What is a Modern Data Stack
To Understand What Do People Mean When They Say “Modern Data Stack”
Before we begin, however, it’s important to understand what exactly we mean by modern data stack:
- It’s cloud-based
- It’s modular and customizable
- It’s best-of-breed first (choosing the best tool for a specific job, versus an all-in-one solution)
- It’s metadata-driven
- It runs on SQL (at least for now)
“Simply put, the modern data stack refers to a suite of products used for data integration and analysis by the more technology-forward companies. One of the key characteristics which make the tools as part of the stack “modern” is that they are all cloud-native."
Compared to legacy stacks, this tends to offer a few benefits:
- Lower barriers to set up and deploy
- Easier to scale up as needed
- More operationally focused than IT focused
- A Data Platform1` is necessary to glean insight to improve customer service, improve supply chain operations, or even for marketing
- Requires a radical approach
- A suite of tools used for data integration.
- A fully-managed ETL data pipeline
- A cloud based columnar warehouse or data lake as a destination
- A data transformation tool
- A BI and a data viz platform
Why do You need a Modern Data Stack
- Traditional stacks are inefficient in stitching together data
- Fail to deliver on the volume of data the enterprises generate
- When your integrations are failing, dont have upgrades and need better sync.
“The modern data stack saves time, money and effort. The low and declining costs of cloud computing and storage continue to increase the cost savings of a modern data stack compared with on-premise solutions. Off-the-shelf connectors save considerable engineering time otherwise spent designing, building and maintaining data connectors, leaving your analysts, data scientists and data engineers free to pursue higher-value analytics and data science projects.”
Why is it different from a legacy data stack
- On the cloud and hopefully cloud-native
- Requires very little maintenance
- Well integrated with all your tools
- Ease of use for MLOps and other business users
- Does not need in-depth technical knowledge
What challenges do you face when upgrading to an MDS
1. Long turn-around time to untangle and set up infrastructure
Companies making use of on-premise infrastructure are responsible for all the costs associated with it, such as the army of engineers required to keep everything maintained and running smoothly.
Since the setup is so deeply interconnected, what may seem like a minor change might break other parts of the system. Finding the exact logical coupling between the systems requires a lot of work-hours to analyze before any improvements can be made to the existing landscape.
2. Slow response to new information
As the company grows, so does its data and computational power needs. It is very costly in terms of resources and time when it comes to scaling out (expanding) on-premises infrastructure. Since on-premises infrastructure is difficult to scale, this naturally leads to a limit in how much computational power there is to analyze data. Data pipelines can take hours to complete, a problem that is compounded as the organization grows.
A TDS requires slow ETL (Extract, Transform, Load) operations before newly ingested data can conform to the rest of the data model. A new data update can take weeks and many hours of refactoring before insights appear. By the time the data is ready, the organization is unable to act in time resulting in missed opportunities.
3. Expensive journey to insights
A lot of the report generation is manually done, especially when the data is coming from different sources. The report is manually generated, manually cleaned, and manually transferred to Excel (gasp!). This leads to errors being made, time being taken away from other business-critical tasks, and the inability to scale.
Analysts are unable to efficiently perform their roles due to the complex landscape. Data engineers are pulled into operational queries that prevent them from doing their actual jobs (like making the data pipelines more scalable!).
Seeing how competitive the business landscape is and the need to adapt to new information quickly, it’s quite clear that the traditional data stack is not an ideal solution. This is where the modern data stack comes in to help your business remain competitive.
What are the benefits of the modern data stack?
Move from IT-focused to a business-focused operating model
- With an MDS, your organization regains the freedom to focus on the business side of things instead of being bogged down by IT-related woes.
- Your organization can have leaner data teams and can focus on the higher value data tasks instead of losing time with the administration and performance optimization of the traditional data stack.
- The tools offered by an MDS are designed with greater accessibility in mind (no-code or little code needed), greatly lowering the technical barrier to entry.
- MDS views self-service as the core functionality, reducing dependencies on the data professionals. This means that the CMOs can extract campaign analytics themselves and view data teams as enablers rather than bottlenecks.
Long-term commitments are replaced with plug-and-play flexibility
- Since infrastructure is no longer on-premises and deployed in the cloud, companies no longer have to worry about hardware/platform maintenance and its associated costs (which results in significant savings).
- Storage & compute are available on-tap, improving the data processing response time through the cloud provider’s elasticity.
- The modern data stack makes use of software as a service platform (SaaS), creating out-of-the-box tools. This means that your team can get to work with minimal setup requirements. (Hence the We Work With Modern Data Stack as every new DataOps/MLOps tool slogan)
Moving beyond once-off analytics to operational BI and AI
- Modern data stacks are much faster to set up and iterate, eliminating the requirement for large IT teams. This allows for non-tech companies to start generating actionable insights within a few hours, instead of the usual days or weeks.
- Data can come from a variety of first and third-party sources. A modern data stack can integrate all of these sources into its data ingestion tool which in turn will work with business intelligence tools.
Treating data governance as a first-class citizen
- We process where problems can be detected and mitigated earlier.
- The tools provided by MDS vendors allow for better data quality, privacy control, and access governance. With the rise of cybersecurity threats, responsible AI, and increasing regulations on data, systems built without data governance in mind is every CIO’s nightmare. Failing to protect data could lead to disastrous consequences for the organization.
- While it’s still a challenge to secure an entire stack, MDS technology providers don’t treat data governance as an afterthought. This results in data governance being part of the process across the entire stack.
The Elements of a Modern Data Stack
- Data source
- Data integration
- Data Storage and Querying
- Data Transformation
- Data Governance and Monitoring
- Data Pipeline
- BI/ Data Vizauliazation
- The loading process involves moving data from one location to another.
- Store everything in one location, generally in the cloud, with warehousing.
- Transform it into data that can be utilized.
- Serve forth analysis and business intelligence to teams.
The Layers in a modern data platform
The “right” data stack will look vastly different for a 5,000-person e-commerce company than it will for a 200-person startup in the FinTech space, but there are a few core layers that all data platforms must have in one shape or another.
Keep in mind: just as you can’t build a house without a foundation, frame, and roof, at the end of the day, you can’t build a true data platform without each of these 6 layers. Below, we share what the “basic” data platform looks like and list some hot tools in each space (you’re likely using several of them):
Data Storage and Processing
The first layer? Data storage and processing layer – as you are need a place to store your data and process it before it is later transformed and sent off for analysis. It becomes especially important to have a data storage and processing layer when you start to deal with large amounts of data and are holding that data for a long period of time and need it to be readily available for analysis.
With companies moving their data platforms to the cloud, the emergence of cloud-native solutions (data warehouse vs data lake or even a data lakehouse) have taken over the market, offering more accessible and affordable options for storing data relative to many on-premises solutions.
Whether you choose to go with a data warehouse, data lake or some combination of both is entirely up to the needs of your business.
Recently, there’s been a lot of discussion around whether to go with open source or closed source solutions (the dialogue between Snowflake and Databricks’ marketing teams really brings this to light) when it comes to building your data platform.
Regardless of which side you take, you quite literally cannot build a modern data platform without investing in cloud storage and compute. Data platform Snowflake, a cloud data warehouse, is a popular choice among data teams when it comes to quickly scaling up a data platform. Image courtesy of Snowflake
As is the case for nearly any modern data platform, there will be a need to ingest data from one system to another.
As data infrastructures become increasingly complex, data teams are left with the challenging task of ingesting structured and unstructured data from a wide variety of sources. This is often referred to as the extraction and loading stage of Extract Transform Load (ETL) and Extract Load Transform (ELT). data platform data ingestionData ingestion tools, like Fivetran, make it easy for data engineering teams to port data to their warehouse or lake.
Even with the prevalence of ingestion tools available on today’s market, some data teams choose to build custom code to ingest data from internal and external sources, and many organizations even build their own custom frameworks to handle this task.
Orchestration and workflow automation, featuring such tools as Apache Airflow, Prefect, and Dagster, often folds into the ingestion layer, too. Orchestration takes ingestion a step further by taking siloed data, combining it with other sources, and making it available for analysis.
I would argue, though, orchestration can be (and should be) weaved into the data platform after you handle the storage, processing, and business intelligence layers. You can’t orchestrate without an orchestra of queryable data, after all!
Data Transformation and Modeling
Data transformation and modeling are often used interchangeably, but they are two very different processes.
When you transform your data, you are taking raw data and cleaning it up with business logic to get the data ready for analysis and reporting. When you model data, you are creating a visual representation of data for storage in a data warehouse.
Data platform dbt open sourcedbt, which sports a vibrant open source community, gives data analysts fluent in SQL the ability to easily transform and model data for use by your platform’s business intelligence layer.
The data transformation and modeling layer turns data into something a little more useful, readying it for the next stage in its journey: analytics.
Business Intelligence (BI) and Analytics
The data you have collected, transformed, and stored serves your business no good if your employees can’t use it.
If the data platform was a book, the BI and analytics layer would be the cover, replete with an engaging title, visuals, and summary of what the data is actually trying to tell you. In fact, this layer is often what end-users think of when they picture a data platform, and for good reason: it makes data actionable and intelligent, and without it, your data lacks meaning.
Data platform bi tool
Tableau is a leading business intelligence tool that gives data analysts and scientists the capability to build dashboards and other visualizations that power decision making
data observability platformNo data platform is complete without data observability. Data observability gives teams a holistic view of data trust across five key pillars of observability, including freshness, schema, and lineage (pictured above). Image courtesy of Monte Carlo.
With data pipelines becoming increasingly complex and organizations relying on data to drive decision-making, the need for this data being ingested, stored, processed, analyzed, and transformed to be trustworthy and reliable has never been higher. Simply put, organizations can no longer afford data downtime i.e., partial, inaccurate, missing, or erroneous data. Data observability is an organization’s ability to fully understand the health of the data in their data ecosystem. It eliminates data downtime by applying best practices learned from DevOps to data pipelines, ensuring that the data is usable and actionable.
Data platform blast radiusWith blast radius, it is easier and faster to identify the reports and people affected by a data incident.
Your data observability layer must be able to monitor and alert for the following pillars of observability:
- Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
- Volume: is the data within accepted ranges? Is it properly formatted? Is it complete?
- Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
- Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision-making?
An effective, proactive data observability solution will connect to your existing data platform quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies.
Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.