What is the Modern Data Stack?
The data world has become a complex landscape, full of acronyms and buzzwords. In this post we will give a birds-eye view of the modern data stack and demystify some of the terminology.
The modern data stack (MDS) is a set of tools and technologies that have emerged in recent years to help businesses collect, store, process, analyze, and visualize data more effectively. In contrast to the traditional data stack (TDS), the modern data stack makes use of cloud technologies to remove the burden, complexity and risk of managing a performant on-premise data system.
In this blog, we will explore the key components of the modern data stack, popular tools and technologies, and use cases they unlock, such as Customer Data Platform (CDP), Infrastructure Data Platform (IDP), CSPM (Cloud Security Posture Management) and many other acronyms and custom use cases that involve the above processes.
A stack is a collection of technologies that combine to form a system. Usually, components of the stack can be swapped out for others. The modern data stack typically includes the following base components: Data Sources, ELT, Data Storage, Data Transformation, Orchestration, Monitoring, Governance and Data Analysis and Visualization. Though as with any engineering stack you can plug and play different tools at each component and use only some of the components, depending on your needs and use cases.
Prior to what is now called the modern data stack we had standard Relational Database Management System (RDBMS) used for analytical workloads which made it hard to scale to big data due to its storage/compute coupling and OLTP architecture.
The modern data stack started gaining popularity in the early 2010s, enabled by cloud computing and OLAP data warehouse and data lake technologies such as Redshift, BigQuery, Snowflake, and more recent innovations like ClickHouse and in-memory OLAP technologies like DuckDB. Even though it has been around for a while, the popularity of this approach has grown significantly, and we are seeing a lot of innovation in every part of the stack.
The introduction of data warehouses and data lakes also made it possible to shift Extract-Transform-Load (ELT) workloads to Extract-Load-Transform (ELT). This approach moves transformation to the data warehouse and data lake which gives much more flexibility to teams working with the data without the need to ingest the data again and running compute intensive operations.
Note: There are many more vendors at each component of the stack and we chose some of them in no particular order (apart from CloudQuery given our bias :)).
This refers to the data we want to ingest and involves different types of sources:
- Databases: Companies have many databases for different workloads. Since these databases are not optimized for analytical workloads, it is often necessary to ingest data from production databases to analytical databases where you can run complex queries without impacting the production system. Additionally, there have been many advances around Change Data Capture (CDC) for databases, allowing you to ingest data in real-time from databases. This enables efficient, near-real time syncing between source databases and destinations.
- APIs: Most SaaS products expose their data via APIs. So, data for anything from marketing or finance products such as Salesforce and Stripe, to infrastructure platforms such as AWS, GCP, Azure can be loaded.**
- Event Streams: This is another popular source type where you might want to subscribe to an event queue and send it to an analytical destination.
This is where CloudQuery (opens in a new tab) and other EL tools come in. They connect sources to the destinations. EL tools have a number of unique features to be able to be performant and scale to an infinite number of sources and destinations.
You can also think of EL tools as an intermediate language between sources and destinations. They extract data from sources, transform them to an internal type system, and then send them to any of the supported destinations, which transform the internal type system to the destination type system.
Other important features of EL tools:
- Open Source - Not all EL tools are open source, but if you want to extend connectors or write new ones, you will need to find an open source EL tool such as CloudQuery. :)
- Performance - The better the performance, the cheaper the EL workload will be, and the faster data will be ingested.
- Pluggable Architecture - Given the large number of sources and destinations, it is essential that sources and destinations are decoupled; otherwise, each source will need to support a growing number of destinations.
Using the tools above, the data can be stored in a centralized place for further analysis. This category includes databases, data warehouses, and data lakes.
The advances in OLAP data warehouses and data lakes were one of the main enablers of the modern data stack: it separated storage and computation of the traditional database, and made it cheap to store large amounts of data and query them efficiently.
One of the main design/architecture changes in the modern data stack was moving from "ETL" (Extract-Transform-Load) to "ELT" (Extract-Load-Transform). What this means is that any additional transformations are made inside the data warehouse or data lake. As requirements change, this gives more flexibility for teams to create different transformations inside the database, rather than re-running compute-intensive workloads to reload data from the source. Transformations can be done with anything from pure SQL, to Python, to dbt.
The three key things you want from your orchestration platform are reliability, observability, and the ability to handle dependencies.
In simpler (non-critical) cases you can get away with simple cron-based or serverless solutions, like GitHub Actions. As the importance of the data grows and you add more transformation steps, you're likely to need an orchestrator like Apache Airflow to declare dependencies between jobs and notify you if things go wrong. This leads us to Monitoring.
APIs and data pipelines may experience downtime, throttling, and failures, so it's important to be aware of when something goes wrong, what specifically failed, and when to rerun a job. Monitoring isn't mandatory from day one, and you can start with something as simple as GitHub action notifications, structured log parsing produced by CloudQuery, or re-using orchestrator capabilities such as Airflow for monitoring jobs.
A lot of work can go into creating a robust data pipeline, but if the resulting data quality and documentation are lacking, the results of using the data will likely be as well. This is where the Governance component comes into play. Tools like Great Expectations, Amundsen and dbt all offer insight into how data moves through the pipeline, the ability to document their features and constraints, and monitoring of the quality of the data so that analysts are not deriving insights from low-quality data.
Visualizing data is key to making sense of all the data that we have gathered. There are many tools that provide such capabilities, such as Superset (opens in a new tab), Metabase (opens in a new tab), and Power BI. The cool part about the modern data stack is that most likely, you already use one of those tools in your company, and you can use it for this stack instead of buying yet another dashboard.
Other optional analysis tools are Jupyter Notebook (opens in a new tab) that connects to the central database, where users can collaborate on queries and analysis in a comfortable way. Also, you can feed this back upstream to machine learning models in TensorFlow (opens in a new tab), PyTorch (opens in a new tab), and so on.
The number of use cases for all that data is infinite. Some popular use cases even have their own acronyms, such as CSPM (Cloud Security Posture Management), CDP (Customer Data Platform), IDP (Infrastructure Data Platform) and CNAPP (Cloud-Native Application Protection Platform). The integration of different data sources often enables unique data science, business insight, security and machine learning use cases, among others.
Even though the modern data stack was born a decade ago, its growing popularity is leading to fast-paced and exciting innovation in every component of the stack.
Across the stack, we are seeing Apache Arrow (opens in a new tab) standardizing in-memory storage and communication. For example, both InfluxDB (opens in a new tab) and Pandas 2.0 (opens in a new tab) now use Arrow for standardized, efficient internal memory representation. We expect this to lead not just to more performant analysis, but also to better interoperability with the rest of the data stack.
In data sources, we are starting to see the native standardization of Arrow Flight and Arrow Flight SQL.
In EL, we are seeing open source high-performance native tools such as CloudQuery that run cross-platform, have a pluggable architecture and are standardizing in-memory type transformations with Apache Arrow.
In data storage, we are seeing the rise of new columnar databases and query engines such as ClickHouse and DuckDB. We are also seeing storage and catalog standards such as Parquet and Apache Iceberg that can decouple schema changes from storage, along with the existing decoupling of storage and compute.
It's an exciting time for the data community, and we’re excited to be a part of this evolving story. So stay tuned and give us a try!