An automated approach with Airflow
“Data quality is a team effort, and the strength of the team depends on the strength of its weakest link.”
Introduction
Data quality is critical for any data-driven organisation that wants to make informed decisions based on reliable information. Inaccurate or incomplete data can lead to erroneous insights, wrong business decisions and loss of credibility. Many data projects fail to deliver and to gain stakeholders’ buy-in due to a lack of trust in data.
At Astrafy we always select open-source tools that excel in what they do and avoid tools that say they can do it all. For data quality, it was a no-brainer to include Great Expectations (GX) in our data quality stack as it has a huge community, it integrates really well with the main cloud providers and it now has a Cloud version to simplify the rendering of the test results.
As with every open-source project, GX OSSrequires some technical skills to set it up and be able to use it seamlessly. We have been through this learning curve with Great Expectations and we wanted to share our learning in this article. The following topics will be covered:
Some reminders about the two families of data quality (prevention versus detection)
Comparison of data quality tools and why we selected GX
Data Contracts with Great Expectations
Python package to abstract boilerplate to run GX via Airflow
Local development process made simple for analytics engineers to test quickly their Expectations
Keep your Expectations DRY by defining custom Expectations
Warning versus Failing Expectations
Great Expectations Cloud
Prevention versus Detection
There are two primary aspects to Data Quality: Prevention and Detection. Data teams need both to tackle quality.
Prevention refers to data producers/controllers making considered changes to production-grade data and eliminating as many data quality issues as possible in the build phase. When you hear the term ‘shift left’, this is where it applies to data quality.
Main categories of prevention include:
Upfront collaboration and code review
Data Contracts (APIs)
Contract Enforcement / DLQ
Data Diffs
Detection (or Observability) refers to understanding the state of data quality at any given team, identifying errors/regressions when they happen, and root causing them.
Main categories of prevention include:
Pipeline monitoring
Tests & Alerting
Root cause analysis
Assertions and Data Profiling
Data Cataloging
Detection is currently facilitated by great tools which consume metadata from your data platform of choice (Snowflake, Databricks, BigQuery). Just to name a few:
dbt [transformation, testing]
Great Expectations [Assertions, Profiling]
Monte Carlo [Data Reliability, Monitoring, Alerting]
In this article we focus on prevention with data contracts to prevent corrupted data to reach the data consumers.
Choosing the right data quality tool
It can be really daunting to choose amongst all the data quality tools in the market. This huge amount of choice has the drawback that you can easily feel lost and choose the wrong tool based on the best sales pitch you hear. A significant amount of time must be spent selecting the right tools that fits best within your data stack. Some rules to help you guide in this decision process:
Choose tools that excel in what they do: the Modern Data Stack offers the flexibility of having different tools that can be assembled as lego blocks. Solutions that focus on a specific area often beats the solutions that want to do it all.
Choose tools that integrate well within your data stack: you need seamless integration with your current data stack, otherwise you will spend time and effort constantly trying to integrate and fix those tools within your current stack.
Don’t reinvent the wheel: there are plenty of open-source and enterprise-grade data quality tools in the market. You certainly need time to select the right one but you don’t have to build something custom.
Start small: Follow the principle of “crawl, walk, run”. Start with a simple open-source tool and have your data quality process fine-tuned and well-oiled with this tool before moving to a more complex solution. Buying the latest expensive data quality tool and plugging it on top of messy data with no data management will only make things worse.
With that in mind, we decided to go with the following data quality stack at Astrafy:
Great Expectations for data contracts within the data warehouse (inspiration article from Chad Sanderson here): Lot of our data comes from external data sources where we don’t have control over the data producers. Therefore we get the raw data within BigQuery and before transforming this data, we check if we haven’t received any corrupted data.
We went with GX for the following reasons:
Laser focus tool on data quality with hundreds of built-in Expectations
Seamless integration with BigQuery
Nice UI to visualise data quality results
Open-source solution with large community
Data Quality stack continuation:
dbt test to test transformations: once we start transforming data and that we are within dbt context, dbt tests are the easiest and most efficient way to test your data. Singular test, generic test and packages such as dbt-utils and dbt-expectations will give you all the flexibility you want to write advanced tests.
As SQL transformations are mainly deterministic, it is a good practice to mock data and test your transformations within a CI tool. This ensures that your transformation outputs match expected results and that wrong SQL code is stopped at the CI stage. We highly recommend checking the unit tests developed within the dbt-ga4 package to get a good grasp.
Data-diff to detect semantic changes in the data: this tool allows you to see how every change to dbt code affects the data produced in the modified model and its downstream dependencies. The SQL code introduced by one of your engineers might be right technically speaking (and therefore not detected by your tests) but is just wrong in terms of logic.
Elementary for data observability: Elementary provides a convenient user interface to visualise all your dbt tests and also offers features to detect anomalies in your data. This is convenient to detect outliers for certain fields where you did not set up any tests.
This data quality stack in place allows us to guarantee data quality at all levels and more importantly to have full visibility on our data quality. We have set up slack alerts on all data quality issues directly from those tools with relatively small development. It is of major importance to add results of those tests as metadata to your data catalogue tool; many data catalogue tools offer direct integration with the aforementioned data quality tools. In our case we use Dataplex and did leverage Dataplex API to send data quality results as metadata.
Data Contracts with Great Expectations
At Astrafy we are applying a data mesh paradigm to our data and therefore have different data products that represent our different family of data. Each of those data products corresponds to an Airflow DAG and the first stage in each of those DAGs is to make sure that source data is not corrupted. We want to avoid at all costs the “Garbage in, Garbage out”. Our standard Data Product Dag is as follows:
Without high-quality data, every analytics initiative will be underwhelming at best and actively damaging the business at worst. Data contracts are API-based agreements between producers and consumers designed to solve exactly that problem Data Contracts are not a new concept. They are simply new implementations of a very old idea — that producers and consumers should work together to generate high-quality, semantically valid data from the ground up.
Ideally data contracts should be implemented before your data reaches your data warehouse. This is feasible when the data producer is part of your company and that you can collaborate on setting up a dynamic data contract mechanism with a messaging system, dead letter queue for data not meeting the contract, etc. This perfect implementation is well explained by Chad Sanderson in his article ‘An engineer’s guide to data contracts’.
In our case, we have no control over data producers and get the data “AS IS” in our warehouse. We might have received corrupted data and for that reason, we consider this data as not reliable until it has passed through a data contract that consists of a series of tests defined in GX. The rest of this article will focus on implementation of those contracts with GX, how we can easily extend and test those locally and how we automated the run of those tests via Airflow.
Great Expectations
Great Expectations comes with two flavours:
Great Expectations Open Source (GX OSS). It provides a suite of tools that allows to easily validate data, define Expectations, and monitor data pipelines.
Great Expectations Cloud (GX Cloud). Recently released in beta, it abstracts away some of the boilerplate you have with the open-source version.
At Astrafy, we use both GX Cloud and GX OSS to store and manage our daily validation results. With GX OSS as the core of everything Cloud offers, we have more control over the customization and deployment options, while GX Cloud provides us with a fully managed data quality management platform where our data quality results are stored and where any stakeholders can go to check the quality of the data.
Workflow Implementation
We wanted to achieve the following features for this data contract initiative to be successful:
Abstract away the boilerplate for handling authentication, setting up the engine to run the Expectations, etc.
Ease to define Expectations. Analytics engineers should be able to write and test Expectations from their IDE by just defining JSON Expectations and run those via a notebook.
Possibility to test the expectations locally through a notebook: Analytics engineers should be able to run GX Expectations seamlessly through a curated notebook. More on this one in the ‘local development’ section.
Keep the code DRY by having custom Expectations defined once and reused anywhere.
Fully automated deployment within our Airflow Data Product DAG. More on this feature in the ‘Automation & Airflow scheduling’ section.
We achieved all those thanks to GX modularity and a few developments on our side. We tackled the first feature by building a python package that handles all the boilerplate of creating data assets, handling authentication, etc. One core component of this package is that it contains a logic to switch between GX OSS in case it is running locally or GX Cloud in case it is running on Airflow.
The second feature was maybe the most important as it would define adoption rate by our analytics engineers. If defining and running tests is too complex then you will get your team frustrated and few tests deployed in the end. We structured our data repository as follows:
We have a top folder ‘data-contracts’ that contains one folder per data product and within each of those subfolders we have an additional folder split for warning or failing tests. This split allows us to go with two kinds of test criticalities and in case of failing tests, we would stop the data pipeline in Airflow.
For analytics engineers, they just have to define their tests as GX Expectations in json files with one json file per table. Within this file, they can put any placeholders they want and those will be automatically replaced by environment variables. We recommend starting small with tests and having those tests driven by the business. Fields that are of high business value should have multiple tests while most fields could not be tested at first. Then a schema validation test should be done for each table to ensure that your downstream dbt models will not break in case of undetected schema changes. Our data contracts fulfil two aspects of the data contract described above:
Semantic checks through tests at field level
Schema validation
The GX json file contains all the necessary information to define tests (the data source, the batch request, and the specific Expectations that need to be validated). It is worth noting that the data repository where analytics engineers work does not contain any GX boilerplate code. All this code is abstracted away in a python package hosted on Google Cloud Artifact Registry. Next section deeps dive into this package.
GX python package
As aforementioned, we have built a python package that abstracts away all the boilerplate needed to handle authentication to BigQuery, the logic to run either the OSS version or the cloud version and also to store custom Expectations (this meets one of required criteria of keeping the code DRY). By separating this repository from our main data project, we allow non-technical members of our team to define Expectations without having to understand the technicalities of Great Expectations. In our setup we have data engineers maintaining this package and analytics engineers using it in the data repository where the Expectations are defined. It has proven to be a very efficient set up where each team works on what they master at best.
The package selects by default Pandas as execution engine as it is the engine that most built-in Expectations support. However, analytics engineers can override this default execution engine directly from the json Expectation file. Pandas can be slow in case of a lot of data and SQLAlchemy engine can be better suited in that case. It is also the responsibility of the analytics engineers to work only with new data and to define this delta logic in the query definition of their Expectations.
The CI pipelines around this python repository generates two artifacts:
A python package that can be installed via “pip” and that is used by analytics engineers on the data repository (more on this in the next section dedicated to ‘local development’).
A docker image that is used by Airflow in a self-contained stage to run GX code via a Kubernetes Pod Operator.
The CI pipeline is as follows:
The git tag follows semantic versioning and this allows users to easily pin a specific version for the python package and docker image to use.
Local Development
One of the key features we wanted to achieve is that Analytics engineers should be able to run GX Expectations seamlessly through a curated notebook. We wanted this experience to be as straightforward and simple as possible and we did this by creating a notebook with placeholders for the data product to run and environment variables to be exported. This notebook will then install the GX package from the artifact registry and run all the Expectations for the specified data product.
This design is depicted in the following architecture:
For tests that could be reused in other tables or data products (for instance check if a product field contains only specific products code), those should be defined by the data engineers within the GX package and then any analytics engineers can use those tests.
Results are displayed directly within the notebook and once the analytics engineer is satisfied with the GX tests he/she wrote, he/she would then push on his/her feature branch. This will trigger a CI job that will upload the json Expectations on a storage bucket. This is the beginning of the fully automated data quality journey explained in the next section.
Automation & Airflow scheduling
We use Airflow as an orchestrator (the following idea would apply the same on other orchestrators) and run GX tests just after the data is ingested. This stage is depicted hereafter:
This data contract stage uses the docker image published from our GX python repository (details in this section). It takes as argument the data product for which it has to run the different Expectations and it will then fetch all those expectations from Cloud Storage. Once it has finished running the different Expectations, it will then output the results in GX Cloud where those can be visualised. All this code runs via a Kubernetes Pod Operator and the decoupling between the different Expectations and the code that runs it make it very flexible to add/change Expectations without having to redeploy the docker image that will run the code. This docker image will mainly be rebuilt when custom Expectations are added or the GX version is changed.
One important point about this Airflow stage is that it will be marked as failed if one of the “failing” Expectations fails. As aforementioned, we have a split between “warning” and “failing” Expectations and any Expectations that would lead to highly corrupted data being propagated to downstream consumers should be categorised in a “failing” Expectation.
Last but not least, you will certainly have a ‘dev’ and a ‘prd’ environment and this design of pushing Expectations to separate environment buckets and having semantic versioning for the docker image allow for seamlessly running different Expectations and version of the code in ‘dev’ and ‘prd’.
Checking results In Great Expectations Cloud
One of the benefits of using Great Expectations Cloud is the ability to visualise our data validation results through a nice UI. We can store and monitor our data validation results over time and track any changes or anomalies in our data.
The GX Cloud provides a user-friendly interface that allows us to visualise our history of data validation results in a variety of ways, including data quality reports and data profiling. We can easily share these reports with other members of our team and ensure that everyone is on the same page when it comes to the quality of our data.
By visualizing our data validation results in the GX Cloud, we can quickly identify any issues with our data and take action to rectify them. This has been incredibly valuable for us, as it allows us to maintain high-quality data and ensure that our data pipeline is running smoothly.
Conclusion
Data quality is one of the main topics nowadays in the Modern Data Stack and most companies struggle to achieve it. Reasons are multiple and as often arise from a lack of knowledge about how exactly to tackle data quality, what are data quality components, what tools should I use for which use cases, etc. Once you start demystifying and go beyond the buzzword, you gain clarity and are ready to set up a plan of action.
In this article, we have started by reviewing a few concepts of data quality and have then deep dived into implementation of data contracts with GX within a data warehouse. This was a specific use case and subsequent articles will tackle how we tackle data quality for our dbt transformation, how we detect anomalies automatically and last but not least how we feed back our data catalogue with those data quality results.
Regarding Great Expectations specifically, we are convinced it is a great tool to tackle data contracts use cases and its cloud version makes it comfortable to share data quality results with a non-technical audience.
As a final word, we want to emphasise that tools and data engineers alone will not solve your data quality issues. It’s foremost a team effort involving everyone in the chain of data. Data producers, data engineers, analytics engineers and data analysts should all be involved in the data quality process. Data producers should be aware of the data contracts and be incentivized to ship data of good quality, data engineers should build and maintain the data quality frameworks that then allow analytics engineers and data analysts to write tests; it’s a real living system that is not stronger than its weakest link. We have often seen in companies data engineers bearing the entire responsibility of data quality and this has never led to sustainable results.
If you enjoyed reading this article, stay tuned as more articles will come in the coming weeks on data quality. Follow Astrafy on LinkedIn to be notified for the next article ;).
If you are looking for support on Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.