Analytics Engineering

Implementation of the Data Contracts with dbt, Google Cloud & Great Expectations: Part 2

Jul 18, 2023

Back

Illustrative flowchart showing the data contract update process. Data Producers edit contracts and submit a pull request. Data Consumers review and approve it. Changes in the 'Data Contracts Repo' trigger a CI/CD pipeline, which updates and stores the contracts in a versioned 'Data Contracts Bucket'.

Data Contracts Management Layer

Introduction

In this article, we delve into the heart of the data contract system — its collaborative layer which involves Data Producers and Data Consumers. This layer, anchored by the Data Contract Store, maintains and versions data contracts. For our implementation at Astrafy, we’ve combined a git repository with a Cloud Storage bucket. This system allows data contracts — essentially text files with schema descriptions — to be read, written, and comprehended by users. Producers propose changes to these contracts, and Consumers review and approve them, facilitating a shared understanding of the changes. Upon approval, changes are deployed to Cloud Storage through an automated CI pipeline. Our system also ensures a high level of security and traceability, with version control, retention policies, and IAM role management. Let’s dive deeper into how this sophisticated system operates.

Refer to Part 1 for the high-level architecture of this implementation, and to Part 3 for more details on the data contracts execution layer.

Schema language selection

The goal of this subsystem is to allow data producers and consumers to establish an agreement regarding the data processed. Depending on the experience of analytics engineers in your organization you can modify the implementation so that it will suit your needs. In our case, we usually work with people dealing with schema files daily and with git capabilities therefore we structured our solution around those tools. You can imagine that in other organizations where this wouldn’t be the case, Data Contract Management Layer would involve web apps with drag-and-drop interfaces to build contracts. Such tools are often easier to use for data analysts that do not have technical expertise however they usually take more time to develop and are more difficult to expand.

The contract that we deal with here, is a schema file that is capable of describing a set of columns including their names and types. On top of that fields can be lists/arrays or represent a struct. There is a couple of schema languages that can be used for this task and on top of them you can build your schema language. Let’s review our options:

Open source performant message passing languages: Apache Avro, Protobuf
dbt schema files
BigQuery JSON schema
Custom schema

Apache Avro & Protobuf

These are the languages capable of expressing schemas, both names and types. They are used to generate classes in various programming languages. They are capable of expressing arrays and structs. On top of that, these languages can automatically guard you against breaking backward compatibility in your schemas (for example removing a column). Also, if you happen to work with Apache Kafka or Apache Hadoop chances are that you’re already using AVRO and can leverage that for your implementation of Data Contracts.

However, for the case of dbt & BigQuery users, these languages might not be the best fit. Firstly, they do not provide common types in the analytics world such as various DateTime types and Numerics. One can express such types as a string, but then the contract loses its value: you cannot enforce the existence of the field “buy_timestamp” of type Timestamp, you can only do that with type String, and we all know how common these kinds of type errors are in the analytics world. Secondly, they use concepts and tools that are familiar to people from the Software Engineering world but not necessarily from Data Analytics, and it might make it more difficult to work with. Protobuf requires attaching numbers to each field, Avro ecosystem uses JVM which is less common in analytics.

dbt Schema Files

Maybe, there is no need to reinvent the wheel. dbt models can have schema files attached to them, maybe it will be enough to use them here?

Data Contracts should be usable across your whole data organization. Most likely, only a fraction of your data tools use dbt. Hopefully, it’s most of the processing at least, but it’s common that due to legacy systems, migrations, or just sheer force of will, the systems producing the data or consuming the data are not running inside dbt. Therefore, Data Contract to be used across the organization cannot be dbt-specific. Ideally, you would be able to use the same Data Contract between table in BigQuery and the Data Studio dashboard.

On top of that, dbt schema files define a lot more than just the schema. It is difficult to look at such a file and from a glance determine what kind of columns it needs. Also, we don’t need nor want to contract all of the columns from a table.

BigQuery JSON schema

While choosing this representation would make schema comparison much easier it is not without its flaws either. JSON as a format, is (even though a big step up from XML) quite wordy, you need to make sure your quotes are in place. Also choosing BigQuery JSON schema and comparing schemas 1:1 would make it easy to just copy and paste an existing table with all of its columns as contracted, which is not necessarily the thing we want.

There are no hard arguments against BigQuery JSON schema. Accepting Google Cloud as a data analytics environment you are quite locked into BigQuery anyway, and even though it may not gonna cover all of the use cases, probably the vast majority of them would store its data on BQ. Having said that, BigQuery JSON schema was the runner-up. It lost to the Custom Schema

Custom Schema for Data Contracts

Since most of our customers are using dbt running on BigQuery they are used to working with YAML files expressing schema. Having an easy schema language, supporting all of the high-level BigQuery types seems like the best of the worlds I described above. By keeping schema files simple, and providing examples, analytics engineers can quickly grasp the concepts and how to contract complicated fields in this schema. Below is the file with all possible types:

fields: - name: _string type: string - name: _bytes type: bytes - name: _integer type: integer - name: _int64 type: integer - name: _float64 type: float - name: _bool type: boolean - name: _timestamp type: timestamp - name: _date type: date - name: _time type: time - name: _datetime type: datetime - name: _numeric type: numeric - name: _bignumeric type: bignumeric - name: _struct1 type: fields: - name: a type: integer - name: b type: string - name: _struct2 type: fields: - name: a type: integer - name: b type: fields: - name: a type: integer - name: b type: string - name: _array_int type: integer repeated: yes - name: _array_string type: string repeated: yes - name: _array_struct type: fields: - name: a type: integer - name: b type: string repeated: yes

All of the types that can be expressed with the schema language

Data Contracts Repository Structure

There needs to be a separate directory for each data product to allow an easy way of retrieving all contracts for a given data product. Also, each schema file should be mapped to a given entity. We might’ve opted for making just one YAML file with the name of the entity, however, to keep this structure easy to extend in the future (for example with data tests defined here) we opted to have a separate directory for each entity and only there a schema file.

Directory & file structure of data contracts repo

Deployment pipelines

The aim of the deployment pipelines is… well to deploy the contracts to Cloud Storage, from where it will be accessible to other principals. The added feature of using Git repositories is the ease of versioning thanks to tags that can be applied to commits. We leverage it in our implementation.

Astrafy operates on Gitlab and uses Gitlab CI which is similar to GitHub Actions with its capabilities. In our deployment pipeline, we distinguish two separate environments: development & production. The code is being versioned with the help of the git tags.

The production deployment pipeline has two major steps:

Bump the patch part of version
Deploy files to the prod bucket

For the first part, it reads the latest tag from the git history, increases the patch version by one, tags the current commit, and pushes it to the repository. (Make sure that the CI token has write access to the repository!)

For the second part, it authenticates with the given Service Account on Google Cloud and uses gsutil rsync, to create a new directory within the production bucket. The bucket, Service Account, and permissions need to be in place before the deployment happens. For setting up infrastructure on Google Cloud we use Terraform.

The production pipeline is triggered when there are new commits pushed on the main branch. This means that if you enforce Pull Requests for the updates of your main branch, any changes to the contract must be done through the review.

The development deployment pipeline is simpler, it only deploys contracts to the dev bucket for the current version (taken from the latest git tag). You can trigger it manually, or you can have it triggered based on the push to the non-main branch.

Data Contracts Structure on Cloud Storage bucket

Conclusion

The collaborative layer of the data contract system presents an approach to managing and versioning data contracts. By using a blend of git repository and Cloud Storage bucket, the system provides a platform that’s easy to use, secure, and fosters collaboration between Data Producers and Data Consumers. The mechanisms in place allow for a streamlined review and approval process, ensuring that any changes to the data contracts are mutually agreed upon, promoting a shared understanding and maintaining the integrity of the data.

The system also provides a CI pipeline that deploys approved changes efficiently, facilitating easy version control and the ability to revert to previous versions of the contract if needed. Moreover, the incorporation of retention policies and IAM roles management adds an extra layer of security, safeguarding the data and preventing unauthorized modifications.

This collaborative layer is an integral part of the broader data contract system. For a more comprehensive understanding, remember to review Part 1 for the high-level architecture of this implementation, and see Part 3, where we dissect the Data Contract Execution Layer in more depth. This series aims to provide a thorough insight into the power and sophistication of the data contract system, highlighting how it can revolutionize your data management strategies.

If you are looking for support on Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.