dbt at scale on Google Cloud — Part 2

Perseverance is not a long race; it is many short races one after the other.

In this part 2 of “dbt at scale on Google Cloud”, we will deep dive into the following topics

Infrastructure
dbt versioning
dbt style guide
dbt macros
Data Quality
- Unit testing using dbt test
- Data quality using Great Expectations
Data Governance
- Securing the data
- Regulations and Compliance

Infrastructure

Infrastructure is the core foundation where your data ecosystem is running. If you are reading this article, chances are you are already on Google Cloud or considering it. Hereafter is an architecture of what we recommend for an analytical project with dbt:

Diagram depicting a multi-layered Google Cloud data architecture for a company, highlighting different project environments such as sandbox, user acceptance testing, and production, with data flowing through landing zones, staging, warehouses, and datamarts, managed via Virtual Private Clouds.

This might seem intimidating at first but this architecture is actually quite simple. As Google Cloud folders and projects are free to create, you should leverage those in order to organise your data ecosystem. This architecture has two layers of folders under the company organisation:

The first one is at department level
The second one is within the data analytics department where folders exist to separate the different environments and also to separate the data from the engine that processes and moves this data (Data projects versus OPS projects).

At project level, the following structure applies and repeats for each environment (i.e. sbx, uat and prod):

One project per data layer and data flows from the landing zone project until the datamart project where the data is accessed by downstream applications. Each data layer serves a specific purpose:
- Landing zone: this layer is where the data lands “AS IS” from the source system. It must be a mirror from the source system.
- Staging area: At this layer basic transformations are applied to the landing zone tables. Those transformations include data type casting, column selection and small cleaning at column level. No joins or complex transformations are applied yet.
- Data warehouse: At this layer, data modelling is applied to each of your data domains. This layer is the brain of your analytical database and a significant amount of time must be spent to thoroughly design the best data model for your domain. Star schemas, data vault 2.0, 3NF, full denormalization (BigQuery excels at handling huge tables fully denormalized thanks to structs and arrays) must be compared and selected to best suit your data. We have put at the end of this article some links to valuable data modelling articles [1].
- Datamarts: This is the only layer that matters for all your stakeholders and downstream applications. All the data at this level must be curated and serve a clear purpose of feeding a dashboard, a Machine Learning model or an API product.
On the OPS side, recommendation is to have one compute project per environment and this project will host the following resources:
- Cloud composer to orchestrate the move of data between the different data layers.
- Any other compute product that your data department requires (GKE, Cloud Function, Compute Engine, etc.). For instance hosting Airbyte on GKE would take place on those OPS projects.
There is also one “artifact” project under the OPS folder that is environment agnostic. This project hosts all the different common artifacts used by your different compute products. We will see in the “dbt versioning” hereafter that dbt runs via a Kubernetes Pod Operator on Airflow and this means that the dbt code is dockerized and the image is stored on Google Cloud artifact registry of this dedicated artifact project.

There are various benefits to this infrastructure granularity:

One of the main ones is that it gives clarity and orderliness to your data ecosystem; all your data is not cluttered into one project.
A direct other benefit of this granularity is the fact you can set IAM bindings at various levels (folder, project and dataset level) and risk of wrongly assigning IAM bindings is greatly reduced.
Your data team works in projects totally separated from the ones where your DataOps team work. This split makes sure that your data team cannot inadvertently create compute resources or jeopardise current compute resources.
Networking is simplified as only the OPS projects require networking; all the data projects only require BigQuery which is serverless.

The list goes on and the bottom line is that this architecture enables your data ecosystem to scale safely while maintaining maximum level of security and observability.

Last but not least, all those folders, projects and resources within those projects should be deployed using Terraform. For folders and projects you can use terraform factories developed and maintained by Google [2]. For the different resources, all must be as well created via Terraform except BigQuery tables. Those will be created by dbt as those resources are quite dynamic and Terraform cannot handle smoothly schema changes.

dbt Versioning

dbt is bringing best practices from software engineering to the data world and one of those best practices is version control. Your data codebase must be considered as a software that evolves and for which clear versions with changelogs and detailed commits are related to.

In part 1 of this article we mentioned how to structure your dbt project by referring to this discourse article. We recommend to go with one repository and to keep your different data domains into separate dbt projects within the same git repository. Interoperability can then be easily achieved by importing the models of another data domain via the packages.yml file. The “marketing” data domain can import the “finance” models by simply adding the following in its list of packages:

packages:

- local: ../finance

This makes compilation quite fast for data domains that are quite isolated and also ensures that compilation bugs from other data domains do not impact all the other data domains (as those are in their own contained dbt projects).

But what about versioning between those different dbt projects ? Versioning purists might be tempted to version each dbt project within the repository independently with tags but we go against this approach for the following simple reason:

BigQuery tables are not versioned (this would be too costly and too hard to maintain versioned and up-to-date tables of all your data transformations) and therefore referring to specific versions of a data domain is useless as anyway only one specific version would be materialised at any point in time. If a data domain has removed a specific column in a table in version 1.1, it doesn’t matter that you refer to version 1.0 of that data domain that still has that column. This column will be absent in the materialised table as the deployed version has this column removed.

The following diagram is the Git workflow we recommend:

The image is a version control workflow diagram, showing a master branch, two feature branches, and a hotfix branch with version tags and sprints indicating an agile development cycle.

A classical workflow with only one main branch (ie “master” branch) and from which developers branch out to work on feature branches. The crucial point here is that developers can iterate quickly on their development by testing their code on sbx and uat data at ease. This is achieved as follows:

On a feature branch, each push triggers a “Cloud Build” that builds a docker image with one tag being the branch name or an eventual git tag.
Image is pushed on the artifact registry project and ready to be used by the different Airflow flies running on Cloud Composer.
Images versions (ie branch name or tags) to be used for each of your data domain in sbx and uat environment is set via an environment variable in Terraform. This environment variable is then used as an environment variable the Cloud Composer resource. Once the Cloud Composer resource is update with this new environment variable, your data domain Airflow flows will use those new versions of the code.

Example of an Airlfow python file for the marketing data domain:

The image displays a snippet of Python code importing the os module, retrieving an environment variable for a DBT marketing version tag, and defining a Kubernetes Pod Operator for a dbt staging task with a specific docker image path including the project ID and the retrieved version tag.

The environment variable “VERSION_MARKETING_DBT” comes from the Cloud Composer instance and can be different for the different Cloud Composer environments. This allows you to iterate quickly on your transformations via Cloud Composer both in sandbox and uat environments (on top of being able to iterate as well locally via the dbt command line).

Once you have tested your data carefully in uat, you create a merge request to the master branch. In case you have made breaking changes in your data domain (such as columns removals and columns renaming), it is your responsibility to notify all the domain owners that use your data domain in their dbt project of the changes you have done. Safeguards via CI pipelines must be in place to ensure that your merge request to the master branch does not break anything (more on this in part 3 of this article in the DataOps section).

All data domains in production must refer to the latest tag of the master branch (for the reason aforementioned about tables not being versioned) and exceptions can be done in case you are done with a feature and you want to put it in production before the official release at the end of a sprint. In that case, you tag a “release candidate” version from your feature branch (for instance “v0.2.0-marketing-rc”). Next step is then to update the terraform environment variable for the version of your data domain in production with this release candidate tag.

dbt style guide

A style guide tells a developer how to work with a particular programming language. Programming style guides are composed of rules and guidance. Rules are mandates, with very few exceptions, enforceable throughout the entire IT organisation. Rules make good and bad programming behaviour explicit. (Software Engineering at Google).

A style guide needs to be defined at the very start of your project and enforced via pre-commit hooks, CI pipeline and code reviews. Ensuring that everyone codes with the same best-practices leads to less technical debts and ease the transfer of knowledge of anyone’s code.

This section will give a high overview of the style guide we recommend on both naming conventions and SQL structure.

Naming conventions:

Naming conventions reduce the effort needed to read and understand source code. Naming conventions let the programmer focus on more important issues than syntax and naming standards. Naming conventions facilitate and let programmers share their code with each other without losing focus on formatting.

Landing zone tables are not managed by dbt and must be created beforehand. We adopt the following naming conventions for those tables:

bqtb_[DATA DOMAIN]_[TABLE IDENTIFIER]

For all the other tables, the following namign convention applies:

bqm_[FLOW NAME]_[TABLE IDENTIFIER]: tables that are materialized with the “table” materialization keyword
bqmi_[FLOW NAME]_[TABLE IDENTIFIER]: tables that are materialized with the “incremental” materialization keyword
bqmvw_[FLOW NAME]_[TABLE IDENTIFIER]: tables that are materialized with the “materialzed_view” materialization keyword. This is a local adapter and you can refer to this page for setting it up and using it.
bqvw_[FLOW NAME]_[TABLE IDENTIFIER]: tables that are materialized with the “view” materialization keyword

For the mapping tables that we materialize via seeds, we apply the following naming convention:

mapping_[TABLE_IDENTIFIER]

It is worth noting that the file names must be unique across all your codebase but table names can be identical across data layers (as a reminder, table names are set via the “alias” setting). Our naming conventions also prepend the data layer in the file name. The following example shows a very simple example of this naming convention:

The image displays a folder structure for a marketing dbt project, including configuration files, SQL and YAML files for staging, data warehouse, and data marts, as well as seed files for Salesforce customer mapping.

The customer table in staging and data warehouse will have the same name in BigQuery (as both will be given aliases “bqmi_customer” ) but dbt will compile without any errors as those tables materialise in different project IDs (therefore have different table IDs).

If your data domain contains many models and a certain level of complexity, it is very likely you will end up with various levels of subfolders and this is totally fine as it will give more structure to your project. A simple example with a subfolder within your data warehouse layer folder:

The image shows a directory structure for a data warehouse within a dbt project, including an 'intermediate' subfolder with SQL files for unioned and grouped customer data, and SQL and YAML files for a marketing customer dataset.

This was just a little taste of naming conventions we apply and naming conventions should actually go as far as setting naming conventions for your columns (see this great article on “Column Names as Contracts”). Naming conventions should be discussed during a workshop with your entire dev and business team to come up with what works best for you. The important point is to write documentation about those naming conventions once you have agreed upon those and to enforce those via:

Pre-commit hooks: this gives direct indication to the developer when he/she commits on the feature branch.
Code review: peer review where a senior dev reviews your potential wrong naming conventions.
CI pipeline: this is an automated pipeline that will fail in case your naming conventions do not comply with your company naming conventions.

SQL conventions:

There are great conventions that have already been written and I will refer to the ones of dbt Labs here. I invite the reader to read all those best-in-class conventions and make it a standard within your company. One of the big dangers of SQL is that it is easy enough for anyone to start writing it; this can then lead to a great amount of SQL technical debt afterwards.

In order to automate your SQL conventions as much as possible, SQLFluff will come handy (it supports the dbt SQL syntax). It is open-source and probably the most advanced SQL linter and auto-formatter. You can define your SQL conventions in a config file and run SQLFluff as a pre-commit hook; this will ensure that your developers receive guidance every time they commit a new SQL file.

You can also have those SQL enforcements done during a merge request pipeline using reviewdog. Reviewdog provides a way to post review comments to code hosting services, such as GitHub, automatically by integrating with any linter tools with ease. It uses an output of lint tools and posts them as a comment if findings are in diff of patches to review. For SQLFluff Github actions, many linter rules have developed on this repository.

dbt macros

dbt macros are the equivalent of functions in any other programming language. As such, macros need to abide to functions best practices:

Do One thing
One level of abstraction per function

Macros need to be located in the “macros” folder and can either be located in the “shared” dbt project or the “data domain” dbt project. This decision boils down to if the macro is specific to a data domain or is generic and can be reused across other data domains. For instance a macro that compute the age of a person based on a birth date should probably be located in the “shared” dbt project as it is quite generic and might be used by different data domains.

Macros are a game changer in the sense that they bring a new level of abstraction and coding practices within your SQL. It might be tempting to write macros everywhere once you start having a good taste of macros. However, always keep in mind that readability of your SQL is of major importance and that abstracting too much within macros can make your SQL harder to read (“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” — Martin Fowler).

Data Quality

Testing is a core practice in software engineering. Code coverage close to 100% is often required and any new code failing the tests won’t be merged into the main branch. Such practices have become the standards for a while in the software engineering world. But it has yet to become the standard in the data world. Corrupted data or data that a stakeholder cannot trust is the main cause of frustration between departments and can lead to stakeholders or downstream applications to look for other ways to get the data they need.

Fortunately many tools have emerged in recent years to detect data quality issues. We will focus on two different data quality paradigms that we recommend to tackle with two different tools.

Unit Testing

The first class of data quality paradigm is unit testing, or in other words data quality issues that emerge from the SQL transformations implemented by your analytics engineers. Unit testing makes sure that the SQL logic you implemented does what it is supposed to do.

We recommend implementing unit tests via dbt tests and using seeds to mock a sample of raw data and a sample of expected transformed data. The diagram below summarises the concept:

The image illustrates a flowchart for DBT (Data Build Tool) transformations in a data pipeline. It shows the process from landing zone mock data through DBT transformations to datamart data, with input and output seeds at each stage. The final step involves comparing the DBT transformed data with expected results for unit testing.

Those unit tests should execute via a CI pipeline that executes at each push on your feature branch and the master branch. The main idea is that all your SQL code runs on your mock data, produces the datamart tables that are then compared to expected results. If the results from the dbt transformations are equal to the expected results from your seed tables, then the tests are considered successful. More technical details about implementing those unit tests can be found in this article. We have also put other links at the end of this article [3]; thos go in details over implementation of unit testing with dbt.

Data integrity testing

The second paradigm of tests are the data integrity tests. Those tests make sure that the data received in the landing zone is not corrupted. This layer is where most of the data quality issues normally reside and as opposed to unit tests where you can just fix your SQL, tests failing at this stage requires you to get back to the source to fix it. It is important to understand that data quality issues in your landing zone tables will have repercussions on all your downstream tables. As the adage says: “Garbage in, garbage out”.

To tackle data integrity issues, we recommend using the open-source library “Great Expectations”. Why not use dbt-expectations or other dbt packages ? This boils down to how big your data project will scale, on the number of stakeholders and the need for them to get access to those data integrity test results. The key factor for us is the “data docs” provided out-of-the-box by Great Expectations. Those are html pages that you can then host on App Engine (protected by IAP) and make accessible to all your stakeholders. Hereafter is a small architecture on we make those data docs accessible to any end users in a secure way:

The image depicts a data testing and presentation workflow. It begins with the Great Expectations test suites at the testing layer, leading to the generation of HTML data docs. These documents are then served through Google's App Engine and made securely accessible to users via an Identity-Aware Proxy, which represents the presentation layer.

“Great Expectations” also gives you much more freedom to develop and iterate over your tests. The main engine is pandas and that usually makes your analytics engineer happy as everyone is acquainted with pandas and its user-friendliness.

Notifying your source systems and downstream applications about data integrity issues is crucial as the ultimate goal for your data team is to become proactive on those data issues instead of being reactive. There is nothing worse than learning about a data quality issue from the business side. “Great Expectations” come handy for notifications on failing tests as you can set up “actions” that will automatically send notifications to all the relevant stakeholders of the failing tests.

“Great Expectations” runs as part of your data domain pipeline (see architecture 1 from first article) and before any dbt transformation stages. You can define “warning” and “failing” tests. “Failing” tests will make your Airflow pipeline fail; you should use “failing” only when data of a test is too corrupted to process it further.

Setting it up and integrating it in your Airlfow pipeline certainly takes a bit more time than using for instance “dbt expectations” but the benefits you will get from using a dedicated tool for those data integrity tests will largely outweigh the time and effort you will spend setting up that tool.

We will probably write a series of dedicated articles on dbt testing and “Great Expectations” at a later stage but for the time being, I recommend reading this great article that deep dive a bit more on the tools detailed above.

Data Governance

As per this whitepaper from Google, Data governance is concerned with the following three areas:

Securing the data
Regulations and compliance
Visibility

Using dbt together with Google Cloud allows you to tackle those three areas in a robust and consistent manner.

Securing the data

You should only have access to the data you require to perform your job and nothing more. The “Least Privilege Principle” must apply at all times and granting access to someone must abide by a strict security process.

Some key security golden rules:

Do not manage access via dbt pre and post hooks. It is not the role of dbt to manage permissions. This could lead to severe security leakages as anyone having access to dbt could potentially grant access to data.
Service account used by dbt should only have the following roles:
BigQuery Data Editor
BigQuery User
Access must never be granted to an individual user but always to Google groups. This makes the process of onboarding and offboarding users much easier. You just need to add or remove them to or from certain groups.

Service accounts creation (for dbt and any other compute products) as well as every IAM binding must happen via Terraform. As your terraform code resides in a git repository, it is versioned and you can audit it at any point of time in history. For all your IAM bindings, you should create a dedicated terraform module and any changes in this module should go through a merge request with at least the approval of the other developers and a security officer (this can be enforced via push rules and merge request rules).

Regarding Google groups that are given access to the data, two kind of groups should be defined:

Internal groups: those groups consist of your data analysts, data engineer, analytics engineers and any other groups that work in the data department. Each one of those groups has different roles and responsibilities and those should reflect with different Google Cloud roles. Different groups should also be created for the different environments. For instance a group “data-engineers-non-prod” where you include all your data engineers and for which you give access only to all the non-prod environments (sandbox and UAT for instance). Then a group “data-engineers-prod” where you include only a few data engineers that are allowed to access the production environment.
External groups: those groups consist of all the stakeholders that consume data from BigQuery. Those groups should be extremely granular and be granted viewer access only to the data they need and nothing more. As per the internal groups, differentiation between nprd and prd should be done.

PII (Personally Identifiable Information) is another sensitive topic that most regulations are chasing. Google Cloud offers two solutions to deal with PII:

BigQuery column security level: BigQuery provides fine-grained access to sensitive columns using policy tags, or type-based classification, of data. Using BigQuery column-level security, you can create policies that check, at query time, whether a user has proper access. For example, a policy can enforce access checks such as:
You must be in group:high-access to see the columns containing TYPE_SSN.

The process to use column security level is as follows:

The image shows a three-step process for managing sensitive data: First, define data classes and organize them into policy tags; second, set policy tags on database columns; and third, manage access control through these policy tags.

Define a taxonomy and data policy tags. Use Data Catalog to create and manage a taxonomy and policy tags for your data. This can be done via Terraform.
Assign policy tags to your BigQuery columns. This can be done via dbt through the yml file of each model (documentation here). It is of major importance to have strict CI rules in place to detect if someone changes those policy tags and try to push those changes.
Manage access on the policy tags. Use Terraform to assign IAM bindings to restrict access to each policy tag. The policy is in effect for each column that belongs to the policy tag.

When a user tries to access column data at query time, BigQuery checks the column policy tag and its policy to see if the user is authorised to access the data.

This is the recommended approach to handle PII as it is non-invasive and allows integrity of the PII data for team members that are allowed to see this data.

Cloud Data Loss Prevention (DLP): can de-identify sensitive data in text content, including text stored in container structures such as tables. De-identification is the process of removing identifying information from data. The API detects sensitive data such as personally identifiable information (PII), and then uses a de-identification transformation to mask, delete, or otherwise obscure the data. DLP is a very powerful tool but is an overkill when you know in which fields PII can be found. It could be used to scan each of the fields containing PII and then use “pseudonymization” to mask the PII data. However, this is a costly process that will have to be run every time new data is ingested. On the contrary, BigQuery column security level is free and ensures that all fields containing PII are only accessible by well defined end users.

Regulations and compliance

There is a growing set of regulations, including the California Consumer Privacy Act (CCPA), the European Union’s General Data Protection Regulation (GDPR), and industry-specific standards such as Global Legal Entity Identifier (LEI) in the Financial industry and ACORD data standards in the insurance industry. Compliance teams responsible for adhering to these regulations and standards may have concerns about oversight and control of data stored in the cloud.

The first step to meet those regulations is to have control over your data. By reading until here and implementing the best practices along this article, you will be in control of your data. Once you are in control of your data and have it secured by allowing only authorised persons to access it, you are already a long way in complying with most regulations. Having your data split between different projects and folders makes it also easier to isolate access and control activities in each of those self contained objects. In the last part of this article, we will cover the “monitoring” pillar that gives you full observability about all the activity going on in your Google Cloud environment. By leveraging Audit logs, you can notice malicious or abnormal activities and take actions.

Visibility

Visibility is knowing where to find the data and being able to navigate seamlessly across the large amount of data that your company possesses. This is achieved by using a data catalogue and many technologies have emerged recently to deal with this topic. We will focus on it in part 3 of this article via the “documentation” section which describes the integrated data catalogue of dbt.