Data Engineering

Refactor your way to Success: A practical guide to dbt Core code Optimization

Apr 23, 2024

Back

Modern data engineering workspace with a computer displaying subtle code, minimalistic notes, and a coffee cup, set against a background of stylized database diagrams.

Introduction

As all Data Engineers know, sometimes you need to move fast and get things done quickly. While this “quick and dirty” approach might be necessary to meet deadlines, it often comes at the expense of clean, understandable, and reusable code. However, this doesn’t mean sacrificing code quality forever! A crucial step is code refactoring, which MUST be done before your code goes into production.

Speaking of refactoring I would like to share with you a feature of dbt that I recently discovered, which greatly helped me out to clean and parametrize my dbt structure/organizations.

dbt Marketing Data product

Let’s imagine you’re a data engineer working within the marketing team, building a data product using dbt Core on Google Cloud BigQuery as your data warehouse.

You have a ton of data flowing in from various channels (email, LinkedIn, Facebook, website, etc.) and landing in separate BigQuery tables as per follow.

A screenshot of a database structure with a series of tables related to marketing, including 'tbl_campaign_email', 'tbl_campaign_facebook', 'tbl_campaign_linkedin', 'tbl_customer_email', 'tbl_customer_facebook', 'tbl_customer_linkedin', 'tbl_geography_email', 'tbl_geography_facebook', and 'tbl_geography_linkedin'.

BigQuery marketing dataset

Due to technical reasons (data synchronization, loading times, segregation), having a dedicated source table per channel was a must.

When building your data product, here are some best practices to follow:

One data product as gitlab repository
A data product means a dbt local package
Organized dbt package

The Data product dbt package detailed:

Staging: containing the light SQL models transformation in order to clean and deduplicate data from the source.
Intermediate: handles more complex transformations with business logic (think KPI calculations, aggregations, joins, etc.).
Data mart: Exposes the intermediate data for BI tools, AI pipelines, reporting, and other downstream applications

Let’s focus on the staging layer, that could looks like:

The image shows a file directory for marketing-related SQL scripts within a version control system. It includes subdirectories for 'datamart,' 'intermediate,' and an expanded 'staging' section. The 'staging' contains SQL files for email, Facebook, and LinkedIn data processing tasks like campaign performance, demographic data, form submissions, and social media engagement. There's also a 'sources' directory with a 'marketing.yml' file.

marketing — dbt — Data Product

A SQL file for each of your model and for each channel. More in details

stg_campaign_performance: This model stores raw campaign performance data (e.g., email blasts, social media campaigns). It might include details like impressions, clicks, conversions, and cost.
stg_demographic_data: This model stores raw demographic data about your target audience (if obtained ethically and compliantly). It could include age, location, income, and interests.
stg_traffic: This model stores raw channel traffic data captured by your analytics platform. It might include details like page views, user demographics, and referring domains.
stg_form_submissions: This model stores raw data submitted through channel forms (e.g., contact forms, lead capture forms). It might include contact information, interests, and inquiry details.

In addition here you have a single file for your sources

The image displays a text configuration file for a marketing data source, likely used in a data pipeline or ETL process. The content outlines a marketing database in BigQuery with tables for customer and campaign data across Facebook, email, and LinkedIn. It also specifies 'freshness' parameters for data quality monitoring, with warnings and errors triggered based on data age. The syntax suggests the use of templating with variables for database names.

Source — marketing.yml

This is not that clean, and well organized. Refactor time !!

dbt Refactoring — First Attempt

A better organization could include a sub-directory for each channel, allowing us to run separate code for email using specific dbt selectors or tags, along with a source separation.

The image is of a directory tree highlighting a structured organization of SQL scripts within a 'marketing' directory. The expanded 'staging' directory is further organized into subcategories for 'email', 'facebook', and 'linkedin', each containing SQL files for tasks like campaign performance, demographic data, form submissions, social media engagement, and traffic. The 'sources' directory contains configuration files named 'email.yml', 'facebook.yml', and 'linkedin.yml'.

dbt structure refactoring

# dbt run only the staging model for the email channel
dbt run --select staging.email

Cool, this is definitely better, at least in term of code structure.

However I am sure that you can see all the “noise” made from all this SQL files.

There is still a lot of duplicate and repettive code doing the same thing. The logic is agnostic from the channel, which is the part that needs to be parameterized so we can have a single SQL file for staging.

This is where dbt identifiers come in to save the day!

dbt Refactoring — Final Attempt

Exploiting the usage of aliases in the dbt models configuration, the dbt project variables definition and the identifier from dbt we can parameterize the channel so we have a single SQL file for staging everything.

dbt_project.yml

Definition of the channel variable in the dbt project variable and set as default “email”.

Then the different sources table definition look like

The image shows three text editor windows side by side, each displaying a YAML configuration file for different marketing channels: email, Facebook, and LinkedIn. The files contain structured data definitions for tables in a BigQuery database with table names and identifiers highlighted in red boxes. The files have a consistent layout, including a version number, source description, table definitions, and data freshness settings. The code is presented with colored syntax highlighting against a dark background.

source definition with dbt identifiers

That will be used in the different models together with alias (and the usage of the channel variable to build what is name alias_custom).

The image shows a segment of SQL code from a model file for marketing data staging, specifically for campaign performance. It includes a custom alias setting at the top and a dbt model configuration with 'incremental' materialization and a unique key on 'campaign_id'. The SELECT statement aggregates total impressions, total clicks, and total conversions. The code is annotated with comments and has an example of dbt's source referencing. There's a filter for incremental updates based on a timestamp. The code has syntax highlighting, with certain parts marked in pink and green, against a dark background.

Essentially, the identifier represents the actual BigQuery table name, while the name is a label that can be consistent across all sources, parameterized based on the channel variable definition.

Now, each model leverages the alias to define the model name and the identifier based on the channel variable to determine the source table.

Conclusion

In conclusion, this approach might seem a bit complex at first, but trust me, it’s a lifesaver in the long run! It promotes cleaner, more maintainable code, and makes future modifications a breeze. If you are looking for support on Modern Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.