Introduction
As all Data Engineers know, sometimes you need to move fast and get things done quickly. While this “quick and dirty” approach might be necessary to meet deadlines, it often comes at the expense of clean, understandable, and reusable code. However, this doesn’t mean sacrificing code quality forever! A crucial step is code refactoring, which MUST be done before your code goes into production.
Speaking of refactoring I would like to share with you a feature of dbt that I recently discovered, which greatly helped me out to clean and parametrize my dbt structure/organizations.
dbt Marketing Data product
Let’s imagine you’re a data engineer working within the marketing team, building a data product using dbt Core on Google Cloud BigQuery as your data warehouse.
You have a ton of data flowing in from various channels (email, LinkedIn, Facebook, website, etc.) and landing in separate BigQuery tables as per follow.
BigQuery marketing dataset
Due to technical reasons (data synchronization, loading times, segregation), having a dedicated source table per channel was a must.
When building your data product, here are some best practices to follow:
One data product as gitlab repository
A data product means a dbt local package
Organized dbt package
The Data product dbt package detailed:
Staging: containing the light SQL models transformation in order to clean and deduplicate data from the source.
Intermediate: handles more complex transformations with business logic (think KPI calculations, aggregations, joins, etc.).
Data mart: Exposes the intermediate data for BI tools, AI pipelines, reporting, and other downstream applications
Let’s focus on the staging layer, that could looks like:
marketing — dbt — Data Product
A SQL file for each of your model and for each channel. More in details
stg_campaign_performance: This model stores raw campaign performance data (e.g., email blasts, social media campaigns). It might include details like impressions, clicks, conversions, and cost.
stg_demographic_data: This model stores raw demographic data about your target audience (if obtained ethically and compliantly). It could include age, location, income, and interests.
stg_traffic: This model stores raw channel traffic data captured by your analytics platform. It might include details like page views, user demographics, and referring domains.
stg_form_submissions: This model stores raw data submitted through channel forms (e.g., contact forms, lead capture forms). It might include contact information, interests, and inquiry details.
In addition here you have a single file for your sources
Source — marketing.yml
This is not that clean, and well organized. Refactor time !!
dbt Refactoring — First Attempt
A better organization could include a sub-directory for each channel, allowing us to run separate code for email using specific dbt selectors or tags, along with a source separation.
dbt structure refactoring
Cool, this is definitely better, at least in term of code structure.
However I am sure that you can see all the “noise” made from all this SQL files.
There is still a lot of duplicate and repettive code doing the same thing. The logic is agnostic from the channel, which is the part that needs to be parameterized so we can have a single SQL file for staging.
This is where dbt identifiers come in to save the day!
dbt Refactoring — Final Attempt
Exploiting the usage of aliases in the dbt models configuration, the dbt project variables definition and the identifier from dbt we can parameterize the channel so we have a single SQL file for staging everything.
dbt_project.yml
Definition of the channel variable in the dbt project variable and set as default “email”.
Then the different sources table definition look like
source definition with dbt identifiers
That will be used in the different models together with alias (and the usage of the channel variable to build what is name alias_custom).
Essentially, the identifier represents the actual BigQuery table name, while the name is a label that can be consistent across all sources, parameterized based on the channel variable definition.
Now, each model leverages the alias to define the model name and the identifier based on the channel variable to determine the source table.
Conclusion
In conclusion, this approach might seem a bit complex at first, but trust me, it’s a lifesaver in the long run! It promotes cleaner, more maintainable code, and makes future modifications a breeze. If you are looking for support on Modern Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.