Data Mesh is becoming the “de facto” paradigm to adopt when dealing with big data residing in many business domains. The solution is then to have each of your data products in a dedicated dbt project to keep each project small in size as compared to having all your data product in one huge dbt project that would take long to compile and be overwhelming for any engineers to work with.
A critical feature in data mesh is to keep interoperability between your data products and you have two options in dbt:
Import other data products via packages.yml. This option has always existed but has an important drawback that it imports every single model, macro and other boilerplate files in the dbt project that imports it. And if you are not using selectors, all models from this imported package will run during your dbt runs.
dependencies.yml: This option only available in dbt Cloud works like in the API world where each dbt project (i.e. each data product) defines its public models that can be exposed and consumers of those dbt projects will only get access to those public models. Files are not fetched anymore in the destination dbt project and only references to public models are imported. This design is optimal and is what should be used.
The second option is optimal but if you are on dbt-core, well… you cannot leverage it natively. But a plugin has been developed recently specifically to bridge this gap; welcome to dbt-loom developed by Nicholas Yager and contributed by Astrafy for the Cloud Storage adapter. By enhancing dbt-core with the capability to integrate multiple projects effortlessly, dbt-loom is not just an addition to the dbt ecosystem; it’s a critical leap forward in advancing the Data Mesh paradigm. In this article, we will explore how dbt-loom unlocks new possibilities for dbt-core users, enabling more effective and flexible data project management and collaboration in line with Data Mesh principles.
dbt and Data Mesh
Due to increasing dbt models in single monolith projects and increasing popularity of the Data Mesh paradigm, dbt has recently introduced a lot of features to support Data Mesh adoption within dbt. Those features are:
Data Contracts: Ensuring that data coming from upstream models abides by well defined schemas.
Model Access Levels: When you start sharing models between dbt projects, you need to define which models should be accessible and which should not. Access levels allow you to define private and public models to serve that purpose.
Model Versioning: Breaking changes are common practice in analytics engineering and can cause a lot of frustration. Model versioning comes to the rescue to allow upstream models to have multiple versions up and running at the same time. As a downstream consumer your models won’t break anymore and you can make your downstream models evolve to use the new upstream versions at your own pace.
dbt-meshify Utility: All previous three features require a bit of yaml boilerplate and dbt-meshify utility tool does exactly that by providing a wizard to quickly set up data contracts, model access levels and versioning fo ryour dbt project.
dependencies.yml: This unique feature, available in dbt-cloud, manages dependencies in a more flexible and efficient way compared to traditional methods.
Only the last feature is unavailable to dbt core users and this gap led to the development of dbt-loom to the power of this feature to dbt-core. To illustrate this feature, let’s work with a simple project dependency between two data products:
FinOps data product that has a source layer, an intermediate layer and a data mart layer.
Monitoring data product that has the same data layer structure but that is using a model from the FinOps data product within its intermediate layer (with a JOIN transformation).
Datamarts models from FinOps data product are public and can be accessed by any other dbt project. It’s worth mentioning that dbt projects that import this other data product must also have access to the git repository of the data product and that the credential of that data product must have viewer access on the imported models (in case of BigQuery, data viewer on the table or view). In summary, three factors to keep in mind for access:
Let’s now deep dive into dbt-loom that enables the behaviour of dependencies.yml file by importing only reference of public models into another dbt project.
Understanding the Workflow of dbt-loom
dbt-loom functions by retrieving public model definitions from dbt artifacts (manifest.json) and integrating them into your dbt project. The plugin supports various sources for obtaining model definitions, such as local manifest files, dbt Cloud, Google Cloud Storage (GCS), and S3-compatible object storage services. A key consideration to have dbt-loom to work is to have the manifest.json file of your different data products stored in one of the aforementioned storage products. To industrialise the upload of your manifest.json files to those storage products, you would typically set this up in your CI pipeline when you do a merge request on your main branch. There would be a stage that would upload the latest version of the manifest.json file to the storage product of your choice.
How dbt-loom works
dbt-loom leverages the dbtPlugin class from dbt-core, which defines functions callable by dbt-core’s PluginManager. During various lifecycle stages, such as graph linking and manifest writing, PluginManager invokes these functions, allowing dbt-loom to parse manifests, identify public models, and inject them when needed.
Setting Up dbt-loom: A Step-by-Step Guide
To use dbt-loom, you first need to install the Python package:
Then, create a dbt-loom configuration file specifying the paths for the upstream project’s manifest files. This setup allows you to fetch and integrate models from various sources, including dbt Cloud, S3, and GCS.
Example of config file (for local manifest file):
End-to-End workflow on Google Cloud
A key innovation of dbt-loom, to which we at Astrafy have contributed, is its capability to retrieve and effectively utilize manifest files from Google Cloud Storage (GCS), which was added in the latest version 0.3.0. This section provides an in-depth exploration of the comprehensive workflow when employing dbt-loom in tandem with GCS.
Setting Up the Workflow
1. Manifest File Generation:
The manifest file, an artifact of dbt, contains metadata about your dbt models. It’s crucial that this file is updated and available in your GCS bucket. This is where CI tools play a crucial role.
Setting Up the CI Pipeline:
We will take Gitlab CI as an example. In your .gitlab-ci.yml file, define a job that performs two main tasks:
Compile the dbt project to generate the latest manifest file.
Upload the manifest file to the designated GCS bucket.
Here’s an example snippet from Gitlab CI:
Versioning Manifest Files:
Optionally, you can incorporate a versioning system for your manifest files within the GCS bucket. This can be achieved by appending a timestamp or a commit hash to the file name when uploading it during the CI stage. Such a practice ensures the availability of historical manifest files for auditing or rollback purposes.
2. Project Configuration in dbt-loom:
Ensure that your dbt-loom configuration file (dbt_loom.config.yml) includes the necessary details to connect with GCS. This includes specifying the project ID, bucket name, object name, and optionally your Google Cloud credentials (dbt-loom uses by default the google default application credentials). Also, you can easily incorporate your own environment variables into the config file. This allows for dynamic configuration values that can change based on the environment. For instance, you might have different GCS buckets for development, staging, and production environments, and instead of hardcoding bucket names, you can use environment variables.
To specify an environment variable in the dbt-loom
config file, use one of the following formats:
${ENV_VAR}
or $ENV_VAR
Example of config file (for GCS):
By leveraging dbt-loom’s integration with GCS and automating the manifest file generation and upload process through a CI tool, you establish a robust and efficient workflow. This setup not only streamlines your data engineering processes but also ensures that your dbt projects remain agile, scalable, and maintainable. With this workflow,, your dbt projects work seamlessly in a data mesh approach exactly in the same way as achieved through the dependencies.yml file in dbt cloud. It’s worth nothing that this step-by-step workflow showcased with GCS differs very little from adapter to adapter and you can therefore easily switch between GCS, S3 and other storage adapters available in dbt-loom package.
Conclusion
dbt-loom is an indispensable tool for dbt-core users wanting to implement data mesh at scale. It greatly simplifies the interoperability between multiple dbt projects, exactly in the same way as dbt Cloud does with dependencies.yml. Beyond technical benefits, it promotes wider and easier access to data across organization, marking a significant improvement in dbt project interoperability.
dbt-loom is still in its early development stage but it already does the job with the main storage provider and you can start using it in your various dbt projects. The plugging uses same core functions that the dbt cloud dependencies.yml feature and in that sense robustness/stability of the code is ensured.
Keep an eye on the repository as improvements and updates will come at a consistent pace.
Thank you
Thank you for reading this article! If you are looking for some support on your dbt implementation, feel free to reach out to us at sales@astrafy.io.