Introduction

Every four months, dbt releases a new version of its dbt core and dbt adapters. In those first days of October, dbt 1.9.0 was released and came with a quite exciting feature. In this article, we will explore in depth the core feature of this release, which is “microbatch incremental” and also cover the other enhancements that came with this release.

Disclaimer: this release is still in beta and is expected to be GA by the first half of November 2024. Concepts and configurations described in this article are likely to to remain the same.

Microbatch incremental

dbt 1.9 introduces a new feature for incremental models: the microbatch strategy (full documentation here). Previously, incremental models could only process new data in batches based on a single column, often a timestamp. This could lead to performance bottlenecks when dealing with large volumes of data.

With microbatch, you can now define multiple batching columns, enabling dbt to process smaller, more manageable chunks of data. This translates to faster incremental model refreshes, reduced strain on your data warehouse, and, ultimately, more efficient data pipelines.

This approach offers several benefits:

  • Efficiency: By processing smaller chunks of data, you can optimize resource usage and reduce processing time.

  • Resiliency: If a batch fails, you can retry it without reprocessing other batches.

  • Concurrency: In future updates, batches can be processed concurrently, further speeding up data transformations.

  • Idempotency: Since each batch is independent, reprocessing a batch should yield the same result.

The biggest change with the previous incremental configuration is that you no longer need to specify an “is_incremental” clause in your model. When you run a microbatch model, dbt will evaluate which batches need to be loaded, break them up into a SQL query per batch, and load each one independently.

How does it work if you don’t have to specify incremental logic? Configurations for microbatch consist of four parameters, and the mechanic trick lies in the event_time parameter.

  • event_time: The column indicating “at what time did the row occur.” Required for your microbatch model and any direct parents that should be filtered. This last point is very important. The direct parent of your microbatch models also needs to have this parameter configured.

  • Begin: The “beginning of time” for the microbatch model. This parameter will be used in full-refresh builds to indicate the oldest history data we want to get.

  • Batch_size: The granularity of your batches. Supported values are hour, day, month, and year.

  • Lookback: Number batches before the latest bookmark to capture late-arriving records.

Let’s take a model “sessions.sql” with the following configuration and query:

-- session.sql

{{ config(
    materialized='incremental',
    incremental_strategy='microbatch',
    event_time='session_start',
    begin='2020-01-01',
    batch_size='day',
    lookback='1',
) }}


select 
      page_view_start as session_start 
from {{ ref('page_views') }}

This will translate into the following data processed based on the days the query runs:

 A diagram representing how page_view_start dates are processed in dbt's microbatch incremental model. Each date in the page_views table is divided into independent batches based on a specified date and lookback settings, allowing efficient and flexible batch processing for data pipelines.

Each of the batches (in this case, each of the days processed) will launch separate independent queries in your query engines in a “divide and conquer” mode.

Those microbatches make it easier than ever to reprocess only specific batches of data from the past and, therefore, to reprocess historical data. Reprocessing data for the first five days of October is as simple as running the following dbt command:

dbt run --event-time-start "2024-10-01" --event-time-end "2024-10-05"

Retrying failed batches is also made easier, and imagine you run a full refresh that goes two years back in time with a granularity of days. This will run more than 700 independent batches, and in case some batches fail, you can easily retry those by running the command “dbt retry”. The following illustration depicts a case with three batches having failed:

A diagram showing time-based data batches, with "t" representing the current batch and earlier times labeled in reverse order (t-1, t-2, etc.). Failed batches are marked in red, illustrating dbt's retry feature, where specific failed batches can be reprocessed without re-running successful ones.

As with every new feature, it still has room for improvement, and I highly recommend reading Christophe Oudar’s article on this topic. His recommendations are quite valuable and should be taken into consideration by the dbt labs team for subsequent releases.

Snapshots Improvements

dbt 1.9 introduces some exciting updates to snapshots, making them more flexible and user-friendly. Here’s a breakdown of the key changes:

Snapshots in YAML

You can now define snapshots using YAML files instead of SQL. This brings consistency to defining sources in dbt and offers a cleaner approach.

YAML (snapshots/orders_snapshot.yml):

snapshots:
- name: orders_snapshot
  relation: source('jaffle_shop', 'orders')
  config:
    schema: snapshots
    database: analytics
    unique_key: id
    strategy: timestamp
    updated_at: updated_at

Need to apply transformations? No problem! Define an ephemeral model and reference it in your snapshot:

SQL (models/ephemeral_orders.sql):

{{ config(materialized='ephemeral') }}

select * 
from {{ source('jaffle_shop', 'orders') }}
where some_condition

YAML (snapshots/orders_snapshot.yml):

snapshots:
- name: orders_snapshot
  relation: ref('ephemeral_orders')
  config:
    # ... your snapshot configuration

Optional target_schema

Previously, all snapshots were written to the same schema, regardless of the environment. Now, you have the flexibility to define different schemas for different environments (e.g., development, production) or use a single schema for all. This simplifies development workflows and gives you more control.

Customizable Meta Column Names

dbt automatically adds metadata columns to your snapshots (e.g., dbt_valid_from, dbt_valid_to). With dbt 1.9, you can customize these column names directly in your YAML configuration:

YAML (snapshots/orders_snapshot.yml):

snapshots:
- name: orders_snapshot
  # ... your snapshot configuration
  config:
    # ...
    snapshot_meta_column_names:
      dbt_valid_from: start_date
      dbt_valid_to: end_date

These updates make dbt snapshots more powerful and easier to manage. For more details, check out the official dbt documentation: https://docs.getdbt.com/docs/build/snapshots

Conclusion

The article explored the two main features/improvements that make dbt 1.9 a worthwhile upgrade for any data professional. But this release comes as well with a bunch of noticeable enhancements:

As this version does not introduce any breaking changes, we highly recommend to upgrade your dbt projects to this version once the release is in GA. By keeping up with latest developments and leveraging latest features, you make it easy to maintain best practices, improve the efficiency and reliability of your data pipelines, and future-proof your dbt projects against potential compatibility issues.

Thank you

If you enjoyed reading this article, stay tuned as we regularly publish articles on dbt and analytics engineering in general. Follow Astrafy on LinkedIn, Medium, and Youtube to be notified of the next article.

If you are looking for support on Modern Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.