Building a Lakehouse on GCP: Lessons Learned with dbt, Iceberg, and BigQuery

Note: You can check the code for this POC in the dbt-iceberg-poc repository.

Motivation

Data lakes are becoming mainstream. Teams are moving away from data warehouses and are exploring data lakes as their primary analytics platform.

At Astrafy, we mainly use dbt for our ETL pipelines. It’s been shown as a spectacular way of simplifying pipelines and making them more maintainable. Therefore, we decided to explore how we could use dbt to build a Lakehouse on GCP.

The ideal outcome would be having a fully functional, open-format pipeline where storage is explicitly decoupled from compute. This way, we could use any other compute engine to process the workloads without relying on BigQuery’s compute engine.

Potential benefits:

Cost savings: Data Warehouse compute costs are higher than providing a custom compute engine.
Flexibility: We could migrate our storage and/or compute workloads to other platforms.

Let’s dive into the details of how we built the pipeline.

Prerequisites & Infrastructure Setup

Create Storage

We separate storage (Bucket) from compute (BigQuery).

Create the GCS Bucket: This will hold our Iceberg data and metadata.

# Replace with your project details
export PROJECT_ID="your-project-id"
export REGION="EU"
export BUCKET_NAME="${PROJECT_ID}-iceberg"
gsutil mb -l $REGION gs://$BUCKET_NAME

Configuring dbt for Iceberg

The magic happens in how we configure dbt to talk to this external storage. Instead of hardcoding paths in every model, we use the catalogs.yml feature introduced in `dbt-bigquery` v1.8+.

catalogs.yml: Create this file in your dbt project root. It tells dbt how to write Iceberg tables and where to put them.

-- catalogs.yml
catalogs:
  - name: iceberg_catalog
    active_write_integration: biglake_conf
    write_integrations:
      - name: biglake_conf
        external_volume: 'gs://your-bucket'
        table_format: iceberg
        file_format: parquet
        catalog_type: biglake_metastore

Building the Pipeline

We implemented a Medallion Architecture (Bronze -> Silver -> Gold).

Bronze: Raw Ingestion

In the Bronze layer, we land data from our source (CSV files in GCS) into native Iceberg tables.

-- models/bronze/bronze_transactions.sql
{{ config(
    materialized='table',
    catalog_name='iceberg_catalog'
) }}

select
    id,
    amount,
    cast(transaction_date as date) as transaction_date,
    current_timestamp() as ingestion_ts
from {{ source('bronze', 'ext_transactions') }}

*Note: `ext_transactions` is a standard BigQuery external table pointing to the raw CSVs.*

Silver: Incremental Merge

This is where Iceberg shines. We use the `incremental` materialization with the `merge` strategy to deduplicate data. Unlike standard external tables, Iceberg supports ACID transactions, allowing us to update records efficiently.

-- models/silver/silver_transactions.sql
{{ config(
    materialized='incremental',
    unique_key='id',
    incremental_strategy='merge',
    catalog_name='iceberg_catalog'
) }}

with source_data as (
    select * from {{ ref('bronze_transactions') }}
),

deduplicated as (
    select *
    from (
        select 
            *,
            row_number() over (partition by id order by transaction_date desc) as rn
        from source_data
    )
    where rn = 1
)

select
    id,
    amount,
    transaction_date,
    ingestion_ts
from deduplicated

Gold: Consumption Layer

For the final layer, we switch back to **Standard BigQuery Views**. Since the heavy lifting (deduplication) happened in Silver, views offer the best performance and compatibility for BI tools (Looker, Tableau) without the overhead of managing another storage layer.

-- models/gold/gold_daily_sales.sql
select
    transaction_date,
    sum(amount) as total_amount,
    count(id) as transaction_count,
    max(ingestion_ts) as last_ingested_at
from {{ ref('silver_transactions') }}
group by 1

Running the pipeline

After setting up the environment variables, service account, and permissions, we can test our pipeline.

dbt build
10:13:20  Running with dbt=1.11.2
10:13:23  Registered adapter: bigquery=1.11.0
10:13:23  Found 3 models, 1 source, 539 macros
10:13:23  
10:13:23  Concurrency: 1 threads (target='dev')
10:13:23  
10:13:26  1 of 3 START sql table model bronze.bronze_transactions ........................ [RUN]
10:13:30  1 of 3 OK created sql table model bronze.bronze_transactions ................... [CREATE TABLE (0.0 rows, 51.0 Bytes processed) in 4.10s]
10:13:30  2 of 3 START sql incremental model silver.silver_transactions .................. [RUN]
10:13:33  2 of 3 OK created sql incremental model silver.silver_transactions ............. [MERGE (3.0 rows, 216.0 Bytes processed) in 2.80s]
10:13:33  3 of 3 START sql view model gold.gold_daily_sales .............................. [RUN]
10:13:34  3 of 3 OK created sql view model gold.gold_daily_sales ......................... [CREATE VIEW (0 processed) in 0.90s]
10:13:34  
10:13:34  Finished running 1 incremental model, 1 table model, 1 view model in 0 hours 0 minutes and 10.57 seconds (10.57s).
10:13:34  
10:13:34  Completed successfully
10:13:34  
10:13:34  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=3

We can then see our files created in GCS.

Files created in Google Cloud Storage

And our tables use the Iceberg Table Format.

Bronze transactions table details in BigQuery

Limitations — Not so decoupled architecture

While the setup above works seamlessly, our research revealed a critical limitation regarding the open nature of this stack.

When using `catalog_type: biglake_metastore`, dbt creates BigLake Managed Tables. The metadata is managed implicitly by BigQuery. There is no accessible “Iceberg Catalog API” endpoint.

This defeats the primary promise of Iceberg: interoperability. You cannot easily connect an external engine (like Spark or Trino) to this catalog because it is locked inside BigQuery’s internal metadata service. You have open file formats (Parquet/Iceberg) trapped in a closed catalog.

We check the BigLake Metastore (BigQuery Iceberg’s managed catalog). We can see below how it has not created any automatic catalog for our stack.

If we really wanted a real data lakehouse in GCP with its advantages, we should create a BigLake Metastore catalog manually and use compute engines that allow connecting to BigQuery’s Iceberg REST catalog. These include Spark, Trino, or Presto.

However, it is not currently possible to use dbt with this architecture, which increases the complexity of the pipelines and, unless you are already using those engines or your data needs require them, it is an overkill.

Conclusion

If your goal is simply to lower storage costs while keeping BigQuery as your primary engine, this stack could be an option. However, keep in mind that many optimizations BigQuery performs in the storage are not included, which makes this solution harder to manage.

Unfortunately, using dbt with Apache Iceberg is not an easy option in GCS at the moment. Integrations may come in the future, which makes this architecture not really an option in the current state of the technologies.

Building a Lakehouse on GCP: Lessons Learned with dbt, Iceberg, and BigQuery