In this blog post I am going to walk through about a personal thought that came up in my mind while working in a specific use case.
Topic: partitioning combined with clustering in BigQuery using dbt insert_ovwerwite strategy. Why can not we partition based on a cluster column instead of all the table?
Parittioning
A partitioned table is divided into segments, called partitions. You partition tables by specifying a partition column which is used to segment the table.
The partioin column can be:
Integer column: partiton table based on a ranges of values
https://cloud.google.com/bigquery/docs/partitioned-tables#integer_range
Time column partitioning: partition table based on DATE, TIMESTAMP or DATETIME column
https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables
Ingestion time column: partiton table based on the pseudocolum _PARTITIONTIME
https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time
Clustering
Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. The clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns.
Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition.
Use case
Partitioning and clusters can be combined together to achive finely grained sorting of the table itself.
Coming back to my previous sentence “Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition.”
So in a combined approach using both table partitioning and clustering, you first segment table data into partitions, then you cluster the data within each partition by the clustering columns.
When you work with tool like dbt and you are leveraging the different type of incremental models in order to have better performance and costs using the insert_overwrite strategy in which a merge statement that replaces entire partitions in the destination table will be generated. I have thought that could be interesting and usefull to change the partitions of only a specific cluster column not of the whole tables.
What I mean: Let’s immagine that we have the following table containing daily statics data based on Italian’s regions of the number of people that are getting married.
So the table will look like:
Initial table
The table it is partitioned by marriage day and using only a cluster column based on the region. It can be rendered as per follow.
Clustered & Partitioned table
Imagine that from my ingestion tool I receive every day the last 3 months data with this scheduling:
10AM data from Lombardia region
12 AM data from Marche region
3 PM data from Puglia region
Now if I use the dbt insert_overwrite strategy it means that all the time when I run my dbt model it is going to overwrite the last 3 months data indipendently from the region that implies that the load of 3PM will be replace the data from Marche and Lombardia region.
That it will result:
insert_overwrite srtrategy
final table
What I think that could be a nice to have: the possibility to remove only the partition for a specific cluster column.
What do you think about it?
Conclusion
In order to overcome the problem I split the models in different one based on the region so I can safely all the time replace the partition. Then aggregate all of this model in one using the UNION ALL statement in the final model with materialization view. Why? In order to benefit still from clustering and partitioning and having control on costs without materialize all the time that a region has been updated.
If you are looking for support on Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.