Introduction
Monitoring and reviewing are the two paddles of the decision making cycle.
Have you ever wanted to know the nitty details about your BigQuery consumption ? Have you ever had a peak in costs that you wanted to understand ? Have you ever wanted to pinpoint your most expensive queries ? and the list goes on… If the answer to those questions is “yes”, then this article is for you and will help you gain full visibility on your BigQuery activity. While this article is focused on BigQuery, some of the data sources we use come from dbt packages that are cloud agnostic. Also the concept of data product and all the transformations can easily be transposed to other analytical databases.
Monitoring as a Data Product
At Astrafy, we are passionate practitioners of data mesh principles and have translated our different family of data into specific data products. We have for instance a data product for FinOps that includes all our cloud billing data from the different cloud providers we use. We have another data product called “web analytics” for all the data generated by our website. And the list goes on with “sales & marketing” data product for data generated by hubspot, “chatops” data product for data from Slack, etc. This mesh of data products gives a lot of benefits that are well explained in many articles dedicated to this heat topic. The main benefits for us is the flexibility it gives us and the fact we can quickly iterate on a data product without impacting any of the others. Interoperability is also critical and it is made easy thanks to the dbt packages; one data product can easily be imported and referred to via a dbt package within another data product. And dbt 1.6 released the dependencies.yml file to allow for smoother import of data products.
We decided to consider “monitoring” as a dedicated data product as it contains a lot of data and can be well isolated and owned by a team. The scope of this article will focus on monitoring of analytical data (queries, storage, dbt runs, etc.) but this monitoring data product should actually contain all monitoring data of your Cloud infrastructure; anything you want to monitor in depth should be included in this data product.
This data product is somehow self contained but we still wanted to check that the data consumed from the BigQuery logs is equal to the ones billed by Google Cloud. We use one of the datamart from the FinOps data product in our monitoring data product to perform this task. The diagram depicted below demonstrates our monitoring data product and the interoperability with the FinOps data product.
In the next two sections we are going to deep dive into the different technical components of this monitoring data product and the last section will shed light on the end goal of this product that is to give visibility on what is happening on BigQuery to the different stakeholders. Visibility is given mainly through BI dashboards but we will see that we can leverage other types of distribution layers such as messaging backbone to trigger another process, a slack alert, etc.
Source data
It all starts with the source data. When starting a data product, the first step is to identify all the sources required to answer the different business questions. We identified the following two family of sources:
BigQuery metadata: everything we do on BigQuery generates logs.
dbt artifacts & elementary data: metadata generated by dbt packages and that allow to derive granular insights when joined with BigQuery metadata
BigQuery Metadata
BigQuery creates loads of logs throughout the day. So, we had to create a sink to gather these logs into a BigQuery Table. This table is going to be considered as the main source of all the data used later. These logs are organised in three different streams :
Data access
System event
Admin activity
Each of these streams allow access to a variety of information. While Data access gives you insights about the entries of jobs and table modifications, System event reports when the table expires and is removed. Finally, Admin activity stream lets you know about all remaining activities and events such as table and dataset creation.
You can read in detail more information about these tables here.
BigQuery Information Schema
To complete information obtained by BigQuery logs, we have also used BigQuery INFORMATION_SCHEMA. The latter views are read-only, system-defined views that provide metadata information about your BigQuery objects. You can query to retrieve metadata information. This metadata overlaps with the BigQuery metadata from the previous section but is easier to access and with a simpler schema.
After some consideration, we chose schemas that were the most useful for our usage. For instance, we have made use of these resource type :
Datasets,
Jobs,
Jobs by timeslice,
Tables,
Etc.
You can read in detail more information about these Information_schema views here.
dbt artifacts
We use dbt as our unique framework to transform data within BigQuery. It’s the entrypoint for all the transformation by our analytical engineers. We have written extensively about how to use dbt at best in this series of three articles (part 1, part 2 and part 3). A good thing about dbt is that it generates a lot of metadata as well and this metadata can be leveraged in a structured manner via the following two dbt packages:
Those three packages work for you behind the scene and feed tables in your analytical databases. Those tables can then be used as source tables for our monitoring data product. Those tables will contain information such as:
Metrics for each dbt run (total time elapsed, bytes consumed per model, etc.)
Metrics about models
Metrics about tests
etc.
dbt project evaluator is more of an auditing package and will give you visibility on how compliant is your dbt project with the best practices.
All those metadata are very precious and will permit granular insights. One important use case with regard to data mesh is to be able to filter by data products; this is possible as we have this data product metadata attached to all our models. For a data product owner, it will be easy to just get visibility on the queries of his/her data product.
Transformation in dbt
After gathering all the data from the BigQuery logs sink, INFORMATION_SCHEMA views and the various dbt artifacts packages, it is now time to transform this data and get valuable insights from it.
As per usual, we use the best practices which are :
To segment our dbt models in different data layers:
→ staging models
→ data warehouse/intermediate models
→ datamart models.
To configure ‘incremental’ for models with huge amounts of data ingested daily; It helps not to over consume data.
To configure clustering for fields that are filtered on frequently. This improves velocity and cost.
This way we knew our work would be efficient and easy to understand for later. First, we had staging models that are mainly used for minimum 1-to-1 transformations from the sources where we added the “record_date” for the incremental configuration. Then comes the intermediate models that contain our data modelling logic with full historization of data. Then the datamart models are the final tables that are then plugged into our various downstream applications. The new version of logs in BigQuery is such that a lot of information is contained in the json field “protopayload_auditlog.metadataJson”. This makes storage efficient and to extract data from this field, we have to leverage BigQuery json functions. You can get more information about this metadata json here.
All the dbt models are running on a daily basis using Airflow in order to orchestrate the process. For our use case, having data refreshed once a day is enough but this frequency could be changed in case you need more real-time monitoring.
The following architecture depicts the entire flow of data with data flowing from left to right. In the blue area are the different sources we have described above that are automatically fed by the log sinks and dbt packages. The green area is where the dbt transformation happens and this is the part that is running daily through Airflow to bring updated datamarts in BigQuery.
Distribution Layer
Visualisation via BI dashboards
Getting great insights from data through sophisticated yet simple dashboards is a daunting task. It requires great collaboration between end users of the dashboards and the different technical teams that will intervene in the process. Those technical personas consist of analytics engineers that will get the data from the source and transform it to the required datamarts defined by the business. Regarding the design and development of the dashboards, this requires some visualization skills to be able to convey in a designed and accessible way the insights needed. Dashboards also need to provide flexibility and granularity via filters; all this without making the dashboard overwhelming. Below is a summarized process workflow of how we proceeded to build those monitoring dashboards:
We took quite some inspiration from this github repository for the admin insights and then we added some custom dashboards from our own inspiration of what we needed.
These dashboards contain either admin insights of what happened lately, where employees accessed BigQuery tools or more precise insights of the queries done, the time it took, who did them, the most expensive one, etc.
Some real use case we solved recently with those dashboards:
We realized that a peak of expenses appeared in the daily cost of bigquery dashboard. At the time we ignored the cause of this expense. Using the Query Cost by Day dashboard from BigQuery, we managed to isolate the problem with ease. It made us realize that it was because of many dbt run –full-refresh of the database in AirFlow that the cost increased a lot compared to the other days.
Fun fact, we’re a European company and we have no one working from the USA. But on the admin dashboard where we have a map showing the different login of SA and employee accounts, we noticed many logins from the USA east coast. With a bit of digging in the data through the BI tool, we found out that it was the servers from Gitlab CI at the origin of those logins (we use some dbt operations in our Gitlab CI).
As a BI tool we are using Looker internally and it does a great job on visualizations but more importantly it excels with its semantic layer through LookML. On the workflow above you might have the dashed green rectangle “Define metrics and build semantic layer”. Ideally you would want to have a single source of truth for your metrics and have all downstream applications use the same definition of a metric. This is possible and that’s the strategy we recommend; dbt 1.6 and future releases include a great semantic layer to perform this task. But in case you use Looker or another advanced BI tool, it’s also possible to define your semantic layer within that tool. This is a design discussion your data architects need to have.
Other downstream applications
BI dashboards are by far the most popular downstream application but it’s worth mentioning that a multitude of other downstream applications exist. The following diagram depicts this idea in “high overview” of some other useful downstream applications that can be plugged on this data.
Three other family of downstream applications on top of BI dashboards are worth mentioning:
Machine Learning: With the large amount of data produced by those logs, you might be interested in building some predictive ML models. In that case Vertex AI can plug directly into your datamarts in BigQuery. From Vertex AI platform, you can then develop your model and put it in production using the multitude of MLOps tools Vertex AI offers.
API endpoints: datamarts data can definitely be of interests to other departments in your company and the best way to expose this data would be through an API product. Revolutionary tools like Tinybird allow you to get API endpoints on your BigQuery tables in a matter of minutes in a fully managed manner.
Messaging backbone: You can also have your orchestrator dispatch a message to a messaging backbone once the datamart transformations are done. This opens the door to integrations with an infinite number of tools as your messaging backbone would then be connected to an engine that would take actions based on this message. We use this Pub/Sub with the following two use cases internally:
→ to send alerts to slack via a Cloud Function; we define some alerting logic within the Cloud Function and if an alert criteria is met, it will fire a message to a specific channel.
→ to feed some data to an external SIEM system
While BI dashboards offer visibility and give great insights on the current state of the data, the three other families of downstream applications permit to activate the data and go beyond simple insights. Having systems in place that take actions on your data in an automated way is a real game changer and sets the marker of real data-driven companies.
Conclusion
Throughout this article we have delved into the intricate processes and technologies utilised in monitoring BigQuery activity. The concept of monitoring as a data product offers enhanced granularity and a well-structured approach, ensuring more comprehensive insights. By segmenting and transforming data using tools like dbt, we can derive valuable insights that not only help in cost analysis but also in optimizing processes and detecting anomalies.
Visualization through BI dashboards remains a powerful tool for conveying insights, but it’s essential to recognize the expanding horizon of other downstream applications. Machine Learning, API endpoints, and Messaging backbones each offer unique opportunities to harness and activate data in a proactive manner. In today’s fast-paced digital environment, achieving a holistic understanding of BigQuery activity and leveraging it to make informed decisions is more than a luxury — it’s a necessity.
If you are looking for support on Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.