Analytics Engineering

Anomaly Detection with dbt and Elementary

Jun 10, 2024

Back

Introduction

Data quality issues cost businesses billions of dollars annually, with impacts ranging from incorrect figures in accounting reports to inaccurate forecast predictions. According to a study by IBM, poor data quality costs the US economy alone around $3.1 trillion per year. This highlights the critical importance of maintaining high data quality and ensuring the reliability of data pipelines.

Maintaining the quality and reliability of data is paramount for data and analytics engineers. Ensuring that data pipelines are free from anomalies and inconsistencies can significantly enhance the accuracy of business insights. This is where Elementary, a dbt-native data observability platform, comes into play. In this article, we will explore how Elementary helps detect data anomalies, ensuring that your dbt projects and data pipelines remain robust and reliable.

Anomaly detection is a critical aspect of data management. Unexpected deviations in data can indicate errors, system issues, or even fraudulent activities. For data and analytics engineers, promptly identifying and addressing these anomalies is essential to maintain data integrity. Elementary provides an efficient and seamless way to integrate anomaly detection into your dbt projects, offering immediate visibility into data quality issues with minimal setup.

In this article, we will dive into the features of Elementary, how it integrates with dbt, and how you can configure and use its anomaly detection capabilities to safeguard your data pipelines.

Tests versus anomalies

When discussing data integrity tests, it is crucial to differentiate between classical tests and anomaly detection. Both are essential in maintaining high data quality but operate differently.

Classical Tests

Classical tests are predefined and known by the users. They are explicitly set up to monitor specific metrics or conditions within the data, such as ensuring that no null values exist in a critical column or verifying that a particular range of values is maintained. These tests are deterministic and require the user to define the exact criteria for what constitutes a pass or fail.

Anomaly Detection

Anomalies, on the other hand, are unknown and undefined beforehand. They rely on statistical methods to identify deviations from the norm based on historical data. Anomaly detection does not require explicit criteria to be set by the user; instead, it uses patterns and trends from past data to establish what is considered ‘normal’ and flags any significant deviations from this expected range. This approach allows for the detection of unexpected and potentially unknown issues within the data.

Integration with Elementary

Elementary seamlessly integrates both types of tests within your dbt projects. By default, Elementary picks up all dbt classical tests and incorporates additional anomaly detection tests specific to the Elementary library. This dual approach ensures a comprehensive monitoring strategy, leveraging the strengths of both classical tests and anomaly detection to safeguard data integrity.

The image contrasts two approaches: classical tests and anomaly detection. On the left, a dashed box labeled "classical tests" notes that they are predefined and monitor specific metrics. On the right, a solid box labeled "Anomaly detection" highlights that it is undefined and uses patterns from past data. The text is handwritten-style, and the overall design is simple and straightforward.

Comparison between Classical tests and Anomaly detection

Understanding Elementary’s Anomaly Detection

Elementary is designed to work natively with dbt, allowing you to configure and execute data tests just like native dbt tests. These tests help monitor specific metrics, such as row count, null rate, and average value, to detect significant changes and deviations. The results are then presented in the Elementary UI, complete with alerts for any detected anomalies.

Key Concepts

Before we dive into the configuration and execution of tests, it’s important to understand some core concepts related to Elementary’s anomaly detection:

Anomaly: A value that deviates significantly from the expected range calculated based on historical data.
Monitored Data Set: The data set against which the data monitor runs, including both training set values and detection set values.
Data Monitors: Metrics collected to detect problems, such as freshness, volume, nullness, uniqueness, and distribution.
Training Set: A reference set of values used to calculate the expected range for the data monitor.
Detection Set: Values compared to the expected range. If a value is outside the expected range, it is flagged as an anomaly.
Expected Range: The range of values calculated based on the training set, used as a benchmark for detecting anomalies.
Time Bucket: Data is split into consistent time intervals to analyze changes over time.

The image shows a graph of null count anomalies over time, divided into "Training" and "Detection" periods. The y-axis is null count, and the x-axis is time. A shaded area represents the expected range. In the detection period, values within this range are green, and anomalies outside it are red. Annotations explain a non-anomaly in the training period and an anomaly in the detection period.

Anomaly detection tests core concepts (source: Elementary docs)

Configuring Anomaly Detection Tests

Elementary’s anomaly detection tests are configured using .yml files within your dbt project. The configuration process is straightforward, following dbt’s native setup.

Example Configuration

Here’s an example of configuring a volume anomalies test in your dbt project:

The image shows a code snippet configuring tests for detecting volume anomalies using the Elementary platform. It specifies the timestamp_column as updated_at, sets anomaly_sensitivity to 3, and monitors anomalies in both directions. The detection period is configured for 2 days, the training period for 14 days, and the time bucket is set to 1 day. Additionally, it includes parameters to ignore small changes with spike_failure_percent_threshold and drop_failure_percent_threshold both set to 10. This configuration aims to monitor and detect data anomalies effectively based on the defined sensitivity and periods.

A volume anomalies test checks the volume of data over a specified period of time. This test is crucial for ensuring that the ingestion of data remains consistent. For instance, if your data pipeline typically ingests around 10,000 rows of data daily, a sudden drop or spike in this volume could indicate an issue that needs immediate attention. By implementing a volume anomalies test, you can detect these irregularities early and maintain the reliability of your data processes.

Core Configuration Parameters

To effectively detect anomalies, it is crucial to configure several key parameters:

timestamp_column: Specifies the column used to determine the time buckets. This is crucial for splitting data into the defined intervals for analysis.
anomaly_sensitivity: Defines the sensitivity of the anomaly detection, typically using a Z-score threshold. Higher sensitivity means that smaller deviations from the norm will be flagged as anomalies.
anomaly_direction: Indicates whether to detect spikes, drops, or both. This allows you to tailor the detection process based on the type of anomalies that are most concerning for your data set.
detection_period: The period over which to detect anomalies. This defines how recent the data needs to be to be considered for anomaly detection.
training_period: The period used to establish the expected range. This defines how far back in time the historical data should be considered when calculating the expected range.
time_bucket: The granularity of the time intervals for analysis. This helps in breaking down the data into manageable and comparable chunks.
ignore_small_changes: Thresholds to ignore minor variations, ensuring that only significant anomalies are flagged.

Running Anomaly Detection Tests

Once configured, running the tests is as simple as executing your dbt commands. Elementary will split your data into the defined time buckets, calculate the metrics, and compare recent values against historical data. Any anomalies detected during the detection period will result in test failures, allowing you to take immediate corrective action.

For detailed steps and code snippets, refer to the Elementary Anomaly Detection section in our public repository dbt-demo.

Example of a Test Run

When a test is run, Elementary performs the following steps:

Splitting Data: Data is split into time buckets based on the specified timestamp_column and time_bucket configuration. This step ensures that the data is analyzed over consistent intervals.
Calculating Metrics: The system calculates the relevant metrics (e.g., row count, null rate) for each bucket within the training period.
Establishing Expected Range: Based on the metrics from the training period, Elementary establishes an expected range for each metric.
Comparing Metrics: The metrics from the detection period are compared against the expected range. This step identifies any deviations that fall outside the normal variation.
Flagging Anomalies: Significant deviations are flagged as anomalies, triggering alerts and marking the test as failed.

The image displays a line graph tracking data points over time, with the x-axis showing dates from February 8 to February 20 and the y-axis representing a numerical value. The data points are connected by a green line, with most points falling within a shaded expected range. One point on February 20 is marked in red, indicating an anomaly outside the expected range. This graph illustrates the detection of anomalies over a specified period.

Anomaly detection test result example from Elementary report (source: Elementary docs)

Handling Test Failures

A test failure indicates that an anomaly has been detected. The Elementary UI provides detailed insights into the nature of the anomaly, including the specific metric and time period affected. This enables data engineers to quickly diagnose and address the underlying issue.

For example, if a significant drop in row count is detected, it may indicate an upstream data ingestion issue. Promptly addressing such issues can prevent incorrect data from propagating through your analytics systems, ensuring the accuracy and reliability of your insights.

Use Cases for Anomaly Detection

Anomaly detection is applicable in various scenarios within data pipelines:

Data Freshness: Monitoring the freshness of data ensures that the latest information is always available for analysis. Anomalies in data freshness can indicate delays in data processing or ingestion.
Volume Anomalies: Detecting unexpected changes in data volume helps in identifying issues such as data loss, duplication, or unusual data spikes.
Null Rate: Monitoring the null rate of critical columns ensures data completeness and helps in identifying missing or incomplete data.
Uniqueness: Ensuring the uniqueness of key columns prevents issues related to data duplication and integrity.

Advanced Configuration Options

Elementary also provides advanced configuration options to tailor the anomaly detection process to specific needs:

Seasonality: By configuring seasonality, you can account for periodic fluctuations in your data. For example, web traffic may exhibit weekly patterns that should not be flagged as anomalies.
Dimensions: You can configure dimensions to perform more granular anomaly detection based on specific attributes of your data.
Exclusion Criteria: Specific columns or metrics can be excluded from the anomaly detection process, allowing you to focus on the most critical aspects of your data.

Conclusion

Elementary’s dbt-native anomaly detection capabilities provide a powerful tool for data and analytics engineers to maintain the integrity of their data pipelines. By seamlessly integrating with dbt projects, Elementary allows for easy configuration and execution of anomaly detection tests, ensuring that data quality issues are promptly identified and addressed.

One of the key advantages of Elementary is its ability to detect anomalies based on the patterns in your data, highlighting issues that might have otherwise gone unnoticed. It is impossible to define or anticipate all potential tests for your data, but Elementary’s robust anomaly detection fills this gap, alerting you to unexpected deviations.

Implementing Elementary in our data observability strategy has significantly enhanced our customers’ ability to detect and respond to anomalies, ultimately leading to more reliable and accurate data insights.

Thank you

If you enjoyed reading this article, stay tuned as we regularly publish technical articles on dbt and how to leverage it at best to transform your data efficiently. Follow Astrafy on LinkedIn, Medium and Youtube to be notified of the next article.

If you are looking for support on Modern Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.