Anomaly tests parameters
- All anomaly detection tests:
- timestamp_column: column name
- where_expression: SQL expression
- anomaly_sensitivity: int
- anomaly_direction: [both | spike | drop]
- ignore_small_changes:
- anomaly_exclude_metrics: SQL expression
- Anomaly detection tests with
timestamp_column
:
all_columns_anomalies
test:
Example configurations
Parameters configuration
timestamp_column
timestamp_column: [column name]
Anomaly detection tests utilize a specified column to segment data into time buckets and filter the dataset. It's highly recommended to use a timestamp column such as updated_at
, created_at
, or loaded_at
(date type is also acceptable) for optimal performance.
- With a timestamp column: Specifying a
timestamp_column
enables the test to divide the data into time-based buckets using this column's timestamps. It calculates the metric for each bucket and identifies anomalies among them. This approach allows immediate test operation if the table has sufficient historical data. - Without a timestamp column: If a
timestamp_column
is not specified, the test will compute the metric for the entire table data at each run and compare it with metrics from previous runs to detect anomalies. In this case, the test requires thetraining_period
duration to accumulate necessary metrics before it becomes effective.
If a timestamp column is not defined, the default behavior is to not create time buckets (default is null).
Default: none
where_expression
where_expression: sql expression
Filter the tested data using a valid SQL expression.
Default: none
anomaly_sensitivity
anomaly_sensitivity: [int]
This configuration defines how the expected range is calculated. A sensitivity setting of 3 implies that the expected range is within three standard deviations from the average of the training set. A smaller sensitivity value will decrease this range, potentially flagging more values as anomalies. Conversely, larger values increase the expected range, likely reducing the number of detected anomalies.
Default: 3
anomaly_direction
anomaly_direction: [both | spike | drop]
This setting determines how data points are compared to the expected range, specifically whether anomalies are identified when data points are above, below, or in both directions relative to this range. This is particularly useful when monitoring metrics where only one type of deviation is considered problematic. For instance, in freshness monitoring, the focus might be solely on detecting delays (data appearing later than expected) rather than early data. The anomaly_direction configuration allows for specifying the direction of interest—both for both deviations, spike for above-the-range anomalies, or drop for below-the-range anomalies.
Default: both
ignore_small_changes
_10ignore_small_changes:_10 spike_failure_percent_threshold: [int]_10 drop_failure_percent_threshold: [int]
This configuration allows an anomaly test to fail only if all the following conditions are met:
- The z-score of the metric within the detection period is considered anomalous.
- Additionally, one of the following conditions must hold:
- The metric within the detection period exceeds the
spike_failure_percent_threshold
percentage of the mean value from the training period, if this threshold is defined. - The metric within the detection period is below the
drop_failure_percent_threshold
percentage of the mean value from the training period, if this threshold is defined.
- The metric within the detection period exceeds the
These settings are useful for situations where metrics are stable, and minor fluctuations result in disproportionately high z-scores, leading to false positives in anomaly detection.
If these thresholds are not defined, the default behavior does not consider small changes, with both spike_failure_percent_threshold and drop_failure_percent_threshold being null.
Default: none
anomaly_exclude_metrics
anomaly_exclude_metrics: [SQL where expression on fields metric_date / metric_time_bucket / metric_value]
This parameter allows for the exclusion of certain metrics from the training set to enhance test accuracy. By default, all data points in the training set are used for comparison. However, specific metrics can be excluded by applying a filter based on an SQL where expression.
The filter can target the following fields:
metric_date
- The date associated with the relevant bucket, applicable even for non-daily buckets.metric_time_bucket
- The precise time bucket.metric_value
- The metric's value.
To use this feature, specify a valid SQL where expression focusing on the columns metric_date, metric_time_bucket, and metric_value. This approach helps refine the training set by removing outliers or irrelevant data points, thereby improving the precision of anomaly detection.
training_period
_10training_period:_10 period: < time period > # supported periods: day, week, month_10 count: < number of periods >
Defines the maximum duration for data collection, encompassing both the training and detection periods. Should a detection delay be specified, the entire training period is adjusted accordingly.
How it works
The
training_period
parameter is effective for tests configured with atimestamp_column
, influencing how historical data is utilized based on table materialization:
- Regular tables and views: Each run calculates values across the entire
training_period
.- Incremental models and sources: Initial and full refresh tests calculate the full
training_period
. Subsequent runs focus on thedetection_period
.Changes from Default:
- Full time buckets: To ensure complete time buckets, the
training_period
is adjusted as needed. For instance, with a weeklytime_bucket
(period: weeK), if a 14-day period ends on a Tuesday, the period is extended to include a full week starting from Sunday.- Seasonality training set: When seasonality is applied, the
training_period
is extended to gather sufficient data for each seasonality aspect (e.g.,day_of_week
) to accurately detect anomalies.Impact of Adjusting training_period:
- Increasing
training_period
: Results in a larger training set, providing a broader data range for establishing the expected range. This generally reduces the test's sensitivity to outliers, decreasing the likelihood of false positives but requiring a higher anomaly threshold.- Decreasing
training_period
: Leads to a smaller training set, limiting the data range for expected range calculation. This may increase test sensitivity to outliers, elevating the risk of false positives but lowering the anomaly threshold for detection.
Default: 14 days
Relevant tests: Anomaly detection tests that utilize a timestamp_column
detection_period
_10detection_period:_10 period: < time period > # supported periods: day, week, month_10 count: < number of periods >
This setting specifies the length of the detection period. For example, if set to 2 days, only data points from the last 2 days are considered for anomaly detection. Similarly, setting it to 7 days means the detection window extends to the last 7 days.
In the context of incremental models, the detection_period
also determines how frequently metrics are recalculated. If metrics within this period have been previously calculated, Elementary will update them to account for any recent backfills or data updates. Adjust this configuration based on the typical delays in your data processing to ensure timely and accurate anomaly detection.
How it works
The
detection_period
defines the timeframe for anomaly detection, with its application varying by the table's materialization type:
- Regular tables and views: Sets the period for analyzing data for anomalies.
- Incremental models and sources: Besides detection, it also dictates the timeframe for recalculating metrics to reflect recent data changes or backfills.
Default: 2 days
Relevant tests: Anomaly detection tests that utilize a timestamp_column
time_bucket
_10time_bucket:_10 period: < time period > # supported periods: hour, day, week, month_10 count: < number of periods >
This parameter sets the granularity of time buckets for data analysis.
Data is segmented into time buckets to track changes and identify anomalies. For instance, with a daily time bucket (period=day
, count=1
), it assesses daily row count variations.
Adjust this setting based on your data's characteristics and the resolution needed for anomaly detection. For hourly volume anomaly detection, configure it as period=hour
, count=1
.
How it works
- The
training_period
anddetection_period
of the test might be extended to ensure full time buckets (for example, full week Sunday-Saturday).- Weekly buckets start at the day that is configured as week start on the data warehouse.
Default: time_bucket: {period: day, count: 1}
Relevant tests: Anomaly detection tests that utilize a timestamp_column
seasonality
_10seasonality: day_of_week | hour_of_day | hour_of_week
The seasonality
configuration is crucial for datasets with predictable, repeating patterns over time, enhancing the precision of anomaly detection by taking into account these regular patterns. This approach helps in reducing false positives and avoiding missed anomalies.
Supported seasonality configurations:
day_of_week
: Aligns daily data buckets for comparison based on the day of the week, ensuring each day is compared with the same weekdays from the past.hour_of_day
: For hourly data buckets, it aligns them by the hour of the day, comparing, for example, 10:00-11:00 AM across different days.hour_of_week
: Combines both day and hour for a more granular weekly pattern, comparing specific hours on specific days across weeks, like 10:00-11:00 AM on Sundays to the same timeframe on previous Sundays.
How it works
- The test compares the metric value of a current bucket not to its immediate predecessor but to previous buckets sharing the same seasonality attribute. This means, for instance, a Monday's data is compared against past Mondays, providing a more accurate anomaly detection basis.
- To ensure a sufficient historical data set for comparison, the
training_period
is automatically adjusted when seasonality is applied. For example, whenseasonality
:day_of_week
is configured, thetraining_period
is by default multiplied by 7, ensuring there's enough data from each day of the week to form a robust training set.
example use case for seasonality
Different days of the week may show varying activity levels in many datasets, with weekends often seeing lower volumes compared to weekdays. Applying the day_of_week
seasonality means the expected range for each day's data is based on historical data from the same weekday, accommodating the normal fluctuations seen throughout the week.
Default: none
Relevant tests: Anomaly detection tests that utilize a timestamp_column
and a 1 day time_bucket
.
detection_delay
_10detection_delay:_10 period: < time period > # supported periods: hour, day, week, month_10 count: < number of periods >
Specifies the time to exclude from the end of the detection period. This is beneficial when recent data might not be fully available or reliable, such as in cases of scheduling discrepancies where tests precede data population. Essentially, it's the buffer period post-detection to omit from analysis.
Default: 0
Relevant tests: Anomaly detection tests that utilize a timestamp_column
.
column_anomalies
Select which monitors to activate as part of the test.
Default monitors by type:
Data quality metric | Column Type |
---|---|
null_count | any |
null_percent | any |
min_length | string |
max_length | string |
average_length | string |
missing_count | string |
missing_percent | string |
min | numeric |
max | numeric |
average | numeric |
zero_count | numeric |
zero_percent | numeric |
standard_deviation | numeric |
variance | numeric |
Opt-in monitors by type:
Data quality metric | Column Type |
---|---|
sum | numeric |
Default: default monitors
exclude_prefix
_10exclude_prefix: [string]
This parameter is specific to the all_columns_anomalies
test, allowing the exclusion of columns from the test based on their prefix. By specifying a prefix, any column whose name starts with this prefix will not be included in the anomaly detection process. This feature is particularly useful for selectively ignoring columns that may not be relevant or could skew the results of the anomaly detection.
Default: none
exclude_regexp
_10exclude_regexp: [regex]
This parameter is specific to the all_columns_anomalies
test, allowing the exclusion of columns from the test based on a regular expression match. By providing a regular expression pattern, columns whose names match this pattern will be excluded from the anomaly detection process. This is useful for filtering out columns dynamically based on naming conventions or patterns, ensuring that only relevant data is analyzed for anomalies.
Default: none