Snapshots

Snapshots

Add new snapshot.

Add new snapshot.

Overview

Analysts often need to examine historical data. While many source systems provide this capability, some do not offer ways to review table history. Snapshots enable you to handle Slowly Changing Dimensions type 2 (opens in a new tab) over mutable source tables, allowing users to track records that change over time directly in Y42.

Example: Tracking customer addresses

Imagine you have a customers table with fields like address. People move, and when they do, their addresses in the table change. Snapshots help you retain this historical information for better data analysis.

idnameaddresscity
1Henry HavingKoenigsplatz 10Cologne
2Garry GivingPrinzenpark 322Munich

If Henry changes his address to 1 Hacker Road:

idnameaddresscity
1Henry Having1 Hacker RoadPalo Alto
2Garry GivingPrinzenpark 322Munich

We will lose the information about his previous address, making it challenging to analyze data and answer questions such as the average shipping time to a specific country or city.

Snapshots address this issue by preserving these changes in the table:

idnameaddresscityupdated_atvalid_fromvalid_to
1Henry HavingKoenigsplatz 10Cologne2023-10-012023-10-012023-10-27
2Garry GivingPrinzenpark 322Munich2023-10-012023-10-01null
1Henry Having1 Hacker RoadPalo Alto2023-10-272023-10-27null

How to create snapshots

List view

  1. Create a new asset: Click on the + sign at the top of the left sidebar to create a new asset.
  2. Select asset type: From the options presented, choose Snapshot.
  3. Name your snapshot: Provide a name for the snapshot and select Create.
Create Snapshot in List view.

Create Snapshot in List view.

  1. Configure snapshot settings: Begin by setting up the necessary configurations for snapshots via our user interface:
Configure snapshot.

Configure snapshot.

  • Target Database and Target Schema: Mandatory fields that inform Y42 where to store the snapshot.
  • Unique_key: Helps Y42 identify unique rows in the source table to track changes.
  • Strategy: Two strategies exist: timestamp and check. Each runs a slightly different query against the DWH. Learn more about selecting the appropriate strategy.
  • Updated At / Check Cols: These settings vary based on the strategy selected.
  • Invalidate hard deletes: This is an opt-in feature that allows for invalidating hard-deleted records during snapshotting. It is disabled by default.

Code View

A complete snapshot query in Code view looks like this:

snapshots/orders_snapshot_timestamp.sql

_14
{% snapshot orders_snapshot %}
_14
_14
{{
_14
config(
_14
target_database='analytics',
_14
target_schema='snapshots',
_14
unique_key='id',
_14
strategy='timestamp',
_14
updated_at='updated_at',
_14
)
_14
}}
_14
select * from {{ source('jaffle_shop', 'orders') }}
_14
_14
{% endsnapshot %}

Timestamp and Check strategies

Timestamp strategy (Recommended)

The timestamp strategy relies on an updated_at field to identify row changes. If the updated_at column for a row has a timestamp more recent than the last snapshot, Y42 will invalidate the old record and store the new one. If the timestamps are the same, dbt will not make any changes.

Required configurations for Timestamp Strategy

ConfigDescriptionExample
updated_atA column that indicates when the source row was last updatedupdated_at

Example usage:

snapshots/orders_snapshot_timestamp.sql

_11
{{
_11
config(
_11
target_database='analytics',
_11
target_schema='snapshots',
_11
unique_key='id',
_11
strategy='timestamp',
_11
updated_at='updated_at',
_11
)
_11
}}
_11
_11
select * from {{ source('jaffle_shop', 'orders') }}

Check strategy

The check strategy is suitable for tables lacking a reliable updated_at column. It compares a specified list of columns between the current and historical records. If any column has changed, dbt will invalidate the old record and capture the new one. If the columns remain the same, no action is taken.

Required configurations for Check Strategy:

ConfigDescriptionExample
check_colsA list of columns to monitor for changes, or 'all' to check all columns["name", "email"] or 'ALL'

Example usage:

snapshots/orders_snapshot_timestamp.sql

_11
{{
_11
config(
_11
target_database='analytics',
_11
target_schema='snapshots',
_11
unique_key='customer_id',
_11
strategy='check',
_11
check_cols='all',
_11
)
_11
}}
_11
_11
select * from {{ source('jaffle_shop', 'orders') }}

snapshots/orders_snapshot_timestamp.sql

_11
{{
_11
config(
_11
target_database='analytics',
_11
target_schema='snapshots',
_11
unique_key='customer_id',
_11
strategy='check',
_11
check_cols=['column1', 'name', 'birthdate'],
_11
)
_11
}}
_11
_11
select * from {{ source('jaffle_shop', 'orders') }}

Query best practices

For optimal performance, start with the simplest query. If you experience performance or cost issues, consider selecting specific columns.

snapshots/orders_snapshot_timestamp.sql

_10
{{
_10
config(
_10
...
_10
)
_10
}}
_10
select
_10
col1,
_10
col2,
_10
col3
_10
from {{ source('jaffle_shop', 'orders') }}

Running snapshot jobs

Snapshots can only be triggered in the default branches you've configured (e.g., main). This is to prevent snapshot runs in development branches from accidentally overwriting your main snapshot table. Each job writes into the same DWH table, unless you modify the configs.