Incremental models
Created as tables in your data warehouse, incremental models initially transform all your source data rows. On subsequent runs, incremental models target specific rows based on a user-provided filter, usually pertaining to newly added or updated rows.
The incremental loading mechanism significantly reduces transformation runtime and compute costs, thereby leading to enhanced warehouse performance.
Building incremental models
Building an incremental model follows the same initial process as building a model that is materialized as a table or view. However, you will need to also include both the incremental configuration and the filtering criteria.
Here's an example to illustrate the structure of an incremental model:
Configure the model as incremental.
Define the materialization type as part of the config block.
Define the model's logic.
Provide the SQL logic that will act as the base for your incremental model.
Utilize the is_incremental
macro to filter out rows.
Use the is_incremental()
macro to filter for "new" rows, which are those added since the last model run. Use the {{ this }}
variable to easily query your target table for the latest timestamp.
Set an optional unique_key to identify rows that need to be updated.
The unique_key
parameter is optional. When set, Y42 merges the new data into the zero-copy-clone of the existing table, allowing for updates to existing rows.
If the unique_key
is not specified, Y42 will simply append the data to the end of the zero-copy-clone of the existing table, allowing the addition of new rows only.
Define the model's logic.
Provide the SQL logic that will act as the base for your incremental model.
Utilize the is_incremental
macro to filter out rows.
Use the is_incremental()
macro to filter for "new" rows, which are those added since the last model run. Use the {{ this }}
variable to easily query your target table for the latest timestamp.
Set an optional unique_key to identify rows that need to be updated.
The unique_key
parameter is optional. When set, Y42 merges the new data into the zero-copy-clone of the existing table, allowing for updates to existing rows.
If the unique_key
is not specified, Y42 will simply append the data to the end of the zero-copy-clone of the existing table, allowing the addition of new rows only.
Merge strategy
Y42 supports various strategies for implementing incremental models, including the merge
incremental strategy. This approach allows for appending new rows or updating existing ones based on the unique_key
configuration, mirroring the principles of Slowly Changing Dimension Type 1 (SCD1). This strategy ensures new data is overwritting old data when the key(s) match(es), while new data is appended, ideal for scenarios where maintaining historical data variations is not required.
You can specify the incremental_strategy
configuration for individual models or globally for all models in your dbt_project.yml
file:
or:
The above configuration translates into a SQL MERGE statement as follows:
Y42 vs. dbt incremental models execution
In Y42, the tables in your data warehouse (DWH) are synchronized with your code. This abstraction eliminates concerns about data loss due to errors overwriting a table in your DWH. Unlike dbt, expensive full-refreshes are not necessary. If a job runs with incorrect configurations, erroneous queries, or source issues, you can revert your changes and restore the model's data to its previous state.
In contrast, each dbt run overwrites your existing DWH table. If the table was incrementally built, overwriting it with incorrect data can be costly. You would have to perform a full-refresh, which undermines the efficiency of incremental models.
BigQuery limitations
In BigQuery, tables are not partitioned or clustered by default. This means that even if you use a WHERE statement to filter out data, the entire table is scanned. Consequently, incremental models like the following will read the whole source table (e.g., raw_app_data.events) with every run:
Therefore, the incremental model does not save any money, although it does save time because any filter or limit reduces execution time in BigQuery.
Incremental Predicates
incremental_predicates
offer an advanced approach for managing large-volume data in incremental models, justifying further performance optimization efforts. This configuration accepts a list of valid SQL expressions. Note that Y42 does not verify the syntax of these SQL statements.
FAQ
What happens if there was a mistake and I have to roll back my incremental model?
This isn't a problem, as we work with zero-copy-clone for incremental models. This means every incremental job gets a new table. If the incremental job fails your tests or the incrementally updated data is incorrect, you can roll back as you would with other Y42 models and use your correct data again. There's no need to worry about mistakes in code or source data messing up your downstream dependencies or stakeholder-facing tables. Additionally, you don't need to run an expensive full-refresh in case of mistakes.
What happens when the schema of my upstream model changes?
At the moment, we perform a full-refresh in this case. In the future, we plan to allow more fine-grained control from the user side.
When do we trigger a full-refresh instead of an incremental job?
Currently, when an upstream dependency changes or something in the incremental model changes, we trigger a full-refresh. In the future, we will allow more fine-grained control within configs.
How do I partition a model in Y42?
You cannot today. We will enable this feature soon.