dbt sources: Managing Data Sources in dbt
In dbt, sources are objects representing raw data tables loaded into your data warehouse by your ELT processes. By defining these tables as sources
within your dbt project, you can manage data lineage, validate assumptions about data integrity, and monitor the timeliness of your data with freshness checks (opens in a new tab).
Declaring
Sources in dbt are defined within YAML
files under the sources:
key, typically stored in the models/
directory.
These definitions allow you to specify the database
and schema
of your source tables, and assign a logical name
used within dbt.
Example of a source declaration:
_10sources:_10 - name: sales_data_10 database: analytics_10 schema: raw_sales_10 tables:_10 - name: transactions_10 - name: customers
This setup enables you to manage source metadata centrally and leverage dbt’s powerful features to interact with these tables directly in your transformation models.
Using the {{ source() }}
Function
Once a source is defined, you can reference it in your models using the {{ source() }}
function. This not only helps in creating clean and manageable code but also ensures that dependencies are correctly understood and handled by dbt.
Example of selecting from a source in a model
_10select t.customer_id, sum(t.amount) as total_spent_10from {{ source('sales_data', 'transactions') }} as t_10join {{ source('sales_data', 'customers') }} as c on t.customer_id = c.id_10group by t.customer_id;
This SQL snippet joins transactions and customers tables from the sales_data source.
Testing and Documenting Sources
Adding tests to sources is a best practice that helps ensure your assumptions about the underlying data hold true throughout the lifecycle of your data products.
Example of adding tests and descriptions to sources
_19sources:_19 - name: sales_data_19 description: "Data loaded from the online transaction system."_19 tables:_19 - name: transactions_19 description: "Records of all sales transactions."_19 tests:_19 - unique:_19 column_name: transaction_id_19 - not_null:_19 column_name: transaction_date_19 - name: customers_19 description: "Customer information."_19 columns:_19 - name: id_19 description: "Unique customer identifier."_19 tests:_19 - unique_19 - not_null
This configuration enhances the maintainability of your project by making it easier for new team members to understand the data structures and for existing team members to ensure data quality.
Monitoring Data Freshness
dbt allows you to configure freshness tests for your sources, which can alert you if your data isn't updated within expected timeframes, critical for maintaining SLAs.
Example of configuring source freshness
_11sources:_11 - name: sales_data_11 freshness:_11 warn_after: { count: 24, period: hour }_11 error_after: { count: 48, period: hour }_11 tables:_11 - name: transactions_11 loaded_at_field: last_updated_11 freshness:_11 warn_after: { count: 6, period: hour }_11 error_after: { count: 12, period: hour }
By setting these parameters, dbt helps you keep track of how current your data is, enabling proactive management of data pipelines.
Manage Sources and dbt Models in one place
Build end-to-end pipelines using a single framework.
Get Started