dbt sources: Managing Data Sources in dbt

In dbt, sources are objects representing raw data tables loaded into your data warehouse by your ELT processes. By defining these tables as sources within your dbt project, you can manage data lineage, validate assumptions about data integrity, and monitor the timeliness of your data with freshness checks (opens in a new tab).

Declaring

Sources in dbt are defined within YAML files under the sources: key, typically stored in the models/ directory.

These definitions allow you to specify the database and schema of your source tables, and assign a logical name used within dbt.

Example of a source declaration:


_10sources:
_10  - name: sales_data
_10    database: analytics
_10    schema: raw_sales
_10    tables:
_10      - name: transactions
_10      - name: customers

This setup enables you to manage source metadata centrally and leverage dbt’s powerful features to interact with these tables directly in your transformation models.

Using the `{{ source() }}` Function

Once a source is defined, you can reference it in your models using the {{ source() }} function. This not only helps in creating clean and manageable code but also ensures that dependencies are correctly understood and handled by dbt.

Example of selecting from a source in a model


_10select t.customer_id, sum(t.amount) as total_spent
_10from {{ source('sales_data', 'transactions') }} as t
_10join {{ source('sales_data', 'customers') }} as c on t.customer_id = c.id
_10group by t.customer_id;

This SQL snippet joins transactions and customers tables from the sales_data source.

Testing and Documenting Sources

Adding tests to sources is a best practice that helps ensure your assumptions about the underlying data hold true throughout the lifecycle of your data products.

Example of adding tests and descriptions to sources


_19sources:
_19  - name: sales_data
_19    description: "Data loaded from the online transaction system."
_19    tables:
_19      - name: transactions
_19        description: "Records of all sales transactions."
_19        tests:
_19          - unique:
_19              column_name: transaction_id
_19          - not_null:
_19              column_name: transaction_date
_19      - name: customers
_19        description: "Customer information."
_19        columns:
_19          - name: id
_19            description: "Unique customer identifier."
_19            tests:
_19              - unique
_19              - not_null

This configuration enhances the maintainability of your project by making it easier for new team members to understand the data structures and for existing team members to ensure data quality.

Monitoring Data Freshness

dbt allows you to configure freshness tests for your sources, which can alert you if your data isn't updated within expected timeframes, critical for maintaining SLAs.

Example of configuring source freshness


_11sources:
_11  - name: sales_data
_11    freshness:
_11      warn_after: { count: 24, period: hour }
_11      error_after: { count: 48, period: hour }
_11    tables:
_11      - name: transactions
_11        loaded_at_field: last_updated
_11        freshness:
_11          warn_after: { count: 6, period: hour }
_11          error_after: { count: 12, period: hour }

By setting these parameters, dbt helps you keep track of how current your data is, enabling proactive management of data pipelines.

Manage Sources and dbt Models in one place

Build end-to-end pipelines using a single framework.

Get Started