Best practices for working with dbt and Bigquery - A practitioner's guide

Introduction

dbt is a powerful data transformation tool to use with cloud data warehouses like BigQuery. It can help you streamline operations in your warehouse, ensuring the data you need is automated and always available.

One of the benefits of dbt is the ability to harness BigQuery’s specific functionalities like timeouts, sample functions, partitioning, and clustering directly within your transformations. dbt allows you to set these up dynamically, improving query performance and saving costs, without the need to constantly monitor your warehouse.

In this article, we will discuss best practices when setting up dbt and BigQuery to harness the powerful capabilities of each tool.

Configure timeouts and retries

Specifying timeout and retry parameters is key when using any data warehouse. When you don’t specify this, long-running queries can quickly eat away at your bank account. Simple errors can also persist when a retry could have fixed the entire disruption in your pipeline. Setting these parameters is a good way to prevent ongoing failures and an accumulation of costs.

With BigQuery, the default is no timeout. Depending on your query type, the default could be 300s, but it’s best to specify this according to your data environment. The timeout period specifies how long a query can run before dbt marks the run as a failure. Long-running queries are common when something isn’t properly set up in your data environment, so timeouts protect you from this.

To specify this in dbt, you can add a configuration block called job_execution_timeout_seconds to your profiles.yml.

hi-y42:
 target: prod
 outputs:
   prod:
     type: bigquery
     method: oauth
     project: y42-prod-env
     dataset: hello_world
     job_execution_timeout_seconds: 300

Here, I set the timeout to 300s, or 5 minutes. Typically queries shouldn’t run longer than this unless you have very inefficient queries, which you should optimize anyways.

Note that this is the timeout for the prod output. If you have other outputs, make sure to specify the timeout for those as well. With dev, you may want to reduce the timeout even more, since there is no reason to have queries here that run even over a minute.

You can also set the number of retries here. Retries refer to the number of times dbt will try to run the query before failing it. The default here is 1. You can prevent retries entirely or increase this number. This depends on how often your queries fail due to minor issues that typically fix themselves on subsequent runs.

You can specify this using the job_retries block.

hi-y42:
 target: prod
 outputs:
   prod:
     type: bigquery
     method: oauth
     project: y42-prod-env
     dataset: hello_world
     job_execution_timeout_seconds: 300
     Job_retries: 2

Again, make sure you do this for any other outputs as well. For dev, you may want to set this to 0.

Utilize the tablesample function when testing

BigQuery offers a powerful function called tablesample. This returns a random subset of data from the table when applied to your query. This function makes it so your query does not read from the entire table, but only reads the rows in the block that you specify.

Tablesample is powerful because it limits the data you are processing, unlike WHERE and LIMIT which simply filter the data after it’s already been processed. Limiting the amount of data you are processing will save you computing power and thus time and money.

With dbt, you’ll want to use this function in development, when testing your data models. It’s not necessary to run your code on all the data you have available in development, making it the perfect place to utilize it.

A model utilizing this function in dbt would look something like this:

select
  account_id,
  account_type,
  email
from {{ ref(‘stg_raw_data__accounts’) }}
where account_status = ‘active’ tablesample system (10 percent)

Using tablesample here returns 10% of the data from the table than the original query without this function.

Now, we don’t want to accidentally use this function in production. And because dbt models are the same in both development and production, even when you specify specific targets with different locations, you need to add a dbt-specific configuration.

dbt makes it easy to add code based on your environment. You can do this in the form of Jinja in your SQL models. In order to do this, add an if statement specifying your development environment.

{% if target.name == 'dev' %}

Then, below this Jinja code, include the code that you only want to be run in development. End the code block with {% endif %}.

With our example, this would look something like this:

select
  account_id,
  account_type,
  email
from {{ ref(‘stg_raw_data__accounts’) }}
where account_status = ‘active’

{% if target.name == 'prod' %}
tablesample system (10 percent)
{% endif %}

This way, the tablesample code will only be applied when your target is set to your development environment.

You can also add these filters to your staging models in dbt to limit all source data. However, this could get tricky when doing joins because you may not be choosing the same subset of data that maps to one another across different tables. I recommend focusing on large core models when using this function, as those tend to take up the most storage and computing power.

Partition or cluster your tables

BigQuery offers you the ability to partition or cluster your tables, improving performance and reducing costs for high-volume tables. Partitioned tables are divided into chunks allowing you to query your data more readily. Clustering works by sorting rows in a table based on a specified column.

Partitioning is good to use when you have a large table and maybe want to see the costs of running a query before running it. In dbt, you can specify a partition_by block in the configuration of your SQL models.

{{ 
  config(
    materialized='table',
    partition_by={"field": "account_created_at","data_type": "timestamp"}
)}}

This configuration is for a model materialized as a table and partitioned by the account_created_at field.

Clustering is good for tables that use a filter, like a where clause, or an aggregate function. They are commonly used in dbt incremental models to build them cheaper and faster.

Similar to partitioning, clustering can be defined in a model’s configuration, like so:

{{
  config(
    materialized = "table",
    cluster_by = ["account_id", "account_created_at"]
)
}}

Here, we are helping to improve the table’s performance by clustering it on account_id and account_created_at. Now, every time this model is built by dbt, it will be built with performance and costs in mind.

Choose the right strategy for incremental models

If you aren’t familiar with incremental models, they are a popular materialization strategy used in dbt models. They allow you to build your larger, more compute-intensive models incrementally, ensuring you aren’t running your SQL code on huge data tables every single time. In general, these are a great way to save time and money.

dbt typically uses merge statements behind the scene in your cloud data warehouse of choice. However, these require scanning all source and destination tables. While they are faster and cheaper than rebuilding the model every time it’s run, they can still be slow and expensive for large tables. In addition, you MUST specify a unique_key.

You can save even more computing power by taking advantage of BigQuery’s insert_overwrite method when building incremental models in dbt. This technique generates merge statements as well, but ones that replace entire partitions.

To take advantage of this, you must use the partition clauses that we mentioned above. You can do this directly within the configuration for your model.

{{
 config(
   materialized = 'incremental',
   incremental_strategy = 'insert_overwrite',
   partition_by = {'field': 'account_created_at', 'data_type': 'timestamp'},
 )
}}

You can also specify a partitions config using partitions_to_replace in your dbt model’s config. While the default is simpler and does not require this, you can do this to increase your cost savings even more.

{% set partitions_to_replace = [
 'timestamp(current_date)',
 'timestamp(date_sub(current_date, interval 1 day))'
] %}

{{
 config(
   materialized = 'incremental',
   incremental_strategy = 'insert_overwrite',
   partition_by = {'field': 'account_created_at', 'data_type': 'timestamp'},
   partitions = partitions_to_replace
 )
}}

This will replace today's and yesterday’s data in your destination when it is run. According to dbt, this is the fastest and most cost-effective way to run incremental models in dbt.

Conclusion

While cloud data warehouses have many benefits, it can be scary to begin using them and have little transparency into the costs when running orchestrated transformations. Luckily, dbt integrates with BigQuery’s performance configurations, allowing you to automate cost savings and increase the performance of your models at the same time.

Taking advantage of BigQuery’s features like timeouts, partitioning, and its unique SQL functions, ensures you are not eating up unnecessary costs. You can take advantage of all dbt has to offer, while still being cognizant of the data you are running your transformations on and how those models are building in your data warehouse.

Automating your transformations and leveraging a cloud data warehouse doesn’t have to be stressful and costly. You just need to ensure you are implementing the best practices that make your life easier.