After a year of discussions on BigQuery vs Databricks vs Snowflake, your company has finally decided to migrate to Snowflake. Life couldn’t be better. Your data is yielding its first insights, people are sending more requests to the data team, and everyone can easily create and share worksheets for analyses.
Perhaps it's a little too easy because, after a few months, you’re now juggling dozens of worksheets across the company! Queries are copy/pasted across ten different places, teams have mismatched calculations for key metrics, changes are hard to implement and worse to test, and data quality is steadily declining. As a result, data engineering time is becoming an ever-scarcer resource. Help!
Luckily, all is not lost. We can implement more rigorous engineering tools and practices that can help maximize data engineers’ impact. Because ingestion is already easier than ever, the majority of their work involves data transformation. After a recent Snowflake migration, you can likely find the quickest wins in good tooling for transforming data.
Why do most data teams extend Snowflake?
While Snowflake offers a fair starting point for data transformation, many teams rely on Snowflake’s extensive ecosystem of partners for additional tooling. The very existence of this ecosystem is a conscious decision on Snowflake’s part. Rather than build a platform that just works okay for everyone, they let you tailor your experience to work great for your team specifically.
We can pick and choose integrations to tailor our warehouse to our specific needs. For example, to help:
- Implement data quality tests and data lineage
- Create pipelines to automate queries according to dependencies
- Separate dev and prod environments to avoid breaking live pipelines
- Fully modularize and version control pipeline components
- Use advanced modeling techniques such as Data Vault 2.0
The downside is that customization is indeed often needed. “Vanilla” Snowflake leaves gaps in its default offering and relies on its ecosystem to fill those gaps. Implementing additional tooling may seem a bit daunting, but it provides an opportunity to massively speed up our data engineering efforts. After all, a fine tuned data warehouse will help you extract value from data much faster.
Upgrading data transformation on Snowflake with dbt
As outlined above, we can achieve the quickest wins in data transformation. One broadly adopted tool is dbt, or data build tool. How can dbt help us remedy the previously listed growing pains? This is not an exhaustive list, but dbt enhances Snowflake with:
- Modular pipelines that provide reusable elements
- Central catalog for a single source of truth
- Data lineage for full transparency of dependencies between models
- Separate dev environments for implementing and verifying changes
- Automated testing for increased data quality
In short, on top of Snowflake, dbt helps tremendously in getting the most out of our valuable data engineering time! We’ll go over these four benefits of dbt and how they compare to “vanilla” Snowflake.
Modular pipelines with reusable elements
Snowflake allows you to write plain SQL. You open up a worksheet, write a query, and get the results on execution. While simple, it is difficult to reuse code for multiple purposes. As a result, teams often end up copy/pasting queries across multiple worksheets. Maintainability suffers, and data definitions start to drift apart as some worksheets are updated while others remain unchanged.
dbt combines SQL with Jinja templating and structures a pipeline into models. The templating allows us to dynamically reference other models, which means we don’t have to copy/paste code across worksheets. If we need to update a model, we can do so in one single place. And models are easy to find and repurpose for our team mates.
Aside from cutting down on repetitive data engineering, maintainability and data quality benefit immensely.
Central catalog for single source of truth
Snowflake features extensive metadata, but most of it is hidden from view. Retrieving it often requires writing extra queries, which makes it difficult to access for less astute users. Zooming out to see what data exists in the landscape is also tricky: we can see the table names, but not necessarily how they relate to our business.
Dbt integrates metadata at model and column levels, allowing us to document arbitrary details. When looking at a timestamp, for example, it would be worthwhile to know the timezone. We can document this in dbt so that this is immediately obvious. This way we can construct a catalog that lets us assess the contents of our data landscape at a glance.
Having an extensive data catalog goes a long way in driving adoption. Finding relevant assets and understanding their meaning becomes easier. This not only boosts ease-of-use, but trust in data as well.
Data lineage
To trust our data, we need to know its provenance. Data lineage provides an overview of dependencies between models, but it is difficult to construct in Snowflake. You could query the access_history
and query_history
, but this is an involved process with many manual steps. As a result, it is very challenging to construct a full data lineage view in “vanilla” Snowflake.
As outlined above, dbt allows us to dynamically reference other models in our code. dbt can use these relationships to construct a lineage view of columns across models. In other words, you get data lineage for free.
Having insightful data lineage not only increases trust in data, but also makes data engineering much easier. We can see at a glance how changes to one query affect models elsewhere. This means we can move a lot quicker when transforming data.
Separated dev environments
As an old adage from software engineering goes: “everyone has a dev environment; some people have the luxury of a separate production environment”. In a Snowflake worksheet, we hard-code references to tables or views. This means that we either duplicate code to account for dev and prod versions, or simply execute everything on prod data.
In dbt, we can dynamically parse in the references to our tables and views. Imagine we have the following source code:
SELECT *
FROM {{ stg_orders }}
If we are working in a prod environment, dbt can automatically parse in company.prod.orders
. Similarly, on prod dbt could automatically create the following query:
SELECT *
FROM company.dev.orders
Having a robust development environment is a must-have for any data driven organization. We cannot avoid people making mistakes, but we can mitigate the effects of those mistakes. With the separated dev environments that dbt enables, we secure our production environment. Moreover, engineers will experiment more freely as well, resulting in new insights and innovations.
Data tests
Imagine you want to verify that a column doesn’t contain any negative values. On Snowflake, your only way to do this would be through a manually written query, such as:
SELECT
order_id,
SUM(amount) AS total_amount
FROM payments
GROUP BY order_id
HAVING total_amount < 0
With dbt, you get tests out-of-the-box. You can specify tests you want to run for columns, and dbt will generate and execute queries automatically. If a test fails, dbt will notify you and show the rows that didn’t pass the test.
Tests like these allow us to better verify our work, speeding up our data engineering efforts. Moreover, they boost trust in our data, which in turn drives adoption.
Conclusions
Snowflake is a great platform, but not a silver bullet for gaining value from our data. We need to tap into its ecosystem of partners to unlock its full potential.
dbt is a valuable addition to Snowflake that helps us throughout the transformation stages in data engineering. We looked at five specific benefits that will significantly improve the velocity of your data engineering efforts: modular pipelines with reusable components, a central catalog, data lineage, separated dev environments, and data tests.
There are additional benefits that we did not explore in detail, such as data documentation. That said, dbt also introduces some downsides. It requires some configuration and by default runs on a local machine. That means that it also orchestrates pipeline executions locally. If you work in a team of multiple data engineers, your runs may interfere with one another. You will also need a disciplined approach to keep your code versions in sync with your colleagues.
The next step: a central orchestrator
A central orchestrator is an excellent solution to overcome these challenges. It provides a single point from which pipeline runs are executed and a single source of truth for your configurations. This will simplify your team’s CI/CD and greatly boost your productivity.
However, it might be a bit daunting to set up and maintain the extra infrastructure. There’s a reason you chose Snowflake: you don’t want to maintain a fragile stack of six different tools. Neither do your users want to work across all of those separate tools. Instead, you want a turnkey solution that plugs into Snowflake and provides dbt, central orchestration, and ingestion while we’re at it.
If this piques your interest, you might like Y42. We built it to be an orchestrator that helps you maximize your data engineering impact. Your entire team can work on dbt projects in a collaborative IDE, without wasting time and resources to configure complex infrastructure for additional tools. And it doesn’t just provide this environment for data transformation. Y42 also handles your data ingestion, pipeline monitoring, and data governance.
If you would like to learn how Y42 empowers you to get the most value from your Snowflake data warehouse, feel free to contact us and request a demo.
Category
In this article
Share this article