3 workflow improvements we wish dbt announced at Coalesce 2023

Last month, the data world paused to catch dbt’s annual Coalesce event. The main theme was “dealing with complexity”. The solution? An architectural mesh that splits a monolithic dbt project into smaller, interconnected ones. Contracts, versions, and tighter RBAC have been sprinkled on top to foster trust in large-scale data collaborations. New features like dbt Explorer, cloud-based CLI, and CI/CD capabilities rolled out, but only for paying Cloud customers.

Meanwhile, dbt-core, the open-source lifeline of dbt Labs, stands still. The same dbt-core that people rallied behind for years to build the largest data community and establish it as the go-to for data transformation. Yet, this time, no significant investments were announced for it other than minor incremental fixes. With each new dbt release, the gap between dbt-core and dbt Cloud continues to widen.

Of the 30,000 companies using dbt, only 3,600 are dbt Cloud customers – that’s just 12%. This reveals that a vast 88% of dbt’s customer base is still relying on dbt-core, the open-source platform. Such a major reliance on dbt-core by the community raises concerns about the long-term effects of its neglect. As dbt Labs continues to roll out updates tailored mainly for its premium service – dbt Cloud, one can't help but wonder about the potential disconnect that may arise from overlooking the needs of the majority.

What’s new from dbt?

dbt Cloud’s product updates reveal a commitment to scalability and workflow optimization, including:

dbt Mesh: break down large dbt projects into smaller, collaborative components.
Cross-project ref & orchestration: reference models and orchestrate pipelines across multiple dbt projects.
Contracts, versions, and access control: provide governance through contracts and model versions while boosting security via granular access control.
dbt Explorer: an enhanced version of dbt docs designed to scale beyond the 1200-1400 models mark, with the addition of some more metadata attributes specific to each object type.
(New) dbt Semantic Layer: introducing adaptable joins and a wider range of metric types.
CI Jobs: automatic deferral to a different environment and ability to rerun jobs from the point of failure.
and dbt Cloud CLI

These developments are aimed at teams scaling their data practices, with an emphasis on collaboration, control, and docs. We can categorize these new capabilities into two overarching themes: enhancing visibility and workflow improvements.

The mesh architecture required cross-project references, contracts, versions, and access control to govern information and have data flow securely from one project to another. All wrapped with a new, more performant docs service, called dbt Explorer.

In terms of workflow improvements, the CI jobs improvements and the Cloud CLI (simplifying python dependencies, credentials management and local setup) only scratch the surface of what could have been done in terms of developer workflows and ops management. This is where we think folks expected true innovation coming from dbt Labs.

Workflow improvements we wish dbt announced

Looking back, not much has really changed in how we handle data over the last decade. We've moved from those drag-and-drop ETL tools to code-based systems that are version-controlled, more flexible, and easier to work with. Add on top automated docs, tests, CI/CD, observability and we are rapidly getting on par with the software engineering best practices – as we aimed so.

Yet, the fundamental way we manage data assets — from building to deploying and monitoring – remains the same. Working with data involves more than code. It involves managing the codebase, database schema, infrastructure, and data. Keeping them all in sync is a real struggle for any practitioner. We need to address this, otherwise the codebase and the data warehouse state will continue to drift apart leading to all sorts of data quality issues. But, what if we could handle everything from one place? Imagine having a system where your code is the single source of truth, always in sync with the latest state of your data warehouse — just like in software development, having a single system dictating the state of our app.

This is the kind of paradigm shift we expected dbt to announce at Coalesce. Better workflows designed for data specifically. It's about carving out our own path as data practitioners, not just borrowing ideas from software engineering.

Step 1: Making our intent possible

To get the data warehouses in sync with the codebase, we need a system where code changes directly update the data warehouse state. It's simple: change your code, and your data warehouse changes too. Deploy something new or roll back a change, and it should instantly show up in your data, schema, everything. This way, what you see in your code is what's really happening in the data warehouse.

Code and data warehouse state always in sync.

Step 2: Doing it without wasting compute resources

Sure, tracking every code change by adding new tables in the data warehouse sounds great. But that could easily get messy and costly. We should be efficient and reuse tables where possible. For example, what we have already built in dev could be reused in prod if no other changes or refreshes happen before you merge your changes in. But for this to work, we need a setup where we're always working with the latest data from prod in dev, making sure we could perform this swap.

Step 3: Making it safe for everyone to develop.

Efficiency is important, but what's more important is creating an environment where data is accessible and safe for everyone to use and develop. We need to break down the barriers that limit who can work with data, and ensure it's not just a few gatekeepers holding the keys to your data warehouse. This means equipping our tools with safety features that prevent most common incidents, while offering the flexibility to customize those safety checks. Additionally, we should be able to quickly revert to a previous state if things don't go as planned. By doing this, we empower more people in an organization to confidently use and experiment with data, democratizing its use and preventing bottlenecks.

A deeper problem with the Modern Data Stack (MDS)

There's a fundamental issue in the MDS that many have hinted at over the past few years but few products have addressed it head-on. As developers, we're still juggling multiple disjointed tools to run an end-to-end pipeline. This means integrating various systems, investing in specialized training, managing access across different services, and understanding the full performance of our pipelines. It's a constant hop from one tool to another. While dbt acknowledges these challenges, we haven’t yet seen all-encompassing great solutions tailored for the MDS.

Data leaders want fewer vendor relationships. I’ve talked about this at length before and won’t belabor it here, but it came out very clearly this week. Relatedly, the boundaries between product categories like observability, quality, governance, cataloging, discovery, and lineage are becoming less and less clear. They were never that clear to begin with, and vendors are now increasingly overlapping with one another’s functionality.

A few years ago, we would've said 100% dbt could drive this consolidation. Fewer tools and more integration would lead to a more coherent development experience, lower billing and management overhead, less time onboarding and hiring individuals, and better all round products that solve also some of the real-world challenges teams face today, challenges often overlooked that fall between the cracks of integrating so many vendors.

For instance, when focusing solely on transformation, changes in upstream sources can cause significant downstream issues. Keeping load and transform coupled could improve debugging and give better control over the pipeline.

Another area is monitoring. A lineage view of dbt assets is useful for visibility, but what if this view also showed each asset health status? Imagine a color-coded system indicating the status of each asset, allowing you to jump right into it and fix problems as they arise. Simple, yet so powerful.

Conclusion

There are also more advanced topics to explore: how pull requests would look like in the context of data assets: rather than analyzing file by file, maybe zoom out at a conceptual level, see the downstream impact of your changes in a visual manner, determine which tests to add/remove based on the impact, and select which assets to reuse rather than rebuild. That might be an option. Similarly, why stick to manually generating YAML files for your models, when a platform could translate user’s actions into YAMLs or generate new ones based on common patterns, such as creating staging tables? What's the ideal IDE experience for data practitioners? How can we further leverage git in our workflows?

These are areas we're actively exploring at Y42, constantly pushing the boundaries of what's possible in data management to improve the data experience for everyone.