Skip to content

Data News — dbt Coalesce 2021 takeaways

Data News #49 — dbt Coalesce takeaways: vision for dbt, what other companies are doing, learn dbt. Usual Data News at the end.

Christophe Blefari
Christophe Blefari
7 min read — ·
What I see when I read my own post. Something weird but cool. (credits)

The end of the year is coming and less articles are written by the whole data community and this is normal because we need a well deserved rest. But don't worry I'll continue to share what has been done!

This week I want to review the whole dbt Coalesce conference in addition to the usual data articles.

dbt is not — anymore — a data product

Obviously dbt is made of dbt Core and dbt Cloud that are two useful and well crafted products. But I think that dbt right now is becoming something different. They left the space of a simple product and tech company to drive a whole ecosystem.

We now have the whole markers: they made some terms the norm — Modern Data Stack and Analytics Engineering —, they put vision, they welcome everyone (especially people not used to data), they produce entry-level tools and documentation and they foster conversations through Coalesce conferences.

If we have a look at the dbt-core, the concept is quite simple: put SQL queries in order and run them. I've seen a lot of companies out there developing similar in-house solution or competitors doing the same. What differs with dbt Labs?

The main difference is the dbt content, they produce a freaking huge amount of content giving you best practices, giving you advice. It's actually a framework. In the pure software world definition. They released the first all-in-one SQL framework for data warehousing.

To me dbt, this year, became a pillar in the data ecosystem, everyone wants to develop an integration with it. Actually everyone wants to be in (incl. me tbh).

Small remark: please write dbt and not DBT. I'm tilted each time I see DBT.

Coalesce 2021 in review

Online conference 101 (credits)

For this edition of the Data News I want to give my views on top of the 70+ live sessions of the Coalesce 2021 — the dbt annual conference. This is the same format as the Airflow Summit review I've already done.

So I grabbed a cup of tea and I watched the 5 days of conferences. I've seen 5 categories of talks this year:

  • Food for thought — to help us seeing forward
  • 101 talks about dbt or other concepts
  • Feedbacks from companies implementing dbt, this is my favourite part obviously
  • Promotional content — talks from the sponsors, sometimes related to dbt sometimes not 🤷‍♂️
  • Diversity talks about how we can be more open in the data field

Food for thought

I really liked the introductory talk by Erica Louie about Scaling Knowledge over Scaling Bodies. In the presentation Erica tries to define what is a good self-service and what are the metrics around it. She also mentions something I've seen a lot of companies struggle with: aiming for "data requests are dominantly investigative questions or infra tasks [rather than answering business questions]".

But actually this is something everyone should answer because self-service is not the go-to solution for every company, sometime DaaS (Data as a Service) could still be the best option.

From the presentation I can give you some key concepts:

  • prefer async documentation (docs or videos) for onboarding
  • define your data user journey
  • visualize dashboards data team reach (for instance dashboards owned by non-data team members, etc.)
  • Weekly active users (WAUs) and feedback score

To continue dbt co-founders (Tristan and Drew) had 3 awesome chats speaking of investment in data, Spark history — unveiling Databricks partnership and with Snowflake SVP about how Snowflake sees analytics today. These discussions truly place dbt as a key actor in the cloud data storage war.

My main takeaway is that data is becoming the main differentiator now when some years ago it was the software. Do people now want to join companies that treat their data bad?

In a glimpse, some other stuff I like:

This is you. If you open all the links provided. (credits)

Let's talk about dbt only

They presented the v1 version during Coalesce sessions. The v1 versions means the API will be stable and officially "out-of-beta". It also means the compile is gonna be faster (could improve up to x100 for companies with a lot of models).

They also teased us about what's coming next in early 2022. The Metrics system and dbt Server. The idea of the dbt Server is to provide an unified interface for the last mile of the data. Right now you can in v1 describe your metrics in the YAML configs. Benn also argued regarding the metrics system and why it's important. Probably the next big move for all companies trying to build the best warehouse.

They also teased us about "Define your own tasks for the dbt DAG" — would it mean that dbt could become more than a SQL queries repository?

📗 If you are new to the data field you may also struggle in building a mature dbt project or to use Git because it is often required to operate your projects. And finally Emma detailed how you can develop dbt packages to encapsulate logic for your data clients teams.

Companies feedback

A lot of companies went to Coalesce to show how they use dbt and also to give us a glimpse of what we can achieve. Like last time at the Airflow Summit I really like this because it says to everyone: it's an open-source product, so take it and make it your own, be creative.

Firstly Aula Education came with 2 talks, the first one was more general about Analytics Engineering discipline becoming a thing where we should all apply software practices — like data engineering couple of years ago. Apply testing, modularity, building first small products and then iterates.

✨ The second one was very inspirational, about how you can survive the schema changes with automation. Imagine a world where MongoDB is your primary source with no schema defined and where everything is changing. At Aula they geniously used dbt_utils macros to automate sources addition and the dbt-audit-helper to easily compare relations. To finish they also presented how they automatically generate the whole dbt config with Jinja and Python from config files.

Eric from Mattermost also explained dynamic sources management with macros.

If we continue this path over the dbt tooling, Slido demonstrated two packages they developed that could be useful to everyone out there. The first one is dbt-coverage, it allows you to create a coverage metric over your dbt docs. The second one is dbt-superset-lineage that lets you push documentation to Superset or pull dashboard to exposure. It reminds me the dbt-metabase package. This stuff is really good.

And in a nutshell (to keep it short):

(credits)

Diversity

Because this post is already too long, I'll give you a list of the open talks I liked:


If you came this far, first I want to thank you. Second, hi I'm Christophe I do data stuff and I write a weekly curation newsletter with my views inside. If you liked it please consider Subscribe to support this kind of work in the future. It's forever free.

Usual Data News is below but in a smaller format.

Back to the roots (credits)

Data fundraising 💰

  • HashiCorp went public this week, the home company of Terraform successfully did the IPO and raised around $1.2b. Even if we are using more and more cloud services, Terraform is here to stay and will still be the good practice in term on architecture setup.

AI Friday

Pinterest shows us how they use Machine Learning (with BERT under the hood) to provide an "healthy" space in the comments sections. This could inspire a lot of companies even if you don't have a comment section, for instance with customer support.

Fast News ⚡️

datanews

Christophe Blefari

I do Data Engineering in Python.

Comments