blef.fr

What I see when I read my own post. Something weird but cool. (credits)

The end of the year is coming and less articles are written by the whole data community and this is normal because we need a well deserved rest. But don't worry I'll continue to share what has been done!

This week I want to review the whole dbt Coalesce conference in addition to the usual data articles.

dbt is not — anymore — a data product

Obviously dbt is made of dbt Core and dbt Cloud that are two useful and well crafted products. But I think that dbt right now is becoming something different. They left the space of a simple product and tech company to drive a whole ecosystem.

We now have the whole markers: they made some terms the norm — Modern Data Stack and Analytics Engineering —, they put vision, they welcome everyone (especially people not used to data), they produce entry-level tools and documentation and they foster conversations through Coalesce conferences.

If we have a look at the dbt-core, the concept is quite simple: put SQL queries in order and run them. I've seen a lot of companies out there developing similar in-house solution or competitors doing the same. What differs with dbt Labs?

The main difference is the dbt content, they produce a freaking huge amount of content giving you best practices, giving you advice. It's actually a framework. In the pure software world definition. They released the first all-in-one SQL framework for data warehousing.

To me dbt, this year, became a pillar in the data ecosystem, everyone wants to develop an integration with it. Actually everyone wants to be in (incl. me tbh).

Small remark: please write dbt and not DBT. I'm tilted each time I see DBT.

Coalesce 2021 in review

For this edition of the Data News I want to give my views on top of the 70+ live sessions of the Coalesce 2021 — the dbt annual conference. This is the same format as the Airflow Summit review I've already done.

So I grabbed a cup of tea and I watched the 5 days of conferences. I've seen 5 categories of talks this year:

Food for thought — to help us seeing forward
101 talks about dbt or other concepts
Feedbacks from companies implementing dbt, this is my favourite part obviously
Promotional content — talks from the sponsors, sometimes related to dbt sometimes not 🤷‍♂️
Diversity talks about how we can be more open in the data field

Food for thought

I really liked the introductory talk by Erica Louie about Scaling Knowledge over Scaling Bodies. In the presentation Erica tries to define what is a good self-service and what are the metrics around it. She also mentions something I've seen a lot of companies struggle with: aiming for "data requests are dominantly investigative questions or infra tasks [rather than answering business questions]".

But actually this is something everyone should answer because self-service is not the go-to solution for every company, sometime DaaS (Data as a Service) could still be the best option.

From the presentation I can give you some key concepts:

prefer async documentation (docs or videos) for onboarding
define your data user journey
visualize dashboards data team reach (for instance dashboards owned by non-data team members, etc.)
Weekly active users (WAUs) and feedback score

To continue dbt co-founders (Tristan and Drew) had 3 awesome chats speaking of investment in data, Spark history — unveiling Databricks partnership and with Snowflake SVP about how Snowflake sees analytics today. These discussions truly place dbt as a key actor in the cloud data storage war.

My main takeaway is that data is becoming the main differentiator now when some years ago it was the software. Do people now want to join companies that treat their data bad?

In a glimpse, some other stuff I like:

This is you. If you open all the links provided. (credits)

Let's talk about dbt only

They presented the v1 version during Coalesce sessions. The v1 versions means the API will be stable and officially "out-of-beta". It also means the compile is gonna be faster (could improve up to x100 for companies with a lot of models).

They also teased us about what's coming next in early 2022. The Metrics system and dbt Server. The idea of the dbt Server is to provide an unified interface for the last mile of the data. Right now you can in v1 describe your metrics in the YAML configs. Benn also argued regarding the metrics system and why it's important. Probably the next big move for all companies trying to build the best warehouse.

They also teased us about "Define your own tasks for the dbt DAG" — would it mean that dbt could become more than a SQL queries repository?

📗 If you are new to the data field you may also struggle in building a mature dbt project or to use Git because it is often required to operate your projects. And finally Emma detailed how you can develop dbt packages to encapsulate logic for your data clients teams.

Companies feedback

A lot of companies went to Coalesce to show how they use dbt and also to give us a glimpse of what we can achieve. Like last time at the Airflow Summit I really like this because it says to everyone: it's an open-source product, so take it and make it your own, be creative.

Firstly Aula Education came with 2 talks, the first one was more general about Analytics Engineering discipline becoming a thing where we should all apply software practices — like data engineering couple of years ago. Apply testing, modularity, building first small products and then iterates.

✨ The second one was very inspirational, about how you can survive the schema changes with automation. Imagine a world where MongoDB is your primary source with no schema defined and where everything is changing. At Aula they geniously used dbt_utils macros to automate sources addition and the dbt-audit-helper to easily compare relations. To finish they also presented how they automatically generate the whole dbt config with Jinja and Python from config files.

Eric from Mattermost also explained dynamic sources management with macros.

If we continue this path over the dbt tooling, Slido demonstrated two packages they developed that could be useful to everyone out there. The first one is dbt-coverage, it allows you to create a coverage metric over your dbt docs. The second one is dbt-superset-lineage that lets you push documentation to Superset or pull dashboard to exposure. It reminds me the dbt-metabase package. This stuff is really good.

And in a nutshell (to keep it short):

Abhiti explained how you adapt incremental_strategy for your needs, what I also like was how they crafted their Sessions modeling with immutable and mutable tables along the way. It resonates with Snowplow time.
The most common practice I noticed was the use of tag to create different schedules, for instance daily, weekly, monthly or 30_minutes, 2_hours.
Companies loves using dbt Cloud Metadata API to build observability: here and here.
Regarding observability don't forget you can use dbt artifacts (run_results.json and manifest.json) to create dashboard on top of it. Snapcommerce team illustrated how they did observability within dbt (they used the dbt_artifacts package).
dbtplyr — bring R dplyr to dbt, particularly useful when you need to dynamically building columns selection.
Use the meta tag in your YAML file to add context: add ownership or access policies, add alerting rules. Everything could go in.

Diversity

Because this post is already too long, I'll give you a list of the open talks I liked:

Inclusive Design and dbt
Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team.
To all the data managers we've loved before
All the daily talks by Jillian especially the closing remarks

If you came this far, first I want to thank you. Second, hi I'm Christophe I do data stuff and I write a weekly curation newsletter with my views inside. If you liked it please consider Subscribe to support this kind of work in the future. It's forever free.

Usual Data News is below but in a smaller format.

Data fundraising 💰

HashiCorp went public this week, the home company of Terraform successfully did the IPO and raised around $1.2b. Even if we are using more and more cloud services, Terraform is here to stay and will still be the good practice in term on architecture setup.

AI Friday

Pinterest shows us how they use Machine Learning (with BERT under the hood) to provide an "healthy" space in the comments sections. This could inspire a lot of companies even if you don't have a comment section, for instance with customer support.

Fast News ⚡️

Behavioral timeseries segmentation in ClickHouse — how can you use ClickHouse to build user journey funnel, Content Square team wrote about it.
I said 43 times the word dbt, let's speak about dataform, a competitor bought by Google but only for BigQuery. This post gives you a glimpse of what you can achieve with it.
How Data Engineers can use SQL to estimate BigQuery storage costs — you can use INFORMATION_SCHEMA to multiply it and get the money you spend. Actually I prefer another solution which is to send directly all your billing data in BigQuery to have the actual precise amount.
Keeping your data pipelines organized — Felipe propose a way to structure your pipeline code to keep it organized
Evolving LinkedIn’s analytics tech stack — Modern Data Stack is paving the way and changing everything, sometimes it's good to see what big companies with big Hadoop footprint evolve their stack. Today it's LinkedIn.