Hey dear members. I have to confess I'm lazy. Every week I want to create content, I want to work on a new article or video. The more I have ideas the more I procrastinate. Every week, Friday appears and I'm still here, late with the newsletter. For years I was convinced I could change it, but let's face the truth, I'm 30 now, it will never change.
Still, while procrastinating this week I've decided to watch all replays—around 120—from the dbt annual conference. This newsletter will give you my Coalesce 2022 takeaways.
I have also added a ❤️ on my 3 favourite talks.
The conference agenda has been divided, according to me, into 5 categories that are similar to last year ones:
- dbt future — which direction the data field is going with dbt at center
- Analytics engineering
- HR — Grow your data career and fix your data team
- Diversity talks about how we can be more open in the data field
- Partners — dbt's booming, everyone wants to be in
Obviously Coalesce has been the theatre for dbt Labs to announce new stuff. Nothing revolutionary or surprising because it was already discussed or announced before the conference. During the 5-days dbt Labs talks were focused on 3 main topics: Python, Semantic Layer and Community. In the modern data stack the warehouse is king, at the center, dbt sits on top of it. In this privileged position dbt usage is growing.
Being at the center of a community of users and partners means a lot. You foster a variety of usages while attracting with your growth a lot of partners in search of integration with you. This is what dbt Labs has to juggle with. As a personal opinion I think that too much tools were just demoing their product without any added value, still this is not a big issue as I can skip them.
Technically speaking, being at the center of the data stack leads also to the next step for dbt: the Semantic Layer. This layer has been designed to be the all-in-one interface for all the tools in needs for data. dbt Labs will open-source a new project called the dbt Server—not yet released—that will put an HTTP API on top of dbt Core to do dbt operations. In addition dbt Cloud will offer a proprietary Metadata API and a Cloud proxy. The Cloud proxy will be able to translate YAML metrics definition to SQL. As I already said it feels a bit like their best try to generate revenue.
If I'm being sarcastic and defensive I don't see as a good sign that dbt wants to be my new data connector on top of my warehouse, adding a layer of complexity in my infrastructure.
Lastly the Python support, while being fairly simple, impressed me. In a form of a duel Jeremy vs. Cody dbt team demoed what you can and can't do with Python models. In a versus Python vs. SQL models we've seen usage of pandas describe and pivot, fuzzy matching and sklearn.
On a side node dbt team also presented their focus for 2023 and 2024 outlined by their user research. As Tristan said, dbt wants to be the open standard to create and disseminate knowledge. So 2023 will mean: better lineage support for datascience, standardization around metrics and semantic layer and enriched dbt DAG capabilities to add more context to it—whatever it means re-bundling is coming.
2022 is probably the year of analytics engineering being popularized. While being still unclear what are the true frontiers of the role, everyone knows that "dbt developers" are analytics engineers. But it goes deeper that this. It implies a mix of business understanding with technical expertise over SQL engines and data modelisation.
At Coalesce we've seen that analytics engineering has a wide range of applications. But in the end you don't build models, you construct knowledge, this knowledge is essential to find the common ground between the company verticals. Even if AE is still new, it relies on old principles like Kimball modeling, but is it still relevant? Spoiler: yes, even if it's not like before for performance reasons, Kimball brings understandability.
Under analytics engineering I really like 3 presentations that I would recommend to any people in analytics, these presentation while approaching technical concepts in a good way bring good food for thoughts to improve any dbt project:
- Outgrowing a single `dbt run` — at scale the schedule based orchestration can fail, having CRON that runs dbt will lead to issue so you need a smarter orchestration pattern. This is were reactive/proactive scheduling enters the room. In the Airflow world it means you have to use sensors to trigger runs. Here Prratek also recommend to run staging model each time a source is refreshed and once every staging have been run to run the marts. I think this is a good pattern.
- ❤️ Testing: Our assertions vs. reality — Probably the best talk of the conference to me. Mariah shows how dbt is natively badly designed when it comes to testing. dbt tests are mixing code and data quality which are 2 different piece of the testing framework. She also greatly illustrates the difference between assumptions and assertions when it comes to data.
- Efficient is the new sexy - A minimalist approach to growth — Matthieu propose a framework to handle team growth while tackling engineering problems. He also tackles issues like modularity (linked to mesh concepts) and testing on another angle than the previous one.
Lastly data contracts concepts were on fire in the data community. This time Jake and Emily are providing us with practical example using jsonschema to define interface between product and data teams.
Grow as an individual and fix your data team
A lot of talks this year have tried to answer to a simple question: how can a data team have an impact? This is obviously related to the fact that all data teams around the world are costing a lot and leaders are still struggling to find the Return on Investment (ROI).
In this introspective search of what's a data team, the picture seems to be the same for everyone. Cultural challenges are the main blockers for massive data adoption. 5 talks tried to propose something to help adoption:
- Know your worth: Unpacking business value delivered by data teams — A framework to build knowledge to exploit data for stakeholders
- Data teams v. The recession — How to win the ROI battle. You have to at least act on 3 levers: core business reporting, avoid people pleasing and drive decisions that affect revenue. Chetan illustrates with Airbnb and Webflow examples.
- How to build data accessibility for everyone — use the JTBD framework to know your data users to achieve self-service.
- Money, Python, and the Holy Grail: Designing Operational Data Models — We need to simplify data models a simple modelisation of the business means that you've understood what's going on. Data teams should not be a consultant team that answer every questions. Data team creates a simple understandable view for everyone.
- ❤️ Operations vs. product: The data definition showdown — Every operational team is different and data should do the glue between stakeholders even if it's hard. Words have different meaning per teams. Data alignment is a people and langage problem, not a technical one.
Being in an analytics team can be difficult because you're in the middle of everything without the power to take decisions. That's why data team have to be empathic. Empathy means "the action of understanding" (cf. Empathy-building in data work and How insensitive: Increasing analytics capacity through empathy).
dbt blog mentioned purple people concept last year. Purple people are these generalists that are doing the glue between the business and the data stack. But being a generalist is often a solo job. You are navigating between specialist world and you help these expert communities communicate between each other. This is what Stephen greatly depicted in ❤️ Excel at nothing: How to be an effective generalist.
There also were open formats. This creativity shows how great the data community is. Tiankai sang a data jam 🎵, competitors battled to answer business questions as fast as possible and Joe developed an Unity SQL game.
Final shout-out to Mehdio who did video interviews and highlights of the conference because he was there in-person.
Last thing I discovered is the dbt-project-evaluator package, which seems amazing to create CI/CD rules to detect for instance direct join to sources or documentation coverage.
(to get data curation each week in your inbox saving your 5 hours of tech watch)
PS: I've already done this last year for the Coalesce 2021 if you wanna check out.
PS2: sorry for the length of this edition, for the delay and I hope the reading was enjoyable I'm not really proud of my writing here.
See you next week.
Join the newsletter to receive the latest updates in your inbox.