man holding his eyeglasses — Rethinking the newsletter (credits)

Here's a new edition of the Data News newsletter. Since my 2-year anniversary post, I've been struggling to find the right writing rhythm. I've been sick and I've been stuck on a client project. Writing the newsletter was not an easy exercise. Even though I keep telling myself "it's not a question of motivation, it's a question of discipline" like a LinkedIn guy. I do things because I enjoy the process of doing things, not for the results.

That's why I'll try to change a bit the way the things are done for the next 3 months. As of today I do the newsletter every Friday. I search and read articles first and then I write. Starting next week I'll do it on Thursday, to schedule the sending at the same hour every Friday, at 2PM.

This way, I'll dedicate my Fridays to write original articles, explore ideas and preparing articles stock for the summer holidays. I plan to do a 1-month break during August. But at the same time I have the FOMO—fear of missing out. So I need to schedule articles in advance. I can tease you that I'll create content about "Create a data platform in 2023", with live examples.

In September I will do a retro and decide if this is the right way to continue or not.

In term of content I've recorded a new podcast episode (in French) that will be out next week. The French version will be a bit different than Minds of data. It'll be more round tables and discussions about the present and the future of our ecosystem.

We also scheduled the next Paris Airflow Meetup in Mirakl offices. Pierre, an Airflow committer and PMC member, will present his Airflow journey. Join us!

Data contracts, dbt and modeling

Back to the roots, it's been a long time since I did not share dedicated stuff about dbt. This week a natural cluster of articles have emerged. A few people already implemented things with the new model governance dbt introduced last month in v1.5.

Julian shared a nice way to use dbt model governance when you have 1000+ models. In a nutshell you can add new characteristics to models that will give more context to dbt. Models can have group, access, contract and versions. In the article Julian greatly explains the software dev comparison when managing programatic APIs with public or private visibility with models management. Finally he also proposes 6 logical data layers to sort your models: source, base, cleanse, core, business and marts.

This structure gives also more visibility to the team because you can draw clear boundaries like: data engineers are responsible for the 3 first layers, analytics engineers for the others.

In order to go more in depth in the data contracts concepts applied to the warehouse and dbt you can activate ownership with dbt data contracts. Mikkel also showcases his tool synq.io that runs tests and alerts on top of dbt.

In addition there are 2 awesome articles about related topics:

Simplicity or efficiency: how dbt makes you choose — This is a side-by-side comparison of dbt and SQLMesh, a growing alternative to dbt. The comparison is done using a project with 50 models on 3 aspects: make a change, deploy in dev and deploy in prod. In the end the article is obviously biased towards SQLMesh (on the company blog), but reveals good issues with dbt.
The data modeling divide — A discussion about different modeling techniques. OBT, star schema, activity schema, etc. and the divide within the community and tools companies for a consensus.

Gen AI 🤖

Why AI will save the world — Marc Andreessen writes about the prevailing panic and 5 risks associated with AI, asserting that AI will probably do the world more good than harm. Still it has cold war vibes inside 🙃.

The single greatest risk of AI is that China wins global AI dominance and we – the United States and the West – do not.

I propose a simple strategy for what to do about this – in fact, the same strategy President Ronald Reagan used to win the first Cold War with the Soviet Union.

The golden age of open source in AI is coming to an end — An article about changes in open-source code licenses creating less permissive models.
Rush to use Generative AI pushes companies to get data in order — Garbage in garbage out. An article from the Wall Street Journal, obviously if you want to fine tune generative models you will have to be sure to have correct training datasets.
Use ControlNet to generate QR Codes — A Chinese engineer used ControlNet to generate visually appealing and hidden QR Codes. The result is quite impressive and works most of the time.

A ControlNet generated QR Code, the link sends to a website to personalise QR codes developed by the author

Fast News ⚡️

Which team should own data quality? — Wether it's data engineering, analytics engineering or more specialised functions supervised by a central governance this is a good question to have.
The next chapter for CastorDoc — CastorDoc, previously Castor, is a a data catalog. They recently did a rebrand and Tristan shared the new associated vision. They unveiled 5 pillars to achieve the new vision in which AI-powered insights is the second one.
Graph components with DuckDB — Max always amaze me with his experiments. This time he writes a graph algorithms in SQL to identify connections.
Gotchas of streaming pipelines: profiling & performance — Feedbacks on how Lyft team increased performance on their streaming pipelines.
The growing pains of database architecture — Figma team shared learnings about scaling Postgres instances.
Backfills in data & machine learning — Backfilling is when you write or overwrite the historical data. Backfilling is one of the most complicated task in data engineering because it often requires design way ahead of problems. Dagster wrote a small guide about considerations you might have when doing backfills.
Daft: a high-performance distributed dataframe library — Recently Polars took all the attention regarding dataframes manipulation. But this new library called Daft could also be a game changer. Daft is written in Rust, uses Arrow, can be distributed and can use complex types.
SELECT Insights — A fresh new newsletter by Simon Späti. He shared a long list of links and genially structured the newsletter like a SQL query.

Data Economy 💰

Cohere announces $270M Series C. Cohere is an OpenAI alternative, they propose an API and a Python, Go or Node SDK to add "language" to your traditional app.

See you next week ❤️.