Hello here ☀️

I feel ashamed for not posting any Data News for the last 2 months, a lot was going on and I did not manage to find time every Friday to write news, I'm so sorry about it.

Hello to all the new subscribers who arrived since January, I want to warmly welcome you ❤️. This is your first Data News ever, enjoy the moment, read whatever you feel curious about, at your own rhythm.

At the moment, I don't want to promise something about being back to our regular weekly schedule, but I'm trying as hard as I can to organise my new routines/life as a content creator and company founder.

I've always worked on multiple projects at the same time, but since I've started nao things changed. There's a truth you only grasp when you've lived it: you are thinking about your company all the damn time.

Events 🪭

While being less present online, I've done a lot of things in real life in the last weeks and I'll continue to do in the weeks to come. I was at the DuckCon #6 in Amsterdam to talk about yato, the smallest DuckDB SQL orchestrator and Robin published 3 podcasts episodes—in French—that I hope you'll listen while running this weekend 🤭:

At the end of the month, on March 31st, I'll co-organise the AI Product Day in Paris, we are sold-out, but we still have slots for sponsors if you want to help us organise the event and get massive visibility to AI and product teams.

I'm going to Barcelona 🇪🇸 (from March 19 to 22)—I'd love to hangout with data people there. I'll give a talk to the French Tech Barcelona on March 20, you can register here. I might plan a day-trip to Madrid (?).

The biggest news as a data fan is also that I'll be at Data Council this year as your news reporter on duty 🤓. So if you plan to got, or if you're in San Francisco around April, let's have a coffee.

AI News 🤖

Running after all the models releases (credits)

The pace of change is nothing short of extraordinary. I haven't published in two months, and it feels like two years. Here's a recap.

Timeline of the major models releases. When I say major it's obviously subjective. It's mainly related to the noise they made online.
- phi-4 — Microsoft continues to release their small open models. I never came across someone using it, however.
- DeepSeek R1 — DeepSeek is a Chinese startup building foundational models, they released v3 previously, then R1 which is a reasoning model that made OpenAI and American AI companies panicking because they claimed major cost reduction in model training. More, DeepSeek code and models are open-source with MIT license. R1 is built on top of v3 using reinforcement learning combined with chain-of-thoughts (CoT) to "reason".
  
  HuggingFace created open-r1 a fully open (for what it means) version of r1, in Python where every step is detailed.
  
  There is also a good analyse of DeepSeek vs. the world.

Mistral Small 3 — A small model that can be use to do CoT, under Apache Licence.

Google Gemini Flash 2.0 — Multimodal reasoning.
Anthropic Claude 3.7 — Claude 3.5 has been by far the most used model for code generation for the last 6 months. 3.7 should be a uplift, and to be honest I feel it's not. They also release Claude Code, an agentic coding tool that you can use in command line to make changes, commit and fix issues. Simon, created the pelican bicycle test, which is fairly good to evaluate models.
Alibaba Qwen2.5 — Nothing much to say to be honest.
Open AI o3-mini — OpenAI fast reasoning model series.
Grok 3 — Musk fans says it's the best model.
OpenAI GPT 4.5 — One month later OpenAI released GPT 4.5, and I feel like a teleshopping presenter. Now we have a selector with 6 models in ChatGPT, I'm kinda lost to be honest. There is also the pelican test.
OpenAI deep research — OpenAI mode to replace McKinsey consultant or Phd people, because why not.
Mistral OCR — The promise is crazy, you can have a quick look at their examples, from a pdf or a photo it can extract information so you can use it. It's even "multimodal" because it keeps the figure in the output.

ChatGPT for MacOS can interact with your code (in the IDE) — in the demo it works with XCode or VS Code and directly changes the files on disk so they changed in your editor.
Claude 3.7 plays Pokemon on Twitch — finally something useful.
I'm not a Perplexity user but I see more and more people switching their Google search usage to Perplexity, which announce a new web browser called Comet and deep research. Deep research has been built to generate ideas, summaries or takeways.
Streaming AI agents: why Kafka and Flink are foundations — A small bond with the data engineering world.
Building effective AI agents — This is a great article from Anthropic if you wanna learn how to build AI agents. It explains well the flow between the user, the UI and the LLM.
📺 Deep dive into LLMs — 1.5m views and have been recommended by a lot of people, should be good (did not see it). Goes from GPT-2 to DeepSeek R1 and give a mental model of what it is.
RAG stuff

dbt Core and SQLMesh, wat 🧭

dbt Core has become one of the most used tool across data teams all around the world. Because of its success, companies might feel the dbt fatigue which happens when your dbt project has been a success but widely spread within the company leading to A LOT of tables—we call it models, the dbt way.

When you have a lot of tables, dbt projects tend to become less manageable, the CLI becomes slow, the local development experience isn't great and more and more features are going into the Cloud version. SQLMesh has been created to fix dbt Core issues and to compete with dbt Cloud.

A few weeks ago, dbt Labs acquired SDF—which I had been watching closely for more than a year, see DN#24.07. SDF is a Rust binary that understands dbt projects and speeds up everything, making up to 100x gain in performance. Under the hood SDF, parses the SQL queries, gets syntax trees, compile and executes them to find issues even before they hit the data warehouse.

We will know very soon what this acquisition will bring to dbt and we all praise for the best improvements to be in the open-source codebase, (spoiler: not sure).

On the other side, SQLMesh answered with the acquisition of Quary, a Rust knowledgeable that made significant improvement to SQLGlot the underlying SQL parser of SQLMesh.

There is a fierce competition between the 2 companies and shots are fired. SQLMesh team is also organised GROUP BY their annual conference in a few days, any resemblance to another event is fortuitous. This week Tobiko also published a benchmark claiming that SQLMesh on top of Databricks delivers 9x cost saving.

Time will tell where this leads, but ultimately it will benefit data professionals as they strive to build the best SQL orchestrators. However, I believe there are still unresolved issues with the developer experience in the age of AI—challenges that I'm actively working to address with nao 🤭.

Fast News ⚡️

Benchmark your data team — Mikkel has been a great contributor to metrics about data teams worldwide: size, ratio to software engineering, team composition, salaries. This time it's a dynamic website where you can explore all these metrics to compare your team with what's out there.
The myth of measuring data team ROI — ROI of a data team is one of the most difficult thing to measure. Hex view on this is to ask other to tell the ROI for you, especially via a NPS of your users.
Fivetran and Airbyte pricing changes — The 2 data ingestion services changes their method of billing. Fivetran is doing thing I don't understand but they have charts explaining and Airbyte switched to capacity-based pricing—which means it's based on the number of pipelines your run rather than the volume you move. Benjamin analysed the pricing changes on LinkedIn, it's a competitor perspective.
What is a flat file? — A large article explaining all the flat files format. I do not miss fixed-width files.
Query engines: gatekeepers of the Parquet file format — DuckDB team is unhappy because most of the query engines are not supporting the latest Parquet advancement forcing the duck to write old spec, which lower performance.
How we migrated to Iceberg using Athena, Trino and Spark — How you can plan a migration to Iceberg. It lasted 4 months and reduced the data volume from 70TB to 40TB.
How not to partition data in S3 — You should partition by folder/date=2025-03-08, rather than with subfolders (sorry American readers, we put the months before the day 🙃).
Polars launches Polars Cloud — Run stuff remotely, why not 🤷‍♂️, on your own Polars cluster. Looks like a 2025 Spark.
Tobi launched a DuckDB newsletter: learning DuckDB by example.
DuckDB goes distributed — DeepSeek released smallpond a lightweight data processing framework built on DuckDB and 3FS—their distributed storage tech. smallpond is an alternative to Daft or Spark. I'm skeptical, distributed processing is not really the initial purpose of DuckDB which is made to remove the communication burden between client <> server through single-node processing.

📅 Mehdi is organising an online event about Scaling DuckDB.
Count.co combines DuckDB processing in-browser and on the server.
uv is becoming a thing, how to use it in PyCharm — uv is a Python package manager written in Rust, that aims to fix all the issues we all faced one day. uv brings also on-the-fly package management for script, which is freaking cool.
Building robust CI/CD pipeline for dbt — Ideas of things you can put in your CI to test your dbt projects before production. Even though I'm personally convince that the CI arrives too late in the process and that it should be done even before you push, this is a great start.
Minimising the runtime of a SQL DAG — What if you could theoretically save time in your SQL DAG by looking at the duration and the dependencies? This is what Max did and he found 26% uplift in performance. This guy never stop to amaze me.
VS Code extension for Google Dataform users — For the first time in 2025 I've met Dataform users, this is cool if it brings another alternative to the table. Tho, it's strictly coupled to BigQuery. Dataform, is dbt but for BigQuery with another syntax (SQLX).
Looks like BigQuery is getting a git integration — BigQuery is getting a lot of new features over the last months: notebooks, lineage, data profiling, etc.
A head of data take on AI code editors — Cursor and Windsurf as AI code editor are everywhere and a lot of engineering teams are starting to use them, but what's the equivalence for data teams? How can we as data works benefits these innovations?

the post has been written by my co-founder Claire—it's her first post, send her love ✨
How dlt enters the AI code generated pipelines world — dlt is becoming the best Python toolki to ingest data into whatever destination. In this AI assisted code writing era, because dlt is just code, it means MCP or other LLMs can really shine into help data engineers writing ingestion pipelines.
SQL related stuff
- Does Ibis understand SQL?.
- Beyond SQL as a pure database syntax.
- SQL is all you need.
- Detecting similar SQL queries with vector search.
📺 Graph databases after 15 years?

🕵️ What if you could become a SQL detective: SQL Noir. It's a funny game to practice you SQL skills to solve mysteries.

Data Economy 💰

Become it's already too long, only headlines.

Anthropic has raised $3.5b Series E.
Upsolver have been acquired by Qlik.
fal.ai raised $49m Series B.
Groq gets $1.5b funding from Saudi Arabia.
dbt Labs acquired SDF and dbt Labs reached $100m ARR.
Tobiko acquired Quary.
Databricks raised $10b Series J, and $5b more in debt.
Hightouch raised $80m Series C.
Eleven Labs $180m Series C, at the same time they released Scribe, their new cloud-base speech to text model.

Sorry for the large edition, I also feel a bit rusty after 2 months not writing. See you soon folks ❤️.