Data News — Week 25.15

Data News #25.15 — Arrived in San Francisco, Llama 4 is out, reasoning hacks, MCP hype, Iceberg stuff.

14 Apr 2025 — 7 min read —

Hey here. What's up? While you're all data vibing I'm sliding into your inbox with the fresh Data News of the last month.

I have moved to San Francisco for the next 3 months, so if you're in town and wanna talk data or go for a run, you know where to find me. It's been a week since we arrive with nao Labs team in SF and it has been a blast. We will be at the Data Council pitching at the AI Launchpad on the 22nd.

I'm planning to create content for you to follow the Data Council from the inside as it has always been great to write takeaways about the talks these last year (2023 and 2024).

AI News 🤖

New OpenAI text-to-speech model — OpenAI released a new text-to-speech model, available through their API, it looks like better than the Whisper baseline. There is a demo website which is quite impressive.
Llama 4 is out — Meta has released the new iteration of their open models. This time it includes 4 models:
- Llama 4 Scout, a small 17B model. Natively multimodal, it achieves an industry leading 10M+ token context window and can also run on a single GPU.
- Llama 4 Maverick a multimodal model, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks. It can also run on a single host.
- And soon they will release Llama 4 Behemoth (outperforming GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks) and Reasoning.
- Cool part is that they partnered from the beginning with Databricks and Snowflake to bring LLM to your data.
Reasoning models don't always say what they think — A summary of a research about the faithfulness of reasoning models. Anthropic discover that these models, that are not really "reasoning" but more doing a Chain-of-Thoughts (CoT) are not always honest when they are given a hint (or hacked) about the the answer they should provide.
How to roll-out a data conversational agent? — Gorgias engineering team released a conversational agent to the company using Dot which is a AI Slack bot that answer data warehouses questions. The article explains super well how Dot fits in the data team and how you can evaluate the AI.
Google announces at Next'25 AI in their cloud products
- AI.GENERATE_TABLE in BigQuery — The setup looks a bit weird because in order for everything to be accessible you have to have a model in a dataset and prompts in a table but this is a great way to extract informations in strings as per the example.
- Talk to your data with Looker —BI tools are already widely deployed in most organizations, and conversational analytics serves as another interface to access the same insights. That's why embedding this "talk to your data" capability within existing BI tools is likely the best path for adoption. However, this approach might also introduce additional complexity, given that many BI tools are already somewhat cumbersome, cluttered repositories of charts.
- Google announced their Agent Development Kit (ADK) and an open protocol to enable communication between agentic apps called Agent2Agent (A2A).
- Also announced BigQuery AI engine that is doing something I don't understand completely: analysts could ask to extract info in a image and match it to a product catalog, and, a copilot for Jupyter Notebooks (in Collab) — more about analytics announcements.
Lessons learned from building agent that can code like Composer — Petr wrote a great article that explains how coding agents are working (which is dead simple). In order to build the most simple version of a coding agent you need to give 3 capabilities list files, read file and write file. With this and a few prompt you can have a working demo in a few minutes, but how to move forward? He gave the 5 important lessons he got out of this.
Oxy, an open-source agentic analytics framework — Robert and Joseph co-founded Hyperquery back in the days (at the peak of the notebook hype) and they are back with a new journey: Oxy. An open-source framework to create analytics workflows in a friendly way. Looks promising.

Navigate the MCP hype

If you've been on the internet lately you should have seen the massive MCP hype, everyone is either building a MCP server or a MCP registry or even a registry of registries.

But what's a MCP?

MCP means Model Context Protocol and is an open protocol created by folks at Anthropic. A MCP is a most of the time referenced as a server that encapsulate discoverable tools, prompts and data to be used by an LLM. MCP clients are on the LLM side and make requests to MCP server.

For instance there are a few Snowflake MCP servers, if you add them to Claude, you will be able to query Snowflake from a Claude prompt for instance, or get a table metadata.

Everything a Developer needs to know about the MCP — If you want to deep-dive more there is this article about it.
OpenAI supports MCP in the agents SDK — The biggest news recently is OpenAI validating the protocol and supporting it (not yet in ChatGPT tho).

Fast News ⚡️

Overclocking dbt, Discord custom solution — Discord data platform is huge and they reached few dbt limitations, especially on the backfilling side (which is not really a dbt strength), so they built their own way to overcome this leveraging the meta tag. They also managed to create isolated environments for the whole team and have a bunch of CI/CD jobs running on each PR validating their own internal rules.
Current state of Databricks SQL warehouse — Does Databricks SQL outperforms Snowflake?
Deduplication in BigQuery, 7 ways to do it — If you're usually do deduplication in BigQuery (or elsewhere) here 7 patterns to achieve it.
Ensuring data contracts adoption across an organization.
The shift left data Manifesto — I did not read it because it's too long, but Chad have been a shift left advocate for a long time. Which mean "Shifting Left means moving ownership, accountability, quality and governance from reactive downstream teams, to proactive upstream teams". Putting in another way give software engineers the responsibility of the data.
BI is dead, change my mind — It's Clickhouse director of engineering turn to say BI is dead, he got the light while chatting with Clickhouse using LibreChat + a Clickhouse and a Github MCP servers. Looking at how chat for everything is taking all over the place, it's only a few months until stakeholders asks for data interfaces using chat.
Local data transformation with dbt and DuckDB — Great article showcasing how you can locally do all your transformations today with dbt and DuckDB, and we even got a great DuckDB local UI.
Software is now content — I really liked this week Benoit's post.
How we built a robust ecosystem for dataset development — Duolingo process to apply software engineering practices to data modeling, in a sense of datasets are assets that could be treated like APIs.

A rare Iceberg table in real life (credits)

Navigating Iceberg landscape

Over the last month a lot of things happened also in the data engineering space, especially around Iceberg which is taking over a lot of discussion when it comes to data storage.

Why is Iceberg so important right now?

Iceberg is a way to escape the data warehouses to build your own warehouses in kit on-top of bucket storages. Iceberg being open-source it will allow us to build interoperability between all systems while supporting some kind of transactional systems on-top of Parquet files.

The Iceberg Summit took place in San Francisco (but I could not go), tho Neelesh publish a small recap of the Summit. I guess the videos will be on the YouTube channel soon.
I personally think that DuckDB might be the easiest developper interface to interact with Iceberg ecosystem, as it's dead simple to spin a Duck instance. Recently we got DuckDB to attach to Iceberg and the capabilities preview Amazon S3 Tables.
Athena vs. Snowflake on Iceberg, performance comparison. In the end Snowflake won being 2x less expensive, the tests uses the engine on top of Iceberg datasets to see how they handle working with Iceberg. Would have been cool to compare it to the same using native tables.
Data wants to be free: fast data exchange with Apache Arrow — How Arrow compares to Postgres when it comes to serialisation and why is it so fast?
Cloudflare R2 data catalog — Cloudflare R2 is a global object storage (like S3) with free egress (meaning free data read from external system) which is a paradise for Iceberg as Lakehouse really heavily on data reads on buckets and most of the engine are living elsewhere. So Cloudflare announced a Iceberg catalog that can live close to your tables.
Amazon reduces prices for S3 Express One Zone — Following Cloudflare announcement Amazon decided to reduce the price of the Iceberg offering.
xorq, declarative, multi-engine pipelines — This new world opened by Iceberg brings us to the multi-engine data stack. Where we use different engines (Snowflake, DuckDB, BigQuery) for instance for what they are great about and store the underlying data in bucket using Iceberg unifying everything in a catalog. xorq is one of the first multi-engine pipeline system for ML use-cases.

Because we have to unify all the trends there is a Iceberg MCP server that has been developed.

Examples and thoughts

Just to go further and connect everything, a few post about the relathionship with Iceberg and the lakehouse and where all this fuzz is going, and what it could mean for your actual data stack.

Iceberg?? Give it a REST!.
We built a data lakehouse to help dogs live longer.
Towards Composable Data Infrastructure.
Roadmap: data 3.0 in the lakehouse era — 4 possible thesis of what could be the next revolution about your data stack.

My two cents about this: this is mainly experimental and this is not relevant yet for the scale most of the companies are. Warehouse + native tables is the easiest user experience you can find, and as data engineers what we want it users using our platforms, right?

Data Economy 💰

See you soon ❤️

Data News

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments

Forward Data Conference + some news

Data News are coming back and Forward Data Conference CfP still open until next Sunday!

8 Mar 2025

Paid Members Public

Data News — Week 25.10

Data News 25.10 — Super large edition, all new models releases, events, dbt Core vs. SQLMesh, benchmark your data team, and more.

Data News — Week 25.15

AI News 🤖

Navigate the MCP hype

Fast News ⚡️

Navigating Iceberg landscape

Examples and thoughts

Data Economy 💰

Data Explorer

The hub to explore Data News links

Christophe Blefari

Comments

Related Posts

Forward Data Conference + some news

Data News — Week 25.10

AI News 🤖

Navigate the MCP hype

Fast News ⚡️

Navigating Iceberg landscape

Examples and thoughts

Data Economy 💰

Data Explorer

The hub to explore Data News links

Christophe Blefari

blef.fr Newsletter

Comments

Related Posts

Forward Data Conference + some news

Data News — Week 25.10