Skip to content

Data News — Week 25.43

Data News #25.43 — Best-of the last 6-months articles: AI and data eng stuff that happened.

Christophe Blefari
Christophe Blefari
6 min read
tower surrounded by clouds
Stand out from the cloud (Credits)

Hey you. It's been a while! The newsletter is back. So, expect Data News to land in your inbox every week between Friday and Sunday. Same recipe as before: a bunch of links about data and AI, topped with my usual spicy opinions.

Below a best of the last months Data News, mainly the best articles about the AI and data ecosystem that I've came across, it's a great reading list.

AI News 🤖

  • The consumer AI companies are working on changing the way we browse and consume internet
    • OpenAI brought to ChatGPT new integrations like shopping and courses. This is a new way to consume the web, ChatGPT shopping will be a way to monetise but also a paradigm shift in how we use internet, OpenAI is trying to rebuild a web from within a chat, shortcutting the browser. But they also release a browser this week, named Atlas.
    • Browsers are getting more and more AI capabilities, whereas it's Dia or Comet, the goal is to give AI browsing capabilities like if it's human. Might be a transitional phase until the whole web gets destroyed because of bots and usual websites disappears?
  • From GPT-2 to gpt-oss: analysing the architectural advances — This is a great architectural deep-dive to understand the architecture behind all the GPTs. What layers are used, how gpt-oss compares with Qwen etc. If you're French, Defend Intelligence redeveloped a GPT for a YouTube video.
  • The upcoming GPT-3 moment for RL — Small essay about the current state of reinforcement learning which need to move from being task-specific to scale. Scaling RL would require something like replication training as a set of specs to reproduce complex RL scenarii.
  • AI vs Gen Z — How AI has changed the career pathway for junior developers. It has been posted on Stack Overflow blog, which ironically has been also very impacted by AI in the last 2 years. It describes well the current situation, being a junior developper was already difficult and AI made it worse (25% decrease in junior job posting in 2024), and the employment for Software engineers has decreased nearly 20% since 2022 peak.

    After years of considering SE like a promising career, AI is changing everything, we don't learn as mush as before, we don't need intern or juniors, salaries might decrease if the job become less complex. But if you don’t hire junior developers, you’ll someday never have senior developers.
  • Use LLMs to analyse postmortems at Zalando — Large companies often have a large number of postmortems (memo written after incidents) and it might be a great use of AI. They designed a multi-stage pipeline with: summarisation, classification, analyse, patterns and opportunity.
  • How I got the highest score on ARC-AGI again swapping Python for English — ARC-AGI is a benchmark doing an intelligence test designed to measure pattern recognition over puzzles that humans can easily solve.

    Currently an human panel score 98%, while GPT-5 Pro scores 18%. The author of the article successfully score 29% when switching from code to English.
  • An unusual consequence of AI coding"What AI coding has taken away is the time where you know exactly what you want to implement and have a rough mental model of how to do it [...] There was a beauty and joy to this part that I miss, a flow state you can hit with a nice linear progression". Probably what factory workers might have said when their factory got automated? We don't have to think the way we were thinking before.

    Related: Dumb Cursor is the best Cursor.
  • Basic facts about GPUs — Explains how GPUs compute and memory work and the different performance regimes: memory-bound, compute-bound and overhead.
  • Prompt injection attacks through images — Hide a text in a image that might be readable when the image gets downsampled or filtered. If a LLM interpret this text it's an attack surface when people are adding images to their chat conversation.
  • Context Engineering: How RAG, agents, and memory make LLMs actually useful and Learn Agentic AI: A Beginner’s Guide to RAG, MCPs, and AI Agents — Two guides to explore agentic concepts.
  • Use GEPA automated prompt optimisation to surpass Claude Opus 4.1 — Databricks achieved great performances after doing prompt optimisation on gpt-oss-120b.
  • [study] Using ChatGPT is not bad for the environment — A cheat sheet about carbon emissions related to LLMs.
  • [paper] Large Language Models often know when they are being evaluated.
  • [podcast] How GPT-5 thinks — From OpenAI’s VP of Research Jerry Tworek. He explains how reasoning works.
  • [paper] How people use ChatGPT — OpenAI ran a classifier on 1.1m sample conversations to understand how their 800m+ weekly active chatters are using the AI. It shows how widely the AI can be use to do people stuff.
Breakdown of granular conversation topic shares from a sample of approximately 1.1 million sampled conversations from May 15, 2024 through June 26, 2025 (extracted from the paper How people use ChatGPT).

Fast News ⚡️

  • Python, the documentary — A great documentary about Python and the origin, how the initial community has been build and what it takes to create such a piece of open-source software that is widely used. Python is scoring 25% popularity index (TIOBE), when top 2 and 3 are C and C++ with 9% each.
  • Apache Airflow Summit took place a few weeks ago, videos are not yet out but Marc Lamberti shared a few takeaways on LinkedIn like how Duolingo is using Airflow. Airflow 3 has also been released.
  • Python 3.14 is out — it gets a natural performance uplift and pave the way for the GIL changes.
  • Astral innovations to Python ecosystem — Astral is changing Python tooling forever with great crafted product. Recently they released:
    • astral/ty — A Python type checker, written in Rust (obv) that runs faster than anything else.
    • uv — uv format (might replace black). And funny thing someone solved wordle using uv dependency resolver.
    • pyx — If you need a private package registry Astral created pyx. Might be their way to make money at the Enterprise level to keep them working on this great tooling.
  • How not to partition data in S3 and what to do instead — When you need to partition by data on S3 you should partition using the YYYY-MM-DD format.
  • Does OLAP need an ORM? — Great question. ORM can bring type-safety to SQL generation because database objects are translated in the native programming language. This way AI when generating objects knows types and might know if something will fail before it hits the database. As chat with your data is becoming more and more tried at companies, this is maybe a requirement we actually need.
  • Some news about the Iceberg / lake house ecosystem.
    • ClickHouse and DuckLake now support write to Iceberg.
    • Thoughts on DuckLake — Max explains why DuckLake might be a big thing when it comes to improving the local developper experience. As DuckLake can make DuckDB function as a data warehouse. Imagine if while developing you could run you usual BigQuery pipelines but locally on the production data (that is available on GCS).
    • Cloudflare data platform — Cloudflare announced their lake house platform based on R2 (S3 compatible storage). They released R2 catalog (a fully manage Iceberg catalog) and R2 SQL. R2 SQL relies on Apache Datafusion.
    • The age of the 10$ lakehouse — A great deep-dive of the combination of the 2 previous bullet points. This is awesome to see this new kind data platforms. Back then they moved away from Fivetran + Snowflake to CDC with Debezium + Hudi (an Iceberg alternative).
  • The minimalist data stack — 5 parts article describing a dltHub + dbt + BigQuery data stack.
  • If you missed it Fivetran and dbt Labs are merging, here are my thoughts.
  • Redefining analytics roles and aligning skills and practices for future-ready insights — How to rebalance the skills and responsibilities when analytics engineering becomes a bottleneck.
  • Data modeling framework + revisiting medaillon architecture — Would you take a bit of data modeling content?
  • Analytics at scale — How to do product analytics at scale when tens of new features are released every week and product teams wants to understand what's happening. The article shares the organisation Doctolib implemented and the data modeling that was put in place to make it work.
  • Scaling Success: The dbt ecosystem at BlaBlaCar — What a team of 45+ engineers had to put in place to make their dbt setup work for everyone: dev containers + extensions + a few dbt packages. If you want the same setup but without doing anything you can use nao.
  • Data as a product, applying a product mindset to data at Netflix.
  • RIP Tableau — 2 months ago Voi killed Tableau and switched to LLM as a bridge in Slack and Sheets to accomplished what was possible in Tableau before. It required an effort of metrics definitions tho.
  • Why BI in the AI age — "Great analytics isn’t about generating charts quickly, it’s about building confidence in decisions through rigorous investigation of data. Every discovery, design choice, and contextual annotation represents a human analyst’s business intelligence."
  • Vibe Analysis — The other side of the coin.
  • Doing SQL work with LLM aids as a SQL addict.

I'll be speaking at OSDC AI next Tuesday about Building AI Agents is data engineering.


See you next week!

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — dbt Coalesce 2025

Data News — dbt Coalesce 2025. What about the Fivetran + dbt Labs. What it means for data ecosystem and more.

Members Public

Forward Data Conference + some news

Data News are coming back and Forward Data Conference CfP still open until next Sunday!