Skip to content

Data News — Week 24.08

Data News #24.08 — Presentation about Engines leading to DuckDB, Gemma and Gemini, Mistral Next, MDS follow-up and more.

Christophe Blefari
Christophe Blefari
4 min read
woman sitting on bed with flying books
My ideas these days (credits)

Hey, fresh Data News edition. This week I've participated to a round table about data and did a cool presentation about Engines. The idea was to depict the history of engines over the last 40 years and what leads to polars and DuckDB. Obviously the I forgot a few things and I'll do a more complete v2 soon.

This is my third presentation about DuckDB in the last 3 months and I think I'll slow down a bit until I get new crazy things to share.

Engines evolution (me)

There are 3 points that have triggered discussion about the visualisation I done

  • What about Arrow? — Apache Arrow is an awesome library that powers a lot of innovations in the data space in the recent years. But UX is where it differs to others, DuckDB user experience is insanely magical. So yeah. But for sure I'll add Arrow in the v2.
  • Spark future — I'm convinced that Apache Spark will have to transform itself if it is not to disappear (disappear in the sense of Hadoop, still present but niche). This is already happening, according to the feedback I've had, but Spark requires more infrastructure and investment, which will continue to drive adoption down, whereas the current trend is towards simplification.
  • JVM vs. SQL data engineer — There's a big discussion in the community about what real data engineering is. Is it Java/Scala or Python? Is it DataFrames or SQL? Is it lake or warehouse? It's a sterile debate: both are useful and can serve different organisations with different service level for data users and stakeholders. Still, I prefer SQL/Python data engineering, as you know me.

Small reminder, I'm partnering with La Conférence MLOps, a half-day conference on the challenges of industrialising AI. It will take place on March 7 in Paris. The list of speakers includes many important figures from the French data ecosystem, and I'm very excited about it. You can get a ticket with at 40% discount with the following promo code: mlops-blef-40. We have only a few seats left.

AI News 🤖

  • Mistral AI will release next week Mistral Next a ChatGPT alternative. We don't have a lot of detail because it has not been announced publicly—I got the news in a French politic newspaper. Still you can test mistral-next on lmsys. Here a first review.
  • Google releases Gemma — Gemma is a family of open models. Available in 2 sizes: 2B and 7B it seems to have baseline performance compared to Llama-2.
  • The same Google got a backslash after Gemini image generation rollout — Conservative people over social networks have been hurt because Gemini wasn't capable to generate image of white people. Google rolled back Gemini until further improvements.
  • Models comparison across key metrics — I found it via Guido on LinkedIn, it shows a lot of cool metrics like for instance the price per token, the speed or the model quality.

Fast News ⚡️

  • Is the modern data stack dead? — This is a follow-up podcast of Tristan Handy with Matt Turck—famous VC guy producing the MAD landscape—following last week post about the MDS. In this 40 minutes podcast they chat more in detail of the dynamics behind the end of hype regarding MDS, AI implication and the future of analytics engineering work.
  • Is the modern data stack disappearing? — An article I wrote 4 days ago as an answer to the trend. Pragmatic and easy-to-read. Essentially I analyse why the modern semantic is an issue.
  • State of the Duck — Introduction Keynote of the DuckCon that gives an overview of how is the current ecosystem and what's to come.
  • PyIceberg 0.6.0: Write support — Yeah, finally I'll be able to play a bit more with Iceberg. Still you need a catalog to make it work.
  • How you can write a Polars plugin — A dedicated website to explain how to write Polars plugins to extend the library capabilities. In order to do it you'll have to write Rust and Python code. This is a good way to enter the Rust world I guess.
  • Unit testing for data engineers — Daniel describes what you need to know as a data engineer to write test. He mainly covers BDD (behavior-driven development) as opposed to TDD (test-driven development).
  • Understand the design principles of Snowflake — Someone took a few hours to understand Snowflake internals and this is a great wrap-up.
  • Aligning Velox and Apache Arrow — Go deeper into memory management and how you can create open standards across the different librairies.
  • Enabling near real-time data analytics on the data lake — Grab showcasing what they did with Flink and Hudi to enable real-time use-cases.
  • Retrieve, merge, predict: augmenting tables with data lakes — A paper that explains how you can improve data discovery on data lakes to finally augment a given table with new data. I did not read the paper except the introduction and a the first schema, but it looks like awesome.

Cool ideas

Data Economy 💰

  • MariaDB takeover at $37m. MariaDB is a public company and could be taken over in by an investment company.
  • Neurelo raises $5m seed to provide HTTP APIs on top of databases (PostgreSQL, MongoDB and MySQL). We can see it as a semantic layer but on software engineering side.
  • Motif Analytics raises $5.7m seed. This is a tool made to analyse sequences, especially useful in web analytics / acquisition. They provide tooling to do without writing awful SQL queries.

See you next week ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links


Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.