Hey, it's been a few weeks since I've not written any news. It was a necessary break for me and a blank page syndrome at the same time. Still I've accumulated a lot of articles that I think should fit in the Data News so this week might be a huge recap of content that has been produce in the last month.
I hope you will enjoy the selection.
On Monday I'll also give a talk at Berlin MotherDuck meetup: DuckDB experiments, a glimpse of the future. I think it will not be live but the recording will be published after the event on YouTube I think.
AI News 🤖
- Sam Altman has been fired as CEO of OpenAI.
- OpenAI announced this leadership transition yesterday. At the same time Greg Brockman (actual President and co-founder) will step down from the chairman of board and Mira Murati (actual CTO) will become interim-CEO. It was a brutal decision.
- The public official given reason was "[Sam] was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence.".
- The Internet has spent the last 15 hours guessing what this really meant. Here are a few theories I've read: a security leak occurred and Sam/Greg hid it from the board, Sam is publicly accused of sexual abuse by his sister, Sam has different views about company vision which doesn't please the board—esp. regarding profits or AI regulations, Sam invested in an OpenAI competitor. Either way, we'll see in a few days.
- People are mostly saddened by the news because Sam was a publicly-beloved and transparent CEO who changed AI. Comparisons with the coup that overthrew Steve Jobs back in the days are many.
- The news arrived a few day after OpenAI dev-day, a public conference announcing new products and features. Mainly they announced GPTs, a no-code UI to create custom versions of ChatGPT.
- Other AI announcements
- Github Universe was the moment to announce more Copilot everywhere in Github ecosystem. The most interesting thing was the fact that Github will introduce M1 and GPU runners.
- xAI—the company founded by Musk after quitting OpenAI—announced Grok. It's a 33B parameters LLM.
- Germany wants to build the European OpenAI competitor and invested $500m in Aleph Alpha, a startup. On the landing page it's clear that the focus is to build safe AI.
- Kyutai has been announcement at a AI Pulse event in Station F, Paris. Kyutai is an open science lab to build and democratize AGI—artificial general intelligence—through open science. They carefully picked open science rather than open-source. The team looks great.
- The GPU availability competition is on. Y Combinator announced a Microsoft partnership and priority access to compute resources. This is linked as well to Microsoft making custom AI chips.
- Biden issues executive order on safe, secure, and trustworthy AI.
- 2 reports with hundreds of pages about AI were published — The State of AI report and AI: The Coming Revolution. Both looks full of interesting things to say, but I did not read them.
- Google team wrote a paper "demonstrating various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks". In a nutshell, LLM can't generalize.
Now that I gave you the general news, let's jump to a few use-cases about AI.
- Towards a real-time decoding of images from brain activity — This is crazy, Meta researchers have been able to create a system that predicts an image seen by a person from the brain magnetoencephalography.
- LLM-powered data classification for data entities at scale — Grab explains how you can use LLMs to do classification, in this case identifying PII in the database. They explain the real-time architecture the system is using and give an example of the prompt they are using.
- Generative AI, the intern you can’t trust — A small post from Atlassian blog, it gives 3 ways to improve LLMs accuracy.
- Summarizing post incident reviews with GPT-4 — Canva has so many incidents that they need a LLM to summarize them for reporting purposes 🙃. Obviously it's a joke, but while the use-case is interesting I question myself about real need behind.
- Building in-video search at Netflix — What if you could prompt for a specific situation and get all the movies—at the relevant timecodes—presenting the situation. This is so cool.
- Cost analysis of deploying LLMs — All of this is cool, but pricey, this post do a good exploration of the costs.
Fast News ⚡️
Because the AI News is pretty packed and I still want you to enjoy this newsletter articles will be less commented than usual. But still spicy opinion, because you know, it's me.
- Data contracts is undoubtedly a new growth lever for data observability companies and data VCs. Soda announced their open-source data contracts engine. It's done in YAML. Here another example of contracts with msgspec.
- NVidia research has been able to supercharge pandas with cuDF to run pandas on GPUs.
- Wes McKinney, pandas and Arrow creator will join Posit—the company behind RStudio—as a Principal Architect. His new role will probably ease the integration in the Posit ecosystem of all the Python tooling, even if it has already been the case for months.
- dbt Labs hired Brandon Sweeney as new President and COO. Brandon was previously dealing with Revenue at Hashicorp. The same company which recently changed licensing to BSL getting backslashed by the tech community for it. Our prayers goes to dbt Core.
- Onehouse , Microsoft and Google are working on table format standard called Onetable. This isn't a new format but a way to create interoperability between Delta, Iceberg and Hudi.
- If you are curious about Iceberg and Hudi ACID guarantees read the article.
- Code faster with Ruff, a Python formatter written in Rust. All the time wasted for black to reformat your code will be used for good purpose now.
Taking other companies as example is often a good way to get ideas
- Gusto, data platform to generate HR insights — All data send to OneModel—a paid HR tool, in a Redshift with Tableau for visualisation.
- Criteo, how to compute data lineage — Criteo has a homemade application for data document called... Datadoc in which they compute their cross assets lineage.
- Picnic, master data management — Creating MDM for retailers is like the one-piece.
- LinkedIn, how to use 4 trillion events daily — Leveraging Apache Beam and Samza.
- Netflix, streaming SQL — Flink architecture in a data mesh organisation.
- Zalando, how to patch Postgres and fix WAL — Zalando team explains what they patched to Postgres JDBC driver that was growth in the write-ahead log.
- GoDaddy, layered architecture for a data lake — Naming conventions ideas and 5 data layers: source, raw, clean, enterprise and analytical.
A few food for thought articles about data concepts and roles.
- From data platform to ML platform — How incrementally data platforms are built, first for analytical use-cases and then adding ML capabilities.
- Why you should not build apps directly on the data warehouse.
- SQL is not designed for analytics and why Malloy is a paving the future.
- Would you become a data strategist? — Great post from Marie about a key analytical role shaping companies strategies.
- Two archetypes of data engineers — Closer to business or to the tech. Best data engineering teams successfully blend the 2 archetypes.
- The Economics team at Instacart — Or how economists and PhDs become more tech-savy enabling more and more relevant usage of data.
Data Economy 💰
- ZenML raises $3.7m additional Seed. A MLOps platform that works with all cloud and tools.
- Snowflake acquire Sisu and Ponder. The first one is an engine to monitor business metrics while the second is a tool to run pandas at scale.
- Yahoo spin-out Vespa and raises $31m. Vespa is a search engine and a vector database. This is the good timing to open-source is for AI use-cases.
- Aleph Alpha raises $500m Series B to build the German OpenAI.
- Kyutai is funded with $330m from 2 French billionaires and Eric Schmidt—ex-Google CEO. Kyutai is a open science lab that wants to build the AGI. The team as a good resume and the science committee looks awesome (Yejin Choi, Yann Lecun and Bernhard Schölkopf).
Ghost implemented a recommendation feature recently so I've added a few folks I like to read on internet.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.