Skip to content

Data News — Week 24.11

Data News #24.11 — OpenAI CTO, Musk vs. LeCun, Grok open-source?, French report about AI ambition, RAG is hype, and data engineering stuff.

Christophe Blefari
Christophe Blefari
6 min read
Mountains

I hope this e-mail finds you well, wherever you are. I'd like to thank you for the excellent comments you sent me last week after the publication of the first version of the Recommendations. This is just the beginning!

This week I've added a subscribe button in the Recommendations page in order for you to opt-in for the weekly recommendation email—every Tuesday. You can subscribe starting today on the page and you'll get emails as soon as I've developed the email sending—expected to be out at the end of the month.

You can opt-in for the recommendations

Second point, I passed the 100 stars on Github for yato, which is a crazy amount! I'd like to do a bit of user research about yato, if you consider using it drop me a message please.

yato, is a small Python library that I've developed, yato stands for yet another transformation orchestrator. With yato you give a folder with SQL queries and it guesses the DAG and runs the queries in the right order.

AI News 🤖

  • Mira Murati answers the Wall Street Journal about OpenAI Sora — OpenAI CTO has been asked a few questions about the underlying technology in Sora. She revealed a few insights. OpenAI consider for the moment Sora as a research output and might eventually be released later this year, it required "much much more" compute power than DALL-E to generate a video and they have a lot of interrogations regarding impact on elections or film industry. Saying mainly that "Sora is a tool to extend creativity".

    Last point Mira has been mocked and criticised online because as a CTO she wasn't able to say on which public / licensed data Sora has been trained on. When she was asked if it was on YouTube videos, Facebook or Instagram she said "I'm actually not sure about that".

    I personally really recommend this interview which covers a lot of interesting topics in 10 minutes.
  • Elon Musk said out loud that xAI will open-source Grok this week. It's Friday and it seems they are later than me when it comes to release stuff. Just-in-time for a reminder about the fact that open-source ≠ open-weights when it comes to AI licensing but differences in weights licensing are not as important as they seem.
  • Databricks invests in Mistral AI — Mistral successfully positioned as the main OpenAI rival by being integrated in all major data platforms (Azure, Snowflake previously).
  • A French commission released a 130 pages report untitled "Our AI: our ambition for France". You can download the French version and an English 16 pages summary. Report includes 25 recommendations given by French-speaking AI leaders (Yann LeCun, Arthur Mensch, etc.).
  • Assisted AI wars are around the corner — I'm only following the French news, but the government is proudly doubling its budget for "AI defense". From what I know, AI is mainly used as an information companion to find signals in the huge amount of data we generate, creating more efficient agents.

    This is related to Paris testing automated video surveillance during Olympics. The technology under this, is, Cityvision.
  • Yann LeCun clashed with Elon Musk on Twitter about AI future. Musk thinking that AI will be smarter than any single human next year, while LeCun said "No" taking as en example the false self-driving car promise. More, LeCun believes that human information compression capabilities are still so far ahead of AI that AGI is not even close.
  • Cognition AI introduced Devin — Devin is the first AI software engineer, Devin can, unassisted, do software engineering tasks like fixing Github issues (13% of success, previously best was ~5%), apply to jobs on Upwork, train and fine-tune its own models. I'm speechless.
  • Building Meta’s GenAI infrastructure — 2x 24k GPU clusters and it's growing. I like how Meta tries to do stuff out in the open (or at least with some kind of transparency) but the number of GPUs is just disconcerting.
  • RAG is the new trend — RAG means retrieval-augmented generation, it has been coined in 2020 (see more) and let's you ground AI models with facts fetched from external sources.
    • There is an exponential number of technologies in the RAG space, especially re vector databases that I don't even mention them but obviously post are all saying "ours is the best".
  • Croissant: a metadata format for ML-ready datasets — In order to move forward, faster in AI and model building we need a interoperable and easy-to-use metadata format for ML datasets. This is Croissant. Starting today it will be supported by 3 majors platforms: Kaggle, HuggingFace and OpenML. Croissant is under mlcommons and you can have a look at the specification.
  • The State of competitive machine learning — a study about ML competition platforms. Give a lot of insights on the market.
brown bread on white table
A new standard full of butter (credits)

Fast News ⚡️

Forward thinking

  • Dataviz is hierarchical — Malloy, once again, provides an excellent article about a new way to see data visualisations. It's inspirational.
  • Coding data pipelines is faster than renting connector catalogs — This is something I've always believed. The devil is in the details and when it comes to data pipelines there are a lot of details, which often refrain us to buy leading to build (or code). Matthaus gives the dlt vision about creating the foundation for developers to be able to create sources in a wink creating a large ecosystem of APIs datasets easily maintainable.
  • Differential storage, a building block for a DuckDB-based data warehouse — It's MotherDuck vision, creating the next data warehouse on-top of DuckDB leveraging DuckDB morphing capacities between a single machine and a production ecosystem. In the article Joseph explains how MotherDuck extended DuckDB to add time travel, zero-copy snapshots opening the door for more collaboration and concurrency.

See you next week ❤️ — recommendations for this week have been computed, go check it out.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.