Skip to content

Data News — Week 24.09

Data News #24.09 — Mistral AI, Klarna AI customer support agent, extract and load still unsolved

Christophe Blefari
Christophe Blefari
5 min read
trees with wind photo
Mistral (credits)

Hello all, this is the Data News, this week edition might be smaller than usual in term of comments as I'm working on a Data News related project that takes me a bit of time, which will probably lead to a series of articles.

Before I forget I've appeared on The Joe Reis Show, we chatted with Joe about data engineering teaching, why it is hard and about generative AI that will change education for ever. This is a 1h podcast, I hope you will enjoy listening to it.

Final reminder, next week there is La Conférence MLOps which will take place in Paris on March 7th. If you want to register I sill have a 40% promocode: mlops-blef-40. I'll give a talk—in French—about how to put in production machine learning at a small scale. Topic which is related to the Data News project 😬.

AI News 🤖

  • Mistral AI announcements
    • Mistral Large, their new flagship model, which outperform other concurrent excepting GPT-4. At the same Microsoft closed a partnership with Mistral to make Large available to Azure, their first distribution partner. It has led to a lot of discussion in French politics about Mistral AI being American more than French. With the partnership Microsoft entered the Series A with a €15m addition joining a16z.
    • They also released a smaller model called Mistral Small.
    • Le Chat, the conversational interface to interact with Mistral models.
    • Final comment, with these 2 announcement Mistral left the open side to go commercial / closed. It led to conversation where people felt betrayed by Mistral which built their differentiator—or should I say marketing—on-top of open-source/weight models. Mistral perdant.
  • GitHub Copilot Enterprise is now generally available — This week I've started to use GitHub Copilot (not the Enterprise version). And let's be honest this is a productivity boost, especially when you want to write docstrings and comments. Still there is an annoying interaction in PyCharm where Copilot takes too much space. Copilot Enterprise mainly comes with 3 features: understand your whole org codebase, a chat to ask question about the codebase, summarise pull requests.
  • Fast, efficient active speaker detection on videos — This is a great introduction to active speaker detection, which means you are able to detect in video speaker faces and if they are actually speaking or not.
  • Klarna's AI customer support agent do the equivalent of 700 agents — Klarna developed an AI agent that interacts automatically with customer driving profit. It has to be put in context.
  • Using DuckDB + Ibis for RAG — Handy code snippet to explain why DuckDB is a good solution bringing best of both world when it comes to RAG.

Extract and load, still unsolved 🤭

I've started writing data pipelines in 2014 and the movement from sources to destinations has always been one of the most discussed topic in my data engineering spaces. Personally I'm the kind of guy who likes to build it custom because I think an out-of-box solution does not exist. In the end you finish with a composable solution mixing up 2 or 3 technologies to extract and load you data in your central storage, ready for transformations.

In 2024 we are more than ever tools to move data from sources to destinations. But the field has taken a new direction.

Until now, solutions were mainly full platforms (often in the cloud) with the promise to do everything in search of rebundling the data platform (cf. The unbundling of Airflow). Recently, it has reached new heights: what if the extract and load is just a small library layer that integrates whatever you're doing—for people reading me carefully this is what I was calling for in using Airflow the wrong way, but the fun way.

Enters the new kids on the blocks:

  • dlt — it stands for data load tool, it's a Python library installable with pip. It provides a framework to do the extract and load, you need to define sources and resources what are the specificities of the resources you want to load: primary keys, write disposition, incremental mode, etc. and the library does the heavy lifting accordingly.
  • PyAirbyte — Airbyte announced their Python library in beta. Currently it support around 250 sources, which is a subset of all Airbyte sources (only the ones written in Python) and it seems it does not support connecting to classic databases. They call a destination a Cache, which is a terrible name. Even if the library is a great idea I feel this is a sad that the interoperability with Airbyte is not 100%.

    Adrian from dlt wrote a small post about PyAirbyte.
  • CloudQuery — Written in Go, YAML driven configuration to move data.
  • ingestr — ingestr is a CLI tool to copy data between any databases with a single command seamlessly. It's built on top of dlt.
  • Slings — Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database. Written in Go.
  • Let's not forget Meltano.

We see a pattern here, when we talk about extract and load there are 2 kinds of sources: databases and APIs, behind able to do both correctly is the key.

On the other side of the movement there is a new open-source reverse-ETL technology called Multiwoven/multiwoven. This is built in Ruby (haha). At the moment it can sync to Facebook, Salesforce and Slack.

green trees and plants under blue sky and white clouds during daytime
Rare footage of a roman extract and load pipeline (credits)

Fast News ⚡️

Tech stuff


See you next week ❤️

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.