Skip to content

Data News — Week 23.11

Data News #23.11 — Airflow alternatives meetup, Gen AI new category, online gradient descent in SQL, data with Rust, dbt exposures to sources, etc.

Christophe Blefari
Christophe Blefari
5 min read
Took a few days with the ☀️ (credits)

Hey you, I hope you had a great week. On my side I'm slowly starting to get on top of the things I had in queue. But, sadly, I work in LIFO so I feel that I'm never done. For people that are not use to it it means last in, first out. Which means that I get easily disturbed by a notification—or even a thought—and do something that I did not plan to do at first. It, probably, explains why you always get the newsletter late on Fridays—or Saturdays.

Thank you for the feedback about last week issue, it seems you liked it. I'll try to continue doing deep-dives on article from time to time.

Airflow alternatives meetup

Click on the image to go the the LinkedIn event.

We are organising next week with the Paris Apache Airflow Meetup group an online event to discuss about Airflow alternatives. At every Airflow meetup we often get questions about Airflow competition so we decided to give a voice to alternatives in order to understand how they compare with Airflow and more.

The first even will take place next week, on March 21st at 7PM CET (UTC+1) and we invited Mage and Kestra. We will host another event soon after with others. You can either register on LinkedIn either join the meetup event.

How lucky you are because I will host the event, so you'll hear my awesome French accent. It also means that if you have any questions that you want me to ask you can send them to me beforehand 🫠.

Gen AI 🤖

I will create a specific category for generative AI.

If you live in a cave or if you only read my newsletter to get news about the data world you might have missed that GPT-4 has been announced and released this week. I even had hard time navigating between data engineering memes and GPT4 tips on LinkedIn and my Twitter is divided between GPT-4 threads and protests in France. What a time to be alive. Politicians think we should work longer when we are slowly starting to discover new AI capabilities that will for sure impact workplaces.

I don't want to take the usual shortcut—but how could I not do that. Will AI replace jobs? I do think that AI should empower people, but will the capitalism think like this when an API call will be able to do the same job as a human? Does even capitalism think? Actually it's probably human decisions about AI that will lead to AI replacing people.

One field that has been totally impacted by the generative field is the Natural Langage Processing (NLP). On Reddit someone asked this if others were also witnessing panic in NLP orgs. The general feeling is that GPT made years of NLP research outdated.

ℹ️
BTW, technically GPT-4 will be multimodal, you will be able to use text and image as inputs and the model will give you text outputs.

A few other news:

Can we develop a GenAI that generates protests slogans? (credits)

Fast News ⚡️

  • Migrating from role to attribute-based access control — RBAC is probably one of the most use paradigm when it comes to autorisation especially because role based autorisations are faster to put in place. In the article Grab team explain how to migrated from roles to attributes autorisation on Kafka.
  • Speeding up “Reverse ETL” — Ziqi works at Microsoft and details in this article what they had to consider to improve their Lakehouse exports to downstream databases. In short they switch SQL Server to columnar storage, disable indexes and locks when copying and played with parallelisation and batch size.
  • Online gradient descent written in SQL — Max is one of the best when it comes to do great experiments. This time it shows that everything can be done in SQL. With recursive CTEs he implemented sklearn linear model and the code is not even that big.
  • Data with Rust — This is a handbook that will showcase how to work in data engineering with Rust. At the moment only part 1 and 2 are written but it looks promising.
  • Sharing data between dbt projects, dbt exposures to sources — When you have multiple dbt projects it can be a mess to reference a model from another project. This blog shows how you can automate it with a CI and definitions in exposures.
  • Polars vs pandas : A new era for Python DataFrames — This comparison is also slowly starting to be a great debate in the data world. Will Polars overtake pandas in the coming years? Guillaume wrote yet another great comparison.
  • Tracking the fake GitHub star black market with Dagster, dbt and BigQuery — Things are getting spicy here. Dagster team proposed a way to eventually identify Github projects buying stars.

Other few articles but with no comment:

Data Economy 💰

  • The Austrian data protection authority has decided that Meta tracking tools are in violation of the GDPR. It will create a precedent.
  • Seldon raises $20m Series B. Seldon is a MLOps platform that helps you deploying models in production. At core Seldon provides a framework that you can configure to serve you models on top of Kubenertes.
  • 👀 Adept raises $350m Series B. This is again a testimony of the frenzy about generative AI, and according to me the most impressive one. Adept want to create a general purpose AI teammate for everyone. At the moment it takes the form of a browser extension in which you can ask stuff when you navigate on Salesforce, Google Sheet or Craiglist.
  • Cast AI raises $20m in funding. They propose an AI to cut your Kubernetes costs in half. Bold promise.

See you next week ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.28

Data News #24.28 — Catching up the news, OpenAI, Claude, kyutai and all the engineering stuff from the last 3 weeks.

Members Public

Databricks, Snowflake and the future

Databricks and Snowflake summits featured major announcements, including open-sourcing their catalogs and enhancing Iceberg compatibility. This article covers all the key updates you need to know.