People getting their Data News at the local market (credits)

Hello for a new edition of the Data News. I hope this email finds you well. Enjoy this week reading.

Data fundraising πŸ’°

Maxime Beauchemin new post β€” Reshaping Data Engineering

If you missed it, you should read it: how the Modern Data Stack is reshaping Data Engineering. Actually, this is more an appetizer, the post is meant to give a overlook of the discipline and tries to define trends with simple true words.

Scaling Airflow on Kubernetes: lessons learned

My friends at Qonto wrote feedback article about scaling Airflow on top of Kubernetes, Yannick details in the post what are the parameters to look at when you want to fine tune your cluster performance on both side (Airflow and Kube).

How to improve at SQL as a data engineer

We all know that in data engineering teams Python and SQL are often required. Even if to me Python β€” or any other data oriented language β€”Β is more important, SQL is still something. In a multi-skilled data team having data engineers mastering SQL but more importantly data modeling is a must have. The post covers all the important topics.

How to rerun a task with Airflow

This is maybe the most complicated Airflow concept. Each time I teach on Airflow I take time to explain why Airflow is build like this and why you should clear instead of trigger. Astronomer team wrote a guide about the task reruns.

But don't forget that it's useless to clear tasks if your tasks are not idempotent and deterministic β€”Β you don't want to re-insert the data in your tables for instance.

PS: something that is annoying with task clearing is that it updates directly the Airflow metadata, so it'll break every monitoring you have based on the Airflow database.

Data Engineering learning path β€” metaphor (credits)

The Modern Real-Time Data Stack

WTF Christophe what is this again? At the moment the MDS is focused on the ELT with the warehousing and not a lot of articles and visions are talking about the real-time part. This article on thenewstack is trying to define the real-time limits and what you should do to add streams as source.

To be practical I propose you an awesome guide to build a real-time Metabase dashboard on top of Materialize. Marta is using the Twitch API to get and publish data into Kafka and then uses SQL streams to query the data in live from Metabase. It looks promising.

Today you can read everywhere on the blogs that you should start treating your data like a product. But to do so you to adapt you organization and we've seen some articles in the past like this one.

To support data products, Databand is saying that you also need Data SLAs. This article is a good introduction about service-level KPIs you should define to be less blind.

AI Friday

This category is back, even if renamed, because this week I have some AI news to share.

Is this the Metaverse? (credits)

Fast News ⚑


PS: Can you guess where am I?