blef.fr

People getting their Data News at the local market (credits)

Hello for a new edition of the Data News. I hope this email finds you well. Enjoy this week reading.

Data fundraising 💰

The old dinosaur Informatica went public for the second time on the NYSE. The company was founded in 1993 (!) and went once public from 1999 to 2005. They are still vending a all-in-one platform for "Data Management", but in the Cloud (obviously).
Yellowbrick Data, a cloud data warehouse, raised $75m in Series C to accelerate growth and to compete with a myriad of players. Their key differentiator is that they offer an on-premise deployment service plus native streaming capabilities. What a good time to be a data warehouse.
A data labeling company that fights against poverty by giving people work, Sama, raised $70m in Series B. They have a Hub in Kenya where probably most of the work is done and they develop an advanced workflow to label video, image and text.
dbt Labs and Snowflake announced a partnership to help companies succeded in their dbt usage integration with Snowflake. They will team up in the costumer success response and you'll be able to access a dbt Cloud free trial from Snowflake Partner Connect. Interesting move.

Maxime Beauchemin new post — Reshaping Data Engineering

If you missed it, you should read it: how the Modern Data Stack is reshaping Data Engineering. Actually, this is more an appetizer, the post is meant to give a overlook of the discipline and tries to define trends with simple true words.

Scaling Airflow on Kubernetes: lessons learned

My friends at Qonto wrote feedback article about scaling Airflow on top of Kubernetes, Yannick details in the post what are the parameters to look at when you want to fine tune your cluster performance on both side (Airflow and Kube).

How to improve at SQL as a data engineer

We all know that in data engineering teams Python and SQL are often required. Even if to me Python — or any other data oriented language — is more important, SQL is still something. In a multi-skilled data team having data engineers mastering SQL but more importantly data modeling is a must have. The post covers all the important topics.

How to rerun a task with Airflow

This is maybe the most complicated Airflow concept. Each time I teach on Airflow I take time to explain why Airflow is build like this and why you should clear instead of trigger. Astronomer team wrote a guide about the task reruns.

But don't forget that it's useless to clear tasks if your tasks are not idempotent and deterministic — you don't want to re-insert the data in your tables for instance.

PS: something that is annoying with task clearing is that it updates directly the Airflow metadata, so it'll break every monitoring you have based on the Airflow database.

Data Engineering learning path — metaphor (credits)

The Modern Real-Time Data Stack

WTF Christophe what is this again? At the moment the MDS is focused on the ELT with the warehousing and not a lot of articles and visions are talking about the real-time part. This article on thenewstack is trying to define the real-time limits and what you should do to add streams as source.

To be practical I propose you an awesome guide to build a real-time Metabase dashboard on top of Materialize. Marta is using the Twitch API to get and publish data into Kafka and then uses SQL streams to query the data in live from Metabase. It looks promising.

Today you can read everywhere on the blogs that you should start treating your data like a product. But to do so you to adapt you organization and we've seen some articles in the past like this one.

To support data products, Databand is saying that you also need Data SLAs. This article is a good introduction about service-level KPIs you should define to be less blind.

AI Friday

This category is back, even if renamed, because this week I have some AI news to share.

If you missed it, Facebook announced they will stop using facial recognition, but Vox is guessing that Meta — new Facebook name — will continue to use it to build the Metaverse.
Following Metaflow new UI (cf. last week) we have this week a post explaining how Metaflow can be useful with working examples and GIFs to help you understand it.
Introducing Pathways: A next-generation AI architecture — I'm far to be an expert on this topic but Google announced a new AI architecture that will excel at doing many tasks at once. As they say: "Pathways will enable us to train a single model to do thousands or millions of things."
Doctrine, a legal intelligence platform, built a recommendation system on top of 3.3m court decisions. They used CamemBERT and in the post they well describe how they did it.

Fast News ⚡

Meltano, an open-source packaged ETL system powered by Singer, Airflow and dbt, changed their logo and wrote a foundation post about their strategy. As a reminder Meltano was initially built at Gitlab.
Airflow permissions — if you want to manage your Airflow UI permissions with Python code here an example script to help you getting started.
How Netflix, Airbnb and Uber do Anomaly Detection — If you want to do Data Quality Monitoring (DQM) this post will give you an introduction to techniques used by US tech giants.
Data Quality on top on Snowflake with Great Expectations — following the previous point Snowflake team wrote on how you could use Great Expectations to check your data quality.
Databricks sets a new TPC record — They official broke the previous records by 2.2x. It means they are able to process 100TB of data faster than before using more than 2k CPUs for half a million $. Can't wait to run it on my laptop (ironic).

PS: Can you guess where am I?