Skip to content

Data News — Week 44

Data News #44 — Informatica IPO, dbt and Snowflake partners, Reshaping Data Engineering, Facebook stop facial reco, data quality monitoring ideas.

Christophe Blefari
Christophe Blefari
4 min read
People getting their Data News at the local market (credits)

Hello for a new edition of the Data News. I hope this email finds you well. Enjoy this week reading.

Data fundraising 💰

  • The old dinosaur Informatica went public for the second time on the NYSE. The company was founded in 1993 (!) and went once public from 1999 to 2005. They are still vending a all-in-one platform for "Data Management", but in the Cloud (obviously).
  • Yellowbrick Data, a cloud data warehouse, raised $75m in Series C to accelerate growth and to compete with a myriad of players. Their key differentiator is that they offer an on-premise deployment service plus native streaming capabilities. What a good time to be a data warehouse.
  • A data labeling company that fights against poverty by giving people work, Sama, raised $70m in Series B. They have a Hub in Kenya where probably most of the work is done and they develop an advanced workflow to label video, image and text.
  • dbt Labs and Snowflake announced a partnership to help companies succeded in their dbt usage integration with Snowflake. They will team up in the costumer success response and you'll be able to access a dbt Cloud free trial from Snowflake Partner Connect. Interesting move.

Maxime Beauchemin new post — Reshaping Data Engineering

If you missed it, you should read it: how the Modern Data Stack is reshaping Data Engineering. Actually, this is more an appetizer, the post is meant to give a overlook of the discipline and tries to define trends with simple true words.

Scaling Airflow on Kubernetes: lessons learned

My friends at Qonto wrote feedback article about scaling Airflow on top of Kubernetes, Yannick details in the post what are the parameters to look at when you want to fine tune your cluster performance on both side (Airflow and Kube).

How to improve at SQL as a data engineer

We all know that in data engineering teams Python and SQL are often required. Even if to me Python — or any other data oriented language — is more important, SQL is still something. In a multi-skilled data team having data engineers mastering SQL but more importantly data modeling is a must have. The post covers all the important topics.

How to rerun a task with Airflow

This is maybe the most complicated Airflow concept. Each time I teach on Airflow I take time to explain why Airflow is build like this and why you should clear instead of trigger. Astronomer team wrote a guide about the task reruns.

But don't forget that it's useless to clear tasks if your tasks are not idempotent and deterministic — you don't want to re-insert the data in your tables for instance.

PS: something that is annoying with task clearing is that it updates directly the Airflow metadata, so it'll break every monitoring you have based on the Airflow database.

Data Engineering learning path — metaphor (credits)

The Modern Real-Time Data Stack

WTF Christophe what is this again? At the moment the MDS is focused on the ELT with the warehousing and not a lot of articles and visions are talking about the real-time part. This article on thenewstack is trying to define the real-time limits and what you should do to add streams as source.

To be practical I propose you an awesome guide to build a real-time Metabase dashboard on top of Materialize. Marta is using the Twitch API to get and publish data into Kafka and then uses SQL streams to query the data in live from Metabase. It looks promising.

Today you can read everywhere on the blogs that you should start treating your data like a product. But to do so you to adapt you organization and we've seen some articles in the past like this one.

To support data products, Databand is saying that you also need Data SLAs. This article is a good introduction about service-level KPIs you should define to be less blind.

AI Friday

This category is back, even if renamed, because this week I have some AI news to share.

Is this the Metaverse? (credits)

Fast News ⚡

PS: Can you guess where am I?


Christophe Blefari

Data Engineering Coach that enjoys all kind of data platform.