Summer is coming — and planet is burning (credits)

Dear members, this is the return of the Data News 🎉. This 2 last weeks we got few fundraising and some awesome posts. I really enjoyed Zalando posts about Airflow tests and Benn's one about Google.

I plan to do a user research about the newsletter and how you consume it in the following weeks. If you want to contribute to the research ping me and I'll plan half hour meeting.

Data Fundraising 💰

Should Google wake up?

Amazon sells to developers and engineers. Salesforce champions lines of business. Microsoft and Oracle win with IT. Google, it seems, hasn’t established itself yet.

Once again while using metaphors Benn gives us a great way to see the cloud industry. Benn thinks that Google is close to have the best data tools but has no vision to articulate everything together. I do agree with everything said in the post. Rather than releasing a fancy new product every 6 months can Google embrace the modern data stack truly?

Airflow is still the cool kid

With the recent 2.3 version Airflow community is still bringing new ideas and features to the product — cf. Dynamic Task Mapping. I recently used the new Grid view which looks neat.

On the commercial side, Astronomer, the company offering managed Airflow instances launched their Astro product. This is not a revolution. This is a packaging of their already existing offer with OpenLineage integration after Datakin acquisition. Which looks interesting. On the marketing side they sell their Astro Runtime, a cloud optimized distribution of Apache Airflow. I hope it'll not diverge that much from Airflow.

If you are still in need of an Airflow introduction HiPay team detailed why they picked Airflow and what are the main concepts. In addition Jarek — the biggest Airflow contributor — wrote a best of post from the Airflow Summit (ik you're still waiting for mine). This is a great list.

Finally, Zalando team explained how they were able to spin up test environment for each new DAG version. This is a great hack to see.

Design patterns in the data world

Design patterns are somewhat important in the software world. As always if we want to build stuff out of data we should consider every great practices from there. Eugene wrote Design Patterns in machine learning code and systems in order to show what's possible. It includes patterns like factory, adapter, decorator, strategy, iterator, pipeline, proxy and mediator.

In addition to Eugene post there is a more generic post (not contextualised for machine learning) about DPs and Solid principles.

Flashy patterns (credits)

Tackling data tests

This is a hard topic. There are many ways to achieved data tests in modern data stacks. We have static and runtime tests. The static tests are running in the CI/CD while runtime ones are directly on the warehouse. The best solution is probably a combination of both.

LinkedIn detailed how they do data quality management. Behind the scene this is how they detect issues within the metadata — available, freshness, schema, completeness. In addition Ismail detailed how he automates tests for Redshift at Doctolib with different possible strategies.

In the end, Datafold showcased 7 dbt testing best practices you can use.

ML Friday 🤖

DALL-E 2 is quickly becoming the a new internet star. DALL-E is an AI model that generates image from any text you give. Recently a lot of people on internet tried the new version that looks more than promising. Alberto demystified DALL-E on Towards Data Science.

But if you feel that DALL-E is not performing enough it seems that Google Imagen outperforms it.

DALL-E vision of me writing my newsletter (credits)

Fast News ⚡️


s/o to medhio for creating Data Creators Club a search bar to find the best data creators out there. Use it to find blog, newsletters or YouTube related to data.


  1. This is something I've invented but it means all these trendy databases attached to the modern data stack wave. Warehouses and others.