blef.fr

Summer is coming — and planet is burning (credits)

Dear members, this is the return of the Data News 🎉. This 2 last weeks we got few fundraising and some awesome posts. I really enjoyed Zalando posts about Airflow tests and Benn's one about Google.

I plan to do a user research about the newsletter and how you consume it in the following weeks. If you want to contribute to the research ping me and I'll plan half hour meeting.

Data Fundraising 💰

Castor raised $23.5m in Series A. Castor is a data catalog created 2 years ago. Within the tool you can search, catalog, govern and see the lineage of your data. They bet on their simplicity to reach adoption within enterprises. Adoption which is the biggest issue of all data catalogs. As I've already said, catalogs are half part of the solution, the rest is on the human side.
DataStax got $115m in a private round. They started by providing Cassandra as a cloud service few years ago. Last year they decided to offer managed Apache Pulsar to cover streaming use-cases. To me they are trying to board the modern data storage¹ train.
Meltano raised again $8.2m in funding. Almost one year after their first funding they are trying to re-orient their product. As the Techcrunch news is saying they were navigating in between building an ETL tool and an end-to-end data platform — their DataOps OS. Now on track to pick the last one. I consider this data operating system vision as one of the major next trend.
A secure data access company called Immuta secured a $100m Series E. They provide a central platform to manage data accesses.
Continual AI raised $14.5m in Series A to add a machine learning layer to the modern data stack. The product sits on top of your warehouse in which you define ml features using SQL and then are able to do predictions directly in the warehouse.

Should Google wake up?

Amazon sells to developers and engineers. Salesforce champions lines of business. Microsoft and Oracle win with IT. Google, it seems, hasn’t established itself yet.

Once again while using metaphors Benn gives us a great way to see the cloud industry. Benn thinks that Google is close to have the best data tools but has no vision to articulate everything together. I do agree with everything said in the post. Rather than releasing a fancy new product every 6 months can Google embrace the modern data stack truly?

Airflow is still the cool kid

With the recent 2.3 version Airflow community is still bringing new ideas and features to the product — cf. Dynamic Task Mapping. I recently used the new Grid view which looks neat.

On the commercial side, Astronomer, the company offering managed Airflow instances launched their Astro product. This is not a revolution. This is a packaging of their already existing offer with OpenLineage integration after Datakin acquisition. Which looks interesting. On the marketing side they sell their Astro Runtime, a cloud optimized distribution of Apache Airflow. I hope it'll not diverge that much from Airflow.

If you are still in need of an Airflow introduction HiPay team detailed why they picked Airflow and what are the main concepts. In addition Jarek — the biggest Airflow contributor — wrote a best of post from the Airflow Summit (ik you're still waiting for mine). This is a great list.

Finally, Zalando team explained how they were able to spin up test environment for each new DAG version. This is a great hack to see.

Design patterns in the data world

Design patterns are somewhat important in the software world. As always if we want to build stuff out of data we should consider every great practices from there. Eugene wrote Design Patterns in machine learning code and systems in order to show what's possible. It includes patterns like factory, adapter, decorator, strategy, iterator, pipeline, proxy and mediator.

In addition to Eugene post there is a more generic post (not contextualised for machine learning) about DPs and Solid principles.

Tackling data tests

This is a hard topic. There are many ways to achieved data tests in modern data stacks. We have static and runtime tests. The static tests are running in the CI/CD while runtime ones are directly on the warehouse. The best solution is probably a combination of both.

LinkedIn detailed how they do data quality management. Behind the scene this is how they detect issues within the metadata — available, freshness, schema, completeness. In addition Ismail detailed how he automates tests for Redshift at Doctolib with different possible strategies.

In the end, Datafold showcased 7 dbt testing best practices you can use.

ML Friday 🤖

DALL-E 2 is quickly becoming the a new internet star. DALL-E is an AI model that generates image from any text you give. Recently a lot of people on internet tried the new version that looks more than promising. Alberto demystified DALL-E on Towards Data Science.

But if you feel that DALL-E is not performing enough it seems that Google Imagen outperforms it.

DALL-E vision of me writing my newsletter (credits)

Fast News ⚡️

Fake Snowflake data the easy way — James demonstrated an easy way to create fake synthetic data in your warehouse by using Python faker library. It requires to create a FAKE Python UDFs calling Faker. And then you use it wherever you need fake data.
Acing Twitch's SQL screen — A lot of data roles interviews will exercice your SQL skills. Either you're a scientist, an analyst or an engineer. Twitch team wrote a guide to help you train your querying skills.
Understand big data file formats — Big deal in data engineering and awesome explanation post.
Why your data pipelines need a fail-safe
Delta vs Iceberg : Performance as a decisive criteria — Delta seems to outperform Iceberg.
Building Spark lineage For data lakes — Monte Carlo team detailed what's behind their lineage technology for SQL and Spark. This is interesting to see what's under the hood.
The shift from data pipelines to data products — Simon tells you why you should consider writing declarative DAGs rather than imperative ones.

s/o to medhio for creating Data Creators Club a search bar to find the best data creators out there. Use it to find blog, newsletters or YouTube related to data.

This is something I've invented but it means all these trendy databases attached to the modern data stack wave. Warehouses and others.