Coming from data tools shopping spree (credits)

Hello friends, I really liked writing last week edition even if it was too short and I did not go deep enough into my introspection thoughts. But I promise it will come back one day. Today's edition will probably feel like teleshopping. Unfortunately I don't do the agenda yet and everyone decided to announce something this week.

Data fundraising 💰

A lot of fundraising in the data field this week, this is fun to analyse because VC money tends to obviously depicts trends.

And to finish this long list of data fundraising Kumo.ai and Ascend.io announced respectively $18.5m in Series A and $31m in Series B. The first one developed a new way to see machine learning for enterprise using graph data modelisation over enterprise data and the second one develop a all-in-one tool to do everything related to data and analytics engineering.

Airbyte acquires reverse ETL company Grouparoo

Data platforms are easy. We have data storage with inbound and outbound pipes, transformations on top. Regarding the inbound pipes, Airbyte is leading the open-source conversation. Conceptually if you are capable of doing the in the out could just be the pipe reversed. But yesterday Airbyte acquired Grouparoo an open-source outbound pipes technology — sometimes called reverse ETL — in order to be able to enter this segment.

To be honest seeing how modular is Airbyte I bet that this acquisition is only a reputation / people acquisition rather than a technology one because Airbyte will build everything on top of what they already have. And if it's not possible we may have an issue somewhere in their promise.

As a calendar coincidence, this week Rudderstack announced their reverse ETL product.

PS: I obviously caricature the reverse ETL job and I know that reading endpoints are different than writing ones.

Airbyte taking over Grouparoo (credits)

Reddit r/place data and architecture

If you weren't on internet this week you may have missed Reddit r/place subreddit, a 2000 pixels x 2000 pixels canvas where every redditor could colourise a pixel every 5 minute. Reddit gave us some statistics about the event, in 4 days around 10m users placed 160+m tiles. In 2017 they did the same event and they explained how they technically did it (the event was 10x smaller).

Validio team, a startup based in Sweden, analysed the 20 most popular data engineering tools in the Nordics and surprisingly — not really — BigQuery was ranked 3 behind Airflow and dbt. The reason behind is Spotify. Spotify has been the big data driven company in Europe that drove massive inspiration and also if you do tech in Sweden you probably worked there or knows someone that works there. So as we just copy paste what others do people use BigQuery like Spotify.

I did a small survey — no science behind — in a French based community about scheduling and Airflow was used by more than 90% of the respondents. On Airflow x dbt, Astronomer announced the new dbt Cloud provider to standardize the way we interact with those tools. If you like Airflow I also found a cross DAGs diagram generator to draw the whole picture — don't look at the examples they are bad.

To conclude this category I want to share Jacques thoughts about the modern data stack for the Marketing — or as they call it MarketingOps or MOps. Last year I shared a lot of stuff around Warehouses as Customer Data Platforms and the transformation is still going on.

BigLake

Google. Google I professionally like you because 4 years ago when I started working on GCP I really liked BigQuery and everything around GCP. Everything was simple to use and straightforward. But when I read this BigLake stuff I think you'll loose me for the sake of this marketing competition against Databricks Lakehouse concept. BigLake is the name Google choose for the multi-cloud capabilities for BigQuery data storage. The idea is to provide unified data storage APIs cross clouds for compute.

Big Lake Cloud Refuge (credits)

Miro Data Engineering team’s journey to monitoring

Miro data engineering team detailed their journey to monitoring and observability. If you are building a data platform this post is a goldmine of concepts to help you understand what you need to define to your incident management system. You can complete the picture with these 10 processes that will help you define your data quality routines.

On that topic I discovered the term circuit breakers from Monte Carlo blog that I really liked. Like a pipe valve, to prevent data pollution.

Fast News ⚡️


See you next week ❤️.