Hello friends, I really liked writing last week edition even if it was too short and I did not go deep enough into my introspection thoughts. But I promise it will come back one day. Today's edition will probably feel like teleshopping. Unfortunately I don't do the agenda yet and everyone decided to announce something this week.
Data fundraising 💰
A lot of fundraising in the data field this week, this is fun to analyse because VC money tends to obviously depicts trends.
- Grafana Labs raised $240m in Series D, less than one year after their previous round. Thanks to the cloud and the kubernetes shift the Grafana stack has been playing a key role in tech stacks. Increasing visibility and observability. Maybe Grafana still suffers being a DevOps tool rather than a data one, but seeing Snowflake logo on the landing page shows that something could change in a near future.
- Data.World raised $50m in Series C, one year after the B, to bring another data catalog to the cloud world. But it seems the promise goes a bit deeper with a all-in-one tool to manage and query your data with a project view. The pricing starts at 50k annually. They clearly compete with Atlan among the data workspace segment.
- Tinybird got $37m in Series A, to provide a cloud platform to create API endpoints on top of all your data (batch and real-time) in minutes. The product is build with ClickHouse as main OLAP warehouse with in house connectors for Kafka, S3/GCS, Postgres and BQ/Snowflake. We've seen a lot of companies entering the serverless realtime platform for your data and to be honest, Tinybird looks awesome among them.
And to finish this long list of data fundraising Kumo.ai and Ascend.io announced respectively $18.5m in Series A and $31m in Series B. The first one developed a new way to see machine learning for enterprise using graph data modelisation over enterprise data and the second one develop a all-in-one tool to do everything related to data and analytics engineering.
Airbyte acquires reverse ETL company Grouparoo
Data platforms are easy. We have data storage with inbound and outbound pipes, transformations on top. Regarding the inbound pipes, Airbyte is leading the open-source conversation. Conceptually if you are capable of doing the in the out could just be the pipe reversed. But yesterday Airbyte acquired Grouparoo an open-source outbound pipes technology — sometimes called reverse ETL — in order to be able to enter this segment.
To be honest seeing how modular is Airbyte I bet that this acquisition is only a reputation / people acquisition rather than a technology one because Airbyte will build everything on top of what they already have. And if it's not possible we may have an issue somewhere in their promise.
PS: I obviously caricature the reverse ETL job and I know that reading endpoints are different than writing ones.
Reddit r/place data and architecture
If you weren't on internet this week you may have missed Reddit r/place subreddit, a 2000 pixels x 2000 pixels canvas where every redditor could colourise a pixel every 5 minute. Reddit gave us some statistics about the event, in 4 days around 10m users placed 160+m tiles. In 2017 they did the same event and they explained how they technically did it (the event was 10x smaller).
The 20 most popular data engineering tools in the Nordics
Validio team, a startup based in Sweden, analysed the 20 most popular data engineering tools in the Nordics and surprisingly — not really — BigQuery was ranked 3 behind Airflow and dbt. The reason behind is Spotify. Spotify has been the big data driven company in Europe that drove massive inspiration and also if you do tech in Sweden you probably worked there or knows someone that works there. So as we just copy paste what others do people use BigQuery like Spotify.
I did a small survey — no science behind — in a French based community about scheduling and Airflow was used by more than 90% of the respondents. On Airflow x dbt, Astronomer announced the new dbt Cloud provider to standardize the way we interact with those tools. If you like Airflow I also found a cross DAGs diagram generator to draw the whole picture — don't look at the examples they are bad.
To conclude this category I want to share Jacques thoughts about the modern data stack for the Marketing — or as they call it MarketingOps or MOps. Last year I shared a lot of stuff around Warehouses as Customer Data Platforms and the transformation is still going on.
Google. Google I professionally like you because 4 years ago when I started working on GCP I really liked BigQuery and everything around GCP. Everything was simple to use and straightforward. But when I read this BigLake stuff I think you'll loose me for the sake of this marketing competition against Databricks Lakehouse concept. BigLake is the name Google choose for the multi-cloud capabilities for BigQuery data storage. The idea is to provide unified data storage APIs cross clouds for compute.
Miro Data Engineering team’s journey to monitoring
Miro data engineering team detailed their journey to monitoring and observability. If you are building a data platform this post is a goldmine of concepts to help you understand what you need to define to your incident management system. You can complete the picture with these 10 processes that will help you define your data quality routines.
On that topic I discovered the term circuit breakers from Monte Carlo blog that I really liked. Like a pipe valve, to prevent data pollution.
Fast News ⚡️
- Scaling our dependency graph — Doctrine explained how they scaled their Python dependencies graph with
pip-toolsto lower time spent to resolve conflicts.
- Delta Live Tables (DLT) — Databricks announced Delta Live Tables as I don't get the point I can just say they announced it.
- A portable devkit for CI/CD pipelines: dagger — Docker creators released a devkit to build, test and debug CI/CD pipelines locally. What a revolution. No more "test ci" commit messages?
- Transform team announced MetricFlow — Transform starts with a proposal about the metrics engine with this semantic layer you can define your metrics that you will be able to query after through the exposition server. I need to deep-dive more on this topic to give you more insights (soon).
- Manage time series data pipes with Meerschaum
- Get Lyn Health’s Data Laboratory feedback on deploying Dagster on ECS
- Get a glimpse of Python 3.11 new features like the new fancy tracebacks
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.