Data News — Week 12
Data News #12 — Astronomer, Hex, YZR fundraising, navigate the modern data stack, make data engineering shine again, and more as usual.
Hello dear members, this week I've activated the paying membership on the blog, this is in beta for the moment — as additional perks are not yet available — so if you want to show some early support for what's coming soon you can upgrade.
No worries, Data News will always be free to read.
This week Data Council took place in Austin. A lot of the tools and people from the data community were there to discuss about data future. I've seen some tweets about it and I can't wait for the YouTube videos.
Data fundraising 💰
- Astronomer is going big. The commercial developer of Airflow this week acquired Datakin and raised $213m in Series C. They made a major rebrand few weeks ago and they gonna push forward Airflow orchestration in the modern data stack context. Datakin, a startup specialized in data lineage with Marquez and OpenLineage initiative, will bring high value to Airflow/Astro ecosystem in order to be the central place, thanks to lineage.
- Hex announced their $52m Series B with Snowflake and Databricks at the table of investors. Hex tech is here to fix the last part of the data journey: the knowledge creation. The product brings a new way to do data exploration and sharing that is worth seeing.
- YZR, a Paris based startup, raised $12m in Series A to expand in the US. YZR provide a SaaS platform to increase data quality with 3 main features: standardization, labelisation and fuzzy matching.
- datagen closed a $50m Series B to lead the way in the synthetic data production. They want to provide generators to create data-as-code in order to get the data we need to create more performant models.
The life of a data engineer: The Game
Firebolt creativity once again reach high-levels. They developed a 2D game where you play a data engineer navigating through the broken data pipelines in the modern data stack while the C-level board is waiting. Have fun.
Navigate through the data stack
In the big data game you'll see a lot of different logos and companies. The number of tools you can use to deploy your data stack has exploded in the last 3 years. In order to understand and follow what's going on the first thing you can do is to subscribe to this newsletter.
You can also read the updated report from a16z — a Silicon Valley VC investing in a lot of successful data companies — on Emerging Architectures for Modern Data Infrastructure. This report gives you blueprints to describe in one drawing every part of your infrastructure: sources, ingestion, storage, query and processing, transformation and output.
To complete the picture you can also read a16z's Data50 list about the world's Top 50 data startups. The analysis is interesting because they propose a categorisation for each company but also a split by founding localisation, year and money invested. My main takeaway is about the amount invested in query and processing category (around 37% of the money raised if we exclude Databricks huge amounts).
Last but not lest, Secoda wrote a modern data stack glossary with more than 75 concepts in order to help people entry this new field.
Details on Github service disruptions
I really like post detailing outages or service disruptions because we can learn a lot. Github recently had performance trouble and incidents with their MySQL cluster. In the post they detail the timeline and what are the next steps for them.
How to make my Data Engineering department shine again
Alexandre from Papernest is giving us the weekly post about data organization. He tries to find how you can develop your data engineering team and make it shine. Do we need data engineers? How can the DE team useful to the whole company? Alexandre answers these questions and more.
Headless BI, Datalake vs. data warehouse
- What is Headless BI? — Cube, a API first BI platform, defined what is headless BI and why you will need Cube (or similar) in the future. They also explain how it integrates with dbt and concepts around dbt metrics.
- Datalake vs. a data warehouse, another post about this topic.I think this is a super write-up.
Trying out delta lake and Hex
Last week we had a personal project reading bottle of wines, this week Paul tried Hex and Delta Lake in order to get alerts on crypto prices. This is nice because it could give you a glimpse about the kind of stuff — notebooks — you can achieve with Hex. Sadly his project seems down but I like the idea.
Fast News ⚡️
- The secrets of PostHog query performance — PostHog is an open-source analytics platform based on top of ClickHouse, in the article they detail what they did to achieve high queries performances.
- How I built a music synthesiser using SQL — Another side project kinda fun. Ramiro built a wav sound from SQL code. 🤯
- Visualise your Spotify data — Using Pandas to visualise your Spotify usage, everyone had this in mind.
- RepliByte — Qovery a all-in-one DevOps platform to build apps on AWS developed a CLI tool to replicate Postgres while hiding sensitive data.
- Implementing the GDPR ‘Right to be Forgotten’ in Delta Lake — I've spoke about datalake, Delta Lake and hiding sensitive data so this post is the zenith of it because it combines everything inside.
Join the newsletter to receive the latest updates in your inbox.