Data News — Week 22.40
Data News #22.40 — Fundraises of Lightdash and Flink Cloud offering, ClickHouse Cloud launch, data engineering migrations projects and more.
Dear members. Once again a late Friday edition. I was travelling this week and I slept too much. But not more excuses, below your Data News edition.
Data fundraising 💰
- Lightdash is finally launching their commercial product. They raised $8.4m in funding (pre-seed + seed). Lightdash is a dbt-based BI tool. It leverages metrics and dimensions defined within dbt to provide an explore UI where you can create visualisations to answer questions. Later add these to dashboards. In my opinion Lightdash is conceptually very similar to Metabase.
- Immerok raised $17m seed round to launch a serverless service for Apache Flink. The promise is make real time mainstream by providing a no-operation platform while using all Flink APIs.
- ClickHouse Cloud launched, one year after their $250m Series B. ClickHouse is a real time OLAP database developed within Yandex. The database promise is to reunite the warehouse-first approach with real-time performances. The Cloud (only AWS for now) will charge you for storage, compute, write and read if you "pay as you go".
What a crazy period we live in. Every open-source technology launch a cloud based offering of their tool expecting money to finance development. Is it really sustainable?
A bit of data engineering
I do not share a lot what I do as a data engineer outside of this newsletter. Even if this is probably for a dedicated post I think today I'll do a category about the data engineer's life. At the moment I'm working on two projects that are migrations. For the first project I migrate from a 12 years old custom made analytical application to a new one made within Apache Superset.
I also feel that a lot of the projects I've worked on as a data engineer were migrations. Sometimes small migrations like changing a data pipelines, sometime larger one like migrating a warehouse technology or an orchestration tool.
Migrations fuel data engineering work today and Ben depicts it greatly in his new post Realities of being a data engineer — Migrations. As Ben said we have different kind of migrations : operations systems, hardware, cloud, analytics or data. Every migration obviously brings a risk and that's why we do a preparatory work to mitigate risk. But even with a good experience we can't plan the unexpected and deadlines will slide.
Later in the post Ben propose a 5-steps framework every migration should follow:
- Initiate — Justify the migration and get buy-in from stakeholders
- Design and discovery — Do the product work and design what you expect, take time to explore the unknown
- Execute implementation — Develop what you have to develop and automate the boring stuff (a lot of migrations contain boring stuff, so automate it)
- Testing and validation — Check everything and do a double run with you old system and the new one
- Roll out and the long tail — Decided when to stop the old system and use the opportunity to change the processes with the new system
After all the different migrations I've done and read I think one of the advice I can give you is to invest in developing custom tools to follow and help the migration. For instance if you have to migrate 200 SQL queries from Postgres to BigQuery, develop a dashboards that gives the progression of the migration and provide automated scripts to dumbly do it. Migration application is boring, gamify it.
To illustrate this post Ronnie from Airbnb described how they upgraded their data warehouse infrastructure. Migrating from Hive to Spark3 + Iceberg.
ML Friday 🤖
- Homepage recommendation with exploitation and exploration — How DoorDash created a personalized homepage with their custom ranking algorithms.
- Also this week Etsy wrote about their search ranking personalisation with Dee Learning.
- Finally, Walmart detailed more their machine learning platform. In a nutshell this is a big platform with a lot of fancy technologies involved. It sits on top of kubernetes and, be ready, mentions BigQuery, Spark, Cassandra, Trino, Hive, GCS at least as data storage platforms.
- 📅 The feature store summit will take place next week on Oct. 11st.
Fast News ⚡️
- The EU wants to put companies on the hook for harmful AI — "A new bill will allow consumers to sue companies for damages—if they can prove that a company’s AI harmed them." Once again EU regulates, probably for the best, while companies are trying AI everywhere. If it ripples others like the GDPR it could be good.
- Recruitment Difficulties, an analyses on 2019 French companies data — This is a study from the French statistical studies bureau. The study outlines high mismatch between labour supply and demand.
- Use Iceberg to reduce storage cost — Deniz describes how migrating from ORC + Snappy to Iceberg with Parquet + Zstandard drastically reduced the S3 GetObject costs (by ~90%). As a side effect it also reduced the Spark compute cost by 20%.
- ❤️ Postgres: a better message queue than Kafka? — Dagster recently launched their cloud offering. They decided to use Postgres as foundation for their logging system. This post explains why. I really like the post because it treats about technologies choices and problem framing.
- matanolabs/matano — The open-source security lake platform for AWS. Matano provides you a way to query and alert from logs collected from all your sources. Matano stores everything as Iceberg files in S3 and you can write Python rules to get real-time alerts on top of it.
- dbt repository — to split or not to split? ; this is a hard question for every dbt developer. Should I go for a monorepo like dbt recommend or should I go for a modular approach? Adrian covers in the post the 2 ways. I personally think everyone should start with a monorepo. Once the data team moves to a mesh organisation the modular approach with packages should be considered.
- Another tool won’t fix your MLOps problems — Whether it's MLOps or DataOps we have too many tools and yet more marketing than practionners in the space. We need to reach the plateau like for the DevOps to avoid tools collection like panini cards.
- What we are missing in data CI/CD pipelines? — Thoughts around a CI/CD incremental approach for dbt.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.