Hey, I hope you're doing good. Let's jump to the Data News, this week is full of great content.
Data Fundraising 💰
- Validio raised $15m in seed funding. The Swedish startup addresses data quality issues with a SaaS platform. They sits on top of a lot of different data sources (warehouse, lakes & streams) to find "bad data". Their pricing includes a free tier.
- Starburst acquired Varada. I'll try to summarize. Presto was created at Facebook. After a conflict Presto community broke up and Trino was founded. The 4 original Presto creators joined Starburst to provide services and to maintain Trino. They bought Varada, a company providing Presto/Trino optimisations for your lake. They claim to reduce Starburst compute prices by 40% with this integration. That says a lot in term of inflated cloud billings companies are probably paying.
- CloudQuery raised $15m in a Series A — cf. week 38. CloudQuery is your cloud inventory queryable in SQL. They sits on top of your cloud providers APIs to get you access in SQL to your resources details. They market the tool as a way to control cost but above all security and compliance. It looks promising. It's time for data teams to help SREs.
- Ataccama received $150m as investment. They provide a "unified data management platform" to cover quality, cataloging, MDM and visualisation in the same tool. I don't fully get their product vision.
- Eppo raised $19.5m in seed + Series A. In a side topic to data but super important. Experimentation. They got inspiration from NATU's way to experiment. Eppo sits on top of your warehouse and encapsulate everything you need to do tests.
Last week I forgot to write specifically about the Snowflake Summit. It was in Vegas and people were excited like if Steve Jobs appeared in the warehouse. Some stuff has been announced there:
- Snowpark in Python is in public preview — it's time to write awful UDFs with Python inside to become the next Oracle.
- Build Streamlit (let's say interactive dashboards) inside the Snowflake UI — not yet launche, but it may become the biggest revolution in the visualisation space
- Unistore (technical details) — use Snowflake for transactionnal workloads and run real-time apps on top of your transactionnal data. Huge promise. It amplifies the conclusion of my first bullet.
Data teams organisation — apply the JTBD framework
It's been a long time since I've not share thoughts on data teams. This week Emilie proposed to use the JTBD framework to build more effective data teams. The Jobs to be done framework is a way to prioritize work. As a data team our mission is to empowers people in their decision-making.
If you identifies correctly your Jobs, the team will obviously drive enablement on others to drive business impact. Emilie shared 5 frequents jobs she observed. I recommend you to read them all.
In some extend I also recommend you to read Christophe's post on Airbyte blog on how to structure a data team to climb the AI pyramid of needs.
Our journey towards an open data platform
Doron shares how they build their data platform at Yotpo. All the drawings are super useful and clear. When you look deeply at the platform you could have interrogations about the needs to have ~3 data storage (Redshift, Snowflake and Databricks) and 3 visualisations tools.
This is a good feedback post but it shows super well the technologies explosion — cf. state of data engineering — we face today in the data ecosystem. Tools cherry-picking is becoming an art.
A framework for designing document processing solutions
Data extraction from document is slightly becoming for a lot of companies one of the best way to apply artificial intelligence for the business and to help the operational teams. Humans love paperwork and all of this paperwork is just in demand to be parsed.
Lester James proposed a framework for designing document processing solutions. Converting PDF to usable data is a key task. To do that he proposed 3 steps: annotation, multimodal models and an evaluation step. As a disclaimer he showcases an annotation lib (Prodigy) he works on.
From Jupyter to Kubernetes: Refactoring and Deploying Notebooks
On the other side you can also try to measure the CO2 impact of your notebooks (on Azure).
Side projects FTW
I've always been a huge fan of side projects in order to learn something, so each time I see people doing extra stuff with data I take care to share it because it resonates in me.
This week Jack built a data pipeline for his own Strava data to visualise everything in Tableau. Almost 7200 kms in 2021, well done.
Fast News ⚡️
- 📚 Fundamentals of Data Engineering (70€) 📚 — A new O'Reilly book looking promising. It covers a lot. PS: I've not read the book yet.
- Monarch: Google's planet-scale in-memory time series database — I voluntary copy-pasted the topic as Google wrote it and I can't wait to get the first interstellar database. While writing my joke I discovered PlanetScale a MySQL serverless OS database.
- fal released their Python models in dbt — while dbt announced it few weeks ago I don't know if they will last.
- How we implemented a Tableau governance strategy — Great inspiration for Tableau users. It features Marie Kondo.
- Data Ingestion in Apache Druid — Walk-through to see how it is to tune Druid for the best performance.
- Rasgo developed a visual tool to build SQL queries — To be honest when I see the tool I recommend you to rather learn SQL.
- Datafold open-sourced data-diff — A lib that compares with checksum data sources and destination. Like Postgres with Snowflake — in 25s with 100m rows. Who didn't try to build it? Don't lie to me.
- How we rebuilt the dbt Cloud Scheduler — Follow-up on their post announcing work on their scheduler but this time with details. Spoiler this is a classic async job design and they partially migrated from Python to Go.
A discussion about dbt, Airflow and the semantic layer. Good one.
Last Read — 3 years after Data mesh: lessons learnt
A month old article. Michelin detailed lessons learnt from implementing a Data Mesh — which, as a side note became suddently un-trendy this year. The main blocs are a data fabric [the data platform], distributed data domains w/ teams [exposing and/or storing and/or valuing data] and a federated governance.
See you next week. I Love you all.
Join the newsletter to receive the latest updates in your inbox.