Skip to content

Data News — Week 22.25

Data News #22.25 — Validio, CloudQuery, Starbust and Eppo money moves, Snowflake Summit in 3 bullets, JTDB for data team and the best Fast News ever.

Christophe Blefari
Christophe Blefari
5 min read
Writing the Data News — metaphor (credits)

Hey, I hope you're doing good. Let's jump to the Data News, this week is full of great content.

Data Fundraising 💰

  • Validio raised $15m in seed funding. The Swedish startup addresses data quality issues with a SaaS platform. They sits on top of a lot of different data sources (warehouse, lakes & streams) to find "bad data". Their pricing includes a free tier.
  • Starburst acquired Varada. I'll try to summarize. Presto was created at Facebook. After a conflict Presto community broke up and Trino was founded. The 4 original Presto creators joined Starburst to provide services and to maintain Trino. They bought Varada, a company providing Presto/Trino optimisations for your lake. They claim to reduce Starburst compute prices by 40% with this integration. That says a lot in term of inflated cloud billings companies are probably paying.
  • CloudQuery raised $15m in a Series A — cf. week 38. CloudQuery is your cloud inventory queryable in SQL. They sits on top of your cloud providers APIs to get you access in SQL to your resources details. They market the tool as a way to control cost but above all security and compliance. It looks promising. It's time for data teams to help SREs.
  • Ataccama received $150m as investment. They provide a "unified data management platform" to cover quality, cataloging, MDM and visualisation in the same tool. I don't fully get their product vision.
  • Eppo raised $19.5m in seed + Series A. In a side topic to data but super important. Experimentation. They got inspiration from NATU's way to experiment. Eppo sits on top of your warehouse and encapsulate everything you need to do tests.

Snowflake Summit

Last week I forgot to write specifically about the Snowflake Summit. It was in Vegas and people were excited like if Steve Jobs appeared in the warehouse. Some stuff has been announced there:

  • Snowpark in Python is in public preview — it's time to write awful UDFs with Python inside to become the next Oracle.
  • Build Streamlit (let's say interactive dashboards) inside the Snowflake UI — not yet launche, but it may become the biggest revolution in the visualisation space
  • Unistore (technical details) — use Snowflake for transactionnal workloads and run real-time apps on top of your transactionnal data. Huge promise. It amplifies the conclusion of my first bullet.
Snowflake founders (credits)

Data teams organisation — apply the JTBD framework

It's been a long time since I've not share thoughts on data teams. This week Emilie proposed to use the JTBD framework to build more effective data teams. The Jobs to be done framework is a way to prioritize work. As a data team our mission is to empowers people in their decision-making.

If you identifies correctly your Jobs, the team will obviously drive enablement on others to drive business impact. Emilie shared 5 frequents jobs she observed. I recommend you to read them all.

In some extend I also recommend you to read Christophe's post on Airbyte blog on how to structure a data team to climb the AI pyramid of needs.

Our journey towards an open data platform

Doron shares how they build their data platform at Yotpo. All the drawings are super useful and clear. When you look deeply at the platform you could have interrogations about the needs to have ~3 data storage (Redshift, Snowflake and Databricks) and 3 visualisations tools.

This is a good feedback post but it shows super well the technologies explosion — cf. state of data engineering — we face today in the data ecosystem. Tools cherry-picking is becoming an art.

A framework for designing document processing solutions

Data extraction from document is slightly becoming for a lot of companies one of the best way to apply artificial intelligence for the business and to help the operational teams. Humans love paperwork and all of this paperwork is just in demand to be parsed.

Lester James proposed a framework for designing document processing solutions. Converting PDF to usable data is a key task. To do that he proposed 3 steps: annotation, multimodal models and an evaluation step. As a disclaimer he showcases an annotation lib (Prodigy) he works on.

From Jupyter to Kubernetes: Refactoring and Deploying Notebooks

I imagine that Netflix energy spent to put in production Notebooks didn't stay unnoticed. People are developing Ploomber to help you doing data pipelines from Notebooks.

On the other side you can also try to measure the CO2 impact of your notebooks (on Azure).

Data scientists (credits)

Side projects FTW

I've always been a huge fan of side projects in order to learn something, so each time I see people doing extra stuff with data I take care to share it because it resonates in me.

This week Jack built a data pipeline for his own Strava data to visualise everything in Tableau. Almost 7200 kms in 2021, well done.

Fast News ⚡️

Podcast 🎙

A discussion about dbt, Airflow and the semantic layer. Good one.

Last Read — 3 years after Data mesh: lessons learnt

A month old article. Michelin detailed lessons learnt from implementing a Data Mesh — which, as a side note became suddently un-trendy this year. The main blocs are a data fabric [the data platform], distributed data domains w/ teams [exposing and/or storing and/or valuing data] and a federated governance.


See you next week. I Love you all.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.