Skip to content

Data News — Week 22.32

Data News #22.32 — Forbes Cloud 100, data integration fundraising, recursive CTEs, a packed ML Friday, Dagster cloud, BigQuery search.

Christophe Blefari
Christophe Blefari
4 min read
It's been a long week, please forgive me, I pick the most obvious picture (credits)

Hey there. This is it. I'm in now in Berlin so if you wanna meet, say hi, I have plenty of time to meet new people.

Here the weekly data news everyone's waiting 🙃.

Data fundraising 💰

  • Datawisp got $3.6m seed round. They tick a few buzzwords. A no-code data platform for web2 and web3. Their landing page is primary focused on crypto analytics. Once again I found interesting that the crypto field is also innovating in the data field and it could be a good news for data in general.
  • Privya raised a $6m seed round. This is an engine that analyses all your infrastructure parts to find PII to reach data privacy laws compliance. With all the GDPR burden these kind of products could become legion in the future to help companies mitigate risk.
  • Equalum, another data integration tool, raised $14m Series C. I'm impressed by the client portfolio they already have and by how unknown they were to me. Their product screen looks like a product coming from the future. But the future as seen as in the 2000'.

When Forbes talks about tech

Last week Forbes released their Cloud 100 a list that ranks — according to them — the top 100 private cloud companies. In this ranking around 10 companies are really about data. This is the stuff we speak about every Friday. They also discussed about Databricks, Fivetran and dbt Labs valuations that skyrocketed recently and what is means for their future.

We all live in a CTE hell loop

Cloud warehouses really popularized Common Table Expressions because in the past CTEs were in disgrace. CTE offers flexibility and linear structure to a SQL query that helps us achieve wonderful thousands lines SQL.

If you want to understand what are CTEs this week Brian wrote a guide on how to approach CTEs when you come from the subselect world. As a side note I also discovered the recursive CTE. It looks like hell.

A recursive CTE (credits)

How to use dbt's run_results.json

This is a great post about how you can use the run_results.json artifact to create your own metrics dashboard to achieve dbt observability. I really like this topic because dbt artifacts are a really powerful way to own your dbt projects and to find incremental boost in performance.

Things I wish I knew...

ML Friday 🤖

Aren't the interactive websites the best ones when it comes to understand a machine learning model? Few weeks ago I shared Random Forest explained. This week the MLU-EXPLAIN did it again with the Logistic Regression. This is a cool web app where you discover stuff while you scroll on the page.

If you want to go further regarding machine learning you can still register to the free Machine Learning Zoomcamp created by Alexey. This is a 4 months program which looks really neat. To finish this category I propose you Vinija's notes about Stanford's NLP with Deep Learning course.

Faster ML Experimentation at Etsy with interleaving — tbh this is not something I'm able to understand when I've written 700 words in the newsletter. But I hope ML Friday's readers will like it 🤓.

PS: Meta wrote a post about Scaling data ingestion for machine learning training. But they removed the post. If you still want to see the verbatim I managed to save it on this gist (sorry Meta). In a nutshell they reduced their data center power consumption by 35-45 percent.

This picture appeared when I type logistic regression (credits)

Fast News ⚡️

  • From McLaren Formula 1 to Quix — Ananth from dataengineeringweekly has an open blog were you can submit founder stories. This week Tomas wrote about Quix and how they solve realtime challenges, including McLaren ones (this is something I forgot few weeks ago when I spoke about data in F1).
  • Dagster Cloud launched — Dagster, an open-source data pipeline tool, released Dagster Cloud. It includes 2 modes: hybrid and serverless. Hybrid lets you run code on your infra while they manage the control plane, serverless option means they run everything for you. In term of pricing the hybrid option is $0.03 per compute minute and the serverless $0.04. If we compare with Prefect the pricing is a bit more expensive for Dagster but still way less than any actual Airflow offer.
  • BigQuery launched in preview search features — You'll be able to create indexes on table columns and then in a SQL where do a string search among those columns.
  • Understand how Apache Iceberg integrates with warehouses: Snowflake and Fivetran — Iceberg and table formats are the future (even if we still have people using CSV a default format, I was guilty of such a crime).
  • The CDP as we know it is dead — Something I shared for a long time in the data news. Warehouses are the new CDP. Long live the warehouse.
  • Count things: Counting users part 2 — A great post from Pedram about the complexities when counting people.
Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links


Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.


Related Posts

Members Public

Data News — Week 24.20

Data News #24.20 — Big edition, 5000 members ❤️, launching Qrators to search in videos, Data Council, OpenAI and Google I/O stuff and data eng stuff.

Members Public

How to build a data team

This article will give you a list of the top resources to follow when building a data team.