Hey there. This is it. I'm in now in Berlin so if you wanna meet, say hi, I have plenty of time to meet new people.
Here the weekly data news everyone's waiting 🙃.
Data fundraising 💰
- Datawisp got $3.6m seed round. They tick a few buzzwords. A no-code data platform for web2 and web3. Their landing page is primary focused on crypto analytics. Once again I found interesting that the crypto field is also innovating in the data field and it could be a good news for data in general.
- Privya raised a $6m seed round. This is an engine that analyses all your infrastructure parts to find PII to reach data privacy laws compliance. With all the GDPR burden these kind of products could become legion in the future to help companies mitigate risk.
- Equalum, another data integration tool, raised $14m Series C. I'm impressed by the client portfolio they already have and by how unknown they were to me. Their product screen looks like a product coming from the future. But the future as seen as in the 2000'.
When Forbes talks about tech
Last week Forbes released their Cloud 100 a list that ranks — according to them — the top 100 private cloud companies. In this ranking around 10 companies are really about data. This is the stuff we speak about every Friday. They also discussed about Databricks, Fivetran and dbt Labs valuations that skyrocketed recently and what is means for their future.
We all live in a CTE hell loop
Cloud warehouses really popularized Common Table Expressions because in the past CTEs were in disgrace. CTE offers flexibility and linear structure to a SQL query that helps us achieve wonderful thousands lines SQL.
If you want to understand what are CTEs this week Brian wrote a guide on how to approach CTEs when you come from the subselect world. As a side note I also discovered the recursive CTE. It looks like hell.
How to use dbt's run_results.json
This is a great post about how you can use the
run_results.json artifact to create your own metrics dashboard to achieve dbt observability. I really like this topic because dbt artifacts are a really powerful way to own your dbt projects and to find incremental boost in performance.
Things I wish I knew...
- when scaling of Apache Spark — 3 lessons
- about Databricks — 5 things from confessionsofadataguy and I really like the 2 first ones
- about delivering data projects — 10 useful principles
ML Friday 🤖
Aren't the interactive websites the best ones when it comes to understand a machine learning model? Few weeks ago I shared Random Forest explained. This week the MLU-EXPLAIN did it again with the Logistic Regression. This is a cool web app where you discover stuff while you scroll on the page.
If you want to go further regarding machine learning you can still register to the free Machine Learning Zoomcamp created by Alexey. This is a 4 months program which looks really neat. To finish this category I propose you Vinija's notes about Stanford's NLP with Deep Learning course.
PS: Meta wrote a post about Scaling data ingestion for machine learning training. But they removed the post. If you still want to see the verbatim I managed to save it on this gist (sorry Meta). In a nutshell they reduced their data center power consumption by 35-45 percent.
Fast News ⚡️
- From McLaren Formula 1 to Quix — Ananth from dataengineeringweekly has an open blog were you can submit founder stories. This week Tomas wrote about Quix and how they solve realtime challenges, including McLaren ones (this is something I forgot few weeks ago when I spoke about data in F1).
- Dagster Cloud launched — Dagster, an open-source data pipeline tool, released Dagster Cloud. It includes 2 modes: hybrid and serverless. Hybrid lets you run code on your infra while they manage the control plane, serverless option means they run everything for you. In term of pricing the hybrid option is $0.03 per compute minute and the serverless $0.04. If we compare with Prefect the pricing is a bit more expensive for Dagster but still way less than any actual Airflow offer.
- BigQuery launched in preview search features — You'll be able to create indexes on table columns and then in a SQL where do a string search among those columns.
- Understand how Apache Iceberg integrates with warehouses: Snowflake and Fivetran — Iceberg and table formats are the future (even if we still have people using CSV a default format, I was guilty of such a crime).
- The CDP as we know it is dead — Something I shared for a long time in the data news. Warehouses are the new CDP. Long live the warehouse.
- Count things: Counting users part 2 — A great post from Pedram about the complexities when counting people.
Join the newsletter to receive the latest updates in your inbox.