Skip to content

Data News — Week 22.26

Data News #22.26 — Snowplow fundraising, stuff you should know about databases, a rant against dbt, the real data quality and fast news.

Christophe Blefari
Christophe Blefari
4 min read — ·
The World is Mine

Bonjour ! Here your fresh Data News edition. This is the first day of July and summer break will arrive for many of you. Sadly data never waits. So I hope you have enough team redundancy to be able to disconnect.

This week I published a new YouTube animated video: data explained to kids — in French but with English subtitles. BTW if you want to help me to do the English version, ping me 👋

Data Fundraising 💰

  • Snowplow, a complete suite of web analytics tools, raised $40m Series B. This is an open-source tool I know from back in 2014 and their marketing evolved drastically. From a web analytics tool to a behavioral data platform including AI and so on. By coincidence — or by chance — last week Italy declared Google Analytics illegal following others in Europe.
  • Zing Data raised $2.4m seed round. "Data at your fingertips" as they say. Their idea is to provide a mobile-first BI tool. Use your mobile to answer data questions and start data discussions. Obviously it works better for events companies.
  • Lightbits raised $42m in capital. Through proprietary technology they sell highly available storage. I got into private object storages just recently because of a mission and I find it super interesting. This is a fierce competition based on saving percentages and so on.
  • Opaque Systems raised $22m in Series A. Don't stop at the corporate website, the technology seems promising. Here what they sell: "Opaque is the first confidential computing platform that enables [...] analytics and machine learning on encrypted data". They are also the creators of MC2 an open-source version of their platform. It needs a try.
Raising money

Things you should know about databases

I often share stuff around databases knowledge. This is a content I really enjoy. It's time for you to learn new things you should know about databases. Mahdi wrote a post with great illustrations. He explains very well how indexes and transactions work. What really happens between BEGIN and COMMIT in your SQL query?

A rant against dbt ref

Complaining about dbt became a trend. When you see the adoption and how people are happy about it this is normal at some point to see dissonant voices. It's Max turn to rant against dbt ref.

I do agree with Max. ref manipulation is a pain point in dbt that breaks the magic. Especially when your workflow as a analyst is:

  • writing SQL in your Snowflake/BigQuery web UI
  • copy/paste the SQL in your text editor (whatever it is) to add it to git
  • finding all the tables references to change it to ref
  • forgetting something (alerted by the CI/CD)

And you do this every day, for every model you touch. In the end you spend more time playing Where's Wally? with tables names rather than writing SQL — ok I exaggerate a bit, but you got it.

On this specific point I think this is possible to develop a browser extension which on the fly replace tables names with the right dbt references — while waiting for some changes from the inside. If you want to do it with me.

The data quality no-one is speaking of

This is an intervention.

Thanks to the Modern Data Stack and dbt we created SQL-driven platforms and analysts are becoming SQL monkeys. This is not good. Pissing SQL all-day long creates monstrosities. Rather than adding extra layers to achieve data quality, build quality from inside out.

In this post, that I deeply recommend, Petr speaks the truth. Everyone should go back to the root cause of data quality issues: your code complexity. It's time to "tame the complexity".

Finding the real data quality

Databricks Summit

Databricks Summit (called Data + AI Summit) is taking place. As I don't have the time to follow it, here is Simon's feedback on Day 1 and Day 2. In a nutshell they announced

  • Change Data Feed, a way to track row-level changes in delta tables.
  • Databricks Workflows, an orchestration tool with a copy/pasted matrix view from Airflow
  • Enzyme a new optimization layer to speed up the ETL process. Bingo.

ML Friday 🤖

Fast News ⚡️


PS: for the first time in a long time I'm not late. See you next week ❤️.

datanews

Christophe Blefari

I do Data Engineering in Python.

Comments