Data News — Week 22.26

Data News #22.26 — Snowplow fundraising, stuff you should know about databases, a rant against dbt, the real data quality and fast news.

Christophe Blefari

1 Jul 2022 — 4 min read —

Bonjour ! Here your fresh Data News edition. This is the first day of July and summer break will arrive for many of you. Sadly data never waits. So I hope you have enough team redundancy to be able to disconnect.

This week I published a new YouTube animated video: data explained to kids — in French but with English subtitles. BTW if you want to help me to do the English version, ping me 👋

Data Fundraising 💰

Snowplow, a complete suite of web analytics tools, raised $40m Series B. This is an open-source tool I know from back in 2014 and their marketing evolved drastically. From a web analytics tool to a behavioral data platform including AI and so on. By coincidence — or by chance — last week Italy declared Google Analytics illegal following others in Europe.
Zing Data raised $2.4m seed round. "Data at your fingertips" as they say. Their idea is to provide a mobile-first BI tool. Use your mobile to answer data questions and start data discussions. Obviously it works better for events companies.
Lightbits raised $42m in capital. Through proprietary technology they sell highly available storage. I got into private object storages just recently because of a mission and I find it super interesting. This is a fierce competition based on saving percentages and so on.
Opaque Systems raised $22m in Series A. Don't stop at the corporate website, the technology seems promising. Here what they sell: "Opaque is the first confidential computing platform that enables [...] analytics and machine learning on encrypted data". They are also the creators of MC² an open-source version of their platform. It needs a try.

Things you should know about databases

I often share stuff around databases knowledge. This is a content I really enjoy. It's time for you to learn new things you should know about databases. Mahdi wrote a post with great illustrations. He explains very well how indexes and transactions work. What really happens between BEGIN and COMMIT in your SQL query?

A rant against dbt ref

Complaining about dbt became a trend. When you see the adoption and how people are happy about it this is normal at some point to see dissonant voices. It's Max turn to rant against dbt ref.

I do agree with Max. ref manipulation is a pain point in dbt that breaks the magic. Especially when your workflow as a analyst is:

writing SQL in your Snowflake/BigQuery web UI
copy/paste the SQL in your text editor (whatever it is) to add it to git
finding all the tables references to change it to ref
forgetting something (alerted by the CI/CD)

And you do this every day, for every model you touch. In the end you spend more time playing Where's Wally? with tables names rather than writing SQL — ok I exaggerate a bit, but you got it.

On this specific point I think this is possible to develop a browser extension which on the fly replace tables names with the right dbt references — while waiting for some changes from the inside. If you want to do it with me.

The data quality no-one is speaking of

This is an intervention.

Thanks to the Modern Data Stack and dbt we created SQL-driven platforms and analysts are becoming SQL monkeys. This is not good. Pissing SQL all-day long creates monstrosities. Rather than adding extra layers to achieve data quality, build quality from inside out.

In this post, that I deeply recommend, Petr speaks the truth. Everyone should go back to the root cause of data quality issues: your code complexity. It's time to "tame the complexity".

Databricks Summit

Databricks Summit (called Data + AI Summit) is taking place. As I don't have the time to follow it, here is Simon's feedback on Day 1 and Day 2. In a nutshell they announced

Change Data Feed, a way to track row-level changes in delta tables.
Databricks Workflows, an orchestration tool with a copy/pasted matrix view from Airflow
Enzyme a new optimization layer to speed up the ETL process. Bingo.

ML Friday 🤖

Instacart detailled how they developped an internal platform to answer data science needs. They call it Griffin. This is a way to help the MLOps and it includes a feature marketplace, a workflow manager and a training & inference platform.
Causal Forecasting at Lyft — I don't get a lot in this post but it fills my brain with memories on PHV stuff.

Fast News ⚡️

envd, a development environment for machine learning — Write in Python or R all your app configuration and envd will build everything out of it. Why not.
sqlglot a Python SQL parser, transpiler, and optimizer — This is a recent SQL parser that went out. They claim to be the fastest one written in Python. It is modular to adapt it to your specific dialects.
People-first Data stacks — It's time to add empathie on top of the modern data stack.
Everything is a funnel, but SQL doesn’t get it — In a lot of business we tend to represent stuff with funnels. Sadly SQL is not a good langage to manipulate funnels.
Remote development at Slack — it is interesting to see how Slack tech team has been equipped with remote development environments and how it worked.
Building a Real-time discovery stack at Whatnot
Airflow Survey 2022 — this is something I missed few weeks ago. Some numbers about how Airflow is used today. For instance still half of the user base is using Celery Executor (and a quarter is on Local).

PS: for the first time in a long time I'm not late. See you next week ❤️.

Data News

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments

Data News — Week 24.40

Data News #24.40 — Back in Paris, Forward Data Conference program is out, OpenAI and Meta new stuff, DuckCon and a lot of things.

13 Sep 2024

Paid Members Public

Data News — Week 24.37

Data News #24.37 — OpenAI o1 new series, building low cost platform with Model dlt and dbt, Data teams survey, feature store, Ibis without pandas.

Data News — Week 22.26

Data Fundraising 💰

Things you should know about databases

A rant against dbt ref

The data quality no-one is speaking of

Databricks Summit

ML Friday 🤖

Fast News ⚡️

Data Explorer

The hub to explore Data News links

Christophe Blefari

Comments

Related Posts

Data News — Week 24.40

Data News — Week 24.37

Data Fundraising 💰

Things you should know about databases

A rant against dbt ref

The data quality no-one is speaking of

Databricks Summit

ML Friday 🤖

Fast News ⚡️

Data Explorer

The hub to explore Data News links

Christophe Blefari

blef.fr Newsletter

Comments

Related Posts

Data News — Week 24.40

Data News — Week 24.37