Data News — Week 6
Data News #6 — Census and Great Expectations fundraising, Google Analytics banned, understand metrics store and distributed computing.
New Friday means Data News, and here you are. I hope this new edition finds you well, enjoy the reading.
The newsletter is way longer than usual because I tried to deep dive the EU to US data transfers topic to give you perspective. I hope you'll enjoy.
Data fundraising 💰
- Census raised $60m in Series B to bring what they call Operational Analytics to reality. The idea behind OA is to use your warehouse as primary data storage and Census and a tool to reverse ETL data to operational tools with software engineering principles (versioning, tests, monitoring, ci/cd, etc).
- Superconductive — the team behind Great Expectations — raised $40m in Series B. Great Expectations has been over the last year a well identified tool when it comes to data quality. With the money they will launch a cloud version with the open-source core plus "added features".
- Starbust raised $250m in Series D to push forward Trino in the SQL agnostic query engine. Trino was born out of the community conflict between community (incl. founders) and Facebook from Presto project. In this case Trino offers data team a unified SQL engine to query all your data storages (that also means that the data is computed on your servers).
EU to US data transfers
In the past week they have been a lot of discussion around EU to US data transfers, this is obviously related to data privacy concepts but also to sovereignty. I'm gonna try to summarize in a few words what happened recently.
In order to operate global services and global collaboration the United States and the European Union negotiated over the years two majors framework to regulate data transfers.
First the Safe Harbor in 1998, that prevented EU and US private organizations from accidentally disclose or lose personal information. In 2015 the CJEU (Court of Justice of the European Union) invalidated the Safe Harbor and one year later the Privacy Shield was born. The Privacy Shield frame the commercial use of EU citizen data in order to protect their privacy. In July 2020 the CJEU also declared the Privacy Shield invalid.
Starting from this, an European non-profit org called noyb — none of your business — filled 101 complaints in August 2020 against websites using Google Analytics or Facebook Connect after reviewing them. They filled the complaints in each relevant local DPA (Data Protection Authorities).
In January 2022, this year, the Austrian regulator stated that the use of Google Analytics violates CJEU decision. Then the French regulator, the CNIL, considered that Google Analytics data transfers are illegal under the GDPR.
The CNIL considers that these transfers are illegal and orders a French website manager to comply with the GDPR and, if necessary, to stop using this service under the current conditions.
Google Analytics future
Ok, this current situation is obviously a mix between political, lobbying and technical stakes. To be honest I'm quite happy to see finally lights on data privacy topics. But what does it mean for Google Analytics future?
This is difficult to find numbers on the revenue Alphabet gets from GA, but at $150k/year for the paid version I can imagine this is core product for them. I believe right now companies will have a deeper looker at their tracking to see if personal data is transferred to GA and take measures, either they will remove this tracking either they will move to another tool, but I bet this is not a small project.
If you want to have a look at open-source alternatives, there are plenty. The historical one is Matomo — formerly known as Piwik, which is different than Piwik PRO, which was the same at first but they diverged in 2016. Recently Posthog appeared and my favourite one: Plausible. I use Plausible for the blog tracking. There are also Snowplow and Rudderstack which are 2 big tools with a lot of features.
Meta threatens to shutdown in EU?
Big news titles. In their 10-K annual filling — a annual report to send to the SEC about financial performance — Meta wrote the following 👇
If a new transatlantic data transfer (Ed. from EU to US) framework is not adopted [...], we will likely be unable to offer a number of our most significant products and services, including Facebook and Instagram, in Europe, which would materially and adversely affect our business, financial condition, and results of operations. (source — p. 9)
This is totally different than what the press was saying, but still it means that Meta is financially impacted by all the regulations about privacy. Which obviously means they gain a substantial amount of money from our data. If we add to this Apple privacy changes that could lead to $10b advertising loss in revenue for Meta something is changing.
To close this category, I want to share Mozilla conjoint work with Meta engineers about Privacy Preserving Attribution which could be the future of the online attribution. The idea behind is to propose a new conversion measurement while proving strong privacy guarantees with a multi-party computation and aggregated system (result will not be linked to individuals).
PS: I far from behind an expert in the legal domain but I tried to write a big summary about this whole recent story. I hope you liked it.
Modern data stack builders
Arpit, who is building astorik: a place to explore modern data landscape, tried to depict the different evolutions of the modern data stack. From early-stage companies in the need of their first dashboards to the well established companies with a wide portfolio of data tools.
Moreover, Photobox shared their new data platform. They decided to be event-driven first and used the Cloud event spec to simplify their work. This is a huge article but full of insights.
If we zoom in the transformation layer Vimeo shared how they do dbt, they have been very creative in the way they decided to mix dbt Cloud with their own CI/CD processing in dev and staging environments. If you still struggle to put envs in place on your dbt setup this post is for you. On the other side Son showed how you can do environment-dependent unit testing in dbt.
iAdvize data catalog research
iAdvize team offered us this week an awesome series of articles about their data catalog research on top of Tableau. They decided to built it on top of Tableau Metadata API. Then they decided what they needed from Tableau API: datasources + calculated field and workbook content. In the end they put together the puzzle to create a Tableau dashboard where you can look at the data.
I've already done this kind of stuff in the past and I would personally developed something outside Tableau to have more liberty in term of result.
- the metrics store ; it has been recently popularized by dbt, in this post you gonna get what's behind it.
- kafka ; this is a new way to be introduced to Kafka concepts — and if you are feeling brave Slack explained how they built self-driving Kafka clusters.
- distributed computing ; huge post demystifying it
Cost Efficiency in big data file format (Uber)
Uber shared metrics around their Parquet compression performance at their scale, they compared ZSTD with GZIP and SNAPPY, this is a good post to understand what is under the hood.
Firebolt, thinking in Lambdas
Can we introduce lambda functions to SQL? Octavian, product manager at Firebolt, proposed a way to add imperative concepts to the declarative nature of SQL. Imagine adding the Python way to create lambda functions but in SQL select. This is what he brings to the table.
Fast News ⚡️
- Snowflake announced data classification feature in public preview but then remove the article. You can still read it in the cache. The feature uses machine learning to detect column tags, then you can apply policies to protect sensitive columns.
- Alibaba Open-Sources AutoML Algorithm KNAS
- A Career in Football Analytics, The What — If you like sport analytics, this post from Benoit will help you understand what does it mean for football.
- What is MLOps? — If you are still looking for a MLOps definition this post is for you.
- O'Reilly book humble bundle for charity — you can get 15 books in one donation for charity
- Andrew Ng interview: "The AI pioneer says it’s time for smart-sized, “data-centric” solutions to big issues"
In order to split news from fun ideas and curiosities this week I separated it in two categories.
- Load Twitter data into Google Sheet and automate it — Aurélien Robin shows how you can load and scheduler Twitter pipelines in Google Cloud Platform. To be honest I don't think we should use Pub/Sub for this kind of entry-level tutorials, but still a good project.
- github/pull request to apply Black on all Django source code → it means around 100k changes 😬
- github/wtfpython — Explore Python through surprising snippets like
is not ...is not
is (not ...).
- MergeStat — Treat your source code and development history as data and use SQL to explore it
See you next week.
Join the newsletter to receive the latest updates in your inbox.