blef.fr

Hey, this week we've crossed the 1000 members milestone! I want to thank you for behind a important part of this newsletter. I've already said it before I do the newsletter because I want to bookmark links but I also do the newsletter because you read and like it. What would be a newsletter without readers?

That's why I'll add next week — if I'm on time — for the official one year 🎉 of the newsletter interactive features. You'll be able to comment and like — Substack style — every post, in addition I'll also improve the Links search feature we developed with Saija few months ago and the overall navigation of the blog.

Now let's aim for the 2000 members! 🎯

PS: I want to specifically thanks members that took the Early Supporter tier when subscribing to the Data News ❤️.

Data fundraising 💰

HuggingFace raised $100m in Series C for open & collaborative machine learning — a.k.a. the Github of machine learning. The French-American startup which released pre-trained PyTorch BERT models 4 years ago will continue to expand and to federate the ml community. They are also paving the way regarding responsible AI.
Aiven, a finish startup, raised $210m in Series D to become a new European Unicorn. The software company provides a cloud platform to deploy services on all major cloud providers. Some parts of the platform are open-source. Right now the services are mainly data storage tools like Kafka, Postgres, MySQL, etc. From what I understand the promise lies in the ability to switch cloud providers, regions easily with a fixed pricing.
Improvado just did a $22m Series A. The company illustrates super well the ops all-in-one vertical platform. They created a data platform for marketing and sales team to do data transformation on top of integrations. I know that the Data News audience is more engineering friendly and this kind of platform can become your shadow IT.

Following Snowflake partnership announcement last week with Dell this week Snowflake partners with Pure Storage to run data warehouse workloads on top of their on-premise S3-based storage. Yay.

Discussions about dbt vision and roadmap

When I waked up this morning I saw Pedram's post "We need to talk about dbt". I felt break-up vibes. You know like when you have two close friends in a relationship and one is telling you that something is going wrong. That they need to talk. They may need a break.

Predram is a dbt early adopter and he mainly feels that dbt is not anymore the product he loved before, especially when he sees VC filling the roadmap with SSO, AES and some other weird enterprise acronyms. He also asks for fixes or vision around their relation: the dbt Core library — so called the "CLI" — and criticizes the Cloud IDE product quality.

In a hot take Tristan, dbt Labs CEO, representing the other part of the relationship tried to be honest bringing "The response you deserve!". He understands Pedram's pain and he is touched to learn he lacked of transparency. Tristan reminds that he started to invest a lot in their relation: 10 FTEs working on the CLI and 8 on the community content, also revealing he plans to support non-SQL languages like Python (**cheering crowd**). Finally, he justifies the VC presence by saying they are helping their relationship by bringing another perspective.

In all seriousness, I really liked this conversation between Pedram and Tristan. I think dbt shines from the community, even if a huge part now is only asking for tech support on the Slack, there still are thinkers and makers. These thinkers and makers are helping dbt to become the central piece of any data stack. Becoming more than a tool, becoming your organised knowledge repository.

I wrote some thought on this after the last Coalesce that dbt is not —anymore— a data product.

Google I/O and AWS Summit Berlin

Google annual developper conference took place this week while AWS Summit was taking place in Berlin. Because I had too much work — and was a bit lazy so I did not go the AWS Summit but here what I was planning to attend.

On the Google side the agenda was hugely oriented on AI & Machine learning, among everything they announced AlloyDB, a service that could change how we do databases. AlloyDB brings to PostgreSQL the disaggregation of compute and storage. This is a major evolution in the industry. Richard detailed on Twitter what it looks like.

Early this month the brilliant Looker team that was amalgamated by Google reappeared with Malloy — naming coincidence? — their new data exploration language. If you want more detail here some slides about Malloy and Malloy Composer, the UI to compose queries.

Use URLs to preselect dashboard filters in Superset

Yesterday I published an article on Apache Superset. Rather technical. Actually, this week, I was stuck on one specific task for Superset and did not find any content on Google to help me so I decided to write this post for me and to help others in with the same issue. I detail how you can use urls to preselect dashboard filters. It includes a small introduction to Superset.

ML Friday 🤖

A dress is not a pullover — In his post Alex, a ml engineer, propose a great walkthrough on how you can write a classification model with PyTorch. He uses Fashion MNIST data to classify dresses and pullovers. I found it crazy on how concise the code is becoming when it comes to tasks like this.

Detect silent model failure — I recently spoke with NannyML team about their open-source library to detect issues in ml models. By using only the models' inputs and outputs (class and probabilities) they are able to recompute the confusion matrix and then to detect data drifts, which means performance drifts. It works on classification use-cases. If you look at their user guides you will find great content around data drift.

Fast News ⚡️

Awesome Public Datasets — A Github repository with public datasets per topics all over the world. This is a huge repo to get ideas.
Anaconda (full installer) is available now for M1 Macs!
Trino: open source infrastructure upgrading at Lyft — A nice write-up on how Lyft deployed Trino (formerly Presto) at their scale.
The landscape of timeseries databases — A simple comparison of timeseries databases explaining how each one is working.
Monitor your Snowflake credits usage — If you do self-service BI you may be afraid every to explode your bills. This post can give you ideas on what to do to monitor this.

PS: I don't know why but today's edition feels small in term of links. I hope you will enjoy ❤️.