blef.fr

Dear readers, I hope this email finds you well. This is still the summer edition of the Data News. I'm so happy to see people reading the news even if it's summer, so thank you all for the support once again.

Data fundraising 💰

Contentsquare is raising again a $600m Series F. This new round includes $200m in debt. They are one of the leading platform when it comes to analyse user behaviour. I really like Contentsquare in term of engineering. They have really good technical articles about how they do data at massive scale — if I'm not wrong they heavily use ClickHouse and Scala.
Hightouch acquired Whatis a company providing a Slack app and a Chrome extension to give access to company knowledge. This is awesome how Hightouch is diversifying these days. They are walking away from the reverse ETL image they had. In addition they also a Datafold integration within Hightouch to provide data-diff to their customer syncs.
Neon raised a $30m Series A to provide the best Postgres experience in the cloud. The product seems promising with a serverless multi-cloud Postgres that handles time travel.
In a nutshell other news I spotted: a web3 data warehouse raised $10m — I'm not into crypto stuff but the news interested me, a feature-engineering platform raised got $5.7m in seed funding, Manta (another data lineage tool) partnered with IBM, an end-to-end NLP platform, Humanloop, raised $2.6m.

RStudio news

Rebranding time. During the rstudio::conf(2022) RStudio team introduced a lot of changes and a new direction. First they are becoming Posit. In summary they do this change because they want reach more than R for data science. They aim to help all data scientists. Which means they will develop multi-language tooling (incl. Python).

The first manifestation of this vision is the release of Quarto. This is an open-source scientific and technical publishing system in which you can create dynamic content in Python, R and Julia. It has been inspired by R Markdown.

Next, they also announced Shiny for Python. Shiny, which is a way to create and publish web app directly from your R code will be available in Python. This could become a credible alternative to Streamlit.

To be honest I don't know very well the R world. I've written so little R code that my opinion could be wrong and bad. So, sorry in advance. I really like the vision and the initiative, R developers are legion and a lot of people are still using R because their niche library is only available in R. So if the vision is to empower everyone no matter the language, this is good. Still it'll be hard to break the scientific tool wall to become an enterprise-ready one — I mean production-ready.

Just as a side note: please don't become like Anaconda, I feel they tried to become the one-stop shop for everything and now this is too big to be the relevant player I want.

I've also read on Twitter that if R dependencies system could improve Python one it could be awesome. I don't disagree.

Data versioning

With today's cloud capacities we are able to save data changes. We have a lot of different technologies that can work with data versions. Christian wrote how you can version your datalake (with LakeFS).

Also, shame on me, I just discovered today that BigQuery had a time travel feature for instance (up to 7 days), see how Guillaume does BigQuery table snapshots.

In addition if you use dbt, here how you can do Change Data Capture in dbt or two ways to create incremental models.

Analytics time

How to empower analytics teams — I personally think that empowering others is what describes the most what I do. I really like empowering analytics teams to help them empowering business team with data.
How Criteo use reporting data
Why arguing about metrics is a waste of time

Q&A to learn from others

This week we got a small Q&A from Picnic data engineering team sharing thoughts on the lakehouse and the data mesh. On the other side Instacart VP data science shared how you can build a data-driven company. Which you should put in perspective with Benn last week post: do data-driven companies always win?

Fast News ⚡️

Snowflake's improvements (digest summary) — This is one of the major perk of having a proprietary cloud data warehouse. If they do performance improvements. They release. You enjoy. Toes wide open. For instance they reduced 7-10% storage cost.
On the other side Pub/Sub released native BigQuery integration — It means you can now directly stream a Pub/Sub topic into BigQuery without writing any pipeline. Small news but big step forward to me. The cost is $50/TiB.
Cloud competition to get US gov contracts is fierce — Everyone wants to take AWS leadership down.
Prefect released their 2.0 version which should fix all the 1.0 flaws with many improvements.
Text search at scale with ClickHouse in Tinybird — This article is obviously biased towards Tinybird (which is a HTTP API platform on top of your SQL queries, like the dbt semantic layer) but it shows how full text search can be done in Postgres and in ClickHouse.
Use Alvin to leverage column level lineage on Airflow — Alvin team developed an Airflow integration to send your DAGs informations to Alvin. It looks promising.
7 tips for a successful Machine Learning project in production
STOP USING CSV — Usual reminder to say you should move to Parquet.