Hey friends, as a small remainder I want to do a community post soon and I'm expecting from great posts you liked recently. Do not hesitate to hit reply 👋.
Data fundraising 💰
- Hightouch acquires Workbase for an undisclosed amount. Hightouch has been one of the leading company of reverse ETL category, Workbase is an operational tool to automate workflows on top your data records. They will be able to provide more features with graphical workflows edition in their SaaS product in order to achieve their "Data Activation" goal. This acquisition should be put in perspective with Airbyte's few weeks ago, both companies trying to bridge the gap over the data platform coming from opposite directions.
- OVHcloud, a French cloud provider, acquires ForePaaS a 7 years old company providing an end-to-end machine learning and analytics platform. OVHcloud aims to enrich their Platform as a Service offer with this acquisition. To be honest as of today OVHcloud data service catalog is pretty small and this new expertise will for sure help them growth.
Untapped potential of data lineage
Data lineage tools are sexy because they provide a neat graph view of your metadata but is it enough? Petr is trying to answer this question by putting in parallel Google Maps and Lineage in order to find information. The final idea is to find the true potential of data lineage. Which probably resides in the search. When reading his article I have to admit that I think about all search bars given by all data catalog tools, already answering his remarks.
Our favourite tools schedulers
When we want to schedule data pipelines we have a lot of products. But what schedules our data tools? dbt and Airbyte wrote articles about what powers them. The Airbyte open source project is using Temporal Java SDK to implement sync workflows and triggers.
On the other hand in order for dbt Cloud to run client project they had to develop a in-house scheduler. This scheduler had bad performance prior to March — more than 80% of scheduler tasks were delayed by more than 1 minute. In a blog post Julia explained dbt Cloud scheduler improvements achieving less than 25s of delay for more than 80% of tasks.
Data tests and the broken windows theory
It is hard to write a complete suite of tests on your data. When we dive into it complexity is not a matter of tech — we have a lot of tools at our disposal — but more a matter of process and maintenance. Keeping up to date a suite of tests on hundreds of tables / thousands of columns is complex. Mo' data mo' problems.
Mikkel explains it better than me with the linking data tests and the broken windows theory. He proposes routines to apply in order to shift the mentality to get a important tests with great ownership.
People doing stuff on Snowflake
There are only two hard things in Computer Science: cache invalidation and naming things.
💬 Phil Karlton
Martin Fowler shared this 12 years ago. Regarding the second point about the data stack this is more than ever valid. When it comes to Snowflake organisation — or table / schema naming — we can do everything. Madison shared how she organizes a Snowflake data warehouse. Obviously this is a proposal and a good inspiration. Own your conventions and naming.
And in bulk people wrote about why they choose Snowflake as backend for a observability product ; how you can build a data app on top of Snowflake.
Modern data stack future
What is the modern data stack? What the future holds for us? I think no one knows yet. But still Nick for instance is saying that our data stacks aren't built for change. Which is close to be truth, our DAGs are often static, hard implementing logic in the data storage with close to none space for evolution.
On the other side Dunith is speaking about Modern Streaming Stack — this is the first time I read this term — and I like it. How the classic event-driven stack can live in the cloud driven world a lot of companies are in. This is a long post but it address all parts of the stack from the compute layer to the serving part and everything around.
Final note, TechCrunch that is trying to rethink (with paywall) Databricks valuation, $38b, in regards with their revenue, less than $1b. Will Databricks be in the modern data stack future?
ML Friday 🤖
- Podcast episode about MLOps and data engineer role — Episode from the data engineering podcast, to be honest I did not listen the podcast yet.
- Model Evaluation in MLflow — This is a walk-through post on how you can use MLflow to store all your model evaluations and keep an history. If you need help in understanding, vpTech can help you understand model evaluation.
- Enabling data science on Google Cloud Platform at Adevinta — Adevinta team detailed how they use GCP to support data science efforts with Vertex AI (pipelines and serving), Spark and BigQuery. I really like the efforts they put in platform components design.
- Vox journalism: Why it’s so damn hard to make AI fair and unbiased.
Fast News ⚡️
- Web scraping is legal — US appeals court reaffirmed its original decision and this is a good news for internet freedom and especially for side project lovers 🤓.
- Yandex open sources YDB, a distributed SQL database — following ClickHouse success the Russian search engine tech team released a new piece of technology with YDB entering a new database segment with CockroachDB for instance.
- Microsoft announced a partnership with Grafana Labs to provide in Azure a managed Grafana. This announcement will bridge the gap with other cloud providers already provided either a managed version (AWS) or either something equivalent (Grafana Google cloud monitoring data source).
- Last week I shared LakeFS comparison of table formats and this week Dremio did their own. I prefer Dremio's, even if they go to far regarding community comparison.
- Evolution of Redash at Blinkit — Shubham from Blinkit shows us what they did in order to fix all Redash issues they got along the way.
Join the newsletter to receive the latest updates in your inbox.