Hello, new Friday, new weekly dose of data for all of you. I hope you had a great week and I send you sun rays from my sunrise ☀️. For the next two weeks you'll have 2 specials weekly editions. Stay tuned!
Data fundraising 💰
Again only one new story this week — we see holidays coming — but a very big one. Amplitude, the product intelligence platform, confidentially filed to go public according to Forbes. The company is valued $4B after the last fundraising one month ago.
Snowflake schema detection
Following the Summit announcement Snowflake finally rolled out in public preview the schema detection feature. Thanks to that Snowflake will be able to automatically determine the schema of a staged file. It'll be possible also to generate a DDL or to use one of the 3 new function like
If you're asking yourself what to choose between all cloud data warehouses (or if you want to change), you can read the comparison between the 4 majors vendors, if you want I also wrote a short comparison 2 months ago here (scroll for the table).
Data discovery, mesh and analytics
I've elected data mesh, data discovery and dbt the 2021 data buzzwords 🙉. This article written by Saxo Bank describe how they integrated DataHub in their decentralized platform — data mesh. Diagram are super interesting and the way Kafka integrates with their Data Workbench is inspiring.
About the same topic Nubank team explained how they scaled their data analytics with software engineering best practices (dbt but without dbt). In the post you'll get a glimpse of Compass their internal data search engine tool!
✨ Can Argo Workflows be a data engineering tool?
Recently I've heard people talking about Argo Workflows. As Kubernetes and Argo are becoming more and more used by every company, is it possible in the future to replace Airflow with Argo? Is it the way to bring together data and software worlds? Christoph Schnabl tries to show how we can use Argo Workflows today.
Hey it's time to tidy your DAGs 🛏️
Nitai did it first but it's a good idea once a year to try to sort all your Airflow DAGs. Yes, you know the ones you hide and that are failing every week. In the post you can see how they reorganized Airflow DAGs for the Ministry of Economy of Brazil.
Also as a reminder Airflow 1.x is not anymore supported and it's time to upgrade to version 2 and enjoy the new UI features.
Kafka and Matplotlib cheat codes
Clairvoyant blog exposes how you can use the Kafka AdminClient programmatic interface to manage topics, examples are in Java but if you are a Python developper methods are almost the same.
Datacamp wrote a cheatsheet about Matplotlib and I'm quite happy about it because I never remember how to handle axis.
Feature store at MoMo
It's been a long time since I've not seen an article about Feature Store. This week Hung Nguyen details how they achieved feature engineering for real time prediction using BigTable with historical data table and offline features tables. They also give example of BigTable schema they designed.
(Reading this story I have a thought for Arthur and Augustin)
Listen the Data Engineering stories
During the last months Netflix launched a new series of articles called "Data Engineers of Netflix", the new episode features Kevin Wylie, he designed and developed the first version of Netflix knowledge graph. I really like how Kevin sees the Data Engineering position and how he wants to empower his colleagues.
If you want to understand Geometric foundations of Deep Learning KDnuggets will help you. But if you want less formulas you can ask yourself if 2021 is the year of the rise of ML Engineers? The last article obviously comes after the best data engineering article of all time — a reminder.
In this new post the MIT Technology Review shows what startups are able to do today in terms of AI generated voices. It's quite impressive.
See you next week 🦊
Join the newsletter to receive the latest updates in your inbox.