Data News — Week 43
Data News #43 — ClickHouse raising spree, BigQuery brain drain, data team organization at Postman, grow as ML Engineer, product releases and more!
Hello dear readers, this is year 2021 week 43 — yeah, big advantage of being subscriber, you get the week number directly in your mailbox.
This week I published 3 stories about the time I deleted data in production, even if the title is voluntarily clickbait because the term "production" is blur. I had my first experience to be on Hacker News front page and I did not like it, as you can see there comments are a bit negative. But isn't a good way to learn something?
I've changed something to the usual fast news, so read everything to see it!
Data fundraising 💰
- 5 weeks after their Series A, ClickHouse is in raising spree, they announced their $250m Series B. As a reminder ClickHouse is an open-source analytical database that promises low-latency performance over a warehouse. As always I see this as a good step forward multiplicity of warehouses technologies available.
- To continue on databases technologies, Yugabyte raised $188m in Series C to be the "default database for Cloud native applications". It's a distributed SQL database built over Postgres. They gonna compete with Google Spanner or CockroachDB.
- This week also features a transfer, a new departure from the BigQuery team. Hossein Ahmadi has joined Snowflake as a Principal Software engineer after 10 years at Google. This follows week 32, where another Principal left to become CTO at Firebolt.
The next big challenge for Data is organizational
I know that you are a diligent reader of the data news, so you already noticed that I really like articles dealing with data organization topics — even if I'm a data engineer 🤓. We already see recurring patterns towards organizational issues in all data teams, and it's gonna be the next big challenge for data.
I totally agree with the article, today tools are important but not crucial as people. Tomorrow your data teams will shine because you hire the good person for you organization and because you embrace techniques that suits them — probably inspired by software engineering.
Also do not hesitate to Subscribe if you like this kind of content.
How Postman's data team works
We already seen how Postman data team organized their platform around Atlan to create the data workspace for everyone. This week we'll see how they organized the data team to meet the operational needs. It seems they are organized around 2 teams: engineering and science and analysts are included in the second one. They also put different hats on the analysts as they could be central, embedded or distributed.
I really like this article because it shows what are the thoughts they had while building their team and it could help everyone.
Grow as a ML engineer
The ML engineer position is probable the newest in the tech ecosystem. The border and the role are still blur but as more people are writing about it as faster we gonna converge to the best definition. This week I propose you to read this Roadmap: from Backend Engineer to ML Engineer. It also covers the major topics the MLE is involved in and the skills it needs to mainly master.
If you also plan to go as a machine learning engineer freelance Pau wrote how you can defeat your impostor syndrome to launch yourself in the independant way.
To finish and because I'm a data engineer I'd like to share also the 7 things you need to know to become a data engineer that Medhi proposed.
The rise and evolution of data engineering — what’s it all about? 🌙→☀️
Data engineering is truly changing. A first reason is all the tools appearing each day that simplify our daily job. The second reason is that it is starting to becoming hype attracting money and talent. Data engineer are dragged out of the shadow because of this. But what's outside the shadow? Is it light? Olivier Molander puts words on the data engineering evolution.
What is data versioning and 3 ways to implement it
We've all build data lake or data warehouses once in our life. And we probably did it the batch way. This is normal. But with time and volume the batch method will start to struggle. You'll need to understand concepts around change data capture (CDC) or data versioning, here how to do it.
To go further on this concept I propose you this Podcast from Data Engineering Podcast about Streaming data pipelines made SQL with Decodable (I like the pun) and this state of the art about Feature Stores look at the end of the post for a link to the Feature Store Summit if you're in.
Data Engineers shouldn't write Airflow DAGs — new episode1
Leroy Merlin Russian team explained how they implemented a configuration driven DAGs generation for their data platform using Airflow. Thanks to YAML templates they are able to describe what data they want and how they transform it.
Fast News ⚡ and Releases 👻
Let's bring some data product news and releases in the fast news to follow how our beloved data products are evolving.
- dbt Cloud — They finally rolled-out the environment variables support in dbt Cloud, it can be used to bring env related variable to the Jinja templates.
- Snowflake x Google Cloud — Snowflake announced support with Google VPC. It could for instance allow you to block public access to your account.
- Metaflow UI out — Metaflow — the Netflix open-source ML platform — is finally getting a GUI. Bridging the gap with MLflow for scientist to go deeper in their model monitoring, understanding and debugging.
Below usual Fast News.
- Snowflake Java UDTF in action — Felipe Hoffa shows how he used UDF to detect language in Reddit messages and created a treemap.
- Hazelcast + Kibana — A walkthrough post on how to combine Hazelcast and Kibana to explore Wikipedia data. Hazelcast is a in-memory platform oriented for real-time performances.
- Choosing a Cache — if you are looking for criteria to met when looking for a cache this article is for you, it does not talk about technologies, only concepts to look for.
- Tracking in SwiftUI — if your product team does not know how to track your iOS app Mixpanel wrote how you can elegantly do it without breaking any Swift logic.
- Count rows in a CSV file — There are at least 6 ways to count rows of a file in Python and surprisingly Pandas is not the best way — lmao.
- LinkedIn and the Great Reshuffle — What a concept popularized by LinkedIn CEO following the Great Resignation, here we have AI post overlooking what LinkedIn is doing. It's midly interesting, but I like the Great Concepts.
1 Props to Anthony Henao for the title (cf. here)
Join the newsletter to receive the latest updates in your inbox.