Skip to content

Data News — Week 43

Data News #43 — ClickHouse raising spree, BigQuery brain drain, data team organization at Postman, grow as ML Engineer, product releases and more!

Christophe Blefari
Christophe Blefari
5 min read
Me reading Hacker News comments (credits)

Hello dear readers, this is year 2021 week 43 — yeah, big advantage of being subscriber, you get the week number directly in your mailbox.

This week I published 3 stories about the time I deleted data in production, even if the title is voluntarily clickbait because the term "production" is blur. I had my first experience to be on Hacker News front page and I did not like it, as you can see there comments are a bit negative. But isn't a good way to learn something?

I've changed something to the usual fast news, so read everything to see it!

Data fundraising 💰

  • 5 weeks after their Series A, ClickHouse is in raising spree, they announced their $250m Series B. As a reminder ClickHouse is an open-source analytical database that promises low-latency performance over a warehouse. As always I see this as a good step forward multiplicity of warehouses technologies available.
  • To continue on databases technologies, Yugabyte raised $188m in Series C to be the "default database for Cloud native applications". It's a distributed SQL database built over Postgres. They gonna compete with Google Spanner or CockroachDB.
  • This week also features a transfer, a new departure from the BigQuery team. Hossein Ahmadi has joined Snowflake as a Principal Software engineer after 10 years at Google. This follows week 32, where another Principal left to become CTO at Firebolt.

The next big challenge for Data is organizational

I know that you are a diligent reader of the data news, so you already noticed that I really like articles dealing with data organization topics — even if I'm a data engineer 🤓. We already see recurring patterns towards organizational issues in all data teams, and it's gonna be the next big challenge for data.

I totally agree with the article, today tools are important but not crucial as people. Tomorrow your data teams will shine because you hire the good person for you organization and because you embrace techniques that suits them — probably inspired by software engineering.

Also do not hesitate to Subscribe if you like this kind of content.

How Postman's data team works

We already seen how Postman data team organized their platform around Atlan to create the data workspace for everyone. This week we'll see how they organized the data team to meet the operational needs. It seems they are organized around 2 teams: engineering and science and analysts are included in the second one. They also put different hats on the analysts as they could be central, embedded or distributed.

I really like this article because it shows what are the thoughts they had while building their team and it could help everyone.

Grow as a ML engineer

The ML engineer position is probable the newest in the tech ecosystem. The border and the role are still blur but as more people are writing about it as faster we gonna converge to the best definition. This week I propose you to read this Roadmap: from Backend Engineer to ML Engineer. It also covers the major topics the MLE is involved in and the skills it needs to mainly master.

If you also plan to go as a machine learning engineer freelance Pau wrote how you can defeat your impostor syndrome to launch yourself in the independant way.

To finish and because I'm a data engineer I'd like to share also the 7 things you need to know to become a data engineer that Medhi proposed.

A ML engineer, a data engineer and an analytics engineer are in a pot... (credits)

The rise and evolution of data engineering — what’s it all about? 🌙→☀️

Data engineering is truly changing. A first reason is all the tools appearing each day that simplify our daily job. The second reason is that it is starting to becoming hype attracting money and talent. Data engineer are dragged out of the shadow because of this. But what's outside the shadow? Is it light? Olivier Molander puts words on the data engineering evolution.

What is data versioning and 3 ways to implement it

We've all build data lake or data warehouses once in our life. And we probably did it the batch way. This is normal. But with time and volume the batch method will start to struggle. You'll need to understand concepts around change data capture (CDC) or data versioning, here how to do it.

To go further on this concept I propose you this Podcast from Data Engineering Podcast about Streaming data pipelines made SQL with Decodable (I like the pun) and this state of the art about Feature Stores look at the end of the post for a link to the Feature Store Summit if you're in.

Data Engineers shouldn't write Airflow DAGs — new episode1

Leroy Merlin Russian team explained how they implemented a configuration driven DAGs generation for their data platform using Airflow. Thanks to YAML templates they are able to describe what data they want and how they transform it.

Let's release data products (credits)

Fast News ⚡ and Releases 👻

Let's bring some data product news and releases in the fast news to follow how our beloved data products are evolving.

Below usual Fast News.


1 Props to Anthony Henao for the title (cf. here)

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.