Halo Data News readers. The weeks are pretty intense for me and every Friday come in a blink of an eye. I write the introduction before the content of the newsletter so I don't how it'll turn today. But I hope you'll enjoy.

For a future deep-dive, I'm looking for data engineering career paths. If you have one or something similar in your company I'd love to have a look at it — everything will be anonymous by default ofc.

No fundraising this week. I did not find any news to put light on.

Data roles

Every tech lead face this identity issue a day or another. This is the same for every data lead. How should you divide your time between management, contribution and stakeholders? Mikkel describe well the difficult life of the data lead. I previously was in a lead role and the main advice I could say to people in the same case is: make your grief and stop the contribution work except for the code reviews.

To some extent, 2 other posts I like this week:

What is the difference between an Analytics Engineer and a Data Engineer?
Lessons that helped me not quit my data job in Week 1 ; the best tip inside is: Bother your seniors. that’s what they’re for.

The metrics layer

Pedram produced a deep-dive on the metrics layer. He tried to explain what's behind and what are the current solutions proposing a metric layer: Looker, dbt Metrics and Lightdash.

In the current state of the technology the metrics layer is nothing more than a declarative way (a file) to describe what are metrics, dimensions, filters, segment in your warehouse tables. In Looker you write it in LookML, in dbt and Lightdash you use the dbt YAML, in Cube you use Javascript to do it.

The final vision of the metrics layer is to create an interoperable way to define metrics and dimensions that every BI tool will understand natively avoiding hours to create this knowledge in the tool. But we are far from there.

McDonald’s event-driven architecture

A two posts series detailed what's behind the McDonald's events architecture. First, they define what it means to develop such an architecture. Something that need to be scalable, available, performant, secure, reliable, consistent and simple. Mainstreamsly they picked up Kafka — but managed by AWS, the Schema Registry, DynamoDB to store the events and API Gateway to create an API endpoint to receive events. It feels like nothing facing, but looks strong.

In the second post they give the global picture and how everything orchestrate together defining the typical data flow. We can summarize it like: define event schema, produce event, validate, publish and if something goes wrong they use a dead letter topic or write directly to DynamoDB.

ML Friday 🤖

Data Stack for Machine Learning — This is a MLOps course that contains a data stack chapter. It covers data storage, extract, load and transform. The whole course seems great.
How AI will eat the perfume industry — "Google AI identifies scents more reliably than humans".
📺 Learned data augmentation for bias correction — I really like the fact it's a PhD defence talk given at a technical university in Denmark by Pola Schwöbel.
Creating media with Machine Learning at Netflix — This is a new blog series where Netflix tech team will explain how they use machine learning to produce creative media content.

Fast News ⚡️

Uber has been — apparently — hacked this night. The attacker claims to be a 18 years old. He got VPN access using social engineering on a IT person. He then scanned the intranet and found a Powershell script on the shared network. The script contained username/password of Uber access management platform. That's how he got in. This is a small reminder of "nothing is really secured".
Generative AI news — now that we got over complicated generative AIs people developed product to generate prompt that will work with each AI. There is also an AI to find your next tattoo.
Introducing Datastream for BigQuery — Google developed a integrate solution to do Change Data Capture for GCP. It can use MySQL, Oracle and Postgres as sources and GCS and BigQuery as destination for the moment. This is a good solution to go real-time with minimal footprint.
Bluesky, monitor your Snowflake cost and get alerted — As I recently shared, we may see a lot of tools similar to this one in future as warehouses took a prominent place in the current data stacks. Watching all SQL queries to indentify unbalanced performance/costs queries.
How to replace your database while running full speed — Every data engineer has to face a migration a day or another. Lior from monday explained how they realized a migration from an analytical database to Snowflake with no downtime. It consisted in 4 steps: create all the schema, migrate the writes, validate, migrate the reads.
Airbyte released a data glossary with a graph network to see relationships between articles.
Iceberg articles — A list of useful articles when you want to understand what's Iceberg and a post explaining the Z-Ordering with Iceberg. Regarding zorder, this is a way to cluster data to optimise collocation when accessing data. But it comes at a cost obviously.
Real-time analytics on network flow data with Apache Pinot — How LinkedIn use Kafka and Pinot to do real time analytics on TBs of network data.
It’s time to set SLA, SLO, SLI for your data team — It's time to apply SREs metrics to data teams.
Connectors catalog — Pierre created an Airtable detailing every connectors out there. If you want to copy data from a specific source have a look at it to find which tool you can use.

See you next week and please stop writing about data contracts.