blef.fr

Me enjoying the data engineering playlist while *everything is good now* (credits)

Hey, it'll probably be one of the shorter edition of the year. I feel that summer is coming and less articles are written. While on LinkedIn posts are still flourishing with unequal quality. Sadly, I miss good ol' web.

While read this edition listen the Spotify data engineering playlist done by Barr Moses. 🎶 EVERYTHING IS BROKEN.

Data Fundraising 💰

Whaly raised $1.9m seed to provide a all-in-one BI tool. The YC company offers a way to sync data from dozens of sources directly in your warehouse and add on top of this a visual way to transform your data to plug it in their Report Builder. They approach the modern data stack from a BI perspective providing all the tools needed in one platform.
IBM acquires Databand. Is it already the time for the consolidation in the data observability space? As mentioned by IBM this is the fifth acquisition since the beginning of the year. It will be interesting to follow how Databand will evolve while in contact with IBM customers.

How to make great schemas

In data engineering we often do schema to present architectures, projects or stuff. Information visualisation is the best way to simplify the complex world we live in. Benoit described what to do to make great schemas. From using a paper — yes, you know the white rectangle you may have somewhere in a drawer — and a pencil to digital tools to do it.

To be honest I hate Benoit right know because I deeply want the 350$ e-ink tablet he's using to draw.

While speaking of schema, he also featured in his monthly newsletter a great way to visualize SQL joins. This is way better than the tradionnal one with circles.

2 technical deep dives that will make you dizzy

Uber uses Spark at a level not a lot of companies have ever imagined. Which means they shuffle a lot. The shuffle is the operation that happens every time you transfer data between job stages. So they decided to develop a Remote Shuffle Service that handle all shuffles efficiently. This is a crazy deep technical post.

Canva is a platform to create graphic design online. Which means they have a lot of visual content. Which means they need GPU if they want to apply machine learning to their content. They developed an awesome encapsulation of their applications combining Docker, Kubernetes and Nix for ML. This is a crazy deep technical post.

Me after reading the 2 previous articles (credits, cropped)

THE CLOUD AND DATA WAREHOUSE - ARE THEY COMPATIBLE?

First, you don't need to yell at me. Second, this is a good question I ask myself every time I wake up. Thankfully Bill Inmon — one of the 2 popes of the data warehouse — had also this question in mind 2 days ago. To him the cloud is not totally compatible with data warehouses mainly because of data movement which is a big cost in cloud environment.

Data platforms future

Speaking of the cloud costs, this week Kris tried to wrote thoughts on data platforms costs driven by underlying cloud costs and how it will be hard to keep up for companies. The pay-as-you-go has some limits.

On the other side Alexandre finished his 3 posts series about Data Platforms: Past, Present, Future. In a well written Medium piece he's trying to guess where are we going and what will be the mutations the data field will face.

As a side note in the 3rd mutation he's mentioning the Data Mesh but Gartner hype cycle is already considering the concept obsolete. What a fun world.

Product News 🎚

This is category I've sometimes in mind but I melt it in the Fast News. Here I want to try to split it.

Preset announced their dbt integration. This is interesting to see, as Preset is a BI tool (the cloud offering of Apache Superset) they decided to develop a deep integration working in both direction with dbt. Preset is able to read sources, models and metrics from dbt, and dbt can access to dashboards in order to fill exposures. This is something Preset developed on their end with their CLI, but still paves the way for other tools.
Discover dolt. Dolt is another SQL database, but with a key differentiator: you can manage your data with git-like commands. All the commands you know for Git work exactly the same for Dolt. I want to try dolt cherry-pick.

Fast News ⚡️

How do I prevent people from running SELECT * on Snowflake tables? Felipe Hoffa proposed a way to add a 1/0 column in your Snowflake table to trigger errors on everyone's trying to do a SELECT *. WHAT A G33K 🤓.
Cracking the Data Engineering Interview — Part 1: Structure — I do agree with the main parts mentioned and I add that you should use my guide on data engineering to have your brain refreshed before interviews.
Delta vs. Hudi — and how a performance test said something the first time that was invalidated the second time they tried. They were stating that Delta was out-perfoming Hudi, but they mis-configured it and now it's the same. Never trust performance tests. Make your own, and still, don't trust yourself.
If you are a financial geek I found the blog you were expecting — Clouded Judgement ; Jamin breaks down every week valuation trends in the cloud universe. This week he analysed Q1 results, Snowflake is in his top 8 winners.
⏱ 3 tips to take back control of your time — A friend of mine wrote this and I really like it.