Hello dear members, I hope this new edition finds you well. I want to welcome all the new subscribers that joined this week, it boomed. In the newsletter you gonna find data articles out of my weekly curation plus my subjective views.
The Data News is a way for me to keep a record of the articles I like but also a way for you to save time to get an diverse glimpse of the data ecosystem.
Enjoy the reading.
Data fundraising 💰
- We already seen the Firebolt data warehouse in the newsletter. They excellently market their product and claim performance gap with other data warehouses. After recruiting key engineers from the BigQuery team, Firebolt raised $100m at a $1.4b valuation. I'm waiting for feedbacks from the community before having any opinion.
- Dremio raised $160m in Series E reaching $2b valuation. Dremio is a SQL first datalake platform that plugs their query engine on top of your cloud storage to create your interactive analytics layer.
- When it comes to cloud storage, MinIO $103m in Series B to provide a multi-cloud agnostic cloud storage platform that is S3 compatible. I didn't know the product before and I think it's a good trend to watch.
ClickOps, this is time to tell the truth
This January I started to teach "DataOps" for students. This is a new class I've written this year. As DataOps is so different from companies to companies this is hard to define it, but still there are invariants. Even if we sell terraform or Ansible like magic tools, in the end, like Corey Quin said, we gonna “click around in the web console, then lie about it”. This is ClickOps.
As sad as it sounds it also exists a browser extension to record what you are doing in the console that translates it to config, the Console Recorder.
PS: if you're interested in the course content, ping me, I'll need beta testers because I plan to release it to everyone. I talk about Cloud (GCP), Terraform, data infra, DevOps and we build a dev + data platform from scratch.
New Year always means technology introspections
Each new year we get the same recipe talking about what will be the trending language or tool for the year to come and I usually don't like it. But this week Furcy tried to find the place of Spark (and Databricks) in the Modern Data Stack and the post is great, and in the end SQL will still be the first.
On the other side Medhi said that the data engineer should not be just a "SQL person" and place a bet that Rust could become a thing in the data world. To be honest I don't totally agree. Mainly because today there aren't a lot of data engineers that are only SQL focused and also because Python is still well suited and simple for today use-cases in data that are still batch-driven.
Learn from Google’s data engineers: don’t optimize your SQL
No, this is a bad advice.
Galen who work at Google, wrote this piece of advice saying you should save time by not optimizing your SQL[the original post has been deleted]. This is a strong opinion and I found this post thanks to the associated Reddit post. The main takeaway of the post is to say that cloud computing time is way cheaper than your salary, so don't bother yourself doing optimisation with MERGE or dimensional modeling and do full snapshots to deliver more value elsewhere.
It's probably a solution when you work at Google because you have close to unlimited power, money or storage but not sure it suits well everyone. If you live everyday in the fear to break the BigQuery limits for a bad written query you are far from this. With this kind of mindset, how lucky we are that Google is carbon neutral 🙄. Yep, more Cloud means more datacenters.
Building Reference Architectures for User-Facing Analytics
What can you do to develop user-facing analytics? What are the actual open-source solutions? Dunith explored different solutions, proposing Spark + MongoDB or Apache Pinot combined with CDC. It reminds me I don't know a lot about Pinot and I should explore it more.
Learn from the best
Gitlab Handbook had always been a huge resource for all data people. I saw this week they refreshed it. And I noticed 2 concepts I really liked:
- Data Pump — This sounds like Reverse ETL but I prefer data pump name. They also create a well documented approach to do it.
- Trusted Data Framework — If you're still is the middle of your data quality / testing approach definition, Gitlab way to do it will help you for sure.
Data Customers expect Data Teams to provide data they can trust to make their important decisions. And Data Teams need to be confident in the quality of data they deliver. But this is a hard problem to solve
Fast News ⚡️
- If you ever wondered what are the meaning of the single and double
_in Python, Ahmed tried all the cases for you.
- Snowpark Is Now Generally Available — Use Scala and Java to interact with Snowflake dataframes.
- How Paris public transport evolutions will impact average travel time in the future? Modality developed a tool to explore geographical data — like kepler.gl. This is well crafted.
- Let's remember Oozie — I developed some Oozie workflows 8 years ago and as I saw this post I was nostalgic for new data engineers that never had the chance to play with it.
- Now’s the time to tackle data ownership — Maggie post, from Datahub project, was like an alarm to me. Data ownership always had been a problem in my previous experiences, maybe it will sound the same to you.
Join the newsletter to receive the latest updates in your inbox.