Skip to content

Data News — Week 4

Data News #4 — Firebold & Dremio fundraising, ClickOps, don't optimize SQL, learn from Gitlab, pray Oozie.

Christophe Blefari
Christophe Blefari
4 min read
Me on Thursday morning seeing new members (credits)

Hello dear members, I hope this new edition finds you well. I want to welcome all the new subscribers that joined this week, it boomed. In the newsletter you gonna find data articles out of my weekly curation plus my subjective views.

The Data News is a way for me to keep a record of the articles I like but also a way for you to save time to get an diverse glimpse of the data ecosystem.

Enjoy the reading.

Data fundraising 💰

  • We already seen the Firebolt data warehouse in the newsletter. They excellently market their product and claim performance gap with other data warehouses. After recruiting key engineers from the BigQuery team, Firebolt raised $100m at a $1.4b valuation. I'm waiting for feedbacks from the community before having any opinion.
  • Dremio raised $160m in Series E reaching $2b valuation. Dremio is a SQL first datalake platform that plugs their query engine on top of your cloud storage to create your interactive analytics layer.
  • When it comes to cloud storage, MinIO $103m in Series B to provide a multi-cloud agnostic cloud storage platform that is S3 compatible. I didn't know the product before and I think it's a good trend to watch.

ClickOps, this is time to tell the truth

This January I started to teach "DataOps" for students. This is a new class I've written this year. As DataOps is so different from companies to companies this is hard to define it, but still there are invariants. Even if we sell terraform or Ansible like magic tools, in the end, like Corey Quin said, we gonna “click around in the web console, then lie about it”. This is ClickOps.

As sad as it sounds it also exists a browser extension to record what you are doing in the console that translates it to config, the Console Recorder.

PS: if you're interested in the course content, ping me, I'll need beta testers because I plan to release it to everyone. I talk about Cloud (GCP), Terraform, data infra, DevOps and we build a dev + data platform from scratch.

New Year always means technology introspections

Each new year we get the same recipe talking about what will be the trending language or tool for the year to come and I usually don't like it. But this week Furcy tried to find the place of Spark (and Databricks) in the Modern Data Stack and the post is great, and in the end SQL will still be the first.

On the other side Medhi said that the data engineer should not be just a "SQL person" and place a bet that Rust could become a thing in the data world. To be honest I don't totally agree. Mainly because today there aren't a lot of data engineers that are only SQL focused and also because Python is still well suited and simple for today use-cases in data that are still batch-driven.

Learn from Google’s data engineers: don’t optimize your SQL

No, this is a bad advice.

Galen who work at Google, wrote this piece of advice saying you should save time by not optimizing your SQL[the original post has been deleted]. This is a strong opinion and I found this post thanks to the associated Reddit post. The main takeaway of the post is to say that cloud computing time is way cheaper than your salary, so don't bother yourself doing optimisation with MERGE or dimensional modeling and do full snapshots to deliver more value elsewhere.

It's probably a solution when you work at Google because you have close to unlimited power, money or storage but not sure it suits well everyone. If you live everyday in the fear to break the BigQuery limits for a bad written query you are far from this. With this kind of mindset, how lucky we are that Google is carbon neutral 🙄. Yep, more Cloud means more datacenters.

Each time your SD card is full buy a new one. Don't optimize photos storage. (credits)

Building Reference Architectures for User-Facing Analytics

What can you do to develop user-facing analytics? What are the actual open-source solutions? Dunith explored different solutions, proposing Spark + MongoDB or Apache Pinot combined with CDC. It reminds me I don't know a lot about Pinot and I should explore it more.

Learn from the best

Gitlab Handbook had always been a huge resource for all data people. I saw this week they refreshed it. And I noticed 2 concepts I really liked:

  • Data Pump — This sounds like Reverse ETL but I prefer data pump name. They also create a well documented approach to do it.
  • Trusted Data Framework — If you're still is the middle of your data quality / testing approach definition, Gitlab way to do it will help you for sure.
Data Customers expect Data Teams to provide data they can trust to make their important decisions. And Data Teams need to be confident in the quality of data they deliver. But this is a hard problem to solve

Fast News ⚡️

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.