Hey you, 11th of November was usually off for me. Since I've started my freelancing activities I don't really follow the usual calendar, working whenever I need/want. I mainly work 3 to 4 days a week. Which is awesome but it has a major drawback I never took a break longer than 1 week. Which, yeah, kinda sucks. Let's change this next year.
On a social note, today I've joined data-folks Mastodon server, you can follow me there. I'll add this new community as source for my curation and I'm gonna try to be active there.
Also, on the 21st of November I'm gonna talk to a meetup for the first time in English and in Berlin. So if you wanna listen my terrible French accent, join us. I'll speak about "How to build the data dream team".
Let's jump onto the news.
Ingredients of a Data Warehouse
Going back to basics. Kovid wrote an article that tries to explain what are the ingredients of a data warehouse. And he does it well. A data warehouse is a piece of technology that acts on 3 ideas: the data modeling, the data storage and processing engine.
In the post Kovid details every idea. In this cloud world where everything is serverless a good data modeling is still a key factor in the performance—which often mean cost—of a data platform. Modeling is often lead by the dimensional modeling but you can also do 3NF or data vault. When it comes to storage it's mainly a row-based vs. a column-based discussion, which in the end will impact how the engine will process data.
Schema changes management
I bet that most common data horror stories are about schema changes. It could be because the product team changed an integer to a varchar in a source Postgres table or because an analyst remove the tax field in the income table. Every time it means morning headaches with Slack messages, Airflow screaming at you with red circles and downstream pipelines to re-run.
Fast forward to today, more and more team are trying to fix this. Here are few articles that will give you few ideas about stuff to do—tbh, there isn't a one-stop solution to fix it:
- Programmatic schema management — Manage all your schema with some kind of code. Petrica showcase at the end of the article Alembic which works, but I think to adds so much overhead in the data warehousing world.
- How to be more confident making data model changes — This article is an hidden ad by the author but still. It greatly depicts what you can do at the CI/CD with a static diff that checks old schemas with new schema.
- Tulip: Schematizing Meta’s data platform — Shows a tool called Tulip that handle schematization of message while also handling schema evolution.
Machine learning at Riot Games
If you play video games like me you'll like this video. If not, you'll still like it I think. This is a morning coffee from the MLOps Community with Ian Schweer who works at Riot Games. Ian describes how Riot Games uses data and what machine learning means.
Even if I recommend you to watch the video here few points I've written that were interesting to me:
- A good part of the discussion was around the fact that DEs and MLEs should just copy what SREs are doing for years. In the end why data management should be different than config management—ok, except from the scale?
- Riot has also embraced the concept of feature store, but at the scale of the enterprise there isn't yet a standard way to do it. In their case it also means they embark the ml models in the game binaries.
- This is probably the concept I liked the most from the video. The end-game dataset. It means that every game can be capture as a dataset, with a known schema in a immutable storage accessible for everyone. I like this idea and it can be replicated to a lot of different business.
Fast News ⚡️
- dbt Labs Founder Tristan Handy on the modern data stack, partnerships — This is a cool (long) interview of dbt co-founder Tristan about his vision of the product. If you have time listen or read it. My main takeaway is around the fact that dbt (core at least) is community-led. The community created dbt as a framework. A framework to organise your data assets and your knowledge. As of today, dbt is the most advanced framework to do this. The rest is just implementation details.
- Is it time to rebrand (or rethink) the Modern Data Stack? — It completes well the previous interview. 10 years after the "Redshift revolution", it's probably time to put words on today's stacks?
- 2003–2023: A Brief History of Big Data — If in parrallel you need a great description of the last 20 years, Furcy wrote the whole data platforms history from the Google File System in 2003 to the 2022 lake house swarm.
- Data engineering is not software engineering — Even if the title is a bit clickbait, the article holds some truths. The author states that data pipelines are not applications and pipelines are single-person tasks that have to be 100% completed otherwise worthless. IMHO, this is partially true and it'll only depend on how the team is mature in their data assets design.
- Introduction to Snowflake's Micro-Partitions — I think that explaination about databases internals are my favourite tech articles. It comes probably from the fact that I like to understand how the stuff I'm using is working.
- GoodData and dbt Metrics — Headless BI or Semantic Layer will be next the big vocabulary discussion in the data ecosystem. BI tools will want to sell headless BI when transformation platform will sell metrics or semantic layers, the idea is to capture via propretary code the data warehouse exposition.
Data Fundraising 💰
- Equals raises $16m Series A. 4 months after a Seed round they get once again money to develop their Excel alternative. The SaaS app connects to your warehouse and displays your data in a tabular format after a query (graphical built or SQL). It looks like a Google Sheets on steroids for data.
- EdgeDB raises $15m Series A. Slowly, years after years, graph databases time is coming up. Enterprises are relying on a multitude of apps with a varied view of their clients. Graph databases are a key piece of technology that provide an unified view over relationships. EdgeDB is an hybrid open-source graph database developed on top of Postgres.
PS: Regarding database trends Cloud Database Report wrote a great article about 7 actual database market trends. More serverless, graph, vector, Postgres is used everywhere, etc.
See you next week.
Join the newsletter to receive the latest updates in your inbox.