Data News entering in town (credits)

Hey you, if I wasn't late in my newsletter writing it wouldn't be me. But here is your usual Data News. The main reason behind this delay is because I've played with LLMs yesterday. I've tried to run open-source models locally on my own laptop. There are still a few bugs and the results are not really at OpenAI level but this is fun to do.

This Tuesday we hosted the second part of the Airflow alternatives meetup with Prefect and Dagster. You can find the replay on YouTube.

Data modeling

Dear readers, I have to confess something. I did not care about data modeling for years. I mean, in the sense everyone understand it today, for 7 professional years I never did a star schema or something similar. I was in the Hadoop world and all I was doing was denormalisation. Denormalisation everywhere. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms.

Actually what I cared was physical storage, data formats, logical partitioning or indexing.

But, actually, it's normal my role was not to translate business in tables. I still firmly believe that this is not the role of a data engineer. A data engineer should still be a software engineer working with data, empowering others with tooling and apps. Data modeling should not be a required data engineer skill. Enters the analytics engineer.

Still I feel that there is a hole in my skillset because I can't give relevant advices when it comes to model business with 3 facts tables instead of 5. And to be honest there isn't any good modern literature to answer this question. Simon started a multipart guide about data modeling. I hope he will fill the gaps. In the first part he treats about the history of modeling and the main concepts.

At the same time Maxime Beauchemin wrote a post about Entity-Centric data modeling. In comparison to the dimensional modeling it uses entities instead of facts. Which is easier to conceptually understand but also to use in machine learning.

When it comes to modeling it's hard not to mention dbt. In the recent years dbt simplified and revolutionised the tooling to create data models. dbt, as of today, is the leading framework. But alternatives are coming. This week I discovered SQLMesh, a all-in-one data pipelines tool. SQLMesh lets you define models like dbt but avoids you the burden of the Jinja ref/sources macros. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. It seems there is also a scheduler and a web UI included in the open-source version.

Gen AI 🤖

Rare footage of a foundation model (credits)

Fast News ⚡️

PS: Data Council took place in Austin a few days ago. As soon as the videos will be out on YouTube I'll do a wrap-up of the sessions. Data Council is usually a moment of the year when the US data gratin gather to discuss.

Data Economy 💰


See you next week ❤️.