Data News — Week 23.14
Data News #23.14 — Data modeling guide, entity-centric modeling, SQLMesh, GenAI: Italy bans, Samsung leak, Vicuna open-source model, reducing the lottery factor.
Hey you, if I wasn't late in my newsletter writing it wouldn't be me. But here is your usual Data News. The main reason behind this delay is because I've played with LLMs yesterday. I've tried to run open-source models locally on my own laptop. There are still a few bugs and the results are not really at OpenAI level but this is fun to do.
This Tuesday we hosted the second part of the Airflow alternatives meetup with Prefect and Dagster. You can find the replay on YouTube.
Dear readers, I have to confess something. I did not care about data modeling for years. I mean, in the sense everyone understand it today, for 7 professional years I never did a star schema or something similar. I was in the Hadoop world and all I was doing was denormalisation. Denormalisation everywhere. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms.
Actually what I cared was physical storage, data formats, logical partitioning or indexing.
But, actually, it's normal my role was not to translate business in tables. I still firmly believe that this is not the role of a data engineer. A data engineer should still be a software engineer working with data, empowering others with tooling and apps. Data modeling should not be a required data engineer skill. Enters the analytics engineer.
Still I feel that there is a hole in my skillset because I can't give relevant advices when it comes to model business with 3 facts tables instead of 5. And to be honest there isn't any good modern literature to answer this question. Simon started a multipart guide about data modeling. I hope he will fill the gaps. In the first part he treats about the history of modeling and the main concepts.
At the same time Maxime Beauchemin wrote a post about Entity-Centric data modeling. In comparison to the dimensional modeling it uses entities instead of facts. Which is easier to conceptually understand but also to use in machine learning.
When it comes to modeling it's hard not to mention dbt. In the recent years dbt simplified and revolutionised the tooling to create data models. dbt, as of today, is the leading framework. But alternatives are coming. This week I discovered SQLMesh, a all-in-one data pipelines tool. SQLMesh lets you define models like dbt but avoids you the burden of the Jinja ref/sources macros. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. It seems there is also a scheduler and a web UI included in the open-source version.
Gen AI 🤖
- It seems that Samsung employees leaked data to ChatGPT — Unsurprisingly OpenAI saves all the prompts we type () and can eventually improve models incrementally. It seems that Samsung employees gave confidential information to ChatGPT. Which means that OpenAI owns Samsung data. But is it really different than what we already have with Gmail or AWS? Or like when Tesla employees where watching consumers in-car footage for years.
- Italy decided to ban ChatGPT — In order to do it Italian Data Protection Watchdog ordered OpenAI to temporarily ceases processing Italian users' data. France and Germany might follow.
- OpenAI: Our approach to AI safety — 4 axes in which OpenAI wants to invest: improve safeguards, protect children, respect privacy and improve factual accuracy.
- Eight things to know about Large Language Models — A PDF that will give me a headache.
- On the practical side I've tried to run locally on my M1 Mac a LLM for the first time and it was a fun ride. In a nutshell I wanted to first run Vicuna an open-source chatbot that has great results when compared to GPT3.5. In order to run Vicuna (or other similar open-source models) you need to get the weights from the LLaMA Meta foundation 65B params model. You can get the model either by waiting after completing a Google form or by other channels remembering me the early days of internet 🧲. Except from the fact that the inference was super slow—while using douzains Go of RAM—the results were not as good as ChatGPT but still great. If you find it interesting tell me I'll write a post about what I launched and how.
Fast News ⚡️
- Twitter's recommendation algorithm — It was an Elon tweet. Twitter published on Github (here and here) their recommendation algorithm and they wrote a blogpost explaining how the recommendation is working. The machine learning is mainly in Python and uses PyTorch. But the algorithm as a whole contains a lot of features, filters and network algorithms.
- Microsoft data integration new capabilities — Few months ago I've entered the Azure world. Not really without pain. Today, Microsoft announces new low-code capabilities for Power Query in order to do "data preparation" from multiple sources. Disclaimer: I don't use Power Query and I don't plan to ever use it.
- One year as a dataviz journalist — Saturday is a good day to have a look at great data visualisations. Erin celebrates his 1-year anniversary as a viz journalist by putting light on the work he is proud of. I really like the "Farthest distance between World Cup stadiums" or the paths to become CCO.
- Life after orchestrators — Benjamin thinks that orchestrators are legacy systems and that we should all move in the real-time world where everything is simpler. No need to add trigger and to synchronise workflows together. Side node: Ben co-founded Popsink a real-time ETL company.
- Meta introduces Segment Anything — A new Foundation model enters the game. His name is SAM, and SAM wants to identify which image pixels belong to an object. Will traditional computer vision the next space to become has-been with the new AI innovations?
- ❤️ Reducing the lottery factor, for data teams — if you had to read only one article today you should read this one. The lottery factor, also named the bus factor is risk measurement about knowledge sharing. In data teams a lot of work have to be done in the early days to avoid knowledge to be lost later on. The article gives ~10 advices to apply to lower the risks. Among them I like the changelog, the pair-programming, the pre-recorded video and the stable credentials.
- The ultimate guide to hire your data team — An awesome canvas to conduct data interviews. This guide will help you before and during the interview. It includes a great list of example questions that you could ask in interviews.
PS: Data Council took place in Austin a few days ago. As soon as the videos will be out on YouTube I'll do a wrap-up of the sessions. Data Council is usually a moment of the year when the US data gratin gather to discuss.
Data Economy 💰
- Dozer raises $3m seed round. Dozer is a platform to develop real-time data apps, looking like a real-time ETL platform. With Dozer you can connect to multiple sources, do transformations (SQL, Python or JS) and then expose the output in APIs for frontend consumers (React, Vue or Python). YAML configured. It also looks like that Dozer is not really under a proper open-source license. If you want to go deeper to me Dozer looks like Materialize or Popsink but with a different vision, offering more an API as a serving layer than a database.
- Roboto AI raises $4.8m seed round. I hate this as much as I find it interesting. Roboto AI wants to create a AI-powered toolbox for people in robotics. In their demo you can use prompt to search over images or timeseries.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.