Hey, this is the Data News. It's super hard to change habits, but it's how it is, the newsletter is going out on Saturday. I hope this edition finds you well. Summer is coming ☀️.
Thank you all because we crossed the 3000 subscribers mark last week. Let's go for the 4000 before the end of the year 🤗.
This is a almost-raw edition for this week.
Gen AI 🤖
- MPT-30B-Chat — This is a chat interface hosted on HuggingFace on the MPT-30B model. The MPT models are interesting because they are on Apache Licence, which can means true open-source, unlikely others.
- In the continuity to licence topic you can watch this great video about laptop-sized ML for text, with Open Source where Nick Burch explore what you can do today on a laptop and introduce greatly the Gen AI field.
- New approaches for detecting AI-Generated profile photos — This is the era we're going to live in. We'll be writing models moderating generative models. Am I the only one who thinks this is a waste of energy?
- Crypto collapse? Get in loser, we’re pivoting to AI — It's a rant that begins with the fact that many opportunists are getting into AI after VC have left crypto. ChatGPT "is a stupendously scaled-up autocomplete", which lead to question about intelligence in AI. I really like the conclusion: "The real threat of AI is the bozos promoting AI doom who want to use it as an excuse to ignore real-world problems — like the risk of climate change to humanity (...) The VCs’ actual use case for AI is treating workers badly".
Fast News ⚡️
- MotherDuck announcing DuckDB in the cloud — First, context. DuckDB is an in-memory analytics database. So it's single server. DuckDB has been open-source by DuckDB Labs. Then comes MotherDuck, a commercial company, with a partnership with DuckDB Labs aiming to to build a modern serverless cloud analytics platform based on DuckDB. That's for the context.
So this week MotherDuck finally announced their cloud offering. It's invite only for the moment — and I did not get my invite yet. In a nutshell the announcement is: you can connect to remote DuckDB by doing
md:in the connection string and you can join local and remote data (also seen on Twitter).
- Iceberg in the clouds — Last week BigQuery announced Iceberg support in GA. At the same time James from Snowflake wrote a blog post helping you to chose between Snowflake or Iceberg table format. Mainly he says, pick Iceberg if you know what you're doing.
- An introductory video about Iceberg — If you want a great Iceberg introduction, go watch Fokko's talk from Berlin Buzzwords.
- Understanding dbt runtime environment — Leo takes the time to explicit what are the messages dbt CLI is telling you.
- Replacing Apache Hive, Elasticsearch and PostgreSQL with Apache Doris — This is a technology bingo. You can replace 3 technologies with only one! This post details the choices behind a migration to Apache Doris. Doris is a real time analytical database.
- How data engineers drive data culture and empower users — This article reminds all data engineers that you're part of the team that brings data culture to a company, so you need to play your part.
- How to become a valuable data engineer — A post thats aggregates great ressources and advices to become a data engineer. I mention also that I have a similar one on the blog: how to learn data engineering.
- Dealing with missing weight data — Carbonfact tries to measure the environmental footprint of a clothing. This is not an easy task and ask you to work with missing data.
- Conceptual vs logical vs physical data models — The author mentions that there are 3 ways to model data with different layers of understanding. In the end he says that you should model your data in the 3 layers: conceptual, logical and physical.
Data Economy 💰
- Acryl Data raises $21m Series A. Acryl Data is the company behind DataHub, the data catalog that has been open-sourced out of LinkedIn.
See you next week ❤️
Join the newsletter to receive the latest updates in your inbox.