Skip to content

Data News — Week 23.25

Data News #23.25 — Yes I was late. A bit of Gen AI and the usual Fast News + Acryl Data fundraising.

Christophe Blefari
Christophe Blefari
3 min read

orange petaled flowers
(credits)

Hey, this is the Data News. It's super hard to change habits, but it's how it is, the newsletter is going out on Saturday. I hope this edition finds you well. Summer is coming ☀️.

Thank you all because we crossed the 3000 subscribers mark last week. Let's go for the 4000 before the end of the year 🤗.

This is a almost-raw edition for this week.

Gen AI 🤖

  • MPT-30B-Chat — This is a chat interface hosted on HuggingFace on the MPT-30B model. The MPT models are interesting because they are on Apache Licence, which can means true open-source, unlikely others.
  • In the continuity to licence topic you can watch this great video about laptop-sized ML for text, with Open Source where Nick Burch explore what you can do today on a laptop and introduce greatly the Gen AI field.
  • New approaches for detecting AI-Generated profile photos — This is the era we're going to live in. We'll be writing models moderating generative models. Am I the only one who thinks this is a waste of energy?
  • Crypto collapse? Get in loser, we’re pivoting to AI — It's a rant that begins with the fact that many opportunists are getting into AI after VC have left crypto. ChatGPT "is a stupendously scaled-up autocomplete", which lead to question about intelligence in AI. I really like the conclusion: "The real threat of AI is the bozos promoting AI doom who want to use it as an excuse to ignore real-world problems — like the risk of climate change to humanity (...) The VCs’ actual use case for AI is treating workers badly".
smiling man standing near green trees
Too perfect to be a real picture (credits)

Fast News ⚡️

  • MotherDuck announcing DuckDB in the cloud — First, context. DuckDB is an in-memory analytics database. So it's single server. DuckDB has been open-source by DuckDB Labs. Then comes MotherDuck, a commercial company, with a partnership with DuckDB Labs aiming to to build a modern serverless cloud analytics platform based on DuckDB. That's for the context.

    So this week MotherDuck finally announced their cloud offering. It's invite only for the moment — and I did not get my invite yet. In a nutshell the announcement is: you can connect to remote DuckDB by doing md: in the connection string and you can join local and remote data (also seen on Twitter).
  • Iceberg in the clouds — Last week BigQuery announced Iceberg support in GA. At the same time James from Snowflake wrote a blog post helping you to chose between Snowflake or Iceberg table format. Mainly he says, pick Iceberg if you know what you're doing.
  • An introductory video about Iceberg — If you want a great Iceberg introduction, go watch Fokko's talk from Berlin Buzzwords.
  • Understanding dbt runtime environment — Leo takes the time to explicit what are the messages dbt CLI is telling you.
  • Replacing Apache Hive, Elasticsearch and PostgreSQL with Apache Doris — This is a technology bingo. You can replace 3 technologies with only one! This post details the choices behind a migration to Apache Doris. Doris is a real time analytical database.
  • How data engineers drive data culture and empower users — This article reminds all data engineers that you're part of the team that brings data culture to a company, so you need to play your part.
  • How to become a valuable data engineer — A post thats aggregates great ressources and advices to become a data engineer. I mention also that I have a similar one on the blog: how to learn data engineering.
  • Dealing with missing weight data — Carbonfact tries to measure the environmental footprint of a clothing. This is not an easy task and ask you to work with missing data.
  • Conceptual vs logical vs physical data models — The author mentions that there are 3 ways to model data with different layers of understanding. In the end he says that you should model your data in the 3 layers: conceptual, logical and physical.

Data Economy 💰

  • Acryl Data raises $21m Series A. Acryl Data is the company behind DataHub, the data catalog that has been open-sourced out of LinkedIn.

See you next week ❤️

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.