Skip to content

Data News — Week 23.35

Data News #23.35 — I'm back. Let's digest what happened in August: dbt tests, Gen AI with Meta new models release, Python into Excel, Airflow new features, Terraform, etc.

Christophe Blefari
Christophe Blefari
5 min read
flat lay photography of blue backpack beside book and silver MacBook
Back to school (credits)

Hey, I'm back.

I've taken an unplanned 3-week break since the last Data News, let's be honest, it was necessary! I spent a few hours working on the fancy data stack project and articles are in the works, but it was idealistic to produce quality code and content while enjoying the summer. Like wine, it takes time to get it right. If you want a first glimpse of the Dagster code, you can look at it on Github, not yet documented but commits messages are clean.

On September 1, I'm still getting used to the school rhythm. A new year starts in September, new friends, new classes and new things. Even if, as an adult, things are different now. Data News is back, but with the same recipe: a weekly newsletter to let you catch up on the previous weeks' articles. I make the selection myself, I choose things I like while being under the others influence. But I'm not an influencer. I just create content.

A glimpse into a fancy assets graph.

This week features what happened in August, even if it was summer holidays, news, features and drama got the data world. Enjoy the news recap.

dbt tests 🧪

dbt Core proposition has been to bring software engineering practices to SQL development. Obviously testing is invited to the party, but tests are hard and everyone does and understands tests differently. There are unit, integration, functional and end-to-end tests.

This summer a lot of people wrote about testing with dbt.

💡
Before you start reading something else I recommend you the excellent video Testing: Our assertions vs. reality from last Coalesce on YouTube.

Generative AI 🤖

I haven't really been keeping up with the news because it moves too fast, but here are a few things that have stood out:

  • Meta releasing models faster than beforeExpanding DINOv2 a computer vision model (on X), releasing SeamlessM4T a multilingual multimodal translation model (on X), releasing Code Llama a LLM for coding.
  • Snowflake fine-tuning Code Llama for SQL generation — With these fine-tuning it seems they are close to GPT-4 accuracy in text-to-SQL.
  • Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper
  • A French Youtuber released on Twitch a 24/7 AI deep-faking French presidents (Macron, De Gaulle, Chirac) answering the Twitch chat questions, but his channel got banned by a Twitch bot after AI-Macron said something illegal while answering a question about worst french cities. AI fights this is the future we want.

Fast News ⚡️

A certain idea of hell.
  • Python into Excel — Microsoft and Anaconda announced Python coming into Excel. I'm bitter-sweet about it, on one side I don't think Excel is a good platform for software development, on the other side, let's be honest a face the truth Excel is the only data platform business users wants. Still the big winner of this is Microsoft, because Python code will run on Azure.
  • After Excel, Notebooks get a second youth — Meta explained how they schedule Jupyter Notebooks in production, Google announced the BigQuery studio with embedded Notebooks in the UI and Jupyter released Jupyter AI (you call it with %ai) to bring Gen AI to the notebook.
  • New features in Airflow — with 2.7 you get a Cluster Activity UI and with airflowctl new CLI you can spin up Airflow instances in a wink.
  • Introducing the revamped dbt Semantic Layer — dbt Labs announced the Beta of the Semantic Layer which will be a paid product in dbt Cloud. I've already wrote a lot about the semantic layer and more is to come. So let's see where it goes.
  • Introducing SOL: Sequence Operations Language — A new dedicated to to sequence analyses, which can be useful when working with web traffic data.
  • Answering "Why did the KPI change?" using decomposition — If you are an analyst who needs to explains everyday why a metric increased or decreased, this article is for you. Max explores metrics decomposition for sum and ratio. This is brillant.
  • Apache Hudi: From Zero To One (1/10).

Drama

  • Instacart's Snowflake bills — When public companies publish results numbers are looked at. This time Instacart bills have been overlooked. While the company said it has spent $13m, $28m and $51m respectively for 2020, 2021 and 2022 in Snowflake spending and plan to spend $15m in 2023.

    People supposed Instacart found the magic solution to reduce costs, others said it migrated to Databricks. But the main reason is: prepaid credits. The Snowflake press team even wrote a post.

    Still you can watch the perfectly timed video about How Instacart Optimized Snowflake Costs by 50% or Snowflake optimisation at HelloFresh.
  • Hashicorp changed Terraform license model — Hashicorp decided to move from Mozilla Public License to Business Source License (BSL). BSL is source-available and not really open-source. Following the announcement OpenTF forked the repo.

Data platform stuff

4 articles that gives food for thoughts about the future of the data field.

Data Economy 💰


Feels good to be back, see you next week ❤️. I hope you enjoyed your summer.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 1200 links

Explore

Christophe Blefari

Senior Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 23.37

Data News #23.37 — A lot of article this week, Falcon 180B, HuggingFac(ing) the senate, Snowflake and BigQuery tips, Databricks still burning cash and raising, etc.

Members Public

The fancy data stack—batch version

Data News Summer Edition — Design the fancy data stack to explore the Tour de France data.