Data News — Week 23.17
Data News #23.17 — what happened to the Semantic Layer, OpenAI demo that feels like 2007 iPhone and the fast news.
Hey you, new edition of the newsletter. This week summer time arrived in Berlin and it was awesome. I managed to move forward with my client projects this week and it also feels relieving. So I'm pretty happy, sun and great projects 🙂.
Regarding the content, if you are in Paris on May 9th, we are organising the Paris Airflow Meetup in Algolia offices, it will be in English so you don't have any excuses not to come. Also I'll be a lot in Paris in May so if you want to have a 🍜 / 🍺 ping me.
What happened to the Semantic Layer?
This week dbt Labs disclose their vision about the semantic layer and especially what they want to do with the Transform acquisition. This is mainly a roadmap of the MeticFlow integration within dbt ecosystem. At the moment we have a dbt Semantic Layer that correspond to YAML definitions and MetricFlow—which was Transform open-source project—that is able to understand the semantics to generates SQL.
A lot of changes will happen to MetricFlow incl. breaking changes:
- the dbt metrics spec will change, in the current state actually not a lot of people were using it, dbt_metrics package will be deprecated, probably they will merge dbt and MetricFlow syntax to define semantics and metrics
- "The core MetricFlow package will become a stand-alone library for processing metric queries, generating a query plan, and rendering SQL against a target dialect." (cf. Github discussion)
- The license will change to BSL.
- The serving part of the system aka. the metrics store will be the paid service of dbt Labs and a part of the dbt Cloud offering. It means that you will define metrics and dimensions in YAML and then plug all you tools to dbt Cloud, it seems there isn't any open-source solution to do the serving—at least from dbt Labs side. And with the license change on MetricFlow dbt Labs are protecting themselves against someone using MetricFlow generation to propose such a paid service.
- There are more described in the Github discussion.
To add more spice to this Carlin wrote what happened to the Semantic Layer. Carlin works at Google in the Malloy team (Google semantic layer to say it fast—tbh it's probably more) and he gives his views and also a small retrospective on the semantic layers.
Gen AI 🤖
- DoorDash identifies Five big areas for using Generative AI — Doordash is a food delivery platform and they shared how they imagine Generative AI can help them in the future. Either by assisting humans, it can be customers (cart building, etc.) or employees (SQL writing or document drafting) ; either by improving actual AI stuff: search, discovery, information extraction.
- When it comes to SQL writing the field is on fire, a lot of companies are trying to rise from the dead the Slack chatbots answering to insights. I think of Shape (YCombinator, out of stealth this week) and Delphi Labs or Promptimize. Promptimize is a toolkit to evaluate and tests prompts, for instance you can "unit tests" you natural langage to SQL prompts with it—it has been open-source by Maxime Beauchemin (Airflow and Superset creator).
- Bard now helps you code — Google is finally going the Copilot way and proposes an alternative with Bard. Bard now can help you write code or Google Sheets functions, but it can do more by explaining or debugging code for you.
- 📺 The Inside Story of ChatGPT’s astonishing potential — A TED talk from OpenAI President and co-founder sharing his vision, the potentials and the limit of the technology. In the video you can feel Steve Jobs's 2007 iPhone keynote vibes. The video also greatly showcase ChatGPT plugins. I higly recommend you watching it.
Last but not least a more "traditional" AI category:
- End-to-end ML modeling in BigQuery — BigQuery added over the last years a lot of ML capabilities to the engine. This post showcases a lot of it (it uses a XGBoost model).
- Building a large scale unsupervised model anomaly detection system (part 2).
Fast News ⚡️
- From PostgreSQL to Snowflake: A data migration story — The migration lasted 9 months and included 8 steps. They went on this journey because in 2021 Postgres was already hitting read performance limits, degrading the downstream user experience in the BI tools. As Katia shares in the article a 9-months migration is a long tunnel where you encounter a lot of roadblocks and frustration but in the end everyone feels the difference: 10x performance gain—at least—on dashboard execution time.
- Building dbt CI/CD at scale — Every week a new great article about someone else dbt setup where you discover things. This time Damian shares how he designed checkout.com CI/CD pipelines—in Github. In a nutshell they get the actual production manifest, run a SQL Linter, validates models changes (by detecting what are the altered models and running them) and deploy to Airflow.
- Making the Most of Airflow — I already shared Matt's article last week and this week he continues with an awesome article about Airflow. In the article he gives a great overview of Airflow main concepts: DAGs and TaskFlow API (I've also wrote something about dynamic DAGs last year), DRY and what to do to not redevelop stuff and how to test.
- Building a Kimball dimensional model with dbt — Jonathan from Canva wrote a large article about dimensional modeling and how to do it with dbt. This is a 7-parts tutorial that shows you how to create fact and dimensions tables.
- Data engineering design principles you should follow — It treats mainly of software engineering principles like SOLID. Idempotence and determinacy are forgotten from the article and if you want to go deeper on the topic you can read the most important article on this topic: functional data engineering.
- Real-time denormalized data streaming platform part 1 and part 2 — Razorpay data team describes how and why they needed to move their ETL process from daily to near real-time. Technologically moving from Airflow batches to Spark running on top of Kafka.
- Toward declarative data orchestration with Kestra — A few weeks ago in the Airflow alternatives meetup we organised, we invited Kestra. A YAML-based orchestrator written on the JVM. Recently Benoit joined Kestra as their PO. In this article he shares his vision. It's mainly a question of vocabulary and reach, Kestra believes that with their own declarative YAML syntax they can offer data pipelines to the masses. YAML is enough simple for your analysts (they already do dbt) or business to write their own pipelines.
- Manage database schemas with Terraform in plain SQL — Atlas is an open-source schema management tool. The post showcases the atlas provider in Terraform that allows you to write SQL to manage your database in Terraform. I can't wait to see dbt reimplemented in Terraform.
- Automatically detecting breaking changes in SQL queries — When you alter a SQL query you can either do a breaking or a non-breaking change. What if with SQLglot you could detect a breaking change before it happens in production?
See you next week ❤️
Join the newsletter to receive the latest updates in your inbox.