Skip to content

Data News — Week 23.16

Data News #23.16 — Analytics engineering future, a new Airflow meetup, data engineering at Adyen and Meta, dbterra and more.

Christophe Blefari
Christophe Blefari
4 min read
If this picture had been generated with AI it would have been boring (credits)

Dear readers, I hope you're doing good. We are close to the second anniversary of the newsletter. Which is crazy. Retrospectively it means that I've written 900 words on average every week for the last 102 weeks. When you look at the first edition we came a long way—lmao.

We announced this week the May Paris Apache Airflow meetup. It will take place in Algolia offices, the 9th of May. We will have 3 speakers and for the first time all the presentations we will held in English. So if you're in Paris or in France do not hesitate to register.

Analytics engineering future

This week Tristan Handy—dbt Labs CEO—wrote a post about the future of analytics engineering: The next big step forwards for analytics engineering. As introduction Tristan gives the original vision of dbt that became mainstream, today. A lot of data teams embraced dbt, or at least the SQL with engineering practices to transform data in cloud data warehouses.

The content of the post is more about the future and the vision of the next big thing in analytics engineering: new models capabilities. In dbt Core 1.5 we will be able to define:

  • Contracts — you will be able to define columns types and constraints and ask dbt to enforce it. If a model do not respect a contract it will not build. In dbt vocabulary build means run + other things.
  • Access — you will be able to namespace models with groups and visibility. Models visibility will be either private, protected or public. This is a preambule to cross-project dependencies I guess.
  • Versions — you will be able to define versions for models without breaking the downstream consumers. In order to do it you will have multiple SQL files suffixed with the version—_v<version> . To select a specific version you will have to do {{ ref('model_name', version=1) }} .

I think that these improvements are really important to bring analytics engineering to the next level, this is new capabilities that will bring the field new software engineering practices to data assets management. If we had to this the semantic layer new (through dbt Labs acquisition of Transform) we are going in the right direction.

Gen AI 🤖

  • If you want to understand LLMs there is a note that has been written by an experts office of the French gov. You can read it in French or in English. To be honest this is a great quality note that you can share to people who wants to understand what are all the AI concepts. Might still be a bit too technical to share it to your parents.
  • ChatClimate — This is a chat that has been trained with the last IPCC report (the GIEC for the French audience). He showcases well the search capabilities of ChatGPT-based system because every answer is completed with references to the report chapters.
  • How to train your own Large Language Models — Now that you tried the previous chat, let's say that you want to run your own LLM. Replit team wrote a great overview of what you have to do.
  • Building a ChatGPT Plugin for Medium.
ChatClimate answer to the most important question.

Fast News ⚡️

  • Building a Flink self-serve platform on Kubernetes at scale — Instacart engineering team migrated from Flink on EMR to Flink on Kubernetes. This article gives you an overview of the Kubernetes platform they implemented.
  • fal-ai/isolate — Yet another package manager in Python. fal developed a new lightweight package manager to isolate environments for at function level. The project README is not yet really explicit.
  • Data Engineering at Adyen — "Data engineers at Adyen are responsible for creating high-quality, scalable, reusable and insightful datasets out of large volumes of raw data". This is a good definition of one of the possible responsibilities of DE. This is a great article and they even included a flowchart to identify which role will suit you the most. It is interesting to read this post jointly with the future of data engineer at Meta. Which gives another perspective, which is very business oriented.
  • Announcing dbterra: easily sync your jobs with dbt Cloud™️ — Eric developed a tool called dbterra that mixes dbt and Terraform in order to deploy open-source dbt project to dbt Cloud with configuration as code.
  • Test Driven Development for SQL — A smal article that gives you a vanilla BigQuery framework with CTE to write unit tests. I think it has to be improve but it gives a greate boilerplate.
  • Saving 💵 With BigQuery & dbt — A few tips to save money when using dbt and BigQuery. Mainly it says that you should consider switching your models to incremental.

Data Economy 💰

  • Betterdata raises $1.65m seed round. A Singaporean company that provides a tool that generates synthetic data. Synthetic data are AI generated data. In Betterdata case you can use your own datasets and generate data that keep all the statistical metrics needed to do machine learning. This way you can work on data that is similar to yours but different. It's a technique to work with anonymised data.
  • CoreDB raises $6.5m seed round. CoreDB is a managed Postgres service that put the emphase on the extensions in order to add more capabilities to your database cluster. CoreDB has been funded by the ex-CEO-CTO of Astronomer.
  • A lot of companies announced recently layoffs, sadly. The biggest one being Meta with a new round of 4k people laying off 21000 people since last November. Astronomer as also let 100 people go recently, if you are heavily relying on Airflow it might be interesting to reach people out.
  • Elon Musk, according to reports, founded a new AI company called X.AI Corp.

See you next week ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.