Skip to content

Data News — Week 23.24

Data News #23.24 — AI Act, testing in dbt, data journey manifesto, SO survey, CDC with Clickhouse and fundraising.

Christophe Blefari
Christophe Blefari
5 min read
close up photography of round green fruit
The newsletter, a metaphor (credits)

Hello, after the good weather comes the storm. I'm now under the Berlin rain with 20°. When I write in these conditions I feel like a tortured author writing a depressing novel while actually today I'll speak about the AI Act, Python, SQL and data platforms. Casual day at the office finally.

Some personal news, next Monday and Tuesday I'll be at Berlin Buzzwords, if you're ping me, it would be a pleasure to meet and hang together.

There are still seats for the June Airflow Paris Meetup (in French).

AI 🤖

Data and Analytics Engineering 🧑‍🔧

  • Testing frameworks in dbt — Robbert developed a small framework to do tests in dbt. Mainly he unit tests macros (the logic) with his framework and test data with soda and dbt contracts.
  • The data journey manifestoDataKitchen wrote a manifesto to put principles on the data journey to avoid the mess in production. There are 11 principles and 11 new ideas to create an healthy platform. For instance you should not trust your data providers and what worked last week will not work today.
  • Why data consumers do not trust your reporting — It is a good illustration of the data journey manifesto. Stakeholders often notice data issues before the data team does. This destroys any confidence they may have in the numbers. Data warehouses are mutable, this is one of the many root causes proposed by Lucas. The past often changes, whether because of code or data. This is metrics drift.
  • Data Documentation 101: Why? How? For Whom? — Marie wrote best practices for establishing complete and reliable data documentation. The first advice is about the documentation readers: data team, business users or other stakeholders.
  • Change Data Capture (CDC) with PostgreSQL and ClickHouse — This is a nice vendor post about CDC with Kafka as movement layer (using Debezium). The post explains well the architecture you need to make it work.
  • A deep dive into graph analytics — Petrica tries and showcases Memgraph in a long-form post. I'm a fond of graph visualisations and analytics—as well as maps.
  • Experimenting at Scale, the Spotify Home way — Simple principles to run a good old' experiment at Spotify scale.
  • The ultimate SQL guide — After the last canva on data interviews, here's a canva to learn SQL. From databases introduction to SQL writing. It covers simple SELECT and advanced concepts. This is neat.
  • The power of pre-commit and SQLFluff —SQL is a query programming language used to retrieve information from data storages, and like any other programming language, you need to enforce checks at all times. This is where you should use pre-commit and SQLFluff.
  • Metis: building Airbnb’s next generation data management platform — The new manifesto for every data governance company /S.

PS: I just split the Fast News to have a smaller one. Fast News contains lighter news and broad articles.

a man with glasses is looking at a laptop
When the stakeholder notices issues before you (credits)

Fast News ⚡️

  • Stack Overflow developer survey 2023 — Every year SO sends a survey to developers and it gives a great overview of the technology usage across the space. This year ~90k people answered, they also integrate a small AI category to measure impact on dev work.

    What we see related to data engineering is mainly: Python and SQL are still shining at the top of technology popularity—around 50% use them. Thanks to AI hype Python is the second most desired technology behind Javascript, which augurs well for the future. They also share salary figures and data engineer / science are well situated in the ecosystem, best-paid job in Germany after management position but less-paid in the US.
  • Generating income from open source — Vadim shares how he makes money from all the different open-source projects he has. He shares what works and what does not work. In the post he also shares the journey of Sidekiq founder who's making $10m ARR alone.
  • You can put space in BigQuery column namesThe editors of (me) have no comment. In fact, yes, you are all crazy?
  • Malloy's Near Term Roadmap — I've shared recently Malloy demo, which was awesome. The article shares the recent features and says also something I will never forget: "Malloy aims to be syntactically the same no matter what database contains the data".
  • The Astro Cloud IDE — Astronomer released a bunch of Airflow operators to their Cloud IDE (which was released in Dec. but I missed it). I get the point why companies wants us to go in their Cloud IDE, but I hate this trend. Let me alone in my PyCharm.
  • Cube announcements ; Data Graph and Orchestration API — This is 2 announcement from Cube. I really like following them because they are thoughts leader in the semantic layer space. Data graph create an entity diagram from the semantic definition with the API offers you an endpoint to launch pre-aggregations jobs from your scheduler.
gray concrete building under white clouds during daytime
We don't need spaces (credits)

Data Economy 🤖

  • Graphext raises $4.6m in seed round (second to continue develop a data analysis platform build for exploration. The Spanish startup develop a tool where you quickly explore datasets and then build charts or AI models on top of it. Last year they build a graph with Data News links, we clearly see the different content categories I share.
  • Telmai raises $5.5m seed round. A new data observability platform enters the space, it looks like they propose the same features as the competition: add your datasources, get automated alerts on data drifts.
  • At the same time Masthead raises raises $1.3m also as a data observability platform, but done differently. Masthead does not run SQL on your data—which generate costs uplift—but reading logs and metadata to identify anomalies.
  • Informatica acquires Privitar. This consolidation will bring new features to Informatica. As a reminder Informatica has been funded in 1993 and is one of the dinosaurs in the ETL space. Privitar will bring "data security" stuff.

See you next week ❤️

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links


Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.