Skip to content

Data News — Week 23.04

Data News #23.04 — GPT safe place here, dbt, Airflow, Dagster, data modeling and contracts, data creative people a lot of news.

Christophe Blefari
Christophe Blefari
5 min read
My view from the train window (credits)

Dear Data News readers it's a joy every week to write this newsletter, we are slowly approaching the second birthday of this newsletter. In order to celebrate this together I'd love to receive your stories about data—can be short or long, anonymous or not. This is an open box, just write me with what you have on the mind and I'll bundle an edition with it.

This is fun because I'm usually not someone who's good at having habits. Every week to be honest I get hit by Friday. I don't write in advance. Every week you get a taste of my current mood. I often try to sync my travels on Fridays, even if internet is terrible in the train, this is still a good way to fill the +8 hours travel time I'm used to.

Today I take the following commitment: I will never use any generative algorithm to write something in the newsletter. Fun story because one year ago I had an intern working with me on the blog to whom I had given the task to write code that was able to learn from my writings to generate a Data News edition. One year later, different views. In ChatGPT times, my idea is just boring.

On the other side, at the moment I'm not really organised to check if articles that I share have been totally written by humans, but same shit, I'll do as much as I can to avoid sharing empty articles like I've always did. It might be a good use-case for GPTZero.

As a data professional this is probably the height to not want to use AI. But right now the field feels like when cryptocurrencies arrived. Awesome raw ideas with sharks circling around waiting for a new productivity highness.

PS: last week I did a—bad—joke about Apache naming and a reader pointed me an article about the ASF and non-Indigenous appropriation.

This is enough about my life, let's jump to the news.

Back to the roots, a few engineering articles

I did not know how to put together these articles, so here a few loose articles. In my manage and schedule dbt guide in a nutshell I say that in dbt projects you have 2 lifecycles. The first one is the developing experience and the second is the dbt runtime. It means you have to run dbt somewhere:

  • Jonathan proposed a creative way to do it in Dagster — every dbt model is a software defined asset, which means that the whole data chain is reactive and every model are refreshed on a trigger rather than on a cron-based schedule.
  • Astronomer team developed an awesome library that is meant to translate dbt DAG to Airflow DAG: astronomer-cosmos. You either have a DbtDag object or a DbtTaskGroup, that dynamically creates an Airflow DAG from your dbt project. It looks very promising. Cosmos reads dbt models files and do not use the manifest.

In term of data modeling ThoughtSpot wrote about the best data modeling methods and Chad—the pope of Data Contracts—wrote about data contracts for the warehouse, mainly it shift the responsibilities to data producers in order to enforce schema and semantic, but in the data world it is sometimes rather an utopia. Producers are often software teams that, sadly, does not care about data teams.

Finally Noah shared how he improved data quality by removing 80% of the tests and Ronald proposed a framework to create data products in Airflow.

Data people are creatives 🪄

This is a new category that will appear in the next Data News edition. In this category I'll share stuff that we can do with data. The idea is to inspire others by promoting the end use-case rather than just the technology. I'll be more than happy to share what you do.

  • Are Airbnb guests less energy efficient than their host? — Max tries to find if Airbnb guests energy consumption is higher than the hosts' one. I'm always amazed by straight to the point analyses like this.
  • Automated object detection in CSGO — PandaScore, a French company that generates data from public—and probably private—e-sports data, showcases how they used OCR to get data in CSGO live streams. I did something similar last year on Teamfight Tactics.
  • Football data pipeline project — This is more a technical walk-through to build a Streamlit dashboard on the Premier League. Still this is interesting.
This is us (credits)

Fast News ⚡️

Data Economy 💰

  • Microsoft finally announced their "multi-billion dollar" investment—probably $10b—in OpenAI. Nothing more to say, you might have guessed my opinion in the introduction.
  • whalesync raises $1.8m pre-seed to create another data movement SaaS that is connectors based. With bidirectional connectors. The difference with similar product is the possibility to also sync to Postgres. Usually tools like this only do it between SaaS. The enable also web page creation automation for SEO, which is unrelated to the data movement business.
  • Komprise raises $37m Series D to build yet another all-in-one data platform to do everything about data.

See you next week ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.12

Data News #24.12 — My Friday routine, the 01 interpreter, RAG, xAI Grok-1, Apple entering the course, run Spark in BigQuery, Williams F1 using Excel BigData (lol) and more.

Members Public

Data News — Week 24.11

Data News #24.11 — OpenAI CTO, Musk vs. LeCun, Grok open-source?, French report about AI ambition, RAG is hype, and data engineering stuff.