Skip to content

Data News — Week 22.49

Data News #22.49 — ChatGPT, Paris Airflow Meetup takeaways, GoCardless data contracts implementation, schema drift, Pathway and Husprey fundraise.

Christophe Blefari
Christophe Blefari
5 min read
This is what we call a Chat in French (credits)

Hello there, this is Christophe, live from the human world. Last week have been totally driven by ChatGPT frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me. OpenAI perfectly executed it, they dedicated an gigantic amount of computing power to offer a neat pay-as-you-query experience, like BigQuery. And I bet it will transform our industry as far as BigQuery did. But do we want big companies holding decision power in their own pre-trained models, leaving real data science to the big ones?

I don't want to be alarmist, this is not the tone I have here in the Data News, but do we want a future where the support chat of our home train service or our mobile carrier is under the hood ran by a Musk's company? Ok, it's a caricature, but imagine. I can't wait to see Excel comparing average cost per words written between a human and a machine.

🎄 Let's switch topic. It's time for the Advent of Data head's up. Since last week edition we had 6 new articles published in the calendar. Go taste your daily chocolates. In a nutshell you can now develop an internal pip package for your data team, handle governance, explain to stakeholders what you're doing, send AI models to small devices while understanding Rust for data engineering and 3 keys geospatial metrics.

Paris Airflow Meetup 🧑‍🔧

On Tuesday I organised the 4th Paris Apache Airflow Meetup. The first one since 2019 and it was awesome, I met with a lot of people, the talks and the venue were awesome. The goal now is to do a meetup per month in 2023. For this I'll look for speakers and hosts, so if you live in France and you want to share something with the French community reach me, I have a lot of ideas.

After an small introduction the evening started with a presentation by Clément and Steff from leboncoin data engineering team. They shared with us the good practices they implemented to scale their Airflow development. As a figure at leboncoin 7 teams are using Airflow to operate more the 1000 DAGs. For you a short takeaway in English of their presentation:

  • Stop using custom Operators or Hooks if there is a community one available—this point is particularly relevant if you feel your custom stuff creates tech debt
  • Be careful with Airflow's variables, each Variable.get does a database call and drives bad performances. The replacement solution was to use Jinja templating combined with something more traditional in app development: a constant file.
  • Use priority_weight, for this they created an enum with 5 different priority humanly understandable.
  • And lastly: give ownership context to DAGs, develop custom macros for repeating tasks like generate_s3_url, use pendulum date library to avoid the pain of managing dates, use cluster policies and finally do tests. And if you don't know how to do tests have a look at how Airflow is written and copy how they do it.

Then Qonto data engineering team with Charles & Charles shared how they integrated dbt within Airflow. After a small introduction of the classic modern data stack combo—snowflake-dbt-tableau-airflow—Charles presented what is dbt and what are the alternatives to integrate dbt within Airflow.

In a nutshell you have 3 options to do it:

  • You use the DbtCloudRunJobOperator but it requires dbt Cloud
  • You use a BashOperator that runs dbt run command
  • You use multiple BashOperator running dbt run --select model command

Qonto decided to go for the last option.  Then the other Charles detailed what it means and how they monitor what is happening. Obviously there are a few pro/cons for this approach that are:

  • cons: Airflow UI does not like having too many tasks (especially the graph view), in their setup with a KubernetesExcutor it means a lot of cold start because a model run means a new pod with a dbt CLI bootstrap, you have a lot of dependencies to manage
  • pros: You are very flexible because you can run one model at a time if you want, the incident management is simplified because as dbt flaws on this topic are filled by Airflow standards, the monitoring can be done

In the end they showcased their Metabase dashboard helping them understand every dbt run that is very complete mixing data from Airflow with a clever trick—they use XCom to save metadata in the database to be able to use it in Metabase—and the dbt artifacts.

👀 See the slides

PS: shout-out to people I met there reading the newsletter, your kind words are important and it gives me a lot of motivation. See you soon ❤️.

Studious atmosphere to listen Charles^2 (credits Alaeddine)

Fast News ⚡️

Data Fundraising 💰🇫🇷

  • Pathway raises $4.5m pre-seed round. This is an insane amount of money for a pre-seed. Pathway is a French startup in open beta providing real time processing. You need to pip install their package and then you're able in Python to transform your tables. Transformations are operations like select, index, filter, join or map.
  • Husprey raises $3m seed round. Husprey provide an alternative to the dashboard world for data analyses with advanced SQL notebooks. They already have a large number of connectors and even integrate with dbt. Husprey is also a French founded company.

See you next next week ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.34

Data News #24.34 — Forward Data Conference guest speakers, Data Engineering for AI/ML, AI news and a lot of great fast news.

Members Public

Data News — Week 24.30

Data News #24.30 — TV shopping for foundational models (OpenAI, Mistral, Meta, Microsoft, HF), BigQuery newly released stuff, and more obviously.