Hello, dear Data News reader, I hope you'll enjoy this new edition. It's amazing how quickly time flies and this summer I passed the 3-year mark since I started my freelance adventure. I'm so happy with what it's brought me. But I've got this internal alarm that goes off every 3 years asking me for new things. It's time for me to search for my future paths.
Don't worry the newsletter and the content stuff I do is something I enjoy so it will probably stay as an invariant in this quest.
Also, this week I wrote R code for the first time. It's not an experience I'd recommend. I tried using ChatGPT to help me with this task and every answer it gave me was always wrong. In 20 attempts, it never gave me a correct snippet. On the other hand, I asked the AI to help me write a TCP proxy in Python and it worked first time. Probably a training bias.
Going further I've looked at StackOverflow trends to see if there is a reason Python is better covered by ChatGPT than R—more than the obvious one—and Python was 6 to 7 times more popular than R at the time of training. The graph also shows that Python has been losing popularity since 2022, although I don't really know why and it stays on top. Only C# get massive increase in TIOBE index.
This week, the videos from the Airflow Summit 2023 have been released and as always, I'd like to provide you with a list of the talks I found interesting. You can also watch the YouTube playlist and show support to other talkers.
Airflow Summit 2023 🌬️
For the sake of reading I've sorted the few talks I've selected in 3 categories: general stuff, Airflow internals and feedbacks from companies.
General — Get Airflow ideas
- The Summit opened with a panel about the past and the future of Airflow. It was also the time for the panelist to give a huge shoutout to all Airflow contributors. I personally join the shoutout because Airflow has been in my professional journey for the last 5 years and it helped me grow and achieve so much.
- Then Marc Lamberti gave a huge update about Airflow but done differently — It wasn't about slides with a list of new features but rather about how you can write, in 2023, a data pipeline with Airflow. It's a presentation that silences critics about Airflow's rigidity and complexity.
- Airflow operators need to die — This is a funny topic. Airflow operators are often criticised because they don't work, so people just use Python or Bash operators to orchestrate their own stuff, which leave us with useless operator code. So, Airflow needs a new vision. This talk from Bolke is probably the beginning of operators rebirth. Bolke proposed new storage and dataframe APIs to remove hardcoded operators and decouple source from destinations.
- Airflow can also be at the center of data mesh discussion with companies using multiple Airflow instance to give power to many teams. Kiwi.com showcases how they moved from a monolith to several smaller envs while Delivery Hero explained how they run 500 Airflow instances with a lot of unique specificities.
- A microservice approach for DAG authoring using datasets — The idea is to apply SE patterns to pipelines like migration, broadcast and aggregate. In addition you should create micropipelines which we can define as small, loosely coupled DAG which operates on one input Dataset and produces one output Dataset. And then each micropipelines will implement a unique pattern with defined input and output.
- Dynamic task mapping to orchestrate dbt — dbt has changed the data world and is immensely popular, but dbt orchestration is sill a problem. Many of Airflow users have to integrate dbt within Airflow. This time Xebia team propose an usage of dynamic task mapping to do it (link to Github repo with multiple solutions).
- Astro team also showcased how you can deploy LLM with Airflow — following a16z infra guide.
Understand Airflow internals
3 talks you should watch to learn things you don't know about Airflow internals.
- Airflow is made of 3 main components interacting together: the webserver, the scheduler, the executor and they use a database to communicate. Within the scheduler there is a DAG parser process reading files to understand what needs to be scheduled.
- This DAG parsing step has flaws.
- By default you have to wait 5 minutes to have a new DAG displayed in the UI.
- If you have 300 DAGs coming from a single file (forloop) it works way better than if you have 300 DAGs in 300 files.
- That's why we should probably move to event-based DAG parsing — In the presentation Bas explains the 4 steps in the DAG parser and what configuration you can change to have better performance. He also demo a event-based DAG parsing that instantaneously display DAGs in the UI.
- Then John also explained what he did to improve parsing performance — Especially around Python import. Because parsing DAG means running Python DAG code (and import) and import fucks the import time.
- ➡️ In conclusion you should consider running the dag processor in standalone to remove the impact it could have on the scheduler and follow latest community improvements.
- This DAG parsing step has flaws.
- Niko also discussed about the executor decoupling to unlock the development of third-party executors like an ECS Executor.
To finish this newsletter 3 companies presentation about their Airflow that gave me inspiration.
- Bloomberg, leveraging dynamic DAGs for data ingestion — I'm a huge fan of dynamic DAGs, I think this is the way to go in Airflow because as a data engineer your role is to create a standardisation layer when it comes to data work rather than doing the actual data work, especially in a mesh concept. Here Bloomberg team create a nice categorisation of data tasks to provide DAGs as a config.
- Reddit, How we migrated from Airflow 1 to Airflow 2 — If there are still people out there on Airflow 1, you should migrate, new Airflow are way much simpler and funnier. But to be honest Reddit presentation can be generalised to every team that want to migrate from a old software to a fresher one. Migration recipes can apply whatever the software you use.
- Monzo, Evolving our data platform as the bank scales — This presentation is full of awesome ideas. It talks about dbt integration within Airflow (using a custom DAGBuilder), monitoring, alerting and Slack interaction with the data stack.
See you next week ❤️ — this week other articles will be blended in next week Data News!
Join the newsletter to receive the latest updates in your inbox.