Skip to content

Data News — Week 23.02

Data News #23.02 — Switch from pandas to Polars, hiring processes, new age of machine learning, how query engines work and data economy.

Christophe Blefari
Christophe Blefari
6 min read
Abandoned Pandas (credits)

Hey. I have busy weeks, I'm sorry Data News are coming on Saturday again. This is a bit hard to travel by train, work and write at the same time. Plus I'm a fast context switcher, so it piles up. Also a few of you have sent me messages recently and I've not yet answered, I see you and I did not forget you. Now that I'm back in Berlin it'll be easy.

Last week we organised the first Paris Airflow meetup of the year. It was a round table that I've moderated with Benoit Pimpaud, Furcy Pin and Marc Lamberti. We talked about the place of Airflow in 2023, the unbundling of Airflow and the best way to run your Airflow DAGs today.

The discussion was in French and the recording will be released next week. In the meantime you can still check my article Using Airflow the wrong way that summarize a bit the operators vs. containers debate. During the meetup we did not talk about Airflow alternatives, currently Mage is the rising tool that everyone tries out as a replacement for Airflow?

Enjoy the Data News.

Polars—Pandas are freezing

Recently influencers are betting that Rust will be the de-facto language in data engineering. The history repeat, we've seen it with Scala, Go or even Julia at some scale. In the end Python and SQL are still here for good. But with Rust the approach is different. The idea is not to replace Python but to replace the underlying bindings that are used by Python libraries.

And it makes sense, for instance ruff a Python linter that is build in Rust that claims to be extremely faster that the usual stuff.

On the data processing side there is Polars, a DataFrame library that could replace pandas. Let's have a quick look at it. In this overview I'll not talk about performance because I don't have the time to do a proper benchmark—and I've never done this. Just the experience of a beginner that knows pandas very well.

The installation is pretty straight forward, you can do it with pip. When compared to pandas this is awesome because it seems polars as no dependencies so it does not need to build wheels like pandas.

pip install polars

Regarding the imports the documentation continues to treat me well. It looks like stuff I know with pandas.

import polars as pl

Then I can do my first CSV import, in the example I load a French railway open dataset about lost and found objects in stations.

df = pl.read_csv("lost-objects-stations.csv", sep=";")

Then you can use the same code as pandas to select the data (head, ["col"], etc.). I want now to try a group by.

df.groupby("Station").agg([pl.count()]).sort("count", reverse=True)

# Same code but it pandas
df.groupby("Station")["Date"].count().sort_values(ascending=False)

And lastly (because if I continue the newsletter gonna be too long for you to read), I just try to convert a str Series to datetime.

df = df.with_columns(
	df["Date"].str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%Z").alias("Date")
)

# Same code in Pandas
pd_df["Date"] = pd.to_datetime(pd_df["Date"], format="%Y-%m-%dT%H:%M:%S%Z", utc=True)
We can already see the performance difference here.

To be honest I try polars for 15 minutes and I can already see how I could switch to it if I have the guaranty it is way faster. APIs are quite similar so I'm far from being lost.

🫠 If after this small introduction you want a deeper comparison of Polars you can check Modern Polars by Kevin Heavey or a 40 minutes YouTube video that explains Polars internals.

Hiring processes

The current state of the data market is weird. At the same time we have a lot of lay-offs and a lot of companies that are still looking for data folks. Which is often a critical hiring for them, but they struggle. There is a huge gap between jobs, what folks are looking for and what companies are looking for.

This week Teads shared their engineering hiring process. The process is not focused entirely on data, but still this is relevant because it can give ideas to hiring companies or junior looking for advices. They have a short 4 touchpoint interview which looks like a good compromise.

When focusing on data more, Galen wrote about what he looks for in data analyst candidates. One of the most interesting advice he gives that I can press is: you should spend time mastering the technologies you've chosen. With the current state of data this is easy to loose focus, so listen to this intervention. Stop chasing the last data trends and master what you daily use. I think that mastery in one domain can be easily transferable in other domain.

Would you be interested by data job offers in the newsletter?

I would like to propose you job offers that I personally validate—following an open checklist. Obviously companies would pay for this service and it will be a mean for me to get something in return for the curation/writing work I do every week.

AI Saturday

Credits Good Tech Things by @forrestbrazeal

Fast News ⚡️

Data Economy 💰

  • Metaplane raises $8.4m seed funding. This is a bold claim, Metaplane wants to be the Datadog for data. Operating in the data observability space the usual set of features: tests, data quality monitoring based on historical data, lineage and alerts.
  • XetHub raises $7.5m seed round. XetHub brings git to data files management. They support up to 1TB repositories with git-like commands (checkout, push, commit, pull, etc.). I think that XetHub is super useful when in data science we need to keep the data alongside the models. When commit a change on a big file their repo hub summarise data diffs.
  • Generative AIs are booming, following all the stories with possible Microsoft $10b investment in OpenAI, Seek AI raises $7.5m seed round. Seek AI promise is a prompt where you ask your data anything and the AI responds on top of the raw data directly.

See you next week, maybe on Friday ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 1200 links

Explore

Christophe Blefari

Senior Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 23.05

Data News #23.05 — machine learning at big tech, Airflow in Azure, think in SQL, dbt and snowflake clones, generative Seinfeld.

Members Public

Data News — Week 23.04

Data News #23.04 — GPT safe place here, dbt, Airflow, Dagster, data modeling and contracts, data creative people a lot of news.