Saturday be like (credits)

Hello data news readers. I hope you had a great last week. This is the Saturday data news, yesterday I had a blank page syndrome. I hope you don't mind.

Before jumping in the news, I have 2 things to say. First, I've been listed as Best data science newsletter by Hackernoon. If you like this newsletter I'd love to get your vote. Then, I'll organise an Airflow meetup in Paris on the 6th of December and I'm still looking for speakers. Probably for 5mins light talks—fr or en.

Have fun.

Build a data lake from scratch with DuckDB and Dagster

I've recently shared a lot of articles around DuckDB in the newsletter. If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.

This database technology got a lot of traction because of its simplicity to install and to use. Which also mean that influencers and bloggers can experiment easily to show you how wonderful it is. This article is no exception.

Dagster on the other hand is another orchestration tool that has been thought for the cloud and the data orchestration. They firstly popularized software-defined assets concept which is a way to define data assets as code. This way the orchestrator knows data dependencies and can do reactive scheduling rather than CRON-based.

So, Pete and Sandy from Dagster team showcase how you can create a s3 datalake with DuckDB as query engine on top of it. I really like the article because it shows in a small amount of code how you can:

Obviously what they did is purely experimental but it gives ideas on how every company could create a lake with a smaller footprint and a smaller price. I mean, BigQuery and Snowflake are also launching processing on-demand, but here with DuckDB you really know what's running and it's fairly simple so you can measure all the costs.

PS: as I never used—I plan soon—DuckDB and Dagster all my comments are based on my theoretical understanding of the technologies and all the readings I had about it.

DucksDB (credits)

Databases time

It looks like a special edition about databases but this is not. Dremio wrote an article to explain how a read query works with Iceberg tables. In a nutshell, a read query first uses the catalog to find the right metadata files. They will point on the correct manifest files in order to get the correct data. With even more simple words, it uses metadata systems to narrow the data search, the less you read data the faster the query will be.

If we go on a more exotic database side. Redis team wrote a guide of things to consider when doing a database migration and Mohammad wrote a retro on DynamoDB 10 years after the general release.

Playing dataviz tennis for collaboration and fun

This idea is so fun and I'd love to try it in a data team. For content purpose Georgios and Lee played at dataviz tennis. Every dataviz tennis match lasted for 8 rounds, with 45 minutes per round and the person who served picked the dataset. So it means player 1 choose a dataset, work on a viz for 45 minutes and then shot the viz to player 2 that work on it for 45 minutes, and so on. All of this in R with ggplot2.

I think this is a fun way to collaborate and for some projects we should try it in data teams. This is a alternative way to do pair programming and it can be done with data pipelines as well.

ML Saturday 🤖

How would you rate your job satisfaction in your current role? (credits)

In bulk here few cool articles:

Fast News ⚡️

Good self-service (credits)

Data Fundraising 💰


See you next week ❤️ — PS: below should appear a survey about how you like the newsletter, please tell me what you think.