Data News — Week 22.46
Data News #22.46 — Paris Airflow meetup, DuckDB, data teams need to break out of their bubble, select * exclude and the fast news.
Hey you, a new Friday means data news. This week feels a bit like old data news with a variety of articles on different cool topics while I navigate through the actual data trends.
Next Monday I'll present "How to build a data dream team" at Y42 meetup. I'll share in next week edition a written form of my talk. But this week as an appetizer there are 2 articles I really liked about data teams composition.
Last but not least, if you are in Paris on the 6th of December you can join us for the reboot of the Apache Airflow meetups—I'm the organizer. Talks will be given in French. The agenda:
- leboncoin will share best practices around Airflow
- Qonto will show how you can greatly integrate dbt within Airflow
- I'll also introduce the meetup with last Airflow features
My two cents about DuckDB
Ok, right now, LinkedIn and Twitter data world are a bit going one-way down the Rust and DuckDB street. While I don't have any opinion on Rust except the fact it's look like a programming language eternal debate I'm bored of, I have one on DuckDB.
Here a small description I wrote about DuckDB 2 newsletter ago:
If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.
First, let's decrypt the marketing. DuckDB mother company called MotherDuck says stuff like: "BigData is dead" or "Your laptop is faster than your data warehouse". Which theorically opens the door back to single instance processing for your data. This is brillantly good, tbh. I buy it. Plus they add this fun tone with ducks, which creates sympathy for the product.
But is it really something?
I think it is, but I might have already been influenced by the marketing. When I think about DuckDB simplicity. It's exhilarating.
pip install duckdb then
import duckdb and you are good to go. You don't need to run a server. A database is available to you, you can read files (CSV or Parquet) and execute SQL or Dataframe operation on it seamlessly.
I can imagine a list of use cases that will help improving the data engineering workflow but in the same time I don't believe Duck can become the main processing engine of a data platform. I mean, by his single-node nature the technology will for sure serve with brio decentralised teams with central lake but I see more edge use-cases like: running data processing in the CI/CD to quickly validate stuff, provide a great local dev experience to every data developer or empower small data analytics products.
I don't think it can replace current data warehouse vision or technologies and according to me it shouldn't be sell or compared with. But more a cool sidekick to the actual modern data stack. Still I'm afraid with the huge amount of money invested and the actual course of things where everyone wants to try the hype it'll turn differently.
Oliver also shared deeper views on the hype.
Data teams need to break out of their bubble
Mary MacCarthy published a great post. It's a wake-up post for data teams. In the current economic situation, all the intellectual discussions about the vision of the field are fun but this is not really for what data teams are built. Data teams are meant to exist in most company to empower other teams. I also bet that the semantic layer, DuckDB, Rust or other trendy stuff is not something that will empower your stakeholders.
Right now the best move you can do according to Mary to empower your stakeholders is to break out of your bubble to really work in pair with them. In the article she takes the example of the relation between the marketing team and the data team that often looks like shadow IT. Martech solutions are often another all-in-one data platform.
On the same topic Mikkel Dengsøe came back with a great article about data people outside of the data team. He brings few tips and pitfalls to make this setup works.
Fast News ⚡️
- Notion announced Notion AI — Notion will introduce an AI assist bloc that will be able to generate text in your Notion pages. Right now in alpha waitlist. Under the hood it uses OpenAI, in the FAQ Notion promises that you data will be protected and not use by OpenAI.
- Dataclasses: Supercharge your Python code — If you don't use Python's dataclasses you should look at this article that gives you usage examples. I personally use a lot dataclasses when it comes to create configuration for my data pipelines. It allows me to type my configuration and to get rid of the bracket notation to use objects which is more comfortable when developing.
SELECT * EXCLUDE/RENAME— It has been one of the feature I was missing the more when I switched from BigQuery to Snowflake. Here it is. You'll be able now to supercharge your Snowflake select * by either excluding unwanted columns or renaming on the fly some. It saves precious SQL lines when you have a lot of columns.
- Visualization tips for data story-telling — How to pick colors, how to display text and at what size and how can you emphases a data among others. This article is a good head's up.
- StarRocks, a next-gen sub-second MPP database — I discovered a new open-source real time OLAP database. Nothing more to say except that I had it in the newsletter as a save for later.
- Revamping the Apache Airflow-based workflow orchestration platform at Coinbase — What to do when you have around 1000 pipelines and more than 1500 PR every month on your project.
- Building Spark data pipelines in the cloud, what you need to get started — Spark have not yet disappeared even if I don't share that much content around it in the newsletter. This is a complete guide about Spark worth mentioning.
- Your data catalog shouldn’t be just one more UI — In today's data ecosystem all data catalogs have been developed following the same concepts coming from SV big tech startups. In this article the author explicits that a data catalog still should be more than a search bar in the entities. More a data catalog should firstly be a central metadata repository with open APIs allowing every data team to activate real use cases.
See also: More on semantics & databases. What if we could add more semantic directly in the database schema comments.
- (I did not had the time to read these 2 articles) Simplifying 3NF & Data Skew : 101.
Data Fundraising 💰
- Quix raises $12.9m Series A. Quix is a serverless real time platform that allows developers to focus on developing real time apps rather than spending time on the underlying infra. Their SDK works with Python and C#.
- MotherDuck raises $47.5m Seed and Series A. Just a side note about the DuckDB mother fundraising. I've already mainly shared my thoughts about this in this newsletter edito. The company seems to be in the tracks of the giants fuelled with a16z money. As others are betting we have few months ahead of us with trendy ducks.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.