Data News — Week 23.04
Data News #23.04 — GPT safe place here, dbt, Airflow, Dagster, data modeling and contracts, data creative people a lot of news.
Dear Data News readers it's a joy every week to write this newsletter, we are slowly approaching the second birthday of this newsletter. In order to celebrate this together I'd love to receive your stories about data—can be short or long, anonymous or not. This is an open box, just write me with what you have on the mind and I'll bundle an edition with it.
This is fun because I'm usually not someone who's good at having habits. Every week to be honest I get hit by Friday. I don't write in advance. Every week you get a taste of my current mood. I often try to sync my travels on Fridays, even if internet is terrible in the train, this is still a good way to fill the +8 hours travel time I'm used to.
Today I take the following commitment: I will never use any generative algorithm to write something in the newsletter. Fun story because one year ago I had an intern working with me on the blog to whom I had given the task to write code that was able to learn from my writings to generate a Data News edition. One year later, different views. In ChatGPT times, my idea is just boring.
On the other side, at the moment I'm not really organised to check if articles that I share have been totally written by humans, but same shit, I'll do as much as I can to avoid sharing empty articles like I've always did. It might be a good use-case for GPTZero.
As a data professional this is probably the height to not want to use AI. But right now the field feels like when cryptocurrencies arrived. Awesome raw ideas with sharks circling around waiting for a new productivity highness.
PS: last week I did a—bad—joke about Apache naming and a reader pointed me an article about the ASF and non-Indigenous appropriation.
This is enough about my life, let's jump to the news.
Back to the roots, a few engineering articles
I did not know how to put together these articles, so here a few loose articles. In my manage and schedule dbt guide in a nutshell I say that in dbt projects you have 2 lifecycles. The first one is the developing experience and the second is the dbt runtime. It means you have to run dbt somewhere:
- Jonathan proposed a creative way to do it in Dagster — every dbt model is a software defined asset, which means that the whole data chain is reactive and every model are refreshed on a trigger rather than on a cron-based schedule.
- Astronomer team developed an awesome library that is meant to translate dbt DAG to Airflow DAG: astronomer-cosmos. You either have a DbtDag object or a DbtTaskGroup, that dynamically creates an Airflow DAG from your dbt project. It looks very promising. Cosmos reads dbt models files and do not use the manifest.
In term of data modeling ThoughtSpot wrote about the best data modeling methods and Chad—the pope of Data Contracts—wrote about data contracts for the warehouse, mainly it shift the responsibilities to data producers in order to enforce schema and semantic, but in the data world it is sometimes rather an utopia. Producers are often software teams that, sadly, does not care about data teams.
Finally Noah shared how he improved data quality by removing 80% of the tests and Ronald proposed a framework to create data products in Airflow.
Data people are creatives 🪄
This is a new category that will appear in the next Data News edition. In this category I'll share stuff that we can do with data. The idea is to inspire others by promoting the end use-case rather than just the technology. I'll be more than happy to share what you do.
- Are Airbnb guests less energy efficient than their host? — Max tries to find if Airbnb guests energy consumption is higher than the hosts' one. I'm always amazed by straight to the point analyses like this.
- Automated object detection in CSGO — PandaScore, a French company that generates data from public—and probably private—e-sports data, showcases how they used OCR to get data in CSGO live streams. I did something similar last year on Teamfight Tactics.
- Football data pipeline project — This is more a technical walk-through to build a Streamlit dashboard on the Premier League. Still this is interesting.
Fast News ⚡️
- Airbyte announced a free sync plan. Starting today the connectors that are in alpha and beta will be free to use in Airbyte Cloud. It needs only one side of the sync to be in alpha/beta to have it for free. Once GA you'll have 2 weeks before being charged.
- Earlier in January Fivetran also announced a free plan. Starting February you will be able to use it to sync up to 500k distinct rows for free plus other perks.
- SQLAlchemy 2.0 released — This is a major release with a lot of breaking changes. As I'm far from being an expert in SQLAlchemy I can't say more than it seems to be shiny new better ORM.
- Metaplane announced data tests preview in pull requests — This is a way to compare the SQL code in a PR to the live production data to see directly in Github what have changed. It gives ideas.
- Snowflake released min_by and max_by functions — With these new min/max functions you can in a select statement get the first/last status for an id. This is a great shortcut.
- How to compare two tables for quality in BigQuery — Giorgios propose a simple query to compare 2 tables in BigQuery. If you are a Snowflake user there is a minus operation to do it even easier and if you use dbt you can avoid this boilerplate by use dbt_utils.equality function.
- How misused terminology is damaging the data field — The title is a bit exaggerated and terminology gatekeeping damage even more the field. Actually in the end we all do stuff with data, right?
- How you can have impact as an Engineering Manager — Good question and good article. In a nutshell it's about your team and other teams and how you interact with other people in terms of behaviour, processes and practices.
Data Economy 💰
- Microsoft finally announced their "multi-billion dollar" investment—probably $10b—in OpenAI. Nothing more to say, you might have guessed my opinion in the introduction.
- whalesync raises $1.8m pre-seed to create another data movement SaaS that is connectors based. With bidirectional connectors. The difference with similar product is the possibility to also sync to Postgres. Usually tools like this only do it between SaaS. The enable also web page creation automation for SEO, which is unrelated to the data movement business.
- Komprise raises $37m Series D to build yet another all-in-one data platform to do everything about data.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.