Hey, new Friday, new Data News edition. I'm so happy to see new people coming every week. Thank you for every recommendation you do about the blog or the Data News. This kindness for my content gives me wings.
This week I don't want to be late, so let's start the weekly wrap-up. I got less inspired this week, it means shorter edition.
As a side note we are looking for speakers for a late February Airflow Meetup. Still open topics, so whatever you want to share—have to be related to Airflow at some point—we'll be happy to welcome you as speaker.
The current state of data
This week Benjamin Rogojan livestreamed an online conference featuring awesome data voices: state of data infra. Matt wrote his takeaways on Medium about the conference. In parallel Ben released the results of a survey about data infras he run among his followers. The main thing to notice is that the average company is a Finance company using Airflow with BigQuery and they struggle—like you probably—to hire people.
This is also time for my views about the state of data. After 2 years of running the newsletter writing every week about trends and following "influencers" for you I'm bored. If I'm being honest I'm French and probably I was born bored, but still. When I was a young professional I was so hype by new technologies, right now it's harder for me. I personally feel that data ecosystem is in a in-between state. In between the Hadoop era, the modern data stack and the machine learning revolution everyone—but me—waits for. But, funny, in the end we are still copying data from database to database by using CSVs, like 40 years ago.
If we go back to this week articles:
- Matt Hawkins tried to find the origins of the term "modern data stack".
- Pedram wrote about the state of data testing — in the end of the article obviously because it's on Datafold blog they share data-diff, still the article is relevant near the four facets of data quality: accuracy, completeness, consistency and integrity.
- Apache Doris, to me it looks like a character from Nemo, actually it's the new real-time warehouse of the Apache Foundation.
- There is an introduction post about DataHub — when you look at what you have to run to launch a data catalog: 4 components and 4 different data storage. Don't be surprised if no ones uses data catalogs. When I think that some people are saying Airflow is complex to launch.
In a nutshell I just want to solve problems and empower people with what I build and I don't care if my stack is a post-modern aquarium, I just want it to be blazingly boring.
Data modeling techniques
Data modeling as of today is probably the most important skills of every data practitioner. We don't really care about your role or your tools. This is about optimisation. Optimisation at different levels, it can be performance optimisation, costs optimisation, business understanding optimisation. Yeah, in fine, optimisation.
There are many techniques out there to do it, I don't want to enumerate them because that's not really the intention. Still, aim for simplicity, keep it simple stupid and think about your consumers.
PS: this feedback about the Medallion architecture—bronze, silver, gold—might be interesting for you.
Fast News ⚡️
- Why I moved my dbt workloads to GitHub and saved over $65,000 — With the dbt Cloud price increase I already shared companies started to look for innovative way to run dbt. This time this is an example demonstrating that you can do it in Github Actions.
- 10 Common Misconceptions about Airflow — Airflow grown a lot and probably users that lost faith in Airflow a while back while never come back. Still this post tries to revalidate Airflow. Shortly, in recent Airflow versions it's easy for instance to get started, the UI is great—and tbh always has been, the scheduler is stable.
- Lights on Versatile Data Kit — A YouTube video about a tool developed by vmware that is an alternative to dbt—yeah, sorry this is the best way to define it.
- Data Engineering job market in Stockholm — Alexander shared on a personal blog his job research in Sweden. Spoiler: out of 43 application he got 6 offers. This is a short post but describes well his experience.
- Why the super rich are inevitable — Except the fact that we should eat the rich. I just want to talk about the way the information is displayed. Alvin—the author—explained economical concept with a scrollable visualisation and with some simulation to help people understand concepts. I found it very pleasant and it looks like something data teams could do to package data analyses.
- All you need to know to get started with Vertex AI Pipelines — Will people continue to do Data Science by themselves in 2023? Probably not like before and with more APIs in it. For that you can follow this overview about Vertex AI—the Google Cloud Platform manage machine learning product.
- BigQuery Ingestion-Time Partitioning and Partition Copy With dbt — Christophe from Teads wrapped-up how they contributed to dbt 1.4 by adding ingestion-time partitioned table support for BigQuery.
- Choose your adventure: How changing how you spend your free time can genuinely make you feel like you have more of it and take care of your well-being.
Data Economy 💰
- Cumul.io raises €10m Series A. Embedded analytics is the capabilities to introduce Business Intelligence apps within "traditional" software platforms like SaaS application or public website. Cumul.io provides a complete SDK to integrates Analytics in your app. Either by doing it yourself either by letting your customer do it.
- Lay-offs are continuing at big tech. Google and Microsoft announced respectively 6% and ~5% jobs cuts. According to layoffs.fyi in January this year around 40k people got laid off in tech, it represents 25% of last year total lay-offs—150k. If it happened to you recently, you can reach me, I'll do whatever I can do to help you.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.