Data News — Week 23.05
Data News #23.05 — machine learning at big tech, Airflow in Azure, think in SQL, dbt and snowflake clones, generative Seinfeld.
Hey you, it's already February. Every week same analysis for me. I plan too many tasks but I slowly deliver. I guess that's how it is. Still I love this Friday rendezvous that we have together. I'm still amazed by how I changed my old habits to add the writing in my workflow. And it brings me a lot of joy.
This is also funny because I don't consider newsletter writing as work. Which is maybe a bit stupid but when I work on the newsletter I upskill myself, I read, I discover stuff, I meet with people. But the newsletter takes 1 day per week to be done, which is significant to say it's work. I wish everyone to find this little thing that is actually work but that makes work less work.
I'd like to write more about my time organisation and especially about my freelancing activities but today is a day where I have less time for the newsletter, so it's more an appetizer for later. Let's jump directly to the news.
ML Friday 🤖
- Netflix, discovering creative insights in promotional artwork — That probably the reason behind Netflix being now very conventional in term of artwork. The article shows our Netflix art creators are using past data to create new artworks. In the end this is a loophole, where everything looks like similar.
- ebay, Variable Hub a data access layer for risk decisioning — Looks like a feature store but for risk topics. The idea is to create a unified layer that stores all the data needed to take decisions.
- Lyft, powering millions of real-time decisions with LyftLearn Serving — Architecture of the decentralized system Lyft use to deploy and serve ml models.
- Spotify, Unleashing ML Innovation at Spotify with Ray — I've never used Ray in the past, but looks promising as a unified way to describe machine learning pipelines no matter the framework you want to use.
This is refreshing to see big tech machine learning articles that are still looking like machine learning we were doing 2 years ago.
Fast News ⚡️
- What's the Modern Data Stack? — Another post about what's the modern data stack. The article is a good summary of the required blocks composing a modern data stack. You can also get inspired by Stuart's modern data stack.
- Analytics Engineer- A Glorified BI Engineer? — I feel guilty, I still think that Analytics Engineers are BI Engineers. But BI Engineer for the modern data stack times. In this post Madison tried to compare the two roles. In the end actually, the answer depends. Analytics Engineer role is still unclear and depends company to company. What's often stays is that AE is between DE and DA, so the definition is often done complementarily to other positions.
- Microsoft Azure announced managed Airflow — Starting this week you'll be able to launch Apache Airflow within Azure Data Factory. The feature is in public preview. The way they integrated it within Azure looks a bit weird, but it at least exists now.
- Change data capture with DuckDB — Pedram had a sneak peek of the future, he tried a CDC setup (with Striim) that writes to GCS and then DuckDB compute metrics downstream.
- Data team as % of workforce — Mikkel is a reference when speaking about data team size. This week he categorised companies by data team size as % of workforce. For instance he found that Marketplace companies have bigger data teams than B2B ones. It makes sense.
- 2023 state of databases for Serverless & Edge — I did not know that serverless databases field was so innovative right now. All things considered this is a normal evolution, databases connections are from an old time and web developers wants direct access to databases. This is interesting to see how serverless Postgres is going.
- Think is SQL, avoid writing SQL in a top to bottom approach — A nice post about the mismatch between the logical query processing order and the syntaxic order of SQL queries.
- Parquet best practices: the art of filtering — How to leverage Parquet filtering to save processing time.
- Optimizing dbt development with Snowflake clones — dbt development in large data warehouse can become expensive if you ask every dbt developer to dbt run the whole SQL tree. Montreal analytics propose a solution with Snowflake db clones. You can also use the dbt --defer option which does something similar.
- What if we use CHANGELOG in our data projects? — This is important to have a consistent nomenclature when naming commits and changes, sadly the same should apply to dashboards, but hard to do.
- How we deployed a simple wildlife monitoring system on Google Cloud — Artefact engineering a serverless platform on GCP to do wildlife monitoring.
- 📺 Seinfeld-like sitcom generated by AI 24/7 live on Twitch — This is amazing how far we are able to go today in terms of content generation.
Data Economy 💰
- Select Star raises $15m in Series A. Select Star is another data catalog that automatically connects to your tools and provides the usual data catalog UI based on a search bar with metadata management inside. Nothing new under the sun.
See you next week ❤️.
PS: and sorry it was a fast data news today. I have a big presentation to prepare for Monday. I wish you a great weekend.
Join the newsletter to receive the latest updates in your inbox.