Data News — Week 23.15
Data News #23.15 — Yann le Cun interview, hot takes on the modern data stack, costs saving and metrics layer.
Hey you, the newsletter might be late today again, but this time this is not my fault. Ghost editor was down when I wanted to write. Anyway, here the weekly Data News, written faster than usual.
AI News 🤖
Yann le Cun did a 10 minutes interview at a major French radio. If you want to read the French transcript you can do it here. Mainly what he says:
- There is no doubt in the fact that one day there will be machines at least as intelligent as human. But ChatGPT isn't, it gives the impression but it is not.
- AI can amplify human intelligence like machines amplify human strength.
- Technology shift jobs. For instance before industrial revolution major part of the French population was working in fields, now it's less than 2%. It means we shouldn't be afraid of technology replacing jobs. He thinks that this will also allow more people to be creative.
- Regarding fake news and ethics he compares to e-mails. He thinks that like when we develop spamming filters to avoid fake mails we will develop the same to avoid fake news.
- For Yann ChatGPT has nothing revolutionary, but he admits it's good engineering. This is just a normal evolution of deep learning systems.
- (Last one because it's funny). He bets than in 10-15 years (or more) we will not have smartphones anymore but augmented reality glasses. We will also use voice to interact with machines, so we can interract with them hands in the pockets—I can't wait to use Siri and Alexa 2.0.
As a side project, if you want to practice machine learning this weekend you can replicate Rihab's project: detect wildfire smoke with YOLOv8 model.
Fast News ⚡️
- A tour of Airbyte’s Octavia CLI — Airbyte, an open-source extract-load platform, released a few months ago a CLI called Octavia that let’s you create integration pipelines. Jeremy wrote a post that showcases how to do it.
- Hot takes on the Modern Data Stack — Matt gives 5 hot takes about the MDS. I don’t totally agree with everything but this is a good read. He says that Redshift is not anymore competing in the warehousing space, which I agree with. He also says that Airflow is obsolete, I disagree, it became common recently to say bad things about Airflow. But as always the issue is between the chair and the keyboard. He is also hard with Airbyte and dbt.
- The new philosophers — It's been a long time since I've shared Benn's posts. Still my favorites. Saying smart things, weeks after weeks. This time he writes about the new marketing approach of the modern data stack ecosystem. Plenty of tools, so let's develop a new tools to avoid the other tools. And add his views about ChatGPT disruption: "We'll initially try to insert LLMs into the game we're currently playing [...]. Our data models won’t be augmented by LLMs; they’ll be built for LLMs". Probably no-one knows, yet, what it means.
- Castor announced Castor AI — Doing it the other way Castor released a feature that explains a SQL query in natural language. This is a good way to help business users understand what's happening in the transformation layer.
- How we made our reporting engine 17x faster — Teads engineering team explain how they significantly speed up their ads report generation. In a nutshell they replaced Spark (EMR) in-memory transformations by BigQuery.
- Large-Scale generation of ML podcast previews at Spotify with Google Dataflow — It became a common issue at vaste content platforms to generate previews to support the scale. This time Spotify explains how they did it with Apache Beam. As an input they take audio and transcript data and they generate podcast previews that will appear in your feed.
- Big savings on Big Data — This is the current trend, with the current economic situation we have to do more with less (or at least with what we have). At Lyft they optimised their ML platform to save time and money on workloads. Especially they lowered all the dev costs.
- LocalStack: Why local development for cloud workloads makes sense — It does the glue with the previous bullet point. This time Corey writes about LocalStack, a tool that emulates locally AWS APIs. The emulation could be the future mainly because it avoids increasing cloud costs for development.
- Using DuckDB with Polars — A nice showcase of the 2 new kids on the block working together. Mainly what you will do is querying in SQL with DuckDB Polars dataframes.
- Using Metrics Layer to standardize and scale experimentation at DoorDash — A very good exhaustive article about a metrics layer. At DoorDash a lot of teams are doing experimentation and they were in need of a common ground between metrics definition. That’s why they built this system. Mainly they define measures, dimensions and metrics in YAML that will be materialised and made accessible to Curie (their experimentation platform).
Data Economy 💰
- Cybersyn raises $62.9m Series A. Cybersyn is a data-as-a-service platform that provides public datasets for everyone. You can see it as a datasets marketplace of common public data. They are heavily supported by Snowflake so the dataset are accessible in Snowflake marketplace. For instance you can freely query the US Addresses dataset to get all the addresses in a zipcode.
- Rupert raises $8m in funding. Rupert wants to fill the gap between the data analyst and the business users by providing a no-code UI to create data alerts on top of your semantic layer.
Join the newsletter to receive the latest updates in your inbox.