Hey, it's Saturday I hope you're enjoying July, taking deserve break, reading data engineering articles while at the beach or traveling to unknown places. Sometimes there are Fridays when I don't find any glue between articles for the newsletter and I have an idea of something to compensate but it takes me the whole Friday of exploration.
And here we are on Saturday. Yesterday I found a way to get sensor data of half of the Tour de France peloton, I was sure it was a good dataset to explore new tools with. And it's honestly a great dataset but it's a bit hard to download and format all the data for exploration. So it will be for later.
Anyway, here a quick press roundup about a few news and articles.
Gen AI 🤖
- Elon Musk announced xAI, his new company, to show that he's better than the rest. He hired alumni from all the AI companies (e.g. Deep Mind, Google, OpenAI, etc.). They held a 2-hour Twitter Space in which they detailed the vision a little. It's mainly about building an AGI capable of understanding the universe. They say we are a few weeks away from their first release. Here a great summary of the space.
- Associated Press sign with OpenAI to share AP's text archive — Interesting to say as it's one of the first deal like this. It reminds me when press gave up years ago on their own platform writing for Google and Facebook news platform. At least this time we will know what OpenAI uses for training.
- Shopify introduce Sidekick — Once again Gen AI is a Copilot. Shopify introduced a right panel in the UI to help vendors in any way. Gen AI used as a Copilot. In the video we see the Sidekick generating a chart to answer a sales question.
- Hollywood actors taking a strike action — They don't want AI and computer-generated faces and voices to replace actors.
- Clibrain, a Spanish startup, launches to build LLMs models for Spanish. They released LINCE-ZERO. Spanish is the second most spoken language by native speakers and the fourth most spoken by all speakers.
Fast News ⚡️
- How we cut BigQuery costs 80% by hunting down costly queries — Mixpanel team hugely reduced their BigQuery spending. They use Fivetran, dbt and Census. In order to get started they first built a cost dashboard using information_schema.jobs tables. Then they took actions, mainly: avoiding SELECT *, materialising intermediate result, adding partition and going incremental. Nothing new but good reminder.
- Data Contracts in the Modern Data Stack — Whatnot is one of the company who embraced Data Contracts last year. This article details what they shared in they excellent Data Council talk. Mainly their implementation is a Protobuf Schema Registry and interface at event production and consumption.
- Introduction to dimensionality reduction — I've gave up on machine learning a few years ago, so I really like every article explaining with visual machine learning concepts. This article explains the dimensionality reduction that is often mandatory when datasets grows. There is a part two with live Python examples.
- Make Python free-threading — This is how open-source is made. In a community discussion about removing Python GIL. Someone from Meta said they can dedicate 3 CPython internals engineers to work 2 years+ in breaking the barriers. Python GIL stands for Global Interpreter Lock, which is a lock that allows Python to use only one thread. Interesting to see.
See you next week ❤️
Join the newsletter to receive the latest updates in your inbox.