Hey you, this is a Saturday edition of the Data News. I hope this email finds you well. This week you'll have less editorial content because I'm late. But still you'll find awesome articles that have been written recently.
As a reminder on Tuesday next week I'm organising the Apache Airflow Paris meetup that you should consider joining if in Paris. Also next week I'll publish my first podcast episode ever that I've recorded with Joe Reis—the co-author of the famous Fundamental of Data Engineering. I'm still looking for the name of the podcast, if you have ideas shoot.
Gen AI 🤖
- Google "We have no moat, and neither does OpenAI" — This is an internal note from a Google employee (which does not reflect Google views), that mainly says that open-source models will win over Google and OpenAI and closed-source policy for models might be a mistake especially in a world where some models leaks (e.g. Meta ones).
- If you already have access to OpenAI in Azure you can now use GPT-4—only in preview yet.
And more traditional AI:
- YOLO-NAS a new object detection model — you should have seen already this model that detects people in real-time in videos. This new one seems to be better than the previous one.
- Mojo, a new programming language ready for the AI — Mojo is a new programming language that looks like Python but at a lower level, this could unlock performance gains and new heights in AI models development.
- eBay’s blazingly fast billion-scale vector similarity engine.
Fast News ⚡️
- Paypal, template for data contract — PayPal is implementing a Data Mesh and they provided in the open all their thoughts with data contracts. In the Github repo they are sharing a YAML template describing what's in the contract. This is insanely exhaustive.
- Even Amazon can't make sense of serverless or microservices — PrimeVideo tech team wrote an article that could be summarised by: we migrated from functions based approach to a monolith in a VM. Internet found this ironical. By doing this they reduced cost by 90%.
- Lakehouse at Walmart — Samuel from Walmart describe the research they did and why they picked Hudi over Delta in order to implement a Lakehouse architecture. As a reminder the Lakehouse is the merge of the datalake and the data warehouse, which is mainly a way to add a SQL friendly processing engine on top of a datalake with ACID transactions.
- Safer deployment of streaming applications — This is how Grab deploy Flink applications.
- Why you should reconsider Debezium: challenges and alternatives — Warning: this article has been written by a CDC solution, but still this is relevant because it shows what is the reality of managing Debezium.
- Dataform: schedule daily updates using Cloud Functions — Dataform is a solution Google bought a few years ago that is a dbt alternative but for BigQuery. This article gives a great overview of the product. To be honest it looks like a bit hacky.
- 📺 Dev Deletes Entire Production Database, Chaos Ensues — If you want a greatly told story you should watch it, this is a YouTube video explaining how Gitlab remove the production database and how they fixed it. It reminds me my own horror story of deleting the whole
/datafolder in HDFS.
- Oracle is taking on Snowflake — I often say the Snowflake will become the new Oracle. This is fun to see that Oracle still try to catch up. They come up with a lot of news: they will implement Delta Sharing protocol, lower the storage for $118/TB to $25, partner with AWS and propose low-code data integration tool.
- Data modeling, again — Simon published the second part of his data modeling guide, this time he covered the different techniques you can use when modeling data: dimensional, vault, anchor and more. You might also want to see practical examples of data modeling, Sonny wrote a nice article using a hotel business as a support.
- Crafting your data team — Practical tips on how to get started with your data team in a new startup. In the post Marc gives you the qualities you should look for and what hiring you should prio first.
- 🎮 The CS:GO Liquid team announced a new data analyst, DeMars did work previously on a predictive analytics approach on Valorant trying to predict who whould win a round in different situations. This is fun to see our beloved data analyst position reaching other fields.
- Data projects on personal data — Petrica dive into her Medium data with DuckDB and Plotly and Stefen analysed his Uber spends with dbt and Postgres. As a reminder, doing personal data projects is still the best way to learn about technical stuff.
Data Economy 💰
- AuraML, an Indian-based company, raises $230k in pre-seed round. AuraML is a 3D synthetic data company, their engine is capable to generate 3D realistic-looking environnements your might want to use in other models.
- Mistral AI, a French Gen AI company, will probably raise €100m (link in French) in the following weeks. It looks like at the moment the company only hired a few French people that were working previously at Meta or Alphabet on LLaMa or DeepMind. The goal of the company is to provide the first French—hence European—alternative to OpenAI. Obviously this is heavily political and strategic for Europe so we will follow it in the next weeks.
- Anaconda is expanding and buying EduBlocks. EduBlocks is a scratch-like platform to write Python or HTML code. This is a cool thing in order to continue code democratisation.
- Open-source done differently. Sequoia—a VC—will support Sebastián Ramírez with an open-source fellowship. Sebastián is the creator of FastAPI, SQLModel and Typer. There isn't more detail in the press release but this is awesome to see.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.