Hey you, new Friday means Data News. This week is pretty stacked in term of content, especially video / audio content. I hope you will enjoy it as much as me.
Let's start with with my newly created podcast Minds of Data. In Minds of Data I'll met people from the data ecosystem in order to learn more about them. In the first episode I sad down with Joe Reis and we discussed about his professional journey before becoming the thought leader he is today, we also chatted about data engineering. You can listen the episode on Spotify, Apple Podcast and Deezer.
PS: this is my first episode ever so feedbacks are more than welcome.
As the same time in Paris we organised last Tuesday the May Airflow meetup. We had 3 talks, that you can find on YouTube. I really liked Benoit and Samy presentation about Cloud Composer—Managed Airflow on GCP. They shared good practices on how to manage Composer in the cloud, things like:
- Use the same configuration for staging and prod
- Use a secret manager to manage your Airflow connections
- Use IAM restrictions in the DAGs bucket
- Use operators and define the company policy around it
- Define clear policies to govern your Airflow
Also Airflow 2.6 went out this week with a new trigger DAG parameterizable UI, new alert notifications framework (callbacks) and a new graph interface in the grid view.
Gen AI 🤖
The pace of innovation and announcement in the (Gen) AI field doesn't deflate. I can't really cover the whole field because it moves so fast that I can't even keep up. This week the Google I/O Keynote was a major milestone.
Google I/O Keynote takeaways
What amazed me from the Google Keynote is the fact that Generative AI is treated like a product, like the 2007 iPhone—look at this ad. When you think about it AI has always been something hidden, like an API call, a score or a recommendation in a larger UI. In Google's Keynote AI gets a 26 minutes segment and then all the derivations lasting for 2h.
To me Google annual conference is a sign that the party is over, especially for OpenAI. Actually OpenAI deal with Microsoft was probably the best deal they could have go for. Even if as human we want to send models in the arena to get the most performant one, or masturbate ourselves comparing the size of parameters. In the end the best integrated models will win. And Google as a head start—as well as Microsoft, as they remind us in the Keynote they have 15 products used by billions of people: they have our e-mails, our photos, our maps and more. AI is a just a feature in their product, even if it needs an UI rethink, this is just a feature.
So in the end Google, an AI-first company from the beginning wants to put AI everywhere and wants to offer you an AI collaborator. Here are the major takeaways from the Keynote:
- They release PaLM 2, the last foundation model. It will exists in 4 sizes: Gecko, Otter, Bison and Unicorn each asking for different hardware resources to work.
- PaLM 2 will be natively integrated in Google products. Gmail will get enhance smart reply features, Maps will propose immersive view over a route and Photos will have a magic editor that will allow you in a single drag-n-drop to edit a picture.
- Google will create a sidekick that will be available in Workspace—Sheets, Docs and Slides—called Duet AI, you'll be able to ask the AI to create content for you unlocking productivity gains. Duet AI will also work in GCP (in the console and within the web IDE).
- According to the announcement PaLM 2 will particularly shine when fine-tuned (e.g. for IT security or medicine). You'll be able to do it by yourself within your own GCP instance in Vertex AI. They also released Imagen, Codey and Chirp resp. for image generation, code generation and speech-to-text.
- Bard, the conversational model—ChatGPT equivalent—is now opened to everyone (actually not in all countries). Bard works great for code generation, debugging and code explainability.
- Bard might also be the Zero-ETL solution we were all waiting for. In the demo the speaker asks Bard to find schools in an area, then asks for it to be saved in a Google Sheets, then asks to for a new column in the sheet if the school is public or private. To be honest, what prevents Bard in the future to do the same in a database?
- Finally Google tease their next-gen model Gemini which obviously will be awesome, to hear them and announce an evolution of the search interface will Gen AI as a new interactive way to search.
In the end I really like the keynote because it gives a new milestone about what we can expect as integration in the products we daily use.
- Hugging Face released an open model called StarCoder that has been trained on Github code that is meant to act as a Copilot. Still the model is not yet ready to be used as an instruction model—ChatGPT way.
- At the same time HF also introduced an open-source Chat UI.
- After Bill Gates, it Steve Wozniak—Apple co-founder—who gives his take on the AI breakthroughs in a BBC interview mainly we can't stop the march of progress, AI will be used to scam people and we have still to put guardrails, but human guardrails.
- Salesforce do not want to be leftover in the battle, they announced Slack GPT natively integrated in Slack to summarise or compose messages but also a way for partners to bring new kind of Gen AI apps.
- Also Salesforce did a makeup to Tableau with Tableau GPT, a way to provide AI-powered analytics. In Tableau Pulse you'll have access to auto-generated insights on your data. With a "For You" tab like you were in TikTok.
Fast News ⚡️
- Zero ELT could be the death of the modern data stack — Amazon launched this trend a few months ago. In the current situation we're far from killing any ELT processes, but it might come. For instance Zapier launched Zapier Tables some kind of data storage within your zaps.
- We need to talk about Excel — Let's be honest, as strong we try to kill Excel as strong he comes back. David shares interesting stories around Excel usage at companies that I can relate to. He finally mentions Count and Equals, two companies, that builds on top of tabular interfaces to do data.
- Determine BigQuery storage costs across an org — A SQL query that I did not tried. Please read it twice before running it blindly.
- Polars, laziness and SQL context — Daniel showcases the 2 features which should make you want to migrate to Polars.
- Building the seller analytics dashboard — An great example of what you should consider when building an analytics dashboard in the product and how you combine dbt and GraphQL APIs to build a pragmatic metrics store.
- OLTP vs. OLAP — One of the best explanation of the differences between both. Mainly it resides in the data storage. One being row-oriented while the other one is column-oriented, this is not the only difference.
- Correctly loading incremental data at scale & real-time denormalized data streaming platform.
- ExternalTaskSensor in Apache Airflow: how to calculate execution delta — I've seen multiple time that the delta computation was annoying for data engineering teams. This article deep-dives well on it.
- Upscaling LinkedIn's profile datastore while reducing costs — For optimisation geeks.
See you in a few days with Data Council takeaways ❤️.
Join the newsletter to receive the latest updates in your inbox.