Data News — Week 24.24

Data News #24.24 — I'm back sorry for the late news. I'm co-organising a conference in Paris in Nov, CfP is open, AI news with OpenAI and Apple and a lot of Fast News.

Christophe Blefari

15 Jun 2024 — 8 min read —

🥹It's been a long time since I've put words down on paper or hit the keyboard to send bytes across the network. We're in the age of AI, and my lords computer science have evolved over the last 30 years. I'm writing this edition from my child's home, and it brings back memories. I got my first computer at the age of 6 and spent my days installing Windows 98 over and over again, getting lost between the BIOS and the Windows installation pages, playing with Word, Dreamweaver and Adobe Premiere.

My first website is still up somewhere on internet 🥹 — it was to help my aunt sell her house

Who would have thought that 25 years later, I'd be celebrating 10 years working with computers? June also marks the third anniversary of this newsletter. 3 years ago I started the newsletter in order to share my expertise with people, I'm so happy how it turned out. More than 5000 members subscribed to the newsletter and the blog generated almost 100k unique visitors.

Recently a lot of people subscribed and never received a Data News, I want to give you a warm welcome and this edition marks the journey we embark together, you will enjoy what's coming next, I'm sure.

I've taken a little forced break because I've been overwhelmed with work lately, juggling a lot of requests and my customers' work. In order to deliver I had to reclaim my Fridays. Around the newsletter there are unfinished projects with the Recommendations page and Qrators and I'll get back to them starting July once I'm done with the rest.

Forward data conference ⏩

I'm excited to announce that I am co-organising the Forward Data Conference, a one-day event in Paris. Join us on November 25th as we bring together around 350 attendees and an impressive lineup of speakers. It's going to be an incredible opportunity to connect, learn, and explore the latest in data. We will make our best to make the conference friendly for English natives.

Forward Data aims to be a hub for knowledge sharing and best practices, offering you the chance to expand your horizons, explore new facets of the data ecosystem, and connect with key international community leaders.

AI News 🤖

A lot of AI news and changes were made in the last 3 weeks. This is a small recap.

OpenAI
- The super-alignement team was fired — The goal of the super-alignement team was to research all related topics to AGI security. But it seems priorities reshuffled. Then OpenAI appointed former NSA leader (nominated by Donald Trump), he will probably work with Safety and Security committee.
- Annualised revenue projected to be $3.4b — This is crazy how the company successfully reached this amount, mainly by selling to Enterprise customers. By comparison Snowflake revenue was $2.8b in 2023.
- Extracting concepts from GPT-4 —
Apple announced iOS 18 and their own AI — AI will stand for Apple Intelligence. With great ego Apple appropriates the letters AI at their annual developer conference (the WWDC) they showcase how AI will be integrated everywhere in iOS:
- Siri has been revamped — now looking like a Microsoft AI copilot, Siri will be able to sort notifications, to help you writing better or to give better contextualised answer. Siri will also integrate with OpenAI through ChatGPT when needed.
- At the same time they announced their model will run on-device (keeping your data safe and private) and when more compute will be required they will use a private cloud.

Writing tools — to bring a few of the best GenAI features: proofreading and rewriting. When selecting a text you will be able to ask the model to rewrite it more professionally, etc.

Genmoji — a way for your parents to be even cringer in their emoji usage by generating emoji from a sentence.
Finally, with new Siri and Writing tools they reworked one of the worst Apple application: Mail. Giving a better look and new capabilities in email writing.
It joins other features for which Apple will introduce AI (and GenAI) throughout its products (audio transcription, image generation from tags, better natural language search on photos, etc.). But this anchors Apple in a consumer products company, not an AI company like Google, Microsoft or Meta. Apple has decided for years to keep its users' data safe and private, which means they don't have a pool of data to train large language models.

How to rethink the recommendation for social networks — This is a small video about Jack Dorsey (Twitter co-founder) about recommendation algorithm and how platforms today should give the choice back to users, this is about free will and building biais / filter bubbles. Why should we have transparency on what rules the recommendations and why should platforms propose multiple algorithms and let the users decide, like a marketplace.
Changing the GPU is changing the behaviour of your LLM — A cool experimentation that shows how GPU impact the inference.
MLOps coding course — Great MLOps course! It contains 6 chapters and covers all the needed topics to put models in production with the correct choics.
RAG in BigQuery — When you do RAG in database it's often correlated to embedding functions and being able to query these vector with performance. BigQuery has all the toolkit to do it and this article showcase it well (and let's be honest all the competition does the same).
What makes a Gen AI system open? — A paper that survey 45 models across 14 elements that could define them open. OLMo 7B Instruct is the most open according to the paper and ChatGPT the least one. On the same topic Mozialle released a paper about a framework for Openness in Foundation Models.

Fast News ⚡️

Solving probabilistic Tic-Tac-Toe — Probabilistic tic-tac-toe is like tic-tac-toe but each cell is given a probability distribution. So when you make a play randomly you can x, o or do nothing. Someone develop a Unity version of the game and someone else wrote a math solver giving the best play at every turn.
Amphi ETL — Amphi is a low-code visual ETL that you can run in JupyterLab. This is super clever. This is the first time I see this kind of application that can run as an extension of Jupyter. Worth watching it in the future, this is still early.
Compare Airbyte and dlt ways to create custom sources — A large article that compares Airbyte and dlt when it comes to creating custom sources. Both extract and load tools can create custom source via either Airbyte low code CDK or dlt REST API Source toolkit.
trip.com migrated from 50PB Elastic to ClickHouse — I've never been fan of NoSQL platform like ES for data work. This article on ClickHouse blog showcase how a client migrated their ES cluster to ClickHouse to improve their logs querying capabilities. More, the article focus once at scale with multiple CH clusters how to correctly route the queries.
Hunting non optimised queries in ClickHouse — The talk is about ClickHouse but can apply to every engine. In the talk Yohann explains the mechanism he put in place to find non-optimised SELECT. He did it with a machine learning model, which means that he identified the features slowing the queries like nesting, subqueries, join and wheres.
BigTesty — a framework that allows to create integration tests with BigQuery on a real and short lived infrastructure. It uses Pulumi (a infra-as-code tool) and requires you to give inputs, SQL queries and outputs and tests it against a dedicated BigQuery project.
Data platform explained part II — Part 2 of the Spotify article about data platforms. Their name 3 different steps: data collection, management and processing (and they even mention GDPR) and finally explain how they treat data culture.
What is really Apache Iceberg? — Iceberg has been at the center of the discussion this week. Julien wrote the greatest deep dive you can find on the topic.
Cron expressions with DuckDB — An handy function in DuckDB that can generate time arrays when given a cron syntax, it's more understandable than generate_series().
Serverless Jupyter notebooks at Meta — They develop a system called Bento which allows notebooks to either run with classic kernels or with in-browser kernel (being really serverless) using Pyodide. They have handy functions to get sql, googlesheet or graphql data in the browser memory to then work on it.
Airflow new youth — If you stayed with Airflow 1.x or previous 2.6 you might have missed Airflow new youth. This presentation from Jarek showcase all the recent improvements: data aware scheduling, deferrable operators, object storage, etc.
A hybrid information retriever with DuckDB — how can you fusion semantic and lexical search with DuckDB. Looks neat.
dbt-score, lint metadata and get max score — Lint you dbt metadata, gets a score and be happy in the CI/CD.
Automatically detecting breaking changes in SQL queries — Use SQLGlot diff function (on AST) and gets what changed on a SQL query and act accordingly.
How I failed to implement dbt — Benoit explains why he failed implemented dbt in his previous role. He identifies 5 errors that led to a failure. As always this is not about a technical issue.
250 European data infrastructure startups and what we learned from them — Another perspective about data infrastructure that greatly complete the MAD landscape. At the end of the page it gives great definition about every part of a data platform.
The rise of medium code — Between low-code practitioners and software engineers there are medium code practitioners like analytics engineers and data scientists. This code often lies into Python orchestrators and has to be treated correctly because it's production code as well.
Write-Audit-Publish pattern — Once again a great article about this pattern.
How Monzo uses incremental modelling to handle billions of events every day.

💡

I'm working a dedicated article about Snowflake and Databricks latest advancements which should be published on Monday.

Data Economy 💰

Mistral raises €600m — Mistral has never been a French company from the first rounds but raises again a lot of cash to go faster.
xAI raises $6b — Late to the party and it seemed no one care about but Musk tries to fight.
Cube raises $25m — Cube has the most advanced piece of technology today when it comes to semantic layer and they raised enough money to continue going into this direction.
Snowflake invests in Omni — Omni is a refreshed version of Looker with a fresher LookML.
Databricks acquires Tabular — It created waves last week in the data community. I'll write more about it on Monday.
Tobiko raises $17.3m — The company behind SQLMesh and SQLGlot raises cash to create a suite of tool to invent the data development of tomorrow.
Redpanda acquires Benthos — In the streaming world it was big.

I want to address something weighing on my mind. We've all seen the results of recent European elections and how the far right has influenced public debate and opinion. I strongly believe we should not fall for their tactics or their so-called solutions. In the tech community, many of us are privileged, often due to our financial stability. However, we cannot build a society with only people like us. Because of our privilege, we (1) should vote, (2) should use our vote to support those marginalised by the system.

For my French readers, there are parliamentary elections in France in 15 days. I urge you to vote and to vote against the far right. Hate and division are not solutions. Cutting public services through tax reductions is not a solution. Pushing for more productivity when AI is on the rise is not a solution. Individualism is not a solution. They don't bring any solution.

Consider what the tech ecosystem would look like under far-right principles: diversity stifled, innovation hindered, and global collaboration restricted. These ideologies could limit talent flow, reduce educational programs, and promote censorship and surveillance (which is almost already here, we work in big data face the reality), undermining our core values of privacy and open access.

If you feel this message doesn't belong in a tech newsletter or professional sphere, I don't care and you can unsubscribe. However, I believe that advocating for openness and tolerance is essential, and accepting hate speech is unacceptable.

See you next week ❤️.

Data News

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments

Data News — Week 24.40

Data News #24.40 — Back in Paris, Forward Data Conference program is out, OpenAI and Meta new stuff, DuckCon and a lot of things.

13 Sep 2024

Paid Members Public

Data News — Week 24.37

Data News #24.37 — OpenAI o1 new series, building low cost platform with Model dlt and dbt, Data teams survey, feature store, Ibis without pandas.

Data News — Week 24.24

Forward data conference ⏩

AI News 🤖

Fast News ⚡️

Data Economy 💰

Data Explorer

The hub to explore Data News links

Christophe Blefari

Comments

Related Posts

Data News — Week 24.40

Data News — Week 24.37

Forward data conference ⏩

AI News 🤖

Fast News ⚡️

Data Economy 💰

Data Explorer

The hub to explore Data News links

Christophe Blefari

blef.fr Newsletter

Comments

Related Posts

Data News — Week 24.40

Data News — Week 24.37