Hi, it's been a while since I last posted something here. Happy new year 🎉. I hope you haven't forgotten about me. A lot of things have been happening at the same time in my professional and personal life. To be honest, everything's been going well, but I've found it hard to find time to write among other things.
And that's the problem. I want to do so many things at once. It's quite funny because when I'm coaching someone, one of the first pieces of advice I give them is to stay focused and avoid multitasking, but when it comes to me... Yeah, you know.
However, some excellent articles have been written and I want to end 2023 with one last big wrap on these December articles. I'd also like to say hello to all the newcomers who arrived in December, thank you for your trust. We're going to get to know each other.
Before moving on to the Data News, a bit of personal news, in December, I took part in the MotherDuck meetup in Berlin. I presented what I believe to be the future from my DuckDB experiments. I've especially been amazed by DuckDB in the browser with WASM. I'll also go to the DuckCon in Amsterdam on February 2nd—pm me if you're going.
End of January, on the 31st I'll speak at a Modern Data Stack conf in Paris, still about DuckDB, but this time in French. I also took part in my friend's podcast where we discussed 3 trends in data: data modeling, real-time analytics and DataOps.
My retroprojective—a retro 2023 with a projection into 2024—will soon be written. It will talk about my search for a new spicy adventure, the fact that I've finally taken up running again, my new journey as an angel investor, and so on.
Enjoy this last 2023 Data News.
AI News 🤖
- An interactive 3D explaination of LLMs — Explaining complex things the visual way is the best. In this one it details all the components in a LLM—a big part explain what's a Transformer.
- LLMs for builders: jargons, theory & history — Mehdi, compiled in a large article all the necessary vocab to understand the basic conversation when it comes to generative. He even quickly explain how you can run a model on your computer.
- Cocorico 🐓. Mistral AI, one of the French "OpenAI" startup, entered the field setting new standards and with recognition. They released their first AI endpoints: generative and embedding. When it comes to generative they currently have 3 models: tiny, small and medium, which are performing well against GPT-3.5. At the same time they released Mixtral 8x7B, the first open-source model of this calibre under an Apache Licence. And the weight are open-source as well.
- What I wish someone had told me — It's borderline AI news, but as the author is Sam Altman, I think it belongs here. After the whole Hollywood thing around Sam being pushed out and then coming back, Sam clickbaited us. He's written 17 great HR / team building tips—but they have nothing to do with the drama we're all living for.
- People underestimate how impactful Scikit-learn continues to be — The year is coming to an end and LinkedIn is playing at the 2024 predictions game. Obviously no one will get it right. At the same time one of the Scikit-learn confounder, put the church back at the city center—this is a French expression poorly translated. Scikit is still the most used library when you look at some numbers and LLMs have still to bridge the gap in usage.
- OpenAI prompt engineering guide — Wow, an official guide to become a prompt engineer /s. Seriously, it seems it contains good tips to communicate to the algorithm.
- Google announced Gemini, their new multimodal model "beating" GPT-4, but fooled us with an edited video.
Fast News ⚡️
- Airflow 2.8 is out — The Airflow rhythm of release is crazy, I can't keep up with the awesome feature that have been added this year. To finish the year Airflow team have released improvements to Datasets and a major step forward with the new Object Storage API that provides a generic abstraction over Cloud Storage to transfer data from one to another.
- The EU AI Act has passed — After many years working on the text the EU has voted for the AI Act to regulate usage of AI when usage European citizen data. It points to a cheat sheet that summarises what you need to know. In a few words: the AI Act provides the glossary to define what's an AI and define the boundaries of prohibited and high-risks AIs.
- BigQuery now integrates DuetAI — to help you generate or complete SQL queries.
- AWS announced S3 Express — S3 Express is a new zone with 10x better performance (latency and parallelisation). Paul, wrote a few speculations about the new S3 tier, this is highly detailed and explains very well what to expect, DateEngineeringWeekly also wrote thoughts about it. S3 will still be the king, or the GOAT.
- Idempotence — Matt wrote an article to explain what's the idempotence and why it matters in data engineering. Idempotence can be mathematically summarise to f(f(x)) = f(x), it's important in data engineering because for the same input you want a pipeline to produce the same output. Never forget to have it in mind when thinking of a pipeline it leads to great questions.
- Have I Resolved the Pie Chart Debate? — We all know pie chart are terrible. Nick proposes how to fix the pie chart dilemma.
- How to know if your data team is successful? — Reflexions around team performance and how to measure it.
Engineering stuff ⚙️
- Netflix internal data engineering Summit — Netflix team organised an internal conference about DE topics. And they recorded it. 8 videos are on YouTube and to be honest this is awesome content to learn patterns and get ideas from the best. They still use technologies around JVM (Spark and Flink), but with no surprise everything resolves around Iceberg—which has been created at Netflix.
- Using Netflix Maestro and Apache Iceberg — Going deeper into incremental processing the engineering team details how they implemented it.
- Introducing WAP pattern support with Apache Iceberg (with SQLMesh) — Small article about a important pattern to avoid putting bad data in production. The WAP pattern—Write-Audit-Publish—let's you first write the data in a staging layer in which the data is audited, if the audit is green then the data is published in the production layer. This article is just an entry point to SQLMesh—a dbt alt—that enables you to do it.
- Use Databricks to read Iceberg tables in Snowflake 🙃 — This post have been written by Snowflake team, but reflect a strategy from Snowflake to attract customers by being open, and Iceberg do the glue here, winning the table format. Still, don't do it and try to avoid spaghetti data platform.
- Efficient ELT refreshes — Max detailed how he designed his ELT pipelines
- Run dlt on Lambda to save on extract and load costs — dlt is an open-source Python library to do extract-load in Python, if you want to save cost out of different cloud services that moves data, it might be an alternative.
- Druid deprecation and ClickHouse adoption at Lyft — Data engineer loves migration. They prefer even more speaking about the migrations they have done. Moving from Druid to ClickHouse looks like a good improvement.
- Data Quality Score: next chapter of data quality at Airbnb — After all the data cataloging vision and trends Airbnb launched, this time they explained how they see dataset quality and how they score it.
- The state of SQL-based observability, on ClickHouse blog.
- Designing OBT and comparing OBT with Star Schema.
- Code review best practices for Analytics Engineers.
- Self-Service data analytics as a hierarchy of needs.
- Easy GCP cost anomaly detection.
- An elegant platform, Grab.
- A guide to MLOps with Airflow and MLflow, TheFork.
- What made Apollo a success — A Nasa PDF with 8 articles reprinted from March 1970 issue of Astronautics & Aeronautics.
Data Economy 💰
- Mistral AI raised another €415m at $2B valuation. Mainly from US based capitals, it will probably change the governance of the company, is it still French?
- Elon Musk’s generative AI startup xAI looks to raise $1bn.
- AssemblyAI raises $50m. API endpoints to convert voice data to text in all his forms (transcript, chapters, summaries, etc.).
- Keboola raises $32m in Series A. This is a all-in-one data platform for non-technical data users.
- London-based Harriet raises €1.4 million pre-seed. An AI assistant using HR data to help employees.
- AI data platform VAST Data raises $118m. All-in-one platform for big corp to do AI and engineering at the same place.
- Octolis has been acquired by Brevo (ex-SendinBlue). Octolis is a CDP / reverse-ETL solution and Brevo is a CRM, the join makes total sense.
See you this Friday with a post opening 2024 🎊.
Join the newsletter to receive the latest updates in your inbox.