blef.fr

Data News — Week 24.40

2024-10-06

Back in Paris (credits)

Hey, hey, hey. I'm so sorry for this small break about the news. I was in middle of starting my new company, nao, and moving back from Berlin to Paris. Still I hope this edition finds you well, it will be a mix of personal news, OpenAI saga and usual data engineering stuff that I enjoy reading.

First things first, yes, I'm co-founding a company. We called the company nao and you can see it as a no-code semantic layer. Still I keep a post about it for later, but if you're interested, hmu.

Then, with my girlfriend we decided to move back from Berlin to Paris after 2 years there. It's a professional move for both of us, we will miss Berlin to be honest but a big part of our social life is in Paris. Being in Paris will ease all the events and IRL stuff I go / organise.

Forward Data Conference ✨

As a reminder, on November 25th I'm organising the Forward Data Conference. It will be a day to shape the future of the data community, where teams can come to learn and grow together. There are still tickets left—we sold around 80% of the tickets.

This week we announced the program, you can find it on our website. I really like the program we put in place, it a mix of Engineering and Strategic / Vision talks.

The conference will be held in French + English, a few talks will be given in French but we will subtitle them live and we will also find a way to always have something in English in parallel for all English native speakers.

You can use BLEF_FWD24 promo-code to get 15% reduction on your ticket.

Get tickets for Forward Data Conference

PS: dear readers, if you proposed a talk to the FDC which has been rejected, I'm so sorry you did not get a detailed explanation, we received a lot of talks and I wasn't able to write a personal message to every talk that has been rejected. Tho, if you're wondering why, reach me and I will explain you.

AI News 🤖

OpenAI is our best saga about drama and tech, when the Netflix show is going out?
- DevDay recap — OpenAI DevDay was the developer conference to announce features, models and stuff about their product. The "biggest" announcement was around Realtime API targeting the speech-to-speech applications.
  
  In addition they introduced prompt caching to save tokens costs, the possibility to fine-tune vision for GPT-4o. Last thing is Canvas, which is a new way to interact with the models, I'd say it's a mix of Notion and Anthropic better UI. This is mandatory for OpenAI to improve and diversify their public UI/UX in order to compete with large apps ecosystems.

Whisper large v3 turbo — New turbo version of Whisper has been released on Hugging Face (announcement). Following Realtime voice API, it's great to see improvements in Whisper, the voice model.
OpenAI to remove non-profit control and give Sam Altman equity — After a magic trick, Sam could receive equity worth around $150b. The important note is also that OpenAI is moving it's core business to for-profit which will not be controlled anymore by the non-profit board.

Advanced Voice not available in EU — Advanced voice is a Siri interface on top of Chat-GPT capabilities. The unavailability in EU is lobbying at it's finest, fearing AI Act or GDPR could harm innovation. Explain to me why companies with the best engineers in the world can't find a way to make things legal.
They raised $6.6b at $157b valuation (and $4b in debt). Another $10b after the first in Jan 2023.

Meta, if there was a race, Meta would be well positioned, who would have thought after Metaverse choices?
- Meta Movie Gen — Meta announce new research for movie generation models. Let's be honest for the moment it just feels unreal, like a video game or something in virtual reality. But in the end, this is maybe what we need?
- New hardware (powered with AI) — Two promising product have been demonstrated a pair of glasses and a wristband that allows you to interact with virtual interfaces with your finger movements.
- SAM 2, Segment Anything Model 2 can run on-device on Apple CoreML — A demo of image segmentation that run 100% offline and on-device. Industrial application might easily follow out of this.
- Mark Zuckerberg says leaders should have technical skills if they want to call themselves a tech company. Yes, but technical leaders are also sometimes not the best ones, maybe the crazy ones, so other skills are required.
Introducing contextual retrieval — Anthropic introduced a new way to do RAG with more context, that performs better than standard.
Meta and Google announced automatic dubbing for resp. Reels and YouTube videos, this is something. Translation looks like a use-case that is almost solved with LLMs. It unlocks a world where languages are not anymore barriers, giving us access to instantly content and discussions all around the world, especially if it can run on-device, cheaply.
Web browser automation through agentic workflows — A Github repo with a demo using Gemini and Selenium to automate browser actions.
New AutoGen architecture — AutoGen is an open-source programming framework for agentic workflows, they designed a new architecture (to be honest I don't know what it means).
Klarna drama — Klarna CEO announced he will shutdown Salesforce and Workday to replace it with internal initiatives + AI. Let's see where it goes.
Paris police wants to keep AI surveillance in place post-Olympics — Who could have predicted?
Malt AI report — Malt is a French / European freelance marketplace and they dropped their new AI report. A few things I can note going through the report below.
- Snowflake demand has largely increased and it's close to Databricks in volume, tho Hadoop demand is still larger 🙃
- The biggest demand concern stuff around AI like LLM, Deep Learning, Machine Learning, scikit-learn, etc. — in 2024 there are 16k AI freelancer profiles
- dbt pops out as a specific skill on freelancer profile
- AI engineers and scientists have an average daily rate around 500€, which is 100€ more than tech and data general category.
- AI supply is half data scientists half all other tech positions (DA, DE, Back-end, SE, DevOps).

Build the foundations (credits)

Fast News ⚡️

CfP for DuckCon in Amsterdam on January 31, 2025 — In January next week, the DuckCon will take place, the call for paper is still open until Oct 18th. I might propose something about yato (?).
dlt goes 1.0.0 — dlt announced their 1.0.0 version, as well as 1000 open-source customers in production. This version brings stability and marks a new milestone for the library.

Side note, I'm a dltHub investor.
Airbyte is also going 1.0 — Following dlt (?), Airbyte is also going 1.0 with 3 objectives more use-cases, reliability and better throughput performance.
❤️ NO SLIDES conference — Be careful before clicking on this link you might loose yourself in a loophole. Recently Timo organised a NO SLIDES conference, a conference where people would only share their screen and no slides. I participated to demo nao, but the demo failed, so the recording does not exist anymore (oups), still I've watched other few talks and really enjoyed.
ELT with Kestra, DuckDB, dbt, Neon and Resend — How with Kestra you can create a declarative data pipelines to move data using the trendy libraries.
DuckDB is the foundation.
Fast feedback when SQL writing — A nice experiment showcasing how writing SQL tomorrow would look like. Imagine getting results directly while typing to have a faster iteration loop.
BigQuery jobs explorer refreshed — Google team released a fresh new explorer for BigQuery Jobs.
Coursera and Joe Reis launched a Data Engineering Professional Certificate — I can't recommend Joe enough, he's one of the best when it comes to capture date engineering job and the syllabus is great.
Current state of Databricks SQL — "The best data warehouse is a lakehouse", lmao. Episode 21425325 in the competition between Snowflake and Databricks.
The data death cycle — 5 traps you wanna avoid to deliver value with Data & AI products: the tech trap, the doing trap, the project trap, the silo trap and the performance-first trap. And follow-up about silos by Hugo.

No comments

Mainly because of time and length of this issue.

Data Economy 💰

OpenAI raises $6.6b at $157b valuation. Softbank goes in with half a billion.
Supabase raises $80m Series C. It's an open-source Firebase based on-top of Postgres.
Kestra raises $8m Series A. Kestra is an open-source orchestration engine, written in Java, and you create workflows using a declarative model. Ludovic the CTO wrote about turning an open-source project into a viable business.
fal.ai raises $14m Series A. For the readers that are here for a long time you might remember fal.ai, they were the first to propose a way to mix Python and dbt models with a specific tooling, and they pivoted into a super-faster GenAI inference platform.
NVidia acquires OctoAI.
BlackRock and Microsoft plan $30bn fund to invest in AI infrastructure.
Voltron Data laid off 50+ employees recently. Voltron engineers are one of the best when it comes to under the hood engines powering our modern data platforms.
:probabl. raised €5.5m Seed round. probabl is the official operator of the scikit-learn brand and will develop products and services around the library. Because we need the data science tooling to be and stay open-source.

Side note, I'm a :probabl investor.

See you soon ❤️

Data News — Week 24.37

2024-09-13

Back to work (credits)

Hey you, can you believe it's already September? This year has been flying. It feels like I just blinked, and here we are. In August, I've been focusing mainly on my next big journey—if you follow me on LinkedIn, you might have caught a sneak peek! I'll be making a full announcement next week. I want to take the time to explain my thought process and ideas behind it. I hope you will like it.

Below are the Data News wrapping summer and the first two weeks of Sept.

AI News 🤖

OpenAI released 2 new models OpenAI o1-preview and o1-mini — These models brings changes and a breakpoint in the models naming. OpenAI decide to give up on the GPT naming, which means GPT-5 will never be plugged in. GPT paper has been co-authored by 4 person and 3 are not anymore at OpenAI, leaving GPTs also mark a change in paradigm.

The o1 series brings more “reasoning”, it looks like a pre-prompt that does a chain of thoughts on top of what they already did best. Lots of stories about exceptional things the model can do have been published today—e.g. in the OpenAI system card explained that the model was able during a cybersecurity challenge (a CTF) to understand a failing Docker environment (due to infra) and still be able to find the flag.

Here a YouTube playlist demonstrating o1 capacities.

As clem mentioned on Twitter, it's always important to pay attention to words, even if the “reason” model, it doesn't think, it processes.
More news about OpenAI
- New models are already available on Azure ; but be careful Microsoft open-source Phi-3.5-mini is out.
- Ilya Sutskever, previously Chef Scientist at OpenAI, raised 1b$ to co-found Safe Superintelligence with a manifesto.
- Alexis Conneau, Her ex-research lead at OpenAI, decided to create a new company and got a lot of Tweet impressions. Previous OpenAI members are quite popular when it comes to founding.
- Bloomberg reported that OpenAI seeks to raise $11,5b more at $150b valuation, making it the third private company in terms on valuation [paywall article].
- NEO Beta, a humanoid company backed by OpenAI, released a first video demo. And it's impressive (🙃), the robot is able to handover a bag to a human!
- We hope next OpenAI model is not o7. /s
OpenAI and Anthropic will give their model first to US gov (NIST) to help advance safe and trustworthy AI innovation for all. But they cry when in Europe the AI Act is voted threatening innovation.
NVidia released Eagle a vision-centric multimodal LLM — Look at the example in the Github repo, given an image and a user input the LLM is able to answer things like "Describe the image in detail" or "Which car in the picture is more aerodynamic" based on a drawing.
Aleph Alpha introduced Pharia-1-LLM — it's a 7B model and the license is explicitly targets non-commercial and research usages. Aleph Alpha is a German company, funded by German VCs (with $500m), was trying to compete with US companies (like Mistral and OpenAI 🤭) in the models race but gave up this competition to pivot to a AI-support company for public sector.

Fast News ⚡️

Calm data flows (credits)

EU and China launch cross-border data flow communication mechanism — It's an official statement saying that EU and China will re-discuss the policy about the data transfer out of China for European company, which is difficult.
Building a cost-effective analytics stack with Modal, dlt, and dbt — A great example of how you can built a small analytics stack in today's world with dlt, dbt and Modal, a serverless platform to run Python stuff. The articles contains a lot of code snippets to understand what's under the hood.
Data teams survey 2024 — Jesse Anderson released the results of his survey about the state of data teams in 2024.
Daily cache implementation in Python — A highly effective approach for caching when working with large datasets stored in distant buckets is to implement a local cache. It avoids the need to repeatedly download the data.
Safe composable Python — A good article about function composition and testing in Python and how it articulates together.
What are the key part of data engineering — Simple way to present what are the key part of data engineering.
Automation strategies for monitoring and self-healing of data pipelines — I like the concept of self-healing pipelines, tho, not sure it's really idempotent and not sure it leads to a great management of data assets, still the article is related also to data contracts and might be solved another way.
Medium feature store, how do they store lists — Medium built a feature store powering their recommendation system (which could work better tbh), in this blog they explain how did they decided to store features of type list.
How the UK football rely heavily on data? — It's common knowledge that Liverpool FC won multiple titles recently by being data driven. This article shows how data teams are becoming larger and larger in the clubs. In the Premier League top 6, data team headcount average is 14.
Spotify data science personas — Data science role is evolving and Spotify proposed multiple persona among their data teams.
Unlocking insights with high-quality dashboards at scale — A checklist of stuff that you should have a look at to build high-quality dashboard. They even developed a Dashboard portal to improve dashboard usage and discovery.
Ibis drops pandas backend and embrace fully DuckDB — It's a big choice moving forward Ibis, a multi backend dataframe library, decided to drop pandas support using by default DuckDB. The article says that DuckDB is way faster, covers the feature gap and pandas was mostly annoying because of NaN for nulls values, whereas it's NULL for all the other backends.
DuckCon #5 videos — All the videos from the 5th DuckCon in Seattle are on YouTube. I did not had the time yet to look at it but I think awesome things are waiting us behind a click.
Should you be migrating to an Iceberg Lakehouse? — This is an excellent question, and a good starting point for considering whether you should change all your assets to Iceberg.
Amazon S3 now supports conditional writes — I think it's a great start for table formats.
The analytics development lifecycle by dbt — Tristan Handy proposed a framework (very Enterprise) to rethink how the analytics workflow should work as of today. The article is 37 minutes long so I did not read but I saw the holy DevOps/DataOps infini sign that gave me instant headache. Toby from SQLMesh answered on LinkedIn, drama time.
Making SQLMesh faster — the road for SQLMesh is crystal clear, they want to be the faster alternative to unmanageable large dbt projects, so they work of the speed of execution.
What's your question?

See you next week ❤️

Data News — Week 24.34

2024-08-24

News again.. (credits)

It's been 3 weeks.

Summer continues and I hope this new edition finds you well, having had a great vacation and a nice break before getting back to business in September. Content and articles have been a little slow over the last few weeks and that's to be expected, but I feel it gonna get back to business as usual soon.

Some personal news, in September things will be changing professionally on my side, I'll be slowly leaving the freelancing world. More details soon in a 2-parts articles I'm writing about it. I can't wait to tell you more to be honest. Still the newsletter gonna be the same formula.

Events ✨

As you may know I'm co-organising the Forward Data Conference on November 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. There are still tickets left—we sold around 60% of the tickets.

We have started to announce a few guest speakers for the conference that I can't wait listen on stage. At the moment we have announced:

Joe Reis, best-seller author, data engineer, he will speak about the new art of data modeling
Hannes Mühleisen, co-creator of DuckDB
Claire Lebarz, Chief Data and AI at Malt, who was working at Airbnb previously
Virginie Cornu, Co-founder / CTPO and previously VP data at Jellysmack

You can use BLEF_FWD24 promo-code to get 15% reduction on your ticket.

Get tickets for Forward Data Conference

Transition to another event. Demetrios Brinkmann is organising on Sep 12—in 18 days—a free online conference called Data Engineering for AI/ML, the agenda is pretty packed and the lineup is full of awesome speakers (Joe and Hannes will be there ☺️). The idea of the conference is to go deeper in the AI/ML current state and how we do data engineering in 2024 that services ML and AI teams.

AI News 🤖

The 3 last OpenAI co-founders (credits: HBO)

Drama at OpenAI people leaving (again) — Only 3 of the 11 original co-founders are still at OpenAI. In July reports were saying that OpenAI could be on the track to make a $5b loss.
OpenAI structured outputs — You can now force OpenAI APIs to return you a specific enforced JSON schema when calling.
New image generation capabilities — The German lab Black Forest created more realistic than ever images and with their model Flux.
How Meta animates AI-generated images at scale.
Meta segment anything model (SAM v2) is impressive in order to identify anything in images.
ML in content moderation at Etsy.
Semantics in the LinkedIn search engine.
StackOverflow AI survey — a few insights quoted from the survey
- 76% of all respondents are using or are planning to use AI tools in their development process this year, an increase from last year (+70%)
- 81% agree increasing productivity is the biggest benefit that developers identify for AI tools.
- 70% of professional developers do not perceive AI as a threat to their job
Watermarking Generative AI — I believe in watermarking and I root for users app in the future to identify watermarks and to inform users if its AI, not AI or unknown.
The open-source AI definition (v0.0.9) — A try to put words in order to define, in open, what's AI: "an AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments".
What nobody tells you about RAG — Large deep-dive about RAGs.

Fast News ⚡️

Timeless skills for data engineers and analysts — Benjamin proposes 4 non-technical skills that are useful when doing the data work, the 2 first one are "thinking in systems" and "data intuition". With the current resurgence of data modeling I think it both are critical part of the modeling work and need to be looked at.
How top data teams are structured — Mikkel had a look at 40 data teams and analysed the way they are structured. As an output we get the roles ratio understanding better team composition.
Mainframes finds new life with AI — Obviously we will never get rid of mainframes, and IBM says clients wants to run AI on it. I think this is time to learn machine learning in COBOL.
Foursquare modern data platform — A classic modern data stack, but still interesting to see that Foursquare—like large companies—is going multi-technology by having multiple storage, processing engines and notebooks technologies.
Amazon migration from Spark to Ray — Exabyte-scale migration and in the end Amazon saves a lot of processing time when it comes to files compaction.
Design a text-to-SQL solution — An Indian food delivery company, called Swiggy, developed an internal text-to-SQL solution that user can interact with via Slack. The article describe all their thoughts and the challenges they faced while working on it.
In browser WASM Postgres — Recently a company ported Postgres to WASM and now you can run Postgres within your browser without any server, you can test it at postgres.new (only desktop for the moment). You can see more on pglite website.
Why we replaced Airflow in our experiment platform — Eppo reached a job concurrency limit with Airflow (wanted to run 50k concurrent experiments) and decided to switch to something else. After trying a few different things they decided to developed their own tech.
Airflow Summit in San Francisco — Airflow gonna turn 10 this year. There is Airflow Summit in San Francisco on 10-12 of September. This year will also mark the start of Airflow 3.0 aiming to be release in March next year. If we look at the confluence page Airflow 3.0 will feature: a new web UI, DAG versioning, remote execution (from cloud to on-prem), data assets (like Dagster, a renaming of Datasets) and more but that's a great milestone.
Airflow and dbt ; in Astronomer — Orchestrating dbt within an orchestrator is one of the most discussed topic among data teams using dbt. It's also an large adoption lever for companies like Dagster from the private discussion that I have (Dagster-dbt integration is top-notch). Astronomer has to go the same direction and through cosmos and a better integration now they propose it as well.
No, data engineers don't need dbt — Common sense but good reminders. If you do ETL instead of ELT and you don't have a warehouse, dbt might not be the best fit.
Sparkle write Spark pipelines in YAML at Uber — Uber doing uber things. Still declarative platforms in YAML are the way to go to standardise data work.
ETL development life-cycle with Dataflow — How Netflix is also doing YAML development for Dataflow job writing.
StarRocks usage at Pinterest for faster analytics.
Table format comparisons — Honest linear review of 4 formats (it includes the 3 main ones and Apache Paimon). This first part details how formats manage physical files, the second part is about append-only and incremental tables.
Polaris catalog is open-source — Snowflake released the expected Polaris catalog. On this matter, it has been reported that Databricks finally acquired Tabular for $2b while Snowflake tried to get them for $600m.
BigTable supports SQL (and this is something) and GCP can run managed Kafka.
The revolving doors of BI — I frequently observe companies switching BI tools every 2-3 years, hoping that the latest solution will resolve all their issues. While these migrations often provide short-term relief, they inevitably lead to another dead end. The initial success of the migration is typically due to the fact that, during the transition, companies address their technical debt and apply the lessons learned from previous mistakes. However, without addressing the underlying issues, the cycle is likely to repeat itself.
Metrics management at Duolingo — I should go back to learning German in order to improve Duolingo metrics.
How we test Airbyte and marketplace connectors — Exhaustive tests suite Airbyte put in place to check if connectors are behaving correctly.
Slack summary ELT pipeline — It showcases how you can create an ELT pipeline with dlt, Ibis and Hamilton—which is a Python library to create transformation DAGs.
snowflakecli — a DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
A Postgres extension (pg_duckdb) that brings DuckDB as an analytical engine within Postgres has been backed by Microsoft and fostered a bit of discussions in the community.
Genealogy of databases — like a subway map that depicts all databases changes since Prehistory.
Martin Kleppmann is working on a version 2 of Designing Data-Intensive Applications.

Final stuff

dbt Generic Tests in Sessions Validation at Yelp.
LETSQL inference for DataFrames — I don't understand the blog but looks cool.

Data News — Week 24.30

2024-07-26

Tallinn (credits)

Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe.

I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions. Many thanks to all people who trusted us and submitted a talk for the conference (especially the DN members!).

We also announced our first guest speaker, Joe Reis. Joe is a great speaker, he wrote Fundamentals of Data Engineering, which is one of the bibles in data engineering and I can't wait to hear him at Forward Data. He his currently writing his second book about data modeling.

Forward Data is a 1-day conference I will co-organise on November 25th, in Paris. It will be a day to shape the future of the data community, where teams can come to learn and grow together.

Buy tickets for Forward Data Conf

AI News 🤖

Some days, AI News is like a TV shopping show. Over the past two weeks, a few dozen models have been released, and I'd like to introduce them to you.

New models & stuff

OpenAI — OpenAI is trying to continue leading the charge by releasing models like Apple products.
- GPT-4o mini: advancing cost-efficient intelligence — After GPT-4o which brought great performance and became the new flagship model, available in the free tier, OpenAI released a smaller version of it, the mini. According to the benchmark GPT-4o mini is close to GPT-4o performance but best in class among the small models. Even if OpenAI did not disclose how small it is, a few people are claiming it's a 8B.
- Fine-tune GPT-4o for free — Until September 23, 2024 GPT-4o mini is free to fine-tune. This means each organization will get 2M tokens per 24 hour period to train the model and any overage will be charged at $3.00/1M tokens. Worth trying [docs].
- SearchGPT, new OpenAI product —Yesterday, OpenAI unveiled their latest product, SearchGPT, a prototype AI search application. The system generates answers while providing reliable sources. This announcement coincides with Google Search's recent report of a 11% increase in revenue for the last quarter, reaching $64 billion. It shows that search did not disappeared with the advent of GPTs.
Meta — This is crazy how Meta who suffered an unintentionally leak of LLaMA weights on torrents 1 year ago is now the company advocating for open models and leading this part of the ecosystem.
- LLaMA 3.1 is out — The model is out in 3 versions, the largest one with 405B and 2 smaller ones (70B and 8B). They even released a 92 -pages whitepaper explaining how they trained it, the expected performances and what you can do with it. How dear fried Mark even wrote an ode to the open source with some kind of manifesto. The announcement is like a summary if you want it short.
- Meta won’t release its multimodal Llama AI model in the EU — It would have been perfect if Meta was complying with the rules but in the end Meta is Meta. Doing lobbying and like a crying kid we punish for being bad they announce they will not release they super multimodal AI in Europe because regulatory environment is too "unpredictable". How to say that they use training data they should not have used.
  
  Last point related to tech giants (Apple, Nvidia, Salesforce) stealing YouTube subtitles to train foundational models.
- Meta Chameleon — Finally Chameleon is available on HuggingFace, it's Meta Mixed-Modal Early-Fusion Foundation, which actually means it can understand and generate text and images.
Mistral — The US-French company is keeping the rhythm with other giants going back on open models.
- MathΣtral — A 7B model for math reasoning and scientific discovery under Apache license. I'll try it soon for something I'm cooking.
- Codestral Mamba — A 7B model for code generation under Apache license.
- Mistral Large 2 — Competing directly with the large LLaMA 3.1, Mistral Large 2 is 123B parameters and is at the moment the closest model to GPT-4o which still sets the benchmark.
Microsoft Phi-3 models — Microsoft continues to try hard at the game with their Phi-3 models available in Azure. But who cares?
SmolLM - blazingly fast and remarkably powerful — HuggingFace released new state-of-the-art small models (135M, 360M and 1.7B parameters) trained on a open corpus.

Articles

Because AI and GenAI is not only about models, a few great articles have been written as well.

Twitter uses your data to train xAI Grok — Twitter recently added a opt-in to utilise your X posts as well as your user interactions, inputs and results with Grok for training and fine-tuning purposes. The opt-in is only available on desktop.
AI Lab: The secrets to keeping machine learning engineers moving fast — About the AI Lab Meta put in place to keep ML engineers velocity giving them possibilities to A/B tests models and avoid regressions.
Building Pinterest Canvas, a text-to-image foundation model — Great article about creating a image generation model for product backgrounds.
Multimodal RAG pipeline — A notebook explaining how you can index and build a RAG on deck slides.
Watermarking Generative AI: ensuring ownership and transparency — This is the future, being able to watermark all the generated content for ensure trust and transparency for end consumers.
Germany (in German, sorry) and Switzerland both added into their law some kind of preference for open-source software (even required in Switzerland).
Understand Vector database in Google Sheets — Playful. This is a Google Sheet template that you can copy that will explain you how a vector database works as well as the search.
Infinite dataset hub — A Generative app that generates datasets for you out of a few words.

Fast News ⚡

Because the fast news are always the best.

Why it feels impossible to get a data science job — Data science market became highly competitive in the last years, even more with everyone rushing to AI and jobs switching from being great at machine learning to being good at maintaining APIs orchestration. This articles tries to explain why and what to do.
Snowflake brings seamless PostgreSQL and MySQL — It has been announce at the Snowflake summit, now you can directly ingest Postgres and MySQL from Snowflake UI, removing the need of any other tool for these sources. The way they did it requires you to run a Docker container, which is kinda meh.
Query Snowflake Iceberg tables with DuckDB & Spark to save costs — That's what Iceberg tables on Snowflake unlocks. The capabilities to offload compute in DuckDB or Spark to save costs (or to move costs actually).
BigQuery team is on fire and released a lot of cool new stuff
- Table explorer — an automated way to visually explore table data and create queries based on your selection of table fields.
- Continuous queries — A way to answer to Snowflake dynamic tables, continuous queries are SQL statements that run continuously. Google announce that CQ can be used for low-latency tasks.
- On the same topic as CQ, they released the changes function, which is a SQL function that returns all rows that have changed in a table for a given time range. I think it will unlock a lot of use-cases in BigQuery.
How Mixpanel delivers funnels up to 7x faster than the data warehouse — Mixpanel team is proud to say that they have better performance than Snowflake.
You can run Clickhouse functions in DuckDB with the chsql extension and there is a great post about how DuckDB manages memory.
Querying 1TB on a laptop with Python dataframes — A benchmark on a 96 GB of memory laptop using DuckDB, DataFusion and Polars. Crazy what we can do nowadays.
Data modeling techniques for the post-modern data stack — A great recap of all the techniques that exists out there for modeling (medallion and dimensional).
Parquet File Format: everything you need to know — How parquet file are written.
Encoding and Compression Guide for Parquet String Data Using RAPIDS — For Nvidia geeks.
Hudi merge-on-read — Iceberg has been all over the place, but there is Hudi as well, and the merge-on-read is a great feature tbh.
Maestro: Netflix’s workflow orchestrator — 5 years later Netflix finally open-source their orchestrator. Curious to see if it will pick up. It's written in Java and it does what other orchestrator are already doing.
The Analytics Development Lifecycle — dbt Labs needs to reinvent itself, now that dbt is everywhere, they need to define the next vision as they have pure gold in their hands. Tristan, CEO of dbt Labs, is aiming for a new acronym, ADLC (Analytics Development Lifecycle), and provides a draft manifesto with user stories of what AE should be able to do tomorrow.
Deliver your data as a product, but not as an application.

Data economy 💰

Google in negotiations to acquire Wiz in $23 billion deal, actually no. Wiz is a cloud security firm.

See you next week ❤️

Data News — Week 24.28

2024-07-13

EuroSeagull (credits)

Dear members, it's been a few weeks since I did not catch you on a proper Data News with a collection of links. Here we are.

This week, I attended EuroPython in Prague. While I spent most of my time at the dltHub booth in the sponsors hall, I didn't attend many talks. However, I did give a few presentations on my SQL orchestration library, yato, which pairs well with dlt. A YouTube video might come out soon.

Additionally, I attended an interesting talk by a Data News reader about the rise of YAML engineers, Matthieu also wrote in the past an article about this. I'm so happy to have met a few of you there 😊.

This is a great transition as a reminder to say that I'm co-organising a 1-day conference on Nov 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. The Call for Paper (CfP) closes in a few hours, on Sunday 23h59. So propose a talk. Submissions are welcome in French and English.

Submit your talk to the Forward Data Conference!

AI News 🤖

OpenAI — Always the biggest news provider, announcements or dramas
- OpenAI was hacked, revealing internal secrets and raising national security concerns — The hacker reached OpenAI’s internal messaging systems early last year, stealing details of how OpenAI's technologies work from employees.
- Microsoft quits OpenAI board — Microsoft said that they are not needed anymore because the governance has improved and at the same might want to avoid antitrust issues raised by gouv around the world. Apple also will not join the board as it was also expected.
- How good is GPT at coding, really? — A research team evaluated the capabilities of GPT-3.5 in solving LeetCode problems. Although GPT-3.5 might be outdated, the findings are still somewhat relevant. The team discovered that GPT-3.5 performed significantly better on problems that existed before its training cut-off date. However, the model struggled with correcting its own mistakes.
- OpenAI’s acquisition of Rockset, what it means — I announced it a few weeks ago, OpenAI bought Rockset a real-time analytical vector database. Customers have 2 months left to migrate away from the database that will probably because the core of the OpenAI architecture.
kyutai released Moshi — Moshi is a "voice-enabled AI". The team as kyutai developed the model with an audio interface-first with an audio language model, which make the conversation with the AI more real (demo at 5:00 min) as it can interrupt you or kinda "think" (meaning for predict the next audio segment) while it speaks. Moshi will be part of kyutai open-source released and is purely local.
Claude 3.5 Sonnet — To end the tour of the last model, if you missed it Claude 3.5 went out in June and featured great performances when "reasoning", Claude is capable to split the screen and develop some kind of Codepen playground with a React app applying what you're asking. There is a demo where Sonnet transformed a research paper into a simulator app about the paper in one prompt.
pandas-ai — Give a dataframe to pandas-ai and configure a model, then you'll be able to chat with your data to get answers or chat about your questions. Nothing new I'd say, the only difference is the fact that the API to use it is fairly simple.
Meta’s approach to machine learning prediction robustness — Principles Meta applies to bring robustness to ML.
Bringing AI-powered answers and summaries to file previews on the web — Dropbox has developed a feature that generates summaries for every file in your storage. This process involves converting files, regardless of their format, into text. The text is then transformed into embeddings, which make the content easily summarisable and queryable for Q&A purposes.
Recommendation system using a vector database — How Malt (a freelancer platform) built a recommendation engine using current vector database technologies (with Qdrant).
DevOps for data science — An open-source and free book covering what data scientists need to know about DevOps. It has been written by someone at Posit (the company behind RStudio). It covers general knowledge about infra + code snippets.

Fast News ⚡️

End-to-end data lineage in AWS — AWS announced DataZone to bring lineage to your data assets, from the picture it can mixes datasets (?), Glue table and jobs while giving you a greed/red vision of what's up to date. They mention column lineage, but from the picture it looks like they track columns but not proper column-level lineage. UI is AWS-tier.
A brief history of modern data stack — Ananth from data engineering weekly wrote a few weeks after the modern data stack debate his views on it (read my opinion on this), he considers that we are in the post-modern data stack era with a few a points that will (or are) be implemented everywhere. Especially the interoperability.
Apache XTable — XTable is a new layer that provides a cross-table interoperability, you don't need to choose only one table format out of Hudi, Delta and Iceberg. It provides abstractions and tools for the translation of lakehouse table format metadata.
Create dynamic Iceberg tables — Snowflake added support for dynamic tables in a Iceberg format. Dynamic tables are tables based on "real-time data" (or streams, or continuous pipelines). So it means now Snowflake can just be used as an engine writing continuous tables in a blob storage in a open format like Iceberg — future is coming.
Creating a file format in Rust — Experiment about what you need to create a new file format. That's super interesting to understand what's under the hood of popular stuff we often use.
Data pipelines and SCDs — Slowly changing dimensions are an important pattern to know when it comes to data engineering. Julien wrote a great article about it and explaining the 3 possible form and the snapshot way. His chart are great. You can also read Timo detailed post on Mixpanel blog and why SCD are the best thing for product analytics.
How to build a Semantic Layer — An great small guide that gives you the things to consider when going the Semantic Layer road. Gwen gives a step-by-step method to migrate from marts to a dbt Semantic Layer.
How dlt uses Apache Arrow — A great post explaining why the next generation of data tooling need to use Arrow and how it impact the performances. This article explains how dlt (extract and load) then leverages Arrow.
nanoarrow, a way to technically understand Arrow — Slides about a re-implementation of the Arrow framework (it's highly technical to be honest without a video).
DuckDB and dbt — How with DuckDB and dbt you can build the transformation layer of a BI application (e.g. a Pokemon dashboard).
DuckDB extension mechanism — DuckDB wants to provide a repository for community extension. This way the community will be able to extend DuckDB easily and it will also reduce the minimal size of the DuckDB allowing use to have an even more portable database/engine.
Don’t lead a data team before reading this — 5 important points you should consider when leading a data team. I really like the "The business doesn’t care about how you solve the problem" because it's a good reminder for my technical audience that your role as a data person is to empower others with data, so boring tech is often the best.
Apache Kafka overview — If you're not familiar with Kafka this is a great overview.
Spark-connect, what's this — Very detailed post about what is spark-connect and why it will change the way we do Spark. It highlights how it simplifies and enhances the development process, particularly through its compatibility with various languages and the potential it unlocks for creating a data quality process.
How Discord uses Dagster — 2000 dbt tables, covered by over 12000 dbt tests. Discord uses the dbt <> Dagster integration to power they whole data assets management.

Stories

How Canva collects 25 billion events per day — Protobuf + Amazon Kinesis.
In-memory analytics for Kafka using DuckDB — The author develop kwack a small utility that allows you to run SQL queries on top of Kafka streams (in-memory).
40 AI prompts to boost your marketing team’s creativity — Atlassian collection of 40 prompts to do marketing stuff. I'm not sure I'm happy to see this on Atlassian blog.
A Recap of the Data Engineering Open Forum at Netflix — Video about the Netflix data engineering open forum are out on YouTube and this post is a recap + takeaways.
Senior engineer fatigue — When you gain in experience as your career progress, you will start to feel fatigue as an engineer. Senior fatigue is characterised not by a decline in productivity but by a deliberate deceleration. I find the first part about the paradox of slowing down to speed up so true, that I recommend you to read it warmly.

💡

Going back in time — A Little Architecture

See you next week (probably) ❤️ — I'll take random breaks this summer in order to prepare for the changes coming in my professional and personal life in September.

Databricks, Snowflake and the future

2024-06-21

Welcome to the snow world (credits)

Every year, the competition between Snowflake and Databricks intensifies, using their annual conferences as a platform for demonstrating their power. This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco. The conferences were expecting 20,000 and 16,000 participants respectively.

Snowflake is listed and had annual revenue of $2.8 billion, while Databricks achieved $2.4 billion—Databricks figures are not public and are therefore projected. Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014.

Snowflake and Databricks have the same goal, both are selling a cloud on top of classic¹ cloud vendors. In the data world Snowflake and Databricks are our dedicated platforms, we consider them big, but when we take the whole tech ecosystem they are (so) small: AWS revenue is $80b, Azure is $62b and GCP is $37b.

The Google search results give an idea of the market both tools are trying to reach. Using a quick semantic analysis, "The" means both want to be THE platform you need when you're doing data. Both companies have added Data and AI to their slogan, Snowflake used to be The Data Cloud and now they're The AI Data Cloud.

Below a diagram describing what I think schematises data platforms:

Data storage — you need to store data in an efficient manner, interoperable, from the fresh to the old one, with the metadata.
Data engine — you need to make computations on data, the computation can be volatile or be materialised back to the storage
Programmable — you need to run code on your platform, whatever the language or the technology at some point you need to translate your business logic into a programmatic logic
Visualisation — you need to visualise the output of the computed data because charts are often better than table
AI — you need to be proactive or predictive, that's when machine learning or deep learning enters, more generally today AI.
In order to make all of this work data flows, going IN and OUT.
Edge stuff — and then everything else that goes with it like privacy, observability, orchestration, scheduling, governance, etc. which might be required or not depending on the company maturity.

One way to read data platforms

When we look at platforms history what characterises evolution is the separation (or not) between the engine and the storage. Good old data warehouses like Oracle were engine + storage, then Hadoop arrived and was almost the same you had an engine (MapReduce, Pig, Hive, Spark) and HDFS, everything in the same cluster, with data co-location. Then cloud changed everything and created a new way separating the storage from the engine leading to ephemeral Spark clusters with S3 and then, Cambrian explosion, engines and storage multiplied.

This is the fundamental difference between Snowflake and Databricks.

Snowflake sells a warehouse, but it's really more of a UX. A UX where you buy a single tool combining engine and storage, where all you have to do is flow data in, write SQL, and it's done. Databricks sells a toolbox, you don't buy any UX. Databricks is terribly designed, it's an amalgam of tools, they have a lot of products doing the same thing—e.g. you could write the same pipeline in Java, in Scala, in Python, in SQL, etc.—with Databricks you buy an engine.

At least, that's what the two platforms are all about. Ultimately, they both want to become everything between the left and the right arrows.

Now that I've introduced the two competitors, let's get down to business. In this article I'll cover what Snowflake and Databricks announced at their respective summits and why Apache Iceberg in the middle crystallised all the hype.

Snowflake Summit

Snowflake took the lead, setting the tone. I won't delve into every announcement here, but for more details, SELECT has written a blog covering the 28 announcements and takeaways from the Summit. If you're a Snowflake customer, I recommend reading Ian's insights. His business is centered on Snowflake, and he always offers the best perspectives.

Here what I think summarises well the summit:

Apache Iceberg support — it means Snowflake engine is now able to read Iceberg files. In order to read Iceberg files you need a catalog, Snowflake support external catalogs—like AWS Glue—and they will open-source Polaris, in the next 90 days, their own Apache Iceberg catalog.

If you're not familiar with Iceberg, it's an open-source table format built on top of Parquet. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table. For a comprehensive introduction to Iceberg, I recommend reading my friend Julien's Iceberg guide.
Native CDC for Postgres and MySQL — Snowflake will be able to connect to Postgres and MySQL to natively move data from your databases to the warehouse. This could be a significant blow to Fivetran and Airbyte's business. While the exact pricing hasn't been revealed yet, the announcement emphasises cost-effectiveness.
Store and run whatever you want on Snowflake — They bring a serverless / container philosophy to Snowflake as you will be able to store your AI models, run pandas code or any container.
Dark mode interface — Ironically it was their closing announcement, their most asked feature and their Reddit most liked post following the announcement. I found it a bit ridiculous but it showcases how much Snowflake is a UX first platform.

From the start, Snowflake has been a straightforward platform: load data, write SQL, period. This approach has always appealed analysts, analytics engineers, and pragmatic data engineers. However, to capture a larger market and address AI use-cases, Snowflake needed to break through its glass ceiling. To me, that's what these major announcements are. Snowflake becomes Databricks.

Databricks Data+AI

I didn't attend either summit in person. While I enjoy these events, I prefer avoid flying for ecological reasons, and large gatherings can be challenging for an introvert like me. Watching the Data+AI Summit from home did give me a bit of FOMO, but the Snowflake Summit did not. Databricks successfully built hype during the event, announcements after announcements.

Once again it boils down to the nature of the platform, Snowflake is insanely boring, even if use-cases are different Snowflake solution standardise everything, when it comes to Databricks, creativity arise—or we can call it tech debt. By the multiplicity of products or ways to handle data shiny stuff can appeal everyone.

Here what Databricks brought this year:

Spark 4.0 — (1) PySpark erases the differences with the Scala version, creating a first class experience for Python users. (2) Spark versions will become even easier to manage with Spark Connect, allowing other languages to run Spark code—because Spark Connect decouple the client and the server. (3) Spark 4.0 will support ANSI SQL and many other things.
Databricks AI/BI — Databricks has introduced AI/BI, a smart business intelligence tool that blends AI-powered low-code dashboarding solution with Genie, a conversational interface. AI/BI will be able to semantically understand and use all the objects you have in your Databricks instance. Visually the dashboarding solution looks like a mix between Tableau and Preset.
Serverless compute — This is something that keeps bridging the gap in term of user experience, because manage Spark cluster is painful, Spark serverless let's you run a Spark job without worrying about the execution. Still, serverless compute does not support SQL.
Buying Tabular — Before the last bullet point, it was already something big. Databricks bought Tabular for $1b. Tabular was founded in 2021, had less than 50 employees and raised $37m. Jackpot. Accordingly to the press Snowflake and Confluent (Kafka) were also trying to buy Tabular.

But what is doing Tabular? Tabular is building a catalog for Apache Iceberg and Tabular employs a few part of Iceberg open-source contributors. By getting Tabular Databricks gets all the intellectual knowledge about Iceberg and how to build a catalog around it.
Open-sourcing Unity Catalog — Finally, on stage, Databricks' CEO hit the button to open-source Unity Catalog, directly responding to Snowflake’s open-sourcing of Polaris. Unity Catalog, previously a closed product, is now a key part of Databricks' strategy to become a the data platform. This move, combined with Tabular acquisition, will help Databricks achieve top-notch support for Iceberg.

If you've made it this far, you probably understand the story. Databricks is focusing on simplification (serverless, auto BI², improved PySpark) while evolving into a data warehouse. With the open-sourcing of Unity Catalog and the adoption of Iceberg, Databricks is equipping users with the toolbox to build their own data warehouses.

Apache Iceberg and the catalogs

We finally get down to Iceberg. What's Iceberg? Why catalogs are so important? How do they differ to data catalog we are used to?

So Iceberg has been started at Netflix by Ryan Blue and Dan Weeks around 2017. Both co-founded later Tabular (which got acquired by Databricks). Iceberg has been designed to fix the flaws of Hive around table management, especially about ACID transactions. The project became a top-level Apache project in Nov 2018.

Currently Apache Iceberg competes with Delta Lake and Apache Hudi and became the leading format in the community when looking at all metrics. Newcomers are also arriving late to the party like nimble or DuckDB table format which could be a thing in the future.

What is Iceberg?

The community decided these last year and Parquet became the go-to file format when it comes to storing data. Parquet has many advantages like being columnar, the compression, can pushdown predicates, own the schema at file level and more. But there are a few issues with Parquet. Parquet is a storage format, except for a few metadata and the schema Parquet has lack of information about the table.

A table format creates an abstraction layer between you and the storage format, allowing you to interact with files in storage as if they were tables. This enables easier data management and query operations, making it possible to perform SQL-like operations and transactions directly on data files.

Iceberg is composed of 2 layers but has sublayers, like a onion:

the data layer — contains the raw data in Parquet, Iceberg manages the way the Parquet files are partitioned, etc.
the metadata layer
- manifest file — A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information.
- manifest list (or snapshot) — A new manifest list is written for each attempt to commit a snapshot because the list of manifests always changes to produce a new snapshot. This is simply a collection of manifests describing a state or a partial state of the table.
- metadata file — Table metadata is stored as JSON. Each table metadata change creates a new table metadata file that is committed by an atomic operation.

Official Iceberg schema (credits)

That's what it is, if you have to understand something, Iceberg creates table on-top of raw Parquet files.

So once you have Iceberg you're capable to create multiple tables, but you need a place to store all the metadata about your tables. Because Iceberg each table atomically but obviously you need more than one table. That's why you need a catalog. This catalog is like the Hive Metastore. I've read somewhere that we should call it a super metastore rather than a catalog which is already used to describe another product in the data community.

Still that's why we need a place to keep a track all our Iceberg tables. That's what is Unity Catalog, AWS Glue Data Catalog, Polaris, Iceberg Rest Catalog and Tabular (RIP). Actually all of these catalog are implementing the Iceberg REST Open API specification.

💡

Read Julien's post about Apache Iceberg if you want to go deeper.

Conclusion

Databricks and Snowflake embracing Iceberg by open-sourcing a compatible catalog and opening their engines to Iceberg show how far ahead is Iceberg. I don't think Databricks or Snowflake really won the competition.

On Snowflake's side, they mitigated the impact by open-sourcing Polaris and embracing the Iceberg format. However, most Snowflake end-users won't be concerned with these changes ; they simply want to write SQL queries on their data. These format details are more relevant to data engineers. Snowflake finds itself between Databricks' innovation and BigQuery's simplicity³ (ingest data, query). To grow, Snowflake needs to expand in both directions.

With this move Databricks will finally provide a data warehouse to their customers, it will be a data warehouse in kit, but a data warehouse. Because this is what it is, the Iceberg + catalog combo just create a data warehouse. It mimics what a database is already doing for ages, but more in the open with you pulling all the levers rather than something hidden in a black box written in database compiled language like C.

Wait, Iceberg is written in Java, and honestly, PyIceberg is lagging significantly behind the Java version... Here we go again.

1 — I don't like the classic term to qualify AWS, Google and Microsoft but actually that's what they are right now. Leaders and commodities.

2 — I just made this term, looks like it does not exist for data really but I like it a lot.

3 — Actually recently BigQuery added a lot of features to extend the compute and with more way to interact with data (notebooks, canvas, etc.)

Data News — Week 24.24

2024-06-15

hey (credits)

🥹It's been a long time since I've put words down on paper or hit the keyboard to send bytes across the network. We're in the age of AI, and my lords computer science have evolved over the last 30 years. I'm writing this edition from my child's home, and it brings back memories. I got my first computer at the age of 6 and spent my days installing Windows 98 over and over again, getting lost between the BIOS and the Windows installation pages, playing with Word, Dreamweaver and Adobe Premiere.

My first website is still up somewhere on internet 🥹 — it was to help my aunt sell her house

Who would have thought that 25 years later, I'd be celebrating 10 years working with computers? June also marks the third anniversary of this newsletter. 3 years ago I started the newsletter in order to share my expertise with people, I'm so happy how it turned out. More than 5000 members subscribed to the newsletter and the blog generated almost 100k unique visitors.

Recently a lot of people subscribed and never received a Data News, I want to give you a warm welcome and this edition marks the journey we embark together, you will enjoy what's coming next, I'm sure.

I've taken a little forced break because I've been overwhelmed with work lately, juggling a lot of requests and my customers' work. In order to deliver I had to reclaim my Fridays. Around the newsletter there are unfinished projects with the Recommendations page and Qrators and I'll get back to them starting July once I'm done with the rest.

Forward data conference ⏩

I'm excited to announce that I am co-organising the Forward Data Conference, a one-day event in Paris. Join us on November 25th as we bring together around 350 attendees and an impressive lineup of speakers. It's going to be an incredible opportunity to connect, learn, and explore the latest in data. We will make our best to make the conference friendly for English natives.

Forward Data aims to be a hub for knowledge sharing and best practices, offering you the chance to expand your horizons, explore new facets of the data ecosystem, and connect with key international community leaders.

Be ready for Forward Data!

AI News 🤖

A lot of AI news and changes were made in the last 3 weeks. This is a small recap.

OpenAI
- The super-alignement team was fired — The goal of the super-alignement team was to research all related topics to AGI security. But it seems priorities reshuffled. Then OpenAI appointed former NSA leader (nominated by Donald Trump), he will probably work with Safety and Security committee.
- Annualised revenue projected to be $3.4b — This is crazy how the company successfully reached this amount, mainly by selling to Enterprise customers. By comparison Snowflake revenue was $2.8b in 2023.
- Extracting concepts from GPT-4 —
Apple announced iOS 18 and their own AI — AI will stand for Apple Intelligence. With great ego Apple appropriates the letters AI at their annual developer conference (the WWDC) they showcase how AI will be integrated everywhere in iOS:
- Siri has been revamped — now looking like a Microsoft AI copilot, Siri will be able to sort notifications, to help you writing better or to give better contextualised answer. Siri will also integrate with OpenAI through ChatGPT when needed.
- At the same time they announced their model will run on-device (keeping your data safe and private) and when more compute will be required they will use a private cloud.

Writing tools — to bring a few of the best GenAI features: proofreading and rewriting. When selecting a text you will be able to ask the model to rewrite it more professionally, etc.

Genmoji — a way for your parents to be even cringer in their emoji usage by generating emoji from a sentence.
Finally, with new Siri and Writing tools they reworked one of the worst Apple application: Mail. Giving a better look and new capabilities in email writing.
It joins other features for which Apple will introduce AI (and GenAI) throughout its products (audio transcription, image generation from tags, better natural language search on photos, etc.). But this anchors Apple in a consumer products company, not an AI company like Google, Microsoft or Meta. Apple has decided for years to keep its users' data safe and private, which means they don't have a pool of data to train large language models.

How to rethink the recommendation for social networks — This is a small video about Jack Dorsey (Twitter co-founder) about recommendation algorithm and how platforms today should give the choice back to users, this is about free will and building biais / filter bubbles. Why should we have transparency on what rules the recommendations and why should platforms propose multiple algorithms and let the users decide, like a marketplace.
Changing the GPU is changing the behaviour of your LLM — A cool experimentation that shows how GPU impact the inference.
MLOps coding course — Great MLOps course! It contains 6 chapters and covers all the needed topics to put models in production with the correct choics.
RAG in BigQuery — When you do RAG in database it's often correlated to embedding functions and being able to query these vector with performance. BigQuery has all the toolkit to do it and this article showcase it well (and let's be honest all the competition does the same).
What makes a Gen AI system open? — A paper that survey 45 models across 14 elements that could define them open. OLMo 7B Instruct is the most open according to the paper and ChatGPT the least one. On the same topic Mozialle released a paper about a framework for Openness in Foundation Models.

(credits)

Fast News ⚡️

Solving probabilistic Tic-Tac-Toe — Probabilistic tic-tac-toe is like tic-tac-toe but each cell is given a probability distribution. So when you make a play randomly you can x, o or do nothing. Someone develop a Unity version of the game and someone else wrote a math solver giving the best play at every turn.
Amphi ETL — Amphi is a low-code visual ETL that you can run in JupyterLab. This is super clever. This is the first time I see this kind of application that can run as an extension of Jupyter. Worth watching it in the future, this is still early.
Compare Airbyte and dlt ways to create custom sources — A large article that compares Airbyte and dlt when it comes to creating custom sources. Both extract and load tools can create custom source via either Airbyte low code CDK or dlt REST API Source toolkit.
trip.com migrated from 50PB Elastic to ClickHouse — I've never been fan of NoSQL platform like ES for data work. This article on ClickHouse blog showcase how a client migrated their ES cluster to ClickHouse to improve their logs querying capabilities. More, the article focus once at scale with multiple CH clusters how to correctly route the queries.
Hunting non optimised queries in ClickHouse — The talk is about ClickHouse but can apply to every engine. In the talk Yohann explains the mechanism he put in place to find non-optimised SELECT. He did it with a machine learning model, which means that he identified the features slowing the queries like nesting, subqueries, join and wheres.
BigTesty — a framework that allows to create integration tests with BigQuery on a real and short lived infrastructure. It uses Pulumi (a infra-as-code tool) and requires you to give inputs, SQL queries and outputs and tests it against a dedicated BigQuery project.
Data platform explained part II — Part 2 of the Spotify article about data platforms. Their name 3 different steps: data collection, management and processing (and they even mention GDPR) and finally explain how they treat data culture.
What is really Apache Iceberg? — Iceberg has been at the center of the discussion this week. Julien wrote the greatest deep dive you can find on the topic.
Cron expressions with DuckDB — An handy function in DuckDB that can generate time arrays when given a cron syntax, it's more understandable than generate_series().
Serverless Jupyter notebooks at Meta — They develop a system called Bento which allows notebooks to either run with classic kernels or with in-browser kernel (being really serverless) using Pyodide. They have handy functions to get sql, googlesheet or graphql data in the browser memory to then work on it.
Airflow new youth — If you stayed with Airflow 1.x or previous 2.6 you might have missed Airflow new youth. This presentation from Jarek showcase all the recent improvements: data aware scheduling, deferrable operators, object storage, etc.
A hybrid information retriever with DuckDB — how can you fusion semantic and lexical search with DuckDB. Looks neat.
dbt-score, lint metadata and get max score — Lint you dbt metadata, gets a score and be happy in the CI/CD.
Automatically detecting breaking changes in SQL queries — Use SQLGlot diff function (on AST) and gets what changed on a SQL query and act accordingly.
How I failed to implement dbt — Benoit explains why he failed implemented dbt in his previous role. He identifies 5 errors that led to a failure. As always this is not about a technical issue.
250 European data infrastructure startups and what we learned from them — Another perspective about data infrastructure that greatly complete the MAD landscape. At the end of the page it gives great definition about every part of a data platform.
The rise of medium code — Between low-code practitioners and software engineers there are medium code practitioners like analytics engineers and data scientists. This code often lies into Python orchestrators and has to be treated correctly because it's production code as well.
Write-Audit-Publish pattern — Once again a great article about this pattern.
How Monzo uses incremental modelling to handle billions of events every day.

💡

I'm working a dedicated article about Snowflake and Databricks latest advancements which should be published on Monday.

Data Economy 💰

Mistral raises €600m — Mistral has never been a French company from the first rounds but raises again a lot of cash to go faster.
xAI raises $6b — Late to the party and it seemed no one care about but Musk tries to fight.
Cube raises $25m — Cube has the most advanced piece of technology today when it comes to semantic layer and they raised enough money to continue going into this direction.
Snowflake invests in Omni — Omni is a refreshed version of Looker with a fresher LookML.
Databricks acquires Tabular — It created waves last week in the data community. I'll write more about it on Monday.
Tobiko raises $17.3m — The company behind SQLMesh and SQLGlot raises cash to create a suite of tool to invent the data development of tomorrow.
Redpanda acquires Benthos — In the streaming world it was big.

I want to address something weighing on my mind. We've all seen the results of recent European elections and how the far right has influenced public debate and opinion. I strongly believe we should not fall for their tactics or their so-called solutions. In the tech community, many of us are privileged, often due to our financial stability. However, we cannot build a society with only people like us. Because of our privilege, we (1) should vote, (2) should use our vote to support those marginalised by the system.

For my French readers, there are parliamentary elections in France in 15 days. I urge you to vote and to vote against the far right. Hate and division are not solutions. Cutting public services through tax reductions is not a solution. Pushing for more productivity when AI is on the rise is not a solution. Individualism is not a solution. They don't bring any solution.

Consider what the tech ecosystem would look like under far-right principles: diversity stifled, innovation hindered, and global collaboration restricted. These ideologies could limit talent flow, reduce educational programs, and promote censorship and surveillance (which is almost already here, we work in big data face the reality), undermining our core values of privacy and open access.

If you feel this message doesn't belong in a tech newsletter or professional sphere, I don't care and you can unsubscribe. However, I believe that advocating for openness and tolerance is essential, and accepting hate speech is unacceptable.

See you next week ❤️.

Data News — Week 24.20

2024-05-17

Lights on (credits)

Hello you. The sun is out, the days are getting longer and Data News is still here. Next week marks 3 years of this newsletter/blog (yay 🎉 ). It'll be a time for looking back, reflecting and celebrating, but next week. This week, we reached 5000 members.

Yes, 5000 of you read my content periodically. Just thank you ❤️.

In the recent days I've been working on a new side project. What if you could search in video content and get the exact timestamp of what you're looking for?

Let me introduce an application of this on the Data Council 2024 80 videos.

Data Council 2024 ✨

Data Council Austin is according to me one of the best conference when it comes to think about the future of data. Every year the talk that are given at DC are always full of quality content. There is a main drawback of this which is: it 80 videos of ~30 minutes and not everyone have the time to watch everything or search among the videos.

So I developed an app which allows you to search for words in the Data Council video playlist and we've curated highlights with Julien in order for you to watch only what we've curated.

It's available on qrators (can be pronounced curators / creators). And it works for the moment only on desktop.

Search content on qrators

The search is working greatly, for instance you can full-text or exact-term search. For instance Airflow, dbt, backfill, "data mesh" or "SQL Glot". Quotes means an exact-term search.

I'll write another post later about the behind the scene and how this app has been built, but because I'm your humble servant, this app uses DuckDB WASM and requires no backend to work (except a bucket with the data).

Still, I want you to get as always a few takeaways of the conference so here are my favourite talks with a few highlights:

Data culture as a product— Abhi already had one of my favourite talk from Data Council 2023 about Metrics tree. Following on from his work on metrics, this time he attempts to give advice on creating a good data culture in order to create a good decision culture in companies. After all, companies need to make decisions, and these decisions need to be informed by data. [highlights]
Processing trillions of records at Okta with DuckDB instead of Snowflake — it was one of my most expected talk from the council because a few months ago Jake posted on LinkedIn his team successfully reduced Snowflake billing by hundred of thousands by shifting to DuckDB. In the talk he explained what was the issue with Snowflake and how a multi-engine data stack built on-top of S3 + Lambda drastically reduced dollars spent. [highlights]

I like a few of the others talks, but I'll do a dedicated post for this I think because the Data News is already super dense.

Launching Qrators, place to search for stuff in videos

AI News 🤖

OpenAI recents announcements — The company behind ChatGPT announced a few things hyping everyone recently. Especially their GPT-4o (it's 4-o the letter and not 40 the number) model, this model adds new capabilities to ChatGPT around photos, videos and audio. The model can talk or understand what's in a image/video and answer questions about it. They also released a MacOS app that you can call on Option+Space to ask ChatGPT. OpenAI also detailed a bit their model specs and what principles they implemented to put guardrails around answers.
Why Microsoft invested in OpenAI in 2019 — Emails explaining why Satya Nadella (CEO) and Kevin Scott (CTO) pushed Microsoft to invest in OpenAI have been made public, and are worth a look. It mainly reads that Microsoft was "several years behind the competition in terms of ML scale" (compared to Google, in search / ML in applications) and that to get there, they needed someone with gigantic ambition, from silicon chips to high-level programming abstractions. And the OpenAI team was someone.
OpenAI is offering $10m packages to top AI researchers. There is a paywall I can't say more.
AI lobbyists are everywhere now — A bit more politic but with the stakes around AI (money, power, content moderation and generation, privacy, etc.) lobbying around it is through the roof.
Google I/O keynote — Google I/O was the response from Google to OpenAI announcement around models. They showcased agents that can help you do more in your favourite Google Apps, then DeepMind showcases new capabilities around image and music processing / generation. But one of the most important announcement was only a few seconds about search that might change forever (paywall can be avoided with a page reader). Google introduced AI overview that will be presented first in search answer pushing traditional results far below.
LLMs with Keras — Keras team demoed various workflows around LLMs (Gemma) with Keras.
Opt-out to avoid Slack training LLM models on your private data — Slack (acquired by Salesforce) could train their LLM models on your data. Still their answered in the Twitter thread but it's legal stuff I don't understand.
HuggingFace releases Idefics2 — An open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. Works with multiple images as well to create stories.
Building DoorDash’s product knowledge graph with LLMs — A good graph is like good wine and DoorDash used LLMs capabilities in information extraction to improve their product catalog graph.

Fusion (credits)

Fast News ⚡️

Apache Arrow DataFusion becomes Apache DataFusion — DataFusion, a query engine built in Rust and uses Arrow for in-memory structures, has been promoted as a top-level Apache project. DataFusion is one of the most important alternative to DuckDB when it comes to engine (not mentioning polars here). On that topic this week I've met people from SDF who are betting on DataFusion as their core execution engine.
facebook/nimble, a new columnar file format — A new columnar file format is out. They announce it as "a replacement for file formats such as Apache Parquet". Ok but why?
Unexpected tips for data managers — A comprehensive and pragmatic list of tips to be a great data manager. This is pure gold.
Data about data from 1,000 conversations with data teams — Mikkel output of his interviews with a lot of data teams and what topics are important.
How to save 90% of BigQuery’s storage cost and how to reduce your Snowflake cost. On the same topic if you don't know about GROUP BY ROLLUP you should use it in Snowflake or BigQuery.
Reverse engineering exercice — Awesome idea. In order to learn concepts the author decided to reverse engineered a NYT game and he documented the process and what he understood. I find this exercice super insightful and I'd love to do something similar.
Initial thoughts on SQLMesh — A post describing what are the key concepts of SQLMesh (esp. around envs, plans and projects), this is a great introduction. Last week SQLMesh team also released stuff around testing, similar to dbt unit tests you can define input and outputs to test your models.
dltHub REST API source toolkit — dlt released a toolkit to build extract and load pipelines on top of custom APIs. With the toolkit you can declare your endpoints resources and auth and then you'll be able to extract and load your data.
Cube releases their AI API — Now you can query in natural language your semantic layer and get answers (it uses OpenAI). This is close to what I had demoed last year in a talk.
MotherDuck pricing page — Great pricing page, competitors should get inspired from this. It's fun to play with it to see how many hundreds of thousands of dollars you would have spent.
Uber, auto-categorizing an exabyte of data at field level through AI/ML — Reminds me SDF article about end-to-end classification of your data models but at Uber scale.
great_tables — A great tool to create nice looking tables in Python on top of your dataframes.

Food for thought to end (because it's already too long)

See you next week for the anniversary 🎂

How to build a data team

2024-05-03

And it's a team... (credits)

Hey, new Friday and special Data News this week. This week has been pretty packed in term of work for me so here a joker as a weekly newsletter. This is a compilation of great resources about building a data team.

This is a collection I've created when working on a talk called "How to build a data dream team" that I've made last year. All the articles are different and give a large spectrum of perspectives about creating a data team.

In my experience, building a data team is a mixture of everything, there's no single recipe, but for it to work, you need to adapt the technology you choose to the people you have. These days, it's very easy to technically build a data platform, but building a data team goes further than that, it's about processes, communication and prioritization, how to build trust with stakeholders, etc.

10 great resources to build a data team

How to build your data team? — This article from Castor team brings all the vocabulary needed. It explains what are the different models: centralised, embedded or federated and what are the pros and cons for each. It also overlook the topic of the size and the roles.
Net promoter score for data teams — A very important topic I guess. A reminder, one of the most common data team mission is to empower stakeholders. So face the truth and compute a NPS to know what your stakeholders are thinking about.
Vision for a data team — Probably the most pragmatic and full of handy advice. This blog from Alan data team explain what a data team should do.
Building a data team at mid-stage startup: a short story — Views of all the journey your data team will go through in a startup. From the first day to one year later.
Data team, how we work — The Gitlab handbook which is a big bible of resources when it comes to data. Everything is detailed, how do they work, how they do triage, prioritisation, etc.
How Typeform build a data team in under 6 months — 5 key insights and 7 top advices about what to do.
Data organisation: why are there so many roles? — A great guide about the role and responsibilities of people in a data team.
How to build a data analytics dream team — Goes a bit further than the previous article open to new (weird) roles.
The next big challenge for data is organisational — Yes, technically this is just about alignement. The rest is human collaboration and change management which is quite hard.
Building a data team, dbt recommendation guide — dbt Labs wrote a great guide about data team building.

Sorry about this intermittence in my Data News, I promise next week I'll be back ❤️.

Data News — Week 24.16

2024-04-19

easy (credits)

Hey, new Friday, new Data News. This week, I feel like the selection is smaller than usual, so enjoy the links. I'm a bit late with the Recommendations emails, I'm sorry about that I got a few new leads as a freelancer I had to take in priority changing a bit my schedule. But don't worry it gonna be out soon.

AI News 🤖

When do models get the same hype as 2007 iPhone release? I did not get the memo.

Meta releases Llama 3 — After last week Mistral new models, this week ends with the new Meta open-source models. Llama the Third is online. One thing to analyse is that the model created more hype than Meta AI the new ChatGPT run by Zuck company going all-in into his AI vision. It shows something changed, generative models reaching massive adoption, in my bubble at least, people care more about a new model than an assistant available in Meta ecosystem (Insta, WhatsApp, Facebook and more). Until we reach the model fatigue hype is real.

Personally, I can't comment on the performance of the models, it's like comparing the performance of two cars, as long as I can drive, it's fine. You can try Llama 3 on Modal or HuggingChat.

In order to go further you can read this excellent analysis on Twitter or the model card—they even give the teCO2 estimated in the training phase. In a nutshell it says:
- Llama is available in 8B and 70B, 400B is coming once training will be completed—and approaching GPT-4 performances.
- Llama has a larger tokeniser and the context window grew to 8192 tokens as input.
- It was trained on a large dataset containing 15T tokens (compared to 2T for Llama 2).
Mistral wants to raise again at $5B valuation.
Microsoft VASA-1 — Microsoft published a paper about a model generating talking avatars from an image and an audio. This is quite impressive. They did not released the code so I tried the closest solution open-source called SadTalker and tried it on me. This is a bit creepy, but impressive for the low quality of my inputs.
Structured generative AI — Oren explains how you can constraint generative algorithms to produce structured outputs (like JSON or SQL—seen as an AST). This is super interesting because it details important steps of the generative process.
Evaluate anything you want with LLMs — I really like how LLMs can be used for tasks that are not the one we are first thinking of. This blog shows how you can use Gen AI to evaluate inputs like translations with added reasons.
How we build Slack AI to be secure and private — How Slack uses VPC and Amazon SageMaker with your data secured and private.
OpenAI batches — OpenAI opened a new API endpoint to batches requests.

Fast News ⚡️

Theseus against really big data (credits)

Principal Engineer — Although staffs and principals have been on the career ladder for a long time, there are very few articles on what it takes to become one of the greats. This article covers the whole ladder and the mix of skills needed to reach the top. It's a mix of hard, soft and business skills.
Data pipeline, incremental vs. full load — A comprehension comparison between 2 mode of ingestion with a decision tree about which one to pick.
❤️ Scaling to count billions — An awesome retrospective of the Canva OLAP architecture to count marketplace usage, from MySQL to Snowflake + buckets. It greatly decompose every important part of an OLAP platform with the collection, the deduplication and the aggregation.
Spark (and Theseus) on GPUs benchmark — A detailed benchmark by Voltron data about running Spark and Theseus (their GPU data processing engine) workloads on GPUs. This is crazy how Theseus outperform Spark. The conclusion look like a great summary for me:
- For less than 2TBs > use DuckDB, Polars, DataFusion or Arrow backed projects.
- Up to 30TBs > Cloud warehouse or Spark
- Over 30TBs > Go Theseus. [Theseus] "prefer to operate when queries exceed 100TBs". 😅
Polars new benchmarks — Polars released new benchmarks about the TPC-H dataset. Polars and DuckDB are the cool kid and benchmarks show you should stop using pandas and switch to polars to get x10 performance gain.
Hydra: the Postgres data warehouse — Postgres is one of the most used database, this week I discovered Hydra an open-source columnar port of Postgres aiming to create an open-source Snowflake. To watch.
Neon GA — Neon, another Postgres fork, is generally available. Neon wants to provide a serverless autoscalable Postgres for devs.
Clever Cloud offloading 20TB every month — Kestra showcases how one of their client is using the declarative orchestrator to offload TBs of data every month.
KubeCon + CloudNativeCon Europe’24 notes — A few notes from the 2024 Kube big mass.
Is SQLMesh the dbt Core 2.0? — A great blog to answer a great question. SQLMesh is bringing fresh ideas to the SQL transformation landscape. The post covers a lot of topics and explains the concept similarities between both tools.
gwenwindflower/tbd — A code generator for dbt. Winnie developed a great tool to save time in documenting you dbt projects using Gen AI models.
Snowflake text embeddings for retrieval.
Distributed dashboarding with DuckDB WASM — Ramon put words into ideas I have in my mind for months: distributed dashboarding. I buy so much this concept, especially with DuckDB WASM and what it unlocks in term of autonomy or privacy for users.
WAP with dbt, Icerberg or Nessie — Julien showcase how you can achieve WAP pattern with different technologies.
CSV to DB — What if you could open a CSV, re-order or rename the columns directly in the browser? Without any backend call—with DuckDB obv—. s/o to Théodore, a Data News subscriber, who developed this. This is a great idea.

See you next week ❤️.

Data News — Week 24.15

2024-04-12

The fest we deserve (credits)

I hope this Data News finds you well. In today's edition we have a large selection of links, I think you will enjoy it.

But first I want to welcome all the new members joining this week after my new episode on DataGen with Robin Conquet. This is an episode in French and we talked mainly about the eventual end of the modern data stack. Which I have already condensed in a post a few weeks ago (in English).

MDS Fest 🥳

As announced last week I've participated to the MDS Fest 2.0 this Thursday. I've shared my journey with Apache Superset and why I consider Superset the best open-source alternative when it comes to building BI applications.

Yes because you should stop building dashboard and build BI apps instead. It enters in the productisation of the data, but mainly I think you should consider your BI tools as a way for your users to interact with data and not only to monitor metrics. With customisation possible Superset is the best tool for it.

You can have a look at my slides or watch the replay on YouTube.

In the same conference a lot of other talks took place here a few selection you should check out:

How to pivot your data team from a service team to a value-generator — Very often data teams struggle in delivering value or in finding what's their real identity. Taylor identified pattern and gives great advice to help you finding it.
Data contracts: federated data governance — Another talk by Chad about the data contracts, always on point in describing the pains around the "data supply chain".
Deliver reporting in pure SQL with dbt + Evidence — A great showcase of what you can build with Evidence (a BI as code solution).
Build analytics at Hive.co — The journey Oleg and his team went through to implement a modern data stack. They used RFC to document where they were heading to.

PS: Apache Superset is going 4.0 this week with a lot of new features.

AI News 🤖

Open-source LLMs for everyone — A great post from Siemens AI team about open LLMs initiatives that brings new usages in the dev workflow, wether it's about code completion and pull request / crash reporting summarisation, it looks neat.
Building LLMs for code repair — Replit is a AI-driven workspace for developers , think of a supercharged IDE. They wrote a blog about what they developed to create a LLM driven fix suggestion for LSP (Language Server Protocol), which is a protocol between your IDE and a server that understand and analyse the code to find errors or highlight the code.
LLM DataGen — A small demo of a LLM based on Gemma that generates JSONL based on a given name. This is not working super well and it would be better if we could specify the columns names and types for instance, but showcase another great usage example of generative algorithms.
Meta, building an infrastructure for the future — It explains how Meta is partnering with GPUs vendors to design new chips and how it's incredibly hard to connect thousands of GPUs in a cluster where everything can fail at every moment.
Can Gemini 1.5 actually read all the Harry Potter books at once? — A nice Graphviz chart spotted on Twitter of the whole Harry Potter relationships in a poster. Done by Gemini with the content of all the books. Obviously Gemini already knows a few of the Hogwarts lore by his training, but still this is impressive. Sadly we don't have the complete prompt / code.
Speaking of prompts, PromptLayer organised a tournament and they blogged about their favourite prompts of the competition. Once again speaking to a LLM is like speaking to children, USE CAPITAL LETTERS TO CAPTURE THEIR ATTENTION.
OpenAI open-sourced a light library to evaluate language models — you can use 7 different evals and check the results on OpenAI or Claude models.
Mistral-8x22B is out — a new model that does something probably awesome.

Fast News ⚡️

Last week I forgot to share the 2024 state of analytics engineering by dbt Labs. By overlooking at it it depicts well the actual trends I also see in my local market and we see that in 2024 more than 50% of the time spent by data practitioners is to maintain or organise data assets.
Introducing Beam YAML — Apache Beam is a unified processing framework (understand unifying streaming and batch) that runs on many different engines. Today they introduce Beam YAML a way to write pipelines the declarative way. Reminds me Kestra so much.
How I became an AST convert — I've been an AST convert for a long time, I'm so happy someone writes about this. AST stands for abstract syntax tree and can be an abstract representation of a language, and this is what SQLGlot does (hence SQLMesh) to SQL. Afzal writes in this blog what it means, especially in a diff context.

You can also listen Toby interview in Joe podcast about SQLMesh and SQL transformations.
Airbnb open-sources Chronon — A data platform for serving for AI/ML applications. In Chronon you defined sources and groupby—a collection of aggregations on keys—which represents in the end features and then the platform handle the downstream management.
BigQuery releases data canvas — This is a large open canva (like count.co) in which you can write SQL queries, assisted by Gemini, and link them in a DAG fashion.
Write-Audit-Publish pattern in modern data pipelines — A pattern worth to be known more because it can avoid your pipelines pushing wrong data into your users tools.
Using Preset (Superset) to explore the dbt Cloud semantic layer — You can configure a sync between both clouds or use a CLI, then you will be able to explore metrics in your BI tool.
Book review of DuckDB in Action.
Efficient CSV parsing — A YouTube talk about the DuckDB cvs parser and what it means to parse unstructured files.
To conclude this edition 2 project walkthrough:
- Slack summary pipeline with dlt, Ibis, and Hamilton.
- Local pipeline development with SQLMesh, Airflow, and Postgres.

✨ s/o to Hugo who runs a weekly data round-up and this week he published before me so you can also check his great links selection.

See you next week ❤️

Data News — Week 24.14

2024-04-05

Lost between ideas (credits)

Hey, new Data News edition. I hope you will enjoy this week selection after skipping last week one. I was a bit overwhelmed with the amount of tasks I had on the desk—and I'm still. But here we are.

Before jumping to the news, I want to let you know that I have improved the Recommendations page and the weekly emails with the recommendation should arrive soon. The new page supports better the mobile and give you titles and overviews of links which are GPT4 generated.

I'll speak at the MDS Fest 2.0 next week on April 10. MDS Fest is a free virtual 5 days conference about Modern Data Stack topics, a lot of awesome speakers, there are a few talks I can't wait to watch. On my side I'll talk about Apache Superset and what you can do to build a complete application with it.

AI News 🤖

LlamaIndex slides, examples with Mistral AI — A few slides with a lot of example on how you can use LlamaIndex with Mistral AI models. I guess there is a video associated with the slides, but I don't have it. It shows a few RAGs, Agents and parsers on documents to retrieve the data you need.
DBRX, a new state-of-the-art open LLM — Databricks has to be an AI company (bragging vs. Snowflake). This week they released a new open model that performs great.
How we built Text-to-SQL at Pinterest — Pinterest open-sourced a tool called Querybook that they used to access Pinterest data every day. In order to boost usage they developed a text-to-SQL feature. This article greatly explained how they did it.

Fast News ⚡️

MAD 2024 landscape — The new edition of the Machine learning, AI and Data Landscape is out, with many logos and obviously changes since last year with the GenAI hype. I did not yet analysed the new map, but I'll try to do it soon.
Polars on GPU — Polars announced that they are collaborating with Rapids to bring GPU performance to reach another summit with Polars. Looks cool to be able to switch Polars engines like this. If you are using Polars reach out to me I'm curious to know how people are using it.
Git in Snowflake — Snowflake is getting more and more feature months after months. Becoming a fully complete suite of applications directly reachable from SQL in your warehouse. Reminds me of Oracle and I don't like this centralisation, but the future is always going the bundling way. Now you can read a git repository when creating procedure in the SQL DML.
DuckDB is the new jq — The author show how you can manipulate a json file with a DuckDB one liner. I really like this take, it gives a great perspective about DuckDB and how you can use it locally to do fast manipulation. But contrary to jq which has a non-trivial syntax, DuckDB is SQL.
Survey about query engines used by companies — Data Council happened recently. It's a US-based conference that I often really like because talks and ideas that are discussed there often shape what we do in the data industry, at least from what I see in the YouTube videos, I never went there. A talker did a survey during his keynote about the query engines used by the audience and Spark is still leading to BigQuery/Snowflake/Athena.
How we built a 19 PiB logging platform with ClickHouse — ClickHouse is a tech company and you can see it from the blogpost. They deeply explain in this article why they choose ClickHouse to monitor their ClickHouse Cloud offering saving money on their Datadog bill.
❤️ Building open data portals in 2024 — David open-source a end-to-end framework to build open data portals. This is awesome (example), you can easily ingest, transform and share data, looks like yato but with many more features puzzled together to create a local-first data platform.
Spotify, data platform explained — The beginning of a series explaining the Spotify data platform.
A path from data mess to data mesh — 5 key principles you should apply to avoid the data mess.
Semantic layers, a buyers guide — This is a exhaustive comparison between dbt Cloud metrics offering and Cube. In a nutshell, I'd say that both technologies are not yet mature, with a smaller advantage for Cube for being open.
The data analyst every CEO wants — I really like this blog from Benoit, he gives practical advices about what you have to focus on if you're working as an data analyst for C-level of your company.
YAML developers and the declarative data platforms — A good introduction to why declarative languages are perfect for creating data platforms. To be honest, I think this is a topic that can be used to compare good and great data engineers. Creating a declarative data platform is easy, but creating the right level of abstraction that describes reality without creating debt and over-engineered solutions is much harder.
PR review tool for dbt projects — A nice tool creating visual representations comparing 2 dbt artifacts that you can embed in a CI to validate changes before they get merged into production code.
When is the data model finished? — Spoiler: a data model is never finished. Actually you need to depict your company business and activities, as the time goes, activities grows and obviously you have to manage this asset in time.
A training set for bike sharing forecasting — Max has created a large dataset of bike sharing providers in ~50 cities around the world. If you want to play with DuckDB and visualisations this is a good start.
Calculating walking isochrones in Python — Cool way to produce Python viz.

See you next week ❤️

Dont forget to check out the new Recommendations pages (below an overview of mines).

Overview of Recommendations and email (resp. left and right)

Data News — Week 24.12

2024-03-22

Friday routine (credits)

It's Friday and it's Data News. I don't go into too much detail about the magic of Data News, but every Friday is the same. At first, I'm: oh shit, here we go again and 10 minutes later I'm lost in reading the content and picking too many articles to fit into a thousand word edition.

Usually all the process takes me a whole Friday. I organise myself as follows:

During the whole week I scroll—too much—LinkedIn. Save posts without reading. Sometimes I also save stuff on Twitter by liking. The reason I do this is to avoid context switching—let's be honest, it works for the DN context, but does not work in general in my life.
Exploration, Friday morning
- I read the last 7 days of 2 Twitter lists (MDS, Data voices) and I open interesting stuff in tabs.
- Then I use Feedly which is connected to ~500 websites, Reddit and Medium and opens interesting articles in tabs.
- Then I opens the elements saved from LinkedIn.
Reading and writing, Friday afternoon
- I read the articles and remove what I find irrelevant (context, values, quality, etc.), I create a first connexion between all the links, trying to sort them to have a fluid path between articles ideas
- I usually go from ~50 links to 25 after the reading part.
- I write in one-go starting from top to bottom.
Publication—Once the Data News is ready, I just click publish, I don't proofread much (sorry for the typos). I think I already spend so much time selecting + writing that I can't be stuck in revision mode for long.
Post-publication—After the publication I do my homework of promoting my own work (mainly on LinkedIn), I run a few post-publication scripts for the Explorer / Recommendations. I also watch the click / opening stats and thats all. But I could do it better I think.

The process works well, but as you can see, because I use fresh news, it's just-in-time. Which puts pressure on my Fridays. I'd like to have a few articles in stocks to remove the pressure of having to write something on selected Fridays and be off.

❤️ I rarely say it, if Data News helps you save time you should consider taking a paid subscription (60€/year) to help me covers the blog fees and my writing Fridays.

Just before I jump to the news, I'll speak at the MDS Fest 2.0 on April 10. MDS Fest is a free virtual 5 days conference about Modern Data Stack topics, a lot of awesome speakers, there are a few talks I can't wait to watch. On my side I'll talk about Apache Superset and what you can do to build a complete application with it.

Ok. Now give me the news.

AI News 🤖

The 01 light

01 open interpreter — The 01 light is a small device operable with your voice that controls your home compute. I've been rarely amazed by the latest physical device AI startups produced, but this one is different. It's a small white sphere that understand what you says and then control your compute mouse to execute actions for you wether you're in front of your compute or elsewhere.

The initiative wants to be open(-source) and they provided the code on Github. Actually they trained a "computer LLM". And they are the reason this newsletter was late, because you can build yourself the physical device with Arduino stuff—list of materials—and I wanted to do it today but a piece was not available at the electronic shop 🥲.

Under the hood it uses a big prompt to instruct their LLM because in the end fuck you, show me the prompt. And actually it's always fun to read the prompt companies are using to do specific tasks. Sometimes it looks like you speak to a child. Capslocks and repetitions to make the algorithm understand.
Finally, xAI released Grok-1 in open — The weights are available in torrent / HF and everything is under Apache License. The repo has been released last Sunday after Musk publicly announced last week the released, I feel bad for the sweaty engineers who worked the whole week on it. I did not see a lot of feedback on it since.
Apple tries to enter the LLM game (while facing fines) — Rumours says they will partner with Google to use Gemini to power iPhone AI features, at the same time they wrote a paper about MM1, a family of multimodal models up to 30B parameters.
Microsoft hires DeepMind co-founder — Mustafa Suleyman will lead a new organisation called Microsoft AI. Following the announcement Copilot, Bing, Edge and the GenAI teams will all move to the new organisation. Satya Nadella is going all-in on AI. It's important to say that Mustafa is joining Microsoft from Inflection a LLM company in which Microsoft invested 1 year ago.
OpenAI closing partnerships with major newspapers in Europe — After Axel Springer in Germany, they sign with Prisma Media which groups Le Monde (France), El País (Spain) and the Huffington Post (worldwide). All these partnership will help OpenAI train GPTs on media corpus to enhance the reliability of the answers in return for a significant source of additional revenue.
Commun Corpus — A HuggingFace dataset collection including public domain texts, newspapers and books in a lot of languages. TB size.
Designing RAGs — A super long and detailed article about RAG. It covers the 5 main components: indexing, storing, retrieval, synthesis and evaluation. Let's be honest it contains all you have to know about this new trend and what are the key considerations.
Vector DB comparison — A table comparing all the different vector technologies on different axes like the search, models, APIs and technical details.
Python codebase with best practices to support MLOps — This is a Github repository with a lot, I mean a lot, of tools and tips to create a production grade repository.

Fast News ⚡️

Run Spark procedures in BigQuery — BigQuery released a way to write PySpark code in the web editor and to run / deploy it from there creating a new serverless way to create BigQuery assets. This is a nice way to mix SQL and Python code.
pip install data-stack — This is a title I could have written myself. In this blog Julien covers the new Pythonic tooling and how far it can bring us in building lightweight programmatic data stacks. He also mentions my baby yato.
Pretzel notebooks — A new open-source notebook / exploration tool based on-top of DuckDB WASM and PRQL, it allows you to chain operation like upload file, SQL, charting, filter, sort etc. You can explore the demo.
On the same topic Hashquery launched — this is a Python framework to create semantic data models.
Williams F1 used Excel to build their car — F1 parts (thousands) were managed in a spreadsheet. These Excels files were unmanageable and explained why Williams had delays in deliveries. That's not surprising because I think during a F1 season there aren't a lot of breaks which you can use to pay technical debt.
dbt Core unit testing in v1.8 — dbt Core has implemented unit testing and it's coming soon. When unit testing a model you can give input rows and says what you expect as output rows in the YAML definition. dbt will run and validate the model for you. This is game changer.
Awesome DuckDB snippets — A website that collects cool DuckDB snippets. The most popular is a 4 line bash command that you can add to your bashrc to convert a CSV to Parquet.
The cost of data incidents — Mikkel is one of my favourite authors, he carefully picks all his titles that they resonate deeply in me. He proposes a formula to compute the costs of your data incidents, changing downtime numbers into $.
2023 in reading — This is a great side project idea. This is a visualisation of the hours spent by Erin reading books in 2023. Personally I just finished my first book of 2024 😅.

This newsletter edition is already too long and I have 10 other deep articles that I'll keep for next week ❤️.

Recommendations have been computed this Wed., go check what the algorithm prepared for you. The email notification feature is almost ready so opt-in on the reco page to get your recommended link by email once ready. I know the mobile version of the reco page is buggy, I'll work on it next week as well.

Data News — Week 24.11

2024-03-15

Mountains

I hope this e-mail finds you well, wherever you are. I'd like to thank you for the excellent comments you sent me last week after the publication of the first version of the Recommendations. This is just the beginning!

This week I've added a subscribe button in the Recommendations page in order for you to opt-in for the weekly recommendation email—every Tuesday. You can subscribe starting today on the page and you'll get emails as soon as I've developed the email sending—expected to be out at the end of the month.

You can opt-in for the recommendations

Second point, I passed the 100 stars on Github for yato, which is a crazy amount! I'd like to do a bit of user research about yato, if you consider using it drop me a message please.

yato, is a small Python library that I've developed, yato stands for yet another transformation orchestrator. With yato you give a folder with SQL queries and it guesses the DAG and runs the queries in the right order.

AI News 🤖

Mira Murati answers the Wall Street Journal about OpenAI Sora — OpenAI CTO has been asked a few questions about the underlying technology in Sora. She revealed a few insights. OpenAI consider for the moment Sora as a research output and might eventually be released later this year, it required "much much more" compute power than DALL-E to generate a video and they have a lot of interrogations regarding impact on elections or film industry. Saying mainly that "Sora is a tool to extend creativity".

Last point Mira has been mocked and criticised online because as a CTO she wasn't able to say on which public / licensed data Sora has been trained on. When she was asked if it was on YouTube videos, Facebook or Instagram she said "I'm actually not sure about that".

I personally really recommend this interview which covers a lot of interesting topics in 10 minutes.
Elon Musk said out loud that xAI will open-source Grok this week. It's Friday and it seems they are later than me when it comes to release stuff. Just-in-time for a reminder about the fact that open-source ≠ open-weights when it comes to AI licensing but differences in weights licensing are not as important as they seem.
Databricks invests in Mistral AI — Mistral successfully positioned as the main OpenAI rival by being integrated in all major data platforms (Azure, Snowflake previously).
A French commission released a 130 pages report untitled "Our AI: our ambition for France". You can download the French version and an English 16 pages summary. Report includes 25 recommendations given by French-speaking AI leaders (Yann LeCun, Arthur Mensch, etc.).
Assisted AI wars are around the corner — I'm only following the French news, but the government is proudly doubling its budget for "AI defense". From what I know, AI is mainly used as an information companion to find signals in the huge amount of data we generate, creating more efficient agents.

This is related to Paris testing automated video surveillance during Olympics. The technology under this, is, Cityvision.
Yann LeCun clashed with Elon Musk on Twitter about AI future. Musk thinking that AI will be smarter than any single human next year, while LeCun said "No" taking as en example the false self-driving car promise. More, LeCun believes that human information compression capabilities are still so far ahead of AI that AGI is not even close.
Cognition AI introduced Devin — Devin is the first AI software engineer, Devin can, unassisted, do software engineering tasks like fixing Github issues (13% of success, previously best was ~5%), apply to jobs on Upwork, train and fine-tune its own models. I'm speechless.
Building Meta’s GenAI infrastructure — 2x 24k GPU clusters and it's growing. I like how Meta tries to do stuff out in the open (or at least with some kind of transparency) but the number of GPUs is just disconcerting.
RAG is the new trend — RAG means retrieval-augmented generation, it has been coined in 2020 (see more) and let's you ground AI models with facts fetched from external sources.

There is an exponential number of technologies in the RAG space, especially re vector databases that I don't even mention them but obviously post are all saying "ours is the best".

Croissant: a metadata format for ML-ready datasets — In order to move forward, faster in AI and model building we need a interoperable and easy-to-use metadata format for ML datasets. This is Croissant. Starting today it will be supported by 3 majors platforms: Kaggle, HuggingFace and OpenML. Croissant is under mlcommons and you can have a look at the specification.
The State of competitive machine learning — a study about ML competition platforms. Give a lot of insights on the market.

A new standard full of butter (credits)

Fast News ⚡️

Since the end of Feb. BigQuery supports DELETE to delete partitions in a SQL query.
How I saved $70k a month in BigQuery — Junaid shared a few techniques he used to saved a bunch of dollars on the BigQuery bills, this is nothing new. this is more common sense but always works. In a nutshell it's: smarter schedules, tables optimisations, incremental, avoiding views and precomputing.
Attributing Snowflake cost to whom it belongs — Fernando gives ideas about metadata management to attribute better Snowflake cost. Wether it's a dbt model, a Tableau dashboard or a Metabase question it has to be tracked to understand what drives your bills.
Understand how BigQuery inserts, deletes and updates — Once again Vu took time to deep dive into BigQuery internal, this time to explain how data management is done.
Pandera, a data validation library for dataframes, now supports Polars.
PyData is coming to Paris in 2024 — The CFP is open and I submitted a talk there about yato.
A comparison between Kestra and Airflow — Benoit (who works at Kestra) did a great comparison between the 2 tools, comparing the syntax to write DAGs and the performance in term of scheduling capacities—tasks per seconds. Obviously Benoit prefers Kestra, at the expense of writing YAML and running a Java application.
New Apache Arrow engines — Arrow has become one of the most used library when it comes to built in-memory engines. Arrow doing a lot of the data operation heavy lifting.
- Apache Arrow DataFusion Comet — a native Spark SQL accelerator, the idea behind is to improve Spark performance behind replacing Spark executor by delegating it to Comet. On the matter there is also Apache Gluten which is a plugin aiming to double SparkSQL performance.
- Arroyo, a stream-processing platform, rebuilt their engine using DataFusion.
Postgres creator launches DBOS, a transactional serverless computing platform — Mike sees DBOS like a cloud-native OS that runs on-top of the database in order to rethink application development and deployment.
Unlocking Kafka's potential: tackling tail latency with eBPF.

Forward thinking

Dataviz is hierarchical — Malloy, once again, provides an excellent article about a new way to see data visualisations. It's inspirational.
Coding data pipelines is faster than renting connector catalogs — This is something I've always believed. The devil is in the details and when it comes to data pipelines there are a lot of details, which often refrain us to buy leading to build (or code). Matthaus gives the dlt vision about creating the foundation for developers to be able to create sources in a wink creating a large ecosystem of APIs datasets easily maintainable.
Differential storage, a building block for a DuckDB-based data warehouse — It's MotherDuck vision, creating the next data warehouse on-top of DuckDB leveraging DuckDB morphing capacities between a single machine and a production ecosystem. In the article Joseph explains how MotherDuck extended DuckDB to add time travel, zero-copy snapshots opening the door for more collaboration and concurrency.

See you next week ❤️ — recommendations for this week have been computed, go check it out.

Data News — Recommendations

2024-03-08

We all need recommendations (credits)

When I started writing this newsletter nearly three years ago, I never imagined that the words I write on my keyboard would take such an important place in my life. All the interactions I have with you, whether online or offline, are always amazing and give me wings.

Today I want to introduce a new feature in the Data News galaxy.

I don't talk much about my freelance life in Data News because sometimes I think that's not the contract we have together. Data News promise is to give you, every week, the links I've hand-picked with my spicy opinion about them. Since the beginning of the year balance between content and freelancing has gone from 80/20—80% client stuff and 20% to content—to 30/70. This is mainly due to the fact that I've done my annual University lectures and talked at 7 events since the beginning of the year.

Let's be honest, I'm also a bit stupid. At every event I talk, I decide to do a new presentation. That's great because it helps me innovate and pushes me to new horizons every time, but it takes time to assimilate chunks of work in order to produce creative keynotes.

All of this is made possible thanks to my Data News curation. Thanks to the time I spend reading content, forging ideas and chatting with all of you, I get inspired and my crazy brain invents things. And I want you to have the same superpowers as me. This is what motivates me.

PS: Fast News ⚡️ at the very end if you want to keep this story. Which will makes me sad, but I understand.

There is a problem

Data News have grown so much since the beginning, I currently have 4500 members on blef.fr. I have sent 132 Data News editions which represents 2500 links (~20 links per edition).

But there's a big problem: all my old Data News is dead content.

I mean, there is a big difference between podcast for instance and news blogging like I'm doing. When you subscribe to a new podcast you often scroll over the past episodes of the creator. When someone subscribe to the Data News rarely the person goes over my old news.

A few numbers

All these 2500 links that I've liked and commentated. When I'm looking at all these links for the most of them they are timeless and I think they can still bring a lot of value to all of you.

That's why I want to re-activate my old content.

The Explorer

One year and half ago I had developed the Explorer. The Explorer is a search bar that let's you search over all the links that I have shared in the 132 Data News editions.

It was my first step in this journey to make my handpicked links browsable and usable to everyone. While I'm not good at marketing it there is a few number of you using it every month but I think it could be used way more.

The Explorer (https://blef.fr/explorer)

But I want to go further.

Introducing the Recommendation

2500 links is a huge amount and sometimes this is like finding a needle in a haystack. That's why I've developed a new feature: a recommendation module.

Data News recommendation will give you every week a single link that you should have clicked on.

For the moment the recommender will be based on your click history. In every Data News email I send you I know which link you clicked on, so I'm able to leverage this information to recommend you content.

This is just the beginning and for the moment the algorithm is very trivial, this is a collaborative filtering algorithm that recommends you links you did not clicked on that have been clicked by members with the same click behaviour as you.

Data News recommendations

As you can see in the screenshot of the feature in the Recommendation panel you can see the link that have been recommended to you and the link you've clicked on. In order to for me to get your feedback you have the possibility to like / disliked all the links (wether it's recommendation or clicked links).

See your recommendation

Christophe, why did you make this? No one asked for it.

Yes no one asked for it but let me extend deeper on the why

Frustration — Like I said before, I'm super frustrated by all the content that I've referenced that is "dead". I'm pretty sure that if I successfully reactivate this content I can: generate more traffic on blef.fr, diversify my revenue and bring more knowledge to the data community.
It's a showcase — It can be an educational project showing others how you can orchestrate and schedule a small-scale AI application.
It's fun and rewarding — Looking from my side I like the fact that every week members will have a gift coming from me being this recommendation.
Why not? — Finally I don't run any playbook, why not trying stuff?

Architecture

I said it, while being a new feature to the blog this is as well an educational projet I can use to showcase technologies. See below the global architecture I've used to make this links recommender work.

Ghost — My blog is hosted on ghost.org, I really like Ghost because it's open-source (but I use the paid hosted version) and give me the possibility to extend the blog with custom code. The main part of the blog is just a bunch of Handlebars templates connected to Ghost Content API. I extended the website by embedding a React application that powers the custome frontend of the Explorer and Recommendation.
blefapi — In order to make the React apps working I need to have a custom backend that I've developed with Django, this backend connects to Ghost using some kind of SSO (with JWT), which means that I don't need to create another login page, once you're a member you can use all my extended features. The Django app uses a Postgres as a database and a bucket to host a few static files. Everything is hosted on Scaleway (a French cloud company).
CI/CD — Everything is just deployed from Github Actions, wether it's the React application of the Django API I just need to push and it will deploy a new version.
newsletter-reco — This is where the recommendation magic happens. This is small pipeline that needs to get the activity data from the blog API, do a bit a feature engineering, recommend an article for every member and then publish the recommendations to the blefapi. Under the hood (see below) it uses dlt, DuckDB / pandas and Github Actions.

How the recommendation works

The recommendation is fairly simple, it uses dlt to do the extract-load from the Ghost API then dlt loads the data into a DuckDB database then this DuckDB data is transformed using SQL / Python transformations orchestrated by yato. In order to publish the recommendation to the API it uses the DuckDB ATTACH capabilities by directly inserting records to the Postgres database (it's a hack, but works). All of this will run into Github Actions every week to produce a new recommendation for everyone.

Browse the recommendation code on Github

Next steps

I'll work incrementally in the next week on the recommendation, I'm open to all suggestion and I'd love to get your feedback on this, you can even do Pull Requests on the code if you feel it. Here what I plan to add in the following weeks:

Subscribe to an additional email to receive the recommendation on Tuesday (if you really want to receive recommendation by email answer to this email I'll opt you in directly).
Use GenAI to summarise all the links database to give you a summary of each link that have been recommended to you—saving you one-click maybe
Improve the recommendation algorithm by using an item-based approach and embeddings
Taking into account the like / dislike from the Timeline
Develop public BI-as-code dashboard showing metrics about the content and showcasing Evidence and Observable

Bonus: yato

While working on the recommender I've developed something else called yato. yato stands for yet another transformation orchestrator and is the smallest DuckDB SQL orchestrator on Earth.

The idea behind yato is to provide a Python library (pip install yato-lib) that you can run either with Python code or via CLI that run all the transformations in a given folder against a DuckDB database.

yato uses SQLGlot to guess the underlying DAG and run the transformations in the right order. For the moment yato is tight to DuckDB, philosophically yato has been developed like black (the formatter) you just have one required parameter: a transformation folder and then you can do yato run .

I don't think yato will ever replace dbt Core, SQLMesh or lea, yato is just lighter alternative that you can use with your messy SQL folder.

See yato on Github

It was a special announcement for me, I hope you'll understand and receive this news as excited as I'm.

And because I still want you to get a few news below a very fast news.

Very Fast News ⚡️

Elon Musk decided to sue OpenAI for violating company principles by putting profits and commercial interest first. Funny to see this from Elon Musk the philanthropist.
Google is slowly loosing the race for (Gen)AI, so people are starting to call for Sundar Pichai to step down.
Anthropic released Claude 3 — that seems to achieve great results in benchmarks with "sophisticated vision capabilities".
HuggingFace released Enterprise Hub — A private space to use HF features but in a dedicated space.
Yann Lecun went on Lex Fridman podcast — He chatted for almost 3h. I did not listen the podcast yet but I guess he chatted about the concept of intelligence like he he used to do.
Sicara released a tech radar about AI technologies. It includes 4 pillars: algorithm, data, methods and industrialisation. This is funny to see parquet as a technology still to adopt.
Easy introduction to real-time RAG — Showcase how you can include your Langchain / OpenAI pipeline into a classic Kafka / Pinot infrastructure.
ClickHouse acquired chDB a DuckDB alternative and achieved the 1 trillion challenge (with classic ClickHouse) in under 3 minutes for $0.56.
Snowflake now support trailing commas and partners with Mistral AI to bring models to the warehouse, we also learn that Snowflake Ventures also invested in Mistral AI. Long gone are the days when mistral was French.
Orchestra released a free-tier platform to rapidly build and monitor data products. Orchestra is graphical solution to define DAGs and orchestrates different parts of the Modern Data Stack.
Use Ibis to load data from other databases to Snowflake. This is similar to the ATTACH I did in my recommender with DuckDB.
How to measure a data platform — A great article discussion the metrics tree we need to put in place as a data team. I really like it.

See you next week ❤️ — and please give me feedback wether you like it or not.

Data News — Week 24.09

2024-03-02

Mistral (credits)

Hello all, this is the Data News, this week edition might be smaller than usual in term of comments as I'm working on a Data News related project that takes me a bit of time, which will probably lead to a series of articles.

Before I forget I've appeared on The Joe Reis Show, we chatted with Joe about data engineering teaching, why it is hard and about generative AI that will change education for ever. This is a 1h podcast, I hope you will enjoy listening to it.

Final reminder, next week there is La Conférence MLOps which will take place in Paris on March 7th. If you want to register I sill have a 40% promocode: mlops-blef-40. I'll give a talk—in French—about how to put in production machine learning at a small scale. Topic which is related to the Data News project 😬.

AI News 🤖

Mistral AI announcements
- Mistral Large, their new flagship model, which outperform other concurrent excepting GPT-4. At the same Microsoft closed a partnership with Mistral to make Large available to Azure, their first distribution partner. It has led to a lot of discussion in French politics about Mistral AI being American more than French. With the partnership Microsoft entered the Series A with a €15m addition joining a16z.
- They also released a smaller model called Mistral Small.
- Le Chat, the conversational interface to interact with Mistral models.
- Final comment, with these 2 announcement Mistral left the open side to go commercial / closed. It led to conversation where people felt betrayed by Mistral which built their differentiator—or should I say marketing—on-top of open-source/weight models. Mistral perdant.
GitHub Copilot Enterprise is now generally available — This week I've started to use GitHub Copilot (not the Enterprise version). And let's be honest this is a productivity boost, especially when you want to write docstrings and comments. Still there is an annoying interaction in PyCharm where Copilot takes too much space. Copilot Enterprise mainly comes with 3 features: understand your whole org codebase, a chat to ask question about the codebase, summarise pull requests.
Fast, efficient active speaker detection on videos — This is a great introduction to active speaker detection, which means you are able to detect in video speaker faces and if they are actually speaking or not.
Klarna's AI customer support agent do the equivalent of 700 agents — Klarna developed an AI agent that interacts automatically with customer driving profit. It has to be put in context.
Using DuckDB + Ibis for RAG — Handy code snippet to explain why DuckDB is a good solution bringing best of both world when it comes to RAG.

Extract and load, still unsolved 🤭

I've started writing data pipelines in 2014 and the movement from sources to destinations has always been one of the most discussed topic in my data engineering spaces. Personally I'm the kind of guy who likes to build it custom because I think an out-of-box solution does not exist. In the end you finish with a composable solution mixing up 2 or 3 technologies to extract and load you data in your central storage, ready for transformations.

In 2024 we are more than ever tools to move data from sources to destinations. But the field has taken a new direction.

Until now, solutions were mainly full platforms (often in the cloud) with the promise to do everything in search of rebundling the data platform (cf. The unbundling of Airflow). Recently, it has reached new heights: what if the extract and load is just a small library layer that integrates whatever you're doing—for people reading me carefully this is what I was calling for in using Airflow the wrong way, but the fun way.

Enters the new kids on the blocks:

dlt — it stands for data load tool, it's a Python library installable with pip. It provides a framework to do the extract and load, you need to define sources and resources what are the specificities of the resources you want to load: primary keys, write disposition, incremental mode, etc. and the library does the heavy lifting accordingly.
PyAirbyte — Airbyte announced their Python library in beta. Currently it support around 250 sources, which is a subset of all Airbyte sources (only the ones written in Python) and it seems it does not support connecting to classic databases. They call a destination a Cache, which is a terrible name. Even if the library is a great idea I feel this is a sad that the interoperability with Airbyte is not 100%.

Adrian from dlt wrote a small post about PyAirbyte.
CloudQuery — Written in Go, YAML driven configuration to move data.
ingestr — ingestr is a CLI tool to copy data between any databases with a single command seamlessly. It's built on top of dlt.
Slings — Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database. Written in Go.
Let's not forget Meltano.

We see a pattern here, when we talk about extract and load there are 2 kinds of sources: databases and APIs, behind able to do both correctly is the key.

On the other side of the movement there is a new open-source reverse-ETL technology called Multiwoven/multiwoven. This is built in Ruby (haha). At the moment it can sync to Facebook, Salesforce and Slack.

Rare footage of a roman extract and load pipeline (credits)

Fast News ⚡️

Your first 90 days as a head of data — Handbook and roadmap with pragmatic insights on what to do in your new journey as head of data.
Career pathways of data engineers — IC, manager, being a data engineer or a data full stack. It covers great topics.
Google Search filetype:pdf was not working for a moment — Internet panicked and believed Google was continuing to downfall. But actually it was a bug.
BigQuery time series data — BigQuery will now support time series analyses.
Snowflake access management — I've already share Teej work in the past, but now he launched a company to solve Snowflake access management, using code. I bet it gonna become the best solution for this issue out there.
Rill dashboards for ClickHouse — Rill works now with ClickHouse.
Fast Poetry and pre-commit with GitHub Actions — An efficient and useful Github Actions to cache poetry installs in the CI.
BigQuery cost dashboard app — Hashboard develop a dashboard to help you follow your BigQuery costs, it's free. It uses Hashboard, which is a BI tool. Even if you don't use it, it gives good ideas about what to track.
Introducing DoorDash’s in-house search engine — Custom searcher built on top of S3.

Tech stuff

See you next week ❤️

Data News — Week 24.08

2024-02-23

My ideas these days (credits)

Hey, fresh Data News edition. This week I've participated to a round table about data and did a cool presentation about Engines. The idea was to depict the history of engines over the last 40 years and what leads to polars and DuckDB. Obviously the I forgot a few things and I'll do a more complete v2 soon.

This is my third presentation about DuckDB in the last 3 months and I think I'll slow down a bit until I get new crazy things to share.

Engines evolution (me)

There are 3 points that have triggered discussion about the visualisation I done

What about Arrow? — Apache Arrow is an awesome library that powers a lot of innovations in the data space in the recent years. But UX is where it differs to others, DuckDB user experience is insanely magical. So yeah. But for sure I'll add Arrow in the v2.
Spark future — I'm convinced that Apache Spark will have to transform itself if it is not to disappear (disappear in the sense of Hadoop, still present but niche). This is already happening, according to the feedback I've had, but Spark requires more infrastructure and investment, which will continue to drive adoption down, whereas the current trend is towards simplification.
JVM vs. SQL data engineer — There's a big discussion in the community about what real data engineering is. Is it Java/Scala or Python? Is it DataFrames or SQL? Is it lake or warehouse? It's a sterile debate: both are useful and can serve different organisations with different service level for data users and stakeholders. Still, I prefer SQL/Python data engineering, as you know me.

Small reminder, I'm partnering with La Conférence MLOps, a half-day conference on the challenges of industrialising AI. It will take place on March 7 in Paris. The list of speakers includes many important figures from the French data ecosystem, and I'm very excited about it. You can get a ticket with at 40% discount with the following promo code: mlops-blef-40. We have only a few seats left.

AI News 🤖

Mistral AI will release next week Mistral Next a ChatGPT alternative. We don't have a lot of detail because it has not been announced publicly—I got the news in a French politic newspaper. Still you can test mistral-next on lmsys. Here a first review.
Google releases Gemma — Gemma is a family of open models. Available in 2 sizes: 2B and 7B it seems to have baseline performance compared to Llama-2.
The same Google got a backslash after Gemini image generation rollout — Conservative people over social networks have been hurt because Gemini wasn't capable to generate image of white people. Google rolled back Gemini until further improvements.
Models comparison across key metrics — I found it via Guido on LinkedIn, it shows a lot of cool metrics like for instance the price per token, the speed or the model quality.

Fast News ⚡️

Is the modern data stack dead? — This is a follow-up podcast of Tristan Handy with Matt Turck—famous VC guy producing the MAD landscape—following last week post about the MDS. In this 40 minutes podcast they chat more in detail of the dynamics behind the end of hype regarding MDS, AI implication and the future of analytics engineering work.
Is the modern data stack disappearing? — An article I wrote 4 days ago as an answer to the trend. Pragmatic and easy-to-read. Essentially I analyse why the modern semantic is an issue.
State of the Duck — Introduction Keynote of the DuckCon that gives an overview of how is the current ecosystem and what's to come.
PyIceberg 0.6.0: Write support — Yeah, finally I'll be able to play a bit more with Iceberg. Still you need a catalog to make it work.
How you can write a Polars plugin — A dedicated website to explain how to write Polars plugins to extend the library capabilities. In order to do it you'll have to write Rust and Python code. This is a good way to enter the Rust world I guess.
Unit testing for data engineers — Daniel describes what you need to know as a data engineer to write test. He mainly covers BDD (behavior-driven development) as opposed to TDD (test-driven development).
Understand the design principles of Snowflake — Someone took a few hours to understand Snowflake internals and this is a great wrap-up.
Aligning Velox and Apache Arrow — Go deeper into memory management and how you can create open standards across the different librairies.
Enabling near real-time data analytics on the data lake — Grab showcasing what they did with Flink and Hudi to enable real-time use-cases.
Retrieve, merge, predict: augmenting tables with data lakes — A paper that explains how you can improve data discovery on data lakes to finally augment a given table with new data. I did not read the paper except the introduction and a the first schema, but it looks like awesome.

Building a cost effective solution using Fabric — Another look at Fabric. In the end the author creates a workspace and transform the data with pandas and DuckDB in notebooks. Thank you Microsoft.
A newsletter about the streaming data space — Robin collected a lot of cool articles about the streaming ecosystem.

Cool ideas

Open-source, current state and future hopes.
Data will not tell you what to do.
Turning ideas into AI use cases — the Product Manager point of view.

Data Economy 💰

MariaDB takeover at $37m. MariaDB is a public company and could be taken over in by an investment company.
Neurelo raises $5m seed to provide HTTP APIs on top of databases (PostgreSQL, MongoDB and MySQL). We can see it as a semantic layer but on software engineering side.
Motif Analytics raises $5.7m seed. This is a tool made to analyse sequences, especially useful in web analytics / acquisition. They provide tooling to do without writing awful SQL queries.

See you next week ❤️.

Is the modern data stack disappearing?

2024-02-19

No.

This question generated a lot of content last week, and a lot of words were written. I wanted to keep my answer short so as not to burden you with a few thousand more words to read.

Modern data stack has been coined by US companies and VCs—mainly Fivetran / dbt Labs—as a word to quickly emphasis a way to build data stack in the cloud related to ELT. It was a well-suited marketing term, let's be honest.
The time came and everyone took their place at the table to eat a slice of cake.
A lot of people have issues with the modern word. Probably because it's not an explicit semantic, definition is relating to the present or recent times as opposed to the remote past. In this definition there are 2 issues.
- is relating to the present — not all companies are in the same present
- as opposed — the term creates an opposition between 2 worlds, creating something we always like in tech: a debate between 2 kind of technologies.
Actually modern creates some kind of exclusion between new technologies and old technologies. It was useful as first for Fivetran or dbt Labs to be disruptive, but now that everyone is using the MDS is it still a good idea to create this competition? Especially if you want to enter the Fortune 500 where they actually use old tech?
And, we should stop being cynical, who in the hell—in my readers at least—wants to work with SAP, IBM or mainframes in 2024? Because they still exists, numbers show that still 50% of companies are on-premise, when it comes to publicly listed companies or government stuff it's probably way higher.
For these organisations the ideal of a modern data stack still resonate. Employees are stuck in hell regarding data tooling. Data projects are still failing to go in production.
Personally, down there, my vision of the modern data stack has always changed over the years. As always, I don't blindly apply the principles by the book. The idea of dedicated storage where all the data lies with SQL transformations on top and top-notch CI/CD processes with everything-as-code and a galaxy of convenient tools around to be observable sums up what's modern about our data ecosystem.
That's why I think modern data stack vision isn't going anywhere.

In my four years of freelancing, I've always said I build data platforms or data stacks, because who am I to judge whether I'm modern?

As a reference read my online friends views:

Is the "Modern Data Stack" still a useful idea? — Tristan Handy, dbt Labs CEO. He mainly coined the term and whistles the end of the playtime. MDS was previously useful to align practices but now he thinks we should move on to analytics stack. And AI is around the corner to take all the lights while we actually do stuff at the bottom of the pyramid.
The problem was the product — Benn Stancil, Mode CTO, scroll to mid-article.
Everything Ends - My journey with the modern data stack — Joe Reis, author of Fundamentals of Data Engineering. Joe depicts his own journey and views and why it became a mess with too many companies on the radar. Creating finally the most fragmented platforms with no coherence at all, negating all the good MDS aspects.

Data News — Week 24.07

2024-02-16

Italy Sora (credits)

Hey you, time for the Data News. Because I did not send the news last week you will get articles from the 2 last weeks. Last few days have been heavily packed with AI News as well.

Disclaimer, the 2 events below will be in French.

Before jumping to the news there are a few events I want to write about. Next Wednesday I will participate to a Data Night Talk a open discussion about AI & data engineering with other content creators. We will do it online / in-person. So tune-in. I working on a 10-minute light talk about data engines (🦆) and a funny game.

✨ Second, I'm partnering with La Conférence MLOps, a half-day conference on the challenges of industrializing AI. It will take place on March 7 in Paris. The list of speakers includes many important figures from the French data ecosystem, and I'm very excited about it. I may give a talk—but I'm not sure yet.

PS: This is not a paid partnership with nibble—the company behind the conference—but in fact they've been a client of mine for 2 years and have been a huge supporter of the newsletter since day one and they're good humans. I'm happy to partner with them.

AI News 🤖

The last few days have been stacked in term of AI News. Have fun getting through everything 😊, this is really cool how fast things are improving.

OpenAI Sora, generate 1-minute long videos — OpenAI released yesterday a new generative model that is able to create 1-minute long videos from text prompt. At the moment this is not public and only in the hands of limit testers but the first look shows that OpenAI might be already ahead of the competition. You can see a few videos on their landing page as well as the current limitations.
ByteDance boximator, create motion on images — Boximator is a friendly method to instruct generative algorithms with boxes. With the boxes you define a motion on an image and the models create a video out of it. ByteDance is the company behinds TikTok. At the end of the page you have a current comparison of generative videos and you can make yourself an opinion compared to Sora.
Meta V-JEPA, fills the void in videos — V-JEPA is not a generative model. With this model you can "fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space".
Sam Altman wants to raises $7 trillion — It was the WSJ news of last week about OpenAI CEO seeking for 7,000,000,000,000 of dollars. Obviously everyone tried to bet why he wants this money—which could came from UAE government—probably to enter the semi-conductor industry. I'm happy we correctly use money to save our planet /s.
Canada partners with NVIDIA to bring more computing power.
Groq, which speeds up LLMs inference — This week I discovered Groq a company that created LPU™ inference engine claiming to be the most efficient cloud-based provider (18x faster) in term of tokens/s.
Google announced a "new" AI hub in Paris — Techcrunch is mocking US company about this announcement considered as a communication effort because the 300 members of the new AI hub are already working for Google and it's just a new office space.
Still Google announced Gemini 1.5 — Gemini is the new name of Bard (as of last week) and blablabla Gemini is awesome blablabla. This is crazy how Google feels outdated when you look at smaller AI companies in term of hype or magic they are able to build. More details on Twitter.
HuggingFace model usage on HuggingChat — HuggingChat is a Chat UI from HF that lets you play with whatever model works with it. It depicts how fluid is the model market. Mainly it shows that Mistral is currently replacing LLaMa. You can also see the models "market share" among the major APIs / Clouds vendors in a nice Sankey diagram.
NVIDIA Chat with RTX — It's a Windows app (~37GB) that locally runs a GPT model to unlock chat with your files in a secure way out of the cloud. Happy gamers.
French Finance ministry released a LLM to summarise legislative proposals — Called LLaMandement it's a fine-tuned LLaMa designed to produce neutral summary of law proposal to help the government with preliminary notes. All the data used for the fine-tuning is available on Gitlab and the FastChat command used as well. Here the English paper on arxiv.
Why you need LLMOps — A great post that encapsulates all the words needed to understand what needs to be done when it comes to put LLM in production.

Fast News ⚡️

❤️ Observable 2.0 — Observable has a close place to my heart. Observable has been created by Mike Bostock the creator of D3js which is my Proust madeleine. Today they announced the 2.0 version which is mainly Charts as Code. It goes beyond the notebooks and become a static site generator for building fast, beautiful data apps, dashboards, and reports. I'm so excited to play with it.
Introducing universal SQL — I have to talk about Evidence now, which is as well a static site generator for building data front-end. They introduce universal SQL as a way to connect to all kind of datasources, adding interaction in the frontend while staying fast. Mainly it means data is exported in Parquet and compute with DuckDB WASM in the browser.
uv: Python packaging in Rust — I've been using poetry in the last 2 years and I'm quite satisfied about it. Seeing a new kid on the block is good because it renew the ideas because let's be honest we will never have a de facto packaging tool.
sqruff: SQL linter written in Rust — This is the results of the Rust hype, people are now porting more and more tools in Rust for efficiency. And it's for the better.
Introducing the column explorer in MotherDuck — A cool feature in MotherDuck (DuckDB in the cloud) to add sparklines and columns distributions when looking at a dataset.
dbt Labs announced a few new things to their dbt Explorer (that is only available to dbt Cloud). In a nutshell they announced column-level lineage, recommendations and semantic layer exports. This is fair to say that lineage is powered by SQLGlot (and Toby is not happy about it).
Introducing vector search in BigQuery — RAG is everywhere and BigQuery enters the game.
Snowflake lowers the cost of task from x1.5 to x1.2.

Engineering ⚙️

Eventify everything — This is an ode to event modeling and a different way to think data modeling. Timo showcases how you can eventify your data model to think differently your business activity.
What we learned after running Airflow on Kubernetes for 2 years — Outstanding article with great insights about the journey of running Airflow in production. It breaks down how to handle dynamic DAG generation, multiple DAG repository, configuration fine tuning and observability.
A dataframe is a bad abstraction — The article is too long for me to read it right now but the title is enough catchy for me to put it in the newsletter. If you read it I'm curious to know what you think about it.
Back Market’s journey towards data self-service — How to be a data self-service company and what initiatives they tried toward this journey.
How to leverage Metabase for efficient self-service analytics? — 3 companies joined to shared tips about Metabase governance and monitoring. This is goldmine if you're using Metabase and struggling understanding how users are using Metabase.
Automating data classification for the 21st Century — How Semantic Data Fabric (SDF) is able to statically infer data types and lineage how of a SQL patrimony. In a sense SDF is a dbt alternative. I really like this article.

Food for thoughts 🍱

What founders need to know to build a high-performing data team — Centralised, distributed or hybrid data team? This article discuss it.
Let's elevate the role of analytics.
Data ownership: A practical guide.
Is the Modern Data Stack still a useful idea — I'll write more about this later I guess this is too big too write my views in a bullet point. The article is about Tristand Handy views about the future of the modern data stack.

See you next week ❤️.

PS: I'd love to get your feedback about the newsletter either if you recently joined or if you're here since the beginning. More I'd love, as well, to understand who you are and what eventually would make you financially support my content creation activity.

Data News — Week 24.05

2024-02-03

hey (credits)

Hello here, this is Christophe from Amsterdam. I hope you're doing good. I'm in Amsterdam for the day for the DuckCon #4. The DuckDB annual conference, and god I like Europe. Being able to travel by train from Berlin to Paris to Amsterdam while going to the west of France for a lecture in a week is something truly awesome.

Anyway this week will be a mixed Data News with links, stuff and ideas and a small wrap-up of the DuckCon + the stuff I presented on Wed. to a Modern Data Stack meetup in Paris about DuckDB WASM. I hope you'll enjoy it.

The text-to-sql problem

Every once in a while the people are trying to give a shot at the text-to-sql problem. Each time a new breaktrough is happening (meaning a new LLM) company launch and people tries. 2 weeks ago TextQL raised $4.1m seed trying to solve this issue.

But what problem are we trying to solve?

In fact, I think we're trying to solve two different problems. The first is self-service, we want our stakeholders to be able to access information on their own and with no errors, once again chasing the dream that our clients can navigate the data jungle on their own, in fact this problem is "text-to-insights". And there's the second part of the problem which is much simpler, a data copilot, which can be a tool that accelerates the productivity of data workers by bootstrapping SQL writing or analysis.

Obviously when it comes to self-service we need a layer that does a text-to-sql conversion. In the current cycle of hype it can be done with LLMs, like DuckDB-NSQL-7B, the one MotherDuck provided recently. Like every model you have to analyse the efficiency of these generation layers.

From my own little experiments in this field here what I can say a generating layer can behave like an analyst but will be way more stupid than an analyst. I mean, on one side a LLM can get a thousands lines queries right the first time, like an analyst, it has to be done incrementally, either with prompt for the LLM or by test and run by the analyst.

But there is something that limits the LLM: his business understanding. Even if you give your LLM access to the database, the codebase and the docs there is something the LLM does not have: the implicit (vocal) business rules that are written nowhere.

I have 2 thing for the conclusion:

Have a look at what Alan did as a Copilot / Metabase bot to help people getting insight — by people in this case it means the CEO, who is explicitly saying on LinkedIn "It's incredible I don't need to ask anymore my People or Data Analysts team" — 😬
Having a data catalog does not mean that people knows what to do with the data [they just know it exists] —this is like an aggregate of quotes from my Wed. conference.

State of the French data market

2 benchmarks have been published recently about the French data market.

French public market salary grid in data (compared to software engineer) (source)

The public sector released their salary grid for all tech workers — this is in French but scroll to the last page of the PDF to have the table
- We have 4 experiences buckets <5, <10, >10 and >20 years. Which is completely relevant for the tech / data field I think, only a few people are +20 years from what I see.
- This is crazy how bad data engineers are paid compared to all others positions — especially when you know that other positions are doing data engineering when there is no data engineers
- The comparaison to data scientists is nevertheless not relevant because very often data scientists have PhD. so make sense they start higher that other positions
- What do you think of it?
At the same time the Modern Data Network released the annual benchmark of data professionals
- An Analytics Engineer role enters the chat — this is explained because the MDN is full of startup and roles evolves faster than elsewhere.
This is complicated to compare the 2 benchmarks because the experiences ranges are not the same still we see trends that are similar between the positions.
My main takeaway is that Data Analyst role is finally taking the place it should take as a full role and not a transition role before a DS or a DE role — being paid higher than other at entry for instance.

Modern Data Network annual benchmark of data professionals (source)

😥

Currently there is a huge layoffs period in tech startups in Europe and in the US. When looking at number layoffs.fyi this Jan has more layoffs than the last 6 months of 2023.

If you have been impacted by a layoffs and you need help finding your new journey, write to me.

Fast News ⚡️

Not to fragment the news that much because I already wrote too much AI News is blended without the Fast News.

OLMo, a new open-source LLM — The Allen Institute in Seattle released what they called a truly open LLM. For the first time we have the model, the weights and the training data. I can't wait to see how it compares and used by people.
hf/moondream1 — This is really awesome, this is a tiny LLM that can answer questions about a given image.
:probabl. launch — The team behind scikit-learn is joining forces and creates a new venture with the goal to maintain a state-of-the-art data science tooling suite to benefit France, EU and the World.

Thank you moondream

github/cybersec-ctf-box — A cybersecurity CTF to train yourself. A friend of mine created this repo to train yourself against a few attacks you might face. The first one is around Chart.js library.
dbt Labs names a new CTO — He was CTO at MongoDB previously.
Don't fix bad data, do this instead — This is never a good idea to apply patch on bad data. Always remember to identify root causes before jumping on the fixing wagon.
Our transformation journey toward an open data platform — Condé Nast data platform walkthrough. The platform is built on-top of Databricks with a lot of other logos revolving around making sense of the lakehouse platform.
Mastering Airflow variables — All the different techniques to master Airflow variables.
Grab, rethinking stream processing: data exploration — How do you unlock analyst super-powers by giving them capabilities to analyse real-time data directly on streams and not in the offloaded lake data.
How to build high-performance engineering teams — Get click baited like me. If I can add a following point, step 0 is important but then you need to give them enough freedom and vision.
The business-critical data warehouse — Putting back the church at the center of the village.
Databricks acquires Einblick — It goes back to the text-to-insights problem. Einblick is a drag-n-drop solution to "solve any data problem in one solution". LMAO marketing teams at the finest.

DuckCon + my Duck stuff

Because this Data News is already too long I split the content into 2 articles. Read my DuckCon takeaways 🦆.

Still last Wed. I've presented DuckDB to a French audience during this presentation I've showcased what you can do with DuckDB and DuckDB WASM. WASM is a portable way to run DuckDB in the browser.

You can play with the SQL editor I've worked on here (mobile + desktop), try to run a small group by query after a load tables, everything you do run on your device. This is the wasm magic. There is as well the Firefox extension the let's you hover parquet file in cloud console to get the schema, but more of this later as I plan to push it forward this month.

PS: I'm so happy to met a few readers IRL, it anchors my content and my work into the reality. So once again to the few people who came to me, thank you so much.

See you next week ❤️.

DuckCon #4 takeaways

2024-02-02

A picture of people chatting at DuckCon (credits)

Hey, this is a straightforward post about the ides and the takeaways I got for the DuckCon. I guess the recording will be posted online a in few days / weeks.

It took place in Amsterdam in a wonderful location. The agenda of the afternoon was quite small (because it is still a small conference) but interesting. There is something awesome to meet the DuckDB community at this step. The tool has not yet reached his peak so you meet people that are early adopters and fans of it — it's a nerds (male — diversity might come later I hope) conference actually.

DuckDB announcements

The Duck creators announced that v0.10.1 is coming soon and before end of July we might get the v1.0.0. DuckDB adoption numbers are demonstrating a real trend behind the "hype". DuckDB docs website gets 500k unique visitors per month and DuckDB has a new shiny website.

Soon we will get things like:

Forward (best efforts) and backward (guaranteed) compatibility between duckdb file formats
Attach Postgres database to execute Postgres queries from DuckDB prompt
Fixed lengths arrays new data type
A new unified memory manager
A secret manager that can persists between sessions
A new compression algorithm called ALP that brings faster compression / decompression and higher compression ratio
v1.0.0 will have no new feature compared to v0.10, focus on stability stuff and bugfixes

See the State of Duck introductive keynote

Ideas from talks

I'll just throw in the wild ideas and stuff I've seen from talks.

HuggingFace is using DuckDB in multiples features to power data exploration in the frontend. In their datasets product when looking at a dataset you can full text search or see distributions (with bars at the top of columns) and this is powered with DuckDB. Lastly they pre-compute statistics on datasets with DuckDB.
Fivetran uses DuckDB as the tech to do file merge in the data lake offering
Datacamp uses DuckDB to be able in notebooks to query dataframes in SQL and consider it for teaching SQL — I might have something in the making about this on my side.
dbt Core developer is using DuckDB is pdb to debug what happening in the database pretty easily and can create "debug packages" to send to other people.
DuckDB feels magical for a few people (Liverpool FC) because it does stuff faster than other technologies with less technical footprint — you just write SQL and it works.
The pattern might me
- Get the data out of db
- Query it with DuckDB
- Put it the data back into the db

In conclusion

There is something between the lines, even if DuckDB is used differently by everyone it just runs and creates something universal (thanks to SQL). Actually this might be the final tool that will break the wall between tech teams and data teams.

With DuckDB you offload a business logic that would be embarked in a backend app into SQL queries. You can use DuckDB as a library and not a service, which changes everything, what you need to do is import duckdb and not launch a Docker service manage connection strings, etc.

Last point, parquet was the starting point of a lot of use-cases because the Duck is working well with the columnar files. But between all the question and feeling people seems to like the idea of a DuckDB file format that will become the defacto data format.

Let's see.

I'm sorry I've written this as enhanced raw notes, I hope you'll like it.

Data News — Week 24.04

2024-01-26

Hey (credits)

Hey, new week new email. This is already end of January but I took time to travel and see people I did not see for a long time so I'm super happy how this new year is starting.

Next week, I'll be wrapping up my DataOps lecture by incorporating how to deploy machine learning models. This is a fun part where students learn how to serve a simple classifier in production. Building a custom HTTP API, Docker image and CI/CD processes making it accessible on internet. For the modern part this year, I'm going to integrate an LLM "classifier" part, it might attract their curiosity. We'll see.

Yesterday an interview I did for the podcast Let's talk AI has been published. Available everywhere. We talke about data engineering, freelancing and career stuff.

Let's Talk AI podcast new episode

Data & AI products

Yesterday I went to 5h conference organised in Paris about Data & AI products—in French, the idea of the conference was to mix people coming from data and product ecosystem which is, let's be honest, the key enabler for AI in production. The recording will be online in a few days / weeks and I'll share them once online.

Here a few takeaways in a messy way:

Data products and organisational impacts
- Data engineers are still the limiting human resource.
- Data mesh by the book will not work, if you want to scale you can't just add more people in a central team.
- Data mesh means decentralisation but more importantly ownership and responsibilities to team (esp. data producers)—if every team has to be responsible you need to have a easy-to-use platform and you have to explicitly give them responsibilities.
UX for data products — This is a presentation I really enjoyed by Claire Lebarz, VP data at Malt. Without the voice you will miss a lot of things, still it contains great practicals tips.
- Before jumping to AI projects you need first to start with words and to define metrics reflecting [your] values. You don't want your AI to give bad product experience. So define—as a metric—what you don't want to have.
- Then Claire schematised human interaction with models (slide 8) via an interface with inputs and outputs. Inputs and outputs can be instrumented with multiple techniques that will empower people in their interaction with AI algorithm.
  - Inputs — This is what you ask from the users to feed your algorithm. It can be done with calibration, implicit or explicit feedback and corrections.
  - Outputs — Product design choices where you give power over the algorithm. It can be done with multiple options (like trips alternative on Google Maps), attributions (why something has been recommended), confidence interval (weather) or limitations.
- As data people you need to build relationships with designers to converge on common terms about human-AI interactions
And other bits I got from the others talks
- OKR means metrics alignement across the company which lead to team autonomy—AI teams should be autonomous in finding solutions to move indicators
- It critical to have dashboards measuring success when AB testing models
- "Product is about people crafting together to best solutions and experiences to solve a customer problem" — Anne-Claire Baschet.

How to interact with AI models — Claire Lebarz, UX for data products

AI News 🤖

This Chinese startup is winning the open source AI race — Thanks to Mistral AI open-source became the new standard among the community. There is a Chinese company called 01.ai who wants to build the first killer app of the Gen AI. (see also open-sourcing the future of AI, which is a HuggingFace praising post at some point).
Hugging Face and Google partner for open AI — Do not mistake it's open AI and not OpenAI 😬. This partnership will benefit Google Cloud customers with unique hardware to train models and HuggingFace users will have some benefits but I did not understand the corporate sentences from the press release.
OpenAI new embedding models and API updates — new Turbo models for 4.5 and 3.5 and 2 new embeddings models.
Unleashing the power of LangChain — From POC to production, it showcases the LangChain expression language that helps developers chaining prompts in a nicer way.

Fast News ⚡️

Disney Holotile VR floor — Disney developed a "dynamic" floor for VR use-cases. With it you can walk without really moving. This is a bit disturbing but it can unlock the metaverse future.
ClickHouse and the one billion row challenge — ClickHouse proposed a SQL solution with ClickHouse local to the a challenge consisting in aggregating 1B rows in a text file. Initially this challenge has to be answered in Java. The leader submitted a solution running in less than 2s—have fun—while ClickHouse took 19s.
SQLGlot switch to Rust — I really like SQLGlot, this is a SQL parser that gives you back the AST to do stuff. They ported the parser from Python to Rust and got 30-40% performance improvement.
2024 data engineering trends — We are still in January so it's still valid, Anna captured a few things that will make data teams busy this year. Firstly the reducing in resources leading teams to do more with less (or at least doing the same with less).
Snowflake batch data loading — A good explanation of the Snowflake kCOPY INTO command and what you need to setup around it to make it work.
Effective pandas 2 is out — I did not read the book. As pandas is, still, everywhere, it can be a good ressource if you need to learn the 2.0 version.
Introducing girls to code, one flower at a time — An awesome initiative to introduce girls to code and data visualisation through creative coding projects, there is a Notion guidebook to do data visualisations with p5.js.
The open-source enterprise data platform in a single portal — Bern local community open-sourced an data platform blueprints to launch an all-in-one data platform with dbt, Airflow and Superset on top of Postgres and K.
Github/dbt-assertions — A dbt package to write dbt tests at row-level and to save exceptions alongside your failing rows (cf. example).
PR comment template for dbt data projects — Great stuff. This is a proposal of Github pull request template when modifying dbt models. It includes description, lineage diff, illustration of model changes or impacts and more.

See you next week ❤️.

How to learn data engineering

2024-01-20

Learn data engineering, all the references (credits)

This is a special edition of the Data News. But right now I'm in holidays finishing a hiking week in Corsica 🥾. So I wrote this special edition about: how to learn data engineering in 2024.

The aim of this post is to create a repository of important links and concepts we should care about when we do data engineering. Obviously I'm full of bias, so if you feel I missed something do not hesitate to ping me with stuff to add. The idea is to create a living reference about Data Engineering.

📬 Subscribe to the excellent weekly newsletter 📬

A bit of context

It's important to take a step back and to understand from where the data engineering is coming from. Data engineering inherits from years of data practices in US big companies. Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center.

In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics.

What is Hadoop? A quick overview of what everyone used for years (and still using it for some of us). It's important to understand the distributed computing concepts, MapReduce, Hadoop distributions, data locality, HDFS.
Data & Data Engineering — the past, present, and future ; this is a good overlook on data engineering history.
This one is a gitbook with a lot of content but I recommend you specifically to read the introduction to data engineering.
In order to become a great data engineer you'll also need to understand computer science. How do computer works? Additionally by understanding how web works — frontend & backend, deployment, etc. This is oversimplified but I did not found a simple resource on this topic, so if you have something, I'm interested.

Who are the data engineers?

Every company out there has his own definition for the data engineer role. In my opinion we can easily say a data engineer is a software engineer working with data. The idea behind is to solve data problem by building software. Obviously as data is different than "traditional product" — in term of users for instance — a data engineer uses other tools.

In order to define the data engineer profile here some resources defining data roles and borders.

Data Organization: why are there so many roles ? — And why it is important to understand them. This is one of the most synthesized article about data roles. Furcy defined Programming as the core skill for data engineers.
To complete the picture here are some missions and skills that are expected to be done by data engineers. Warning, the article is from an online bootcamp but they summarize pretty well everything. You can also have a look at the gov.uk data engineer job card, they detail every seniority level expectations.
We don't need data scientists, we need data engineers — for years companies were hiring data scientists because it was booming, then realized they were in need for data engineers to team up with scientists. This post shows the data job market with numbers.

What is data engineering

As I said it before data engineering is still a young discipline with many different definitions. Still, we can have a common ground when mixing software engineering, DevOps principles, Cloud — or on-prem — systems understanding and data literacy.

If you are new to data engineering you should start by reading the holy trinity from Maxime Beauchemin. He wrote some years ago 3 articles defining data engineering field.

There is a global consensus stating that you need to master a programming language (Python or Java based) and SQL in order to be self-sufficient.

Some concepts

When doing data engineering you can touch a lot of different concepts. Firstly, read the Data Engineering Manifesto, this is not something official in any kind but it greatly depicts all the concepts data engineers daily face.

Then here a list of global resources that can help you navigate through the field:

The Data Engineer Roadmap — An image with advices and technology names to watch.
Reddit r/dataengineering wiki a place where some data eng definitions are written.
This book, 📘 Data Pipelines Pocket Reference, defines everything related to data pipelines and how to treat data movement from source to target.

If we go a bit deeper, I think that every data engineer should have basis in:

data modeling — this is related to the way the data is stored is a data warehouse and the field has been cracked years ago by Kimball dimensional modeling and also Inmon model. But it recently got challenged because of "infinite" cloud power with OBT (one big table or flat) model. In order to complete your understanding of data modeling you should learn what's an OLAP cube. The cherry on the cake here is the Slowly Changing Dimensions — SCDs — concept.
formats — This is a huge part of data engineering. Picking the right format for your data storage. Wrong format often means bad querying performance and user-experience. In a nutshell you have: text based formats (CSV, JSON and raw stuff), columnar file formats (Parquet, ORC), memory format (Arrow), transport protocols and format (Protobuf, Thrift, gRPC, Avro), table formats (Hudi, Iceberg, Delta), database and vendor formats (Postgres, Snowflake, BigQuery, etc.). Here a small benchmark between some popular formats.
batch — Batch processing is at the core of data engineering. One of the major task is to move data from a source storage to a destination storage. In batch. On a regular schedule. Sometime with transformation. This is close to what we also call ETL or ELT. The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory. In this category I recommend also to have a look at data ingestion (Airbyte, Fivetran, etc.), workflows (Airflow, Prefect, Dagster, etc.) and transformation (Spark, dbt, Pandas, etc.) tools.
stream — Stream processing can be seen as the evolution of the batch. This is not. It addresses different use-cases. This is often linked to real-time. Main technologies around stream are bus messages like Kafka and processing framework like Flink or Spark on top of the bus. Recently all-in-one cloud services appeared to simplify the real-time work. Understand Change Data Capture — CDC.
infrastructure — When you do data engineering this is important to master data infrastructure concepts. You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team. You'll be also asked to put in place a data infrastructure. It means a data warehouse, a data lake or other concepts starting with data. My advice on this point is to learn from others. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill).
new concepts — in today's data engineering a lot of new concepts enter the field every year like quality, lineage, metadata management, governance, privacy, sharing, etc.

Is it really modern? (credits)

The modern (and the future) data stack

Coming from Hadoop — also called the old data stack — people are now building modern data stacks. This is a new way to describe data platforms with a warehouse at the core where all the company data and KPIs sit. Below some key articles defining this new paradigm.

And now some articles I like that will help you get inspiration.

Gitlab Data Team Handbook — One of the best data resource. This is a public documentation on how Gitlab data team do stuff.
Airbnb is great at exposing what they are doing in term of data. For instance with these 2 articles: How Airbnb achieved metric consistency at scale & How Airbnb built “Wall” to prevent data bugs
Data Engineering patterns are important — Dagster tried to introduce Software-Defined Assets and Prefect spoke about Positive and Negative engineering.
Scaling data analytics with software engineering best practices
Jesse Anderson ; Creating a Data Engineering Culture and his book 📘 Data Engineering Teams
What is MLOps? Some people wrote a white paper detailing Machine Learning Operations (MLOps): Overview, Definition, and Architecture in which they write about rols and missions.

Once again if you feel I forgot something important do not hesitate to tell me. I'll add more and more stuff to this article in the future.

If you enjoyed this article please consider subscribing to my weekly newsletter about data where I demystify all these concepts. I help you save 5 hours of curation per week.

Data News — Week 24.03

2024-01-20

Walking in the street be like recently (credits)

Hey I hope this new edition finds you well. We are deep in the winter, it's time for comfy Data News to read near the fire 🔥.

This week, on Monday, I started my annual university lecture. It's been 9 years since I started teaching and this year something was different. The students were incredibly calm, obviously my course is a bit difficult at the beginning because it touches on concepts that they are not used to—cloud, data in production, data engineering, etc. So it's normal that they don't have any questions at first. But still, even during exercices hands were still down when previous year they were asking me for debugging help.

This year something was off.

On Wednesday I finally understood what changed. It was ChatGPT. Actually the whole class was using ChatGPT—I did a raising hand survey and everyone said yes. So now, the default go to was to ask ChatGPT questions rather than ask me, and then if ChatGPT does not have the answer they might ask me.

I still don't know how to react about this. I think it does not makes sense to ban ChatGPT, like it was stupid to ban Google Search at my time. But still there is something to do, I need to research and think more about it.

I assume that education will be radically transformed. Both ways. The way students learn will be different, but the way teachers teach will have to be different. This will force us to bring something to the class that ChatGPT can't: humanity.

AI News 🤖

OpenAI opened an Elections Program Manager — The role is to support the "efforts around elections security and integrity for the EMEA region". We have this year the European parliament election. It deeply shows how OpenAI products are—or might be—used in order to win races. I guess they already have people for the US elections.
Palantir CEO: U.S. eating everyone's lunch on AI — Let's continue on politics, at the World Economic Forum in Davos Palantir CEO said that within 10 years 95% of the world top tech companies will be American. I don't see any difference with today.
Meta released MAGNeT — MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions. It seems that it works well when generating sound effects.
Zuckerberg teased 2024 Meta AI strategy — In a selfie video on Facebook / Instagram Zucky explained that Llama 3 is coming and that Meta is building a massive 600k H100 NVidia GPU infrastructure. It just represents $27b just in raw GPUs. But luckily for us Meta will open-source everything they do because they love us so much <3. In exchange we can wear the new Ray Ban Meta glasses with AI inside to give Meta more training data. We are just seeing the world shifting and we are all ok with it.

If you want a less salty opinion than mine Oliver as always aced it.
Prompt engineering with Bard — A recording from a local Google Developers meetup from last November. In this talk we discover a few concepts on how to talk to Bard. Mainly you can do it through the API or the UI and Peter explains that Bard shines in creativity, factuality and reasoning. He greatly explains the concept of grounding and why it matters.

Just after he explains how you can "teach" Bard to reason with a reverse word example in which Bard fails. In order to do it you have to ask Bard to write and execute code in background but to activate the code execution feature "you're at the mercy of the classifier". This is our future, being at the mercy of classifiers.
Google News is boosting garbage AI-generated articles — This is a paid article. The title speaks by itself.
Paper, Sleeper agents — A not reassuring paper. Anthropic research team proved that this is possible to insert backdoors in models and the backdoor persists despite safety and adversarial training.

I'm writing from a library today, I feel like a student (credits)

Fast News ⚡️

It's time to build — Still a big fan of Benn's content. This week he talks retrospectively about why his content shifted from the modern data stack to AI. Then there's all the marketing that goes into selling data tools to unleash the power of your data. Far from trends and the lights, it's actually time to build tools.
My thoughts going into a New Year — Tristan, dbt CEO, wrote thoughts about 2024. He covers why AI has not yet impacted data jobs and writes about OSS licensing after Snowplow recent changes, renewing a statement saying that dbt Labs does not need to do this change because they have a solid commercial path.
EU AI Act — A nice looking summary of what matters in the new EU AI Act, from the risk definitions to the potential fines. Then CastorDoc explains why a data catalog can help you overlook how to be compliant.
Packaging, one year later: a look back at 2023 in Python packaging — Understanding Python packaging is one of the most important skill to master when you want to enter Python world. I've seen too many people struggle in their development workflow because they are not used to pip and all. Chris wrote a follow-up to last year post about the sad state of Python packaging explaining standards and proposing things for the future.
How lazy imports accelerate machine learning at Meta — Meta developed their own implementation of CPython called Cinder. In order to speedup model training time they switch to Cinder and decided to use lazy imports.
Measuring data quality: bringing theory into practice — Mikkel is one of the best when it comes to putting the correct words on data quality issues. You should read this article to clarify these concepts.
Big O — A practical approach — The Big O notation is something taught at school and super important when programming, especially in data when complexity has to be understood to speed up data transformation. This articles gives you what's important. o/
How DoorDash used a service mesh and saved costs.
The evolution of a data platform, part 2 — The part 2 of the MatHem’s analytical platform. On GCP, BigQuery at the center with event flowing from PubSub / DataFlow with a great usage of all Google Cloud items.
Airflow evolution at Snap — Large teams need multi-tenancy and Snap is one of them. This article shows all the different architecture Snap put in place to deploy Airflow at scale.
Integrating Airbyte with data orchestrators: Airflow, Dagster and Prefect — A orchestrators comparison and how Airbyte can be used as extract-and-load within them.
Convert your PySpark code to Snowpark code using SnowConvert — Snowflake trying to attract Databricks customers.
Hey Snowflake, send me a HTML email — This is one of the feature I'm the most unsure about. Do I want my warehouse to be able to send emails without my global orchestration system to be aware of it? Yes because it's cool to give freedom to users... but no because as a data engineer I want a platform where flows are controlled. What's your take of this?

If you want to do it with BigQuery, you should take a look at my friend's BigFunctions (see the send_email function).
bq-lineage-tool — Java code that uses ZetaSQL to build a column-level lineage parser for BigQuery.
Comparison running dbt-core and dlt-dbt runner on Functions — If you're runnning dbt-core within Cloud Functions check this article integrating dbt with dlt to avoid a weird hacky subprocess.
Build a personal real estate dashboard with dlt and dbt — Another example where you can chain dlt—to extract and load data—and dbt—to transform data—in order this time to build a real estate dashboard to find your dream property in Portugal.

Data Economy 💰

Phospho raises €1.7m in pre-seed to build GenAI monitoring applications.
SKY ENGINE AI raises $7m Series A. With a platform that generates synthetic data for deep learning vision algorithms. It let's you create 3D stuff that you can use to train algorithms.

See you next week ❤️.

Data News — Week 24.02

2024-01-15

Back to school (credits)

Hello you. Back to the usual Data News—with a little delay, I'm sorry.

First of all, I'd like to thank you for your positive comments on last week's article. It's a subject close to my heart and I was very happy to share it with you, because I never thought that Data News would become such a big part of my life.

I'm starting my annual university lectures today. It's always very exciting to go back and teach students, to help them discover the world of data from another perspective. The details: it's a 27-hour course called DataOps. It's quite a broad subject. I actually cover data engineering and how to put data stuff into production.

For years I gave a 30-hour lecture called Python for Data Science in which I covered the basics of Python, pandas and scikit-learn. But I stopped 2 years ago because it was too much and repetitive for me. I'm very happy with this new DataOps lecture because it's much closer to what's really going on in the data world.

Over the years I've accumulated exercices and one day—I hope this year—I'll provide it to everyone in a nice way.

It's funny because in the days leading up to the lecture, I'm always stressed about something: I'm always afraid I'm going to run out of content. The last thing I want to do is give a boring class.

Wish me luck and have fun reading the news.

AI News 🤖

Bill Gates talks with Sam Altman — An 30 minutes episode of Bill Gates' podcast where he chats with Sam Altman.
14 predictions about AI — In a long form article, Vincent shares his predictions about AI and the trends we might see in 2024. Garbage in, garbage out, still one of the most important issue. Personally, I have a question for authors in 2024: when are you going to stop generating images to illustrate articles? They're horrible and destroy the content. If I have to predict something it would be the this trend to stop.
Meta, from audio to synthesize human in conversation — Do we finally see an outcome of the billions Meta invested in the Metaverse 🙃. To be honest this is impressive, from an audio Meta is capable to generate a photorealistic avatar that behaves like if it was you speaking.
How Meta is advancing Gen AI — a podcast about Meta GenAI breakthroughs.
Microsoft will replace the Windows keycap by a Copilot — This might be a major change to Windows computer and keyboards, Microsoft wants to add a physical AI trigger on every keyboard. Might be the best adoption trigger we ever saw.
I coded exclusively with ChatGPT for 30 Days — Good takeaways about a nice experiment.
IBM explaining why they invested in HuggingFace — During gold rush sell shovels. It explains NVIDIA 2023 success, but HuggingFace is legendary for the same reason. HF became the defacto platform when it comes to share and showcase AI models.
Sentence embeddings — After reading this article you will be able to do a PhD in embeddings. Personally I did not read it but if you want to understand embeddings you should.

Fast News ⚡️

dbt related stuff
- Download artifacts from you dbt Cloud job runs — a tutorial from a CLI tool to generate ERD diagrams for dbt Cloud projects.
- Testing dbt macros — A clever pattern to write unit tests on dbt macros with a model computing all the possible macro values and a dbt test checking all the possible cases.
- Unit testing dbt models — Using a dbt-unit-testing package Matthieu showcases how you can easily test your models.
- dbt meta tag — A list of the companies habing product features depending on the meta tag. It shows how deeply dbt change the data world.
What I would do differently getting into Data Engineering — Data engineering has changed a lot in the recent years and Daniel gives 3 advices that you should consider to get into data engineering. Learn SQL, be social and learn to say no.
Lead Data Engineer career guide — Detailed skillsets needed to be a lead data engineer.
Effectively managing junior developers on remote teams — In the current state of the ecosystem this is super important to provide a perfect introduction to the data world to juniors.

Time to sleep (credits)

🫠

I'm sorry, it's midnight when I'm writing this. To be able to publish on Monday morning I don't have the time to read all the following articles.

Data Economy 💰

Talend will shutdown Talend Open Studio their open-source version on January 31. As a reminder Talend has been acquired by Qlik 9 months ago. This is probably a strategy to keep money flowing. See you Talend 👋.
Alteryx to be acquired by private equity firms in $4.4B deal. OK.

See you soon ❤️.

Data News — 2024

2024-01-07

Thoughts. Backward and forward. (credits)

Hello, it's 2024. I hope you're well and that you've ended 2023 on a high note with your loved ones. I wish you a Happy New Year and all the best for 2024. I'm very happy to have the privilege of corresponding with you and it honours me.

This edition of Data News will focus on the end of 2023 with a good retrospective about me and my activities—content and freelancing. Of course, it will also look ahead to 2024, and I'll try to set the vision 2024. But you know how bad I'm with goals.

Let's wrap-up 2023

In technology, we live in a fast-paced environment and when you compare yourself to others or try to keep up with all the news, it's easy to get FOMO. This year was perhaps the key moment when I managed to step away from vanity metrics and take breaks away from work. With the exception of my trip to Japan in 2019, I think this is the first time in my life it's happened in this way. Even my parents noticed I was away for a week from the computer for Christmas.

The next step for me is finally to recognise that even if I have a deep feeling that I haven't achieved anything in 2023, that I'm always behind in my ideas, it's a wrong feeling and I should be proud.

Proud about my content, my professional and my personal life.

Content — I don't do it for fame

When I started content creation, my North Star was to create an international audience while being in France—hence in Europe—aiming, at my level, to balance with everything in data being said/made in the US.

Then, I also stated that I wanted to produce content that I'd like to read, actually all the stuff I produce is something that helps my sharpen my ideas and to save for later. If I like my own content other will do it because I'm a just a normal person.

What 2023 brought:

Followers — I doubled in followers on my 3 main platforms: I reached 4000 people on the blog, 8000 on LinkedIn and almost 600 on Twitter (even if I don't post that much there). My only target is, to be honest, to grow my blog, having 4000 people who trusted my enough to enter their email and validate the subscription is just crazy.
The blog — 46 articles published in 2023, this is way less than in 2022 but it's ok. In term of views my blog got an increase of 67% going up to 36k unique visitors, just wtf.
Video & audio — This year I've published 3 podcasts episode, which is way less than what I initially wanted and I participated in 2 podcast épisode with DataGen. This is something I want to change in 2024.
Conferences — I've talked to a few meetups and online conferences, my main issue here is that everytime I do a talk I want it to be a unique experience, so it asks me so many prep hours. Something I should change maybe. On the same topic we ran the Paris Airflow meetup for 6 months but took a never-ending break after the summer holidays.

In the end, I've done more things IRL and I've met a lot of people I wouldn't have met if I hadn't been visible online and I think that's the big W of 2023. That's what brings me the most joy in fact. Thank you all for being so nice and supportive with me ❤️.

In conclusion I'm happy with this. But frustrated not to have done more. Still I decided not to focus entirely on content creation so I kept room for freelancing and personal life.

Professional — The limits of freelancing

I started my freelance career almost 4 years ago. In a world affected by COVID, At that time, remote work was the new norm in the tech and people were surprised that I decided to go down the unstable route when COVID was already enough.

And I won't regret it, for the last years I've worked on the most pleasant projects with people I really enjoyed working with. I'm lucky enough to be able to choose the companies I work with, people who understand my requirements. Being able to take time for myself while working wherever I want on projects I choose is something I wish everyone could do. It changed the way I see work.

It's time for a review of 2023.

Revenue — My freelancing activity is stable, last year I billed almost the same revenue as 2022 around 140k€, while taking more big breaks. In term of clients I had in number less clients than in 2022 because my main client kept me busy.
Projects— The few noticeable projects I've worked on
- I designed, developed and deployed a reporting application with Apache Superset. This is for the French gov, for more than 60k users (+12k weekly), it contains more than 10 dashboards with 5 custom visualisations in React—you can see an example screenshot here.
- At the same time for the gov I've worked on a larger project to develop a private datalake to work datasets with on-demand RStudio and Jupyter containers. For this I deployed a private Kubernetes with on-top MinIO, Keycloak, LDAP (for auth) and Onyxia to deploy containers.
- Then I've worked to implement a few small data stacks (revolving around ELT and a warehouse) and helped 3 companies migrating from something to dbt.
Partnerships — I had a few discussion with people about partnerships in 2023 and did not really push it forward, but I should next year.
Angel investment — I did my two first investments recently, it continues my content North Star, putting light on stuff made in Europe. I'm so happy to finally open this path. Welcome blef ventures. More about on this soon.
Data engineering — Data engineering is changing and my work is changing. When I started data engineering in 2014, the term wasn't even existing. Moving data from A to B has always been something fun for me. But with the years something changed and I might want something else as evidence by my work at gov, I do engineering. Data engineering is moving towards the left creating a deeper gap between data users and underlying layers. More on my views about this in a coming article.
Other freelancers — When I started, I France we were only a few doing data engineering in freelance. Now, because of all the layoffs, the way the work is changing and the promise of money a lot of people entered the game. I've met a lot a tried to give advices but it probably means for me that I need to renew my offering as well.

I'm happy about 2023, but it brought a few big issues in my daily routine that I want to fix next year:

I feel alone in my daily work — working partially and remotely for companies isolated me a bit and after 4 years it's time for a social boost
Fuck dopamine — all the attention business distracts me so much and I lose focus so fast, especially when I open Twitch. I changed my phone routine and spend way less time on it but I have to change something on my computer as well.
Get things done — When it comes to finishing task, I'm good when it's for client, but when it's for myself, there is a huge improvement.
Administrative tasks — ...

Personal — Catch me in a train

In 2023 I've achieved a great Work-Life-Balance. I'm so happy and in love with my girlfriend, who is freelancing as well, so the rhythm is kinda the same for us and we have the same kind of issue even if we are not working in the same ares—she's in the movies industry.

Travel — We still mainly live in Berlin, and I travelled a lot between Paris—where my business is—and Berlin. These ~10 trips represent all-together around 80kg in carbon emissions. In comparaison I went once to Malaga last year to follow my gf for work and it's just insanely more (300kg eC02).
Sport — Since August I started running again, 550 kms since. Twice a week for 2 months then 4 times a week and I've never been happier. Some people might remember but it was my 2022 goals to run once a week. It took me 1,5 years to reach the goal. The target is 45 mins for a 10k next year. I also started bouldering, unexpected and I like it. Still in 2023 I did less bike than the previous years.
Friendships — Met a few new friends and I'm so happy with this because when you pass 30 I feel that creating new relationships become way more difficult.

A few articles

That's a wrap for 2023, and because it's the Data News, here a few articles that people (and me) liked in 2023 that you might find interesting:

Count.co SQL guide — A infinite canvas with content to discover and improve in SQL.
Answering "Why did the KPI change?" using decomposition — An excellent article by Max, I really enjoyed both the content and the form.
Malloy 4.0 announcement — Malloy is a new language created to transform and analyse data, it transpiles to SQL. I did not had the time to play with it but I'll in 2024.
Coding patterns in Python — a great list of Python to know in Python.

Here we go again (credits)

2024

2024 marks the tenth anniversary of my entry into working life. My 6-month internship started in April 2014 and I was developing a custom drag-n-drop dashboard application with Django and D3 as a project. Fast-forward 10 years later projects haven't changed 😅.

Once again I wish you the best for 2024.

If you are following me for a long time you know that I'm super bad at having resolution. Let's more make ideas and stuff I'll be proud about in January 2025 when writing the 2024 post.

Keeping the habits — I often repeat this is not about motivation but discipline. Let's continue having the habits I have, continuing running and the newsletter.
Adding new habits — I'd like to add at least 2 habits, especially in the content creation, this year I want to reboot my YouTube channel and stick to podcast publication.
Create courses — For all my pro career I've written courses—I teach since 2015—but I've never really created something for people online. It's time.
Release 2 products — I want to release 2 products that can live by themselves next year, one around the blog, the other one we will see.
I want to find my new pro journey. If you have ideas, hit me up.
Invest in 4 companies — If you are in this journey, same, hit me up.

Let's go and see you this Friday for a more traditional newsletter. I'm sorry for this long format but changing years is something enough noticeable.

Thanks again ❤️. I wish you a good end of Sunday.

Data News — December 2023

2023-12-31

Hi, it's been a while since I last posted something here. Happy new year 🎉. I hope you haven't forgotten about me. A lot of things have been happening at the same time in my professional and personal life. To be honest, everything's been going well, but I've found it hard to find time to write among other things.

And that's the problem. I want to do so many things at once. It's quite funny because when I'm coaching someone, one of the first pieces of advice I give them is to stay focused and avoid multitasking, but when it comes to me... Yeah, you know.

However, some excellent articles have been written and I want to end 2023 with one last big wrap on these December articles. I'd also like to say hello to all the newcomers who arrived in December, thank you for your trust. We're going to get to know each other.

Before moving on to the Data News, a bit of personal news, in December, I took part in the MotherDuck meetup in Berlin. I presented what I believe to be the future from my DuckDB experiments. I've especially been amazed by DuckDB in the browser with WASM. I'll also go to the DuckCon in Amsterdam on February 2nd—pm me if you're going.

End of January, on the 31st I'll speak at a Modern Data Stack conf in Paris, still about DuckDB, but this time in French. I also took part in my friend's podcast where we discussed 3 trends in data: data modeling, real-time analytics and DataOps.

My retroprojective—a retro 2023 with a projection into 2024—will soon be written. It will talk about my search for a new spicy adventure, the fact that I've finally taken up running again, my new journey as an angel investor, and so on.

Enjoy this last 2023 Data News.

AI News 🤖

An interactive 3D explaination of LLMs — Explaining complex things the visual way is the best. In this one it details all the components in a LLM—a big part explain what's a Transformer.
LLMs for builders: jargons, theory & history — Mehdi, compiled in a large article all the necessary vocab to understand the basic conversation when it comes to generative. He even quickly explain how you can run a model on your computer.
Cocorico 🐓. Mistral AI, one of the French "OpenAI" startup, entered the field setting new standards and with recognition. They released their first AI endpoints: generative and embedding. When it comes to generative they currently have 3 models: tiny, small and medium, which are performing well against GPT-3.5. At the same time they released Mixtral 8x7B, the first open-source model of this calibre under an Apache Licence. And the weight are open-source as well.
What I wish someone had told me — It's borderline AI news, but as the author is Sam Altman, I think it belongs here. After the whole Hollywood thing around Sam being pushed out and then coming back, Sam clickbaited us. He's written 17 great HR / team building tips—but they have nothing to do with the drama we're all living for.
Building applications with real-time stable diffusion APIs — fal, has written a great article about how you can use WebSockets in Javascript to interact in real-time with a Python backend and stable diffusion. There is a great article with an image generation from a sketch. It gives so many ideas.
People underestimate how impactful Scikit-learn continues to be — The year is coming to an end and LinkedIn is playing at the 2024 predictions game. Obviously no one will get it right. At the same time one of the Scikit-learn confounder, put the church back at the city center—this is a French expression poorly translated. Scikit is still the most used library when you look at some numbers and LLMs have still to bridge the gap in usage.
OpenAI prompt engineering guide — Wow, an official guide to become a prompt engineer /s. Seriously, it seems it contains good tips to communicate to the algorithm.
Google announced Gemini, their new multimodal model "beating" GPT-4, but fooled us with an edited video.

Fast News ⚡️

Airflow 2.8 is out — The Airflow rhythm of release is crazy, I can't keep up with the awesome feature that have been added this year. To finish the year Airflow team have released improvements to Datasets and a major step forward with the new Object Storage API that provides a generic abstraction over Cloud Storage to transfer data from one to another.
The EU AI Act has passed — After many years working on the text the EU has voted for the AI Act to regulate usage of AI when usage European citizen data. It points to a cheat sheet that summarises what you need to know. In a few words: the AI Act provides the glossary to define what's an AI and define the boundaries of prohibited and high-risks AIs.
BigQuery now integrates DuetAI — to help you generate or complete SQL queries.
AWS announced S3 Express — S3 Express is a new zone with 10x better performance (latency and parallelisation). Paul, wrote a few speculations about the new S3 tier, this is highly detailed and explains very well what to expect, DateEngineeringWeekly also wrote thoughts about it. S3 will still be the king, or the GOAT.
Idempotence — Matt wrote an article to explain what's the idempotence and why it matters in data engineering. Idempotence can be mathematically summarise to f(f(x)) = f(x), it's important in data engineering because for the same input you want a pipeline to produce the same output. Never forget to have it in mind when thinking of a pipeline it leads to great questions.
Have I Resolved the Pie Chart Debate? — We all know pie chart are terrible. Nick proposes how to fix the pie chart dilemma.
How to know if your data team is successful? — Reflexions around team performance and how to measure it.

Engineering stuff ⚙️

Netflix internal data engineering Summit — Netflix team organised an internal conference about DE topics. And they recorded it. 8 videos are on YouTube and to be honest this is awesome content to learn patterns and get ideas from the best. They still use technologies around JVM (Spark and Flink), but with no surprise everything resolves around Iceberg—which has been created at Netflix.
Using Netflix Maestro and Apache Iceberg — Going deeper into incremental processing the engineering team details how they implemented it.
Introducing WAP pattern support with Apache Iceberg (with SQLMesh) — Small article about a important pattern to avoid putting bad data in production. The WAP pattern—Write-Audit-Publish—let's you first write the data in a staging layer in which the data is audited, if the audit is green then the data is published in the production layer. This article is just an entry point to SQLMesh—a dbt alt—that enables you to do it.
Use Databricks to read Iceberg tables in Snowflake 🙃 — This post have been written by Snowflake team, but reflect a strategy from Snowflake to attract customers by being open, and Iceberg do the glue here, winning the table format. Still, don't do it and try to avoid spaghetti data platform.
Efficient ELT refreshes — Max detailed how he designed his ELT pipelines
Run dlt on Lambda to save on extract and load costs — dlt is an open-source Python library to do extract-load in Python, if you want to save cost out of different cloud services that moves data, it might be an alternative.
Druid deprecation and ClickHouse adoption at Lyft — Data engineer loves migration. They prefer even more speaking about the migrations they have done. Moving from Druid to ClickHouse looks like a good improvement.
Data Quality Score: next chapter of data quality at Airbnb — After all the data cataloging vision and trends Airbnb launched, this time they explained how they see dataset quality and how they score it.

Data Economy 💰

Mistral AI raised another €415m at $2B valuation. Mainly from US based capitals, it will probably change the governance of the company, is it still French?
Elon Musk’s generative AI startup xAI looks to raise $1bn.
AssemblyAI raises $50m. API endpoints to convert voice data to text in all his forms (transcript, chapters, summaries, etc.).
Keboola raises $32m in Series A. This is a all-in-one data platform for non-technical data users.
London-based Harriet raises €1.4 million pre-seed. An AI assistant using HR data to help employees.
AI data platform VAST Data raises $118m. All-in-one platform for big corp to do AI and engineering at the same place.
Octolis has been acquired by Brevo (ex-SendinBlue). Octolis is a CDP / reverse-ETL solution and Brevo is a CRM, the join makes total sense.

See you this Friday with a post opening 2024 🎊.

Data News — Week 23.46

2023-11-18

Back in town (credits)

Hey, it's been a few weeks since I've not written any news. It was a necessary break for me and a blank page syndrome at the same time. Still I've accumulated a lot of articles that I think should fit in the Data News so this week might be a huge recap of content that has been produce in the last month.

I hope you will enjoy the selection.

On Monday I'll also give a talk at Berlin MotherDuck meetup: DuckDB experiments, a glimpse of the future. I think it will not be live but the recording will be published after the event on YouTube I think.

Not sure there are still free seats, but if you want to come reach me.

AI News 🤖

Sam Altman has been fired as CEO of OpenAI.
- OpenAI announced this leadership transition yesterday. At the same time Greg Brockman (actual President and co-founder) will step down from the chairman of board and Mira Murati (actual CTO) will become interim-CEO. It was a brutal decision.
- The public official given reason was "[Sam] was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence.".
- The Internet has spent the last 15 hours guessing what this really meant. Here are a few theories I've read: a security leak occurred and Sam/Greg hid it from the board, Sam is publicly accused of sexual abuse by his sister, Sam has different views about company vision which doesn't please the board—esp. regarding profits or AI regulations, Sam invested in an OpenAI competitor. Either way, we'll see in a few days.

People are mostly saddened by the news because Sam was a publicly-beloved and transparent CEO who changed AI. Comparisons with the coup that overthrew Steve Jobs back in the days are many.

The news arrived a few day after OpenAI dev-day, a public conference announcing new products and features. Mainly they announced GPTs, a no-code UI to create custom versions of ChatGPT.
Other AI announcements
- Github Universe was the moment to announce more Copilot everywhere in Github ecosystem. The most interesting thing was the fact that Github will introduce M1 and GPU runners.
- xAI—the company founded by Musk after quitting OpenAI—announced Grok. It's a 33B parameters LLM.
- Germany wants to build the European OpenAI competitor and invested $500m in Aleph Alpha, a startup. On the landing page it's clear that the focus is to build safe AI.
- Kyutai has been announcement at a AI Pulse event in Station F, Paris. Kyutai is an open science lab to build and democratize AGI—artificial general intelligence—through open science. They carefully picked open science rather than open-source. The team looks great.
- The GPU availability competition is on. Y Combinator announced a Microsoft partnership and priority access to compute resources. This is linked as well to Microsoft making custom AI chips.
- Biden issues executive order on safe, secure, and trustworthy AI.
2 reports with hundreds of pages about AI were published — The State of AI report and AI: The Coming Revolution. Both looks full of interesting things to say, but I did not read them.
Google team wrote a paper "demonstrating various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks". In a nutshell, LLM can't generalize.

🍿 (© Silicon Valley HBO series)

Now that I gave you the general news, let's jump to a few use-cases about AI.

Towards a real-time decoding of images from brain activity — This is crazy, Meta researchers have been able to create a system that predicts an image seen by a person from the brain magnetoencephalography.
LLM-powered data classification for data entities at scale — Grab explains how you can use LLMs to do classification, in this case identifying PII in the database. They explain the real-time architecture the system is using and give an example of the prompt they are using.
Generative AI, the intern you can’t trust — A small post from Atlassian blog, it gives 3 ways to improve LLMs accuracy.
Summarizing post incident reviews with GPT-4 — Canva has so many incidents that they need a LLM to summarize them for reporting purposes 🙃. Obviously it's a joke, but while the use-case is interesting I question myself about real need behind.
Building in-video search at Netflix — What if you could prompt for a specific situation and get all the movies—at the relevant timecodes—presenting the situation. This is so cool.
Cost analysis of deploying LLMs — All of this is cool, but pricey, this post do a good exploration of the costs.

Fast News ⚡️

Because the AI News is pretty packed and I still want you to enjoy this newsletter articles will be less commented than usual. But still spicy opinion, because you know, it's me.

Data contracts is undoubtedly a new growth lever for data observability companies and data VCs. Soda announced their open-source data contracts engine. It's done in YAML. Here another example of contracts with msgspec.
NVidia research has been able to supercharge pandas with cuDF to run pandas on GPUs.
Wes McKinney, pandas and Arrow creator will join Posit—the company behind RStudio—as a Principal Architect. His new role will probably ease the integration in the Posit ecosystem of all the Python tooling, even if it has already been the case for months.
dbt Labs hired Brandon Sweeney as new President and COO. Brandon was previously dealing with Revenue at Hashicorp. The same company which recently changed licensing to BSL getting backslashed by the tech community for it. Our prayers goes to dbt Core.
Onehouse , Microsoft and Google are working on table format standard called Onetable. This isn't a new format but a way to create interoperability between Delta, Iceberg and Hudi.
If you are curious about Iceberg and Hudi ACID guarantees read the article.
Code faster with Ruff, a Python formatter written in Rust. All the time wasted for black to reformat your code will be used for good purpose now.

Taking other companies as example is often a good way to get ideas

Gusto, data platform to generate HR insights — All data send to OneModel—a paid HR tool, in a Redshift with Tableau for visualisation.
Criteo, how to compute data lineage — Criteo has a homemade application for data document called... Datadoc in which they compute their cross assets lineage.
Picnic, master data management — Creating MDM for retailers is like the one-piece.
LinkedIn, how to use 4 trillion events daily — Leveraging Apache Beam and Samza.
Netflix, streaming SQL — Flink architecture in a data mesh organisation.
Zalando, how to patch Postgres and fix WAL — Zalando team explains what they patched to Postgres JDBC driver that was growth in the write-ahead log.
GoDaddy, layered architecture for a data lake — Naming conventions ideas and 5 data layers: source, raw, clean, enterprise and analytical.

A few food for thought articles about data concepts and roles.

From data platform to ML platform — How incrementally data platforms are built, first for analytical use-cases and then adding ML capabilities.
Why you should not build apps directly on the data warehouse.
SQL is not designed for analytics and why Malloy is a paving the future.
Would you become a data strategist? — Great post from Marie about a key analytical role shaping companies strategies.
Two archetypes of data engineers — Closer to business or to the tech. Best data engineering teams successfully blend the 2 archetypes.
The Economics team at Instacart — Or how economists and PhDs become more tech-savy enabling more and more relevant usage of data.

Data Economy 💰

ZenML raises $3.7m additional Seed. A MLOps platform that works with all cloud and tools.
Snowflake acquire Sisu and Ponder. The first one is an engine to monitor business metrics while the second is a tool to run pandas at scale.
Yahoo spin-out Vespa and raises $31m. Vespa is a search engine and a vector database. This is the good timing to open-source is for AI use-cases.
Aleph Alpha raises $500m Series B to build the German OpenAI.
Kyutai is funded with $330m from 2 French billionaires and Eric Schmidt—ex-Google CEO. Kyutai is a open science lab that wants to build the AGI. The team as a good resume and the science committee looks awesome (Yejin Choi, Yann Lecun and Bernhard Schölkopf).

Dreaming of sun (credits)

Ghost implemented a recommendation feature recently so I've added a few folks I like to read on internet.

Read a few friends

See you next week ❤️.

Data News — Week 23.42

2023-10-20

Writing about dbt like a sheep (credits)

Hey, this week Coalesce—the dbt Labs annual conference—took place. During 3 days, people shared how they used dbt around the world. I'll, as usual, write a takeaway post after binge watching all keynotes, but this is for next week. Still dbt Labs announcements were mainly towards dbt Cloud with great features to drive adoption of the paid product.

They announced dbt Mesh a product enabling cross-project dependencies for teams with multiple dbt projects. In addition they also released an Explorer view that lets you navigate through all you project and see models, macros and more directly in one nice graph.

Does this mean that you have to use dbt Cloud to have a multi-project setup? No, you can activate multi-project collaboration with dbt Core. I've written a guide that helps you do it.

Read my dbt multi-project guide

📺 On the content side I'll also present next week the Fancy Data Stack project at the Data Engineering And Machine Learning Summit 2023 organised by Seattle Data Guy. I'll be online on Thursday 26 at 5PM CEST. Add it to your calendar and sign up for the conference—the list of speakers is insane.

Data News is packed this week, take time to enjoy it, rainy times are coming, you can see it as a gift 🎁.

Enough dbt use lea 🥰

Max—the first Data News member 🤗—open-sourced carbonfact/lea this week. lea aims to be a minimalist alternative to dbt by fixing a few flaws that comes with dbt. You can even see the traditional Jaffle shop example done in lea.

What are the main differences?

You configure lea with env variables.
a lea prepare command that creates database objects that needs to be created (dataset, schema, etc.). Schema are interpreted from the folder structure (with DuckDB).
lea understand the views relationships, you don't need a ref. Jinja templating is still supported tho.
Tests are directly added in the SQL code at the column that is target. For instance if you need to test unicity on a column you add the @UNIQUE decorator. Singular tests are still supported.
lea generates documentation as Markdown in the workdir.
Other cool features: lea teardown delete database objects, lea diff shows table schema differences and you can write Python model as long as they return a DataFrame.

Max also wrote a nice post about data downstream issues—which is the main problem leading the data contracts space: Sh*t flows downhill, but not at Carbonfact. You should read it because it gives another perspective of solution to fix it.

Gen AI 🤖

Can you run it? — There is a HuggingFace app that tells you by taking your specs what you need to run a LLM model for inference or training.
25 million Creative Commons image dataset released — Fondant, an open-source processing framework, released publicly available images from web crawling with their associated license.
New Vertex AI Feature Store — GCP Vertex AI is the place to do "serverless" AI. This is awesome to see this directly integrated within BigQuery as it obviously brings simplicity. In public preview.

Fast News ⚡️

Pandas appreciation post (credits)

Meta banned a creator for selling Python and Pandas courses — The automated AI filters identified the ads as violating wildlife protection rules. We are therefore thinking with our feet these algorithms are probably written in Python. Do we still want a future where AI decide for us?
At the same time, luckily for us, Meta is creating custom silicon for AI.
Disney new intelligent robot / toy — The entertainment company showcased a new toy with impressive capabilities opening doors for a fun future for kids.
11 lessons learned managing a platform team within a data mesh — BlaBlaCar, a carpooling company, is well-known France for recently adopting a data mesh organisation. This post gives great insights about the impact on the data platform team.
The need for an open standard for the semantic layer — Following news about dbt's semantic layer, this post from cube opens the door to defining a standard when it comes to semantics. What should be the main entity type at the center of the semantics: metrics or datasets?
Why data integration will never be fully solved — Anna covers a few data integration tools and tries to explain why this is such a tricky field that have issue to be resolved with only one cloud tool.
Popsink a real-time ingestion and processing platform released their self-service offering this week. They are French and they built a great platform on top of Redpanda and Flink claiming to be 4x cheaper than Fivetran to do data replication. As an echo of last bullet point.
Following Popsink kind of stuff, an example of how Fortis Games, a game editor, developed real-time platform with the same technologies.
Rise of the data generalist: smaller teams, bigger impact — You don't need to convince me. In all experience and talks I have with people smaller teams obviously drives bigger impact.
La stratégie nationale pour l'intelligence artificielle — In French. This is about what France wants to until 2025 to drive IA adoption.
- 3500 new students and at least 200 additional thesis on AI topics
- Do between 10% and 15% of the world market share when it comes to embarked AI
- and more stuff in order to attract foreigners and help companies

Engineering stuff

Dagster released their internal data platform in open — Surprise they use Dagster as an orchestrator.
dbt related stuff
- To dbt or not to dbt — A few lessons learned while implementing dbt at Intercom.
- dbt MetricFlow, semantic layer 2.0 — An quick analyse of the new semantic layer vision.
- Data contracts and schema enforcement with dbt — It comes with dbt Mesh and gives a lot of new metadata over your models to bring more software engineering practices to dbt development.
Pros and cons of One Big Table data modeling — I really like OBT, it brings a lot of simplicity, especially in the downstream usage, but obviously it has known issues.
What is data versioning and 3 ways to implement it — A comparison between change data capture (CDC), dimensional modeling and slowly changing dimension (SCD).
Mage, BigQuery and bundled-up bike trips — A homemade project where Patrick used Montreal public data of bike counting sensors.
Replace Dockerfile with Buildpacks.

Data Economy 💰

OpenAI is near $90b valuation. With a product that has been launched in late 2022. It seems OpenAI is doing more than $100m in revenue per month. The numbers are just crazy.
Lonestar raises additional $825k in Seed. Lonestar provides immutable storage to be sent on the moon as a backup service. Yep on the moon 🌕.
Aindo raises €6m Series A. Aindo is a synthetic data solution, it provides a platform to generate synthetic data from your real data in order to preserve statistical relevance while removing sensible information. With synthetic data you can then publicly seek for help among the world's data scientists.
ScyllaDB raises $43M Series C. It's NoSQL database that is compliant with Apache Cassandra interfaces, and open-source.
Pantomath raises $14m Series A. A new data pipelines observability solution enters the game.
Prophecy raises $35m Series B. This is a drag-n-drop data transformation product that I never heard of.

See you next week ❤️.

dbt multi-project collaboration

2023-10-19

cross-project dependencies (credits)

Over the last few years, dbt has become a de facto standard enabling companies to collaborate easily on data transformations. With dbt, you can apply software engineering practices to SQL development. Managing your SQL patrimony has never been easier.

So, yes, dbt is cool but there is a common pattern with it: you accumulate SQL queries. If your implementation of dbt is successful, many teams will use it, many business use cases will result in SQL queries in your warehouse. Fast forward to 2 years later, you find yourself with hundreds or thousands of SQL queries. Whatever the number, there will be a critical point at which a single project no longer scale.

❓

Read my guides How to get start started with dbt and how to manage and schedule dbt as a preview about dbt.

Having too many models in a single repository will become unmanageable:

Governance — many data owners
Data domains — a lot of different concepts that you would like to isolate as single units
Name clashes — you can't have 2 models with the same name in a project
and more 😅

This is when you consider a multi-project configuration for your dbt implementation. With a multi-project configuration, you can imagine isolated dbt projects with possible connections between them. We can draw a parallel with microservice architecture. Each dbt project is like a microservice and instead of exposing an HTTP API, it exposes tables with enforced contracts.

Initially cross-project references was a feature aimed to be released in dbt Core (cf. roadmaps 2022-08 and 2023-02). But after research and first developments it was decided by dbt Labs that multi-project collaboration will become a feature of dbt Cloud. Which I understand perfectly. It's the best feature for creating a differentiating commercial offering. What's more, multi-project collaboration is by its very nature an Enterprise—with a big E—feature, which makes it relevant for a paid-for solution.

Hence dbt Mesh, which has been announced this week at Coalesce—dbt Labs annual conference. dbt Mesh is the dbt Cloud solution to manage cross-project references, a multi-project node explorer and all the governance.

Cross-project references is a key enabler to data team decentralisation. Let's imagine you have a core project, managed by the central data team. In this core model you have an orders model. On the other side the finance data team wants to build a revenue model on top of the core.orders model. With cross-project references you can declare a model to be public on core to use it elsewhere.

dbt cross-project references use-case

All this is possible natively with dbt Cloud. But dbt Cloud multi-project is expensive. At the very least $100/month per project—Enterprise pricing, so it's not possible to have actual figures. But from what I know, it's expensive.

What if we could do it with dbt Core?

Join blef.fr for free

Enters dbt-loom

Obviously the community did not welcome well this announcement as it converged with the new pricing. It's a bit frustrating to see a product you truly love and I which you believe keeping awesome features behind closed-doors. But dbt is still open-source, so it's up to the community to adapt.

And the community adapted.

On my side I tried to fork dbt-core to inject in the what was need to make the multi-project working, but it was a burden. It was not very successful. On the other side Nicholas Yager worked on dbt-loom which leverages new dbt Plugins mechanism that was introduced with v1.6. Nicholas wrote a great explanation of the plugin API.

Under the hood, you need to write a Plugin class, inheriting from DbtPlugin, and implementing one of the 2 hooks available—or both: get_nodes and get_manifest_artifacts . The first hook is called every time dbt needs to get nodes and nodes are injected as external nodes, this is the one that interest us. Actually if we want to implement cross-project dependencies we need to add to a dbt project context the external nodes it depends on.

Here what you can do with dbt-loom.

dbt-loom in action with multi-project

Multi-project collaboration example

In order to help you understand what it really means here a working example with dbt-loom on a 2 projects setup—core and finance. First in the core project. In order to have reproducibility I use dbt-duckdb connector so everyone can try it at home. I have 1 seed that loads a few rows and 2 models: stg_orders and orders.

Obviously orders depends on stg_orders and respectively the first one is public and the second one is private.

-- raw_orders.csv (dbt seed)
order_id,order_date,amount,customer_id
1,2023-01-01,340,c1
2,2023-01-02,13,c2
3,2023-01-03,1456,c1
4,2023-01-04,765,c3

-- stg_orders.sql
WITH raw AS (
    SELECT
        order_id,
        order_date::DATE AS order_date,
        customer_id,
        amount
    FROM {{ ref('raw_orders') }}
)

SELECT *
FROM raw

-- orders.sql
SELECT
    order_id,
    order_date,
    customer_id,
    amount::DECIMAL(8,2) AS amount_incl_vat,
    (amount / 1.2)::DECIMAL(8,2) AS amount_excl_vat
FROM {{ ref("stg_orders") }}

The seed, the stg model and the final public model.

In order to declare these models as available for cross-project dependencies you need to specify it in the YAML. In our case stg_orders will be protected and orders will be public with an enforced contract. The contract is super important because as soon as you expose a model, you have to potential downstream consumers that are building stuff on your models, you can't delete a column or change a type without notifying. Or even more, versioning models.

version: 2

models:
  - name: stg_orders
    access: protected
  - name: orders
    access: public
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: int
        constraints:
          - type: not_null
      - name: order_date
        data_type: date
      - name: customer_id
        data_type: string
        constraints:
          - type: not_null
      - name: amount_incl_vat
        data_type: numeric(8,2)
      - name: amount_excl_vat
        data_type: numeric(8,2)

models.yml that declares access and contracts for public model

💡

There are 3 kind of accesses for a model. It can be private, protected or public. Private means the model is accessible only within the same group—a model can be only in one group. Protected means only a reference within the project and public from everywhere. See the doc.

That's all for the core project. Once you have dbt build the core project a manifest.json will be generated and tables will be created in the database. On the finance project, with dbt-loom install—pip install dbt-loom— you need to declare the core project as a dependant manifest.

manifests:
  - name: core
    type: file
    config:
      path: ../core/target/manifest.json

dbt_loom.config.yml

Then you can write a few models that are using cross-project references.

-- stg_revenue.sql
WITH orders AS (
    SELECT *
    FROM {{ ref('core', 'orders') }} -- this is cross-project reference
)

SELECT *
FROM orders
LEFT JOIN {{ ref('margins') }} ON 1 = 1

-- revenue.sql
SELECT
    order_date,
    SUM(amount_excl_vat * margin) AS revenue
FROM {{ ref('stg_revenue') }}
GROUP BY order_date

dbt finance project SQL models

Now you can dbt build this project as well and dbt-loom will extend dbt models list thanks to the plugin by adding the core.orders model.

In order for you to try it at home I've created a Github repository with a working example using DuckDB as database. You can try it yourself.

Join for free to not miss any updates

Conclusion

Multi-project collaboration is probably the best feature dbt Labs introduced in recent times. This feature has a huge potential to structure dbt projects and avoid chaos.

As a data engineer who loves open-source and community stuff, dbt-loom is a great workaround, but be aware that it's all experimental at the moment and if large workflows rely on this functionality, you should consider using the paid version with dbt Mesh.

In order to go further you can watch a Coalesce 2023 talk about dbt-meshify a tool that helps you automating your journey to a multi-project dbt setup from a monolith—here the direct link to the video.

Data News — Airflow Summit 2023 takeaways

2023-10-14

(credits)

Hello, dear Data News reader, I hope you'll enjoy this new edition. It's amazing how quickly time flies and this summer I passed the 3-year mark since I started my freelance adventure. I'm so happy with what it's brought me. But I've got this internal alarm that goes off every 3 years asking me for new things. It's time for me to search for my future paths.

Don't worry the newsletter and the content stuff I do is something I enjoy so it will probably stay as an invariant in this quest.

Also, this week I wrote R code for the first time. It's not an experience I'd recommend. I tried using ChatGPT to help me with this task and every answer it gave me was always wrong. In 20 attempts, it never gave me a correct snippet. On the other hand, I asked the AI to help me write a TCP proxy in Python and it worked first time. Probably a training bias.

Going further I've looked at StackOverflow trends to see if there is a reason Python is better covered by ChatGPT than R—more than the obvious one—and Python was 6 to 7 times more popular than R at the time of training. The graph also shows that Python has been losing popularity since 2022, although I don't really know why and it stays on top. Only C# get massive increase in TIOBE index.

This week, the videos from the Airflow Summit 2023 have been released and as always, I'd like to provide you with a list of the talks I found interesting. You can also watch the YouTube playlist and show support to other talkers.

Airflow Summit 2023 🌬️

For the sake of reading I've sorted the few talks I've selected in 3 categories: general stuff, Airflow internals and feedbacks from companies.

General — Get Airflow ideas

The Summit opened with a panel about the past and the future of Airflow. It was also the time for the panelist to give a huge shoutout to all Airflow contributors. I personally join the shoutout because Airflow has been in my professional journey for the last 5 years and it helped me grow and achieve so much.
Then Marc Lamberti gave a huge update about Airflow but done differently — It wasn't about slides with a list of new features but rather about how you can write, in 2023, a data pipeline with Airflow. It's a presentation that silences critics about Airflow's rigidity and complexity.
Airflow operators need to die — This is a funny topic. Airflow operators are often criticised because they don't work, so people just use Python or Bash operators to orchestrate their own stuff, which leave us with useless operator code. So, Airflow needs a new vision. This talk from Bolke is probably the beginning of operators rebirth. Bolke proposed new storage and dataframe APIs to remove hardcoded operators and decouple source from destinations.
Airflow can also be at the center of data mesh discussion with companies using multiple Airflow instance to give power to many teams. Kiwi.com showcases how they moved from a monolith to several smaller envs while Delivery Hero explained how they run 500 Airflow instances with a lot of unique specificities.
A microservice approach for DAG authoring using datasets — The idea is to apply SE patterns to pipelines like migration, broadcast and aggregate. In addition you should create micropipelines which we can define as small, loosely coupled DAG which operates on one input Dataset and produces one output Dataset. And then each micropipelines will implement a unique pattern with defined input and output.
Dynamic task mapping to orchestrate dbt — dbt has changed the data world and is immensely popular, but dbt orchestration is sill a problem. Many of Airflow users have to integrate dbt within Airflow. This time Xebia team propose an usage of dynamic task mapping to do it (link to Github repo with multiple solutions).
Astro team also showcased how you can deploy LLM with Airflow — following a16z infra guide.

Understand Airflow internals

3 talks you should watch to learn things you don't know about Airflow internals.

Airflow is made of 3 main components interacting together: the webserver, the scheduler, the executor and they use a database to communicate. Within the scheduler there is a DAG parser process reading files to understand what needs to be scheduled.
- This DAG parsing step has flaws.
  - By default you have to wait 5 minutes to have a new DAG displayed in the UI.
  - If you have 300 DAGs coming from a single file (forloop) it works way better than if you have 300 DAGs in 300 files.
- That's why we should probably move to event-based DAG parsing — In the presentation Bas explains the 4 steps in the DAG parser and what configuration you can change to have better performance. He also demo a event-based DAG parsing that instantaneously display DAGs in the UI.
- Then John also explained what he did to improve parsing performance — Especially around Python import. Because parsing DAG means running Python DAG code (and import) and import fucks the import time.
- ➡️ In conclusion you should consider running the dag processor in standalone to remove the impact it could have on the scheduler and follow latest community improvements.
Niko also discussed about the executor decoupling to unlock the development of third-party executors like an ECS Executor.

Companies feedback

To finish this newsletter 3 companies presentation about their Airflow that gave me inspiration.

Bloomberg, leveraging dynamic DAGs for data ingestion — I'm a huge fan of dynamic DAGs, I think this is the way to go in Airflow because as a data engineer your role is to create a standardisation layer when it comes to data work rather than doing the actual data work, especially in a mesh concept. Here Bloomberg team create a nice categorisation of data tasks to provide DAGs as a config.
Reddit, How we migrated from Airflow 1 to Airflow 2 — If there are still people out there on Airflow 1, you should migrate, new Airflow are way much simpler and funnier. But to be honest Reddit presentation can be generalised to every team that want to migrate from a old software to a fresher one. Migration recipes can apply whatever the software you use.
Monzo, Evolving our data platform as the bank scales — This presentation is full of awesome ideas. It talks about dbt integration within Airflow (using a custom DAGBuilder), monitoring, alerting and Slack interaction with the data stack.

See you next week ❤️ — this week other articles will be blended in next week Data News!

Data News — Week 23.40

2023-10-10

(credits)

Hey, I'm a bit late once again. I hope this newsletter edition finds you well. This is almost a raw edition, I had quite a big amount of links, I hope you will like this selection.

Gen AI 🤖

OpenAI’s plan to build the "iPhone of artificial intelligence" — Obviously this is one of the main struggle for OpenAI. In order to stay forever in the B2C market they need more than chat interface, they need hardware, they need to enter in users everyday life. Still not sure we need a need addictive device.
❤️ Generative AI exists because of the transformer — A scroll story by the Financial Times explaining what's Generative AI. Good for everyone.
Evaluating LLMs is a minefield — Slides from Princeton Uni about LLMs evaluation and why it's hard to understand how it evolves. You can find the video version on Princeton website, named "Societal Impact of AI".
JPMorgan CEO says AI could bring a 3½-day workweek — blablabla, AI is awesome, we want AI everywhere and pay people less blablala /s.
Language Modeling is compression — Paper from Deepmind. Title looks cool. To be honest this is not the first time I see LLMs and compression in the same paper and it opens, at least, to funny experiments.
Decoding speech perception from non-invasive brain recordings — Even more crazier. This article describes the state-of-the-art in decoding speech from brain activity. There are multiples models and shows what we can achieve by just looking at electro or magnetic recordings of the brain.

Fast News ⚡️

Microsoft Fabric: should Databricks be worried? — Vantage did a price analysis between Microsoft Fabric and Databricks. Generally Fabric pricing is simpler because fresh and new but for more complex stuff Databricks still shine.
Introducing Python and Jinja in Cube — Cube, an open source semantic layer, has released a new writing capabilities in Python with Jinja in the YAML definitions. Something that reminds dbt. You can now write macros to generate YAML.
Confluent announced Kafka roadmap and Flink as a Cloud service — this is the result of Confluent acquisition of Immerok. Confluent is still growing but struggle to become a real competitor to Databricks or Snowflake.
Python 3.12 is out — Every year a new Python minor version is released and this year it brings a few cool features. Mainly you get a new generic type parameter, better f-strings with multilines in curly brackets and quote reuse, the per-interpreter GIL—where Meta is proud to say it contributed to it.
Announcing Malloy 4.0 — This is the tool I have to try soon. Malloy is out with a new version and a lot of new features. As a reminder Malloy is a new analytical language meant to generate SQL to query databases.
4 tips to save warehouse money — Paul posted on LinkedIn about search index, avoiding rerunning dbt tests when possible or just deleting tables. Ian also proposed you identify and remove things you don't need anymore.

Tech and data engineering stuff ⚙️

CRON jobs at Slack scale — Why do you need an orchestrator when you can run CRONs? Slack engineering team details how they wrapped CRON jobs on top of Kubernetes with database table to get monitoring.
Using ML to identify date formats in file names — Dropbox developed a classifier to identify date formats in file names. This is the backbone of a naming convention feature. It gives ideas. Based on DistilRoberta this is something to look at to fix the mess of a datalake.
Data modeling is dead! Long live data modeling! — Joe Reis Keynote at Big Data London about his next book topic: data modeling. Joe covers why data modeling was put on the side in the recent years and why we need it back today, showcasing a few useful patterns and definitions.
Chat with BigQuery data — this is a recycling of all the chatbot use-cases, once again. It's an example where you can use natural language to access BigQuery data. There is also a walkthrough example on Airbyte with LLamaindex.
Creating an Airflow custom hook for API calls — A guide showing you how you can extend Airflow hooks to have a custom way to call APIs.
Goodbye Spark. Hello Polars + Delta Lake — Spark is under attack. In the last years Spark has been powering a lot of data use cases but with the modern data stack and more recently with DuckDB, Polars and smaller size OLAP technologies it allows a new way to do data processing.
Ensuring data quality with Verity — Lyft definition of data quality and a tour of the in-house product to address data quality in the data platform: Verity. This is a must-read and a good showcase of what you can do.
Database file format optimization: per column dictionary — Mixpanel developed a proprietary columnar database and this article shows what they did to improve compaction and increase performance.
Exploring the power of graph databases.

Data Economy 💰

The new search index (credits)

Yahoo spins out Vespa. Vespa is the tech behind Yahoo search engine, it's a search engine and a vector database. In the current Gen AI times, it looks like a good time to do it.
Contentsquare acquires Heap. Heap is a product analytics solution to understand better you funnel acquisition performance.
Kestra raises $3m Seed funding. Kestra is the new kid in the open-source orchestration space but disrupting the Python status quo because it's written in Java and requires you to writes pipelines in a declarative way, in YAML. IF you want to know more you can watch this YouTube live I did with Kestra's CTO demonstrating capabilities.

See you soon ❤️.

Upgrade your Modern Data Stack

2023-09-29

Make your data stack take-off (credits)

Hello, another edition of Data News. This week, we're going to take a step back and look at the current state of data platforms. What are the current trends and why are people fighting around the concept of the modern data stack.

Early September is usually conference season. All over the world, people gather in huge venues to attend conferences.. Last week it was Big Data London, this week it was Big Data & AI Paris. I wasn't able to go. But every time I went to a conference in the past, I came back with ideas to change everything because someone introduced me to a new fancy stuff.

This feeling is right. But you should temper your excitation. Let's go through the current state of data to understand what you should do next.

Big Data is really dead

Although the term Big Data is no longer very popular, London probably counted over 10,000 visitors and more than 160 vendors (2022 figures). Big Data London exists since 2016 and when we look at sponsors it's like an history book. Over the years Cloudera logo has been replaced by Snowflake and Databricks ones. Microsoft logo still standing over the years. When everybody is digging for gold, it’s good to be in the pick and shovel business.

The era of Big Data was characterised by Hadoop, HDFS, distributed computing (Spark), above the JVM. This era was necessary and opened doors to the future, fostering innovation. But there was a big problem: it was hard to manage.

That's why big data technologies got swooshed by the modern data stack when it arrived on the market—excepting Spark. We jumped from HDFS to Cloud Storage (S3, GCS) for storage and from Hadoop, Spark to Cloud warehouses (Redshift, BigQuery, Snowflake) for processing.

In fact, we're still doing the same thing we did 10 or 20 years ago. We need to store, process and visualise data, everything else is just marketing. I often say that data engineering is boring, insanely boring. When you are a data engineer you're getting paid to build systems that people can rely on. By nature it should be simple—to maintain, to develop—it should be stable, it should be proven. Something boring.

Big data technologies are dead—bye Zookeeper 👋—but data generated by systems are still massive and is the modern data stack relevant to answer this need in storage and processing?

Is the modern data stack dying?

The modern data stack has always been nice words to bundle a philosophy used to build data platform. Cloud-first. With a handy warehouse at the center and multiple SaaS tools revolving around to answer useful—sometimes not—use-cases. Following an E(T)LT approach.

Historically, data pipelines were designed with an ETL approach, storage was expensive and we had to transform the data before using it. With the cloud, we got the—false—impression that resources were infinite and cheap, so we switched to ELT by pushing everything into a central data storage.

If we summarise the initial modern data stack vision, this is something like:

move data with Fivetran
store data in Snowflake
transform data with dbt
visualise with Looker
document with a catalog, prevent with data observability, orchestrate

So what's left of the original vision of the modern data stack that can be applied in 2023 and beyond? An easy-to-manage central storage and querying and transforming layer in SQL. When you put the things like this it opens the doors and does not limit the modern data stack to 4 vendors.

The central storage can be cloud storage, a warehouse, a real-time system, while the SQL engine can be a data warehouse or a dedicated processing engine. It can go further than that, you can—in fact you should—compose storages and engines, there are too many use cases for any one solution to address. More importantly, the modern 4-vendor data stack is too expensive to scale.

The modern data stack is not about to disappear, it's so simple to use in the first place and it's the core of too many data stacks and practices today. But it needs to adapt to today's needs, hence its incremental evolution.

I believe in incremental evolution

What do you need to do? Well, it all depends on whether you're a newcomer and want to start building your data platform, or whether you already have a stack and are wondering what to do next. If you're starting your data stack in 2023, simply choose the solution that will be the quickest to implement to discover your business use cases, you'll build something later. A lot of companies started with Postgres + dbt + Metabase, don't be ashamed.

When it comes to incrementally change a data platform this is a bit different, you need to find what is going wrong and what could be improved. Like

data workflows are always failing, are always late—Identify why workflows fails, data contracts might help to bring consensus as code if it fails because of upstream producers, create metrics about failure or latency aim for a 30-days streak with no issues. Define SLAs, critically and ownership. For downstream data quality there are also a lot of tools.
data stack is too expensive—With the current economic situation a lot of data team were in need to stop spending crazy amount of compute and introspect storage to remove useless data archives. DuckDB can help saving tons of money.
developer experience to add new workflows—This is something often neglected by data engineers, you need to build the best dev experience for other data people not everyone is fluent with the CLI.
data debt—You might have too many dashboards or tables, workflows spaghetti. For this you need to do recurrent data cleaning. Find, tag and remove what is useless, what can be factorised. Only healthy routines can prevent this.
poor data modeling—This topic might be too large to handle in one bullet. Data modeling is the part that really don't scale in data stacks. Because of the growth your SQL queries patrimony will inflate and only data modeling will prevent data from being unusable, repetitive or false. Good data layers are a good start.
there is no data documentation—Rare are the people who are happy to document what they are doing. Best to do is to defined what is a good documentation and then enforce the requirements before going to production. Think the documentation for your readers.
data is not easily accessible for humans or AI—We build data platforms in order to be used. You should create usage metrics over of your platform either about business users conversion in the downstream tools, about SQL query writers but also about how AI is using the data. How the AI platform combines with the analytics platform?

This list is probably not exhaustive, but it's a good start. If you think you're good on all counts, you've probably finished the game and that means your data team has built something that works. Don't forget the stakeholders though, as it's probably more useful to have a platform that barely works but serves users perfectly than the other way around.

Conclusion

This post is a reflection on the changes in the data ecosystem. Marketing would have you believe that your data infrastructure may be obsolete but you shouldn't worry about it, if you're still using a crontab to run your jobs that's fine. Just use the right tool for the right job and identify what are your data needs. Tip: data needs are rarely a technology name.

I hope you like this different Data News edition, I'm curious to know what you think about it, I wanted to keep it short while giving a few practical links and ideas.

Your data stack won't explode if you don't use dbt.

💡

Going deeper: The road to composable data systems: thoughts on the last 15 years and the future. Wes McKinney—pandas and Arrow co-creator—is one of the best thought leader in the data space. This article depicts well how composable our platforms will be in the future and why Apache Arrow have to be everywhere.

PS: I wanted to write also about interoperability of data storage and file formats but that's for another time.

Fast News ⚡️

Motherduck has announced their pricing — The model simplicity reminds me a lot BigQuery in the early ages. You pay for the cold and hot storages. Respectively $0.04 per GB per month and $0.02 per GB per hour. But it looks like way more expensive than BigQuery.
Announcing BigQuery Omni cross-cloud joins — Join datasets located in BigQuery with datasets located in AWS or Azure. This is part of BigQuery Omni offering, which is 37% more expensive (in EU).
3 lessons to learn before creating your own data team — Christelle wrote 3 lessons learned about a survey that has been run in a private French data community. Mainly it shows that the first hires in a data team have to be picked cautiously.
How to prioritise projects and scale your Data Science team efficiently — A nice article about how to understand an OKR and make it your own to lead data science projects.
Mistral 7B, the best 7B model so far and open-source — Mistral AI is the French company that want to compete with OpenAI and they released under Apache license a first 7B model.
A selection of SQL tutorials — a long list.

Data Economy 💰

Rollstack raises $1.8m Seed. This is a YC company and they propose a product that automates slide deck with data coming from your data stack. Without engineering or manual work. This is an awesome idea the young myself would have love 8 years ago when I was generating Powerpoint in Python.
Kolena raises $15m Series A. Kolena proposes an end-to-end framework to test and debug ML models to identify failures and regressions.

See you next week ❤️.

Data News — Week 23.38 (late)

2023-09-26

Early like my run (credits)

Hey. This is a super late Data News, I wanted to send it earlier but I was travelling then enjoying time with friends and family. I'm still struggling a bit to write as fast as I would like, but 🤷‍♂️.

So, sorry for the late edition and enjoy.

Gen AI 🤖

Announcing Microsoft Copilot — Having everything under a common brand is great and Copilot is a great name. Microsoft announced that your AI companion called Copilot will be everywhere in the next Windows 11 update. For instance in Paint, Photos and in your web search (Edge and Bing).
At the same time Microsoft leaked 38To of data — through a Github repository containing a link to an Azure storage with public access open.
OpenAI announced DALL·E 3 — natively built with ChatGPT to create more impressive image from user prompts.
I recommend you to follow Oliver on LinkedIn if you don't want to miss anything related to Gen AI. He's writes the best takeaways multiples times a week.

Fast News ⚡️

Postgres 16 has been released — featuring a few performance improvements in parallel executions (string_agg and array_agg) but also with SELECT DISTINCT and COPY command.
Astronomer released Ask Astro — A LLM application that is able to understand Astro docs to answer most of the Apache Airflow questions. The source code is on Github.
The implications of scaling Airflow — Sarah, who's working at Prefect, wrote a post about Airflow downsides at scale and how Prefect mitigates them. I'd not say that all the downsides are relevant blockers but still it outlines on of the biggest Airflow issue: everything is implicit. Airflow is a framework allowing a wide range of code easily leading to debt.
dbt pattern, test-transform-publish — Often called staging pattern. The idea is to publish the data once tests have validated that the is valid. What Leo proposes is an incremental transformation with tests on top. If the tests are valid then an view runs and select the last update.
A guide to the Snowflake results cache — Cache is a critical piece to every data warehouse either for reusing data between runs or between stages in the same run. This article details what you have to understand to optimise your Snowflake query writing.
Use the new SQL commands MERGE and QUALIFY in Redshift — Redshift still exists and tries to catches with the competition. Merge allows you to deduplicate data by writing what you want to keep when rows matches and qualify filters results of a previously computed window function.
Real-time analytics with Snowflake dynamic tables & Redpanda — A good showcase of Snowflake dynamic tables with Wikipedia data.

Data Economy 💰

Cisco acquired Splunk for $28b in cash. Crazy amount. Splunk has been here for a while providing a all-in-one platform for tech observability by ingesting logs and events to provide insights on a tech stack.
Secoda raises a $14m Series A. Secoda is a data catalog tool with lineage and monitoring capabilities. Fresh money will help them to add AI capabilities to the product and increasing monitoring capabilities.
Motherduck raises $52.5m Series B. In total they raised $100m and announced that Motherduck product is open for everyone and not anymore behind a waitlist. Mainly Motherduck is the company providing DuckDB as a Cloud product but they are not developing DuckDB, their product is quite young but works like expected: with a simple string you can get an analytical cloud database that just works and that can be instantly replaced by a local one if needed.
Tabular raised $26m Series B. Tabular is the company providing a cloud platform on top of Apache Iceberg—developed by Iceberg founders. I'd say that Iceberg (or table formats) are probably one of the technology that will incrementally change for the better the way we write data pipelines. Providing more control over data storage. Yet I think Iceberg is not yet ready to be widely used (Python write support still missing, you need Spark).
Anthropic could get $4b from Amazon. Amazon did a first $1.3b in a corporate round to bring a lot of money to one of the biggest OpenAI. The ChatGPT alternative, Claude, is already out there.

See you on Friday ✨.

Data News — Week 23.37

2023-09-15

Facing the News (credits)

Hello Data News readers. I'm still struggling to get back into my usual work rhythm. If you add the fact that last week I came up with fewer articles than I expected, this has led me to another blank page. Anyway, after 2 years of work, I have to accept and let go when necessary. But don't worry I don't forget you.

Let's quickly jump to the news, because it's rather busy.

(Gen) AI News 🤖

Reinforcement Learning: an easy introduction to value iteration — Title says easy, but the article contains maths formula. RL is always something magical and this article explains it well through golf concepts.
Falcon 180B has been released on HF — This is interesting to note that Falcon has been developed at Technology Innovation Institute (TII) in Abu Dhabi. It brings diversity to Foundation models usually coming from US. But despite of the number of parameters (180B) can it run on your computer? Spoiler, according to Benjamin it needs 100GB of RAM to run and a good GPUs to be able to fine tune.
If you're late to the party and you need fresh views on LLMs Daniel wrote an introduction demystifying the Large Language Models and Jesse wrote about LLMs impact from a Data Engineering perspective.
At the same time Github Research quantified GitHub Copilot’s impact on developer productivity and happiness — Developer productivity is a difficult measure to compute. Also productivity ≠ speed, but speed is important. The research also shown that people using Github Copilot feel more 88% more productive and are more efficient and less frustrated.
HuggingFace CEO and co-founder opening statement at AI insight forum — This week US AI giants went to a 6-hours private meeting with 60 US senators to explore AI regulation. Clement Delangue transparently shared his speech on Twitter. Mainly he treats about openness, risks measurements—like mis-information, elections manipulation or carbon emission increase—and finally safeguards implementation.
Meta developed an end-to-end AI system performance simulator called Arcadia. From what I understand this performance simulator unlock capabilities in finding what are the best parameters for training.

💡

Additional big tech stuff to check: real-time ML training at Etsy and last mile data processing with Ray at Pinterest.

Fast News ⚡️

I can predict a project failure (credits)

Birmingham City Council has to pay 5x the initial price of the new ERP Oracle project. From £20 million to around £100 million. Crazy amounts.
I just discovered this week that in June BigQuery introduced primary keys and foreign keys.
How to reduce warehouse costs? — Hugo propose 7 hacks to optimise data warehouse cost. And if you can read French (🇫🇷) there is the super post by a French data collective about comment réduire ses coûts Google BigQuery?.
* * * * * schedule Snowflake queries — If you want to live dangerously you can use Snowflake table schedule to compute tables periodically. I don't recommend it, it's a Pandora's box we don't want to open.
Dimensional data modeling with dbt — A great 6-steps process to create a simple dim-fact model with dbt. It also uses the dbt_utils macro to generate a surrogate key.
Head-to-head comparison of 3 dbt SQL engines — A comparison between DuckDB, Spark and Trino where DuckDB wins almost every fight. Obviously biased by the fact that the comparison is done on a mono node and DuckDB is built for this.
Scrape & analyse football data — Benoit nicely put in perspective how to use Kestra, Malloy and DuckDB to analyse data.
Factory Patterns in Python — It remembers me Java design patterns classes at the engineering school. A bittersweet feeling. Still I think that the Factory pattern is probably the one that I've used the most since the beginning of my career and this post explains it well.
When charts looks like spaghetti, try these saucy solutions — Great tips to enhance your dashboards.
❤️ The key to building a high-performing data team is structured onboarding — The title say it all. Still in the article it mentions 2 key piece. First you need a great onboarding doc and then you need to successfully pass the "bootcamp" phase, which matches the 2 first weeks.

Of course, great onboarding isn’t the only thing necessary to build a high performing team, but it’s almost impossible to build one without great onboarding

Github gems 💎

nike-inc/brickflow — Nike engineering team released a Python framework to orchestrate jobs in Databricks workflows. Mainly it maps to Airflow concepts to have a declarative interface over Databricks objects like Cluster, Workflows or Notebooks in order to orchestrate them.
sourcegraph/cody — Cody is a free, open-source AI coding assistant that can write and fix code, provide AI-generated autocomplete, and answer your coding questions. Under the hood it uses either Anthropic or OpenAI LLMs to work and requires a free cody.dev account.
teej/titan — Titan is a Python library to manage data warehouse infrastructure. Titan allows you to create Snowflake Databases, Warehouses, Role and RoleGrant in a programmatic manner.

Data Economy 💰

Databricks atm (credits)

SQream raises $45m Series C. SQream is a GPU-based SQL database that can act as a data warehouse promising performance peaks at PB scale because of the GPU architecture. It also works well for machine learning use-cases.
Gable raises $7m in seed funding. Chad Sanderson launched his data contracts product / platform in association with 2 other co-founders. Chad produced a lot of content around contracts in the last 2 years. It seems Gable is here to fix upstream data quality with contracts. Alerts will be sent in Github to alert owners when something breaks enforced rules.
Databricks raises, another, $500m in Series I. Soon there will be no letter in the alphabet to associate with Databricks fundraising. Since the beginning they raised $4b and are today valued at $43b. Nothing to say except than they love to burn cash. Be ready for a downhill in 2025 if you have picked Databricks.
Treefera raises $2.2m in pre-seed to develop a data platform that monitors forests built for carbon offsetting and reforestation. I really like their "data products" approach and the geo visuals over forests risks.
Collibra acquires SQL data notebook Husprey. Husprey is a Notion-like directly in the warehouse to write stories on top of each interesting tables or facts. It will become a nice product in the Collibra data governance ecosystem.

See you next week ❤️.

Data News — Week 23.35

2023-09-01

Back to school (credits)

Hey, I'm back.

I've taken an unplanned 3-week break since the last Data News, let's be honest, it was necessary! I spent a few hours working on the fancy data stack project and articles are in the works, but it was idealistic to produce quality code and content while enjoying the summer. Like wine, it takes time to get it right. If you want a first glimpse of the Dagster code, you can look at it on Github, not yet documented but commits messages are clean.

On September 1, I'm still getting used to the school rhythm. A new year starts in September, new friends, new classes and new things. Even if, as an adult, things are different now. Data News is back, but with the same recipe: a weekly newsletter to let you catch up on the previous weeks' articles. I make the selection myself, I choose things I like while being under the others influence. But I'm not an influencer. I just create content.

A glimpse into a fancy assets graph.

This week features what happened in August, even if it was summer holidays, news, features and drama got the data world. Enjoy the news recap.

dbt tests 🧪

dbt Core proposition has been to bring software engineering practices to SQL development. Obviously testing is invited to the party, but tests are hard and everyone does and understands tests differently. There are unit, integration, functional and end-to-end tests.

This summer a lot of people wrote about testing with dbt.

💡

Before you start reading something else I recommend you the excellent video Testing: Our assertions vs. reality from last Coalesce on YouTube.

dbt tests: How to write fewer and better data tests? — Ari catalogs the kind of tests you can write with dbt. Do you want to test data or code changes? (this is the more important question tbh) Do you want to test schema changes, missing data, volume or value anomalies? He covers everything.
An overview of testing options for dbt — Another exhaustive and less opinionated list about the options out there to write test on the data.
A simple approach to implementing unit tests for dbt Models — Mahdi proposes a CTE nomenclature to create input and output in dbt models to unit tests them.
dbt Core unit tests are coming — A discussion on Github about unit tests and fixtures definitions in YAML to tests models. If implement within dbt Core it would be the most awesome feature. Because hacking with seeds and custom macros looks nasty.

Generative AI 🤖

I haven't really been keeping up with the news because it moves too fast, but here are a few things that have stood out:

Meta releasing models faster than before — Expanding DINOv2 a computer vision model (on X), releasing SeamlessM4T a multilingual multimodal translation model (on X), releasing Code Llama a LLM for coding.
Snowflake fine-tuning Code Llama for SQL generation — With these fine-tuning it seems they are close to GPT-4 accuracy in text-to-SQL.
Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper —
A French Youtuber released on Twitch a 24/7 AI deep-faking French presidents (Macron, De Gaulle, Chirac) answering the Twitch chat questions, but his channel got banned by a Twitch bot after AI-Macron said something illegal while answering a question about worst french cities. AI fights this is the future we want.

Fast News ⚡️

A certain idea of hell.

Python into Excel — Microsoft and Anaconda announced Python coming into Excel. I'm bitter-sweet about it, on one side I don't think Excel is a good platform for software development, on the other side, let's be honest a face the truth Excel is the only data platform business users wants. Still the big winner of this is Microsoft, because Python code will run on Azure.
After Excel, Notebooks get a second youth — Meta explained how they schedule Jupyter Notebooks in production, Google announced the BigQuery studio with embedded Notebooks in the UI and Jupyter released Jupyter AI (you call it with %ai) to bring Gen AI to the notebook.
New features in Airflow — with 2.7 you get a Cluster Activity UI and with airflowctl new CLI you can spin up Airflow instances in a wink.
Introducing the revamped dbt Semantic Layer — dbt Labs announced the Beta of the Semantic Layer which will be a paid product in dbt Cloud. I've already wrote a lot about the semantic layer and more is to come. So let's see where it goes.
Introducing SOL: Sequence Operations Language — A new dedicated to to sequence analyses, which can be useful when working with web traffic data.
Answering "Why did the KPI change?" using decomposition — If you are an analyst who needs to explains everyday why a metric increased or decreased, this article is for you. Max explores metrics decomposition for sum and ratio. This is brillant.
Apache Hudi: From Zero To One (1/10).

Drama

Instacart's Snowflake bills — When public companies publish results numbers are looked at. This time Instacart bills have been overlooked. While the company said it has spent $13m, $28m and $51m respectively for 2020, 2021 and 2022 in Snowflake spending and plan to spend $15m in 2023.

People supposed Instacart found the magic solution to reduce costs, others said it migrated to Databricks. But the main reason is: prepaid credits. The Snowflake press team even wrote a post.

Still you can watch the perfectly timed video about How Instacart Optimized Snowflake Costs by 50% or Snowflake optimisation at HelloFresh.
Hashicorp changed Terraform license model — Hashicorp decided to move from Mozilla Public License to Business Source License (BSL). BSL is source-available and not really open-source. Following the announcement OpenTF forked the repo.

Data platform stuff

4 articles that gives food for thoughts about the future of the data field.

The uses and abuses of cloud data warehouses — A streaming database saying to a batch database: "you're not suited for operational use-cases, only analytical". The batch database answers one day later.
Level-up with a Medallion architecture — bronze, silver, gold are the structuring layers of the Medallion architecture. Matt explains it for you.
The data contract pivot in data engineering — It's a fancy name, but it aims to solves upstream data problems with a technical + process solution.
After the modern data stack: welcome back, data platforms.

Data Economy 💰

Hugging Face raises $235m in Series D. You can see Hugging Face like the Github of machine learning models, but it's much more today, this is a global platform to distribute AI—in every form possible. Obviously with the new popularity over Generative AI models HF is playing a key distribution role.
Stemma has been acquired by Teradata. Stemma is a company that has been founded by ex-Lyft employees working on the company data catalog Amundsen. Mainly Stemma is built on top of Amundsen with Enterprise features. Consolidation.
Rockset raises $44m in Series B. Rockset is a real-time search (and analytics) database aiming to replace Elastic. Like Elastic but in the cloud.
Ikigai Labs raises $25m in Series A. Ikigai provides a web platform to do data transformations in a visual way on top of tabular data. You can do entity resolution or forecasting for instance.
Elementl becomes Dagster Labs, to make it clear. I'm announcing soon blef Labs.
The Information reported that Open AI will pass $1b in annual revenue "over the next 12 months".

Feels good to be back, see you next week ❤️. I hope you enjoyed your summer.

The fancy data stack—batch version

2023-08-04

Summer Edition (credits)

This is the first article of the Data News Summer Edition: how to build a data platform. I tried to be as short as possible in this first article, details will come in the following ones.

The modern data stack has been criticised a lot, a few are saying it's dead other are saying we are in the post-modern era. The modern data stack as a collection of tools which interacts altogether to serve data to consumers is still relevant. Personally I think that the modern data stack characterises by having a central data storage in which everything happens.

Let's design the most complete modern data stack, or rather the fancy data stack.

In this article we will try to design the fancy data stack for a batch usage. A lot of logos and products will be mentioned. This is not a paid article. However over the years I've met people working at these companies so I might have a few biais.

As a disclaimer, this may not quite make sense in a corporate context, but since this is my blog, I'll do what I want. Still, the idea of this post is to give you an overview of existing tools and how everything fits together.

💡

If you just want a few articles to read, just go to the bottom of the email.

A few requirements

The source data lies in Postgres database, in flat CSV and in Google Sheets.
I want something cloud agnostic—when possible.
I want to use open-source tooling.
Everything I do should be production-ready and public. At the end of the experiment you should be able to access the tools—when possible.

Source data

When I was looking for data, I was looking for a bit of volume, something geographical and without PII. I personally like NYC Taxi trip data but sadly it has been used many times so it removes a bit the fun. At the same time the Tour de France was ongoing and I found a "way" to get Strava data for Tour athletes on Strava. So I thought it was the perfect data to build a data platform.

Mainly there are 3 datasets:

Athletes — all the data about the athletes like their race ids, teams, their profile but also their body size. It will be a Google Sheets.
Stages — le Tour de France is a 3-weeks race, it contains 21 stages, every stage is a GPS path with a few checkpoints. It will be 21 CSVs.
Race — the actual race data which is a GPS data point every second for each athletes on Strava + other data points sometimes. It represents almost the half of the peloton. It will be a table in Postgres. Postgres is not the best solution for this, but as I want to mimic enterprise context, having a Postgres database is kinda mandatory.

Race data will be partitioned per day, but as the Tour is already done, it will be a bit different than in real life environment. Still, this is something I keep in mind for future trainings. Because I'm convinced that to learn data engineering you need to experiment real life pipelines running every day to experiment the morning firefighting.

I'll delve into the data in the next article, but I won't detail how I got the data because, you know.... 🏴‍☠️. Actually, it's just a few Python scripts and a bit of F12, but that's not the point of this article.

Source data (Postgres, CSV and Sheets)

The fancy data platform

In order to have a complete data platform we will need to move the data from source to consumption. But what the consumption will look like?

I want to answer multiple use-cases:

Create a dashboard to explore stages results
Give a LLM driven bot that answer common questions about the race
Compare 2 athletes performance on a specific segment and generate a GIF

In order to answer this we will need to ingest data from the multiple sources, then transform and model the data in the chosen data storage and finally develop consumers apps to answer the business needs.

Let's try to throw a first design of our application—with logos. Obviously this can be subject to change. Either because it's too complicated either because I want to change. Once again this is fiction so I can afford to change stuff.

Actually as a one of my main advice is that you should never be strict about tech choices because you can't plan the unexpected. So do yourself a favour and accept to throw away something that does not work for you.

The fancy data stack

Just for the sake of being open, there are a lot of alternatives and my choices could have been different. Here what you can also consider if you're doing your own platform.

Extraction
Open-source — Dagster, Airbyte, Airflow, Prefect, Mage, Kestra, dltHub
SaaS ($) — Stitch, Portable, Orchestra and the cloud versions of the OS tools
SaaS ($$) — Fivetran
Transformation
SQL — dbt, SQLMesh
Python — pandas, polars
Distributed — Spark, Pathway
Datalake
Open-source — MinIO, Ceph, LakeFS, OpenIO
SaaS ($) — S3, Google Cloud Storage, Azure Blog Storage
Table format — Apache Iceberg, Apache Hudi, Delta
Warehouse
Open-source — DuckDB, ClickHouse, Apache Pinot, Apache Kylin, Apache Doris
SaaS ($) — BigQuery, Snowflake
Semantic Layer
Open-source — Cube, Malloy, sqler
SaaS ($) — dbt Cloud, LookML
Governance
Open-source — Datahub, OpenLineage, OpenMetadata
SaaS ($) — CastorDoc, Atlan
Analytics
Open-source — Superset, Metabase, Lightdash
SaaS ($) — Tableau, Looker, PowerBI, Whaly and the cloud version of the open-source tools
Exploration
Open-source — Streamlit, Jupyter
SaaS — Hex, Graphext, Husprey, Count (etc. this list can become infinite)

Conclusion

After this design exercice I have mix feeling. I'm in between. I think this is a fancy stack because I tried to put everything inside, but as the same time is find it quite boring. Like this is just stuff that works. This is linear, I'll move data from A to B to C in to order to use it with D. Actually this is just modern data engineering.

In the following part of this series you'll follow my adventure in the extraction, the transformation and in the serving for analytics and Gen AI usage.

I hope you'll enjoy this Data News Summer Edition.

FAQ and remarks

Why do you use Google Cloud?
Because my credit card is already in place and I'll be much faster. My opinion on the matter is this: all clouds are born equal, you just have to find the one you're most comfortable with, or suffer your company's choices.
DuckDB is not really a data warehouse.
I pick DuckDB because it's fancy. I think I'm gonna hit some limitation especially in Geo compute, so I might switch to ClickHouse or BigQuery if I lack of time.
I hate Github actions, but I prefer putting code in public in Github.
I used the way to visualise data platform Gitlab data team is using.
What about the performance of the platform?
I don't really care about performance, because this is not large data and I don't want to spend hour optimising for performance.
Do you have a budget?
Something reasonable. I think ~100€ / month is ok for this experience.
What will you do in LLM category?
I don't know yet. If you have ideas about what I can do reach me.
Why Dagster?
I've been building things with Airflow for almost 5 years, I love trying new things and in the list of orchestrators that have hyped me the most, Dagster is number one. Software-defined assets are something I wanted to play with.

Small Fast News ⚡️

If you want dont care about this, here a few articles you might want to read by the pool.

How to model: Kimball vs One Big Table — This is one of the main topic of discussion in the data space. Should you go for dimensional modeling or go for OBT or even go for query-driven data modeling (coined by Joe Reis—who's writing a book about data modeling).
Costwiz, Saving cost for LinkedIn enterprise on Azure — LinkedIn developed a complete data platform on Azure to save costs on Azure.
Confidence — An experimentation platform from Spotify — After years of experience in building experimentations, Spotify decided to release a product for others to do it. This is in private beta and the move is interesting.
DuckDB vs. Spark, ElasticSearch and MongoDB — Even if this is not really relevant to compare it to NoSQL databases, tests are showing that DuckDB looks better.
Overview of JupyterHub ecosystem — Just saving this for me because I do stuff on it.
Read Data Engineering Weekly.

See you next week ❤️.

Be kind to me, this is my birthday (credits)

Data News — mid-2023 popular articles

2023-07-28

🧜‍♂️ (credits)

Hey, this is a mid-2023 edition with some of my favourite articles and the popular articles that have been shared this year in the newsletter. There isn't any fancy calculation on how to find the popular articles. Here how it's done.

Every link sent in each newsletter is tracked in 2 ways:

when you click on a link it first redirect you to my blog so I know that you've clicked on it
it adds ref=blef.fr to the url, so the original articles knows that the traffic comes from me, mainly it's a great way to support me by being discoverable to others

I've used the click data to sort articles by popularity. Obviously it has a few biais like recent editions get more clicks because I have more subscribers but impact is minimal.

A few numbers. Since the beginning of the year I've shared around 500 articles, which generated at least 22k views on creators articles. I say at least because this is an low estimated number, from a projected experience I think that in the reality it's twice this number.

If you have travel time I also recommend you the first episode of Data Minds, my podcast, with Joe Reis.

Data News — Week 23.29

2023-07-22

See you on the road (credits)

Hey, I hope this newsletter finds you well. This is a small blogpost to give you a few reads while waiting for your next travel. We can already feel summer, I found less articles to enter the selection this week.

Also be ready for the Data News: Summer Edition. For the next 5 releases it will be a bit different than usual, less curation and more original articles written in advance to allow me to take a break.

You'll—probably—get:

A 2023 must-read articles
How to create a batch data platform—using Tour de France data—from ingestion to visualisation using all the fancy tools the data world can offer (in 2 or 3 parts)
Docker for data people
The disparition of the data engineer

Fast News ⚡️

Give a controller to your stakeholders (credits)

For the love of the game — Winnie from dbt Labs wrote a great post about seeing data a as game, analytics being the game design. What if we conceive data as a game for our consumers and not as a linear tool to do boring actions. In the article the author also shares dbt is jQuery, not Terraform, which awesomely describes how dbt helps you enter flow state for data work.
How an acquisition fails — It's been a long time since I've shared Benn's articles, but as always I can't recommend him enough. This time it's about tech acquisitions and what can be done to fail—or succeed.
Microsoft Fabric: An end to end implementation — A first—blurred—glimpse of Microsoft Fabric capabilities, Jordan reads data from Sharepoint and Azure Storage, then transform it using PySpark to visualise stuff in PowerBI. Classically boring stuff.
How to chat with data in Snowflake using ChatGPT, dbt, and Streamlit — Less boring, obviously when you put ChatGPT and dbt in the same sentence it creates buzz instantly. This is an interesting demo of how you can quickly build a chat experience—using OpenAI—on top of you data models.
LLM based pipelines with PostgresML and dbt — Mainly for me this is a discovery of the PostgresML an open-source extension that brings ML functions to the database. As cloud databases like Snowflake and BigQuery brought it years ago, this was mandatory for the Postgres stack. In the article it shows you that you can than run transformers or embeddings directly from dbt.
Taking charge of tables: introducing OpenHouse for big data management — New data product at LinkedIn: OpenHouse. OpenHouse sits on top of the LakeHouse to bring a control plane to managed Iceberg files. It reminds me something... We used to call it warehouse back in the days.
Models on HuggingFace — Clement, the CEO of HuggingFace, congratulates the community and himself because a lot of public models are hosted on HuggingFace, it shows how fast and deep things are going.
Plot Gallery on Observable — I'm not often a fan, but Mike Bostock is different. He created d3.js while at the New York Time, he brought something unique to digital data visualisation. More recently he co-founder Observable, which is an awesome tool to do visualisations, and the plot gallery makes me envious—while quite simplistic.

Data Economy 💰

Hightouch raises $38m in a Venture round. Hightouch has been primarily known for his reverse ETL solution. With the money the team announced a new suite of tools to activate customers in the warehouse. You can see it as a CDP—customer data platform—in your warehouse. It means you get a unified view of customers across all your tables.
Polar Analytics raises $9m Series A. Polar Analytics is a vertical SaaS to provide analytics for Shopify vendors. This is less data engineering oriented but still I find interesting to see a "reporting" product raising money. Also vertical product like this can give ideas to marketplace on what can be great reportings.
Unstructured raises $25m Series A to build ETL for LLMs. Unstructured wants to give you the ETL toolkit to use company complex data like HTML, PDF, CSV, PNG, PPTX, as they say on their site. Personally I did not know that CSV was a complex source of data but ok. To be honest at the moment it looks like a fancy text extractor.

See you next week ❤️.

Data News — Week 23.28

2023-07-15

Have fun train models on this (credits)

Hey, it's Saturday I hope you're enjoying July, taking deserve break, reading data engineering articles while at the beach or traveling to unknown places. Sometimes there are Fridays when I don't find any glue between articles for the newsletter and I have an idea of something to compensate but it takes me the whole Friday of exploration.

And here we are on Saturday. Yesterday I found a way to get sensor data of half of the Tour de France peloton, I was sure it was a good dataset to explore new tools with. And it's honestly a great dataset but it's a bit hard to download and format all the data for exploration. So it will be for later.

Anyway, here a quick press roundup about a few news and articles.

Gen AI 🤖

Elon Musk announced xAI, his new company, to show that he's better than the rest. He hired alumni from all the AI companies (e.g. Deep Mind, Google, OpenAI, etc.). They held a 2-hour Twitter Space in which they detailed the vision a little. It's mainly about building an AGI capable of understanding the universe. They say we are a few weeks away from their first release. Here a great summary of the space.
Associated Press sign with OpenAI to share AP's text archive — Interesting to say as it's one of the first deal like this. It reminds me when press gave up years ago on their own platform writing for Google and Facebook news platform. At least this time we will know what OpenAI uses for training.
Shopify introduce Sidekick — Once again Gen AI is a Copilot. Shopify introduced a right panel in the UI to help vendors in any way. Gen AI used as a Copilot. In the video we see the Sidekick generating a chart to answer a sales question.
Hollywood actors taking a strike action — They don't want AI and computer-generated faces and voices to replace actors.
Clibrain, a Spanish startup, launches to build LLMs models for Spanish. They released LINCE-ZERO. Spanish is the second most spoken language by native speakers and the fourth most spoken by all speakers.

Fast News ⚡️

How we cut BigQuery costs 80% by hunting down costly queries — Mixpanel team hugely reduced their BigQuery spending. They use Fivetran, dbt and Census. In order to get started they first built a cost dashboard using information_schema.jobs tables. Then they took actions, mainly: avoiding SELECT *, materialising intermediate result, adding partition and going incremental. Nothing new but good reminder.
Data Contracts in the Modern Data Stack — Whatnot is one of the company who embraced Data Contracts last year. This article details what they shared in they excellent Data Council talk. Mainly their implementation is a Protobuf Schema Registry and interface at event production and consumption.
Introduction to dimensionality reduction — I've gave up on machine learning a few years ago, so I really like every article explaining with visual machine learning concepts. This article explains the dimensionality reduction that is often mandatory when datasets grows. There is a part two with live Python examples.
Make Python free-threading — This is how open-source is made. In a community discussion about removing Python GIL. Someone from Meta said they can dedicate 3 CPython internals engineers to work 2 years+ in breaking the barriers. Python GIL stands for Global Interpreter Lock, which is a lock that allows Python to use only one thread. Interesting to see.

My savings on BigQuery money (credits)

See you next week ❤️

Data News — Week 23.27

2023-07-08

Who's leading the data peloton? (credits)

Hey you, this is the Saturday Data News edition 🥲. Time flies. I'm working for the Series of articles in advance for August about "creating data platforms" and I'm looking for ideas about the data I could use for this. Having some kind of simulated real-time data would be the best. But it requires to write a simulation. Which is enough complicated. What would you use?

Small French aside 🇫🇷

(A small part in French, jump to next section)

Cette semaine j'ai lancé mon podcast en français nommé À l'heure des données. Dans ce podcast, qui sera mensuel, je vais discuter avec des experts francophones qui font l'écosystème. On discutera du présent mais aussi du futur.

Dans le premier épisode j'ai discuté avec Benoit Pimpaud qui a été data scientist à l'Olympique de Marseille et qui s'est reconverti plus tard chez Deezer en data engineer. Aujourd'hui il s'occupe du produit chez Kestra, un orchestrateur open-source développé en France.

🎧 Pour nous écouter : Apple — Spotify — Deezer — Amazon

Sue un tout autre sujet, Stéphane Bortzmeyer a participé au colloque du CNRS sur Penser et Créer avec les IA génératives et il a écrit un rapport sur ces 2 jours.

PS : est-ce qu'une version française de mon contenu t'intéresse ?

The new dbt Semantic Layer

Following the acquisition of Transform by dbt Labs a few months ago, dbt Core integrates MetricsFlow. MetricsFlow was the semantic layer of the acquired company. This week, Nick Handel, co-founder of ex-Transform, wrote about how dbt Core specs will adapt.

As a reminder a semantic layer is a definition on top of your models meant to be reusable. The idea, is then, to use the semantics to generate SQL queries. You can read my article on the semantic layer.

In the new vision it will be possible to define multiple things:

entities —It defines the nodes of your business models. In a dbt model, you can define primary and foreign entities. A foreign entity defines an edge between models, hence a join in the final query.
measures — A value aggregation.
dimensions — A categorical or a time field than can be used either in a group by either in a filter.
metrics — A pre-defined object that combines entities, measures and dimensions.

Semantics and metrics in dbt Core explained. (credits: the example is reworked from Nick's examples)

Just ahead I gave you an precise example of how the new nomenclature will behave for a simple case with a fact_transaction model. This is important to notice that the semantic layer is something that sits on top of you current dbt models definitions.

To complete the picture this is important to notice that the revenue_usd metrics can be queried at the moment either with a CLI, either via the API dbt Labs will release through their dbt Cloud offering.

Read dbt metrics documentation

As an extension I've seen 2 things this week that I feel makes sense here:

VulcanSQL — A data API framework for DuckDB, Snowflake, BigQuery, PostgreSQL. Actually Vulcan let's you define in a blink parametrise SQL that you can expose through an API. It comes then with a catalog, a documentation and a way to connect downstream consumers tools (e.g. CSV exports, Excel, Sheets, etc.)
A Rill Data dashboard about DuckDB commits — DuckDB commits is just an example. What I want to show here is Rill Data UI, while being relatively simple offers a standardise way to explore a dataset. On the left you get the metrics, on the right the dimensions, everything can be clickable and allows you to drill down. Under the hood it's "BI-as-code", YAML defining this dashboard can be found on Github.

These two examples are not really semantic layers in the strict sense, but revolve around the concept.

Gen AI 🤖

Deploying Falcon-7B into production — If you want to launch your own open-source model on Kubernetes, this is a tutorial to do it.
Langchain: explained and getting started — Langchain is a toolkit that lets you chain—what a surprise—components. Actually it's some kind of pipelines, every component as inputs and outputs and Langchain do the glue. Components includes stuff like prompts, LLMs, agents or memory.
Langchain integrates Cube (the semantic layer) — Wrapping-up with previous category, Langchain can use Cube as a data loader.
The rise of Vertical AI — Verticality in business always existed because it brings contextualisation. This articles described what will arrive on the market on top of Foundations and horizontal models that tries to be generic.
Introducing NSQL: Open-source SQL Copilot Foundation models — This is a Foundation models that generates SQL, claiming to outperform others.
Introducing Superalignment — Some stuff OpenAI wrote about the future (I did not read).
CodeGen2.5: Small, but mighty — Salesforce released a new version of the CodeGen model. I hope they did not trained it on their internal code 🫠

Now you want to think twice before eating a pizza (credits)

Fast News ⚡️

Career advice for aspiring progressive data professionals — Brittany has been working in progressive data for years and she's giving advices for people who wants to follow her path.
Declarative data pipelines with Hoptimator — After trying to bring self-service for data pipelines at LinkedIn, they decided to go for declarative data pipelines supporting only a specific data movements. With YAML. We were visionary when we designed and developed this at Kapten 5 years ago.
Airflow: scalable and cost-effective architecture — Hussein, an Airflow committer and PMC member, proposes an ideal architecture for big Airflow projects.
Scaling data teams: 5 learnings — BlaBlaCar data team is well known in France now and recently embraced a data mesh organisation. Manu, the VP shares 5 learnings you should as a manager be aware of.
Measuring the carbon footprint of pizzas 🍕 — Shit I've eaten a pizza yesterday. Max includes in the study 4 axes: agriculture, transformation, packaging, and transport. With this Margharita obviously is the less emitting one. 4x less than a Calzone with meat.
Parquet file format explained — and how it compares with Avro & ORC.
Iceberg won the table format war — Don't be click baited by the title, the article has been written by a dev rel at the company who mainly maintains Iceberg.
Reducing data platform cost by $2m — How Razorpay optimised (mainly) their S3 storage (deletion, relocation) to save a lot of money.
Every major announcement at Snowflake Summit — Another view than the one I shared last week by someone who actually was at the Summit.
An intro video to open lineage, which is a important topic to give visibility over your data platform.
Makefile tricks for Python projects — One of the best data magical trick. We repurposed old good Makefile to create simpler CLI on top of our daily tool. This is an article giving tips to make your best Makefiles.
You can now GROUP BY ALL in Snowflake.

Data Economy 💰

DigitalOcean acquires Paperspace. Paperspace is an all-in-one SaaS product to develop, train and deploy AI applications. With a custom Notebook UI based on Jupyter you can develop your models while checking at ressources, when the models is reading you can deploy it within containers.
Redpanda raises $100m in Series C. Redpanda is a great product for developers. The best way to describe it is: this is a Kafka alternative. Built for modern times it removes most of the Kafka complexity by implementing all Kafka APIs.

See you next week ❤️

Data News — Snowflake and Databricks summits

2023-07-03

2 summits (credits I cropped the image)

Hey, since I said I should try to send the newsletter at a specific schedule I did not. Haha. Still here the newsletter for last week. This is a small wrap-up from the Snowflake and Databricks Data + AI summits which have taken place last week.

There are so many sessions at both summits that this is impossible to watch everything, more Databricks and Snowflake do not put in free access online everything so I can't wait everything. I'll try to recap the major announcements by reading between the lines and through social network posts.

💡

If you want another view on both the conferences Ananth from Data Engineering Weekly wrote about the conferences extravaganza and a few trends he wanted to chat about.

Snowflake Summit ❄️

The marketing tagline of Snowflake have always been "the Data Cloud", with this year announcement we can feel it really accelerated to achieve this vision. Snowflake wants you to send whatever data on their cloud and then now you can use a lot of different features to do stuff on it. They announced:

Document AI — A new integrated product where you can ask questions in natural language on documents (PDF, etc.). With LLMs they will try to answer questions. Once you are happy with the quality of answer you'll be able to publish the model and use it in SQL queries and write pipelines on top of it to infer on new documents and send emails when needed.
Snowflake Native App framework — Via the Snowflake marketplace vendors and developers will be able to create apps that you can run on your data. In the UI you pick the tables you want the app to run on. Here the native apps marketplace, there are only 25 apps and it only works on AWS at the moment.
Container Services & Nvidia partnership — Snowflake is slowly becoming a one-stop shop, with container services you will be able to run your own apps in a Kubernetes cluster managed by Snowflake. For instance tomorrow you'll be able to launch Airflow (via Astronomer) within Snowflake. On the same topic Nvidia partnership will bring GPUs to Snowflake offering for users in need of large compute for AI training. Thanks to this data do not move out of Snowflake, or if we say the truth, out of your underlying cloud.
Dynamic Tables — Dynamic tables are streaming tables. With Snowflake you can send real time data coming from Kafka, for instance, with dynamic tables you can create a table on top of the real time data that refreshes in real time, using only what's needed to compute the new state. Dynamic tables has been announced last year, but looks finally in preview. In the demo there is also how the SQL UI integrates LLMs generating SQL from a comment.

PS: s/o to David who also covered Snowflake changes.

Data + AI Summit 🗻

The theme of the Databricks summit is Generation AI, it's a well found title regarding the current state of data. I watched the 3 keynotes to find announcements but it looks like less structured that Snowflake still here a few takeaways:

Microsoft and Databricks are still best friends, even after Fabric. In a quick Skype call Satya Nadella, Microsoft CEO said that discussions about responsible AI while developing it is a good thing. We should explore 3 parallel tracks at the same time: misinformation, real world harms (incl. bias), AI takeoff.
The CEO of Databricks was on stage and use words that I like, he says
- data should be democratise to every employee
- AI should be democratise in every product

Databricks vision about LLMs (in Wed. Keynote 2023 Data + AI Summit)

LakehouseIQ — Matei Zaharia presented it on stage. LakehouseIQ is a way to use your Enterprise signals (org charts, lineage, docs, queries, catalog, etc.) to contextualise LLMs used in UI assistants. In the demo LakehouseIQ is asked to "get revenue for Europe" but understand that Europe is not the exact name of the region for this company but EMEA. Here a demo of LakehouseIQ. In the demo we also sees that you can generate SQL from a comment in the UI.

This is their way to democratise data to every employee.
Databricks acquires MosaicML for $1.3b— It should land in data economy category but you know. I've shared MosaicML last week because they are the ones behind the first open-source LLMs, the MPT models, on Apache License. This is a great move from Databricks to set themselves in the AI ecosystem for real. As a side note Naveen Rao, Mosaic CEO, said that to train MPT-30B from scratch you need around 12 days and less than $1m.
LakehouseAI — Research shown that 25% of the queries get their costs misestimated by the query optimisers and the error can be 10⁶. Databricks built a new way to do I/O with AI, they promise that you don't have to do any kind of indexes and the engine can "triangulate" where the data is to be faster than before. Mainly you have to see LakehouseAI like an AI DBA that does magical stuff to your engine by learning on all your queries telemetry.
They also announced a lot of stuff around Spark.

As you can see Lakehouse is becoming more than ever a marketing brand around Databricks. In the end what we want is a place to store data and an engine to query data. That's all.

Data Economy 💰

ThoughtSpot acquires Mode analytics for $200m — This is consolidation at work. ThoughtSpot is a company who tries to bring AI in the analytics domain. With TS you can define insights and access to it, with Mode they gain a end-user application that people are already using. Also you might know Mode through Benn Stancil blog.
Hopsworks raises $6.5m — Hopsworks is a feature store.
Inflection AI raises $1.3b from Bill Gates, Eric Schmidt, Microsoft and Nvidia. They developed a personal AI called Pi who's designed to be supportive, smart and here for you at anytime. Let's see where it goes.

See you soon ❤️.

Data News — Week 23.25

2023-06-24

(credits)

Hey, this is the Data News. It's super hard to change habits, but it's how it is, the newsletter is going out on Saturday. I hope this edition finds you well. Summer is coming ☀️.

Thank you all because we crossed the 3000 subscribers mark last week. Let's go for the 4000 before the end of the year 🤗.

This is a almost-raw edition for this week.

Gen AI 🤖

MPT-30B-Chat — This is a chat interface hosted on HuggingFace on the MPT-30B model. The MPT models are interesting because they are on Apache Licence, which can means true open-source, unlikely others.
In the continuity to licence topic you can watch this great video about laptop-sized ML for text, with Open Source where Nick Burch explore what you can do today on a laptop and introduce greatly the Gen AI field.
New approaches for detecting AI-Generated profile photos — This is the era we're going to live in. We'll be writing models moderating generative models. Am I the only one who thinks this is a waste of energy?
Crypto collapse? Get in loser, we’re pivoting to AI — It's a rant that begins with the fact that many opportunists are getting into AI after VC have left crypto. ChatGPT "is a stupendously scaled-up autocomplete", which lead to question about intelligence in AI. I really like the conclusion: "The real threat of AI is the bozos promoting AI doom who want to use it as an excuse to ignore real-world problems — like the risk of climate change to humanity (...) The VCs’ actual use case for AI is treating workers badly".

Too perfect to be a real picture (credits)

Fast News ⚡️

MotherDuck announcing DuckDB in the cloud — First, context. DuckDB is an in-memory analytics database. So it's single server. DuckDB has been open-source by DuckDB Labs. Then comes MotherDuck, a commercial company, with a partnership with DuckDB Labs aiming to to build a modern serverless cloud analytics platform based on DuckDB. That's for the context.

So this week MotherDuck finally announced their cloud offering. It's invite only for the moment — and I did not get my invite yet. In a nutshell the announcement is: you can connect to remote DuckDB by doing md: in the connection string and you can join local and remote data (also seen on Twitter).
Iceberg in the clouds — Last week BigQuery announced Iceberg support in GA. At the same time James from Snowflake wrote a blog post helping you to chose between Snowflake or Iceberg table format. Mainly he says, pick Iceberg if you know what you're doing.
An introductory video about Iceberg — If you want a great Iceberg introduction, go watch Fokko's talk from Berlin Buzzwords.
Understanding dbt runtime environment — Leo takes the time to explicit what are the messages dbt CLI is telling you.
Replacing Apache Hive, Elasticsearch and PostgreSQL with Apache Doris — This is a technology bingo. You can replace 3 technologies with only one! This post details the choices behind a migration to Apache Doris. Doris is a real time analytical database.
How data engineers drive data culture and empower users — This article reminds all data engineers that you're part of the team that brings data culture to a company, so you need to play your part.
How to become a valuable data engineer — A post thats aggregates great ressources and advices to become a data engineer. I mention also that I have a similar one on the blog: how to learn data engineering.
Dealing with missing weight data — Carbonfact tries to measure the environmental footprint of a clothing. This is not an easy task and ask you to work with missing data.
Conceptual vs logical vs physical data models — The author mentions that there are 3 ways to model data with different layers of understanding. In the end he says that you should model your data in the 3 layers: conceptual, logical and physical.

Data Economy 💰

Acryl Data raises $21m Series A. Acryl Data is the company behind DataHub, the data catalog that has been open-sourced out of LinkedIn.

See you next week ❤️

Data News — Week 23.24

2023-06-16

The newsletter, a metaphor (credits)

Hello, after the good weather comes the storm. I'm now under the Berlin rain with 20°. When I write in these conditions I feel like a tortured author writing a depressing novel while actually today I'll speak about the AI Act, Python, SQL and data platforms. Casual day at the office finally.

Some personal news, next Monday and Tuesday I'll be at Berlin Buzzwords, if you're ping me, it would be a pleasure to meet and hang together.

There are still seats for the June Airflow Paris Meetup (in French).

AI 🤖

AI Act 🇪🇺 has been voted in the European parliament. Also called GDPR 2.0 the AI Act is meant to regulate the usage of AI in tomorrow's world. It has been widely criticised by lobbyists, companies or developers. I'm not informed enough so I'll wait before giving my opinion on it.
I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI — Meta is in a frenzy to release new models. Yann's vision goes toward AI systems learning and reasoning like animals and humans.
Deep multi-task learning and real-time personalisation for closeup recommendations — Pinterest still doing deep learning.
Last week I shared nice QR Code generated with ControlNet, this week someone released a model on HuggingFace to do it QR Code Conditioned ControlNet (not related to the Chinese original paper) and you can even use the generator web UI.
DashQL – Complete analysis workflows with SQL — A crazy paper about a new language that mixes SQL with analyses and graphs. It looks sexy but my brain can't read 9 pages PDF without overheating.
Model and Data Versioning: An Introduction to mlflow and DVC — If you want to understand model versioning this is for you.

Data and Analytics Engineering 🧑‍🔧

Testing frameworks in dbt — Robbert developed a small framework to do tests in dbt. Mainly he unit tests macros (the logic) with his framework and test data with soda and dbt contracts.
The data journey manifesto — DataKitchen wrote a manifesto to put principles on the data journey to avoid the mess in production. There are 11 principles and 11 new ideas to create an healthy platform. For instance you should not trust your data providers and what worked last week will not work today.
Why data consumers do not trust your reporting — It is a good illustration of the data journey manifesto. Stakeholders often notice data issues before the data team does. This destroys any confidence they may have in the numbers. Data warehouses are mutable, this is one of the many root causes proposed by Lucas. The past often changes, whether because of code or data. This is metrics drift.
Data Documentation 101: Why? How? For Whom? — Marie wrote best practices for establishing complete and reliable data documentation. The first advice is about the documentation readers: data team, business users or other stakeholders.
Change Data Capture (CDC) with PostgreSQL and ClickHouse — This is a nice vendor post about CDC with Kafka as movement layer (using Debezium). The post explains well the architecture you need to make it work.
A deep dive into graph analytics — Petrica tries and showcases Memgraph in a long-form post. I'm a fond of graph visualisations and analytics—as well as maps.
Experimenting at Scale, the Spotify Home way — Simple principles to run a good old' experiment at Spotify scale.
The ultimate SQL guide — After the last canva on data interviews, here's a canva to learn SQL. From databases introduction to SQL writing. It covers simple SELECT and advanced concepts. This is neat.
The power of pre-commit and SQLFluff —SQL is a query programming language used to retrieve information from data storages, and like any other programming language, you need to enforce checks at all times. This is where you should use pre-commit and SQLFluff.
Metis: building Airbnb’s next generation data management platform — The new manifesto for every data governance company /S.

PS: I just split the Fast News to have a smaller one. Fast News contains lighter news and broad articles.

When the stakeholder notices issues before you (credits)

Fast News ⚡️

Stack Overflow developer survey 2023 — Every year SO sends a survey to developers and it gives a great overview of the technology usage across the space. This year ~90k people answered, they also integrate a small AI category to measure impact on dev work.

What we see related to data engineering is mainly: Python and SQL are still shining at the top of technology popularity—around 50% use them. Thanks to AI hype Python is the second most desired technology behind Javascript, which augurs well for the future. They also share salary figures and data engineer / science are well situated in the ecosystem, best-paid job in Germany after management position but less-paid in the US.
Generating income from open source — Vadim shares how he makes money from all the different open-source projects he has. He shares what works and what does not work. In the post he also shares the journey of Sidekiq founder who's making $10m ARR alone.
You can put space in BigQuery column names — The editors of blef.fr (me) have no comment. In fact, yes, you are all crazy?
Malloy's Near Term Roadmap — I've shared recently Malloy demo, which was awesome. The article shares the recent features and says also something I will never forget: "Malloy aims to be syntactically the same no matter what database contains the data".
The Astro Cloud IDE — Astronomer released a bunch of Airflow operators to their Cloud IDE (which was released in Dec. but I missed it). I get the point why companies wants us to go in their Cloud IDE, but I hate this trend. Let me alone in my PyCharm.
Cube announcements ; Data Graph and Orchestration API — This is 2 announcement from Cube. I really like following them because they are thoughts leader in the semantic layer space. Data graph create an entity diagram from the semantic definition with the API offers you an endpoint to launch pre-aggregations jobs from your scheduler.

We don't need spaces (credits)

Data Economy 🤖

Graphext raises $4.6m in seed round (second to continue develop a data analysis platform build for exploration. The Spanish startup develop a tool where you quickly explore datasets and then build charts or AI models on top of it. Last year they build a graph with Data News links, we clearly see the different content categories I share.
Telmai raises $5.5m seed round. A new data observability platform enters the space, it looks like they propose the same features as the competition: add your datasources, get automated alerts on data drifts.
At the same time Masthead raises raises $1.3m also as a data observability platform, but done differently. Masthead does not run SQL on your data—which generate costs uplift—but reading logs and metadata to identify anomalies.
Informatica acquires Privitar. This consolidation will bring new features to Informatica. As a reminder Informatica has been funded in 1993 and is one of the dinosaurs in the ETL space. Privitar will bring "data security" stuff.

See you next week ❤️

Data News — Week 23.23

2023-06-09

Rethinking the newsletter (credits)

Here's a new edition of the Data News newsletter. Since my 2-year anniversary post, I've been struggling to find the right writing rhythm. I've been sick and I've been stuck on a client project. Writing the newsletter was not an easy exercise. Even though I keep telling myself "it's not a question of motivation, it's a question of discipline" like a LinkedIn guy. I do things because I enjoy the process of doing things, not for the results.

That's why I'll try to change a bit the way the things are done for the next 3 months. As of today I do the newsletter every Friday. I search and read articles first and then I write. Starting next week I'll do it on Thursday, to schedule the sending at the same hour every Friday, at 2PM.

This way, I'll dedicate my Fridays to write original articles, explore ideas and preparing articles stock for the summer holidays. I plan to do a 1-month break during August. But at the same time I have the FOMO—fear of missing out. So I need to schedule articles in advance. I can tease you that I'll create content about "Create a data platform in 2023", with live examples.

In September I will do a retro and decide if this is the right way to continue or not.

In term of content I've recorded a new podcast episode (in French) that will be out next week. The French version will be a bit different than Minds of data. It'll be more round tables and discussions about the present and the future of our ecosystem.

We also scheduled the next Paris Airflow Meetup in Mirakl offices. Pierre, an Airflow committer and PMC member, will present his Airflow journey. Join us!

Data contracts, dbt and modeling

Back to the roots, it's been a long time since I did not share dedicated stuff about dbt. This week a natural cluster of articles have emerged. A few people already implemented things with the new model governance dbt introduced last month in v1.5.

Julian shared a nice way to use dbt model governance when you have 1000+ models. In a nutshell you can add new characteristics to models that will give more context to dbt. Models can have group, access, contract and versions. In the article Julian greatly explains the software dev comparison when managing programatic APIs with public or private visibility with models management. Finally he also proposes 6 logical data layers to sort your models: source, base, cleanse, core, business and marts.

This structure gives also more visibility to the team because you can draw clear boundaries like: data engineers are responsible for the 3 first layers, analytics engineers for the others.

In order to go more in depth in the data contracts concepts applied to the warehouse and dbt you can activate ownership with dbt data contracts. Mikkel also showcases his tool synq.io that runs tests and alerts on top of dbt.

In addition there are 2 awesome articles about related topics:

Simplicity or efficiency: how dbt makes you choose — This is a side-by-side comparison of dbt and SQLMesh, a growing alternative to dbt. The comparison is done using a project with 50 models on 3 aspects: make a change, deploy in dev and deploy in prod. In the end the article is obviously biased towards SQLMesh (on the company blog), but reveals good issues with dbt.
The data modeling divide — A discussion about different modeling techniques. OBT, star schema, activity schema, etc. and the divide within the community and tools companies for a consensus.

Gen AI 🤖

Why AI will save the world — Marc Andreessen writes about the prevailing panic and 5 risks associated with AI, asserting that AI will probably do the world more good than harm. Still it has cold war vibes inside 🙃.

The single greatest risk of AI is that China wins global AI dominance and we – the United States and the West – do not.

I propose a simple strategy for what to do about this – in fact, the same strategy President Ronald Reagan used to win the first Cold War with the Soviet Union.

The golden age of open source in AI is coming to an end — An article about changes in open-source code licenses creating less permissive models.
Rush to use Generative AI pushes companies to get data in order — Garbage in garbage out. An article from the Wall Street Journal, obviously if you want to fine tune generative models you will have to be sure to have correct training datasets.
Use ControlNet to generate QR Codes — A Chinese engineer used ControlNet to generate visually appealing and hidden QR Codes. The result is quite impressive and works most of the time.

A ControlNet generated QR Code, the link sends to a website to personalise QR codes developed by the author

Fast News ⚡️

Which team should own data quality? — Wether it's data engineering, analytics engineering or more specialised functions supervised by a central governance this is a good question to have.
The next chapter for CastorDoc — CastorDoc, previously Castor, is a a data catalog. They recently did a rebrand and Tristan shared the new associated vision. They unveiled 5 pillars to achieve the new vision in which AI-powered insights is the second one.
Graph components with DuckDB — Max always amaze me with his experiments. This time he writes a graph algorithms in SQL to identify connections.
Gotchas of streaming pipelines: profiling & performance — Feedbacks on how Lyft team increased performance on their streaming pipelines.
The growing pains of database architecture — Figma team shared learnings about scaling Postgres instances.
Backfills in data & machine learning — Backfilling is when you write or overwrite the historical data. Backfilling is one of the most complicated task in data engineering because it often requires design way ahead of problems. Dagster wrote a small guide about considerations you might have when doing backfills.
Daft: a high-performance distributed dataframe library — Recently Polars took all the attention regarding dataframes manipulation. But this new library called Daft could also be a game changer. Daft is written in Rust, uses Arrow, can be distributed and can use complex types.
SELECT Insights — A fresh new newsletter by Simon Späti. He shared a long list of links and genially structured the newsletter like a SQL query.

Data Economy 💰

Cohere announces $270M Series C. Cohere is an OpenAI alternative, they propose an API and a Python, Go or Node SDK to add "language" to your traditional app.

See you next week ❤️.

Data News — Week 23.22

2023-06-03

Sun is coming in Berlin (credits)

Hey, I've been sick longer than I expected, but I'm finally well. I hope this email finds you all well, as well. I've had to catch up on almost 3 weeks of content. When I step back, the amount of articles shared each week is insane, there are countless articles about things that have already been written. Sometimes I feel like I'm trying to find a needle in a stack. Or several needles.

I wanted to write more about Microsoft Fabric and the states of data that were published last week but I'll do it another time.

Gen AI 🤖

As always the pace of innovation in this field is incredibly fast so here a few news I've seen I found worth it:

Japan goes all in: copyright doesn’t apply to AI training — I'm far from being a law expert but it looks like something that will create precedence. The article is saying that it lays down with Japanese new strategy to become a leader in AI technologies, by removing barriers on training data they hope to open doors. Obviously artists (especially mangakas) were not happy about it.
Sam Altman, OpenAI CEO did an Europe Tour — Sam went to Europe recently (Span, France, Poland, Germany and UL) in order to meet countries representatives. I guess that he did lobbying around the AI Act but also he was here to do real estate because OpenAI wants an European office.
New Nvidia 144TB GPU — Nvidia is the clear winning of the AI race. They announced an insanely crazy new GPU and Google, Meta and Microsoft are already customers. Surprising.
How DoorDash uses XcodeGen to eliminate project merge conflicts — Ok now I don't want to resolve a Git conflit anymore 😅 .
US researchers developed a LLM-powered Minecraft agent: Voyager. Minecraft is a survival game and the agent has been designed to Minecraft learn life skills incrementally. In the end it generates a code that is used to send the agent in the cubic world.
A new kind of camera— An artist developed an AI camera, the Paragraphica, that is a context-to-image camera. The camera is using location data to feed context to a generative algorithm.

A dynamic prompt — (Paragraphica camera)

Fast News ⚡️

Meltano announced their Cloud — Meltano is an open-source data integration project that has been started at Gitlab. With a few configuration and a CLI you can write data pipelines using hundreds of connectors (using Singer spec). The pricing is based on the number of runs and not the volume of data. This is a major difference with the competition (Airbyte, Fivetran, Stitch).
A ridesharing app simulation — Juraj developed over the last months a complete simulation of a ridesharing app (like Uber), he shared everything he did in blog posts and the results is kinda amazing. I recently spent hours on Mini Motorways so this is the kind of side projects I like.
Breaking into data engineering as a self-taught developer — A few advice from a fellow data engineer who was data analyst before.
What's the hype behind DuckDB? — This is a great post from Matt Palmer about DuckDB. If you want a quick intro about the tool this is the way to start. In the article Matt also showcases how you could use DuckDB to write a transfer pipeline like moving a Parquet file from a disk to S3.
How Instacart Ads modularized data pipelines with Spark — A great deep dive on a Lakehouse architecture for streaming. The article describes a migration from "thousands of complex SQL lines" to composable Spark SQL.
dbt at Zendesk ; setting foundations for scalability.

Data Economy 💰

Databricks acquires bit.io — bit.io was "the fastest way to get a Postgres database". In order to start you just had to send data and your database was already setup. When looking at the press release Databricks acquisition is a team acquisition to improve their own developper experience.

Now I go back on Diablo — See you next week ❤️.

Data News — Week 23.21

2023-05-29

Me (credits)

Hey, I've been sick in the last 3 days and it was impossible to write something. As I still want to send something, here a raw edition with no comments. See you on Friday.

Gen Ai 🤖

QLoRA: Efficient Finetuning of Quantized LLMs — 65B parameter model on a single 48GB GPU reaching 99.3% of the performance level of ChatGPT on Vicuna.
Modding Age of Empires II with a Sprite-Diffuser.
Github Copilot Chat announcement.

Fast News ⚡️

clickhouse-local vs DuckDB — DuckDB is not the only one to work great locally. In this test clickhouse works better.
The Future of Data — Everyone wants a piece of the pie; no one wants to bake.
Data Modeling, architecture Pattern, tools and the future — part 3 of Simon's guide.
Microsoft Fabric — everyone was talking about it on LinkedIn. This is the Lakehouse integration for Analytics into Azure. Here are first impressions, how it includes with Power BI and a few remarks. Honestly this looks like a disguised Databricks.
States of data season — Airbyte's state of data, Databricks's, lakeFS's.
Engineering Levels: a simple framework for startups.
Writing design docs for data pipelines.
Databend and the rise of Data warehouse as a code.

Data Economy 💰

dbt Labs reduced 15% of their staff. Tristan announced it on the blog and the company provided transition perks. It was a sad announcement.
Snowflake acquired Neeva — A generative AI search company that was in difficulty got acquired by Snowflake.
EU hits Meta with record €1.2B privacy fine — under GDPR.
Elementl (Dagster) Raised $33m — to continue building the data orchestrator.

See you soon. ❤️

Data News — 2 years anniversary

2023-05-19

TWO YEARS — HAPPY BIRTHDAY

👋 Here is a special edition for me. Exactly 2 years ago, I sent out my first email newsletter. At the time, only 3 people received it. I already told the story in Robin's podcast, here is a written version. In 2021, I was doing Twitch lives twice a week, every Wednesday I was doing a data news round-up. One day, I decided to save the links on a blog created for the occasion, a few days later, 3 people subscribed. This is what made me decide to send emails containing my round-up. By chance.

So I want to thank Max, Théodore and Emiel, it is largely thanks to you that this newsletter exists. If you had not joined so early, I would never have realized that people would like to read my content. These bookmarks that I was saving mostly for myself.

Today, 104 editions later, I want to take a look back at my content creation journey, but also at my freelance journey that started one year earlier, in 2020.

😱

If you only want to read Data News you can read my selection of talks from the Data Council.

The beginning

Before becoming a freelancer, I was working at Kapten, a French PHV company—an Uber competitor—where I was leading the data engineering team. We were a team of 6 people and our goal was to build the data platform for the company. During my time at Kapten, we built a data stack with Airflow, BigQuery and Metabase + Tableau. I was coming from the Hadoop world and BigQuery was a breath of fresh air. The component I'm most proud of is the ELT framework we built on top of Airflow to give total autonomy to analysts and scientists on the data loading and transformation processes.

In a nutshell it was an ETL-as-configuration on top of Airflow. You were able to define configs in Python to do full or incremental loading from different sources, processing in SQL or Python and exports. The framework and the processes were pretty strict, but it worked and gave full autonomy to analysts in building whatever they wanted to do. All the ownership was given back to others, we were just writing software and maintaining a platform.

I think it took almost a year to build the entire platform. We had set a goal: no broken Airflow pipelines in a 30-day sliding window. We achieved that. And we hit a plateau. We were doing less data engineering because everything was working well, less firefighting, looking for a new vision. As human beings, we wanted to fill the void, so we explored different things: real-time feature store, data lineage or data contracts—we call it this way today, but back in the days it was only schema management. But what was the next step for us?

I had done what I was hired to do: build a data platform for analytics and analysts. It was time for me to leave, at the same time the context changed: we got acquired and laid off. That's where my freelance journey started.

Going into freelance

I left the when the COVID was at a peak and a few people did not understand the move. To be honest I didn't even know where I was going but I was confident in my skillset and in my ability to sell my data engineering expertise. In retrospect I was just naive.

The Kapten experience brought me expertise on Airflow and GCP, a good knowledge about Kubernetes and a lead experience. In addition to my solid engineering and infra skills it creates a good resume.

By chance 2 of my former bosses heard about my freelancing and proposed me work. It led respectively to a 3-months and a 1-year mission with Equancy and Qonto. Then I did a mission with Yousign with Faouz that I met a few years earlier thanks to a LinkedIn message. The common point of the 3 missions was to build stuff around Airflow. In a blink my first company fiscal year was already done, with around €180k in revenue.

While I was at Qonto, we migrated to dbt, which was rapidly being adopted by French startups. This allowed me to add a new tool to my belt. Then it became a new expertise.

In my second fiscal year (2022), I had the privilege of working with the French tax authority to help them define the vision for the 2027 data platform and with the Ministry of Education to implement Superset and dashboards on that platform. In the blink of an eye, my second year was already over with less revenue (160k€) but in less time.

Along the way I also helped startups hiring—Folk, Modjo, Kard—and did mentoring—Blent.ai, Libeo, iBanFirst, nibble. I even hired 2 awesome interns who helped me on the blog for a few months. As 2023 is still running I'll keep it for another retrospective.

While my story is exciting, here are a few things to learn from it:

Former co-workers are part of your network and are probably the ones who vouch for you the most.
In terms of networking, participate in events, give to the community and you will receive something at some point. Don't be afraid to solicit people on LinkedIn, people respond more often than you'd think.
Find the main reason why you want to freelance. It can be many things like money, freedom, issues with authority, digital nomadism, etc.
If it's money I think you gonna miss the freedom part of being freelance. If you want to do a lot of ca$h you will work every day, in a long-term mission for a big company. Which is actually like a permanent position without the perks of it (at least in countries where we have a social system).
Set your daily rate and (try to) stick to it. I started at €800, then went up to €1000 and now I'm at €1200. Don't forget that you are competing with agencies, often charging high prices.
One of my strict conditions is to work only part-time. In fact, I work an average of 2.5 days a week. To be successful, you have to be organized and be aware of context switching. To be honest, this is very difficult and I am bad at it.
In my opinion, to freelance in data engineering, you need at least two or three proven experiences in data engineering. Very often, as a freelancer, you are perceived as someone who knows things. To be assertive, you'll need to be confident in your recommendations.
Identify your strengths and communicate clearly about it. Here how I say it: I'm a data engineer who built a lot of data platform for analytics, I have an expertise in Airflow, dbt, Superset and in infrastructure.

Juggling with content creation

Doing freelance data engineering is a great thing for me, I've been working with computers since I was young. It's always better when passion meets your work. Alongside this, I also started creating content in January 2021. This was one of my goals when I decided to go part-time so I could have time for content creation.

I did not set clear business objectives over my content creation. Finally I did an engineering school, not a business school. It's probably the reason why I often loose focus and do multiple things. Here is a small selection of what I tried:

Twitch — I've done 4 months of Twitch at the beginning, but my 2-months holidays with no internet broke my routine. I don't think I'll go back to solo lives.
I made YouTube videos — I have 7 videos, each video took me about 20 hours, hard to add it into my daily routine but it will come back.
Twitter — even if went from 200 followers to 400 followers to Twitter I can't find my voice there. This is sad because Twitter is the social network I consume the most.
LinkedIn — I tried multiple things on LinkedIn but I don't have the discipline to publish one post a day. In the end I went from 2000 followers to 6000+ in 2 years.
Podcasts — the new thing I've recently started. Once again I loose focus, but, the podcast format is so statisfying to do.

And finally, the newsletter, which is my safe place. I've found discipline in writing my own content with my own tone. It takes me about a day of work per week. Basically, I spend 2 hours selecting content, 1 hour reading the content, 2 hours writing, and 1 hour post-processing. In the end, I'm proud of the quality of the newsletter, but one day is a lot and after 3 years, I have to wonder which direction to go in.

Actually, I don't care I'll continue like this. But why do I do content:

I like to share / transmit to others, when I was a kid I wanted to be a maths teacher.
It creates visibility for me and as a freelance I need to be visible.
It helps me shape my ideas.
I love the adrenaline rush I get when I do things publicly. Even though there are serious downsides to it, like FOMO or addiction, I love it.
I hope that in the long run it will generate enough money for me to do less consulting. Blog subscriptions bring me 300 € / month. Which is less than 2% of my revenue 🫠.

Conclusion

This is a post that is more personnal than what I usually do. This time I did not promise anything like I already did in the past. Promises I didn't keep because I'm lazy. At least I learn from my mistakes.

Whether you are a customer, a friend or a subscriber thank you very much for your support over the past 3 years. Let's continue for another 3 years? ❤️

Data Council 2023

2023-05-18

(credits)

Data Council Austin is a yearly conference that features a great panel of speakers giving talks about the future of the data field. As I often do I've overlooked the 70 presentations and here a medley of what I've liked.

Data Council 2023 YouTube playlist

My personal selection

If you had only 3 videos to watch it should be the 3 following:

Malloy an experimental language — This is my favourite talk. Llyod, founder of Looker, puts 30 years of data warehousing into perspective in 30 minutes, especially the fact that we see "data in rectangles." Since joining Google, he's been working on Malloy, a new way to query data. Malloy compiles in SQL and works on data semantics. The presentation gives another look at the semantic layer. During the demo, Llyod does some data analysis in the browser and it's just mind-blowing 🤯.

At the same time someone Google also did a Calcite presentation.
Data contracts, Accountable data quality — Data contracts is a trendy concepts that contains a lot of things. Chad Sanderson did the best recap about it. DE is often constant firefighting, a lot of (spaghetti) SQL to maintain. A lot of breaking changes are coming from upstream producers (form or content).

At scale everything breaks without data quality, the modern data stack is good because self-service and easy to implement but lacks of everything to be mature in the future: ownership, data quality, context. It creates a non-consensual API, we pull data but never agreed on a contract (SLA, schema, etc.).

The root cause is mainly because of miscommunication between producers and consumers. Data contracts aims to fix with API-based agreements between producers and consumers that capture the schema, semantics, distributions and enforcement policies of the data.

You can also watch Whatnot data contracts implementation.
Metric trees — It reminds my KPIs framework people were doing when I started to work in a consultancy firm. This is nice way to represent your company business. Still today 90% of the value a data team delivers is in the analytics. The analytics goal is to model correctly business. You should answer 4 questions: what happened, why did it happened, what's going to happen, what should we do next.

Organisations are systems with inputs and outputs and a formula. Formulas have metrics, relationships and weights. In the end you can depicts all your KPIs with formulas.

The data team strategy should be mainly to define and operationalise the company growth model. Using a metric tree as a logical representation of a growth model. You have 3 types of outputs: customer value, financial and strategic.

Screenshot of Metric trees presentation.

Other stuff I liked

Snowflake optimisation guide — This is a pragmatic guide on how you can lower your Snowflake costs. In the current context we have to do more with less. The talk starts with a great introduction of Snowflake architecture. In a nutshell the speakers share tips about warehouses sizing and design, performance optimisation with pruning, clustering and query design.
LLMs and Semantic layer —This is something I've in mind for a few time. This is a tool presentation but still it's relevant. On the same topic of self-service Whatnot shared how they turned data consumers in data constructors.
Scaling Uber metrics systems (w/ Pinot) — uMetric migration from ES to Pinot. They created an unified layer where metrics uses the same logic for downstream consumers. uMetric manages definition, discovery, computation, verification and serving.
Writing unit test for data science — Pragmatic guide about unit tests.
Retro on data science by DJ Patil — DJ Patil has been US Chief Data Scientist. He coined the "data scientist" term back in 2008. He does a great retro.
Dashboards as code — Using code to make BI dev better, this is DataOps, we have almost X as code in the whole data chain, only dashboards lacks of it.
Growing the data Team and data Culture at GitLab — GitLab data playbook is well-known. The eng - director gap problem. This is when you have a director that manages an individual contributor.
A deep-dive into the dbt manifest — How to do a dry-run in cloud data warehouse, load the manifest as dynamic dags, enforce polices or build monitoring.
Augmenting the modern data stack — by merging batch and real-time technologies in one database.

See you soon ❤️.

Data News — Week 23.19

2023-05-12

Sorting the news (credits)

Hey you, new Friday means Data News. This week is pretty stacked in term of content, especially video / audio content. I hope you will enjoy it as much as me.

Let's start with with my newly created podcast Minds of Data. In Minds of Data I'll met people from the data ecosystem in order to learn more about them. In the first episode I sad down with Joe Reis and we discussed about his professional journey before becoming the thought leader he is today, we also chatted about data engineering. You can listen the episode on Spotify, Apple Podcast and Deezer.

PS: this is my first episode ever so feedbacks are more than welcome.

As the same time in Paris we organised last Tuesday the May Airflow meetup. We had 3 talks, that you can find on YouTube. I really liked Benoit and Samy presentation about Cloud Composer—Managed Airflow on GCP. They shared good practices on how to manage Composer in the cloud, things like:

Use the same configuration for staging and prod
Use a secret manager to manage your Airflow connections
Use IAM restrictions in the DAGs bucket
Use operators and define the company policy around it
Define clear policies to govern your Airflow

Also Airflow 2.6 went out this week with a new trigger DAG parameterizable UI, new alert notifications framework (callbacks) and a new graph interface in the grid view.

Gen AI 🤖

The pace of innovation and announcement in the (Gen) AI field doesn't deflate. I can't really cover the whole field because it moves so fast that I can't even keep up. This week the Google I/O Keynote was a major milestone.

Google I/O Keynote takeaways

What amazed me from the Google Keynote is the fact that Generative AI is treated like a product, like the 2007 iPhone—look at this ad. When you think about it AI has always been something hidden, like an API call, a score or a recommendation in a larger UI. In Google's Keynote AI gets a 26 minutes segment and then all the derivations lasting for 2h.

Bold tagline & Google ego speaking (screenshot from the Keynote)

To me Google annual conference is a sign that the party is over, especially for OpenAI. Actually OpenAI deal with Microsoft was probably the best deal they could have go for. Even if as human we want to send models in the arena to get the most performant one, or masturbate ourselves comparing the size of parameters. In the end the best integrated models will win. And Google as a head start—as well as Microsoft, as they remind us in the Keynote they have 15 products used by billions of people: they have our e-mails, our photos, our maps and more. AI is a just a feature in their product, even if it needs an UI rethink, this is just a feature.

So in the end Google, an AI-first company from the beginning wants to put AI everywhere and wants to offer you an AI collaborator. Here are the major takeaways from the Keynote:

They release PaLM 2, the last foundation model. It will exists in 4 sizes: Gecko, Otter, Bison and Unicorn each asking for different hardware resources to work.
PaLM 2 will be natively integrated in Google products. Gmail will get enhance smart reply features, Maps will propose immersive view over a route and Photos will have a magic editor that will allow you in a single drag-n-drop to edit a picture.
Google will create a sidekick that will be available in Workspace—Sheets, Docs and Slides—called Duet AI, you'll be able to ask the AI to create content for you unlocking productivity gains. Duet AI will also work in GCP (in the console and within the web IDE).
According to the announcement PaLM 2 will particularly shine when fine-tuned (e.g. for IT security or medicine). You'll be able to do it by yourself within your own GCP instance in Vertex AI. They also released Imagen, Codey and Chirp resp. for image generation, code generation and speech-to-text.
Bard, the conversational model—ChatGPT equivalent—is now opened to everyone (actually not in all countries). Bard works great for code generation, debugging and code explainability.
Bard might also be the Zero-ETL solution we were all waiting for. In the demo the speaker asks Bard to find schools in an area, then asks for it to be saved in a Google Sheets, then asks to for a new column in the sheet if the school is public or private. To be honest, what prevents Bard in the future to do the same in a database?
Finally Google tease their next-gen model Gemini which obviously will be awesome, to hear them and announce an evolution of the search interface will Gen AI as a new interactive way to search.

In the end I really like the keynote because it gives a new milestone about what we can expect as integration in the products we daily use.

📺 Watch the 10 mins recap (by Google)

Other stuff

Hugging Face released an open model called StarCoder that has been trained on Github code that is meant to act as a Copilot. Still the model is not yet ready to be used as an instruction model—ChatGPT way.
At the same time HF also introduced an open-source Chat UI.
After Bill Gates, it Steve Wozniak—Apple co-founder—who gives his take on the AI breakthroughs in a BBC interview mainly we can't stop the march of progress, AI will be used to scam people and we have still to put guardrails, but human guardrails.
Salesforce do not want to be leftover in the battle, they announced Slack GPT natively integrated in Slack to summarise or compose messages but also a way for partners to bring new kind of Gen AI apps.
Also Salesforce did a makeup to Tableau with Tableau GPT, a way to provide AI-powered analytics. In Tableau Pulse you'll have access to auto-generated insights on your data. With a "For You" tab like you were in TikTok.

The StarCoder (credits)

Fast News ⚡️

Zero ELT could be the death of the modern data stack — Amazon launched this trend a few months ago. In the current situation we're far from killing any ELT processes, but it might come. For instance Zapier launched Zapier Tables some kind of data storage within your zaps.
We need to talk about Excel — Let's be honest, as strong we try to kill Excel as strong he comes back. David shares interesting stories around Excel usage at companies that I can relate to. He finally mentions Count and Equals, two companies, that builds on top of tabular interfaces to do data.
Determine BigQuery storage costs across an org — A SQL query that I did not tried. Please read it twice before running it blindly.
Polars, laziness and SQL context — Daniel showcases the 2 features which should make you want to migrate to Polars.
Building the seller analytics dashboard — An great example of what you should consider when building an analytics dashboard in the product and how you combine dbt and GraphQL APIs to build a pragmatic metrics store.
OLTP vs. OLAP — One of the best explanation of the differences between both. Mainly it resides in the data storage. One being row-oriented while the other one is column-oriented, this is not the only difference.
Correctly loading incremental data at scale & real-time denormalized data streaming platform.
ExternalTaskSensor in Apache Airflow: how to calculate execution delta — I've seen multiple time that the delta computation was annoying for data engineering teams. This article deep-dives well on it.
Upscaling LinkedIn's profile datastore while reducing costs — For optimisation geeks.

👋

The newsletter is much longer than expected—I got lost today in watching fascinating videos—so I'll be sending out a second part over the weekend or early next week with a recap of the best talks from Data Council 2023. If you want to get a head start, my favourite talk was Lloyd's demonstration of Malloy, an experimental langage for data.

See you in a few days with Data Council takeaways ❤️.

🎙️ Episode 1 — Joe Reis

2023-05-08

Data News — Week 23.18

2023-05-06

It's wedding weekend (as you'll probably read it, congrats) (credits)

Hey you, this is a Saturday edition of the Data News. I hope this email finds you well. This week you'll have less editorial content because I'm late. But still you'll find awesome articles that have been written recently.

As a reminder on Tuesday next week I'm organising the Apache Airflow Paris meetup that you should consider joining if in Paris. Also next week I'll publish my first podcast episode ever that I've recorded with Joe Reis—the co-author of the famous Fundamental of Data Engineering. I'm still looking for the name of the podcast, if you have ideas shoot.

Gen AI 🤖

Google "We have no moat, and neither does OpenAI" — This is an internal note from a Google employee (which does not reflect Google views), that mainly says that open-source models will win over Google and OpenAI and closed-source policy for models might be a mistake especially in a world where some models leaks (e.g. Meta ones).
If you already have access to OpenAI in Azure you can now use GPT-4—only in preview yet.

And more traditional AI:

YOLO-NAS a new object detection model — you should have seen already this model that detects people in real-time in videos. This new one seems to be better than the previous one.
Mojo, a new programming language ready for the AI — Mojo is a new programming language that looks like Python but at a lower level, this could unlock performance gains and new heights in AI models development.
eBay’s blazingly fast billion-scale vector similarity engine.

Fast News ⚡️

Paypal, template for data contract — PayPal is implementing a Data Mesh and they provided in the open all their thoughts with data contracts. In the Github repo they are sharing a YAML template describing what's in the contract. This is insanely exhaustive.
Even Amazon can't make sense of serverless or microservices — PrimeVideo tech team wrote an article that could be summarised by: we migrated from functions based approach to a monolith in a VM. Internet found this ironical. By doing this they reduced cost by 90%.
Lakehouse at Walmart — Samuel from Walmart describe the research they did and why they picked Hudi over Delta in order to implement a Lakehouse architecture. As a reminder the Lakehouse is the merge of the datalake and the data warehouse, which is mainly a way to add a SQL friendly processing engine on top of a datalake with ACID transactions.
Safer deployment of streaming applications — This is how Grab deploy Flink applications.
Why you should reconsider Debezium: challenges and alternatives — Warning: this article has been written by a CDC solution, but still this is relevant because it shows what is the reality of managing Debezium.
Dataform: schedule daily updates using Cloud Functions — Dataform is a solution Google bought a few years ago that is a dbt alternative but for BigQuery. This article gives a great overview of the product. To be honest it looks like a bit hacky.
📺 Dev Deletes Entire Production Database, Chaos Ensues — If you want a greatly told story you should watch it, this is a YouTube video explaining how Gitlab remove the production database and how they fixed it. It reminds me my own horror story of deleting the whole /data folder in HDFS.
Oracle is taking on Snowflake — I often say the Snowflake will become the new Oracle. This is fun to see that Oracle still try to catch up. They come up with a lot of news: they will implement Delta Sharing protocol, lower the storage for $118/TB to $25, partner with AWS and propose low-code data integration tool.
Data modeling, again — Simon published the second part of his data modeling guide, this time he covered the different techniques you can use when modeling data: dimensional, vault, anchor and more. You might also want to see practical examples of data modeling, Sonny wrote a nice article using a hotel business as a support.
Crafting your data team — Practical tips on how to get started with your data team in a new startup. In the post Marc gives you the qualities you should look for and what hiring you should prio first.
🎮 The CS:GO Liquid team announced a new data analyst, DeMars did work previously on a predictive analytics approach on Valorant trying to predict who whould win a round in different situations. This is fun to see our beloved data analyst position reaching other fields.
Data projects on personal data — Petrica dive into her Medium data with DuckDB and Plotly and Stefen analysed his Uber spends with dbt and Postgres. As a reminder, doing personal data projects is still the best way to learn about technical stuff.

French mistral taking on OpenAI (credits)

Data Economy 💰

AuraML, an Indian-based company, raises $230k in pre-seed round. AuraML is a 3D synthetic data company, their engine is capable to generate 3D realistic-looking environnements your might want to use in other models.
Mistral AI, a French Gen AI company, will probably raise €100m (link in French) in the following weeks. It looks like at the moment the company only hired a few French people that were working previously at Meta or Alphabet on LLaMa or DeepMind. The goal of the company is to provide the first French—hence European—alternative to OpenAI. Obviously this is heavily political and strategic for Europe so we will follow it in the next weeks.
Anaconda is expanding and buying EduBlocks. EduBlocks is a scratch-like platform to write Python or HTML code. This is a cool thing in order to continue code democratisation.
Open-source done differently. Sequoia—a VC—will support Sebastián Ramírez with an open-source fellowship. Sebastián is the creator of FastAPI, SQLModel and Typer. There isn't more detail in the press release but this is awesome to see.

See you next week ❤️.

Data News — Week 23.17

2023-04-28

Berlin (credits)

Hey you, new edition of the newsletter. This week summer time arrived in Berlin and it was awesome. I managed to move forward with my client projects this week and it also feels relieving. So I'm pretty happy, sun and great projects 🙂.

Regarding the content, if you are in Paris on May 9th, we are organising the Paris Airflow Meetup in Algolia offices, it will be in English so you don't have any excuses not to come. Also I'll be a lot in Paris in May so if you want to have a 🍜 / 🍺 ping me.

What happened to the Semantic Layer? 🫠

This week dbt Labs disclose their vision about the semantic layer and especially what they want to do with the Transform acquisition. This is mainly a roadmap of the MeticFlow integration within dbt ecosystem. At the moment we have a dbt Semantic Layer that correspond to YAML definitions and MetricFlow—which was Transform open-source project—that is able to understand the semantics to generates SQL.

A lot of changes will happen to MetricFlow incl. breaking changes:

the dbt metrics spec will change, in the current state actually not a lot of people were using it, dbt_metrics package will be deprecated, probably they will merge dbt and MetricFlow syntax to define semantics and metrics
"The core MetricFlow package will become a stand-alone library for processing metric queries, generating a query plan, and rendering SQL against a target dialect." (cf. Github discussion)
The license will change to BSL.
The serving part of the system aka. the metrics store will be the paid service of dbt Labs and a part of the dbt Cloud offering. It means that you will define metrics and dimensions in YAML and then plug all you tools to dbt Cloud, it seems there isn't any open-source solution to do the serving—at least from dbt Labs side. And with the license change on MetricFlow dbt Labs are protecting themselves against someone using MetricFlow generation to propose such a paid service.
There are more described in the Github discussion.

To add more spice to this Carlin wrote what happened to the Semantic Layer. Carlin works at Google in the Malloy team (Google semantic layer to say it fast—tbh it's probably more) and he gives his views and also a small retrospective on the semantic layers.

Gen AI 🤖

DoorDash identifies Five big areas for using Generative AI — Doordash is a food delivery platform and they shared how they imagine Generative AI can help them in the future. Either by assisting humans, it can be customers (cart building, etc.) or employees (SQL writing or document drafting) ; either by improving actual AI stuff: search, discovery, information extraction.
When it comes to SQL writing the field is on fire, a lot of companies are trying to rise from the dead the Slack chatbots answering to insights. I think of Shape (YCombinator, out of stealth this week) and Delphi Labs or Promptimize. Promptimize is a toolkit to evaluate and tests prompts, for instance you can "unit tests" you natural langage to SQL prompts with it—it has been open-source by Maxime Beauchemin (Airflow and Superset creator).
Bard now helps you code — Google is finally going the Copilot way and proposes an alternative with Bard. Bard now can help you write code or Google Sheets functions, but it can do more by explaining or debugging code for you.
📺 The Inside Story of ChatGPT’s astonishing potential — A TED talk from OpenAI President and co-founder sharing his vision, the potentials and the limit of the technology. In the video you can feel Steve Jobs's 2007 iPhone keynote vibes. The video also greatly showcase ChatGPT plugins. I higly recommend you watching it.

Last but not least a more "traditional" AI category:

End-to-end ML modeling in BigQuery — BigQuery added over the last years a lot of ML capabilities to the engine. This post showcases a lot of it (it uses a XGBoost model).
Building a large scale unsupervised model anomaly detection system (part 2).

AGI (credits)

Fast News ⚡️

From PostgreSQL to Snowflake: A data migration story — The migration lasted 9 months and included 8 steps. They went on this journey because in 2021 Postgres was already hitting read performance limits, degrading the downstream user experience in the BI tools. As Katia shares in the article a 9-months migration is a long tunnel where you encounter a lot of roadblocks and frustration but in the end everyone feels the difference: 10x performance gain—at least—on dashboard execution time.
Building dbt CI/CD at scale — Every week a new great article about someone else dbt setup where you discover things. This time Damian shares how he designed checkout.com CI/CD pipelines—in Github. In a nutshell they get the actual production manifest, run a SQL Linter, validates models changes (by detecting what are the altered models and running them) and deploy to Airflow.
Making the Most of Airflow — I already shared Matt's article last week and this week he continues with an awesome article about Airflow. In the article he gives a great overview of Airflow main concepts: DAGs and TaskFlow API (I've also wrote something about dynamic DAGs last year), DRY and what to do to not redevelop stuff and how to test.
Building a Kimball dimensional model with dbt — Jonathan from Canva wrote a large article about dimensional modeling and how to do it with dbt. This is a 7-parts tutorial that shows you how to create fact and dimensions tables.
Data engineering design principles you should follow — It treats mainly of software engineering principles like SOLID. Idempotence and determinacy are forgotten from the article and if you want to go deeper on the topic you can read the most important article on this topic: functional data engineering.
Real-time denormalized data streaming platform part 1 and part 2 — Razorpay data team describes how and why they needed to move their ETL process from daily to near real-time. Technologically moving from Airflow batches to Spark running on top of Kafka.
Toward declarative data orchestration with Kestra — A few weeks ago in the Airflow alternatives meetup we organised, we invited Kestra. A YAML-based orchestrator written on the JVM. Recently Benoit joined Kestra as their PO. In this article he shares his vision. It's mainly a question of vocabulary and reach, Kestra believes that with their own declarative YAML syntax they can offer data pipelines to the masses. YAML is enough simple for your analysts (they already do dbt) or business to write their own pipelines.
Manage database schemas with Terraform in plain SQL — Atlas is an open-source schema management tool. The post showcases the atlas provider in Terraform that allows you to write SQL to manage your database in Terraform. I can't wait to see dbt reimplemented in Terraform.
Automatically detecting breaking changes in SQL queries — When you alter a SQL query you can either do a breaking or a non-breaking change. What if with SQLglot you could detect a breaking change before it happens in production?

See you next week ❤️

Data News — Week 23.16

2023-04-21

If this picture had been generated with AI it would have been boring (credits)

Dear readers, I hope you're doing good. We are close to the second anniversary of the newsletter. Which is crazy. Retrospectively it means that I've written 900 words on average every week for the last 102 weeks. When you look at the first edition we came a long way—lmao.

We announced this week the May Paris Apache Airflow meetup. It will take place in Algolia offices, the 9th of May. We will have 3 speakers and for the first time all the presentations we will held in English. So if you're in Paris or in France do not hesitate to register.

Analytics engineering future

This week Tristan Handy—dbt Labs CEO—wrote a post about the future of analytics engineering: The next big step forwards for analytics engineering. As introduction Tristan gives the original vision of dbt that became mainstream, today. A lot of data teams embraced dbt, or at least the SQL with engineering practices to transform data in cloud data warehouses.

The content of the post is more about the future and the vision of the next big thing in analytics engineering: new models capabilities. In dbt Core 1.5 we will be able to define:

Contracts — you will be able to define columns types and constraints and ask dbt to enforce it. If a model do not respect a contract it will not build. In dbt vocabulary build means run + other things.
Access — you will be able to namespace models with groups and visibility. Models visibility will be either private, protected or public. This is a preambule to cross-project dependencies I guess.
Versions — you will be able to define versions for models without breaking the downstream consumers. In order to do it you will have multiple SQL files suffixed with the version—_v . To select a specific version you will have to do {{ ref('model_name', version=1) }} .

I think that these improvements are really important to bring analytics engineering to the next level, this is new capabilities that will bring the field new software engineering practices to data assets management. If we had to this the semantic layer new (through dbt Labs acquisition of Transform) we are going in the right direction.

Gen AI 🤖

If you want to understand LLMs there is a note that has been written by an experts office of the French gov. You can read it in French or in English. To be honest this is a great quality note that you can share to people who wants to understand what are all the AI concepts. Might still be a bit too technical to share it to your parents.
ChatClimate — This is a chat that has been trained with the last IPCC report (the GIEC for the French audience). He showcases well the search capabilities of ChatGPT-based system because every answer is completed with references to the report chapters.
How to train your own Large Language Models — Now that you tried the previous chat, let's say that you want to run your own LLM. Replit team wrote a great overview of what you have to do.
Building a ChatGPT Plugin for Medium.

ChatClimate answer to the most important question.

Fast News ⚡️

Building a Flink self-serve platform on Kubernetes at scale — Instacart engineering team migrated from Flink on EMR to Flink on Kubernetes. This article gives you an overview of the Kubernetes platform they implemented.
fal-ai/isolate — Yet another package manager in Python. fal developed a new lightweight package manager to isolate environments for at function level. The project README is not yet really explicit.
Data Engineering at Adyen — "Data engineers at Adyen are responsible for creating high-quality, scalable, reusable and insightful datasets out of large volumes of raw data". This is a good definition of one of the possible responsibilities of DE. This is a great article and they even included a flowchart to identify which role will suit you the most. It is interesting to read this post jointly with the future of data engineer at Meta. Which gives another perspective, which is very business oriented.
Announcing dbterra: easily sync your jobs with dbt Cloud™️ — Eric developed a tool called dbterra that mixes dbt and Terraform in order to deploy open-source dbt project to dbt Cloud with configuration as code.
Test Driven Development for SQL — A smal article that gives you a vanilla BigQuery framework with CTE to write unit tests. I think it has to be improve but it gives a greate boilerplate.
Saving 💵 With BigQuery & dbt — A few tips to save money when using dbt and BigQuery. Mainly it says that you should consider switching your models to incremental.

Data Economy 💰

Betterdata raises $1.65m seed round. A Singaporean company that provides a tool that generates synthetic data. Synthetic data are AI generated data. In Betterdata case you can use your own datasets and generate data that keep all the statistical metrics needed to do machine learning. This way you can work on data that is similar to yours but different. It's a technique to work with anonymised data.
CoreDB raises $6.5m seed round. CoreDB is a managed Postgres service that put the emphase on the extensions in order to add more capabilities to your database cluster. CoreDB has been funded by the ex-CEO-CTO of Astronomer.
A lot of companies announced recently layoffs, sadly. The biggest one being Meta with a new round of 4k people laying off 21000 people since last November. Astronomer as also let 100 people go recently, if you are heavily relying on Airflow it might be interesting to reach people out.
Elon Musk, according to reports, founded a new AI company called X.AI Corp.

See you next week ❤️.

Data News — Week 23.15

2023-04-14

The only AI I'm eager to see (credits)

Hey you, the newsletter might be late today again, but this time this is not my fault. Ghost editor was down when I wanted to write. Anyway, here the weekly Data News, written faster than usual.

AI News 🤖

Yann le Cun did a 10 minutes interview at a major French radio. If you want to read the French transcript you can do it here. Mainly what he says:

There is no doubt in the fact that one day there will be machines at least as intelligent as human. But ChatGPT isn't, it gives the impression but it is not.
AI can amplify human intelligence like machines amplify human strength.
Technology shift jobs. For instance before industrial revolution major part of the French population was working in fields, now it's less than 2%. It means we shouldn't be afraid of technology replacing jobs. He thinks that this will also allow more people to be creative.
Regarding fake news and ethics he compares to e-mails. He thinks that like when we develop spamming filters to avoid fake mails we will develop the same to avoid fake news.
For Yann ChatGPT has nothing revolutionary, but he admits it's good engineering. This is just a normal evolution of deep learning systems.
(Last one because it's funny). He bets than in 10-15 years (or more) we will not have smartphones anymore but augmented reality glasses. We will also use voice to interact with machines, so we can interract with them hands in the pockets—I can't wait to use Siri and Alexa 2.0.

As a side project, if you want to practice machine learning this weekend you can replicate Rihab's project: detect wildfire smoke with YOLOv8 model.

:/ (credits)

Fast News ⚡️

A tour of Airbyte’s Octavia CLI — Airbyte, an open-source extract-load platform, released a few months ago a CLI called Octavia that let’s you create integration pipelines. Jeremy wrote a post that showcases how to do it.
Hot takes on the Modern Data Stack — Matt gives 5 hot takes about the MDS. I don’t totally agree with everything but this is a good read. He says that Redshift is not anymore competing in the warehousing space, which I agree with. He also says that Airflow is obsolete, I disagree, it became common recently to say bad things about Airflow. But as always the issue is between the chair and the keyboard. He is also hard with Airbyte and dbt.
The new philosophers — It's been a long time since I've shared Benn's posts. Still my favorites. Saying smart things, weeks after weeks. This time he writes about the new marketing approach of the modern data stack ecosystem. Plenty of tools, so let's develop a new tools to avoid the other tools. And add his views about ChatGPT disruption: "We'll initially try to insert LLMs into the game we're currently playing [...]. Our data models won’t be augmented by LLMs; they’ll be built for LLMs". Probably no-one knows, yet, what it means.

💡

In a presentation I made this week I wrote "Gen AI + Semantic Layer = self-service ?". I think it sums up very well where we are today. But as Robert says "No tool can fix people, behaviors, process, and the semantic layer, however conceptually elegant or impactful, is no exception" (read here).

Castor announced Castor AI — Doing it the other way Castor released a feature that explains a SQL query in natural language. This is a good way to help business users understand what's happening in the transformation layer.
How we made our reporting engine 17x faster — Teads engineering team explain how they significantly speed up their ads report generation. In a nutshell they replaced Spark (EMR) in-memory transformations by BigQuery.
Large-Scale generation of ML podcast previews at Spotify with Google Dataflow — It became a common issue at vaste content platforms to generate previews to support the scale. This time Spotify explains how they did it with Apache Beam. As an input they take audio and transcript data and they generate podcast previews that will appear in your feed.
Big savings on Big Data — This is the current trend, with the current economic situation we have to do more with less (or at least with what we have). At Lyft they optimised their ML platform to save time and money on workloads. Especially they lowered all the dev costs.
LocalStack: Why local development for cloud workloads makes sense — It does the glue with the previous bullet point. This time Corey writes about LocalStack, a tool that emulates locally AWS APIs. The emulation could be the future mainly because it avoids increasing cloud costs for development.
Using DuckDB with Polars — A nice showcase of the 2 new kids on the block working together. Mainly what you will do is querying in SQL with DuckDB Polars dataframes.
Using Metrics Layer to standardize and scale experimentation at DoorDash — A very good exhaustive article about a metrics layer. At DoorDash a lot of teams are doing experimentation and they were in need of a common ground between metrics definition. That’s why they built this system. Mainly they define measures, dimensions and metrics in YAML that will be materialised and made accessible to Curie (their experimentation platform).

Data Economy 💰

Cybersyn raises $62.9m Series A. Cybersyn is a data-as-a-service platform that provides public datasets for everyone. You can see it as a datasets marketplace of common public data. They are heavily supported by Snowflake so the dataset are accessible in Snowflake marketplace. For instance you can freely query the US Addresses dataset to get all the addresses in a zipcode.
Rupert raises $8m in funding. Rupert wants to fill the gap between the data analyst and the business users by providing a no-code UI to create data alerts on top of your semantic layer.

Data News — Week 23.14

2023-04-08

Data News entering in town (credits)

Hey you, if I wasn't late in my newsletter writing it wouldn't be me. But here is your usual Data News. The main reason behind this delay is because I've played with LLMs yesterday. I've tried to run open-source models locally on my own laptop. There are still a few bugs and the results are not really at OpenAI level but this is fun to do.

This Tuesday we hosted the second part of the Airflow alternatives meetup with Prefect and Dagster. You can find the replay on YouTube.

Data modeling

Dear readers, I have to confess something. I did not care about data modeling for years. I mean, in the sense everyone understand it today, for 7 professional years I never did a star schema or something similar. I was in the Hadoop world and all I was doing was denormalisation. Denormalisation everywhere. The only normalisation I did was back at the engineering school while learning SQL with Normal Forms.

Actually what I cared was physical storage, data formats, logical partitioning or indexing.

But, actually, it's normal my role was not to translate business in tables. I still firmly believe that this is not the role of a data engineer. A data engineer should still be a software engineer working with data, empowering others with tooling and apps. Data modeling should not be a required data engineer skill. Enters the analytics engineer.

Still I feel that there is a hole in my skillset because I can't give relevant advices when it comes to model business with 3 facts tables instead of 5. And to be honest there isn't any good modern literature to answer this question. Simon started a multipart guide about data modeling. I hope he will fill the gaps. In the first part he treats about the history of modeling and the main concepts.

At the same time Maxime Beauchemin wrote a post about Entity-Centric data modeling. In comparison to the dimensional modeling it uses entities instead of facts. Which is easier to conceptually understand but also to use in machine learning.

When it comes to modeling it's hard not to mention dbt. In the recent years dbt simplified and revolutionised the tooling to create data models. dbt, as of today, is the leading framework. But alternatives are coming. This week I discovered SQLMesh, a all-in-one data pipelines tool. SQLMesh lets you define models like dbt but avoids you the burden of the Jinja ref/sources macros. Under the hood it uses sqlglot the SQL parser that has been developper by the same developper. It seems there is also a scheduler and a web UI included in the open-source version.

Gen AI 🤖

It seems that Samsung employees leaked data to ChatGPT — Unsurprisingly OpenAI saves all the prompts we type (🫠) and can eventually improve models incrementally. It seems that Samsung employees gave confidential information to ChatGPT. Which means that OpenAI owns Samsung data. But is it really different than what we already have with Gmail or AWS? Or like when Tesla employees where watching consumers in-car footage for years.
Italy decided to ban ChatGPT — In order to do it Italian Data Protection Watchdog ordered OpenAI to temporarily ceases processing Italian users' data. France and Germany might follow.
OpenAI: Our approach to AI safety — 4 axes in which OpenAI wants to invest: improve safeguards, protect children, respect privacy and improve factual accuracy.
Eight things to know about Large Language Models — A PDF that will give me a headache.
On the practical side I've tried to run locally on my M1 Mac a LLM for the first time and it was a fun ride. In a nutshell I wanted to first run Vicuna an open-source chatbot that has great results when compared to GPT3.5. In order to run Vicuna (or other similar open-source models) you need to get the weights from the LLaMA Meta foundation 65B params model. You can get the model either by waiting after completing a Google form or by other channels remembering me the early days of internet 🧲. Except from the fact that the inference was super slow—while using douzains Go of RAM—the results were not as good as ChatGPT but still great. If you find it interesting tell me I'll write a post about what I launched and how.

Rare footage of a foundation model (credits)

Fast News ⚡️

Twitter's recommendation algorithm — It was an Elon tweet. Twitter published on Github (here and here) their recommendation algorithm and they wrote a blogpost explaining how the recommendation is working. The machine learning is mainly in Python and uses PyTorch. But the algorithm as a whole contains a lot of features, filters and network algorithms.
Microsoft data integration new capabilities — Few months ago I've entered the Azure world. Not really without pain. Today, Microsoft announces new low-code capabilities for Power Query in order to do "data preparation" from multiple sources. Disclaimer: I don't use Power Query and I don't plan to ever use it.
One year as a dataviz journalist — Saturday is a good day to have a look at great data visualisations. Erin celebrates his 1-year anniversary as a viz journalist by putting light on the work he is proud of. I really like the "Farthest distance between World Cup stadiums" or the paths to become CCO.
Life after orchestrators — Benjamin thinks that orchestrators are legacy systems and that we should all move in the real-time world where everything is simpler. No need to add trigger and to synchronise workflows together. Side node: Ben co-founded Popsink a real-time ETL company.
Meta introduces Segment Anything — A new Foundation model enters the game. His name is SAM, and SAM wants to identify which image pixels belong to an object. Will traditional computer vision the next space to become has-been with the new AI innovations?
❤️ Reducing the lottery factor, for data teams — if you had to read only one article today you should read this one. The lottery factor, also named the bus factor is risk measurement about knowledge sharing. In data teams a lot of work have to be done in the early days to avoid knowledge to be lost later on. The article gives ~10 advices to apply to lower the risks. Among them I like the changelog, the pair-programming, the pre-recorded video and the stable credentials.
The ultimate guide to hire your data team — An awesome canvas to conduct data interviews. This guide will help you before and during the interview. It includes a great list of example questions that you could ask in interviews.

PS: Data Council took place in Austin a few days ago. As soon as the videos will be out on YouTube I'll do a wrap-up of the sessions. Data Council is usually a moment of the year when the US data gratin gather to discuss.

Data Economy 💰

Dozer raises $3m seed round. Dozer is a platform to develop real-time data apps, looking like a real-time ETL platform. With Dozer you can connect to multiple sources, do transformations (SQL, Python or JS) and then expose the output in APIs for frontend consumers (React, Vue or Python). YAML configured. It also looks like that Dozer is not really under a proper open-source license. If you want to go deeper to me Dozer looks like Materialize or Popsink but with a different vision, offering more an API as a serving layer than a database.
Roboto AI raises $4.8m seed round. I hate this as much as I find it interesting. Roboto AI wants to create a AI-powered toolbox for people in robotics. In their demo you can use prompt to search over images or timeseries.

See you next week ❤️.

Data News — Week 23.13

2023-03-31

This newsletter is about money (credits)

Dear readers, already 3 months done in 2023. We are slowly approaching the 2-years anniversary of the blog and the newsletter. We are almost 3000 and once again I want to thank you for the trust. To be honest time flies and I’d have preferred to do more for the blog in the start of the year but my freelancing activities and my laziness took me so much.

By the way, recently I’ve worked with Azure tooling and I changed a bit my mind. I had tried Azure years ago and the only memory I had of it was that it was not working. Like you ask for a VM and you don’t get a VM. But obviously it changed. Except the fact that they have a pretty bad vocabulary for things, it works, and the UI is surprisingly pleasant to use.

My personal preference hierarchy changed with this experience, which is subjective, is GCP > Azure > AWS. Still one complaint I might have is about documentation, sometimes docs page are not of great quality, the presentation is pretty bad and full of usage examples when we only need complete documentation of ressources.

If you did not register next week we’ll host an online meetup about Airflow alternatives and Prefect and Dagster teams will do a demo.

Join the event on LinkedIn on April 4th 7PM CET (UTC+2)

PS: brace yourself, April fool's is tomorrow. I've already seen a few "jokes" on LinkedIn.

Google Data Cloud & AI Summit

Two days ago Google announced new things at their Data Cloud & AI Summit. Here a small recap of what has been announced.

Pricing changes

First, a new BigQuery pricing model. Big changes—or should I say BigChanges—the flat-rate pricing will not be accessible starting on July 5 and will be replaced by a capacity pricing similar to what Snowflake is doing. It will start at $0.04 for a slot hour. It is hard to compare to the previous flat-rate pricing but the previous pricing was more around $0.028 for a slot hour (42% increase). Still Google says that it will lower your BigQuery costs because they have the smartest autoscaler on hearth 🫠 and will run only what's perfectly needed for your queries.

Let's take an example previously you could have run 100 BigQuery slots at every moment in time for $2000 a month. Tomorrow for instance you would be able to run 160 slot hour if you use BigQuery only 10h per day for the same amount ($2000). This changes means less computing power on average for the same price but with higher peaks.

Like good news never comes alone they will also increase the on-demand pricing by 25% starting July 5. It will costs $6,25 per TB compared to $5 before.

They also announced a "significant" increase in compression performance so that you should switch you storage pricing from logical (uncompressed) to physical (compressed—the actual bytes stored on disk). Compressed storage is at least twice more expensive than uncompressed but as they announce a 12:1 (previously 10:1) compression ratio your company wallet will be the winner. The icing on the cake.

This new pricing is sad to see, excluding the increase, I believe that for years, one of the strongest advantage of BigQuery was his apparent pricing transparency. Now you need to do multiplication and open and read 5 pages to understand the pricing.

5 paragraphs about the pricing, it was unexpected.

Looker Modeler

They also announced Looker Modeler, a single source of truth for BI metrics. This is finally over, we have Google take on the semantic layer and the evolution of the LookerML which was one of the first version. They created Looker Modeler as a metrics layer that will be accessible by all the application downstream.

In a nutshell it will mean—in Looker vocabulary:

Data Engineer will create sources that will be available in Looker
Analytics Engineer (or Data Analysts) will create Views from a table with Dimensions and Metrics thanks to the LookML
Then AE will create Models on top of Explores—Explores are joined Views
Then you will be able to access the Models through the Looker Modeler via a JDBC interface or a REST API.

Screenshot from "Trusted metrics everywhere" keynote.

Thanks to this we will be able to read LookML data from Tableau. Awesome!

If you teach someone SQL they can help themselves, if you teach them LookML they help everyone.

Gen App Builder

As an answer to OpenAI offering Google Cloud started to propose cloud offering around generative AI. They announced a web UI to create conversational AIs. In the demo you upload a FAQ (in CSV) and a "How to guide" (in PDF), you pick either "Chat" or "Search" mode and it will generate a app that you can give to your customers to user personalised with your data. You can also feed to the pre-trained models BigQuery tables, GCS buckets or websites URLs.

Gen AI 🤖

Google summit last section was a perfect transition to the GenAI category. Every week is richer than the previous one.

Bill Gates published a note: The Age of AI has begun. He says that GPT being able to ace university Bio exam is the most important advance in technology since the graphical user interface in 1980. He thinks that AI will help reduce world's inequities like health inequities—I doubt that, like pills and medicine only rich and educated people will benefit AI in the end. Then he enlightens us with how AI can change health and education systems, what are the risks and the next frontiers.
Then everyone started to freak out. An open letter has been written to ask a pause—at least 6 months—in giant AI experiments, it has been signed by almost 2000 people, with some notorious CEOs, researchers and Elon Musk. This pause should be use to develop with policymakers AI governance systems.
Gary Marcus debated about the AI risk ≠ AGI risk. AGI means artificial general intelligence. He mainly thinks that LLMs are an “off-ramp” on the road to AGI.
For B2B generative AI apps, is less more? — a16z, a huge US VC, predicts that we will enter in a second wave of AI called the SynthAI. Currently we generate information based on prompt, in the wave 2 we will generate insights based on information. This wave seems to be critical for B2B because AI should help decision making, but need to be concise for that.
As a fun use-case, John wrote how we can rethink whisky search in the world of ChatGPT.

🙃

Last week I said I wanted to write about the current state of self-service, still you will have to wait a bit more because I got lost in today set of news. But if you want a small takeaway, I don't think we will achieve self-service only with ChatGPT, the issue is not only in technology but in the people.

Still LLMs and semantic layers are a goods initiatives to achieve this everlasting dream. But first ask yourself, will my CEO trust a bot answering on Slack or someone from the accouting team delivering an overcrowded Excel?

Fast News ⚡️

The data engineer is dead, long live the data plaform engineer — This is a current trend of the marker that has been accelerate by the analytics engineer appearance. As AE are, theoretically, in between DE and DA it pushes the actual roles borders further. Meaning DE will have to do more infra stuff to support analytics initiatives. This is normal and Robert adds more to the table in his blogpost. You can also check this map that explains all data roles.
Five charts that changed the world — A 5 minutes video by the BBC that shows 5 awesome charts that changed the course of the world.
Databend, an open-source version of Snowflake — at least this is what they claim. This week I discovered this "open-source data warehouse written in Rust". I'll try it out when I have time. If you tried it you'd love to get your feedback.
audit_helper in dbt — A blog post that showcase how you can use dbt audit_helper package to improve your models. As the competition gets harder every week Datafold also wrote a blogpost comparing audit_helper vs. data-diff.
dbt-checkpoint, a list of pre-commit to ensure dbt quality — A list of 40 pre-commit written in Python that you can use to improve the quality of your dbt projects. It includes the useful check-script-has-no-table-name to check if a there is not table name leftovers.
Strategies for effective data compaction — From Mixpanel and how they developed an event compactor system with PubSub.
The 2 inventors of the Lempel-Ziv algorithm that is used in all ZIP files died recently. They wrote their proposal in 1977. RIP.

Sorry for this longer edition, and see you next week ❤️.

PS: if you follow me on LinkedIn you might see this content recycled there because.

Data News — Week 23.12

2023-03-24

The Earth can also generate great images (credits)

Dear readers, I hope this new edition finds you well. It seems that you really liked the recent editions, which is perfect because it was fun to write. I feel that this week all the articles I found relevant for the newsletter are either AI related or technical. I really don't know how to deal with news overflow about the Gen AI landscape. Do you like all the GenAI hype? 👍 or 👎

Airflow alternatives meetup

This Tuesday also took place the first part of the Airflow alternatives Meetup with Mage and Kestra. It was an awesome online meetup. I really liked the presentations from Mage and Kestra and even if I was focus on hosting the event it was great to see 2 other visions about the future of orchestration. Which, to be honest, are not really far from Airflow.

📺 Watch the full replay

Here are my takeaways about the event:

Mage and Kestra have been both developed with Airflow flaws in mind, especially about deployment complexity, reusability and data sharing between tasks.
The tagline "Modern replacement for Airflow" on Mage side makes sense. Out of the box Mage provide all-in-one web editor to write data pipelines with a great UX. In small browser text areas you will be able to write Python, SQL or R code and orchestrate theses transformations with drag-n-drop. I personally hate developing in the browser, but the promise looks good. But actually Mage and actual Airflow version are—almost—the same the only difference is the UX when developing pipelines.
Tommy, Mage's CEO, said that for the moment they will focus on building the best open-source data pipelines tool. They got enough funding for the next 2 years.
Facing the reality, even if worker management seems easier in Mage, the deployment is not yet ready to go. Either you go with Terraform script that will launch elastic containers either you go with helm, but requires Kubernetes.
Now Kestra. One of last kid on the block. Ludovic, the CTO who presented Kestra at the event, said that he started the development while at a mission at Leroy Merlin where people were heavily unhappy about Airflow. Kestra is a YAML-based data pipeline tool mixed with string templating. The YAML approach allowed less-technical users to be able to write pipeline.
Kestra vision is also very open, everything is accessible through APIs. Which leads to a variety of usage for a company. Under the hood Kestra is developed in Java which is totally different than other alternatives.
Kestra in the future can easily looks like Mage, YAML being the mid-step before a "drag-n-drop" like UI.

It was so fun to organise this event and I'd love to do more live in the future with blef.fr. Still, in 2 weeks on April 4th the part 2 of the event with Prefect and Dagster will take place, I hope I'll see you there.

You should register for the part 2

Gen AI 🤖

The newsletter is already to big for today so I'll try to keep it short especially on this Gen AI that is already spammed everywhere.

OpenAI is slowly starting to create a gigantic ecosystem and could become the next GAFA-like company. The non-profit research company manifesto is already far away. OpenAI released a study about Large Langage Models—LLMs—impact on the job market (sorry I wanted to read the pdf but my brain is already grilled) and announced ChatGPT plugins. In a nutshell OpenAI is has created a AI interface that everyone likes and will add on top of it a App Store experience with plugins. It reminds me of something.

Because OpenAI is not everything, some news of the alternative world. Mozilla announced Mozilla.ai a community-based open-source AI ecosystem, Stanford researches released Alpaca a model that behaves similarly than OpenAI text-davinci-003 but that costs a lot less ($600 to train it), there is also a list of open alternatives to ChatGPT.

I'd have love to speak about tools offering to translate human langage to SQL like sequel.ai or SQL translator but it would open the Pandora's box about self-service analytics and this is for next week.

Fast News ⚡️

Pi? (credits)

You broke Reddit: The Pi-day outage — a good retrospective post on Reddit outage for Pi-day—the 14th of March, 3/14 in US date format, unlucky we are all the rest of us we don't have a Pi-day. Jayme, a staff software engineer, shares that a Kubernetes version upgrade from 1.23 to 1.24 led to the outage. Actually Kubernetes introduces in 1.24 a terminology change from master to control-plane which was the trigger of the issue.
Apache Arrow releases Arrow nanoarrow — Recently Arrow got a lot of light because of DuckDB or Pandas 2.0 and it's good. Arrow is a multi-langage interface to use in-memory data structures. Nanoarrow is a C library that acts as simplified interface for application developers in order to put Arrow everywhere.
Avoiding data pipeline failures: the importance of backfilling capability — As I've already said in the past backfilling is one task that separates data engineers to great data engineers. Backfilling has to be thought at every step of a data pipeline design and development. This is a small LinkedIn article, but it's a good reminder.
Datafold's data-diff now integrates dbt — You can now run data diff after a dbt run to compare your models to the production state and get a summarised view of the rows impacted.
How LinkedIn reduced processing time with Apache Beam — Beam is a distributed processing framework that proposes a unified execution engine for batch and real-time. LinkedIn team decided to migrate to a lambda architecture and got 94% uplift in performance.
How fast is DuckDB really? — Georges, Fivetran CEO, ran a performance test to have metrics on DuckDB performance. The article conclusion is that a Macbook M1 (and probably M2) can have better performance than a server. I can relate.
If data lineage is the answer, what is the question? — A good list of use-cases where data lineage would be useful.
Using CockroachDB to reduce feature store costs by 75% — More and more articles about costs optimisation, 2023 is the year were data engineers skills will be used to lower platform costs. And this is a good point.
How shadow data teams are creating massive data debt.
Distributed Machine Learning at Instacart.

Data Economy 💰

Sifflet raises €12.8m Series A. Initially this is a data observability tool but it turned out they added features like lineage and cataloging. Often needed to better contextualise alerts but also to avoid tools multiplicity when working with big corporations.
Hex raises $28m in a Venture Round[1]. Hex is a notebook-based analytics application. Cells are at the center of the analytics, they produce outputs than can be used later in other cells on in visualisation. The visualisation can be organised in a Notion-like document but with live data. I recently tried Hex, the UX is neat and I think the tool is worth it for production-ready explorations[2]. Here an example with the MAD data—I made no presentation effort.
DragonflyDB raises $21m Series A. Dragonfly is a replacement for Redis claiming to outperform it in many way (throughput, snapshotting speed, scaling). I don't have a lot to say except the fact that we are going in a future with a lot of databases choices.

A venture round is when the series has not been specified.
I just coined the term, I mean, it's when you do a great exploration and you want to share a professional result to your stakeholders.

See you next week ❤️.

Data News — Week 23.11

2023-03-17

Took a few days with the ☀️ (credits)

Hey you, I hope you had a great week. On my side I'm slowly starting to get on top of the things I had in queue. But, sadly, I work in LIFO so I feel that I'm never done. For people that are not use to it it means last in, first out. Which means that I get easily disturbed by a notification—or even a thought—and do something that I did not plan to do at first. It, probably, explains why you always get the newsletter late on Fridays—or Saturdays.

Thank you for the feedback about last week issue, it seems you liked it. I'll try to continue doing deep-dives on article from time to time.

Airflow alternatives meetup

Click on the image to go the the LinkedIn event.

We are organising next week with the Paris Apache Airflow Meetup group an online event to discuss about Airflow alternatives. At every Airflow meetup we often get questions about Airflow competition so we decided to give a voice to alternatives in order to understand how they compare with Airflow and more.

The first even will take place next week, on March 21st at 7PM CET (UTC+1) and we invited Mage and Kestra. We will host another event soon after with others. You can either register on LinkedIn either join the meetup event.

How lucky you are because I will host the event, so you'll hear my awesome French accent. It also means that if you have any questions that you want me to ask you can send them to me beforehand 🫠.

Gen AI 🤖

I will create a specific category for generative AI.

If you live in a cave or if you only read my newsletter to get news about the data world you might have missed that GPT-4 has been announced and released this week. I even had hard time navigating between data engineering memes and GPT4 tips on LinkedIn and my Twitter is divided between GPT-4 threads and protests in France. What a time to be alive. Politicians think we should work longer when we are slowly starting to discover new AI capabilities that will for sure impact workplaces.

I don't want to take the usual shortcut—but how could I not do that. Will AI replace jobs? I do think that AI should empower people, but will the capitalism think like this when an API call will be able to do the same job as a human? Does even capitalism think? Actually it's probably human decisions about AI that will lead to AI replacing people.

One field that has been totally impacted by the generative field is the Natural Langage Processing (NLP). On Reddit someone asked this if others were also witnessing panic in NLP orgs. The general feeling is that GPT made years of NLP research outdated.

ℹ️

BTW, technically GPT-4 will be multimodal, you will be able to use text and image as inputs and the model will give you text outputs.

A few other news:

LinkedIn team also wrote a blog about AI principles stating that AI is like oxygen for the engineering team—I personally would have say that data was oxygen, but who cares—and that with great power comes great responsibility. The same week Microsoft (who owns LinkedIn) reportedly layed off the AI ethics and society teams. Great timing.
Glaze, protecting artists from style mimicry — A tool developed by researchers at Chicago university will help digital artists by cloaking their art to avoid mimicry from deep learning trainings.
Google and Microsoft will compete to include AI copilots in their offices suites — Microsoft announced 365 Copilot that will work in Word, Excel, Powerpoint and Outlook. On the other side Google announced the same for Google Docs and Gmail.

Can we develop a GenAI that generates protests slogans? (credits)

Fast News ⚡️

Migrating from role to attribute-based access control — RBAC is probably one of the most use paradigm when it comes to autorisation especially because role based autorisations are faster to put in place. In the article Grab team explain how to migrated from roles to attributes autorisation on Kafka.
Speeding up “Reverse ETL” — Ziqi works at Microsoft and details in this article what they had to consider to improve their Lakehouse exports to downstream databases. In short they switch SQL Server to columnar storage, disable indexes and locks when copying and played with parallelisation and batch size.
Online gradient descent written in SQL — Max is one of the best when it comes to do great experiments. This time it shows that everything can be done in SQL. With recursive CTEs he implemented sklearn linear model and the code is not even that big.
Data with Rust — This is a handbook that will showcase how to work in data engineering with Rust. At the moment only part 1 and 2 are written but it looks promising.
Sharing data between dbt projects, dbt exposures to sources — When you have multiple dbt projects it can be a mess to reference a model from another project. This blog shows how you can automate it with a CI and definitions in exposures.
Polars vs pandas : A new era for Python DataFrames — This comparison is also slowly starting to be a great debate in the data world. Will Polars overtake pandas in the coming years? Guillaume wrote yet another great comparison.
Tracking the fake GitHub star black market with Dagster, dbt and BigQuery — Things are getting spicy here. Dagster team proposed a way to eventually identify Github projects buying stars.

Data Economy 💰

The Austrian data protection authority has decided that Meta tracking tools are in violation of the GDPR. It will create a precedent.
Seldon raises $20m Series B. Seldon is a MLOps platform that helps you deploying models in production. At core Seldon provides a framework that you can configure to serve you models on top of Kubenertes.
👀 Adept raises $350m Series B. This is again a testimony of the frenzy about generative AI, and according to me the most impressive one. Adept want to create a general purpose AI teammate for everyone. At the moment it takes the form of a browser extension in which you can ask stuff when you navigate on Salesforce, Google Sheet or Craiglist.
Cast AI raises $20m in funding. They propose an AI to cut your Kubernetes costs in half. Bold promise.

See you next week ❤️.

Data News — Week 23.10

2023-03-11

Sorting all the eggs of the landscape (credits)

Dear readers, this week Data News lands on Saturday and will be a little bit different than usual because I found less relevant article and as promised last week I wanted to speak about the MAD Landscape.

I hope you will enjoy this topic focus edition where I speak about economics even if I'm a newbie about economy. In last minute I also added stuff about the Silicon Valley Bank that has been seized by the US FDIC, which will generate a crisis in scale-ups/startups world.

The MAD landscape

The Machine learning, Artificial intelligence & Data (MAD) Landscape is a company index that has been initiated in 2012 by Matt Turck a Managing Director at First Mark. First Mark is a NYC VC, in their portfolio they have Dataiku, ClickHouse and Astronomer among other tech or B2C companies.

Evolution between 2012 and 2023. We jumped from 142 logos to 1414, the world changed but Pig remains. (credits: mattturck.com)

Year after year the MAD Landscape has become an important tool to index the whole data landscape. The choice of categories is also a very clear way to categorise companies and to capture how the data field is changing. Obviously this kind of index is opinionated and they—Matt and his team—make editorial choices when they decide to include—or not—a company, but still, their selection depicts a reality.

Today I want to do a second lecture of the 4-parts article Matt wrote and to give my views on it. As Matt said in a LinkedIn live with Joe Reis the MAD landscape was not published last year (2022) because of time and the landscape has been totally shaken by 2 major events: the massive layoffs wave and the generative AI hype. As a reminder in 2021 edition money was flowing, Databricks did 2 huge rounds with $2.6b raised and Snowflake IPO was a success one year after.

In the MAD landscape we have 3 main parts that I will discuss about today:

Infrastructure and open source infrastructure — all the data tools everyone wants to use (or not, Talend appears twice in the list 🙃) this part depicts well what a data engineering needs to create a data stack.
Analytics — this is about the tools we will use to query the data lying down in the infrastructure.
Machine learning & AI — this category has been totally shaken by the generative AI trend, enterprise machine learning in 2023 is not the same as previously.

Before going more into category changes and macro trends this MAD capture there are a few interesting facts highlighting some biais this index might have:

933 companies out of 1414 (65%) are US-based companies
The continent repartition is 965 (68%) in North America, 182 (12%) in Europe , 74 (5%) in Asia and 192 companies are open-source, so they don't have a base country
Median founding year is 2015, which means that half of the companies are less than 7 years old, and 20% are less than 3 years old
GAFAM have logos everywhere. Amazon is the most represented one with 33, then Google with 30 logos, then Microsoft with 21. Apple and Meta are lower with both 2 logos. This is important to mention that IBM have 12 logos and IBM is the oldest company — 1911.

Mainly what these fact are saying is that the MAD landscape is dominated by US-based companies and US-based companies are nowadays thinking how the world should do data, trying to replicate their problem and their vision to everyone. Which is kinda broken. Obviously there are companies or VCs in Europe/Asia but rare are the one with the same impact. Diversity-wise this is a world dominated by the Northern Hemisphere (as always), there is no company in Africa or Southern America for instance.

Key insights

In a nutshell here are the key insights you need to know if you do want to read Matt's notes. First regarding data infrastructure:

The consolidation will come in the next months/years — every sub-category has between 20 and 30 logos, even if every company think it's unique, they often do the same as other and the market might not be as large as thought. Also there are a lot of "single feature companies" which will compete with broader ones and more likely fails because of offering. Snowflake and Databricks are the adults who will whistle the end of recess.
Quality and observability are the same — sorry but everyone want to be the "Datadog of data". When looking at the trend they all want to do the same.
The future of data catalogs is unclear — I really like the definition of catalog Matt gives: "there is a need for an organised inventory of all data assets". Catalogs are still struggling being adopted even if they seem to be asked by a part of the industry. There are also too many alternative.
With the recession, modern data stack is attacked — This is a big shortcut but true. Modern data stack is tightly coupled to ELT which means load first and think second. When you load first you have more data than you need which leads to avoidable costs. The actual MDS with unlimited computing power and storage might come to an end.
If you want another perspective with a more exhaustive list of changes you can read Anna's takeaways about MAD 2023 infra category.

ℹ️

tl;dr — Less money everywhere, optimisation everywhere. The golden age of flooding money is done. We will aim for a simplification of everything because often simple means less cash burn. Less everything.

Read MAD 2023 — TRENDS IN DATA INFRA

After infrastructure Matt also writes about all AI impacts:

The index this year depicts the generative AI hype with a lot of early stage startup doing almost everything possible with generative algorithms.
According to Matt we are now in the 3rd cycle AI hype. This is the largest one because it reached mainstream coverage. As a proof my father is using ChatGPT (in French "chat" means "cat" and he says CatGPT, which is a bit funny). But yeah AI became mainstream even if it was already everywhere before, but it was vertical models. But now everyone experiences the general purpose intelligence.
Startups might have difficulties catching up tech giants on this because they need data and probably a lot of computing power they might not have.
There are many backlashes AI companies will have to navigate through: impact on job market, algorithm bias, disinformation, hallucination—a word for AI is often wrong, and lastly AI is just boring.

Read MAD 2023 — TRENDS IN ML/AI

In addition to this O'Reilly released their technology trends from the searches on their website, when we only focus on the data field what we see is:

Overall Python is the most popular concept and the most growing one — I think it explains because Python is the best entry-level langage for the IT world.
When it comes to data, data engineering is the most searched concept and growing
Spark and Hadoop have been less searched than last year
PowerBI is the 3rd most searched concept and I'm sad about it

Silicon Valley Bank—wat?

🤞(credits)

This is a bit last minute but this is freaking huge. Let me do a recap for you and why does it matter.

The Silicon Valley Bank (SVB) is a deposit bank based in California and has the biggest market share. SVB manages billions dollars of assets. Mainly the assets are coming from Silicon Valley startups, founders and employees. In a nutshell if you are a startup founder and you get millions from a seed round you put the money in the SVB.

2-3 years ago a lot of money was raised, the SVB got around $200 billion in deposit. The SVB wanted to put $80 billion of this money at work using Mortgage Backed Securities (MBS)—just as a reminder MBS were at the center of the 2008 financial crisis. MBS guarantees 1.5% return and because interests rates were low because of the pandemic it was ok.

In the last months the FED increased rates crossing recently the 4.5% mark which still was ok. But started triggering chain reaction of all actors. The SVB did a first mistake that I'm not able to explain.

Then VCs started panicking (e.g. Peter Thiel's) advising founders and startups to get the money out of the SVB. Which led to a bank run.

A bank run or run on the bank occurs when many clients withdraw their money from a bank, because they believe the bank may cease to function in the near future

Then the SVB did another mistake. One day later the stock was 60% down and later the same day the bank collapsed.

What happened here is huge and will have a big impact on every US-based scale-ups/startups—it's very well linked to the MAD landscape. Mainly the deposits were only insured until 250k and it means that a lot of companies will lack of cash and probably have difficulties to pay salaries and/or vendors soon.

As a reaction it'll, sadly, imply more layoffs in the coming days and weeks. Other are also afraid of a contagion to the whole banking system as the SVB collapse became the second-largest collapse of the US history.

Data Economy 💰

Lonestar raises $5m in seed to put data centers on the Moon. Yep, you read it well. Apparently moon market projected to generate $105B in revenue over next decade. While in France we are fighting to retire earlier people wants to send my Twitter history backups to the moon.
Employees are feeding sensitive business data to ChatGPT, raising security fears.

See you next week with an usual Data News ❤️.

Data News — Week 23.09

2023-03-04

Formula 1 is back (trying to jinx before it happens) (yes there is no link with the data news) (credits)

Hello you, I hope this new Data News finds you well. After last week question about your consideration of a paying subscription I got a few feedbacks and it helped me a lot realise how you see the newsletter and what it means for a you. So thank you for that. I'll try to think about it in the following weeks to understand where I go for the third year of the newsletter and the blog.

Stay tuned and let's jump to the content.

This week I've published an compact article about how to get started with dbt. The idea behind this article is to define every dbt concept and objects from the CLI to the Jinja templating or models and sources. The article has been written as something you can add in your own internal dbt onboarding process for every newcomer.

Rad my article — How to get started with dbt

Machine Learning Saturday 🤖

Was it a boost ride? (credits)

How BlaBlaCar leverages machine learning to match passengers and drivers — BlaBlaCar is a carpooling company and in this article they detail what they did—in terms of machine learning—to improve trips listing with a Boost feature that proposes detours to drivers in order to be able to cover more countryside cities. It does not include any generative AI but greatly shows how machine learning can impact business problems.
Sharing LinkedIn’s Responsible AI Principles — Very short article that lists the 5 principles LinkedIn aims to follow. In a nutshell AI should be use as a tool to empower members and augment their success, while prioritising trust, privacy, security, and fairness, providing transparency in AI usage, and the right governance should be put in place to maintain accountability over AI algorithms.
Designing a regional experiment to measure incrementality — Monzo team did an geographical experiment in order to understand how their referral program works.

Fast News ⚡️

Writing well: a data engineer’s advantage — This is probably a leftover part of the data engineer toolkit, but writing is an essential skill. Luuk gives a few advices on how to improve in your email communications with coworkers in order to announce new release or to seek for budget for a factorisation project.
Here’s why your efforts to extract value from data are going nowhere — If data science is “making data useful,” then data engineering is “making data usable.”. This is a quote from Cassie article which I find awesome. But still, in order to make data works we still need to praise other data coworkers that have to do documentation and all the governance burden that no-one wants to do.
Understanding slowly changing dimensions (SCD) in data warehousing — SCD modeling is an old technique but more and more relevant today as we need to keep track of transactional data. The article proposes 6 types of SCDs. I think the SCD type 2 is the most common and lossless one, but other are worth mentioning. As a side note, if you want to understand quickly what SCD are, dbt snapshots documentation page is the best path to go.
How to run dbt with BigQuery in GitHub Actions — When you're starting with dbt you don't need any orchestrator or dbt Cloud, a CI/CD do it for sure. This article gives you the GitHub Action you need to setup.
Snowflake: query acceleration service — Snowflake invented a boost, that you activate with a flag at warehouse creation (in Snowflake a warehouse is the compute isolation your queries run in, the bigger the warehouse is the more compute you use and pay). When you activate the query acceleration service when Snowflake thinks that a query can be accelerated it will launch more compute than actually specified by your warehouse. Not related, they also announced Snowpipe Streaming this week.
Data ingestion pipeline with Operation Management — At Netflix they annotate video which can lead to thousand of annotation but they need to manage the annotation lifecycle each time the annotation algorithm runs. This article explains how they did it.
Ensuring Data Consistency Across Replicas — Mixpanel details how they ensure that different zones Kafka consumers are writing the data in the same manner. This way, when a zone is unavailable they can use the other zone to still have the data without any duplication or lack of messages.
Pandas 2.0.0 — A new major Pandas release is out. In the shadows of Polars that seems to revolutionise DataFrame computation Pandas came with a lot of optimisation and changes.
AWS lambdas are still on Python 3.9 — Corey rant about AWS lambdas that are still using Python 3.9 while all the competition upgraded to at least Python 3.10.
A small head's up, the Apache Airflow team has announced the Airflow Summit for 2023 which will be held in Toronto in September. They recently opened the call for presentations.

Footage of the new Snowflake query acceleration service—be careful it burns cash faster than ever (credits)

Data Economy 💰

Qwak raises $12m Series A. Are the ducks the new elephants? Qwak proposes a all-in-one platform to manage all operations in a machine learning project. In the platform you do the feature engineering, the model creation, versionning, deployment and monitoring with all pipeline automated. I think a lot of platforms like this exists today.
Announcing Tabular — Tabular has been released in public this week. Tabular is a cloud offer using Apache Iceberg. This is funny to see their offering because they offer a "managed data warehouse storage", which means without the compute. You bring your own compute. Some company also call it a lakehouse or a data lake, but the word shift is enough interesting to notice. At least for me.
Insights from new data and AI Pegacorns — Ben from GradientFlow gave a few economic insights about the data Pegacorns (companies with more than $100m annual revenue). I don't have much to say on except that next year probably we'll see generative AI companies on the track to enter the selection.

I wanted to include a review of the 2023 MAD landscape in this newsletter but as I was late and it would have become a huge edition I'll try to write something on it specifically next week.

See you next week ❤️.

How to get started with dbt

2023-03-01

This article is meant to be a resource hub in order to understand dbt basics and to help get started your dbt journey.

When I write dbt, I often mean dbt Core. dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt Core has been developed by dbt Labs, which was previously named Fishtown Analytics. The company has been founded in May 2016. dbt Labs also develop dbt Cloud which is a cloud product that hosts and runs dbt Core projects.

In this resource hub I'll mainly focus on dbt Core—i.e. dbt.

First let's understand why dbt exists. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision. In terms of paradigms before 2012 we were doing ETL because storage was expensive, so it became a requirement to transform data before the data storage—mainly a data warehouse, to have the most optimised data for querying.

With the public clouds—e.g. AWS, GCP, Azure—the storage price dropped and we became data insatiable, we were in need of all the company data, in one place, in order to join and compare everything. Enter the ELT. In the ELT, the load is done before the transform part without any alteration of the data leaving the raw data ready to be transformed in the data warehouse.

dbt purpose as conceptualised in 2017—which is the same today (What, exactly is dbt?)

In a simple words dbt sits on top of your raw data to organise all your SQL queries that are defining your data assets. And dbt only does the T of the ELT which is really clear in term of responsibilities.

dbt is a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun.

It was the previous tag line dbt Labs had on their website. This is important to understand that dbt is a framework. Like every framework there are multiple hidden pieces to know before becoming proficient with it. Still it very easy to get started.

dbt concepts

There are a few concepts that are super important and we need to define them before going further:

dbt CLI — CLI stands for Command Line Interface. When you have installed dbt you have available in your terminal the dbt command. Thanks to this you can run a lot of different commands.
a dbt project — a dbt project is a folder that contains all the dbt objects needed to work. You can initialise a project with the CLI command: dbt init.
YAML — in the modern data era YAML files are everywhere. In dbt you define a lot of configurations in YAML files. In a dbt project you can define YAML file everywhere. You have to imagine that in the end dbt will concatenate all the files to create a big configuration out of it. In dbt we use the .yml extension.
profiles.yml — This file contains the credentials to connect your dbt project to your data warehouse. By default this file is located in your $HOME/.dbt/ folder. I recommend you to create your own profiles file and to specify the --profiles-dir option to the dbt CLI. A connection to a warehouse requires a dbt adapter to be installed.
a model — a model is a select statement that can be materialised as a table or as a view. The models are most the important dbt object because they are your data assets. All your business logic will be in the model select statements. You should also know that model are defined in .sql files and that the filename is the name of the model by default. You can also add metadata on models (in YAML).
a source — a source refers to a table that has been extracted and load—EL—by something outside of dbt. You have to define sources in YAML files.
Jinja templating — Jinja is a templating engine that seems to exist forever in Python. A templating engine is a mechanism that takes a template with "stuff" that will be replaced when the template will be rendered by the engine. Contextualised to dbt it means that a SQL query is a template that will be rendered—or compiled—to SQL query ready to be executed against your data warehouse. By default you can recognise a Jinja syntax with the double curly brackets—e.g. {{ something }} .
a macro — a macro is a Jinja function that either do something or return SQL or partial SQL code. Macro can be imported from other dbt packages or defined within a dbt project.
ref / source macros — ref and source macros are the most important macros you'll use. When writing a model you'll use these macros to define the relationships between models. Thanks to that dbt will be able to create a dependency tree of all the relation between the models. We call this a DAG. Obviously source define a relation to source and ref to another model—it can also be other kind of dbt resources.

In a nutshell the dbt journey starts with sources definition on which you will define models that will transform these sources to something else you'll need in your downstream usage of the data.

ℹ️

I want to mention that the dbt documentation is one of the best tools documentation out there. So do not hesitate to go there to understand better concepts we needed. You just have to understand that there is the reference part which is the detailed documentation of function or configuration and there is the documentation part which is more about concepts and tutorials.

dbt entities

I don't want to copy paste the dbt documentation here because I think they did it great, there are multiple dbt entities—or objects, I don't know how to name it, they name it resources, but I don't want to clash with the resource as a link. So there are multiple dbt entities you should be aware of before starting any project, the list below is exhaustive (I hope) but more, the list is sorted by priority:

sources / models — you already know it, this is the key part of your data modelisation.
tests — a way to define SQL tests either at column-level, either with a query. The trick is if the query returns results it means the test has failed.
seeds — a way to quickly ingest static or reference files defined in CSV.
incremental models — a syntax to define incrementally models with a if/else Jinja syntax. Here the reference. You can choose the strategy you want depending on your adapter (cf. examples on BigQuery).
snapshots — this is how you do slowly changing dimension. This is a methodology that has been designed more than 20 years ago that optimise the storage used. The dbt snapshot page is the best illustration I know of the SCD.
macros — a way to create re-usable functions.
docs — in dbt you can add metadata on everything, some of the metadata is already expected by the framework and thank to it you can generate a small web page with your light catalog inside: you only need to do dbt docs generate and dbt docs serve .
exposures — a way to define downstream data usage.
metrics — in your modelisation you create dimensions and measures mainly, in dbt you can next define metrics that are measures group by dimensions. The idea is to use metrics downstream to avoid materialising everything. You can read my What is a metrics store to help you understand.
analyses — a place to store queries that are either not finished either queries that you don't want to add in the main modelisation.

You can read dbt's official definitions.

⚠️

I feel that this is important to mention again that dbt Core is a framework to organise SQL files and not a scheduler that will be able out of the box run your transformation on a fixed schedule.

Also dbt only does a pass-through to your underlying data compute technology, there is not any kind of processing within dbt. Actually dbt can be seen as an orchestrator with no scheduling capabilities.

Analytics engineering

dbt is becoming a popular framework while being extremely usable. A lot of companies have already picked dbt or aim to. There are multiple technological reasons for this, but technology is rarely the real reason. I think the reasons dbt is becoming the go-to are mainly organisational:

dbt is a complete tool that you can give to analytics teams, it can become their unique playground. Within it they can do almost everything.
The network effect. Because more and more companies are betting on it, more and more trained people there will be in the market. It's also a strategical choice in order to be able to hire people.
The documentation, as I said earlier, is top of the notch.

dbt Labs also popularised the analytics engineer role. We can quickly summarise the role as in-between the data engineer and the data analyst. But because companies can have very versatile definition of role, I'd say that the analytics engineering is the practice to create a data model that represents accurately the business and that is optimised for a variety of downstream consumers. So the analytics engineers are the one doing this.

By the position of this role and the freshness of it, people are coming into analytics engineering from data analytics. Usually they don't have a lot of software engineering good practices and knowledge, which is obvious, but the dbt framework is also meant to bring this to the table.

This is also fair to say that dbt as a tool is very easy to use and very often the complexity of the dbt usage will lie in the SQL writing rather than the tool usage by himself. There are also a few questions in term of project structuration that needs to be done.

Subscribe to the blef.fr newsletter ❤️

If you like this article you should subscribe to my weekly newsletter to not miss any other article of this kind.

Resources

As I only want to help you get started with concepts I know want to redirect you to other articles that I find relevant to go deeper:

dbt annual conferences — Every year dbt do their annual conference called Coalesce which features a lot of dbt user and usage. I've covered with takeways the 2 last one: Coalesce 2021 and Coalesce 2022. In these articles there are a lot of cool presentations you should watch to understand deeper how dbt works.
Introduction slides about dbt — This is a presentation I often give, you can also watch a talk I gave in French, there is also a great introduction by Seattle Data Guy that I recommend.
You can do tests in dbt — like: environment-dependent unit testing in dbt, 7 dbt testing best practices or a guide to building reliable data with dbt tests.
You have to get inspiration from others dbt projects — dbt @Beat, dbt @Vimeo, dbt @ShopBack.
Optimisation — An issue with dbt is that everything will run in SQL, which means you'll have to optimise a lot of thing. dbt Labs team wrote about an optimisation of a long running model.
A rant against dbt ref — A great article to make you think about dbt principles.
How to monitor dbt models.
Generate databases constraints with dbt.
🧑‍🏫 Online courses — I've tried any of the courses I'll recommend, but from the background of the mentors I think they are very relevant. There is first a Corise "Data modeling for the modern data warehouse" that lightly covers dbt and mainly how to do data modeling and the analytics engineers club that sells a training program to go "from analysts to engineer" in 10 weeks taught by an ex-dbt Labs employee. You can also contact me if you want something more personalised.

Data News — Week 23.08

2023-02-24

Data engineering team moving data manually (credits)

Dear readers, I hope you had a great week. Each time I look back and I see the amount of Fridays I've spent reading and writing I'm still surprised. For the last 2 newsletters I've tried to ask your for paying support. From number of people who really paid I can see that I failed to either word it correctly, either to propose a newsletter where you see the value of paying for it.

In any case, I need your honest feedback, what would make you consider paying for the content I create?

This is something I struggle with, I really like writing, I really like this newsletter, I really like the blog, but it takes me one day per week to be done. If I want to continue for years I have to find a way to make it sustainable for me, and also if I want to continue more in this direction I have to find a model that works. I'm open to all honest feedbacks.

A bit of infrastructure

This week I've seen a lot of articles that I can put under the infrastructure category, so here we are. The current data state is heavily dependent on infrastructure, wether it's cloud, on-premise or semi-related we need to understand where the data lands and where the code runs.

First Bucky gave his thoughts about the state of infra in 2023. In a nutshell, Javascript is the future of everything, we say it for years, you write once and you run it everywhere—in the browser, on servers—then workflows systems are a key piece of every software architecture, we have a fragmentation of tooling and we want to run tasks one after the other which means we need something to orchestrate them, finally the OLAP databases are evolving in something different with many more features.

In order to improve your data infra you should sometimes try to occasionally kill your data stack, chaos engineering is something that helps discover issues. Monte Carlo also wrote this week about chaos engineering, with a manifesto.

When it comes to data storage, the real-time ecosystem has also changed a lot in the last few years and a lot of tooling went out to simplify the burden of managing Kafka clusters, Materialize—a real-time platform—detailed their architecture. But if you want to continue using the underlying tools here an overlook of Flink architecture or a few techniques you should know as a Kafka streams developer.

Finally Whatnot shared how the migrated their CD processes to ArgoCD and Pinterest now uses HTTP/3, which I didn't even know it was existing.

Is it Kafka? (credits)

Fast News ⚡️

The future analytics developer experience — It's been a few months since we have articles complaining about the actual of analytics development experience. Often they are right. At the moment the best way to develop in your data warehouse is still in the query editor of BigQuery / Snowflake / etc., even if we have tools that are trying to provide a great experience as Petr is saying in the article we still lack something. I hope it will change.
Measuring B2B customer satisfaction — Dashlane team shares how they are measuring customer satisfaction. I really like the KPI framework they put in place and how it translates in charts.
Measuring everything — This blog is a proposition and a signal for you that you should measure absolutely everything to understand what is happening in your product. This goes further than being a data-driven enterprise, you have to put in place a framework the puts data measurement at every product choice, resulting in maturity increase.
Stream processing vs real-time OLAP vs streaming database — Data storage + compute field is slowly becoming a mess a lot of technologies that are so close but so far away at the same time. Hubert tries to explain stuff in the real-time category.
Data Engineers and Kubernetes — A 101 guide about Kubernetes concepts and why you should as a data engineers understand kube basic entities.
Coding patterns in Python — Startdatengineering is one of the best data engineering related blog and this time he propose a few patterns you might need to implement in Python when doing data pipelines.
Etsy payments data model — Articles are often about technologies and rarely about the actual data modeling and this time Etsy team shared what was their reflexion while re-modeling payments. Sadly this is more about transactional improvements and choices rather than analytics.
Shark attacks visualisation — This is a great example of embedded analytics. Louis deployed a version of ToucanToco—a BI tool—using Redshift to visualise data about shark attacks. In a surprising way 3 shark attacks were deadly in Italy, I'll more careful next time when I swim in the Mediterranean.

Data Economy 💰

Qbeast raises €2.5m seed. This is interesting to see that data lakes platforms can still raise money in 2023. Qbeast propose a different way to organise data to optimise queries performance, still it seems they use Spark.
OpenAI new strategy — Someone on Twitter reported that OpenAI privately announced a product called Foundry that would enable customers to run OpenAI on dedicated capacity with full control on the model configuration and profile.

See you next week ❤️ — small edition today, blank page issues 🫠

Data News — Week 23.07

2023-02-18

When the Data News lands on Saturday (credits)

In last week newsletter I've also share what is a metrics store, which led to a longer edition than usual and I saw that a few people did not like it this way. It was a try I'll see in the future how I can do it better. Still, what is a metrics store? You can check out the post extracted from the newsletter.

On the same topic this week Pierre shared how to create a semantic layer in Preset—i.e. managed Apache Superset—to do so, it first defines metrics within dbt and then thanks to the CI/CD it pushes to Preset the metrics definition. This is a great example of a simple way to push down metrics to visualisation tools.

Is DataOps really a thing?

Last year DataOps has been used in many different ways to describe so many data-related different tasks. When you look deeply at it some companies put behind DataOps word just data stuff. Which is a bit misleading when you read that DataOps is "DevOps for data". Because all things wrapped DevOps is something different than software engineering.

I personally do share this perspective. Data engineering is mainly software engineering applied to data, or at least we try. If we see it this way, this is logical to say that DataOps is the movement to smoother the operation side, which technically means the infrastructure side—the IT as previous generations were saying, I don't like IT, it makes me feel old. Data engineering is also an infrastructure heavy field with a lot of technologies to put together to create something that works. This is why DataOps is important. This is why Infrastructure as Code is mandatory.

To me it stops here, all the marketing derivation of it saying we do data products using DataOps methodology is just marketing. Actually you are just writing code applied to data and using Docker containers to deploy it in the cloud. I think we should stick to software engineering vocabulary.

It also means that the data engineer role is constantly evolving. Especially with the new appearance of the analytics engineer role. Analytics engineers are taking tasks out of data engineers—which is for the better tbh. Data engineers will have to focus more on software and on infrastructure. Shifting the expertises. Analytics engineers will become the data modeling experts. Data engineers will own the infrastructure side and software related to data team—which is already a too broad field with different ownerships (DS, MLE, etc.).

In the end when I deploy data apps I end up doing Dockerfile with CI/CD processes and I look for cloud services to hosts my containers. If this is not DevOps what is it?

I do stuff in prod (credits)

Fast News ⚡️

Unveiling the three faces of documentation — Practical advices about data documentations and how you can leverage through 3 main axes: assets knowledge, business knowledge and team onboarding.
Databricks announced a VS Code extension — This is a small news, but still interesting to see all-in-one platform like Databricks going this direction to provide end-users extension to support their way to write code rather than the vendor one.
📺 Understanding the business as a data analyst — A podcast about the business privilege position data analysts have, but also the responsibilities to understand and modelise it correctly in order to provide the best value to data users.
Decrease ETL costs with Apache Arrow — I've often written data extraction with pandas by doing pd.read_sql because it's super handy and you can have something that works quickly, but the cost in memory can be high. This article shows how you can do it with Polars that leverage Arrow using less memory.
Deploying data pipelines using the Saga pattern — When you enter the real time journey your way of thinking data pipeline is a bit different and it can be overwhelming when you come from the batch world. The Saga pattern is a pattern meant to ensure consistency first in the system. Here Picnic showcases the usage of dead letter queues.
The case for being biased — It's been a long time since I've not featured Benn's posts, still awesomely written. It answers well to "Analytics is not about data. It's about truth" I've shared last week. Benn thinks about the role of a data team in the business decisional journey.
Balancing quality and coverage with our data validation framework — Dropbox tech team developed a data validation framework in SQL. The validation runs as an Airflow operator every time a new data has been ingested. In terms of design only one query runs—performance reasons—and if the query returns something different than zeros, it means something is going wrong. This validation process is also a staging step before sending a table to production.
I built a game for data visualization with streaming data — Fun project. How to use streaming data to create a real-time javascript visualisation as a video game.
Pedram developed a NeoVim extension for dbt users. If you're not familiar with Vim or NeoVim, Simon explained what is Vim, and why this is more than an editor.

There is a village called Vim in Indonesia—originally Vim stands for vi iMproved (credits)

Data Economy 💰

Europe data salary benchmark 2023 — Mikkel has become one of the best in Europe to picture correctly the data field by doing benchmark and studies across the whole market. This time he is looking at salaries. To me, as French, the most crazy number is to see that senior positions—5+ years—in Europe are compensated six figures.
Side note, this week I realised that DuckDB Labs was the team behind DuckDB and not MotherDuck who did a partnership with them to propose the duck technology to everyone.

See you next week.

What is the metrics store

2023-02-13

This week dbt Labs announced the intention to acquired Transform. While, you should already be aware about what's dbt, there are still unknowns about what's Transform. Transform is a company that has been founded by ex-Airbnb employees—which is important here—that proposes an open-source metrics framework and a SaaS metrics store.

At the moment Transform is a small company compared to dbt Labs, only 40 employees according to LinkedIn and they raised around $25m. Which is only 10% of dbt Labs actual workforce. But I think this acquisition matters and will shape our data stacks.

In the past I've made jokes about the naming confusion the data field was into, especially with the following terms: semantic layer, metrics layer, metrics store, headless BI, features store. This is want I want to demystify today. I've spent the whole day reading and watching content in this category and I want to help you understand what it means for us. As a side note, it's fair to say that I also wasn't a believer in the actual necessity of this infrastructure piece. After a full day of research I'm more into it, but we have to be careful.

First, definitions

Before going further I have to write down some definitions. These definitions are mine and if you think I'm wrong you'd be more than happy to get your feedback on it. This is also super hard to have a universal definition across all vendors—as can be seen by this discussion.

Measure — a measure is a value on which we can do all sort of computations (addition, multiplication, etc.), in a warehouse context we do aggregations on measures (sum, count, avg). A measure is often numerical but not necessarily. As an example the order price is a measure.
Dimension — a dimension is something that categorises a measure, it adds context to a measure. You can use a dimension to filter or group the data. For instance the order date is a dimension.

Data News — Week 23.06

2023-02-10

This is what the metrics store inspires me (credits)

Dear Data News friend, every week there is a bit of randomness when this email will truly land in your mailbox—which, btw, breaks all the rules of newsletter writing. Yeah, you know, you have to get your readers used to a fixed schedule, which they can trust and bla, bla, bla. The good news is that at least with me you can trust that I have no schedule except that you should have the newsletter on Friday or Saturday.

While I feel privileged to be able every week to send my thoughts to so many people, it takes me a significant amount of time to craft and write the newsletter. I ask you to consider supporting me by becoming a paying subscriber. Especially if you think like me that the newsletter is great.

Become a paid subscriber 💰

Fast News ⚡️

News from the generative AI universe — Google announced Bard a competitor to ChatGPT, but with better ethics, etc. In the same time Microsoft opened in beta the ChatGPT integration with Bing. Closer to us on the data space Hex proposed a prompt that can do magic for you.
Big Data is Dead — A retrospective on why we don't need any more as much as computing power as before. Obviously the article is biased because it's from DuckDB mother company. As a reminder DuckDB runs on a single node fitting all computes in memory. But the article is relevant nonetheless.
Migrating from Airflow to Dagster is now a breeze — In the orchestration competition Dagster made a step forward, they develop tooling to ease migration from one to the other and one side-effect is that you can orchestrate Dagster DAGs from Airflow. In order to understand Dagster philosophy you should now think with assets.
Data Analytics framework in Python: from scientific approach to actionable implementation — A framework to conduct data analysis in Python.
Should you measure the value of a data team? — Considerations about measuring the job a data team is doing and which metrics you should go for.
Analytics is not about data. It's about truth. — This is an hot take this one because what's the truth?
Rebuilding a Cassandra cluster using Yelp’s Data Pipeline — This is awesome when we can use our data engineering skills not only to do analytics but also to help fellow tech teams in tasks that are hard to do.
How to fix your ETL to lower Snowflake Costs — Mark shares a 3 Snowflake queries that you can run to get table usage in order to identify what costs a lot.
Reflecting on the past 6 years ff data engineering — This is a podcast episode (which I did not listen because of time).
The complete guide to building reliable data with dbt tests — 10 practical points to improve your dbt tests.

Data Economy 💰

Acceldata raises $50m in Series C. Acceldata looks like an enterprise data observability tool that does everything other data observability tools are doing. Like drawing charts that shows that you probably have issues 🫠.
Recently the Kafka company (Confluent) acquired the Flink company (Immerok), economically it means a lot and reshuffle companies strategies. In addition RisingWave also shared views on why you probably need a stream processing system.
Why big tech companies need so many people — this is a good economical question. For instance, Twitter, should be easy to copy. Why do they need thousands of engineers to develop a website that I can re-develop over a weekend?
dbt Labs intends to acquire Transform. I just put this here for people who do not read the first part of the newsletter 🫠.

See you next week ❤️

Data News — Week 23.05

2023-02-03

Delivering the data news (credits)

Hey you, it's already February. Every week same analysis for me. I plan too many tasks but I slowly deliver. I guess that's how it is. Still I love this Friday rendezvous that we have together. I'm still amazed by how I changed my old habits to add the writing in my workflow. And it brings me a lot of joy.

This is also funny because I don't consider newsletter writing as work. Which is maybe a bit stupid but when I work on the newsletter I upskill myself, I read, I discover stuff, I meet with people. But the newsletter takes 1 day per week to be done, which is significant to say it's work. I wish everyone to find this little thing that is actually work but that makes work less work.

I'd like to write more about my time organisation and especially about my freelancing activities but today is a day where I have less time for the newsletter, so it's more an appetizer for later. Let's jump directly to the news.

ML Friday 🤖

Netflix, discovering creative insights in promotional artwork — That probably the reason behind Netflix being now very conventional in term of artwork. The article shows our Netflix art creators are using past data to create new artworks. In the end this is a loophole, where everything looks like similar.
ebay, Variable Hub a data access layer for risk decisioning — Looks like a feature store but for risk topics. The idea is to create a unified layer that stores all the data needed to take decisions.
Lyft, powering millions of real-time decisions with LyftLearn Serving — Architecture of the decentralized system Lyft use to deploy and serve ml models.
Spotify, Unleashing ML Innovation at Spotify with Ray — I've never used Ray in the past, but looks promising as a unified way to describe machine learning pipelines no matter the framework you want to use.

This is refreshing to see big tech machine learning articles that are still looking like machine learning we were doing 2 years ago.

Fast News ⚡️

What's the Modern Data Stack? — Another post about what's the modern data stack. The article is a good summary of the required blocks composing a modern data stack. You can also get inspired by Stuart's modern data stack.
Analytics Engineer- A Glorified BI Engineer? — I feel guilty, I still think that Analytics Engineers are BI Engineers. But BI Engineer for the modern data stack times. In this post Madison tried to compare the two roles. In the end actually, the answer depends. Analytics Engineer role is still unclear and depends company to company. What's often stays is that AE is between DE and DA, so the definition is often done complementarily to other positions.
Microsoft Azure announced managed Airflow — Starting this week you'll be able to launch Apache Airflow within Azure Data Factory. The feature is in public preview. The way they integrated it within Azure looks a bit weird, but it at least exists now.
Change data capture with DuckDB — Pedram had a sneak peek of the future, he tried a CDC setup (with Striim) that writes to GCS and then DuckDB compute metrics downstream.
Data team as % of workforce — Mikkel is a reference when speaking about data team size. This week he categorised companies by data team size as % of workforce. For instance he found that Marketplace companies have bigger data teams than B2B ones. It makes sense.
2023 state of databases for Serverless & Edge — I did not know that serverless databases field was so innovative right now. All things considered this is a normal evolution, databases connections are from an old time and web developers wants direct access to databases. This is interesting to see how serverless Postgres is going.
Think is SQL, avoid writing SQL in a top to bottom approach — A nice post about the mismatch between the logical query processing order and the syntaxic order of SQL queries.
Parquet best practices: the art of filtering — How to leverage Parquet filtering to save processing time.
Optimizing dbt development with Snowflake clones — dbt development in large data warehouse can become expensive if you ask every dbt developer to dbt run the whole SQL tree. Montreal analytics propose a solution with Snowflake db clones. You can also use the dbt --defer option which does something similar.
What if we use CHANGELOG in our data projects? — This is important to have a consistent nomenclature when naming commits and changes, sadly the same should apply to dashboards, but hard to do.
How we deployed a simple wildlife monitoring system on Google Cloud — Artefact engineering a serverless platform on GCP to do wildlife monitoring.
📺 Seinfeld-like sitcom generated by AI 24/7 live on Twitch — This is amazing how far we are able to go today in terms of content generation.

Few Snowflake clones (credits)

Data Economy 💰

Select Star raises $15m in Series A. Select Star is another data catalog that automatically connects to your tools and provides the usual data catalog UI based on a search bar with metadata management inside. Nothing new under the sun.

See you next week ❤️.

PS: and sorry it was a fast data news today. I have a big presentation to prepare for Monday. I wish you a great weekend.

Data News — Week 23.04

2023-01-27

My view from the train window (credits)

Dear Data News readers it's a joy every week to write this newsletter, we are slowly approaching the second birthday of this newsletter. In order to celebrate this together I'd love to receive your stories about data—can be short or long, anonymous or not. This is an open box, just write me with what you have on the mind and I'll bundle an edition with it.

This is fun because I'm usually not someone who's good at having habits. Every week to be honest I get hit by Friday. I don't write in advance. Every week you get a taste of my current mood. I often try to sync my travels on Fridays, even if internet is terrible in the train, this is still a good way to fill the +8 hours travel time I'm used to.

Today I take the following commitment: I will never use any generative algorithm to write something in the newsletter. Fun story because one year ago I had an intern working with me on the blog to whom I had given the task to write code that was able to learn from my writings to generate a Data News edition. One year later, different views. In ChatGPT times, my idea is just boring.

On the other side, at the moment I'm not really organised to check if articles that I share have been totally written by humans, but same shit, I'll do as much as I can to avoid sharing empty articles like I've always did. It might be a good use-case for GPTZero.

As a data professional this is probably the height to not want to use AI. But right now the field feels like when cryptocurrencies arrived. Awesome raw ideas with sharks circling around waiting for a new productivity highness.

PS: last week I did a—bad—joke about Apache naming and a reader pointed me an article about the ASF and non-Indigenous appropriation.

This is enough about my life, let's jump to the news.

Back to the roots, a few engineering articles

I did not know how to put together these articles, so here a few loose articles. In my manage and schedule dbt guide in a nutshell I say that in dbt projects you have 2 lifecycles. The first one is the developing experience and the second is the dbt runtime. It means you have to run dbt somewhere:

Jonathan proposed a creative way to do it in Dagster — every dbt model is a software defined asset, which means that the whole data chain is reactive and every model are refreshed on a trigger rather than on a cron-based schedule.
Astronomer team developed an awesome library that is meant to translate dbt DAG to Airflow DAG: astronomer-cosmos. You either have a DbtDag object or a DbtTaskGroup, that dynamically creates an Airflow DAG from your dbt project. It looks very promising. Cosmos reads dbt models files and do not use the manifest.

In term of data modeling ThoughtSpot wrote about the best data modeling methods and Chad—the pope of Data Contracts—wrote about data contracts for the warehouse, mainly it shift the responsibilities to data producers in order to enforce schema and semantic, but in the data world it is sometimes rather an utopia. Producers are often software teams that, sadly, does not care about data teams.

Finally Noah shared how he improved data quality by removing 80% of the tests and Ronald proposed a framework to create data products in Airflow.

Data people are creatives 🪄

This is a new category that will appear in the next Data News edition. In this category I'll share stuff that we can do with data. The idea is to inspire others by promoting the end use-case rather than just the technology. I'll be more than happy to share what you do.

Are Airbnb guests less energy efficient than their host? — Max tries to find if Airbnb guests energy consumption is higher than the hosts' one. I'm always amazed by straight to the point analyses like this.
Automated object detection in CSGO — PandaScore, a French company that generates data from public—and probably private—e-sports data, showcases how they used OCR to get data in CSGO live streams. I did something similar last year on Teamfight Tactics.
Football data pipeline project — This is more a technical walk-through to build a Streamlit dashboard on the Premier League. Still this is interesting.

This is us (credits)

Fast News ⚡️

Airbyte announced a free sync plan. Starting today the connectors that are in alpha and beta will be free to use in Airbyte Cloud. It needs only one side of the sync to be in alpha/beta to have it for free. Once GA you'll have 2 weeks before being charged.
Earlier in January Fivetran also announced a free plan. Starting February you will be able to use it to sync up to 500k distinct rows for free plus other perks.
SQLAlchemy 2.0 released — This is a major release with a lot of breaking changes. As I'm far from being an expert in SQLAlchemy I can't say more than it seems to be shiny new better ORM.
Metaplane announced data tests preview in pull requests — This is a way to compare the SQL code in a PR to the live production data to see directly in Github what have changed. It gives ideas.
Snowflake released min_by and max_by functions — With these new min/max functions you can in a select statement get the first/last status for an id. This is a great shortcut.
How to compare two tables for quality in BigQuery — Giorgios propose a simple query to compare 2 tables in BigQuery. If you are a Snowflake user there is a minus operation to do it even easier and if you use dbt you can avoid this boilerplate by use dbt_utils.equality function.
How misused terminology is damaging the data field — The title is a bit exaggerated and terminology gatekeeping damage even more the field. Actually in the end we all do stuff with data, right?
How you can have impact as an Engineering Manager — Good question and good article. In a nutshell it's about your team and other teams and how you interact with other people in terms of behaviour, processes and practices.

Data Economy 💰

Microsoft finally announced their "multi-billion dollar" investment—probably $10b—in OpenAI. Nothing more to say, you might have guessed my opinion in the introduction.
whalesync raises $1.8m pre-seed to create another data movement SaaS that is connectors based. With bidirectional connectors. The difference with similar product is the possibility to also sync to Postgres. Usually tools like this only do it between SaaS. The enable also web page creation automation for SEO, which is unrelated to the data movement business.
Komprise raises $37m Series D to build yet another all-in-one data platform to do everything about data.

See you next week ❤️.

Data News — Week 23.03

2023-01-20

Summer in coming (credits)

Hey, new Friday, new Data News edition. I'm so happy to see new people coming every week. Thank you for every recommendation you do about the blog or the Data News. This kindness for my content gives me wings.

This week I don't want to be late, so let's start the weekly wrap-up. I got less inspired this week, it means shorter edition.

As a side note we are looking for speakers for a late February Airflow Meetup. Still open topics, so whatever you want to share—have to be related to Airflow at some point—we'll be happy to welcome you as speaker.

The current state of data

This week Benjamin Rogojan livestreamed an online conference featuring awesome data voices: state of data infra. Matt wrote his takeaways on Medium about the conference. In parallel Ben released the results of a survey about data infras he run among his followers. The main thing to notice is that the average company is a Finance company using Airflow with BigQuery and they struggle—like you probably—to hire people.

This is also time for my views about the state of data. After 2 years of running the newsletter writing every week about trends and following "influencers" for you I'm bored. If I'm being honest I'm French and probably I was born bored, but still. When I was a young professional I was so hype by new technologies, right now it's harder for me. I personally feel that data ecosystem is in a in-between state. In between the Hadoop era, the modern data stack and the machine learning revolution everyone—but me—waits for. But, funny, in the end we are still copying data from database to database by using CSVs, like 40 years ago.

If we go back to this week articles:

Matt Hawkins tried to find the origins of the term "modern data stack".
Pedram wrote about the state of data testing — in the end of the article obviously because it's on Datafold blog they share data-diff, still the article is relevant near the four facets of data quality: accuracy, completeness, consistency and integrity.
Apache Doris, to me it looks like a character from Nemo, actually it's the new real-time warehouse of the Apache Foundation.
There is an introduction post about DataHub — when you look at what you have to run to launch a data catalog: 4 components and 4 different data storage. Don't be surprised if no ones uses data catalogs. When I think that some people are saying Airflow is complex to launch.

In a nutshell I just want to solve problems and empower people with what I build and I don't care if my stack is a post-modern aquarium, I just want it to be blazingly boring.

Data modeling techniques

Data modeling as of today is probably the most important skills of every data practitioner. We don't really care about your role or your tools. This is about optimisation. Optimisation at different levels, it can be performance optimisation, costs optimisation, business understanding optimisation. Yeah, in fine, optimisation.

There are many techniques out there to do it, I don't want to enumerate them because that's not really the intention. Still, aim for simplicity, keep it simple stupid and think about your consumers.

PS: this feedback about the Medallion architecture—bronze, silver, gold—might be interesting for you.

Perfect your modeling techniques (credits)

Fast News ⚡️

Why I moved my dbt workloads to GitHub and saved over $65,000 — With the dbt Cloud price increase I already shared companies started to look for innovative way to run dbt. This time this is an example demonstrating that you can do it in Github Actions.
10 Common Misconceptions about Airflow — Airflow grown a lot and probably users that lost faith in Airflow a while back while never come back. Still this post tries to revalidate Airflow. Shortly, in recent Airflow versions it's easy for instance to get started, the UI is great—and tbh always has been, the scheduler is stable.
Lights on Versatile Data Kit — A YouTube video about a tool developed by vmware that is an alternative to dbt—yeah, sorry this is the best way to define it.
Data Engineering job market in Stockholm — Alexander shared on a personal blog his job research in Sweden. Spoiler: out of 43 application he got 6 offers. This is a short post but describes well his experience.
Why the super rich are inevitable — Except the fact that we should eat the rich. I just want to talk about the way the information is displayed. Alvin—the author—explained economical concept with a scrollable visualisation and with some simulation to help people understand concepts. I found it very pleasant and it looks like something data teams could do to package data analyses.
All you need to know to get started with Vertex AI Pipelines — Will people continue to do Data Science by themselves in 2023? Probably not like before and with more APIs in it. For that you can follow this overview about Vertex AI—the Google Cloud Platform manage machine learning product.
BigQuery Ingestion-Time Partitioning and Partition Copy With dbt — Christophe from Teads wrapped-up how they contributed to dbt 1.4 by adding ingestion-time partitioned table support for BigQuery.
Don't target 100% coverage — Yes. This is about JavaScript, but you can still send it to your boss that is asking for 100% coverage for data tests.
Choose your adventure: How changing how you spend your free time can genuinely make you feel like you have more of it and take care of your well-being.

Data Economy 💰

Cumul.io raises €10m Series A. Embedded analytics is the capabilities to introduce Business Intelligence apps within "traditional" software platforms like SaaS application or public website. Cumul.io provides a complete SDK to integrates Analytics in your app. Either by doing it yourself either by letting your customer do it.
Lay-offs are continuing at big tech. Google and Microsoft announced respectively 6% and ~5% jobs cuts. According to layoffs.fyi in January this year around 40k people got laid off in tech, it represents 25% of last year total lay-offs—150k. If it happened to you recently, you can reach me, I'll do whatever I can do to help you.

Almost in time today (credits)

See you next week ❤️.

Data News — Week 23.02

2023-01-14

Abandoned Pandas (credits)

Hey. I have busy weeks, I'm sorry Data News are coming on Saturday again. This is a bit hard to travel by train, work and write at the same time. Plus I'm a fast context switcher, so it piles up. Also a few of you have sent me messages recently and I've not yet answered, I see you and I did not forget you. Now that I'm back in Berlin it'll be easy.

Last week we organised the first Paris Airflow meetup of the year. It was a round table that I've moderated with Benoit Pimpaud, Furcy Pin and Marc Lamberti. We talked about the place of Airflow in 2023, the unbundling of Airflow and the best way to run your Airflow DAGs today.

The discussion was in French and the recording will be released next week. In the meantime you can still check my article Using Airflow the wrong way that summarize a bit the operators vs. containers debate. During the meetup we did not talk about Airflow alternatives, currently Mage is the rising tool that everyone tries out as a replacement for Airflow?

Enjoy the Data News.

Polars—Pandas are freezing

Recently influencers are betting that Rust will be the de-facto language in data engineering. The history repeat, we've seen it with Scala, Go or even Julia at some scale. In the end Python and SQL are still here for good. But with Rust the approach is different. The idea is not to replace Python but to replace the underlying bindings that are used by Python libraries.

And it makes sense, for instance ruff a Python linter that is build in Rust that claims to be extremely faster that the usual stuff.

On the data processing side there is Polars, a DataFrame library that could replace pandas. Let's have a quick look at it. In this overview I'll not talk about performance because I don't have the time to do a proper benchmark—and I've never done this. Just the experience of a beginner that knows pandas very well.

The installation is pretty straight forward, you can do it with pip. When compared to pandas this is awesome because it seems polars as no dependencies so it does not need to build wheels like pandas.

pip install polars

Regarding the imports the documentation continues to treat me well. It looks like stuff I know with pandas.

import polars as pl

Then I can do my first CSV import, in the example I load a French railway open dataset about lost and found objects in stations.

df = pl.read_csv("lost-objects-stations.csv", sep=";")

Then you can use the same code as pandas to select the data (head, ["col"], etc.). I want now to try a group by.

df.groupby("Station").agg([pl.count()]).sort("count", reverse=True)

# Same code but it pandas
df.groupby("Station")["Date"].count().sort_values(ascending=False)

And lastly (because if I continue the newsletter gonna be too long for you to read), I just try to convert a str Series to datetime.

df = df.with_columns(
	df["Date"].str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%Z").alias("Date")
)

# Same code in Pandas
pd_df["Date"] = pd.to_datetime(pd_df["Date"], format="%Y-%m-%dT%H:%M:%S%Z", utc=True)

We can already see the performance difference here.

To be honest I try polars for 15 minutes and I can already see how I could switch to it if I have the guaranty it is way faster. APIs are quite similar so I'm far from being lost.

🫠 If after this small introduction you want a deeper comparison of Polars you can check Modern Polars by Kevin Heavey or a 40 minutes YouTube video that explains Polars internals.

Hiring processes

The current state of the data market is weird. At the same time we have a lot of lay-offs and a lot of companies that are still looking for data folks. Which is often a critical hiring for them, but they struggle. There is a huge gap between jobs, what folks are looking for and what companies are looking for.

This week Teads shared their engineering hiring process. The process is not focused entirely on data, but still this is relevant because it can give ideas to hiring companies or junior looking for advices. They have a short 4 touchpoint interview which looks like a good compromise.

When focusing on data more, Galen wrote about what he looks for in data analyst candidates. One of the most interesting advice he gives that I can press is: you should spend time mastering the technologies you've chosen. With the current state of data this is easy to loose focus, so listen to this intervention. Stop chasing the last data trends and master what you daily use. I think that mastery in one domain can be easily transferable in other domain.

❓

Would you be interested by data job offers in the newsletter?

I would like to propose you job offers that I personally validate—following an open checklist. Obviously companies would pay for this service and it will be a mean for me to get something in return for the curation/writing work I do every week.

AI Saturday

How DoorDash upgraded a heuristic with ML to save thousands of cancelled orders — When running a marketplace this is an usual problem to deal with. DoorDash shares the models they used to replace their intuition.
👀 Building a defensible Machine Learning company in the age of foundation models — This article is very complete, this is probably the best written article about the actual trends in the machine learning. A whole ecosystem is shifting from build it yourself to consume foundation models and APIs built by others.

Credits Good Tech Things by @forrestbrazeal

Fast News ⚡️

Announcing dbt-fal adapter — I shared fal months ago when they launched. I'm still on their Discord and when dbt finally announced Python models support I was a bit sceptical about fal offering the same thing. But because of the small scoped dbt solution—Python code only runs in the warehouse. With this release you can really mix Python and SQL code.
How we cut our Databricks costs by 50% — We can always find optimization in our cloud setup to save costs.
How to land a job in progressive data — If you want to use your skills to Do Good you have to look at Brittany's post about progressive data.
Statement of objections issued against Google’s data processing terms — The German office competition regulation said that Google should do more in being explicit about how the data in processed to help Google business.
How query engines work — This is a web book that explains how query engines work. I did not read it yet but it looks great.
Analysis of Confluent buying Immerok — Jesse Anderson analyses last week news of Confluent (Kafka) buying Immerok (Flink) and what it implies in the real-time low-level technologies competition between Kafka / Flink / Spark.
On Data Contracts, Data Products and Muesli — Another post on data contracts, a bit to long for me to read it. Sorry.
Extracting, converting, and querying data in local files using clickhouse-local — This is awesome how fat Clickhouse can go. Looks like a wider alternative to DuckDB but also a good trend for other warehouse: provide a local experience that lives out of the cloud.
ByteGraph: A Graph Database for TikTok — ByteGraph is the open-source graph database developed by the company behind TikTok. This article shows you what are the key concepts to understand it. To be honest I'm quite impressed with the first line stating that it has been designed to support OLAP, OLSP and OLTP workloads.

Data Economy 💰

Metaplane raises $8.4m seed funding. This is a bold claim, Metaplane wants to be the Datadog for data. Operating in the data observability space the usual set of features: tests, data quality monitoring based on historical data, lineage and alerts.
XetHub raises $7.5m seed round. XetHub brings git to data files management. They support up to 1TB repositories with git-like commands (checkout, push, commit, pull, etc.). I think that XetHub is super useful when in data science we need to keep the data alongside the models. When commit a change on a big file their repo hub summarise data diffs.
Generative AIs are booming, following all the stories with possible Microsoft $10b investment in OpenAI, Seek AI raises $7.5m seed round. Seek AI promise is a prompt where you ask your data anything and the AI responds on top of the raw data directly.

See you next week, maybe on Friday ❤️.

Data News — Week 23.01

2023-01-07

You and me celebrating 2023 (credits)

Happy new year 🎆. For those who were already subscribed at the start of last year I tried to put resolutions and objectives for the year that I did not succeed to follow. The year was so different to what I was expected. Maybe this is an excuse. Anyway I did not reach my goals. What about if we don't care for this year?

Still, what happened was awesome and here a small personal / professional throwback:

I worked for the French public sector as a freelancer: tax administration and education ministry. It makes sense for me and this is something I also really care about.
Bootstrapped a coaching activity with companies and individuals—this is a new exercise but I feel it's close to management that I can't do in freelance.
I moved to Berlin, talk to my first meetup ever in English, met awesome people there but I'd like to met more.
We restarted the Paris Airflow Meetup and people liked it. There are still a few seats left for next Tuesday meetup.
I started to pay myself after 1 year and half of unemployment pay. This is maybe my main source of stress. Will I be next year able to find missions to pay me for the whole year? My business plan asks for 100k€ in revenue.
This year I deeply learned Superset—adding the tools to my tools expertise list.
My written content got around 100k views last year. The blog crossed the 2000 members mark (❤️) and I won the best data science newsletter award. On LinkedIn and Twitter I multiplied 2 my followers. Everywhere I was starting from the bottom and now we're here.
I talked in Robin's podcast about the newsletter and my data engineering journey.

I'm also sorry to start the year late with my newsletter sending. On the last 3 days I was teaching DataOps at a French school and I did not manage to find the time to write to you. And you know what, this is the first time in 7 years of teaching that more than 80% of the class wants to become data engineer.

As a conclusion of this introduction, I want to thank everyone reading this newsletter and sharing feedback or good words about it. It means so much to me and it fuels me. For sure the Data News will be here for a new year and new stuff is coming.

Time for the news—I have around 30 links to share today so it might be less opinionated than usual. Happy reading.

Data team role

I really like all the thoughts around data team role, missions, vision and strategy. I still think that we did not reach any form of consensus about data teams. In term of tooling the modern data stack proposed something that works but the modern data team is still behind. Here the latest ideas I've seen this week:

Should software teams start learning from analytics engineers? — Petr reverse the common idea where analytics teams should learn from software. Why actually everyone is just a part of engineering that helps all of us getting better at data and software.
Data Teams as support teams — Chad from Zendesk thinks that data teams are often misaligned with customers and because of the supportive nature of the relationship between something does not work. He then digs in modeling and analytics value to understand what are the impact on the relationship—this is fun to read that someone from Zendesk does not want to be a support team!
❤️ Elbows of data — This is a good follow-up to Chad's post. Katie toss the term elbow of data who are "folks who have insisted on being involved in driving the company forward, whether they were invited to or not". When we do data we have skills and understanding to help our company. Once again our main role should be to empower stakeholders.

You and your stakeholder, bff (credits)

Data Science Saturday 🤖

How to invest better in acquisition channels? — Marianne detailed how data science helped Qonto understanding their acquisition channels investment.
Data science has a tool obsession.
Selecting the best image for each merchant using exploration and ml.
Introduction to Graph Machine Learning (related Grab Graph service platform).

We are in a middle of ChatGPT frenzy. A new day means a new interrogation about our future. Our future as developers but our future as humans. ChatGPT is seeking for money at high valuation amount. Still, should we trust OpenAI to be open as the name is saying 🫠?

If you want to understand better what's behind ChatGPT you can have a look at minGPT a minimal re-implementation via PyTorch.

At the same time it seems that following initial Microsoft investment in OpenAI, Bing will use GPT models to improve their text and images search. Who would have say that Bing would kill Google?

Final note: How China is building a parallel generative AI universe.

You and ChatGPT being friends (credits)

Fast News ⚡️

Why I'm using (Neo)vim as a Data Engineer and Writer in 2023 — If you want to take 2023 beginning as a sign to move to vim Simon wrote a great post for you.
CircleCI’s unnoticed holiday security breach — CircleCI had a security breach a few days ago.
What if we rewrite everything? — Navigating through the technical debt and spending our entire career doing the same stuff over again. What is the right strategy to have? Probably Keep It Simple Stupid.
Why It’s So Hard to Become a Staff Engineer — A feedback to help people bringing the gap between senior and staff. I think this is even relevant to data world.
Introducing ADBC: Database Access for Apache Arrow — When I see "minimal-overhead alternative to JDBC/ODBC for analytical applications" I'm instantly in. My all professional life I've heard architect saying JDBC is bad so if something better can come so we don't talk about it. You can also listen a related podcast about Arrow vision.
Recap: a data catalog for people who hate data catalogs — This one hurts. You may have noticed if you read me that I'm not very tender with current state of data catalogs. This week Chris started a small footprint data catalog written in Python called Recap. I'll have a look at it soon.
Observability, Tick — Nigel wrote a small post detailing how a smal startup can do observability without spending a lot of money.

Data Economy 💰

The economic situation is obviously not at his best. Previously data was not always impacted about the difficulties but it's also coming to the data world. That's why data fundraising will become a data economy wrap-up.

Astronomer laid off 20% of their staff—which represents 76 folks—and moved from a co-CEO structure to only one CEO. I appreciate the transparency effort that has been done to make this note public. But I still struggle to see Astronomer value and strategy, but this is hard because Astro hires a lot of core Airflow contributors and have important contributions to the data community.
Salesforce is laying off 10% of their staff—roughly 8000 people—including folks at Tableau. They acquired Tableau in 2019 and analysts are saying that Tableau ex-employees are more often impacted by the lay-off.

In search of consolidation and levers to do companies are also merging:

Qlik wants to acquire Talend. Qlik and Talend are two old BI giants. The first one has been founded in 1993 and the second one in 2005. They had obviously been challenged by the cloud vendors and the modern data stack vision that does not include them.
Confluent signed a deal to acquired Immerok. They are respectively the home companies of Kafka and Flink. This is to be honest a natural move because the two technologies works at the best together and ksqlDB never took the place it should have been in the market. Sadly also right now they are challenged by real-time tooling that is way easier to setup.

Finally a fundraise:

Chaos Genius is raising $3.3m Seed Round. They propose an optimisation platform for Snowflake to help you save up to 30% of your warehouse costs.

See you next week ❤️.

I talked to DataGen podcast

2023-01-04

🎙 A few week ago I did my first podcast with Robin. We talked about data engineering and everything around doing a weekly curation.

This is the first episode of Robin's podcast in English and you should follow him because more are coming!

In the podcast we talked about
🔥 My journey before launching the newsletter
🔥 Why and how I write
🔥 My main challenges as a Data Engineer
🔥 My favorite contents
🔥 What I like about data
🔥 A few tips for Data folks

You can listen the podcast on all the platforms:

Apple Podcasts/Itunes: bit.ly/3X3qlOQ
Spotify: bit.ly/3GnfWXb
Google Podcast: bit.ly/3VPSAPR
Deezer: bit.ly/3ZjWsM6

Data News — must-read 2022 articles

2022-12-30

kitsch moment, from me to you (credits)

Hey you, this is the last article of the year and it's gonna be about the articles and trends that made 2022 according to me. You'll see articles that I've already share during the year.

💡

You can also read the 2021's must-read that I've done one year and half ago or how to learn data engineering that contains key articles to understand the field.

Once again thank you everyone for your support this year and see you next week for the first Data News of 2023. Sorry for the delay, I had a blank page syndrome today. Now let's jump to my selection.

ANALYTICS ENGINEERING

We have to be honest in 2022 Analytics Engineering shaped up the data field and concentrated a lot of data discussions. Analytics Engineering can be seen as a renaming of the BI Engineering, if we look at it more precisely it mainly comes out of the data roles specialisation. Analytics Engineers is a specialized role between the Data Engineer and the Data Analyst. Madison had a look a job posting to see what are the skills companies really want in Analytics Engineers.

Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their own questions. [...], an analytics engineer spends their time transforming, testing, deploying, and documenting data. Analytics engineers apply software engineering best practices like version control and continuous integration to the analytics code base.¹

Analytics Engineering brought back light on data modeling. Preset wrote a gentle introduction to data modeling. In a nutshell data modeling is the techniques we can use to structure the data in data warehouses. Nowadays we have:

Dimensional modeling — Introduced in 1996 by Ralph Kimball. We often use the Snowflake Schema or the Star Schema (that is a special case of the previous one, here Snowflake is not the data warehouse technology but more the shape of the table relationships—drawing a snowflake).
Entity modeling — Introduced by Bill Inmon. In this methodology you use the 3NF (third normal form) to model your business entities to avoid redundancy. This approach is less flexible than the previous one.
OBT—One big table ; I don't really know who introduced OBT except the fact that Fivetran mentioned it in 2020. This is often the easiest approach to start. Everything in one table, denormalised.

As a final note, a Reddit thread discussing is Kimball's Dimensional Modelling dead in 2022?

In order to complete the AE articles list here a few I recommend as the best 2022 analytics engineering articles:

Factless Fact table — not so absurd it may sound at first
Testing: Our assertions vs. reality — Probably the best talk of 2022 about testing. This is a YouTube video.
Super Tables: The road to building reliable and discoverable data products — LinkedIn data modeling choices explained and the introduction of Super Tables concept.
Understanding the Snowflake Query Optimizer — In order to become better at data modeling you'll need to understand how the underlying warehouse engine is working. This article is a good way to go to understand how Snowflake works.
Stop using so many CTEs — This is a vendor article that showcases "Chained CTEs" in Hex. Still relevant because in today's data world CTE are everywhere and a lot of data transformations are just a bunch of SQL queries around a few hundreds of lines with a lot of CTEs. But CTEs are untestable blocs of code.
7 Antifragile Principles for a Successful Data Warehouse — Something to look at to create a healthy data warehouse.
My guide about managing and scheduling dbt from dev to production.

DATA TEAMS

3 piece of content that I feel are relevant and not really trendy. This is more something long term that we have to have in mind:

Building more effective data teams using the JTBD framework — Data teams are still in between with no really good practices when it comes to routines or organisation. The Job To Be Done framework can be something to look at.
Building Modern Data Teams — The most complete resource hub with around 40 articles on how to build data teams, data strategies or think about data work / hiring.
Emerging Architectures for Modern Data Infrastructure — The updated version of the a16z vision about modern data infrastructure.

ENGINEERING

In loose, a few of the best 2022 data engineering articles:

Introducing Software-Defined Assets — Best article to rethink the data pipelines and to consider datasets like assets.
The rise of the data reliability engineer — Data Engineers have a large part of their daily job that is close to SREs while not being SREs.
The best website to understand visually machine learning models.
Joe Reis blog. He started blogging recently after writing the excellent Fundamental of Data Engineering book. I often surprise myself agreeing to everything he says, if you have to follow someone except me I think it should be him.
The many layers of data lineage — The best metaphor to understand what you can do with data lineage.
Design Patterns in Machine Learning Code and Systems — Because we need design pattern even if I disliked the design pattern classes I had back in engineering school.
3 tips to take back control of your time.

A GLIMPSE INTO THE FUTURE

This year people talked about a lot of things, with no research here what I can remember:

Data Mesh — The Mesh has been assimilated and tried by multiples organisations, what we've seen is that it requires a minimal size to be started, we have yet to figure out if the organisational changes are worth it.
Data contracts — An interface between the data producers and the data consumers. The interface can take multiple form, we often summarize it as a schema registry. Very useful in a mesh organisation.
Semantic Layer / Metric Layer / Headless BI — "Something“² between the data warehouse and the BI tool that will probably shape trends next year.
Unbundling of Airflow — This is year many Airflow alternatives went public, all with their own vision and great promises, in addition the one-dag-to-rule-them-all strategy has been challenged and execution has also been delocated to other system leaving Airflow like an empty shell. But in the end he'll be back.
GPT-3 applications — It has the potential to revolutionize industries through automation and augmenting human intelligence, but has also raised concerns about its potential negative impact on employment (this bullet has been generated by ChatGPT).

Now that I've said this, I think that 3 technologies will shape data engineering next year:

Wasm — WebAssembly is a portable compilation target in the browser. In human words it means you can run your favourite language code in a Firefox tab. One example is PyScript, that allows us to run Python in HTML. Thanks to Wasm we can use a decentralised power: your stakeholders laptops.
DuckDB — A single node in-memory OLAP database. We did not see yet the full potential. What I think about DuckDB.
Dagger — A programmable CI/CD engine that you can run everywhere.

What is an analytics engineer? (Claire Carroll)
Semantic Layer is more than just something. To be honest for the moment I take it sarcastically, because I'm not sure this is something really important—at least when I see my own French market.

Data News — Week 22.51

2022-12-23

A gift from me to you (credits)

Hey you, if you just subscribed yesterday to the Data News I wish you a warm welcome ❤️‍🔥. The Data News is your Friday weekly data curation in which I select for you the most interesting—according to me—data articles of the last week. I hope you'll enjoy it ✨.

Christmas is coming, so whether you celebrate it or not, I wish you a great end of the year and good time with family and/or friends. There will be a last Data News next week that will be my 10 2022' must-read articles. In the meantime you can read Prukalpa's 5 must-read data blogs from 2022.

The Advent of Data is also coming to an end tomorrow, it has been an awesome ride, I'm so happy we put together such an awesome list of content and I'm so grateful to the 24 creators who accepted the rules and wrote something for this first year. I'll do a wrap-up of the Advent in January to celebrate what we achieved together.

Remember: the Advent of Data was your daily spark of data joy in December. Every day a new data article has been published by a data creator.

Guide—manage and schedule dbt

I published 2 days ago the most complete guide about dbt management and scheduling, in case you missed it you have to check it out! Original deep post that are exclusive to Data News members are something I'm willing to do more next year, to bring you additional value to this newsletter.

Next year I plan to talk about:

Data engineering and analytics engineering career paths
State of the data integration—related to another 2023 project 📚
We have too many choices, my framework to take a decision
Something you want me to write on?

Let's go back to dbt. So this guide in a nutshell will give you ideas on how you can manage dbt repository(ies)/project(s), what you have to think about to provide a top-notch developer experience, how to host and schedule your dbt code.

I'm really proud of the development experience part of the guide because I think that this is a still a unresolved part of every dbt project, something is still broken. From the first contact, the local installation, the (web?) IDE, the useless copy-pastes, the code reviews, the tooling to the development environments there is a lot to say.

👀 Check the dbt guide

As an extension this week two great articles have been written about custom dbt setups. Monzo team detailed how they created their own framework on top of dbt to follow their growth and Albert from Superside explained how they migrated from dbt Cloud to a custom setup with CI/CD, S3, Docker and Airflow.

PS: small question, I did not email you for the guide, would you have wanted to receive an email for it?

Give yourself a book Christmas 🎁

Close your screen and read good old books (credits)

If you need gift ideas for yourself I have a few books to propose to you. The selection is a mix between 2 things I love—data engineering and visualisation.

Here the selection 📚:

Fundamental of Data Engineering — It rapidly became a best seller, Joe and Matt wrote a greatly structured book that covers all the data engineering topics, I firmly recommend it from juniors to seniors.
The Data Warehouse Toolkit, 3rd Edition — With the rapid rise of the Analytics Engineering role the data modelisation came back as number one priority for a lot of data teams. Dimensional modeling has been a reference for years which is the main purpose of the Kimball method.
Effective Data Storytelling — Data storytelling have been something really trendy in the last years, but in a lot of data teams because of the dashboard constraint we often lack of creativity, context or storytelling. This book is a must-read if you want to drive actions with data.

Obviously there are more books that went release this year that are awesome, but I just mentioned what you should absolutely have.

Fast News ⚡️

Functional Data Engineering - A Blueprint — Ananth, from Data Engineering Weekly, the best data engineering newsletter, wrote a great follow-up to Maxime's functional data engineering post. In the post he shows how we can apply entity and event schematisation to Lakehouse architecture.
Tips for hiring junior Data Engineers — MOST. IMPORTANT. POST. OF. 2022. Every data engineer has been junior once, this is important not to gatekeep others by forgetting we once knew nothing about data engineering. This is our duty as senior to hire juniors and to help them. I guaranty you this is the most satisfying feeling, aside from my recommendation the article is awesome and speaks the truth. Last point I think the max ratio is 3 juniors for 1 senior.
💥 BigQuery data lineage — It looks like a something huge, but I'm note sure tbh. Soon we will have a data lineage tab in the BigQuery UI. In order to have it you'll have to activate Data Catalog/Dataplex. This is in public preview.
Panel discussion about licenses in open-source, relation with VCs, etc. Between Doug Cutting (Hadoop co-founder), Maxime Beauchemin (Airflow & Superset creator) and David Nalley (Apache Foundation president). This is really geeky about licensing but few of you might find it interesting.
Maybe Snowflake isn’t for you! — Thoughts around the expensive price of Snowflake that reminds Oracle. tl;dr: take control back of your tools to find the holy added value every data people has spoken about.
Managed transcription with OpenAI whisper and Hugging Face inference endpoints — I don't even understand the first chart of the article but it looks cool.
Personal Finances with Airflow, Docker, Great Expectations and Metabase — when you're a nerd and you like to extend the data pleasure on Saturday.
StackOverflow 2022 developer survey — 15% of respondents developers are in data roles (but they can have multiple hats) and when it comes to technologies SQL and Python come just after JavaScript and HTML/CSS that are just everywhere. Last number is that Spark is the framework that pays the most nowadays—not a good sign for Spark future.
Working with large CSV files in Python from Scratch — Good pattern to optimize your pandas computes by leveraging partionning.
Data Reprocessing Pipeline in Asset Management Platform @Netflix (I did not read it, but I want to keep a track of it—looks interesting).

Data Fundraising 💰

Qualytics raised $2.5m Seed round. The newcomers in the data quality space that is already quite crowded. Qualytics is a small team—8 employees on LinkedIn—based in the US proposing a "data firewall" that protects and compares your data to detect drifts, anomalies and history discrepancies.

See you next week for the last edition of 2022 ❤️. Enjoy holidays.

How to manage and schedule dbt

2022-12-19

Last week dbt Labs decided to change the pricing of their Cloud offering. I've already analysed this in week #22.50 of the Data News. In a nutshell, dbt Cloud pricing is per seat based, which means you pay for each dbt developer. Previously for a team it was $50/month/dev and they increase to $100/month/dev, a 100% increase with a team limit of 8 devs and only one project. To overpass this limit you'll need to take the Enterprise pricing which is opaque as all pricing of this kind.

But this article is not about the pricing which can be very subjective depending on the context—what is 1200$ for dev tooling when you pay them more than $150k per year, yes it's US-centric but relevant.

Let's go deeper than this to list what are today the options out there to schedule dbt in production. We will also cover what it means to manage dbt¹. This article will be written like a guide that aim to be exhaustive by listing all the possible solutions but if you feel I missed something do not hesitate to ping me.

dbt, a small reminder

Everyone—incl. me—is speaking about dbt, but what the heck is dbt. In simple words dbt Core is a framework that helps you organise all your warehouse transformation. The framework usage grew a lot over the last years. It's important to say that a lot of the usages we have today have not been initially designed by Fishtown Analytics.

At first dbt transformations were only SQL queries, but in the recent version with supported warehouse it has been possible to add Python transformations. dbt responsibility is to transform the collection of queries into an usable DAG. The dependencies between the queries are humanly defined—which means prone to error—thanks to 2 handful function source and ref. These 2 functions are called macros because they use Jinja, a Python templating engine, in dbt macros transform Python+SQL code in SQL, we can say that we have templated queries.

Everything I just mention before we can consider it static. If we do a parallel with software development, this is your codebase. Python and SQL together in the dbt framework is your codebase. You can do development on your codebase. In order to go in production you'll have to manage and schedule dbt.

To manage dbt you will have to answer multiple questions, but mainly dbt management is how the data team develop on dbt, how the project is validated/deployed, how you get alerted when something goes wrong, how you monitor.

In addition the dbt management you will have to find the place where dbt will be scheduled. Where dbt will run. dbt scheduling is tricky but not really complicated. If you followed what we've just seen dbt is in a SQL queries orchestrator. dbt does not run the queries, all the queries are sent to the underlying warehouse, which means that theoretically dbt does not need a lot of computing power—CPU/RAM—because he is only sending SQL queries sequentially to your data warehouse which does the work.

Obviously every dbt project has been designed differently but if we simplify the workflow all dbt project will need at some point to run one or multiple dbt CLI commands.

In this guide we will first see how we can manage dbt, i.e. git structures, how to code, the CI/CD and the deployment then in the second part how we schedule dbt code,i.e. on which server and triggers.

💡

This is a big guide, do not hesitate to use the table of content to jump to the interesting parts.

How to manage dbt 🧑‍🔧

Data team workshop to setup dbt (credits)

One of the dbt founding principle is to bring software engineering practices to the data development work, especially into SQL development world. In order to follow-up on this we will try to treat the workflow like an engineering project, even if sometimes it could feel over-engineered.

You have to consider development and deployment when managing dbt project(s):

Like every engineering project the management will obviously start with a git repository—depending on your scale it can be multiple repositories, but if you're just starting I recommend you to go with a single one.
The next step is the development experience. What we often call DevEx. Sometimes data teams forget it. In order to understand this point we have to ask ourselves who are the dbt developers and what do they need.
After development often comes deployment. It can be deployment in all environment or as a lot of data only in production, because only production exists. But before sending your code to production you still want to validate some stuff, static or not, in the CI/CD pipelines.

Git repositories considerations

This is the everlasting debate of every software engineering team, monorepo or multirepo? This is tightly linked to another dbt related question which is mono-project or multi-project. By default and by design dbt is meant to work with mono-project, but when you're starting to grow or when you want to have clear domain borders the single project can quickly reach limits.

As I said previously if you're just starting with dbt and you're a small team I still recommend you to go with one repo one project. Try first to organise correctly the models folder before trying to structure at a higher level.

❓

By definition here a dbt project corresponds to the folder that has been generated by the command dbt init. While a repo is a folder that can be larger than this, that why a repo can contains multiple projects.

The first question you'll probably hit is: how do I put models in different schema/datasets? This is the first step of project organisation. The solution is to override the generate_schema_name macro.

Then if you want to go for multiple projects you'll maybe have to decide how you do the interface between multiple projects, within the dbt toolkit you have 2 solutions:

Every project can define exposures². Exposures are then a way to define a downstream usage of models in the project. With the exposure nomenclature you can regroup multiple models in the depends_on for an type:application that is supposed to use them. ‌‌‌‌If we imagine some-company, with 2 projects—domains—Ops and Marketing we can have in the Ops exposures the models that we want the outside to be aware of define in it. Then with some kind of automation we can generate sources accordingly in the Marketing project.‌‌‌‌ summarized everything here, to go further GoDataDriven team did an awesome talk at Coalesce explaining how you can achieve this: dbt & data mesh: the perfect pair (?).

This is a way to define exposures for downstream Marketing usage

The other solution is to go for a dbt packages structure. In this solution every project—domain—can be installed as a dep in other projects, but I think it will end up in a nightmare of dependencies management. In addition you'll have to be smart in the way you run the models in the end because packages installation could duplicate models execution.

Once project/repo structure has been defined there are still open questions, here are a few:

How do I structure my dbt models folder? You can opt for the dbt recommended solution or for Brice's recommendations. Personally here my only advice is: don't be shy to create folders to separate concerns.
One YAML to rule them all — Do you want to create only one big YAML file that describe all the sources and all the models or do you split it. In my opinion sources have to be at schema/database level and YAML models have to be at the model level. So it means one YAML per SQL file.
Who is the real owner of the git repo? Data engineering or analytics team? — It depends but I'm in favour if possible to give the responsibilities and ownership to the analytics team, dbt is their playground, as a data—platform—engineer it's your responsibility to help them, but it's up to them to learn by doing. Under the hood it also means that dbt project(s) have to be independent(s) from other tools (e.g. dbt repo should not be in the Airflow repo).

Development Experience with dbt

First important thing to say. I'm a data engineer and I truly think that my main mission when in a data team is to empower others through data tools. In the dbt context it means you have to understand how your analytics team is working. I've also noticed over the years that analytics teams are often not able to identify that they are under-equipped or doing something that is not efficient. This is your role as data engineer to identify these issues. This is your role to provide a neat developer experience for every dbt users.

But who are your dbt users?

They can be data engineers—working on the founding layers of the modelisation. Probably the sources and the staging tables.
They can be analytics engineers—doing the same as data engineers in the previous point and going deeper in the modelisation layer into the core, intermediate and mart models.
They can be data analysts, business analysts, web analysts—people using the final mart models, sometimes also doing it. They mainly want to be able to understand from where or how a column is computed or do small changes. They also need a place to store their analyses or all the modelisations they were doing before in their BI tool, which is often their main playground.
Management roles (head of data, VP tech, etc.)—they want to be sure dbt is the right tool but also they want to take a higher view on the modelisation, dbt docs are sometimes a good first entry point for them.
Stakeholders—not sure they are dbt users, dbt is too technical, and you don't want them to see the whole complexity that exists in it.

Now that we listed a few of the dbt users, let's focus on the development experience, especially for analytics team—analytics engineers and data analysts. This is super important to provide a smooth experience for these users because they will spend a lot of working hours in the models, the neater the workflow is, the happier people will be.

What are the levers you can act on to provide this great experience:

Data News — Week 22.50

2022-12-16

Prepping me to deliver Christmas' Data News (credits)

Hey you, the end of the year is coming soon. I really liked this year with you. It was super fun to write every Friday of the year my opinion on data topics, I don't know yet if next year I'll be able to pull out stuff without repeating myself, I hate repeating myself, but for sure I'll try and I'll continue.

We have still 2 Fridays left until the end of the year, I'll try like last year to do special editions but no promise.

As a small reminder the Advent of Data 🎄 is still running and this week we got awesome articles again! So go check them out. For instance Marie and Bryan wrote great pieces to help you get started with data: Is tourism back to its pre-COVID-crisis level? and How to get started with data and help your local community.

dbt Cloud pricing update 🎁

dbt Labs announced yesterday a nice Christmas present for all dbt Cloud customers: a new pricing model. But you know this is the kind of Christmas present your uncle offers you that you don't like. Something you want to return directly because it does not suit you.

Let's have a look a it. Below are listed the major changes:

Team plan x2. From $50/month/per dev to $100/month/per dev but limited to 8 devs
Team plan is now limited to only one project
Team plan will include the Semantic Layer no-one is asking
The free tier now announced it's US based only

Small teams will get their dbt Cloud budget increase by 100%. For instance a small team of 2 analytics engineers will pay now $2400/year just to have a server running their SQL queries and a web IDE that is yet to perfect.

Obviously, dbt Labs has all the data points regarding activity and features usage to take this decision, but feels weird as dbt Cloud was a simple and costless solution for small users to enter the dbt world.

In term of strategy it also means that dbt Labs want to push companies to go for Enterprise plan with hidden pricing—don't forget transparency always wins this is one of dbt Labs core value.

Usual readers of the Data News might notice that I don't go softly with dbt Labs when it comes to their Cloud product, but this is a reality, if I caricature a bit right now dbt Cloud is only a web IDE with the capabilities to run your models, it should be commodity, the real value of dbt exists only in Core for the moment and in the community. In the open-source part.

As a comparison I pay PyCharm for years and it costs me €99/year and I can almost do everything that is included in the dbt web IDE plus I have all my comfort developer setup. The pricing difference is not worth it.

dbt Cloud's new pricing compared to previous one

Fast News ⚡️

Meta—Facebook—has been sued in Northern District of California following Cambridge Analytica scandal leftovers by a Californian attorney company. You'll probably say: "there is nothing new under the sun". OK. Then court files went public and listed the tables storing user identifiers for ads, 11051 Hive tables and 1190 Python pipelines have been listed. Nothing new under the sun.
Yep 11051 Hive tables on the previous bullet point you didn't misread it. They need 11051 tables to run their ads system.
Query your data in Kafka using SQL — This is a post that compares Flink, ksqlDB, Trino, Materialize, RisingWave and timeplus (the authors) in order to query Kafka. Even if it's vendor oriented this is a good starting point to have an overview of what you can expect from these tools.
Traditional vs modern analytics data processing (part 2) — Petrica compare two ways to write a data models, with schema auto-discovery on and off.
Airbyte move(data) conf videos — A YouTube playlist with 38 videos I did not watched because of lacking time from the online data engineering conference Airbyte organised a few weeks ago. You can read Matt's takeaways.
A Zero ETL Future — Benjamin explores the promise of Zero ETL in the future following announcement from AWS or Snowflake.
How HomeToGo has connected Superset Dashboards to dbt Exposures — Small article but great ideas.
Why is everyone trying to kill Airflow? — Imagine a Cluedo and Airflow is Dr. Black. Who did it, when and with which weapon?
Migration of Postgres from 9.6 to 10 via PgLogical for a Debezium.
Unit Test SQL using dbt — Small setup to use seeds and tests to create a framework where you have unit tests.

Data Fundraising 💰

Dataiku raised, once again, a $200m Series F. This new round of investing bring the total amount of money raised to $846m but with the economic global slowdown they did it at a lower valuation—$3.7b. Dataiku has been one of the first company to take AI path with all-in-one product. But it seems over the years as they focused big corporations they struggles selling their graphical drag-n-drop UI to smaller businesses.

As a side note, this is crazy to compare dbt Labs' valuation with Dataiku ones. Almost the same but even if I don't like Dataiku the depth of the product is by far not comparable.

See you next week ❤️

Data News — Week 22.49

2022-12-09

This is what we call a Chat in French (credits)

Hello there, this is Christophe, live from the human world. Last week have been totally driven by ChatGPT frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me. OpenAI perfectly executed it, they dedicated an gigantic amount of computing power to offer a neat pay-as-you-query experience, like BigQuery. And I bet it will transform our industry as far as BigQuery did. But do we want big companies holding decision power in their own pre-trained models, leaving real data science to the big ones?

I don't want to be alarmist, this is not the tone I have here in the Data News, but do we want a future where the support chat of our home train service or our mobile carrier is under the hood ran by a Musk's company? Ok, it's a caricature, but imagine. I can't wait to see Excel comparing average cost per words written between a human and a machine.

🎄 Let's switch topic. It's time for the Advent of Data head's up. Since last week edition we had 6 new articles published in the calendar. Go taste your daily chocolates. In a nutshell you can now develop an internal pip package for your data team, handle governance, explain to stakeholders what you're doing, send AI models to small devices while understanding Rust for data engineering and 3 keys geospatial metrics.

Paris Airflow Meetup 🧑‍🔧

On Tuesday I organised the 4th Paris Apache Airflow Meetup. The first one since 2019 and it was awesome, I met with a lot of people, the talks and the venue were awesome. The goal now is to do a meetup per month in 2023. For this I'll look for speakers and hosts, so if you live in France and you want to share something with the French community reach me, I have a lot of ideas.

After an small introduction the evening started with a presentation by Clément and Steff from leboncoin data engineering team. They shared with us the good practices they implemented to scale their Airflow development. As a figure at leboncoin 7 teams are using Airflow to operate more the 1000 DAGs. For you a short takeaway in English of their presentation:

Stop using custom Operators or Hooks if there is a community one available—this point is particularly relevant if you feel your custom stuff creates tech debt
Be careful with Airflow's variables, each Variable.get does a database call and drives bad performances. The replacement solution was to use Jinja templating combined with something more traditional in app development: a constant file.
Use priority_weight, for this they created an enum with 5 different priority humanly understandable.
And lastly: give ownership context to DAGs, develop custom macros for repeating tasks like generate_s3_url, use pendulum date library to avoid the pain of managing dates, use cluster policies and finally do tests. And if you don't know how to do tests have a look at how Airflow is written and copy how they do it.

Then Qonto data engineering team with Charles & Charles shared how they integrated dbt within Airflow. After a small introduction of the classic modern data stack combo—snowflake-dbt-tableau-airflow—Charles presented what is dbt and what are the alternatives to integrate dbt within Airflow.

In a nutshell you have 3 options to do it:

You use the DbtCloudRunJobOperator but it requires dbt Cloud
You use a BashOperator that runs dbt run command
You use multiple BashOperator running dbt run --select model command

Qonto decided to go for the last option. Then the other Charles detailed what it means and how they monitor what is happening. Obviously there are a few pro/cons for this approach that are:

cons: Airflow UI does not like having too many tasks (especially the graph view), in their setup with a KubernetesExcutor it means a lot of cold start because a model run means a new pod with a dbt CLI bootstrap, you have a lot of dependencies to manage
pros: You are very flexible because you can run one model at a time if you want, the incident management is simplified because as dbt flaws on this topic are filled by Airflow standards, the monitoring can be done

In the end they showcased their Metabase dashboard helping them understand every dbt run that is very complete mixing data from Airflow with a clever trick—they use XCom to save metadata in the database to be able to use it in Metabase—and the dbt artifacts.

👀 See the slides

PS: shout-out to people I met there reading the newsletter, your kind words are important and it gives me a lot of motivation. See you soon ❤️.

Studious atmosphere to listen Charles^2 (credits Alaeddine)

Fast News ⚡️

How to avoid “schema drift” — This article put word on the schema drift concept, which is the same as configuration drift (e.g. in Terraform) but for data. It happens for instance when you have 2 producers of the same event but they are not using the same type for a column. Although the article is a bit vendor oriented it is still relevant and will ring a bell to a lot of engineers.
7 lessons from GoCardless’ implementation of data contracts — Before ChatGPT hype the whole LinkedIn was speaking of data contracts. Here are takeaways from GoCardless. To be honest I should take more space than a bullet point to detail what they are doing, the key learnings are worth reading.
What I learned in my first 6 months as a director of data science — tl;dr be ready to rumble for the hiring competition.
Using server sent events to simplify real-time streaming at scale — Interesting discussion about concepts around real-time communication for apps.
Zero trust with Kafka — Sorry I've read too many articles these days, my brain can't process this one but I like diagrams.
How to get REALLY good at advanced SQL — I may be interesting for a few of us, this article slightly treat the expertise increase in SQL.
Generating Chatbot performance insights using Spark SQL at Helpshift.

Data Fundraising 💰🇫🇷

Pathway raises $4.5m pre-seed round. This is an insane amount of money for a pre-seed. Pathway is a French startup in open beta providing real time processing. You need to pip install their package and then you're able in Python to transform your tables. Transformations are operations like select, index, filter, join or map.
Husprey raises $3m seed round. Husprey provide an alternative to the dashboard world for data analyses with advanced SQL notebooks. They already have a large number of connectors and even integrate with dbt. Husprey is also a French founded company.

See you next next week ❤️.

Data News — Week 22.48

2022-12-03

Train(s) (credits)

Hey you, this is an unusual Saturday. I'm terribly late with this newsletter. This week I had a huge amount of work to deal with and we've launched the Advent of Data, your daily spark of data in December. Thanks to everyone who accepted to participate, we already published the 3 first articles and I can't wait to read everything else writers are working on.

In a nutshell the 3 first articles are:

MLOps isn’t DevOps for ML — Abi strongly answer to thenewstack.io stating that machine learning field should find DevOps practitioners to fill the lack of people in ML operations.
Using Airflow the wrong way — An experimental article I wrote where I explore Airflow as a framework rather than a all-in-one scheduler/orchestrator tool. What if we decide to schedule Airflow DAGs in Gitlab?
Modern Data Modeling: Start with the End? — Brice wrote about dbt projects structure and foundations to have a working modelisation.

🎄 Go to the Advent of Data 🎄

On a side note, we are 200 members away from the 2000 members and it'll be an awesome gift if I reach this number before next year. So if you like the newsletter maybe recommend it to your co-workers 🙃.

Fast News (very fast this time) ⚡

Because I want to deliver the news as soon as possible after my initial delay and a few IRL adventures—I'm currently stuck between Germany and France—this edition will only be a collection of bullet points with opinions.

Joe Reis launched his Substack — Joe is the co-author of the great The Fundamentals of Data Engineering and his blog already have 2 articles I deeply recommend: No extra credit for complexity & Groundhog Days. The first article can be resumed as aim for simplicity for every system you build while the second one tries to answer how can data be believable and add value.
Hey Snowflake, send me an email — Christmas is sometimes the time of the year where magic happens. And once again it happened. Felipe showcases how you can send an email from Snowflake. But, sadly, let's be honest, I hate the way it has to be done. Stored procedures, meh, feels like Oracle.
How DoorDash secures data transfer — This is network stuff, but still interesting for a large part of data engineers. Doordash needed to send traffic from AWS to on-premise datacenters of their payment providers vendors.
The thrill of deprecating dashboards — Last week while summarizing my data dream team presentation I've said that every data team should clean their BI tool every 6 months. This week Sarah shares a few tips on how to do it, from dumping BI data to the warehouse to stats report before cleaning.
How data and finance teams can be friends — As weird as it can be data team is sometimes seen as an annoying stakeholder by business teams. Often because we are stuck between in search of engineering stability while keeping fast delivery pace to follow growth. This post tries to show how data team should work with finance team to avoid this spat.
How to manage data teams, build a reliable platform & ensure data quality — 20 bullet points split in 4 categories. Anna shares a good checklist to build data team foundations.
Introduction to Data Modeling — Data modeling is the trendy skill everyone wants to learn today, dbt and the Analytics Engineering trends put modeling back in the front seat. This article by the Preset team is a good introduction.
A bot that watched 70,000 hours of Minecraft could unlock AI’s next big thing — Yes, generative models are fun, but I've you watched imitative or reinforcement learning? This is so crazy how these kind of models are impressive and I really like the fact that video games are the support of this research.
PS: same stuff is happening in Rocket League and this is also impressive.

Generative trains (credits)

Big tech watch

Trino at Apple — Getting inspiration from big tech companies has always been a great way to discover patterns. This time Apple engineers shared at Trino—good old Presto for lost people—Summit how they use it.
How LinkedIn cleans up unused metadata for its Kafka clusters — Kafka garbage collection, if you like to understand Kafka internals this post is for you.
Enabling static analysis of SQL queries at Meta — reducing the feedback loop on your data models edition is probably the biggest challenge data teams are facing today. I'm not afraid to say it. This is not data contracts or Rust, I think that the most annoying thing for a data team is the time lost on data models development. This is why having a great static SQL analysis is a good starting point in reducing the amount of manual steps. Obviously Meta is Meta and they redeveloped everything from the ground up.

Thank you all and see you next week ❤️.

Data News — Week 22.47

2022-11-25

Capturing the news (credits)

Hello you, I hope this data news finds you well. Time flies to be honest.

I've launched in a rush an Advent of Data. The goal is simple, in December: 24 data people will produce 24 data gems. Every day a new piece of content will be release on a dedicated website. If you wanna join the initiative please reply, we are still looking for a few slots to be filled in. I know it's a late notice thing, but this is a good occasion to contribute to the data community.

How to build the data dream team

This Monday I've done my first ever presentation in a international meetup, in English. The experience was great and I enjoyed it, I hope people in the audience liked it also. A video should be out soon with the whole presentation but while waiting here a small glimpse of the talk.

In this presentation I tried to share ideas on how you can create a data dream team. This is more a presentation that is meant to be a collection of ideas and concepts you have to think about rather than a go-to solution. I'd also say that you should always avoid following blindly general advices because every time implementation depends. It always depends on so many things: the product, the resources you have, the company vision, the localisation, etc.

So yeah, right now the data market is pretty hot. A lot of companies are heavily looking for senior data engineers and analytics people—whether DA or AE—while layoffs are as high as the COVID period. In my opinion in order to create the data dream team you should understand you team creation funnel. Something like:

Attract — you need to make people candidate or at least answer to acquisition managers
Welcome — you never get a second chance to make a first impression, so pay attention to the first week
Onboard — after the welcoming part you need to pay attention at the first 6 months
Keep — this is as important as previous step in the funnel, you have to pay attention to keep people satisfied

At the meetup I've especially detailed what you can do to keep people. You have to build the data dream team everyone wants to join and no one wants to leave.

A few ideas on what you can do to create a great data team working environment

To be honest this impossible to have everything done instantly this is more a long time game. I also think that there are 3 majors levers that are very important in the happiness of a team.

You need to find the correct roles ratio. I mean, how many data engineers the team should have compared to scientists and analysts. In the past Jesse Anderson always advocated for 2-3 DE per DA/DS in simple team and 4-5 in more complex setup. I still think this is only a dream. As of today I believe the a good ratio would be DE / (DA + DS) > 1. This ratio management is only a frustration management. With simple words, the less engineers the more data people will be frustrated.
Define the vision, the strategy and roadmap of the data team. Every data team go through an identity crisis at some point. A lot of data teams started by doing Shadow IT by saying yes to every data related project. But at some point it has to stop. Data team mission should be clear and understood by everyone.
Last but not least, aim for no tech debt. Obviously this is easier said than done. But this is something that should be tackled early a in team because this is another topic that will lead for frustration. And frustration leads to resignation.

Finally I have a slide that I really like with strong opinion on topics that is meant to just make people think. Here below:

Automate everything (IaC)
Data engineers don’t write ETL
Standards, a straight pipe is easier to fix than a curved one
Data analysts know data better than everyone
Do Python, don't do Java
Real time is useless
Describe every warehouse field
SREs and software engineers are your best friends
Who has 0 pipelines issues in the last 30 days?
Who can’t answer to this question in less than 5s?
Ask your DE to talk to stakeholders
GDPR—no one does it, right?

This is in a nutshell my presentation. I'm curious to hear what you think about this. In the last slide of my presentation you have links to 10 articles that will help you for sure.

WOW MY DATA TEAM IS AWESOME 🤣🤣🤣🤣 (credits)

Fast News ⚡️

In a data-led world, intuition still matters — The title says it all. I mainly think this is a reminder that data-driven decisions are good but as Alfred Sauvy said once "Numbers are fragile beings who, by dint of being tortured, end up confessing everything we want them to say" (thanks to Pierre 😉).
Google has a secret new project that is teaching artificial intelligence to write and fix code — We are still waiting for self-autonomous car to really replace drivers, so lmao.
From Postgres to Amazon DynamoDB — Another migration story. This time by Instacart who benchmarked DynamoDB to replace Postgres in their push notification system. In the article they detailed the data model adaptation they did.
Versioning in analytics platforms — Petrica has a great sense when it comes to depict the analytical work. This time she shows at which step of the analytical work you can add versioning. She also showcases Nessie, a data catalog that works with incremental changes like git.
Scaled data mesh — The author tries to enlighten the limitations every organisation will face with a mesh strategy. This is mainly a governance problem, but from companies already trying to implement it, I'd say that this is a point already identified.
World's Simplest Data Pipeline? — "Data Engineering is very simple. It’s the business of moving data from one place to another." This is something I could have said. This is article is so simple, but so true. Few bullet points to check. Every data folk should read it before writing any pipeline.
Retry pattern — When writting a pipeline you also have to think about error resolution. This resolution can by automated with few retry patterns. This is short and conceptual, but good points.
Graph for fraud detection — Grab team explained how they used graphs to do fraud detection. Which is, by the way, one of the best way to handle fraud detection.

📗

White paper — Elastic cloud services: scaling Snowflake’s control
plane. Have fun reading this. It deeply details how the control plane works.

Data Fundraising 💰

OneSchema raises $6.3m in Seed. OneSchema believe that even if we have awesome replacements CSV files will stay forever in the tech world. So they developped a suite of tools in order to help engineers to ingest CSVs. With their SDK you can add a drag-and-drop panel that will in the browser auto-detect your CSV and let the user fix the issue while you'll be able to validate the data before inserting it in the database.

See you next week with a new edition and the Advent of Data 🎄.

Data News — Week 22.46

2022-11-18

Scracthing the surface (credits)

Hey you, a new Friday means data news. This week feels a bit like old data news with a variety of articles on different cool topics while I navigate through the actual data trends.

Next Monday I'll present "How to build a data dream team" at Y42 meetup. I'll share in next week edition a written form of my talk. But this week as an appetizer there are 2 articles I really liked about data teams composition.

Last but not least, if you are in Paris on the 6th of December you can join us for the reboot of the Apache Airflow meetups—I'm the organizer. Talks will be given in French. The agenda:

leboncoin will share best practices around Airflow
Qonto will show how you can greatly integrate dbt within Airflow
I'll also introduce the meetup with last Airflow features

I organised the Paris Apache Airflow Meetup — 6th of Dec. — JOIN US

My two cents about DuckDB

Ok, right now, LinkedIn and Twitter data world are a bit going one-way down the Rust and DuckDB street. While I don't have any opinion on Rust except the fact it's look like a programming language eternal debate I'm bored of, I have one on DuckDB.

Here a small description I wrote about DuckDB 2 newsletter ago:

If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.

First, let's decrypt the marketing. DuckDB mother company called MotherDuck says stuff like: "BigData is dead" or "Your laptop is faster than your data warehouse". Which theorically opens the door back to single instance processing for your data. This is brillantly good, tbh. I buy it. Plus they add this fun tone with ducks, which creates sympathy for the product.

But is it really something?

I think it is, but I might have already been influenced by the marketing. When I think about DuckDB simplicity. It's exhilarating.

You do pip install duckdb then import duckdb and you are good to go. You don't need to run a server. A database is available to you, you can read files (CSV or Parquet) and execute SQL or Dataframe operation on it seamlessly.

I can imagine a list of use cases that will help improving the data engineering workflow but in the same time I don't believe Duck can become the main processing engine of a data platform. I mean, by his single-node nature the technology will for sure serve with brio decentralised teams with central lake but I see more edge use-cases like: running data processing in the CI/CD to quickly validate stuff, provide a great local dev experience to every data developer or empower small data analytics products.

I don't think it can replace current data warehouse vision or technologies and according to me it shouldn't be sell or compared with. But more a cool sidekick to the actual modern data stack. Still I'm afraid with the huge amount of money invested and the actual course of things where everyone wants to try the hype it'll turn differently.

Oliver also shared deeper views on the hype.

Ducks on the horizon (credits)

Data teams need to break out of their bubble

Mary MacCarthy published a great post. It's a wake-up post for data teams. In the current economic situation, all the intellectual discussions about the vision of the field are fun but this is not really for what data teams are built. Data teams are meant to exist in most company to empower other teams. I also bet that the semantic layer, DuckDB, Rust or other trendy stuff is not something that will empower your stakeholders.

Right now the best move you can do according to Mary to empower your stakeholders is to break out of your bubble to really work in pair with them. In the article she takes the example of the relation between the marketing team and the data team that often looks like shadow IT. Martech solutions are often another all-in-one data platform.

Read the article

On the same topic Mikkel Dengsøe came back with a great article about data people outside of the data team. He brings few tips and pitfalls to make this setup works.

Fast News ⚡️

Notion announced Notion AI — Notion will introduce an AI assist bloc that will be able to generate text in your Notion pages. Right now in alpha waitlist. Under the hood it uses OpenAI, in the FAQ Notion promises that you data will be protected and not use by OpenAI.
Dataclasses: Supercharge your Python code — If you don't use Python's dataclasses you should look at this article that gives you usage examples. I personally use a lot dataclasses when it comes to create configuration for my data pipelines. It allows me to type my configuration and to get rid of the bracket notation to use objects which is more comfortable when developing.
Snowflake SELECT * EXCLUDE/RENAME — It has been one of the feature I was missing the more when I switched from BigQuery to Snowflake. Here it is. You'll be able now to supercharge your Snowflake select * by either excluding unwanted columns or renaming on the fly some. It saves precious SQL lines when you have a lot of columns.
Visualization tips for data story-telling — How to pick colors, how to display text and at what size and how can you emphases a data among others. This article is a good head's up.
StarRocks, a next-gen sub-second MPP database — I discovered a new open-source real time OLAP database. Nothing more to say except that I had it in the newsletter as a save for later.
Revamping the Apache Airflow-based workflow orchestration platform at Coinbase — What to do when you have around 1000 pipelines and more than 1500 PR every month on your project.
Building Spark data pipelines in the cloud, what you need to get started — Spark have not yet disappeared even if I don't share that much content around it in the newsletter. This is a complete guide about Spark worth mentioning.
Your data catalog shouldn’t be just one more UI — In today's data ecosystem all data catalogs have been developed following the same concepts coming from SV big tech startups. In this article the author explicits that a data catalog still should be more than a search bar in the entities. More a data catalog should firstly be a central metadata repository with open APIs allowing every data team to activate real use cases.
See also: More on semantics & databases. What if we could add more semantic directly in the database schema comments.
(I did not had the time to read these 2 articles) Simplifying 3NF & Data Skew : 101.

⚖️

Learn how to prepare for new European data privacy requirements — Rare article about data privacy requirements. Atlassian shares law stuff that might resonate to your legal team if you do international data transfer.

Data Fundraising 💰

Quix raises $12.9m Series A. Quix is a serverless real time platform that allows developers to focus on developing real time apps rather than spending time on the underlying infra. Their SDK works with Python and C#.
MotherDuck raises $47.5m Seed and Series A. Just a side note about the DuckDB mother fundraising. I've already mainly shared my thoughts about this in this newsletter edito. The company seems to be in the tracks of the giants fuelled with a16z money. As others are betting we have few months ahead of us with trendy ducks.

See you next week ❤️.

Data News — Week 22.45

2022-11-11

Mastodon and Hadoop are on a boat... (credits)

Hey you, 11th of November was usually off for me. Since I've started my freelancing activities I don't really follow the usual calendar, working whenever I need/want. I mainly work 3 to 4 days a week. Which is awesome but it has a major drawback I never took a break longer than 1 week. Which, yeah, kinda sucks. Let's change this next year.

On a social note, today I've joined data-folks Mastodon server, you can follow me there. I'll add this new community as source for my curation and I'm gonna try to be active there.

Also, on the 21st of November I'm gonna talk to a meetup for the first time in English and in Berlin. So if you wanna listen my terrible French accent, join us. I'll speak about "How to build the data dream team".

Let's jump onto the news.

Ingredients of a Data Warehouse

Going back to basics. Kovid wrote an article that tries to explain what are the ingredients of a data warehouse. And he does it well. A data warehouse is a piece of technology that acts on 3 ideas: the data modeling, the data storage and processing engine.

In the post Kovid details every idea. In this cloud world where everything is serverless a good data modeling is still a key factor in the performance—which often mean cost—of a data platform. Modeling is often lead by the dimensional modeling but you can also do 3NF or data vault. When it comes to storage it's mainly a row-based vs. a column-based discussion, which in the end will impact how the engine will process data.

Schema changes management

A story of an int becoming a str (credits)

I bet that most common data horror stories are about schema changes. It could be because the product team changed an integer to a varchar in a source Postgres table or because an analyst remove the tax field in the income table. Every time it means morning headaches with Slack messages, Airflow screaming at you with red circles and downstream pipelines to re-run.

Fast forward to today, more and more team are trying to fix this. Here are few articles that will give you few ideas about stuff to do—tbh, there isn't a one-stop solution to fix it:

Programmatic schema management — Manage all your schema with some kind of code. Petrica showcase at the end of the article Alembic which works, but I think to adds so much overhead in the data warehousing world.
How to be more confident making data model changes — This article is an hidden ad by the author but still. It greatly depicts what you can do at the CI/CD with a static diff that checks old schemas with new schema.
Tulip: Schematizing Meta’s data platform — Shows a tool called Tulip that handle schematization of message while also handling schema evolution.

Machine learning at Riot Games

If you play video games like me you'll like this video. If not, you'll still like it I think. This is a morning coffee from the MLOps Community with Ian Schweer who works at Riot Games. Ian describes how Riot Games uses data and what machine learning means.

Even if I recommend you to watch the video here few points I've written that were interesting to me:

A good part of the discussion was around the fact that DEs and MLEs should just copy what SREs are doing for years. In the end why data management should be different than config management—ok, except from the scale?
Riot has also embraced the concept of feature store, but at the scale of the enterprise there isn't yet a standard way to do it. In their case it also means they embark the ml models in the game binaries.
This is probably the concept I liked the most from the video. The end-game dataset. It means that every game can be capture as a dataset, with a known schema in a immutable storage accessible for everyone. I like this idea and it can be replicated to a lot of different business.

Fast News ⚡️

dbt Labs Founder Tristan Handy on the modern data stack, partnerships — This is a cool (long) interview of dbt co-founder Tristan about his vision of the product. If you have time listen or read it. My main takeaway is around the fact that dbt (core at least) is community-led. The community created dbt as a framework. A framework to organise your data assets and your knowledge. As of today, dbt is the most advanced framework to do this. The rest is just implementation details.
Is it time to rebrand (or rethink) the Modern Data Stack? — It completes well the previous interview. 10 years after the "Redshift revolution", it's probably time to put words on today's stacks?
2003–2023: A Brief History of Big Data — If in parrallel you need a great description of the last 20 years, Furcy wrote the whole data platforms history from the Google File System in 2003 to the 2022 lake house swarm.
Data engineering is not software engineering — Even if the title is a bit clickbait, the article holds some truths. The author states that data pipelines are not applications and pipelines are single-person tasks that have to be 100% completed otherwise worthless. IMHO, this is partially true and it'll only depend on how the team is mature in their data assets design.
Introduction to Snowflake's Micro-Partitions — I think that explaination about databases internals are my favourite tech articles. It comes probably from the fact that I like to understand how the stuff I'm using is working.
GoodData and dbt Metrics — Headless BI or Semantic Layer will be next the big vocabulary discussion in the data ecosystem. BI tools will want to sell headless BI when transformation platform will sell metrics or semantic layers, the idea is to capture via propretary code the data warehouse exposition.

Delivering the fast news (credits)

Data Fundraising 💰

Equals raises $16m Series A. 4 months after a Seed round they get once again money to develop their Excel alternative. The SaaS app connects to your warehouse and displays your data in a tabular format after a query (graphical built or SQL). It looks like a Google Sheets on steroids for data.
EdgeDB raises $15m Series A. Slowly, years after years, graph databases time is coming up. Enterprises are relying on a multitude of apps with a varied view of their clients. Graph databases are a key piece of technology that provide an unified view over relationships. EdgeDB is an hybrid open-source graph database developed on top of Postgres.

PS: Regarding database trends Cloud Database Report wrote a great article about 7 actual database market trends. More serverless, graph, vector, Postgres is used everywhere, etc.

See you next week.

Data News — Week 22.44

2022-11-05

Saturday be like (credits)

Hello data news readers. I hope you had a great last week. This is the Saturday data news, yesterday I had a blank page syndrome. I hope you don't mind.

Before jumping in the news, I have 2 things to say. First, I've been listed as Best data science newsletter by Hackernoon. If you like this newsletter I'd love to get your vote. Then, I'll organise an Airflow meetup in Paris on the 6th of December and I'm still looking for speakers. Probably for 5mins light talks—fr or en.

Have fun.

Build a data lake from scratch with DuckDB and Dagster

I've recently shared a lot of articles around DuckDB in the newsletter. If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.

This database technology got a lot of traction because of its simplicity to install and to use. Which also mean that influencers and bloggers can experiment easily to show you how wonderful it is. This article is no exception.

Dagster on the other hand is another orchestration tool that has been thought for the cloud and the data orchestration. They firstly popularized software-defined assets concept which is a way to define data assets as code. This way the orchestrator knows data dependencies and can do reactive scheduling rather than CRON-based.

So, Pete and Sandy from Dagster team showcase how you can create a s3 datalake with DuckDB as query engine on top of it. I really like the article because it shows in a small amount of code how you can:

ingest data from Wikipedia with Pandas
write a compact pipeline end-to-end test with a simple test before code
define Dagster data assets
use DuckDB to read and write s3 assets

Obviously what they did is purely experimental but it gives ideas on how every company could create a lake with a smaller footprint and a smaller price. I mean, BigQuery and Snowflake are also launching processing on-demand, but here with DuckDB you really know what's running and it's fairly simple so you can measure all the costs.

PS: as I never used—I plan soon—DuckDB and Dagster all my comments are based on my theoretical understanding of the technologies and all the readings I had about it.

DucksDB (credits)

Databases time

It looks like a special edition about databases but this is not. Dremio wrote an article to explain how a read query works with Iceberg tables. In a nutshell, a read query first uses the catalog to find the right metadata files. They will point on the correct manifest files in order to get the correct data. With even more simple words, it uses metadata systems to narrow the data search, the less you read data the faster the query will be.

If we go on a more exotic database side. Redis team wrote a guide of things to consider when doing a database migration and Mohammad wrote a retro on DynamoDB 10 years after the general release.

Playing dataviz tennis for collaboration and fun

This idea is so fun and I'd love to try it in a data team. For content purpose Georgios and Lee played at dataviz tennis. Every dataviz tennis match lasted for 8 rounds, with 45 minutes per round and the person who served picked the dataset. So it means player 1 choose a dataset, work on a viz for 45 minutes and then shot the viz to player 2 that work on it for 45 minutes, and so on. All of this in R with ggplot2.

I think this is a fun way to collaborate and for some projects we should try it in data teams. This is a alternative way to do pair programming and it can be done with data pipelines as well.

ML Saturday 🤖

How would you rate your job satisfaction in your current role? (credits)

In bulk here few cool articles:

Anatomy of a Data Science Team — 9 roles that appears in a data science team. This is a bit caricatural but it depicts well forces at stake when creating a data science team. As a side note I've also read Anaconda survey about the state of data science, in the survey we can see that data engineer are slightly more dissatisfied about their jobs than data scientists (cf. picture ahead).
How Deezer built an emotional AI — Deezer a French music streaming app shows how they adapted their music recommendation engine—Flow—to learn from their users' mood while identifying mood in songs.
Improving Instagram notification management with machine learning and causal inference — All this science just to send noisy notifications to make you addicted /s.
Fooled by statistical significance & How Der Spiegel uses Machine Learning.

Fast News ⚡️

Bringing autocomplete to Analytics Engineers — Analysts are probably the less equipped team when it comes to productivity. Navigating through thousands lines of SQL or developing SQL in a web browser is not very funny. Today deep channel proposed a workspace to solve this hurdle by adding autocomplete and faster error detection—it integrates mainly with dbt. To be honest this is an issue larger than this that will not be solve by 1 tool, but still the idea is great.
PS: on this matter you can still try my dbt-helper extension to extends you BigQuery console.
Snowflake RBAC Implementation with Permifrost — Managing Snowflake rights can be a huge pain in the ass. You can do it with custom code or Terraform. Here Yousign team detailed how they did it in YAML with Permifrost. Then they manage dbs, warehouse, users and roles with configuration. The article also gives what is the standard when it comes to Snowflake rights at Enterprise.
How to make data documentation sexy — Madison wrote of lot of content when it comes to documenting data knowledge. This time she proposes rules to apply when writing documentation.
What good data self-serve looks like — For years every data team wanted to provide self-service to stakeholders in order to reach the heaven. In heaven stakeholders do SQL, are autonomous and data team concentrates in delivering value and analyses justifying high salary costs. But this is only in heaven, most of self-service is badly achieve. Data teams being a support team trying to creating trust in the data. Nate tries in the article to define good self-service and what are the key levers to act on.

Good self-service (credits)

Data Fundraising 💰

Dataloop raises $33m Series B. Dataloop is mainly a data labelling platform that focus on quality. They propose an end-to-end platform to do everything about AI.
Alation raises $123m Series E. The company founded in 2012 raised another round to push forward their data catalog solution to enterprises. I've not a lot to say except the fact that they have too many products for my brain to understand what they really sell.
Galileo gets $18m Series A. Galileo package integrates within your Python machine learning stack to add debug and tracking to your work.

See you next week ❤️ — PS: below should appear a survey about how you like the newsletter, please tell me what you think.

Data News — Coalesce 2022

2022-10-29

Me right now (credits)

Hey dear members. I have to confess I'm lazy. Every week I want to create content, I want to work on a new article or video. The more I have ideas the more I procrastinate. Every week, Friday appears and I'm still here, late with the newsletter. For years I was convinced I could change it, but let's face the truth, I'm 30 now, it will never change.

Still, while procrastinating this week I've decided to watch all replays—around 120—from the dbt annual conference. This newsletter will give you my Coalesce 2022 takeaways.

🔭

Don't forget that this selection of talks represent my reading of the conference. It also represents my views and my understanding. You can disagree with what I said and if you feel I'm deeply wrong on something, once again hit reply or comment.

I have also added a ❤️ on my 3 favourite talks.

Coalesce 2022

The conference agenda has been divided, according to me, into 5 categories that are similar to last year ones:

dbt future — which direction the data field is going with dbt at center
Analytics engineering
HR — Grow your data career and fix your data team
Diversity talks about how we can be more open in the data field
Partners — dbt's booming, everyone wants to be in

📺 Watch the dbt YouTube playlist

dbt future

Obviously Coalesce has been the theatre for dbt Labs to announce new stuff. Nothing revolutionary or surprising because it was already discussed or announced before the conference. During the 5-days dbt Labs talks were focused on 3 main topics: Python, Semantic Layer and Community. In the modern data stack the warehouse is king, at the center, dbt sits on top of it. In this privileged position dbt usage is growing.

Being at the center of a community of users and partners means a lot. You foster a variety of usages while attracting with your growth a lot of partners in search of integration with you. This is what dbt Labs has to juggle with. As a personal opinion I think that too much tools were just demoing their product without any added value, still this is not a big issue as I can skip them.

Technically speaking, being at the center of the data stack leads also to the next step for dbt: the Semantic Layer. This layer has been designed to be the all-in-one interface for all the tools in needs for data. dbt Labs will open-source a new project called the dbt Server—not yet released—that will put an HTTP API on top of dbt Core to do dbt operations. In addition dbt Cloud will offer a proprietary Metadata API and a Cloud proxy. The Cloud proxy will be able to translate YAML metrics definition to SQL. As I already said it feels a bit like their best try to generate revenue.

If I'm being sarcastic and defensive I don't see as a good sign that dbt wants to be my new data connector on top of my warehouse, adding a layer of complexity in my infrastructure.

Lastly the Python support, while being fairly simple, impressed me. In a form of a duel Jeremy vs. Cody dbt team demoed what you can and can't do with Python models. In a versus Python vs. SQL models we've seen usage of pandas describe and pivot, fuzzy matching and sklearn.

On a side node dbt team also presented their focus for 2023 and 2024 outlined by their user research. As Tristan said, dbt wants to be the open standard to create and disseminate knowledge. So 2023 will mean: better lineage support for datascience, standardization around metrics and semantic layer and enriched dbt DAG capabilities to add more context to it—whatever it means re-bundling is coming.

analytics engineers facing the future (credits)

Analytics engineering

2022 is probably the year of analytics engineering being popularized. While being still unclear what are the true frontiers of the role, everyone knows that "dbt developers" are analytics engineers. But it goes deeper that this. It implies a mix of business understanding with technical expertise over SQL engines and data modelisation.

At Coalesce we've seen that analytics engineering has a wide range of applications. But in the end you don't build models, you construct knowledge, this knowledge is essential to find the common ground between the company verticals. Even if AE is still new, it relies on old principles like Kimball modeling, but is it still relevant? Spoiler: yes, even if it's not like before for performance reasons, Kimball brings understandability.

Under analytics engineering I really like 3 presentations that I would recommend to any people in analytics, these presentation while approaching technical concepts in a good way bring good food for thoughts to improve any dbt project:

Outgrowing a single `dbt run` — at scale the schedule based orchestration can fail, having CRON that runs dbt will lead to issue so you need a smarter orchestration pattern. This is were reactive/proactive scheduling enters the room. In the Airflow world it means you have to use sensors to trigger runs. Here Prratek also recommend to run staging model each time a source is refreshed and once every staging have been run to run the marts. I think this is a good pattern.
❤️ Testing: Our assertions vs. reality — Probably the best talk of the conference to me. Mariah shows how dbt is natively badly designed when it comes to testing. dbt tests are mixing code and data quality which are 2 different piece of the testing framework. She also greatly illustrates the difference between assumptions and assertions when it comes to data.
Efficient is the new sexy - A minimalist approach to growth — Matthieu propose a framework to handle team growth while tackling engineering problems. He also tackles issues like modularity (linked to mesh concepts) and testing on another angle than the previous one.

Lastly data contracts concepts were on fire in the data community. This time Jake and Emily are providing us with practical example using jsonschema to define interface between product and data teams.

Grow as an individual and fix your data team

A lot of talks this year have tried to answer to a simple question: how can a data team have an impact? This is obviously related to the fact that all data teams around the world are costing a lot and leaders are still struggling to find the Return on Investment (ROI).

In this introspective search of what's a data team, the picture seems to be the same for everyone. Cultural challenges are the main blockers for massive data adoption. 5 talks tried to propose something to help adoption:

Know your worth: Unpacking business value delivered by data teams — A framework to build knowledge to exploit data for stakeholders
Data teams v. The recession — How to win the ROI battle. You have to at least act on 3 levers: core business reporting, avoid people pleasing and drive decisions that affect revenue. Chetan illustrates with Airbnb and Webflow examples.
How to build data accessibility for everyone — use the JTBD framework to know your data users to achieve self-service.
Money, Python, and the Holy Grail: Designing Operational Data Models — We need to simplify data models a simple modelisation of the business means that you've understood what's going on. Data teams should not be a consultant team that answer every questions. Data team creates a simple understandable view for everyone.
❤️ Operations vs. product: The data definition showdown — Every operational team is different and data should do the glue between stakeholders even if it's hard. Words have different meaning per teams. Data alignment is a people and langage problem, not a technical one.

Being in an analytics team can be difficult because you're in the middle of everything without the power to take decisions. That's why data team have to be empathic. Empathy means "the action of understanding" (cf. Empathy-building in data work and How insensitive: Increasing analytics capacity through empathy).

Purple people

dbt blog mentioned purple people concept last year. Purple people are these generalists that are doing the glue between the business and the data stack. But being a generalist is often a solo job. You are navigating between specialist world and you help these expert communities communicate between each other. This is what Stephen greatly depicted in ❤️ Excel at nothing: How to be an effective generalist.

(credits)

🫶 Cool stuff

There also were open formats. This creativity shows how great the data community is. Tiankai sang a data jam 🎵, competitors battled to answer business questions as fast as possible and Joe developed an Unity SQL game.

Final shout-out to Mehdio who did video interviews and highlights of the conference because he was there in-person.

Last thing I discovered is the dbt-project-evaluator package, which seems amazing to create CI/CD rules to detect for instance direct join to sources or documentation coverage.

📬 Subscribe to my weekly newsletter 📬

(to get data curation each week in your inbox saving your 5 hours of tech watch)

🗒️

You can also read my raw notes about Coalesce. This is for members only as the format is quite awful I think.

PS: I've already done this last year for the Coalesce 2021 if you wanna check out.

PS2: sorry for the length of this edition, for the delay and I hope the reading was enjoyable I'm not really proud of my writing here.

See you next week.

dbt Coalesce 2022 — my raw notes

2022-10-29

Data News — Week 22.42

2022-10-21

Navigating through the numbers (credits)

Halo, a lot of content has been published this week with the Coalesce and I kept a lot of articles from the last week that I needed to navigate through this quantity to produce this edition. I'm not that proud of the format but it's ok.

As a side node I'm gonna do the 30-day map challenge in November. So if you do it or if you want to do it say hi.

Women in Data — part 2 👩‍💻

Second part of the summary of the Women in Data meetup we organized 2 weeks ago. In the second round table the discussions were about the parity in the data ecosystem.

What can we collectively do to achieve parity in data ecosystems? 💪

Several answers and ideas were proposed by the speakers. Let's dive-in by topics.

Culture. The enterprise culture plays a big part in parity topics. Every manager should be trained and encourage to address equality topic. Also every incorrect behaviour should be mentioned and addressed—still, there was some debate on if it should be addressed with humour or firmness. Gabrielle also described an internal collective she presides to help women finding their place. Along with their mission they identified 5 important points for these collectives to work: define a clear vision, find a sponsor, understand issues with interviews, plan actions that integrates in what already exists in the company then develop content to infuse culture.
Also on the culture topic—yes I move to another bullet because the first on is too big—there are also initiatives at Deezer to help women by providing material or days-off during periods. Last but not least, everything related to the words we use. We should use inclusive writing—in French this is more prevailing than in English. For instance "hey guys" should be banned.
Hiring. Everyone is saying this is hard to find women in the data field. This is a fact, probably true. But if you don't force yourself into searching to add diversity it'll never change. So one solution is to put a ratio when searching for people, for instance you can ask your hiring agency to propose at least one woman per 3 candidates and if not you'll not look at the profiles no matter what. Then you have to care about the whole hiring funnel.
Other issues about hiring were discussed. The salary gap depending on the gender, the fact that studies shown that women tend to candidate less if they don't tick all the requirements.
What else to change. All the differences can be fixed at a local level in the company but this is something that needs deeper change in the society. At the meetup speakers shared with us initiatives to promote tech/data works at kid school for instance. The idea is to show model roles to inspire younger generations. Tech industry is not a men's world.

That's all for this Women in Data meetup. I hope I've transcript the discussion with the right words and intention. I might have misinterpreted some chats and if it's the case I'm sorry.

My last point on this topic, let's not forget we talk about diversity, so this is not only about man and women, there is more to be diverse and inclusive.

dbt Coalesce 2022

dbt Coalesce took place this week, this is the annual 4-days conference organised by dbt Labs. While all data influencers were there to meet and chat about the next trends of the analytics industry a few announcements were made.

dbt Labs took the time to announce the Semantic Layer. While others call it the metrics layer or feature store in the data science space. We'll see a lot of buzz around this unique layer to access metrics in 2023. dbt Labs will push forward this architecture, in search for revenue growth. They will add this as a product in their cloud offering—with a Proxy SQL and a Metadata API.

If you want to see on how the semantic layer can be use Hex demoed it. You can also see this semantic rise up from the BI perspective with the Semantic BI. In this new world everyone wants to see the issues from his perspective, which is annoying for users but fun as an outsider 🙃.

I'll dedicate a full post with my highlights of the conference early next week after watching all the replays.

The metrics layer (credits)

Data contracts 👻

Even if I try not to fall in the hype stuff to give a higher view on trends when I see data contracts everywhere I have to still share it. In a nutshell data contracts are contractualized interfaces between data producers and consumers. The most common pattern seems to be an API—http, file, event, table, etc.—between software engineers and the data team with a way to enforce the contract. We call this schema for ages.

I'm convinced for a long time that data contracts is not a data problem but an IT problem. If the whole tech team is not aligned on the way data changes should be managed you'll fix only a small part of the problem. Petr greatly wrote about the way we draw lines. What belongs where?

Data contracts aligned around business areas (domains) rather than technology layers. Contracts are technology-agnostic and can live anywhere inside the Data Platform.

Andrew and Daniel respectively wrote their own way of seeing data contract implementation. Andrew at GoCardless and Daniel by himself.

Fast News ⚡️

Apache Kafka SSL Security — A simple explanation of how SSL handshakes works and why you should add it to your Kafka cluster.
How Can Artists Influence Recommendation Algorithms? — Second part of the MusicTomorrow series about their tool to help music artists to become more viral on music platforms.
Load Github API data with Python model in dbt — A new way to see data ingestions. In this article the author get Github data with a dbt Python model running in Snowpark. Demoing an extract-load orchestrated directly in your dbt project. This is a good example, not sure it should be reproduced at scale.
Is Druid still a thing? — Druid is a distributed OLAP database that can be used for real-time. In the past the main issue of Druid was the lack of SQL. But it changed. This post is an introduction of the Druid architecture.
Airbnb’s key-value store for derived data — Giants can't stop inventing new databases to solve problems at their scale. This time Airbnb created Mussel as a combination of other OSS to have a scalable key-value store.
Data Engineering Excellency at Netflix — How Netflix empowers the data engineering team to reach excellency. They even compare data engineers to X-Men. They all have different superpowers to work on different villains. For instance to work on Maestro, the data/ml orchestrator.
End-to-end data pipeline tests on Databricks — I like all the testing topics even if it's in Spark (😬). Sicara detailed here how they did it for data quality and unit tests.

Data engineers are superheros (credits)

Data Fundraising 💰

This week a lot a few data satellite companies raised money. When I say satellite I mean companies that are not really related to data field, but they put data at the centre of their product.

RisingWave raised a $36m Series A. A cloud-native streaming database that uses SQL. You can either deploy your own Docker instance either use their new cloud offering. It works with materialized views that are refreshed in real-time on top of tables connected to real-time sources like Kafka, Redpanda or CDC.
Tellius raised $16m in Series B. Tellius offers an augmented analytics platform. A one-stop platform with insights discovery that does anomaly detection on your metrics.
Keebo got $15m in a Series A. Keebo provides a way to lower your warehouses costs by rewriting your SQL queries on the fly. With their solution rather than connecting to Snowflake you connect to Keebo and you let them do the magical optimisation. Even if I like the promise I don't think this is a good idea to rely on a third party to do optimisations. You better done if you teach people to write performances tips with CI/CD checks for instance.
The "security" space got some traction this week with 3 companies raising money. Anonos raised $50m in debt and provide a compliant pseudonymization engine. OutThink raised a $10m seed to tackle automatically data breach by highlighting company risks. Velotix raised $10m seed to automate data accesses over the complete platform.

See you next week ❤️.

Data News — Week 22.41

2022-10-14

Me in a few years after sending my weekly newsletter (credits)

Dear members, it's already fall here, and the weather is getting chilly, but good news: your favourite newsletter is back! So put on your best sweater, make yourself a hot drink, take a seat and enjoy this week's reading 🍂☕️

Women in Data — part 1 👩‍💻

This Tuesday I co-organized with Deezer Devs, DataGen and Modern Data Network a Women in Data meetup. We invited 8 inspiring women working in data to discuss their experiences during 2 1-hour round tables. In the end all the discussions were global and not only narrowed to the data field. I liked it a lot.

While I'm working on translating the whole discussion in English because I feel it should be shared with everyone here is a small summary of what was said during the evening.

They started first with leadership. How can women develop their leadership? During this round table they tackled 4 main topics:

Behaviour. How should women behave when in a leadership position? The society often depicts women qualities in leadership as empathic, adaptable, sensible but this is a stereotype. In opposition men's management is seen as top-down and authoritarian. In the end everyone should find a personal leadership-style, there is no caricatural behaviour. Society has a lot to gain from encouraging diverse leadership styles—being empathic is a great way to collect better feedback—ultimately it is in the best interest of the company in a highly concurrential job market.

I think we almost all heard it (as women), anyway you are too nice, you are too emotional to manage a team or to take on responsibilities

Impostor syndrome. A lot of women suffer from it. Especially when in leadership position when you don't see a lot of lookalike. A lot of content has been discussed on this syndrome. Either when as a woman you question yourself when applying/accepting a job with responsibilities when a man would never question himself. Either when in the tech industry you only see geeks and gatekeepers and you don't recognize yourself, you don't feel at the right place. Either when you're the only woman sharing content in front of a crowd at meetup/conf and people discredit you afterwards.

Board meetings always started with "Hello gentlemen", finishing with "Questions gentlemen?". I felt that I did not have my place. I had a position with responsibilities but no-one asked for my opinion. I was invisible [...]. How do we come to a point where we are given a place somewhere but we are still made to feel that we are not legitimate to occupy it?

Public speaking. Regarding public appearance, tips were given as it's more practical. You can try a few tricks like for instance creating a character when on stage or finding allies in the audience. Allies with whom you'll do eye-contact to help you support the stage fright. People nodding are also a good help. This is important that everyone of us help people in difficulties when witnessing these situations.

OK, you don't trust yourself (...) for this presentation but what matters is that you give the impression that you are. When you go on stage draw a line. You walk and the moment you cross that line you are the character you want to be—it's like acting class.

Relationship with managers, peers and team—who often are men in tech. A lot of times women experience that they need to build tactics in they daily life to avoid awkward or possibly dangerous situations. Often it's making a joke to someone who is undermining their professional abilities simply because they are a woman, someone making a sexist joke, someone flirting.

📺 Watch the meetup (🇫🇷)

Part 2 coming up next week. What can we collectively do to achieve parity in data ecosystems? 💪

Robin, Elisa, Gabrielle, Fatimata, Virginie, Marion et Christelle

Thanks to all the amazing speakers: Marion B., Gabrielle Béranger, Virginie Cornu, Nathalie Gémin, Elisa GILLES, Christelle Marfaing, Arielle Marouani and Fatimata Sall. Moderated by Robin Conquet.

📢

I'd love to hear your experience on this topic. I also want to open the blog for guest writing on this topic, so if you're interested just hit reply, everything is welcome.

How Google fails ☁️

I'm finally adopting the clickbait title like other influencers. This week the Google Next 22 took place. The main news for the data world was about Looker. Or may I say a no-news. Google decided to rename Data Studio, Looker Studio. The YouTube replay has been the most seen video from Next after the keynote.

This first news is simply a renaming. Besides this they decided to create a paid pro version Looker Studio Pro that will include enterprise features with team workspaces and SLAs stuff.

To be honest I'm still lost after this announcement, the Google BI catalog will now include:

Looker (now Google Cloud core product)
Looker Studio
Looker Studio Pro
LookML
Dataform?
BI Engine

Between the lines Google also announced the initial Looker product will explode and integrate within GCP. But to me this is not as clear as it should be. Looker Studio will also access the LookML layer.

Since I've started this newsletter I've watched all the Google news around data and although I have been a huge BigQuery fan from the first hour I've always struggling understanding the strategy and the vision within the Google ecosystem. In the past GCP was the best solution to me because it was blazing simple. One solution for one problem. This vision seems very different today, while BigQuery remains the storage, there are way too many way to move and transform data.

The BI Cloud

When you compare with how Snowflake position in the market, GCP became a complete suite of tools but a complex one. What are your thoughts on this?

Fast News ⚡️

As I already wrote too many words I'll keep a few links for later, but here are 3 cool write-ups.

Modern Data Stack in a Box with DuckDB — DuckDB got a lot of traction recently unlocking a new range of performance on a single node. This article shows conceptually how you can deploy a Modern Data Stack using DuckDB as storage.
When Change Data Capture Wins — Sarah explains how to get started with Change Data Capture and how it can improve your integration SLAs.
Introduction to Key Apache KafkaⓇ Concepts — Kafka became over the last year an important piece of every data stack. This article details the main concepts you need to know about Kafka. Combine this with a CDC pattern, you got your first realtime platform.

Data fundraising 💰

After a discussion with a reader I've decided to put the fundraising category at the end now.

Alvin raised $6m in seed. Alvin is a data lineage first platform, it connects to all your sources, BI tools and operational tools to create the lineage from the logs. The you can visualise or query the lineage data to adapt your own platform.
Climatiq raised €6m in seed funding. What if we could put a sensor everywhere on our servers to measure the climate impact of what we do? Climatiq gives you this in real-time.
Homa raised $100m Series B. Homa is a tracking analytics SDK for game creators. I find it interesting to put it here because game analytics is also something we often forget as we do not accept cookies, but still very present.

See you next week, enjoy your weekend ❤️.

Data News — Week 22.40

2022-10-07

It's already sunset (credits)

Dear members. Once again a late Friday edition. I was travelling this week and I slept too much. But not more excuses, below your Data News edition.

Data fundraising 💰

Lightdash is finally launching their commercial product. They raised $8.4m in funding (pre-seed + seed). Lightdash is a dbt-based BI tool. It leverages metrics and dimensions defined within dbt to provide an explore UI where you can create visualisations to answer questions. Later add these to dashboards. In my opinion Lightdash is conceptually very similar to Metabase.
Immerok raised $17m seed round to launch a serverless service for Apache Flink. The promise is make real time mainstream by providing a no-operation platform while using all Flink APIs.
ClickHouse Cloud launched, one year after their $250m Series B. ClickHouse is a real time OLAP database developed within Yandex. The database promise is to reunite the warehouse-first approach with real-time performances. The Cloud (only AWS for now) will charge you for storage, compute, write and read if you "pay as you go".

What a crazy period we live in. Every open-source technology launch a cloud based offering of their tool expecting money to finance development. Is it really sustainable?

A bit of data engineering

I do not share a lot what I do as a data engineer outside of this newsletter. Even if this is probably for a dedicated post I think today I'll do a category about the data engineer's life. At the moment I'm working on two projects that are migrations. For the first project I migrate from a 12 years old custom made analytical application to a new one made within Apache Superset.

I also feel that a lot of the projects I've worked on as a data engineer were migrations. Sometimes small migrations like changing a data pipelines, sometime larger one like migrating a warehouse technology or an orchestration tool.

Migrations fuel data engineering work today and Ben depicts it greatly in his new post Realities of being a data engineer — Migrations. As Ben said we have different kind of migrations : operations systems, hardware, cloud, analytics or data. Every migration obviously brings a risk and that's why we do a preparatory work to mitigate risk. But even with a good experience we can't plan the unexpected and deadlines will slide.

Later in the post Ben propose a 5-steps framework every migration should follow:

Initiate — Justify the migration and get buy-in from stakeholders
Design and discovery — Do the product work and design what you expect, take time to explore the unknown
Execute implementation — Develop what you have to develop and automate the boring stuff (a lot of migrations contain boring stuff, so automate it)
Testing and validation — Check everything and do a double run with you old system and the new one
Roll out and the long tail — Decided when to stop the old system and use the opportunity to change the processes with the new system

👉 Read Ben's article

After all the different migrations I've done and read I think one of the advice I can give you is to invest in developing custom tools to follow and help the migration. For instance if you have to migrate 200 SQL queries from Postgres to BigQuery, develop a dashboards that gives the progression of the migration and provide automated scripts to dumbly do it. Migration application is boring, gamify it.

To illustrate this post Ronnie from Airbnb described how they upgraded their data warehouse infrastructure. Migrating from Hive to Spark3 + Iceberg.

Data migration (credits)

ML Friday 🤖

Homepage recommendation with exploitation and exploration — How DoorDash created a personalized homepage with their custom ranking algorithms.
Also this week Etsy wrote about their search ranking personalisation with Dee Learning.
Finally, Walmart detailed more their machine learning platform. In a nutshell this is a big platform with a lot of fancy technologies involved. It sits on top of kubernetes and, be ready, mentions BigQuery, Spark, Cassandra, Trino, Hive, GCS at least as data storage platforms.
📅 The feature store summit will take place next week on Oct. 11st.

A personalized homepage (credits)

Fast News ⚡️

The EU wants to put companies on the hook for harmful AI — "A new bill will allow consumers to sue companies for damages—if they can prove that a company’s AI harmed them." Once again EU regulates, probably for the best, while companies are trying AI everywhere. If it ripples others like the GDPR it could be good.
Recruitment Difficulties, an analyses on 2019 French companies data — This is a study from the French statistical studies bureau. The study outlines high mismatch between labour supply and demand.
Use Iceberg to reduce storage cost — Deniz describes how migrating from ORC + Snappy to Iceberg with Parquet + Zstandard drastically reduced the S3 GetObject costs (by ~90%). As a side effect it also reduced the Spark compute cost by 20%.
❤️ Postgres: a better message queue than Kafka? — Dagster recently launched their cloud offering. They decided to use Postgres as foundation for their logging system. This post explains why. I really like the post because it treats about technologies choices and problem framing.
matanolabs/matano — The open-source security lake platform for AWS. Matano provides you a way to query and alert from logs collected from all your sources. Matano stores everything as Iceberg files in S3 and you can write Python rules to get real-time alerts on top of it.
dbt repository — to split or not to split? ; this is a hard question for every dbt developer. Should I go for a monorepo like dbt recommend or should I go for a modular approach? Adrian covers in the post the 2 ways. I personally think everyone should start with a monorepo. Once the data team moves to a mesh organisation the modular approach with packages should be considered.
Another tool won’t fix your MLOps problems — Whether it's MLOps or DataOps we have too many tools and yet more marketing than practionners in the space. We need to reach the plateau like for the DevOps to avoid tools collection like panini cards.
What we are missing in data CI/CD pipelines? — Thoughts around a CI/CD incremental approach for dbt.

See you next week ❤️.

Data News — Week 22.39

2022-09-30

Welcome to the 80 new members of this week (credits)

Tomorrow we'll enter in the last quarter of the year. This is crazy on how the time is flying. At the end of the year my freelancing activity will become my most significant professional experience. But at the same time I feel I've just started yesterday.

I'm so happy to see how to newsletter is turning these days. I really like to get feedbacks from you, so do not hesitate to reach me if you have something to say, it helps me a lot. In my plans I want to write more original content—that will be only for members (free and paid). But I struggle finding the time to do it. I need to rethink my time management and prioritisation. I'm super bad at it. How do you do it?

Data Fundraising 💰

As opposed to the last 2 weeks, fundraising are back this week. Money is coming back. But before, bad news. Docusign is laying off 9% of its staff.

Unravel Data raised $50m Series D. They tick a lot of buzzwords in their tag line: DataOps Observability for the Modern data stack. It feels they do a lot of stuff to help data teams understand better their platform: monitor cloud costs, recommend performance tuning to apps and pipeline, help discover issues faster. In the end they do observability like others. As a side note, they still mention Oozie in their demos. Modern data stack they said.
Coalesce raised $26m Series A. Coalesce is a boring drag and drop web UI to create data transformations for your Snowflake warehouse. Maybe they need money to pay the trademark lawsuit with dbt Labs regarding Coalesce term. They are fighting in court in the US and the UK (cf. Twitter) 🙄.
Wasabi raised $250m Series D to fuel their the cloud storage alternative. Claiming 80% price cuts compared to AWS while being faster, it looks like a solid contender.

Are tables data products?

Data mesh initiative brings at his root the domain ownership to data teams. With simple words the major change is obviously organisational. It puts technical teams closer to their business. In this case you may have to look at the Conway law to define your teams topologies.

In order to get your teams ready for the big change you'll need to identify data products every team will deliver. Data products are entities on which you apply product principles. Data products, among other things, have to be interoperable, discoverable, shareable, bounded and owned.

And it applies very well to tables. Tables are highly interoperable, discoverable and shareable—ok it's depends on your storage/engine, but still it's more than decent. Also with some processes you can easily make the tables bounded and owned. So yes, we can say that tables can be considered as a sufficient data product. BUT, not every table in the warehouse should be considered like so. LinkedIn decided to name these data products the Super Tables.

At LinkedIn Super Tables are unit of work like the jobs or the ads_event table. For instance their jobs table consolidate more the 57 sources into 158 columns. Which obvioulsy means a lot, 57 sources into one table is probably more than the average data team use in a whole warehouse. Every ST should enforce SLA to reach 99%+ availability. It then creates datasets everyone in the company can trust and use in downstream data flows.

LinkedIn move from Source-of-Truth tables to Super Tables (image from the source article)

Creating a Super Table is not an easy task. You'll need to clearly identify why people need the data to create this common asset that delivers value to the stakeholders. With domain data teams it's easier to do it because team are closer to their sources and dedicated per business, so, they should know better what's needed.

But still, once you have all the requirements you'll need to apply data modeling super skills.

As a data modeler you can help leadership bring in millions of dollars in revenue by adjusting a few lines of code.

As a final note on this, everyone is speaking about Kimball but no one read him—I confess myself—Justin wrote a post about the 4-steps dimensional design every data modeler should follow to create a well architecture tables.

ML Friday 🤖

Forecasting something that never happened — This is a good problem to have in machine learning and something I've seen multiple time. Luca describe how you can guess the uplift that will be generated by a promotion when you've never done it.
5 common data quality gotchas in Machine Learning — Doordash developed a DataQualityReport Python package that will help you identify missing values, invalid values and sampling errors while finding the distribution anomalies.

Going back in time wihting my warehouse (credits)

Fast News ⚡️

Google released TensorStore a new way to store and manipulate arrays. I don't get what's the hidden power of this kind of innovation but I feel this is something when applied to PB brain images.
How to use time travel on BigQuery tables — This is enabled by default and you can restore your table state at any point in time in the last 7 days.
Use Snowflake data masking — Every warehouse should have a privacy layer. In Snowflake you can do it with masking policy. Masking policies are functions that will mask the data if queried without privileges. Philosophically it can be applied to every database engine—for instance Postgres.
Evolution of streaming pipelines in Lyft’s marketplace — Lyft engineering team has been a leader of thoughts when it came to feature engineering. In this post they detail the different phase they went through years after years.
Data quality automation at Twitter — Small article that details how Twitter developed their Data Quality Platform (DQP) on top of Great Expectations. In a nutshell they define rules in YAML files that are compile into Airflow DAGs that runs periodically to check if everything runs fine. In the end they show reports in Looker.
The baffling maze of Kubernetes — Kubernetes is the far west. In the article Corey mentions that there isn't any consensus in the community as of now on how to develop iteratively on a Kube cluster. More than 25 products claiming to do it. On my side atm I'm deploying a bare-metal kube cluster and to be honest everyday I'm facing new issues, it reminds me good old Hadoop days.
SaaS metrics reporting — What are the metrics you should follow when doing analytical work for a SaaS product.
Comparing stateful stream processing and streaming databases.

Data News — Week 22.38

2022-09-23

🇫🇷 (credits)

Bonjour vous ! Like sometimes I'm late. Today, I write the first words of the newsletter at 5PM. Which is 8h later than usual. Pardon me. In term of content it has been a huge week for me, I've prepared a meetup presentation that I enjoyed giving this Wed. It feels good to present stuff in public.

So yeah, let's talk a bit of this presentation.

Find the hidden gem in dbt artifacts

On Wednesday I made a 30 minutes presentation looking for hidden gems in dbt artifacts. The talk was a bit experimental, the idea is to show that this is possible for everyone to add context to you data infrastructure by leveraging generated artifacts. It means you can use the 4 JSON files generated to create tooling around your dbt project.

Shoemakers children are the worst shod.

Why not using the data generated by dbt artifacts to create useful data models to self-improve our data platforms?

While leveraging the 4 JSON files (manifest, run_results, sources, catalog) we could:

Sources monitoring like in dbt Cloud
Extends your dbt docs HTML
Send data in your BI tool. We already have Metabase or Preset integrations.
Enforce and visualise your data governance policy. Refuse every merge request if a model owner is not defined for instance.
dbt observability, monitoring and alerting, have fun with analytics on your analytics.
Create a dbt model time travel viewer. Create an automated changelog process than display your data model evolutions.
dbt-helper — Your SQL companion
dbt-doctor — It’s time to detect issues. Idea: a CLI tool to detect any dbt FROM leftovers to fail in CI if yes.

I also shared that every data engineer should consider the artifacts like a way to understand their customers. If you manage to get the artifacts from every envs (local, ci, staging, prod) you have the data to understand how everyone is using the tool. Especially useful if you have junior analysts lost within the tool, it'll detect silent local issues.

Send artifacts from every env to understand how everyone uses dbt.

🔗 Here the slides of my presentation.

Closing on dbt

To finish this edito about dbt here 3 other articles I found interesting. While we live our best life by creating dbt projects the complexity of the projects will only rise in the future. By facilitating the way we creating data models we encourage the data model creation. So what does it means we you have more the 700 models written by more than 43 humans? Anna from dbt Labs wrote an introspection post about it.

Adrian also raised the complexity topic on Medium. He states that with the modern data stack and the all-SQL paradigm we wrote complex code that risks to be unmanageable.

Finally if you want to have a course on data modeling Miles from GitLab will run a CoRise on Data Modeling for the Modern Warehouse. Seems a good resources to get started at Kimball methodology.

Understanding the Snowflake query optimizer

❤️ If you had to read only one article this week it would be this one. I think Teej is doing an awesome job demystifying Snowflake internals. And he striked once again. It's time to understand how the Snowflake query optimizer works. Even if you don't use Snowflake I recommend this article to you.

The job of a query optimizer is to reduce the cost of queries without changing what they do. Optimizers cleverly manipulate the underlying data pipelines of a query to eliminate work, pare down expensive operations, and optimally re-arrange tasks.

In a nutshell the query optimizer tries to transform the badly written 500 lines query to optimized instructions for the database. In order to run the query the database will need to load data in memory and the query optimiser will try to find what is the minimal set of data the engine needs to scan in order to answer as fast as he can.

Once the database knows exactly what to read, the optimizer will rewrite the query in a more optimized syntax but logically identical. It will replace the views or functions with their underlying physical objects, unselect the useless columns (called column pruning) and push the predicates. The predicate pushdown is the step where the optimiser tries to move all the data filtering (WHEREs) as early as possible in the query.

Then it will do a join optimization. But for this I let you read it on Teej excellent post.

👉 Read the full article

Inside Snowflake partitions system (credits)

ML Friday 🤖

Netflix — Machine learning for fraud detection in streaming services.
Snowflake & Prophet (Meta) — Run forecasts directly within the warehouse with Snowpark.
River, Redpanda and Materialize — Max developed a small Streamlit application predicting in real time taxi trip durations.
Linear Regression explained — Once again mlu-explain created the best resource to explain how linear regression is working. While scrolling you understand how the model works.
OpenAI — Whisper, a new model released by OpenAI to automatically detect English speech.

Fast News ⚡️

Airflow 2.4 out, the data-aware scheduling — This new release features a new way to approach Airflow scheduling. You define datasets and relations between them. Airflow handle the logic to run DAG related to each dataset when needed. This behaviour has been introduced by Dagster months ago.
⚠️ 350 000 Python projects subject to a 15 years old vulnerability — CVE-2007-4559 has been discovered in August 2007 and allows an attacker to overwrite files when the archive is untared with .. relative names (I've been told that this attack also exists in zip).
BigQuery SQL functions for data cleaning — 4 useful functions like normalize, pattern_matching, safe divide and date formatting.
📺 Saving the planet one query at a time — A part of the data ecosystem live in a dream. The dream of the infinite resources hidden in Google or AWS. But this is as wrong as the infinite oil principle our economy is based on. The time will come to reconsider running a fancy clustered Spark job and replacing it with a local DuckDB compute. To go further the French org The Shift Project wrote a manifesto to help university shaping the next engineers.
On DuckDB topic there is a — not self-explanatory — demo on how to combine Malloy and DuckDB to do analytics in the web browser.
The evolution of data companies — Ben analyze the extract-load connectors vision of Portable, Airbyte and Estuary. There are 3 companies with founders coming from Liveramp, and Ben tries to see which problem from Liveramp helped them imagine the data products they run today.
Gamification of data knowledge — How to create the best data documentation by adding gamification to the process.
Testing & monitoring the data platform at scale — With Airflow and MonteCarlo inside.

See you next week 👻

Data News — Week 22.37

2022-09-16

My weeks are like (credits)

Halo Data News readers. The weeks are pretty intense for me and every Friday come in a blink of an eye. I write the introduction before the content of the newsletter so I don't how it'll turn today. But I hope you'll enjoy.

For a future deep-dive, I'm looking for data engineering career paths. If you have one or something similar in your company I'd love to have a look at it — everything will be anonymous by default ofc.

No fundraising this week. I did not find any news to put light on.

Data roles

Every tech lead face this identity issue a day or another. This is the same for every data lead. How should you divide your time between management, contribution and stakeholders? Mikkel describe well the difficult life of the data lead. I previously was in a lead role and the main advice I could say to people in the same case is: make your grief and stop the contribution work except for the code reviews.

To some extent, 2 other posts I like this week:

What is the difference between an Analytics Engineer and a Data Engineer?
Lessons that helped me not quit my data job in Week 1 ; the best tip inside is: Bother your seniors. that’s what they’re for.

The metrics layer

Pedram produced a deep-dive on the metrics layer. He tried to explain what's behind and what are the current solutions proposing a metric layer: Looker, dbt Metrics and Lightdash.

In the current state of the technology the metrics layer is nothing more than a declarative way (a file) to describe what are metrics, dimensions, filters, segment in your warehouse tables. In Looker you write it in LookML, in dbt and Lightdash you use the dbt YAML, in Cube you use Javascript to do it.

The final vision of the metrics layer is to create an interoperable way to define metrics and dimensions that every BI tool will understand natively avoiding hours to create this knowledge in the tool. But we are far from there.

McDonald’s event-driven architecture

Event flows at McDonald's (credits)

A two posts series detailed what's behind the McDonald's events architecture. First, they define what it means to develop such an architecture. Something that need to be scalable, available, performant, secure, reliable, consistent and simple. Mainstreamsly they picked up Kafka — but managed by AWS, the Schema Registry, DynamoDB to store the events and API Gateway to create an API endpoint to receive events. It feels like nothing facing, but looks strong.

In the second post they give the global picture and how everything orchestrate together defining the typical data flow. We can summarize it like: define event schema, produce event, validate, publish and if something goes wrong they use a dead letter topic or write directly to DynamoDB.

ML Friday 🤖

Data Stack for Machine Learning — This is a MLOps course that contains a data stack chapter. It covers data storage, extract, load and transform. The whole course seems great.
How AI will eat the perfume industry — "Google AI identifies scents more reliably than humans".
📺 Learned data augmentation for bias correction — I really like the fact it's a PhD defence talk given at a technical university in Denmark by Pola Schwöbel.
Creating media with Machine Learning at Netflix — This is a new blog series where Netflix tech team will explain how they use machine learning to produce creative media content.

Fast News ⚡️

The true Uber alternative (credits)

Uber has been — apparently — hacked this night. The attacker claims to be a 18 years old. He got VPN access using social engineering on a IT person. He then scanned the intranet and found a Powershell script on the shared network. The script contained username/password of Uber access management platform. That's how he got in. This is a small reminder of "nothing is really secured".
Generative AI news — now that we got over complicated generative AIs people developed product to generate prompt that will work with each AI. There is also an AI to find your next tattoo.
Introducing Datastream for BigQuery — Google developed a integrate solution to do Change Data Capture for GCP. It can use MySQL, Oracle and Postgres as sources and GCS and BigQuery as destination for the moment. This is a good solution to go real-time with minimal footprint.
Bluesky, monitor your Snowflake cost and get alerted — As I recently shared, we may see a lot of tools similar to this one in future as warehouses took a prominent place in the current data stacks. Watching all SQL queries to indentify unbalanced performance/costs queries.
How to replace your database while running full speed — Every data engineer has to face a migration a day or another. Lior from monday explained how they realized a migration from an analytical database to Snowflake with no downtime. It consisted in 4 steps: create all the schema, migrate the writes, validate, migrate the reads.
Airbyte released a data glossary with a graph network to see relationships between articles.
Iceberg articles — A list of useful articles when you want to understand what's Iceberg and a post explaining the Z-Ordering with Iceberg. Regarding zorder, this is a way to cluster data to optimise collocation when accessing data. But it comes at a cost obviously.
Real-time analytics on network flow data with Apache Pinot — How LinkedIn use Kafka and Pinot to do real time analytics on TBs of network data.
It’s time to set SLA, SLO, SLI for your data team — It's time to apply SREs metrics to data teams.
Connectors catalog — Pierre created an Airtable detailing every connectors out there. If you want to copy data from a specific source have a look at it to find which tool you can use.

See you next week and please stop writing about data contracts.

Airflow dynamic DAGs

2022-09-13

Airflow is a wonderful tool I have been using for the last 4 years. Sometimes I like it. Sometimes I don't. This post is dedicated to Airflow dynamic DAGs. I want to show you how to do it properly. In this case we can see Airflow as a Python framework, so writing dynamic DAG is just writing another Python code.

Why should I use dynamic DAGs?

Airflow Dynamic DAGs are useful when you have multiple tables for instance and you want a DAG per table ingested. In order to avoid create multiple Python files and doing copy-paste you can factorize your code and create a dynamic structure.

If we illustrates further with an example. Let's imagine you have to copy your production Postgres database. In order to do it you create a list of the tables you want to get every morning. The factory will take this table list as an input and will dynamically produce a list of DAGs.

If for instance you want to have do different stuff depending on the table type — incremental/full e.g. — you can go deeper by creating configuration files per table and then looping over all the configuration files to create a DAG per table.

When you're doing an extract and load process I recommend you to create a DAG per table rather than having a DAG per schema or database for instance. This way you'll have smaller scope in each DAG and backfilling table will be easier. The main disadvantage of this solution is that you have to use more sensors in downstream dependencies.

In summary you can use dynamic DAGs for:

Ingestion multiple tables from a database → a DAG per table
Running a list of SQL queries per domain → a DAG per domain
Scraping a list of website → a DAG per website
Every-time you are copy pasting DAG code

Dynamic DAGs with TaskFlow API

We will use last Airflow version — 2.3.4 — here, but it'll work for every version with the TaskFlow API. Let's say we have 3 sources and we want to create a DAG per source to do stuff on each source. These sources are user, product and order. For each source we want to apply a prepare and a load function.

import pendulum

from airflow.decorators import dag, task


@task
def prepare(source):
    print(f"Prepare {source}")
    pass


@task
def load(source):
    print(f"Load {source}")
    pass


def create_dag(source):
    @dag(
        schedule_interval="0 1 * * *",
        start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
        catchup=False,
        dag_id=f"prepare_and_load_{source}"
    )
    def template():
        """
        ### Prepare and load data
        This is the DAG that loads all the raw data
        """
        prepare_task = prepare(source)
        load_task = load(source)

        prepare_task >> load_task

    return template()


for source in ["user", "product", "order"]:
    globals()[source] = create_dag(source)

dags/prepare_and_load.py

The important part of this code is the last line. It creates a global variable that contains the DAG object that the Airflow DagBag will parse and add for every scheduler loop.

globals()[source] = create_dag(source)

If you want to go further you can also create a configuration per source. I recommend you to create Python configuration rather than JSON. The main reason is because Python configuration can be linted, can be statically checked and you can comment Python dicts.

Airflow UI with dynamic DAGs

Dynamic DAGs with configurations

So you have a configuration folder called config in which you have the 3 sources configuration.

config = {
    "name": "user",
    "type": "A",
}

config/user.py (as an example)

import os
from dataclasses import dataclass

import pendulum
from importlib.machinery import SourceFileLoader

from airflow.decorators import dag, task

CONFIG_FOLDER = "dags/config"


@dataclass
class Config:
    name: str
    type: str


@task
def prepare(source):
    print(f"Prepare {source}")
    pass


@task
def load(source):
    print(f"Load {source}")
    pass


def create_dag(source):
    @dag(
        schedule_interval="0 1 * * *",
        start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
        catchup=False,
        dag_id=f"prepare_and_load_{source.name}"
    )
    def template():
        """
        ### Load monthly data to the warehouse
        This is the DAG that loads all the raw data to the warehouse
        """
        prepare_task = prepare(source)
        load_task = load(source)

        prepare_task >> load_task

    return template()


for file in os.listdir(CONFIG_FOLDER):
    if file.endswith(".py"):
        filename = os.path.join(CONFIG_FOLDER, file)
        module = SourceFileLoader("module", filename).load_module()
        config = Config(**module.config)
        globals()[config.name] = create_dag(config)

dags/prepare_and_load_advanced.py

I decided to use a dataclass to parse every configurations and a module loader to load the configuration. This way every configuration will be statically checked and if an error has been added in your configuration Python code will be invalid. You can then catch it in you CI/CD process for instance.

Dynamic DAGs without TaskFlow

You can also do it without TaskFlow API, for that you just need to also have a create_dag function that returns a DAG and you're set. Below a small example.

import os
from dataclasses import dataclass

import pendulum
from importlib.machinery import SourceFileLoader

from airflow import DAG
from airflow.operators.python import PythonOperator

CONFIG_FOLDER = "dags/config"


@dataclass
class Config:
    name: str
    type: str


def create_dag(source):
    dag = DAG(
        dag_id=f"prepare_and_load_{source.name}",
        start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
        catchup=False,
        schedule_interval="0 1 * * *",
    )

    prepare_task = PythonOperator(
        task_id="prepare",
        python_callable=lambda x: print(x),
        dag=dag
    )

    load_task = PythonOperator(
        task_id="load",
        python_callable=lambda x: print(x),
        dag=dag
    )

    prepare_task >> load_task

    return dag


for file in os.listdir(CONFIG_FOLDER):
    if file.endswith(".py"):
        filename = os.path.join(CONFIG_FOLDER, file)
        module = SourceFileLoader("module", filename).load_module()
        config = Config(**module.config)
        globals()[config.name] = create_dag(config)

dags/main_with_taskflow.py

Conclusion

Creating dynamic in Airflow is super easy. You can create DAG factories for all repetitive tasks you may have, thanks to this you'll be able to unit test your ETL code.

Data News — Week 22.36

2022-09-09

👑 (credits)

Hey, weeks are passing so fast. Every week I'm like I have time until Friday and here it's already Friday.

On the 21st I'll give a talk in French at a meetup: dbt and the modern data stack. I'll talk about the dbt artifacts and my extension dbt-helper. I'd love to see you there 🤗.

Enjoy this week edition.

Data fundraising 💰

Arize, a machine learning observability platform, raised $38m Series B. Used by big names. They integrates with the Python standard machine learning stack, with a free tier. If you need drift detection, model monitoring or explainability it's worth looking.
Hebbia a document search engine raised $30m Series A. Their website does not detail a lot what they are doing and how. You can ingest PDFs, Office docs, etc. and then ask natural language question to get answer.
😥 Firebolt is apparently doing a lay-off firing dozen of employees. We don't have more information but if it appears to be true it'll sad. It also shows that the data warehouse competition is harder than ever before and their high valuation — $1.4b in Jan — was a tricky spot to deliver.
On the same sad note, Snap will shut down Zenly app firing off the whole Paris team. Almost 3 years I was in the same situation with my former employer, I wish all the best to Zenly team. As everyone is saying, Zenly was one of the best French tech team, so if you are looking for talented people try to reach out to them.

Do's and don'ts of data mesh

BlaBlaCar is one of the most advanced French company when it comes to data. The travel company decided to implement a mesh organisation at the beginning of the year rearranging 5 teams into 5 domains. Teams are cross functionnal — like feature teams — in 5 domains: demand, supply (x2), marketing and infrastructure.

In the post Kineret details few do's and don'ts when deciding to move to a mesh structure. As always for a migration the communication is one of the most important topic. With big changes, transparency should come first.

Continuing on the organisation aspect of a mesh, if you want your domain-oriented teams to succeed you'll need to create a way for team to communicate between each other. Data contract is a piece of the puzzle. As data contracts picked up again recently mehdio explained how you can implement data contracts and why it is important.

Small head's up here: you can implement data contracts without an event bus, and even with an event bus you might still need to implement "contracts" that goes deeper than just the messaging system. Because you'll still have exceptions and a lot of stuff will happen outside of the bus.

What if every dashboard self destructed

The title says it all. This is a fun title but it means a lot. In data we have too many things. Many dashboards. Many tables. Many KPIs. What if we automatically destroy dashboards? What if we do it based on views numbers? We could also remove and clean the whole data chain behind a dashboard. In real life I'm not a tidy person, but when it comes to data warehouses or a BI tools I feel this is way more important than my bedroom.

When people are trying to predict the BI future they are often saying that notebooks are the dashboards replacement. I don't think it'll be the case but it's a move forward. In the future of the future people are saying that canvas are the notebooks replacement. I feel this is a good idea, to me it joins the dashboard creativity to the linear execution of the notebook to create a good story.

A small advice I heard this week in the excellent DataGen podcast — Deezer episode, in French. If you use a dashboard, a notebook, a canvas or whatever when you release analysis record an additional video to put sound on your analysis. It will for sure help people onboard faster on your work.

Tableau after you read the previous article (credits)

ML Friday

The journey to real-time machine learning at Instacart — Whatever we way today's data stacks are still mainly batch. The main reason is often because data is used for analytics where batch is enough. That's also why machine learning often starts in batch. But if you want to do production you'll need to be more reactive. Instacart details their journey from batch to real-time with a feature store at the center.
Unsung saga of MLOps — Jaya from Walmart writes about what are the operational concepts around machine learning in production. Training, modeling and canary deployment in the post.
Evolving DoorDash’s substitution recommendations algorithm — How a retailer can recommend product when some are not available? This is a great machine learning exercise for aspiring data scientist.
Recommendations APIs at Slack — This is a bit of an insider post that show where Slack uses ML and also what's the API infrastructure to do it. Mainly batch, orchestrated by Airflow. Next time when slackbot will suggest you to leave a channel you'll know what's behind.
Recommender System Optimization — Music Tomorrow is a platform that gives knowledge to music professional. They reversed-engineered the Spotify recommendation engine to help music industry create more recommended content ➰.
Acing the data science interview: 8 practical tips with examples

The newsletter feels like a bullet point collection these days (credits)

Fast News ⚡️

✨ Funnel analysis, a presentation from the Snowflake Summit — If you work in data you have written funnel analysis at least once. Teej made a great presentation on how you can do it in Snowflake. It compares 3 methods joins, windows and regexp and this is clever.
Parsr — open-source document data extraction toolchain. With Parsr you can clean, parse and extract data from image, pdf, dock and eml.
Metrics of a data platform — A long list of metrics you can track when running a data platform. If you are just starting do try to implement all at once but do it incrementally. I really like the survey metrics like Ease of getting data and the P90/Time to accommodate. It represents well where a data eng team should perform.
The pros and cons of Kubernetes — I hate working on Dockerfile and YAML files. This is an infinite loophole.
Building a Python ecosystem for efficient and reliable development — How Coinbase used Pants to develop a complete build system.
Why every SaaS company will offer native data pipelines — This is about a trend. I think data connectors is one of the hardest data B2C business to do. So many competitors, so many open-source ways to do it and as it's pipelines, random issues will pop-out every day. Reversing the logic and saying, you have my data so it's up to you to push to me can fix this logic, but tools will need help to do it.
Terraform 101 — A well written post about what's terraform for beginners.
How the SQLite virtual machine works — I already spent too much time on this edition, and I did not read this article but I want to.
Deciding if a data leadership role is something you actually want to do

See you next week.

How to install Airflow with pip

2022-09-07

In order to do a proper local pip Airflow installation there are few tips you need to know. Here is the official Airflow documentation.

Create a virtualenv

First you'll need to create a virtualenv. In order to do this you have many different ways to go. Personally I use virtualfish because I'm on a fish prompt. But I recommend you to start with the easiest version if you're a beginner. Please note that the name you give to the venv command correspond to the folder you will create as a virtualenv.

python3 -m venv airflow-env

Then you need to activate your virtualenv with

source airflow-env/bin/activate

To see if your env is activated you'll have the name within parenthesis at the start of your prompt like (airflow-env).

Then install Airflow

This is probably the easiest part. In order to install Airflow you can use pip while the virtualenv is activated. If you're confortable in Python I eventually recommend you to use Poetry to install Airflow.

pip install apache-airflow

Define AIRFLOW_HOME env variable

This is the most annoying part of installing Airflow locally. In order to make it work properly you'll need to define a AIRFLOW_HOME env variable corresponding to the folder you want to work in (the working directory).

Run the Airflow init

Now that you have variables defined you can run the init script and create your first user.

airflow db init
airflow users create \
          --username admin \
          --firstname John \
          --lastname Doe \
          --role Admin \
          --email admin@example.org

Launch Airflow

Once everything is done you can run Airflow with the command below and start hacking.

airflow standalone

Data News — Week 22.35

2022-09-02

Mood (credits)

Hey dear readers, I hope this edition finds you well. In just change a bit the format this week. Only data fundraising and long fast news with a featured article I decided to develop. I hope you'll like it.

Next week I'm gonna launch the Explorer, a hub with all the data news links. If you want access please tell me.

Data fundraising 💰

Startree raised $47m Series B. Startree provides a cloud version of Apache Pinot. Pinot is one of the real time OLAP databases. It has been designed to support real time ingestion while being queried by downstream analytical apps. We are in 2022 and real time and/or batch is still a thing and I bet it'll continue to be.

Do you really need data engineers?

A data engineer showing the way (credits)

Yesterday SeattleDataGuy wrote Let's move fast and get rid of data engineers, while the title is a bit provocative the content is still relevant. In the article Benjamin explains very well that companies wants to remove all the data engineering burden by putting directly low-code data software in the hands of analysts/scientists.

When we think of it, this is true, this is the whole promise of many data tools, in all domains. When it comes to extraction Airbyte, Fivetran and co. are trying to put the simplest UI to the most boring task of data platforms: copying data from sources to analytical storage. Regarding transformation, warehouses + SQL has achieved the biggest complexity reduction. On the serving layer, either it's reverse-ETL either visualization, the space has already been transformed with "friendly" tools.

Everyone can write SQL and can configure web tools, so don't bother yourself to hire rare data engineers.

This is a discussion I also had a lot over the past two years during my freelancing journey. If I hire a data engineer what work should I give him/her? Do I really need to hire data engineers when I have dbt?

Often my answer to these question is: it depends. If you're satisfied with your processes and you don't have any scale, stability or engineering issues, you might be ok with the status quo, still you'll need a geekier analyst than the others for some edgy topics. But if you feel that your data team lacks in processes, does not have the time to self-improve, that you always have long-term engineering tasks in the backlog that no-one can take care of you might need to do some engineering work.

I don't want to be a gatekeeper. But I feel that to be data engineer you need something different. Data engineering is very often a boring field. Not everyone is interested in spending his morning investigating why this pipeline fails or why a query is costing more money than planned. But I personally think every data team need to have someone that does it. You can call it data engineer, DevOps, SRE, data geek, idc. In the end you need someone to whom you can delegate the boring stuff. To let other delivering value to stakeholders.

Even if you have Airbyte, dbt, or the other fancy SaaS data platform launcher I bet you'll find a lot of value in hiring data engineers. Data engineers — in general — love to solve problems and to help other teams optimizing what they are doing. I don't care waking up at 6AM to fix a pipeline if it'll simplify my day later. I even like it. This is also a behaviour I've seen in fellow DE, but rarely in analysts.

So, yes, in the end you really need data engineers. Obviously I'm biased when I'm saying this because I'm a data engineer. As I just said I don't think this is a technical reason. You just need someone — or a team — in the backstage that you can trust who magically handle stuff while the team shines on the outside. But draw clear responsibilities and don't give too much power to data engineers.

This is a large topic to cover and I just scratched the surface.

Fast News ⚡️

Dear Snowflake here some money for your warehouse (credits)

3 main elements in your Snowflake bill — Few days ago the Snowflake stock took 40% in two days. Snowflake has over delivered in terms of revenue for the last quarter. Which means companies have spent more credits, which means more computing time, which means probably few surprised customers. So lucky for us, Sthiven from Wise explains 3 concepts to help you mastering your bills.
Introducing Velox: An open source unified execution engine — Meta developed a new "universal" execution engine. The idea is to replace traditional computing workers by Velox in order to benefits optimisation and common library. For instance you'll be able to replace Spark engine and Presto workers with Velox but your code will still be the same, just the execution will be different. I see the interest for big companies. Not really for others.
Lessons learned after 1 year with dbt — 3 classic software engineering lessons. This is a good reminder of the dbt nature: a framework to apply SE practices to SQL modeling. I couldn't agree more, scale fast without sacrificing on data quality, documentation and pay debt regularly. On this topic Madison also wrote about Github best practices for Analytics Engineering.
Kubernetes was never designed for batch jobs — If you need arguments to justify your SRE team you need a data orchestration tool better than Kubernetes this article is written for you.
A first look at the dbt Python models with Snowpark — This is a great first look at what can be done with dbt Python models. The author compares the SQL and the Python way to compute the same model. Sadly this is not really comparable as the Python models can only materialize in table when she used views for the SQL. Under the hood Python models generate Snowflake stored procedures in Snowpark. Let's welcome back the good ol' stored procedures.
Fixing slow PostgreSQL queries — Why a simple query LIMIT can destroy performance? Awesome technical deep-dive that explains how to understand ANALYZE output to fix query performance.
How to measure cohort retention — This is topic every analyst working in marketing has to do at least once. This post shows how to do it, from defining the problem, to the SQL implementation with charts in the end.
❤️ The many layers of data lineage — This is probably the best article about lineage I have read in a long time. The author proposes to see lineage with different layers, like in map visualisation. We should consider lineage like a graph and we should add layers on top of it to understand what's going. For instance we can use a quality layer, a usage layer or a performance layer. And with each layer you'll be able to highlight different issues.

How to deploy an OSM tile server

2022-08-30

❤️ Maps (credits)

We all love maps. Without maps our life would be harder and less funny. This post will walk you through deploying a open-source tile server to serve maps in your applications. If you are down this way this is probably because you want to get away from Google Maps or Mapbox when it comes to maps but you also want nice looking maps and not the default OpenStreetMap one's.

👉

So, if you want your own tile server with Mapbox-like style you're at the right place.

How do maps work?

Do you know how web maps work? Everyone is using map apps, but few people knows what's really behind. If you already know you can skip to the next part.

Originally maps were just pre-generated images just put together. These images were computed at many different zoom level of a squared world. So let's imagine I want to develop blef maps I just need to generated an image for each z (zoom) and each x/y (coordinates). In order to have a finite number of images I'll only take z integers between 0 and 20, where 0 is the world and 20 a building level and rounded x/y depending on the zoom level to keep small image sizes.

If we consider the decent zoom level — which is often 16 — it means we need to generate around 6 billion images. This is quite a job, and this is only the generation we don't speak about the storage, the access and the serving of these images to have a great experience while browsing the world.

The web has evolved, so have maps. The main issue with image was the static nature of them. A PNG image was generated with a specific style and so a new style meant a full generation. So we invented the vector map tiles. The vector object will contain the shape and all the associated information.

Vectors maps unlocked a lot of new capabilities: dynamic styling, speed and better interactions. As we only manipulate matrixes the style can be dynamic and while everything is rendered on the frontend network data is smaller.

What will we do

In order to deploy our own tile server we will use different pieces:

First, mbtiles — this is a file format that will contain your tilesets. Technically mbtiles are a SQLite database. Tilesets can be either raster (an image) or a vector (matrix). Vector are often smaller and more customisable because it contains shapes rather than a PNG image of the map.
tileserver-gl — This is an open-source tileserver project that uses OpenGL, Open Graphics Library, to render the tiles.
A style — this is the magic here, a style is a JSON configuration file that contains all the styling information of your map rendering.

For this post I will create a France tile server. But it'll work for any part of the world if you find the good mbtiles files. To be honest it could be the hardest part of this tutorial.

What we want to achieve

Let's put everything together

Before doing anything create mandatory folders for your tile server and change directory (cd).

mkdir my-tile-server 
cd my-tile-server 
mkdir fonts sprites styles mbtiles

Create the necessary folders

Find the mbtiles files

As I just said, this is maybe the hardest part. You'll need to find a good source of mbtiles data because your map generation will totally relies on it. On my side I've found 2 different sources for France:

data.gouv.fr — the official French open data repo — there are files for the zoom levels between 9 and 14.
An open endpoint also from the French gouv: openmaptiles. I'll use this file it seems fresher.

If you want to do it for another part in the world you can consider either using OSM QA Tiles which contains the world (32 Go 😅) but only at zoom 12, either use maptiler but depending on your usage it's not free, either search on Google for the mbtiles you need. You can find some stuff on GitHub, or sometimes awesome folks developed tools to get the mbtiles for free out of OSM.

Below how I downloaded the mbtiles file with cURL.

curl -o mbtiles/france.mbtiles https://files.data.gouv.fr/openmaptiles/france-vector.mbtiles

Find the style you want

On my side I knew from the start I wanted a style called Streets in Mapbox. But sadly as Mapbox is a commercial product having access to all the files to display this style is nearly impossible. So we will try to find a replica of the streets style.

If you ever want to use Mapbox style you will need to download your style zip bundle with the API. But for that you'll need a secret token with styles:download scope and a special authorization from their API.

Then after some research on Google I also found that MapTiler is providing some open-source map styles. They are a bit different than the streets one. Thankfully they also propose the streets style in a non open-source fashion. I order to get you'll need to create an account on MapTiler and then you will be able to download the zip bundle.

MapTiler Cloud

‌

In this zip bundle you will find 5 major styles and other data. In the maps/streets folder there are 5 files. All these files are important to serve the map in the style we would like.

maps/streets 
├── sprite.json 
├── sprite.png 
├── sprite@2x.json 
├── sprite@2x.png 
└── style.json

What the maps/streets unzipped folder looks like

Last point regarding this style, as it is not the open-source version, in order to use this version you should comply with their license requirements. Which is free for non-commercial use.

I also want to bring your attention on the fact that you can create your own styles with Mapbox Studio, MapTiler or maputnik (which is open-source).

Launch your tile server

You are now ready to launch your tile server. For this part we will use Docker in order to simplify the tile server launch, if you don't know Docker no worries, you just need to install it. Required commands are pretty basic.

As I just said previously we will use tileserver-gl. In order to make our server working we need to write a proper configuration file. So let's do it. Below there is mine. There are 3 important parts in this file:

First you need to have the folders defined in your options.paths.
Then you need to point to your style, here I copy the unzipped streets folder from the previous part in my styles directory. You will need to change the bounds to coordinates corresponding to the initial view you want to have. Here in my case the bounds are centered on France.
The last part is the data key which should point to the mbtiles you downloaded.

Note: the sprites folder is styles. Because we copy sprites from the zip.

{
  "options": {
    "paths": {
      "root": "",
      "fonts": "fonts",
      "sprites": "styles",
      "styles": "styles",
      "mbtiles": "mbtiles"
    },
    "domains": [
      "localhost:8080",
      "127.0.0.1:8080"
    ],
    "formatQuality": {
      "jpeg": 80,
      "webp": 90
    },
    "maxScaleFactor": 3,
    "maxSize": 2048,
    "pbfAlias": "pbf",
    "serveAllFonts": false,
    "serveAllStyles": false,
    "serveStaticMaps": true,
    "tileMargin": 0
  },
  "styles": {
    "streets": {
      "style": "streets/style.json",
      "tilejson": {
        "type": "overlay",
        "bounds": [-6.3, 41.27688, 9.8, 51.32937]
      }
    }
  },
  "data": {
    "france": {
      "mbtiles": "france.mbtiles"
    }
  }
}

my-tile-server/config.json

Once you have the configuration written, you will need 2 last things to do. Small changes in the streets/style.json file and to get the required fonts. Regarding the fonts you can copy the fonts from the unzipped MapTiler bundle. You can copy everything.

Regarding the streets/styles.json file you need to change:

the glyphs key — find the key in your JSON and replace the "#".

...
  "glyphs": "{fontstack}/{range}.pbf",
...

2. the sprites key — find the key in you JSON and replace the "#"

...
  "sprite": "streets/sprite",
...

3. and finally the sources key — replace what's inside the key with what's below, note that the name in curly bracket should match the source name in the configuration

...
  "sources": {
    "openmaptiles": {
      "type": "vector",
      "url": "mbtiles://{france}"
    }
  },
...

In the end your working folder should looks like below

/Users/blef/Work/lab/my-tile-server
├── config.json
├── fonts
|  ├── Noto Sans Bold
|  ├── Noto Sans Italic
|  ├── Noto Sans Regular
|  ├── Roboto Condensed Italic
|  ├── Roboto Medium
|  └── Roboto Regular
├── mbtiles
|  └── france.mbtiles
├── sprites
└── styles
   └── streets
      ├── sprite.json
      ├── sprite.png
      ├── sprite@2x.json
      ├── sprite@2x.png
      └── style.json

Now you are setup Launch this Docker command to run your tile server.

docker run --rm -it -v ${PWD}:/data -p 8080:80 maptiler/tileserver-gl

Navigate to http://localhost:8080 you will see the screen below. If you click on Streets > Viewer you'll be able to see your map.

In order to use it in your application you can click on "YXZ" to have the template link use in your map app. For instance in React Leaflet the tile layer has an URL attribute you can modify. In our case here the link will be http://localhost:8080/styles/streets/{z}/{x}/{y}.png.

Production

As a side note. If you need to put in production your tileserver you will need to use the --public_url option when launching the server. This public URL correspond to the public web URL you will use to reach your instance. Without this your browser will continue to use localhost:8080 and nothing will work.

Conclusion

I hope you enjoyed this tutorial and that you were able to launch your first tile server. This is just a beginning. But it will allow you to become independent from any external provider to display your maps.

In order to deploy it in production you will need to pass different parameters to the Docker command and obviously to create a proper setup, but this is not what this post is about.

I want to obviously thanks the awesome OpenStreetMap community without whom nothing would be possible.

Data News — Week 22.34

2022-08-26

😥 (credits)

Hey. Already the end of August. Go back to school is approaching. This is a feeling that never left me growing up. You know when you see the summer holidays coming to an end while the stress of the new year is coming.

Who's starting a new work soon? Would you like more content on this topic?

Regarding the blog hygiene, we are slowly approaching the 1400 members. This week I've done a small style refresh of the blog (changing the main color mainly) and the Ghost update so you got 2 cool new features:

search bar — top right in the navigation bar. This search allows you to search over the posts titles and tl;dr. It's ok but could be better.
comment system — if you're a member you can comment on all posts.

This week I have also release the the Data Explorer private api. The Explorer will be your hub to search over more than 1000+ data links I've shared in the Data News. With bookmarking, full text search and recommendations. Ping me, I'll give you beta access. Right now I just need to finish the content categorisation.

The Explorer — Data News links hub (soon to be released, ask for beta)

Data fundraising 💰

The automatic personal data discovery field is booming. Privado, raised $17.5m in funding + Series A to develop a complete suite of tools to monitor all your flows to detect personal data. As I already mentioned in the past, data protections laws are mandatory but tech is not yet ready. I hope these tools will help, but once again the solution is not only technological.
Qloo raised a $15m Series B. Qloo is an API serving what they call a Cultural AI. It means their AI answers text questions about global trends in the world, like "What books people watching reality shows are reading?". Qloo claims having a consumer datasets with more than 575 entities and being privacy first.
Zilliz announced $60m Series B to expand their vector database cloud hosting service using Milvus. In a nutshell a vector database is an optimised way to store embedding vectors to compute vector similarity search. These vectors are a modern way to store DL or ML unstructured data once refined.
Snowflake is on track to buy Applica, an AI based document automation platform. Here the strategy is clear, Snowflake wants to become your all-in-one data store whether it's transactional or analytical. As long as it's data 🤷.

SQL optimisations

Everyone knows that migrating a dbt model to incremental can save you time and money, but we are often lazy to do it. In this article dbt data team explains how they saved $1800/month by migrating to incremental. On the same topic, Péter shows why generated window functions in dbt can lead to degraded performances. It can save $340 per query!

Data products/contracts

The rise of Data Contracts — I've always been a huge fan of the schema registry concept (yes to me it's the same). I think companies should first try to fix their schema management before adding any tool in their stack. Schema registry done correctly fixes everything. But it may be one of the hardest thing to do. It requires a collaboration between tech and data and force SE teams to like databases schema.

Once you have the contracts/schema you can start thinking in term of products/domains.

Hudi vs Delta vs Iceberg — The definitive comparison?

I think I may have shared at least 3 posts in the past regarding the comparison between these 3 technologies. This is probably the last time because this one is really exhaustive. Onehouse compared Hudi, Delta and Iceberg on what they do in term of R/W features, table commodities and platform support. They also explain some key concepts of table storage.

Their opinion should be treated cautiously because Onehouse sell a platform powered by Hudi, so I feel they might be biased at least when it comes to platform support.

To balance opinions, there is a post written by James (Product at Snowflake) on why Apache Iceberg will rule data in the cloud. And another one written by Vladimir on how you can use Delta with Spark.

You might be still lost after reading these two post. My personal advice as someone who never tried the 3: pick one, do stuff with it, learn a lot while using it. Once you become better at identifying what you need, challenge the initial choice.

ML Friday 🤖

This week we have a ML Friday of a decent size. I really like this category because even if I technically understand 10% of what I share I feel attached to it.

Me writing the newsletter during the Roman Empire (generated with dreamstudio.ai)

As an appetizer let's chat a bit about generative AI. I really like what these AI are all doing — DALL-E, Midjourney, dreamstudio, Imagen — have built impressing stuff that may change creative process for ever. What will be the future of journalism if we can generate unique images per article? Will artists use AI to avoid the blank page syndrome?

Does it mean we'll live an AI Art Apocalypse in the next years? The author of the article covers very well the topic: economics, why the art, AI as a tool. As in every revolution jobs will be transformed, or worse, lost and we should have empathy for these people doing jobs that may disappear. On the same generative level Google opened a wait list for their experimental AI chatbot.

Fast News ⚡️

🎓 Learn how to design systems at scale — This is a huge course about system design. It covers a lot of topic in a nice format.
Observable SQL schema browser — the JavaScript oriented notebook initially dedicated for visualisation is more and more leaning towards data. They released a in-notebook schema browser.
Managing our data using BigQuery, dbt and Github Actions — This week I went to a Google Group meetup, one of the presenter said that Cloud Build is awesome. It's a cheap container execution engine that works. Github Actions is also close to be the same.
Spark tips, Optimizing JDBC data source reads — A great dive into levers you can use to optimize db reads.
Migrating Databricks tasks from Prefect 1 to Prefect 2 — Prefect released recently the v2, this post shows you what are the key difference on an example.
Professional Pandas: the Pandas assign method and chaining — From a Pandas master. Learn how to better use assign (I should read this post more often because I don't like assign).
What's the big deal about key-value databases like FoundationDB and RocksDB? — This is the last link of a already too long newsletter. Read it if you find the title catchy, personally I like it.

See you later 👋

Data News — Week 22.33

2022-08-19

A nightmare for people in holidays (credits)

Hello dear members. For people in holidays I hope you enjoy as much as you can. For the others I feel you and here the usual Data News to keep you up-to-date on this Friday afternoon.

Let's start a conversation this week. What is your biggest problem right now?

Mine is Superset virtual datasets, I'm working on a Superset project and the textarea to write SQL queries to build the data model is too small in a way that it is so unproductive. Except from this I'm surprised by how far Superset can deliver.

Data fundraising 💰

Omni raised $17.5m Series A — and a seed funding in April. The field is starting to get stacked on the "next generation" of BI tools. Omni is trying to put a fresh look at it with founders coming from Looker (VPs Product) and Stitch/Talend (CTO). With such a line-up it looks promising. Omni wants to combine a neat BI platform with a auto-generated data models out of one-off queries. I feel Omni is gonna compete with Lightdash, the open-source BI built on dbt.

Data job market status

It's well know that data job market is highly stressed. All companies have engineering and analytics positions opened. Last week at the Data Analytics Careers Summit, Dustin revealed data about the data analytics market. This is super interesting. We can see that demand in SQL grew by 27pts since 2020, while Tableau, PowerBI and Excel are each mentioned in a third of job postings.

On the same topic someone analysed all the jobs posted in the dbt Slack community (~3k) and made some statistics about it. We can clearly see that analytics engineering has picked up while data engineering demand stayed the same.

The best data team

There are many ways to create great data teams. This is the kind of articles I'm really into. What if we could create a partnership between data and teams enabling data to be more than a support role, this is the Data Business Partnership. The post provides great guidelines to try this implementation.

We often say that this is a bad idea to look at data unicorns. But what if instead we were looking for data heroes. Mikkel wrote another great post about data teams. So yeah, how can find, activate and retain data heroes? You know, these people that add this little more in your team.

Find your data heroes (credits)

It's summer, so let's speak about snowflakes

What a boring category title but as I got 3 articles about Snowflake I wanted to group them here.

Firstly, you can try to do machine learning with Snowflake by using the recently released Snowpark. Then it's time to master the query profiler, thanks to Teej you'll able to get started at query graphs readings. And now that you had fun playing with ML and queries you should have a look at your Snowflake bills.

7 best practices for data ingestion

Saikat wrote a small wrap-up about basics best practices everyone should follow when writing data ingestion pipelines. Guess which one is my favourite.

On the same topic Matt wrote about data backfilling. To me, backfilling is one topic that really shows the difference between a data engineer and a great data engineer. Probably because backfilling requires experience and patience. It's easy to run a pipeline, but when your pipeline should recompute or reingest to data from the last 4 years, the stress it'll put on system will be heavy.

And also I have to disagree with Matt's post on how to handle backfilling. I've made the mistake in the past to create dedicated backfilling pipelines but I think this is a bad idea. If pipelines are idempotent and deterministic your don't need another branching, at least this is the Airflow way to do it.

Cool findings 🔎

I discovered the StackExchange data explorer (via Vlad Mihalcea) — Yeah, I might be late to the party. This is a place where you can write SQL to query SE data and get insights.
command-not-found.com, a website where you can search every OS package and the website gives you the installation command for each distribution.
dbt-jsonschema — A way to validate your dbt YAML within VSCode.

ML Friday — Reinforcement Learning at Netflix

Reinforcement learning is something a bit mystic for me. Every year when I give classes I try to give examples and this is one from Netflix is quite interesting. They picked RL models to find optimal recommendations under constraints, our most limited resource: our time.

More ML: 4 essential steps for building a simulator.

Don't let the copilot take the wheel (credits)

Fast News ⚡️

GitHub Copilot will (not) take your job — Ronny wrote a good overlook on Copilot capabilities and also spiced it up with what's questionable about Microsoft strategy.
Grab explained how they use millions of orders — This is pretty classic stuff, a data pipeline reading Kafka with fault tolerance mechanisms. With an OLAP MySQL in the end.
How Instagram suggests new content — Never find me on Instagram, except when I'm looking for a new tattoo, I replaced it with heavy scrolling LinkedIn which is not better tbh, but don't judge me. In this post Meta team detailed how they designed their ML system to recommend posts that feels like your Home Feed. They also mention their internal feature store powered by their IG query language.
Reliability, Scalability & Maintainability in simple words — A great definition article by Thibault about these three magical words.
👍 How to structure a dbt project on GCP [part 1, part 2, part3] — This is an awesome technical deep-dive on how you can scale and structure a dbt project by using GCP toolkit. It covers everything: data landing, modeling, orchestration and deployment. This is the kind of feedback we were all waiting about dbt.
You can make any piece of data look bad if you try — Fun post.
Concepts and practices to ensure data quality + The best way to ensure that data works — Quality data quality stuff.
Measuring system performance is hard, so, know your limits.

See you next week ❤️

Data News — Week 22.32

2022-08-12

It's been a long week, please forgive me, I pick the most obvious picture (credits)

Hey there. This is it. I'm in now in Berlin so if you wanna meet, say hi, I have plenty of time to meet new people.

Here the weekly data news everyone's waiting 🙃.

Data fundraising 💰

Datawisp got $3.6m seed round. They tick a few buzzwords. A no-code data platform for web2 and web3. Their landing page is primary focused on crypto analytics. Once again I found interesting that the crypto field is also innovating in the data field and it could be a good news for data in general.
Privya raised a $6m seed round. This is an engine that analyses all your infrastructure parts to find PII to reach data privacy laws compliance. With all the GDPR burden these kind of products could become legion in the future to help companies mitigate risk.
Equalum, another data integration tool, raised $14m Series C. I'm impressed by the client portfolio they already have and by how unknown they were to me. Their product screen looks like a product coming from the future. But the future as seen as in the 2000'.

When Forbes talks about tech

Last week Forbes released their Cloud 100 a list that ranks — according to them — the top 100 private cloud companies. In this ranking around 10 companies are really about data. This is the stuff we speak about every Friday. They also discussed about Databricks, Fivetran and dbt Labs valuations that skyrocketed recently and what is means for their future.

We all live in a CTE hell loop

Cloud warehouses really popularized Common Table Expressions because in the past CTEs were in disgrace. CTE offers flexibility and linear structure to a SQL query that helps us achieve wonderful thousands lines SQL.

If you want to understand what are CTEs this week Brian wrote a guide on how to approach CTEs when you come from the subselect world. As a side note I also discovered the recursive CTE. It looks like hell.

A recursive CTE (credits)

How to use dbt's run_results.json

This is a great post about how you can use the run_results.json artifact to create your own metrics dashboard to achieve dbt observability. I really like this topic because dbt artifacts are a really powerful way to own your dbt projects and to find incremental boost in performance.

Things I wish I knew...

when scaling of Apache Spark — 3 lessons
about Databricks — 5 things from confessionsofadataguy and I really like the 2 first ones
about delivering data projects — 10 useful principles

ML Friday 🤖

Aren't the interactive websites the best ones when it comes to understand a machine learning model? Few weeks ago I shared Random Forest explained. This week the MLU-EXPLAIN did it again with the Logistic Regression. This is a cool web app where you discover stuff while you scroll on the page.

If you want to go further regarding machine learning you can still register to the free Machine Learning Zoomcamp created by Alexey. This is a 4 months program which looks really neat. To finish this category I propose you Vinija's notes about Stanford's NLP with Deep Learning course.

💡

Faster ML Experimentation at Etsy with interleaving — tbh this is not something I'm able to understand when I've written 700 words in the newsletter. But I hope ML Friday's readers will like it 🤓.

PS: Meta wrote a post about Scaling data ingestion for machine learning training. But they removed the post. If you still want to see the verbatim I managed to save it on this gist (sorry Meta). In a nutshell they reduced their data center power consumption by 35-45 percent.

This picture appeared when I type logistic regression (credits)

Fast News ⚡️

From McLaren Formula 1 to Quix — Ananth from dataengineeringweekly has an open blog were you can submit founder stories. This week Tomas wrote about Quix and how they solve realtime challenges, including McLaren ones (this is something I forgot few weeks ago when I spoke about data in F1).
Dagster Cloud launched — Dagster, an open-source data pipeline tool, released Dagster Cloud. It includes 2 modes: hybrid and serverless. Hybrid lets you run code on your infra while they manage the control plane, serverless option means they run everything for you. In term of pricing the hybrid option is $0.03 per compute minute and the serverless $0.04. If we compare with Prefect the pricing is a bit more expensive for Dagster but still way less than any actual Airflow offer.
BigQuery launched in preview search features — You'll be able to create indexes on table columns and then in a SQL where do a string search among those columns.
Understand how Apache Iceberg integrates with warehouses: Snowflake and Fivetran — Iceberg and table formats are the future (even if we still have people using CSV a default format, I was guilty of such a crime).
The CDP as we know it is dead — Something I shared for a long time in the data news. Warehouses are the new CDP. Long live the warehouse.
Count things: Counting users part 2 — A great post from Pedram about the complexities when counting people.

Data News — Week 22.31

2022-08-06

A busy Saturday (credits)

Hey, I was travelling this Friday so I couldn't finish the newsletter on time. But here you are. I hope you will enjoy. August is really the middle of the year for me. Here a quick summary on my plans:

Almost 2 years since I've started as an independent and I've just turn 30 last week. I'll prepare a post on my data engineering freelance journey. Right now I'm mainly working for the French Ministry of Education.
Next week I'll move to Berlin (saying a small au revoir to Paris)
I plan to increase the content I create starting in September: more videos, mentoring and training. If you like my content, you can consider becoming a paying subscriber (it's 45€/year) and it'll allow me to stay independent.
I want to develop small tools to help data professionals: the dbt-helper extension, dbt-doctor CLI, a data freelance community, a job board here on the blog, etc.

If there is something you would like to see from me, do not hesitate to hit reply 📩.

Data fundraising 💰

Rill raised $12m in seed round — which is huge for a seed round — to bring a new vision to business dashboards. From the GIF on their landing page it looks promising: a SQL-based BI tool with real-time database behind. Under the hood it uses a combination either DuckDB — for the developer version — and Druid for the enterprise one.
LiveEO raised $19.5m in Series B. This isn't directly related to a data product but it showcases where we are today in term of AI use-cases. LiveEO monitors the ground thanks to satellite images to help prevent wildfire — and we got a lot this year — or to detect intruders. When these technologies are used for the good it can be awesome but what's the ethical line to not cross?

Have you seen my privacy?

The General Data Protection Regulation — GDPR — has been originally published in 2016. Since then, other regulations followed: Data Protection Act 2018, POPIA, LGPD, PIPEDA, Data Privacy Act, CCPA. I'm not qualified to evaluate these laws, still I feel this is a good start.

But, there is an elephant in the room. Implementing the GDPR is close to the twelve labours. When it comes to the data team there isn't a proper word to describe the size of the elephant. I can't pinpoint a thing to change to implement the GDPR in the modern data stack, everything needs to change. Data leaks everywhere.

Salma tried to summarize all the rights at stakes regarding GDPR. She also mentions these 12 items to understand the GDPR. I also decided to do this edito because this week TotalEnergies has been fined €1m (😂) because they created a form without opt-out. In addition Criteo will, maybe, face €60m sanction also because of consent.

But behind this smoke screen about consent, while companies and organisations are using satellite, cameras, social networks, etc. to detect stuff, have you seen my privacy?

➰

Small throwback. On the same topic last year, week 22, I shared the feedback from a French startup — Alan — when they got controlled by the authority.

Did you see Hercules' privacy? (credits)

The data meh

My job in this digest is to follow the data news, whether I like it or not. I try to select articles I feel relevant to depict how our field is evolving. Last year, the data mesh trend was strong. This year, facing the reality some big and mature companies applied it, the others forgot it. The mesh, or I prefer, the decentralisation is a great system, but it works only with mature tech and teams. And stars do not align often.

Jean-Georges bet than the next generations of data platforms will be the data mesh. Obviously the articles contains arrows and square because we need processes. But it covers the 4 mesh principles. If you are still sceptical, you can read how Netflix adopted the mesh. Technically their key part is the Kafka cluster allowing the needed decentralisation of a such organisation.

In conclusion I also share this recent article about decentralized data engineering. I feel that the article is hard to read, but it greatly depicts the different phases data eng teams face. From being the central team, then facing shadow it, then generating data silos to become a decentralised team. It embarks so much concepts you need to implement to be successful like self-server, data products, data contracts, etc.

Best practices and learning

This week I've come across a lot of different resources to learn stuff or best practices about data. Here what I've found:

❤️ 4 software engineering best practices to improve your data pipelines (I recommend everyone to read it).
Data documentation best practices — Something fairly simple, or just common sense, but excellent reminder.
Best practices regarding S3 buckets
Everyday data science — This is an interactive course about data science (the first lesson is free) ; this is a fun way to learn.
Amplify Partners hub for data teams — This is a collection of hand-picked linked to create awesome data teams (this is like my links page ; but I know you're waiting for the explorer new version, which is coming soon).

ML Saturday 🤖

The modern data stack is not really the data scientists heaven. This is normal, firstly data teams address the base of the AI hierarchy of needs. But now that we have years of experience in data science with many fails we found way to put in production machine learning. Some people calls it MLOps.

This week Coveo's Director of AI shared how they do MLOps. Jacopo describes their Metaflow usage, from the project startup to the model deployment. I really like the post because it's an overview but greatly depicts how you can integrate ML in the AWS context with dbt and Snowflake.

The ML engineer toolkit (credits)

Fast News ⚡️

dbt Staging highlights — this is like the online demo of the dbt Labs product team. From latest Staging we saw the Python models will be out in v1.3 and that a new Cloud IDE is in the making. Fill the form if you want to apply to the beta program.
Using Apache ECharts in Python — ECharts has been the Apache JavaScript viz library — sporadically replacing D3 in Superset — and Mark do a walkthrough on how you can use it.
📺 Why is Kafka fast? — YT video about Kafka storage specificities that makes it fast. The ByteByteGo channel contains great content.
How to pick a BI tool — Resilia shared how they pick Preset among Tableau, Looker and Lightdash.
The cost of product analytics data in your data warehouse — Old post but I like the approach to quantify every work for data resources to justify the buy or build. Even if the post is selling the author tool it's still relevant in the method.
Dataset-centric visualisations — A podcast with Max Beauchemin and the associated article for people preferring reading than listening. I've been using Apache Superset for the last 3 months and the dataset approach is really refreshing but sometimes annoying.
🤓 Building scalable real time event processing with Kafka and Flink — Big technical deep-dive.
Spark Data Lineage at Yelp & Pricing at Lyft

Data News — Week 22.30

2022-07-29

Summer (credits)

Dear readers, I hope this email finds you well. This is still the summer edition of the Data News. I'm so happy to see people reading the news even if it's summer, so thank you all for the support once again.

Data fundraising 💰

Contentsquare is raising again a $600m Series F. This new round includes $200m in debt. They are one of the leading platform when it comes to analyse user behaviour. I really like Contentsquare in term of engineering. They have really good technical articles about how they do data at massive scale — if I'm not wrong they heavily use ClickHouse and Scala.
Hightouch acquired Whatis a company providing a Slack app and a Chrome extension to give access to company knowledge. This is awesome how Hightouch is diversifying these days. They are walking away from the reverse ETL image they had. In addition they also a Datafold integration within Hightouch to provide data-diff to their customer syncs.
Neon raised a $30m Series A to provide the best Postgres experience in the cloud. The product seems promising with a serverless multi-cloud Postgres that handles time travel.
In a nutshell other news I spotted: a web3 data warehouse raised $10m — I'm not into crypto stuff but the news interested me, a feature-engineering platform raised got $5.7m in seed funding, Manta (another data lineage tool) partnered with IBM, an end-to-end NLP platform, Humanloop, raised $2.6m.

RStudio news

Rebranding time. During the rstudio::conf(2022) RStudio team introduced a lot of changes and a new direction. First they are becoming Posit. In summary they do this change because they want reach more than R for data science. They aim to help all data scientists. Which means they will develop multi-language tooling (incl. Python).

The first manifestation of this vision is the release of Quarto. This is an open-source scientific and technical publishing system in which you can create dynamic content in Python, R and Julia. It has been inspired by R Markdown.

Next, they also announced Shiny for Python. Shiny, which is a way to create and publish web app directly from your R code will be available in Python. This could become a credible alternative to Streamlit.

To be honest I don't know very well the R world. I've written so little R code that my opinion could be wrong and bad. So, sorry in advance. I really like the vision and the initiative, R developers are legion and a lot of people are still using R because their niche library is only available in R. So if the vision is to empower everyone no matter the language, this is good. Still it'll be hard to break the scientific tool wall to become an enterprise-ready one — I mean production-ready.

Just as a side note: please don't become like Anaconda, I feel they tried to become the one-stop shop for everything and now this is too big to be the relevant player I want.

I've also read on Twitter that if R dependencies system could improve Python one it could be awesome. I don't disagree.

The R door (credits)

Data versioning

With today's cloud capacities we are able to save data changes. We have a lot of different technologies that can work with data versions. Christian wrote how you can version your datalake (with LakeFS).

Also, shame on me, I just discovered today that BigQuery had a time travel feature for instance (up to 7 days), see how Guillaume does BigQuery table snapshots.

In addition if you use dbt, here how you can do Change Data Capture in dbt or two ways to create incremental models.

Analytics time

How to empower analytics teams — I personally think that empowering others is what describes the most what I do. I really like empowering analytics teams to help them empowering business team with data.
How Criteo use reporting data
Why arguing about metrics is a waste of time

Q&A to learn from others

This week we got a small Q&A from Picnic data engineering team sharing thoughts on the lakehouse and the data mesh. On the other side Instacart VP data science shared how you can build a data-driven company. Which you should put in perspective with Benn last week post: do data-driven companies always win?

Me reading stuff (credits)

Fast News ⚡️

Snowflake's improvements (digest summary) — This is one of the major perk of having a proprietary cloud data warehouse. If they do performance improvements. They release. You enjoy. Toes wide open. For instance they reduced 7-10% storage cost.
On the other side Pub/Sub released native BigQuery integration — It means you can now directly stream a Pub/Sub topic into BigQuery without writing any pipeline. Small news but big step forward to me. The cost is $50/TiB.
Cloud competition to get US gov contracts is fierce — Everyone wants to take AWS leadership down.
Prefect released their 2.0 version which should fix all the 1.0 flaws with many improvements.
Text search at scale with ClickHouse in Tinybird — This article is obviously biased towards Tinybird (which is a HTTP API platform on top of your SQL queries, like the dbt semantic layer) but it shows how full text search can be done in Postgres and in ClickHouse.
Use Alvin to leverage column level lineage on Airflow — Alvin team developed an Airflow integration to send your DAGs informations to Alvin. It looks promising.
7 tips for a successful Machine Learning project in production
STOP USING CSV — Usual reminder to say you should move to Parquet.

Data News — Week 22.29

2022-07-22

F1 (credits)

Hello 🏎, this weekend the French Formula 1 Grand Prix is taking place. As I'm going there this is a slightly shorter edition. But I also tried to do a curation regarding how F1 teams are using data maximize performance.

Data fundraising 💰

Equals raised $6.6m seed round to replace Excel. Every ~15-20 years we achieve a truly new innovation 🙃. In 1985 Excel was created, later in 2006 Google Sheets was released and changed the way we use sheets. Equals could be the next evolution. Natively the SaaS app connects to your warehouse and displays your data in a tabular format after a query (graphical built or SQL). The UI looks neat. Maybe this is what we were all waiting to fix data last kilometre.
OpenAI released in beta DALL·E 2 image generation API. After the free tier it'll cost 15$ per 460 generated images. To be honest I feel this is quite expensive. But maybe this is the cost to avoid having the web spammed with fake images.

How F1 teams are using data?

To be honest it has been very hard to find public informations regarding this topic. All the teams are secret about the matter — I think we can understand why. This is sad because as F1 is a performance sport where every second count. We could learn a lot when it comes to real-time data use-cases.

So while looking information I mainly navigated through marketing stuff but I found some really interesting YouTube videos about how Formula teams are using data:

F1 Explained: How does telemetry data help teams go faster?
AWS related posts: How FORMULA 1 insights are powered by AWS, compute the fastest driver is in Formula 1 and F1 insights about car analysis and development
How TIBCO uses data science: from simulators to real-life — An hour presentation about how simulation data is got in streaming to be visualized to help e-sport teams.
Fast cars and big data, how streaming can help Formula 1? This presentation shows that Formula 1 is not really different than the usual old data stack (in 2017).
Understand formula racing time series analysis with this short 2mn video.

Your next favourite dbt browser extension

Yesterday I have worked on a experiment I had in mind since last year. This is a browser extension that helps you working with BigQuery and Snowflake when using dbt. The extension overrides the default clipboard behaviour to replace table names by the corresponding ref or source.

You can find a demo of the extension on my LinkedIn post.

A glimpse of the extension.

Stop using so many CTEs

Claire, one of the greatest thinker about analytics engineering job, took position regarding CTEs. You know CTEs. The syntax everyone decided to use in order to avoid subqueries to create more linear queries.

For Claire, we should stop using so many CTEs. I do agree, in today's data world we reached a point where CTEs are so deeply integrated in our data stacks that it could bring more downsides than perks. The proposed solution is the Chained SQL — a feature in Hex — but philosophically it can apply everywhere. Smaller SQL pieces to bring modularity.

Airflow is still the king

Once in a while I cense Airflow because it's like my first data love. This week Jarek — who is an active Airflow committer and PMC — also shared this love. He describes how generic transfers are designed in Airflow and how simple it is to use them. Personally I've been always convinced that writing a transfer DAG in Airflow is so simple that dedicated EL tooling needs is not so strong. Jarek shows the way.

If you are still starting with Airflow David started a series with an introduction post. On the other side Seattle Data Guy put words on this infinite debate: why data engineers love/hate Airflow. When jumping to the real word, Jellysmack explains how they used Airflow to orchestrate in production data science jobs.

And finally you still think you need to redevelop yourself an orchestrator you can get inspiration on how Criteo developed BigDataflow, their internal DAGs-based scheduler/orchestrator.

Holidays ☀️

Like in every good newspaper during the holidays there are some games and especially crosswords. So I tried to give you a small crossword to enjoy data while at the beach — or at the office while others are enjoying the sand. Try the grid online.

DATA HOLIDAYS CROSSWORD

If you feel the crossword is too easy and you have more time to play I recommend you to use SQLordle — a wordle but only with SQL keywords.

PS: if you want more crosswords just tell me I'll try to make others until the end of the summer.

Fast News ⚡️

Event — Summer community days. Next week will take place an online community conference organized by Census around analytics engineering. In 3 tracks it'll mix keynotes, advices and technical presentations. The agenda is packed and personally I'd like to watch few of the confs.
Airflow is finally working with ARM — But starting from version 2.3.
Learn more about Apache Arrow — Arrow is one of the hidden hero of all the data stacks. Used by Parquet, Spark, Snowflake among others Arrow became the leading library when data communication is needed. This is a small Twitter thread detailing David learnings.
Understand indexes — Joaquin explained what everyone should know about databases indexes. This is a great starting point.
Is data scientist still the sexiest job of the 21st century? — HBR post 10 years after their initial post. 1700 words and data engineer mentioned only once.
My Journey in data — Neelesh shared his journey in the data world. From Cloudera to dbt Labs the post shows what he had to learn over the years to become, today, a Staff Software Engineer (which is a good transition for the next link).
🇫🇷 [FR] Comprendre la vie d'une Staff Engineer — Célia from Doctolib engineering team share what is her life as Staff Engineer. This is a great example to understand the daily routine of this position. In addition Nicolas — the blog owner — wrote a post detailing how to become staff/principal engineer.

Do not hesitate to protect yourself from the sun (credits)

Data News — Week 22.28

2022-07-15

The real Bastille Day — French in protest (credits)

Hey, happy Bastille Day 🇫🇷 (one day later). This is a fast-written edition of the newsletter because I'm so late and friends are waiting for me to finish to go to the bar 🍻. I hope you're good and that you enjoy summer time.

This week this is a less-technical edition as it'll probably be for the whole summer.

Data Fundraising 💰

SingleStore raised $116m Series F extension. SingleStore was previously called MemSQL and it is used for instance at Uber. This is a all-in-one database that covers relational to real-time analytics use-cases. They go in the same direction as Snowflake but in the other direction: creating a super database to govern them all.
Deci rasied $25m Series B. Deci provides a deep learning development platform that helps you pick best hardware/architecture to train your models. They developed a tool called the "Model Zoo" that gives you metrics on hardware performance per model.

Why we should open-source our dbt repos

Philosophically this is a good question. Should we open-source our dbt repos? If we remove from dbt repos everything that is not shareable like PII or sensitive SQL it makes sense.

I see for instance a lot of companies facing the same issues in their marketing attribution and everyone rewrite the same SQL again and again (cf. everything is a funnel but SQL doesn't get it). In the end it would also be a huge step forward data transparency. It reminds me when French tax administration open-source the code to compute income tax. But in language M.

Data visualisation from the 80s

Awwww ❤️ . This is everything I like. Great visualisations printed in a great book. Besides the Tufte classic — The Visual Display of Quantitative Information — published in 1983, Tom found Learn to Draw Charts And Diagrams Step by Step published in 1988. The article gives 6 lessons we can learn from the 80s.

oldplotlib (image by the author of this article)

Deconstructing community building — dbt, Airbyte and Levels

Sven did an awesome job at deconstructing how dbt and Airbyte became well-known through big efforts around community building. Respectively with 8k and 32k members. Obviously their success is driven by how the community adopted tools.

When retrospectively I look back at dbt in my local French market it just went viral 2 years ago. Everyone was speaking about it in startups and while the tool and the promise is blazing simple everyone wants it.

How we automated FAQ responses at Grab

Grab is an Asian Super-App. Super-App means you can do a lot of different stuff within the app like ordering a ride, a meal or doing payments. In order to speed-up internal knowledge sharing they decided to automate it.

I really like this article because it shows a problem that can be answered by AI — within a company that has the people to do it — but they still chose an external tool to do it. I also think that the method they used to pick the tool is clever.

Your new knowledge companion (credits)

Fast News ⚡️

Apache Superset 2.0 released — This is a major release that depreciates a lot of old stuff.
How to use Airflow with Trino — I recently started to see Trino getting more and more traction in the data ecosystem as standalone — not only as a way to escape from Hadoop limitations. This is a small tutorial on how to call Trino from Airflow.
Netflix picked Microsoft to runs ads — In order to grow in revenue Netflix decided to go for an ad-supported subscription in the future. They named Microsoft as technology partner to run this ads system. Data will, for sure, flows.
Run Snowflake workloads on your own on-premise data — To me this is the biggest news of the week. After being able to query some expensive on-premise technologies Snowflake will be able to query your on-premise objects storage like MinIO.
Kafka team released the first ARM Docker images — Finally for M1 users.
Differences between Spark, Flink, and ksqlDB for data stream processing — Redpanda wrote an article comparing 3 majors streaming frameworks. If you want to deeper understand what's doing Redpanda team there is this excellent post from the newsletter Interesting Data Gigs about their Senior Staff Engineer offer.
Data Lake vs Data Pond — 🙄
📚 Data Pipelines with Apache Airflow ; book review — Everyone knows here that I'm a huge Airflow fan. As Gabriel is stating this book seems to be a great introduction to Airflow.

See you next week ❤️ (probably for a special edition).

Data News — Week 22.27

2022-07-08

Me enjoying the data engineering playlist while everything is good now (credits)

Hey, it'll probably be one of the shorter edition of the year. I feel that summer is coming and less articles are written. While on LinkedIn posts are still flourishing with unequal quality. Sadly, I miss good ol' web.

While read this edition listen the Spotify data engineering playlist done by Barr Moses. 🎶 EVERYTHING IS BROKEN.

Data Fundraising 💰

Whaly raised $1.9m seed to provide a all-in-one BI tool. The YC company offers a way to sync data from dozens of sources directly in your warehouse and add on top of this a visual way to transform your data to plug it in their Report Builder. They approach the modern data stack from a BI perspective providing all the tools needed in one platform.
IBM acquires Databand. Is it already the time for the consolidation in the data observability space? As mentioned by IBM this is the fifth acquisition since the beginning of the year. It will be interesting to follow how Databand will evolve while in contact with IBM customers.

How to make great schemas

In data engineering we often do schema to present architectures, projects or stuff. Information visualisation is the best way to simplify the complex world we live in. Benoit described what to do to make great schemas. From using a paper — yes, you know the white rectangle you may have somewhere in a drawer — and a pencil to digital tools to do it.

To be honest I hate Benoit right know because I deeply want the 350$ e-ink tablet he's using to draw.

While speaking of schema, he also featured in his monthly newsletter a great way to visualize SQL joins. This is way better than the tradionnal one with circles.

2 technical deep dives that will make you dizzy

Uber uses Spark at a level not a lot of companies have ever imagined. Which means they shuffle a lot. The shuffle is the operation that happens every time you transfer data between job stages. So they decided to develop a Remote Shuffle Service that handle all shuffles efficiently. This is a crazy deep technical post.

Canva is a platform to create graphic design online. Which means they have a lot of visual content. Which means they need GPU if they want to apply machine learning to their content. They developed an awesome encapsulation of their applications combining Docker, Kubernetes and Nix for ML. This is a crazy deep technical post.

Me after reading the 2 previous articles (credits, cropped)

THE CLOUD AND DATA WAREHOUSE - ARE THEY COMPATIBLE?

First, you don't need to yell at me. Second, this is a good question I ask myself every time I wake up. Thankfully Bill Inmon — one of the 2 popes of the data warehouse — had also this question in mind 2 days ago. To him the cloud is not totally compatible with data warehouses mainly because of data movement which is a big cost in cloud environment.

Data platforms future

Speaking of the cloud costs, this week Kris tried to wrote thoughts on data platforms costs driven by underlying cloud costs and how it will be hard to keep up for companies. The pay-as-you-go has some limits.

On the other side Alexandre finished his 3 posts series about Data Platforms: Past, Present, Future. In a well written Medium piece he's trying to guess where are we going and what will be the mutations the data field will face.

As a side note in the 3rd mutation he's mentioning the Data Mesh but Gartner hype cycle is already considering the concept obsolete. What a fun world.

Product News 🎚

This is category I've sometimes in mind but I melt it in the Fast News. Here I want to try to split it.

Preset announced their dbt integration. This is interesting to see, as Preset is a BI tool (the cloud offering of Apache Superset) they decided to develop a deep integration working in both direction with dbt. Preset is able to read sources, models and metrics from dbt, and dbt can access to dashboards in order to fill exposures. This is something Preset developed on their end with their CLI, but still paves the way for other tools.
Discover dolt. Dolt is another SQL database, but with a key differentiator: you can manage your data with git-like commands. All the commands you know for Git work exactly the same for Dolt. I want to try dolt cherry-pick.

Legends (credits)

Fast News ⚡️

How do I prevent people from running SELECT * on Snowflake tables? Felipe Hoffa proposed a way to add a 1/0 column in your Snowflake table to trigger errors on everyone's trying to do a SELECT *. WHAT A G33K 🤓.
Cracking the Data Engineering Interview — Part 1: Structure — I do agree with the main parts mentioned and I add that you should use my guide on data engineering to have your brain refreshed before interviews.
Delta vs. Hudi — and how a performance test said something the first time that was invalidated the second time they tried. They were stating that Delta was out-perfoming Hudi, but they mis-configured it and now it's the same. Never trust performance tests. Make your own, and still, don't trust yourself.
If you are a financial geek I found the blog you were expecting — Clouded Judgement ; Jamin breaks down every week valuation trends in the cloud universe. This week he analysed Q1 results, Snowflake is in his top 8 winners.
⏱ 3 tips to take back control of your time — A friend of mine wrote this and I really like it.

Snowflake vs. BigQuery

2022-07-06

You are on the road. Snowflake on the left. BigQuery on the right. (credits)

This article aims to compare two data warehouses: Snowflake and BigQuery. They are 2nd generation of data warehouses, which improved scalability by separating storage and compute, and simplified administration.

What are Snowflake and BigQuery?

Snowflake is a data warehousing solution, as a software as a service. It has been built for the cloud and can be hosted in the 3 major public clouds: Amazon Web Services (AWS) or Microsoft Azure or Google Cloud Platform (GCP). Snowflake allows a total separation between compute and storage. With this it gives a greater flexibility while reducing costs.

BigQuery, or Google BigQuery, is a data warehouse owned by Google. It's an important part of the Google Cloud Platform. BigQuery is a cloud-based big data analytics service for processing large datasets.

Comparison

Architecture

Snowflake is based on traditional shared-disk and shared-nothing architectures¹. It makes the data available to all compute nodes in the platform by using a central repository for persisted data.

Based on ANSI SQL Snowflake is totally serverless. It uses Massively Parallel Processing, or MPP, to process queries. All servers store a portion of the entire data set locally. For the storage, data is separated and organized in micro partitions that are internally optimized and compressed into columnar storage. Snowflake automatically manage aspects of data storage, as file size, structure, compression, metadata or statistics.

BigQuery is similar as Snowflake, because of the ANSI SQL, but the architecture is different. BigQuery use a vast set of multi-tenant services driven by Google infrastructure like Dremel, Colossus, Jupiter and Borg.

Dremel is a large multi-tenant compute cluster used to execute SQL queries. Concretely, Dremel turn SQL queries into execution trees. Slots, trees leaves in BigQuery, read data from storage and do computation. Mixers, branches of tree, manage the aggregations. BigQuery compresses data into columnar format, for store it in Colossus, the global storage system of Google. Colossus handles data replication, recovery, and distributed management. Jupiter is used for move data quickly from a location to another. Hardware resource allocation and orchestration in BigQuery is done with Borg, the Google's precursor to Kubernetes.

Performance

According to GigaOm, Snowflake systematically outperform BigQuery. On 103 queries, Snowflake used 5 793 seconds, whereas BigQuery used 37 283 seconds (6 times longer). Judging performance by the queries speed is too reductionist.
Still, GigaOm found that in 44 queries, BigQuery surpassed Snowflake of the benchmark tests. In conclusion, they are both still active, with frequent new features and performance enhancements.

Features

Scalability: Snowflake allows users to scale their compute and storage resources up and down independently. While the platform is running, it automatically optimizes performance and monitors the workload in order to improve the query time.
On the other side, BigQuery automatically gives additional compute resources on an as-needed basis for manage large data workloads. It can process petabytes of data in few minutes.
Security: Both Snowflake and BigQuery offer security of users for protect their data, especially when enterprises use data warehouse with confidential or sensitive data.
Snowflake offers SOC 1 type II and SOC 2 type II compliance, and also HIPAA and PCI DSS compliance. Depending the level, it offers other features like multi-factor authentication, support for OAuth and user SSO (single sign-on), IP address whitelisting and blacklisting, access control and automatic data encryption.
BigQuery provides automatic encryption for data in workflow or in rest. Google's Cloud Identity and Access Management feature gives the opportunity for user to adjust with precision access to cloud resources. BigQuery is compliant with the standard of HIPAA and PCI DSS.

Prices

The price of Snowflake depends of the use. As I said upper, compute and storage are separated, so the cost is separated too. For storage, the pricing is 23$ per terabyte per month, for upfront payment (or 40$ on-demand). For compute, the pricing is a little bit complex. Cheapest, or "Standard" price, is 2$ per hour. ‌‌Here, the complete pricing guide, if you are interested.

As Snowflake, BigQuery's prices are separated for storage and compute. For storage, there is a flat rate of 20$ per terabyte per month for active and uncompressed storage, or 10$ for long-term storage. And the first ten gigabytes of storage each month are free. For compute, demand queries are paid for 5$ per terabyte and the first terabyte of queries is free every month. ‌‌I encourage you to read the pricing for all specificities.

In storage pricing, BigQuery is cheaper than Snowflake. In compute pricing, it's complicated to estimate the cost of data warehouse because Google charges per amount of data and not per hours used.

Conclusion

According to the comparisons, Snowflake is not better than BigQuery, and inversely. The only important differences that will make you choose one over the other will be those related to your use.

It appears really often that if you are already using GCP, BigQuery is the obvious choice as you don't need a new contract, everything just work. If you are on Azure or AWS, Snowflake is a great alternative but this you can still consider BigQuery because probably you're already using Google Workspace and you already have Google accounts.

My main advice is: as both warehouses are very close, pick one and stick with it while you become more mature to have your own opinion on what you really need.

Key Concepts & Architecture

Data News — Week 22.26

2022-07-01

The World is Mine

Bonjour ! Here your fresh Data News edition. This is the first day of July and summer break will arrive for many of you. Sadly data never waits. So I hope you have enough team redundancy to be able to disconnect.

This week I published a new YouTube animated video: data explained to kids — in French but with English subtitles. BTW if you want to help me to do the English version, ping me 👋

Data Fundraising 💰

Snowplow, a complete suite of web analytics tools, raised $40m Series B. This is an open-source tool I know from back in 2014 and their marketing evolved drastically. From a web analytics tool to a behavioral data platform including AI and so on. By coincidence — or by chance — last week Italy declared Google Analytics illegal following others in Europe.
Zing Data raised $2.4m seed round. "Data at your fingertips" as they say. Their idea is to provide a mobile-first BI tool. Use your mobile to answer data questions and start data discussions. Obviously it works better for events companies.
Lightbits raised $42m in capital. Through proprietary technology they sell highly available storage. I got into private object storages just recently because of a mission and I find it super interesting. This is a fierce competition based on saving percentages and so on.
Opaque Systems raised $22m in Series A. Don't stop at the corporate website, the technology seems promising. Here what they sell: "Opaque is the first confidential computing platform that enables [...] analytics and machine learning on encrypted data". They are also the creators of MC² an open-source version of their platform. It needs a try.

Raising money

Things you should know about databases

I often share stuff around databases knowledge. This is a content I really enjoy. It's time for you to learn new things you should know about databases. Mahdi wrote a post with great illustrations. He explains very well how indexes and transactions work. What really happens between BEGIN and COMMIT in your SQL query?

A rant against dbt ref

Complaining about dbt became a trend. When you see the adoption and how people are happy about it this is normal at some point to see dissonant voices. It's Max turn to rant against dbt ref.

I do agree with Max. ref manipulation is a pain point in dbt that breaks the magic. Especially when your workflow as a analyst is:

writing SQL in your Snowflake/BigQuery web UI
copy/paste the SQL in your text editor (whatever it is) to add it to git
finding all the tables references to change it to ref
forgetting something (alerted by the CI/CD)

And you do this every day, for every model you touch. In the end you spend more time playing Where's Wally? with tables names rather than writing SQL — ok I exaggerate a bit, but you got it.

On this specific point I think this is possible to develop a browser extension which on the fly replace tables names with the right dbt references — while waiting for some changes from the inside. If you want to do it with me.

The data quality no-one is speaking of

This is an intervention.

Thanks to the Modern Data Stack and dbt we created SQL-driven platforms and analysts are becoming SQL monkeys. This is not good. Pissing SQL all-day long creates monstrosities. Rather than adding extra layers to achieve data quality, build quality from inside out.

In this post, that I deeply recommend, Petr speaks the truth. Everyone should go back to the root cause of data quality issues: your code complexity. It's time to "tame the complexity".

Finding the real data quality

Databricks Summit

Databricks Summit (called Data + AI Summit) is taking place. As I don't have the time to follow it, here is Simon's feedback on Day 1 and Day 2. In a nutshell they announced

Change Data Feed, a way to track row-level changes in delta tables.
Databricks Workflows, an orchestration tool with a copy/pasted matrix view from Airflow
Enzyme a new optimization layer to speed up the ETL process. Bingo.

ML Friday 🤖

Instacart detailled how they developped an internal platform to answer data science needs. They call it Griffin. This is a way to help the MLOps and it includes a feature marketplace, a workflow manager and a training & inference platform.
Causal Forecasting at Lyft — I don't get a lot in this post but it fills my brain with memories on PHV stuff.

Fast News ⚡️

envd, a development environment for machine learning — Write in Python or R all your app configuration and envd will build everything out of it. Why not.
sqlglot a Python SQL parser, transpiler, and optimizer — This is a recent SQL parser that went out. They claim to be the fastest one written in Python. It is modular to adapt it to your specific dialects.
People-first Data stacks — It's time to add empathie on top of the modern data stack.
Everything is a funnel, but SQL doesn’t get it — In a lot of business we tend to represent stuff with funnels. Sadly SQL is not a good langage to manipulate funnels.
Remote development at Slack — it is interesting to see how Slack tech team has been equipped with remote development environments and how it worked.
Building a Real-time discovery stack at Whatnot
Airflow Survey 2022 — this is something I missed few weeks ago. Some numbers about how Airflow is used today. For instance still half of the user base is using Celery Executor (and a quarter is on Local).

PS: for the first time in a long time I'm not late. See you next week ❤️.

Data Engineering Alphabet #1

2022-06-29

Airflow is a open-source tool to author, schedule and monitor workflows
BigQuery is cloud data warehouse to help data into valuable business insights
Column-oriented database is database that organizes data by field, keeping all of the data associated with a field next to each other in memory
DAG, acronym of Directed Acyclic Graph, is used for representing many different types of flows
ETL / ELT are two processing methods which collect data to distant source. The first collects, transforms and loads in the databases. Whereas, the second collects, loads in the database and transforms only if the data team needs to use it
Flink is a distributed processing engine. It uses for processing data streams at a large scale and delivering real-time analytical insights
Git is an open source distributed version control system designed to handle project
Hadoop is an open source distributed processing framework that manages data processing and storage
Iceberg is a high performance format for huge analytic tables. You can safely work with the same tables, at the same time.
JSON, for JavaScript Object Notation, is a text data format
Kafka is a distributed system for continuous data diffusion, that allowed to publish, stock, subscribe recording stream
Lake (data) is a type of repository that stocks data in their initial format through the ELT processes. The transformation is done only if the Data Analyst needs to use it
Machine Learning is a part of artificial intelligence which focuses on the use of data and algorithms to imitate the way that humans learn
NoSQL, for Not only SQL, is non-tabular database and stores data diffrently than relational tables. Main types could be document, key-value, wide-column or graph
OLAP / OLTP are file format type : in row-based storage, data is stored row by row, called OnLine Transactional Processing (OLTP). They are usually very specific in the task that they perform to involve a small selection of records. Conversely, in column-based storage, data is stored in a sort of cube, called OnLine Analytical Processing (OLAP). This storage is used for quickly respond to analytical queries
Pipeline is a set of tools and processes used to automate the movement and transformation of data from a source system to a target repository
Query is a request for data or information from a database table or combination of tables
Raw data is the data collected form a source, but in his initial state. It is not cleaned or organized
Snowflake is a cloud-agnostic data warehouse
Tableau is a data visualization tool used for data analysis and business intelligence
Unstructured data means that datasets are not stored in a structured database format. Structure is not predefined through data models
(data) Versioning is the storage of different versions of data that were created or changed at specific points in times
Warehouse (data) is another type of repository which stocks data already transform through the ETL processes
XML is a data structuring language, used for the management and exchange of information on Internet. It is more powerful than HTML
YAML, acronym of Yet Another Markup Language, is an human-readable data serialization language that is used for writing configuration files
Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services

Data News — Week 22.25

2022-06-24

Writing the Data News — metaphor (credits)

Hey, I hope you're doing good. Let's jump to the Data News, this week is full of great content.

Data Fundraising 💰

Validio raised $15m in seed funding. The Swedish startup addresses data quality issues with a SaaS platform. They sits on top of a lot of different data sources (warehouse, lakes & streams) to find "bad data". Their pricing includes a free tier.
Starburst acquired Varada. I'll try to summarize. Presto was created at Facebook. After a conflict Presto community broke up and Trino was founded. The 4 original Presto creators joined Starburst to provide services and to maintain Trino. They bought Varada, a company providing Presto/Trino optimisations for your lake. They claim to reduce Starburst compute prices by 40% with this integration. That says a lot in term of inflated cloud billings companies are probably paying.
CloudQuery raised $15m in a Series A — cf. week 38. CloudQuery is your cloud inventory queryable in SQL. They sits on top of your cloud providers APIs to get you access in SQL to your resources details. They market the tool as a way to control cost but above all security and compliance. It looks promising. It's time for data teams to help SREs.
Ataccama received $150m as investment. They provide a "unified data management platform" to cover quality, cataloging, MDM and visualisation in the same tool. I don't fully get their product vision.
Eppo raised $19.5m in seed + Series A. In a side topic to data but super important. Experimentation. They got inspiration from NATU's way to experiment. Eppo sits on top of your warehouse and encapsulate everything you need to do tests.

Snowflake Summit

Last week I forgot to write specifically about the Snowflake Summit. It was in Vegas and people were excited like if Steve Jobs appeared in the warehouse. Some stuff has been announced there:

Snowpark in Python is in public preview — it's time to write awful UDFs with Python inside to become the next Oracle.
Build Streamlit (let's say interactive dashboards) inside the Snowflake UI — not yet launche, but it may become the biggest revolution in the visualisation space
Unistore (technical details) — use Snowflake for transactionnal workloads and run real-time apps on top of your transactionnal data. Huge promise. It amplifies the conclusion of my first bullet.

Snowflake founders (credits)

Data teams organisation — apply the JTBD framework

It's been a long time since I've not share thoughts on data teams. This week Emilie proposed to use the JTBD framework to build more effective data teams. The Jobs to be done framework is a way to prioritize work. As a data team our mission is to empowers people in their decision-making.

If you identifies correctly your Jobs, the team will obviously drive enablement on others to drive business impact. Emilie shared 5 frequents jobs she observed. I recommend you to read them all.

In some extend I also recommend you to read Christophe's post on Airbyte blog on how to structure a data team to climb the AI pyramid of needs.

Our journey towards an open data platform

Doron shares how they build their data platform at Yotpo. All the drawings are super useful and clear. When you look deeply at the platform you could have interrogations about the needs to have ~3 data storage (Redshift, Snowflake and Databricks) and 3 visualisations tools.

This is a good feedback post but it shows super well the technologies explosion — cf. state of data engineering — we face today in the data ecosystem. Tools cherry-picking is becoming an art.

A framework for designing document processing solutions

Data extraction from document is slightly becoming for a lot of companies one of the best way to apply artificial intelligence for the business and to help the operational teams. Humans love paperwork and all of this paperwork is just in demand to be parsed.

Lester James proposed a framework for designing document processing solutions. Converting PDF to usable data is a key task. To do that he proposed 3 steps: annotation, multimodal models and an evaluation step. As a disclaimer he showcases an annotation lib (Prodigy) he works on.

From Jupyter to Kubernetes: Refactoring and Deploying Notebooks

I imagine that Netflix energy spent to put in production Notebooks didn't stay unnoticed. People are developing Ploomber to help you doing data pipelines from Notebooks.

On the other side you can also try to measure the CO2 impact of your notebooks (on Azure).

Data scientists (credits)

Side projects FTW

I've always been a huge fan of side projects in order to learn something, so each time I see people doing extra stuff with data I take care to share it because it resonates in me.

This week Jack built a data pipeline for his own Strava data to visualise everything in Tableau. Almost 7200 kms in 2021, well done.

Fast News ⚡️

📚 Fundamentals of Data Engineering (70€) 📚 — A new O'Reilly book looking promising. It covers a lot. PS: I've not read the book yet.
Monarch: Google's planet-scale in-memory time series database — I voluntary copy-pasted the topic as Google wrote it and I can't wait to get the first interstellar database. While writing my joke I discovered PlanetScale a MySQL serverless OS database.
fal released their Python models in dbt — while dbt announced it few weeks ago I don't know if they will last.
How we implemented a Tableau governance strategy — Great inspiration for Tableau users. It features Marie Kondo.
Data Ingestion in Apache Druid — Walk-through to see how it is to tune Druid for the best performance.
Rasgo developed a visual tool to build SQL queries — To be honest when I see the tool I recommend you to rather learn SQL.
Datafold open-sourced data-diff — A lib that compares with checksum data sources and destination. Like Postgres with Snowflake — in 25s with 100m rows. Who didn't try to build it? Don't lie to me.
How we rebuilt the dbt Cloud Scheduler — Follow-up on their post announcing work on their scheduler but this time with details. Spoiler this is a classic async job design and they partially migrated from Python to Go.

Podcast 🎙

A discussion about dbt, Airflow and the semantic layer. Good one.

Last Read — 3 years after Data mesh: lessons learnt

A month old article. Michelin detailed lessons learnt from implementing a Data Mesh — which, as a side note became suddently un-trendy this year. The main blocs are a data fabric [the data platform], distributed data domains w/ teams [exposing and/or storing and/or valuing data] and a federated governance.

See you next week. I Love you all.

Data News — Week 22.24

2022-06-17

Summer is coming — and planet is burning (credits)

Dear members, this is the return of the Data News 🎉. This 2 last weeks we got few fundraising and some awesome posts. I really enjoyed Zalando posts about Airflow tests and Benn's one about Google.

I plan to do a user research about the newsletter and how you consume it in the following weeks. If you want to contribute to the research ping me and I'll plan half hour meeting.

Data Fundraising 💰

Castor raised $23.5m in Series A. Castor is a data catalog created 2 years ago. Within the tool you can search, catalog, govern and see the lineage of your data. They bet on their simplicity to reach adoption within enterprises. Adoption which is the biggest issue of all data catalogs. As I've already said, catalogs are half part of the solution, the rest is on the human side.
DataStax got $115m in a private round. They started by providing Cassandra as a cloud service few years ago. Last year they decided to offer managed Apache Pulsar to cover streaming use-cases. To me they are trying to board the modern data storage¹ train.
Meltano raised again $8.2m in funding. Almost one year after their first funding they are trying to re-orient their product. As the Techcrunch news is saying they were navigating in between building an ETL tool and an end-to-end data platform — their DataOps OS. Now on track to pick the last one. I consider this data operating system vision as one of the major next trend.
A secure data access company called Immuta secured a $100m Series E. They provide a central platform to manage data accesses.
Continual AI raised $14.5m in Series A to add a machine learning layer to the modern data stack. The product sits on top of your warehouse in which you define ml features using SQL and then are able to do predictions directly in the warehouse.

Should Google wake up?

Amazon sells to developers and engineers. Salesforce champions lines of business. Microsoft and Oracle win with IT. Google, it seems, hasn’t established itself yet.

Once again while using metaphors Benn gives us a great way to see the cloud industry. Benn thinks that Google is close to have the best data tools but has no vision to articulate everything together. I do agree with everything said in the post. Rather than releasing a fancy new product every 6 months can Google embrace the modern data stack truly?

Airflow is still the cool kid

With the recent 2.3 version Airflow community is still bringing new ideas and features to the product — cf. Dynamic Task Mapping. I recently used the new Grid view which looks neat.

On the commercial side, Astronomer, the company offering managed Airflow instances launched their Astro product. This is not a revolution. This is a packaging of their already existing offer with OpenLineage integration after Datakin acquisition. Which looks interesting. On the marketing side they sell their Astro Runtime, a cloud optimized distribution of Apache Airflow. I hope it'll not diverge that much from Airflow.

If you are still in need of an Airflow introduction HiPay team detailed why they picked Airflow and what are the main concepts. In addition Jarek — the biggest Airflow contributor — wrote a best of post from the Airflow Summit (ik you're still waiting for mine). This is a great list.

Finally, Zalando team explained how they were able to spin up test environment for each new DAG version. This is a great hack to see.

Design patterns in the data world

Design patterns are somewhat important in the software world. As always if we want to build stuff out of data we should consider every great practices from there. Eugene wrote Design Patterns in machine learning code and systems in order to show what's possible. It includes patterns like factory, adapter, decorator, strategy, iterator, pipeline, proxy and mediator.

In addition to Eugene post there is a more generic post (not contextualised for machine learning) about DPs and Solid principles.

Flashy patterns (credits)

Tackling data tests

This is a hard topic. There are many ways to achieved data tests in modern data stacks. We have static and runtime tests. The static tests are running in the CI/CD while runtime ones are directly on the warehouse. The best solution is probably a combination of both.

LinkedIn detailed how they do data quality management. Behind the scene this is how they detect issues within the metadata — available, freshness, schema, completeness. In addition Ismail detailed how he automates tests for Redshift at Doctolib with different possible strategies.

In the end, Datafold showcased 7 dbt testing best practices you can use.

ML Friday 🤖

DALL-E 2 is quickly becoming the a new internet star. DALL-E is an AI model that generates image from any text you give. Recently a lot of people on internet tried the new version that looks more than promising. Alberto demystified DALL-E on Towards Data Science.

But if you feel that DALL-E is not performing enough it seems that Google Imagen outperforms it.

DALL-E vision of me writing my newsletter (credits)

Fast News ⚡️

Fake Snowflake data the easy way — James demonstrated an easy way to create fake synthetic data in your warehouse by using Python faker library. It requires to create a FAKE Python UDFs calling Faker. And then you use it wherever you need fake data.
Acing Twitch's SQL screen — A lot of data roles interviews will exercice your SQL skills. Either you're a scientist, an analyst or an engineer. Twitch team wrote a guide to help you train your querying skills.
Understand big data file formats — Big deal in data engineering and awesome explanation post.
Why your data pipelines need a fail-safe
Delta vs Iceberg : Performance as a decisive criteria — Delta seems to outperform Iceberg.
Building Spark lineage For data lakes — Monte Carlo team detailed what's behind their lineage technology for SQL and Spark. This is interesting to see what's under the hood.
The shift from data pipelines to data products — Simon tells you why you should consider writing declarative DAGs rather than imperative ones.

s/o to medhio for creating Data Creators Club a search bar to find the best data creators out there. Use it to find blog, newsletters or YouTube related to data.

This is something I've invented but it means all these trendy databases attached to the modern data stack wave. Warehouses and others.

Data News — Week 22.22

2022-06-03

Holidays (me)

Dear readers, I hope you're doing great. I want to welcome all new members joining this week, I hope you'll like the newsletter. You can hit reply if you want to say hi. It's holidays time for me. One week without any computer. This week will be a Data News with some shortcuts.

Enjoy the reading.

Data Fundraising 💰

Once again a new startup raised money in the observability space. Coralogix got $142m in Series D. The company is more a cloud observability platform. But this is still relevant because they sell themselves as a data obs tool. But as Techcrunch asked it, like me last week, is data observability recession-proof?
Vertical data platforms are coming. LatchBio raised a $28m Series A to create the "AWS x Github for BioComputing". Their product simplify the biological datasets analyses.

Datafold announced a free tier of their tool for small data team, 800$/month if 100+ tables. Datafold includes a data diff feature to detect issue in the CI/CD and they have lineage capabilities that integrates directly with dbt.

We need to talk about "we need to talk"

There are many trends in blog post naming and right now we are in the we need to talk trend after Pedram post about dbt few weeks ago. This week Erica wants to talk about the data analyst. She defends the fact that data analyst are still important and even more important today with the rise of analytics engineers. She also details how the collaboration between the two roles can be achieved.

The post also argue that job titles are meant to show something but we should not fall in the trap of marketing behind job titles. I personally agree with everything she says. We can also put in parallel Erica post with Mikkel's: how should analysts spend their time?

As a second post Erica also posted on her personal blog her search for a meaning. Something different than data but important to have in mind.

As data practitioners, we look for meaning in numbers, strings, booleans, and charts. We spend time building scalable infrastructure to reliably deduce questions into answers.

Fast News ⚡

How we structure our dbt projects — dbt refreshed recommendation on how their structure dbt project, this new version replace the old discourse post and it's a must read.
How to pick the least wrong colors — If you do data visualisation this post will help you understand what's important when choosing your colors to be nice-looking AND accessible.
A path towards a data platform that aligns data, value, and people — A must-read.
The real-time data revolution — I'm convinced that this year we have two major trends. The first one is around the lego data platforms, a tool to build your modern data stack bricks after bricks. And the second one is the real-time cloud data product. This article shows what impact real-time data could have in that case.
10 principles of proper database benchmarking — DB Benchmarks is a well-know platform to benchmark databases. This week they are releasing 10 principles to look at when benchmarking.
I deconstructed 13 data industry buzzwords.
Data engineering projects feel tougher, no? — This article is closed to the one I wrote last year: Data engineering failure — Why is it almost impossible to meet deadlines?
Salma wrote Things I wish I knew when I was building a data team, a rex from her previous experience on what you should do when starting a data team. From defining the objectives, to pick the good structure and technology.
“Semantic-free” is the future of Business Intelligence.

See you soon (credits)

Small reminder

We don't need a tech lead

Maybe some of you already heard the previous sentence. Pat Kua wrote what are the 2 scenarii behind. And if it goes the wrong way, remember, great employees don't complain - they walk away.

I'm sorry for the typos this is the first time I write the newsletter not on my own laptop #keyboardissues.

PS: I did not finish in time the Airflow Summit takeaway post. I'll do it once I'm back.

Data News — Week 22.21

2022-05-27

Preparing to leave Paris (credits)

Hello everyone. I hope this email finds you well. Back to the usual Data News. While the Airflow Summit is taking place I also plan to do a takeaway post after watching all the replays.

For the next two Fridays I'll be in holidays ☀️ but I'll try to plan 2 posts in advance. Have a good read.

Data fundraising 💰

After some slowness in the fundraising in the last months due the global economical situation this week is bringing some cash in the data space.

The data observability space is heating up, this week 3 companies raised money on this specific topic. Monte Carlo, one of the most known and influent data observability company, raised $135m Series D¹. MC is pioneering the field with a lot of content and probably a great product — but I haven't tried it yet. At the same time Cribl got $150M in Series D funding to do stuff I did not understand. They have a product called Crible Edge which is a observability agent. Finally Manta, a data lineage company, raised $35m in Series B to expand their lineage technology to solve observability issues.
River, a graphical SaaS ETL, raised $30m in Series B. The usual boring promise of a all-in-one tool to do everything about data, the easy way.
Something more exciting, Preql raised in funding $7m to build an intelligent transformation layer to speed-up your analytics workflow. "No data team or SQL required" as the landing page is stating.
Actiondesk raised $3.9m to build a data warehouse as a spreadsheet solution. They connect to your databases or apps and with templates create directly business reports in a spreadsheet like interface.
Broadcom to acquire VMware in a $61B deal. With the cloud, the virtualization still exists and a lot of companies still relies on these technologies. A small wake-up call.

PS: the stuff around observability is weird, it feels like companies are raising on these keywords just for the money. It probably says something about current trends. Feel free to correct me if I'm wrong.

Manage your data resources

You know I'm fan of data organization articles. I really like to share these kind of articles because as of today this is super hard to create data teams that works. There are a lot of considerations and Emily says that you should not run your data team like a product team, you should run it like a company that needs to scale.

She recently changed position and put some thoughts around the fact that actually your data team could be run like a product team, but not only. And she details this "not only" part. Data job is different. In the post she asks really good question you should ask yourself to build the best data team.

In parallel Tristan from dbt Labs took Emily's post as a starting point and elaborated on what you can do to avoid velocity traps for your team. In summary for Tristan software engineering job is divided between releasing user features and everything else. Deciding how to allocate your team's time between both is probably the hardest part. When it comes to data it's a bit different but not that fat. This is the data hierarchy of needs.

To be honest I recommended you to read the two original posts I'm not sure my takeaways honour them 🙈.

As a side note here another aspect to reduce your team costs: stop useless workloads and services.

Teams (credits)

Lessons learned from running Apache Airflow at Scale

While waiting for other Airflow insights from the Summit here a first feedback. Shopify lists lessons learned running Airflow 2.2 with more than 10k DAGs. They noticed that the multi-tenancy is not perfect because, author privileges are too broad and this is hard to split DAG ownership. Below an extract of the conclusion.

To sum up [their] key takeaways:
• A combination of GCS and NFS allows for both performant and easy to use file management.
• Metadata retention policies can reduce degradation of Airflow performance.
• A centralized metadata repository can be used to track DAG origins and ownership.
• DAG Policies are great for enforcing standards and limitations on jobs.
• Standardized schedule generation can reduce or eliminate bursts in traffic.
• Airflow provides multiple mechanisms for managing resource contention.

How I have set up a cost-effective Modern Data Stack for a charity

In France we have a non-profit association called Data For Good, their work is awesome. People are working for the public good by solving data issues within their capabilities.

This time Marie detailed how she developed a Modern Data Stack for a Paris-based solidarity grocery. To me this is one of the best article regarding data platform development. She greatly explains every choice and propose a state-of-the-art cost effective data stack.

What is Apache Pinot?

In the Apache wine cellar I would like a Pinot.

If you want to understand what is Apache Pinot, this is a great whiteboard YouTube video explaining why Pinot was created and how you can use it today. In summary Pinot is a realtime distributed OLAP datastore to answer realtime data analytics needs. The video vulgarizes key concepts.

🍇(credits)

Some product news 📰

Twitter have been fined $150m because they illegally sold people's data to sell targeted ads
Amplitude finally released a Customer Data Platform in addition to their product analytics standard feature. Sync up to 10M rows per months and get business metrics and quality checks.
Integrate PowerBI apps within PowerPoint. Yep, this is not the usual Modern Data Stack stuff, but everyone do some slides at some point and being able to add directly dashboards inside will help PowerBI expansion.
Apache YuniKorn becomes a Apache Top-Level Project. YuniKorn is a lightweight resources scheduler that sits on top of Kubernetes dedicated for Big Data workloads. The project also includes a UI.

Opinion post: Can ML be absorbed by the DBMS?

George, the CEO of Fivetran, asks if machine learning could be a simple database task? Obviously this is already technically feasible, but will we see a major paradigm change like few years ago when we said "data langage is SQL"? The future is open.

Fast News ⚡️

How Git truly works — When you work in data being a Git user became mandatory. No matters what you version if you version it. This post explains more how Git managed hashes and versioning.
How Git can help analytics work — In addition to the technical explanation here a summary of engineering best practices (like versioning) we should apply to analytics.
Consider better alternatives to CSVs — File formats can change dramatically application performance. When working with data you should consider using Parquet.
Machine learning model observability — A podcast on how you can approach observability with models and how it's different than monitoring.
5 biggest Data Engineering mistakes — Usual reminder on what we should always consider.
Why Python is more complex than you think — A YouTube video from PyconDE — DE for Germany and not Data Engineering, haha — that shows small funny specificities in Python and why Python will become harder with the time.
Atlas, manage your database schema from CLI — I discovered this CLI tool to manage your database schema like a Terraform but for databases. Here the recent release post where I discovered it.
Timely Advice – How Long Does Dataviz Take? — A dataviz visualizing how long dataviz projects are taking 🌀.

¹ Monte Carlo's VC, Redpoint, wrote a post congratulating Barr Moses (CEO) vision.

Data News — Week 22.20

2022-05-20

ONE YEAR — HAPPY BIRTHDAY

🎂🎂🎂 Happy birthday to the Data News 🎂🎂🎂. It has been one year since I've officially decided to publish on a weekly schedule my data curation. Some figures about the newsletter:

✍️ It represents 54 articles and 50298 words written.
💬 I've written the word data 1192 times. Airflow, dbt and Snowflake respectively 160, 140 and 95 times. You will find the 50 most_common words here.
📰 23905 emails sent and an average of 65% open rate. 31% of the member base opened more than 90% of their emails.
⏱ I've been late in the publication 12 times — published on Saturday.
😢 Sadly 54 people unsubscribed.
👀 The newsletter web pages represented 8k page views with an average of 4m15s reading time. The must read articles edition has been the most viewed. I plan to do something similar soon.

Regarding the blog I've added a comments system on each post and you can also like posts to show support. And I decided to change the versioning of the news, it'll be . .

I did not finish the rework of the Links page and I hope it'll be out next week. If you want to test it in beta before others please reach out. Thanks to the Links Explorer you will be able to search, browse and bookmark all links that have been shared previously.

Once again thank you for the support ❤️.

Data fundraising 💰

Imply raised $100m in Series D, the Californian startup has been founded by Apache Druid creators. They provide Druid as a cloud service — they call it DBaaS. With this Series they plan to add better tooling around Druid, especially with SQL capabilities for ingestion and ETL and better joins support.
Heartex announced a $25m Series A to build the best end-to-end solution for managing data labelling projects. They open-source their Label Studio which seems amazing, it includes only core labelling features, teams stuff is in their enterprise version.
This is not really data related. But I like the promise of having a tool that is able to search over the knowledge of a whole company. Glean raised $100m in Series C. They provide a platform that indexes your whole app ecosystem data and then provide a search experience over everything. Their approach is similar to data catalogs products but different. It looks like something to get inspiration.

Increase gender diversity in engineering

I hope I'm doing a good job when it comes promoting articles from women or non-binary. On same topic 50inTech, a initiative that promotes safe workplaces for women did an interview with Iman Akabi who is Lead Data Engineer where she shares what she's doing and how she became data engineer.

When you look at data teams we are closer to gender parity compared to usual tech teams. But this is biased, data engineering gender ratio is worst than software engineering while data analysts one is close to 50-50. This is based on US number, but it also verifies on my close data entourage here in France.

We have work to do.

Batch or stream?

Choose between batch or stream, Benjamin from Popsink tried to summarized what you have to look at when choosing an architecture. Personnaly I still always advice batch for analytics without a lot of volume, but stream products are catching up when it comes to tooling to make streaming a commodity.

Stripe Data Pipeline

Stripe announced Data Pipeline, a service to ship your Stripe data directly to your warehouse — Redshift and Snowflake are supported. I find this interesting because Stripe as a reputation of well crafted products that just works once you use it.

But why on hearth Stripe decided to ship this kind of service when there are plenty of solutions out there to sync data? It's probably because they listened customers complaining about the poor data ingestion ecosystem and fixed the problem to the roots. They own a Data Pipeline solution. Tho, I don't get their pricing, 0.03€ per transaction, what does that means?

Under the hood they are using Snowflake and Redshift Data Sharing features. Will it start a trend for other SaaS tools. Let's provide data sharing to get rid of APIs ingestions.

dbt again

Following last week discussion between Pedram and Tristan, Jeremy who works on dbt-core at dbt Labs wrote the 2022 roadmap vision, he announced 2 new minor releases for this year:

1.2 → with a refactor on incremental models, support for grants, migrate dbt-utils to core and improve it and Python models in beta
1.3 → Python models, UDFs definition from dbt, better exposures and external nodes (it means dbt DAG calling REST API to do stuff)

TADAAAA

State of workflow orchestration

This week Prefect released a report on the state of workflow orchestration ran by Ben Lorica, an AI advisor and influencer. In the survey you'll see mainly that the field is still dominated by Airflow 36%, with behind Prefect 14% and Dagster 8%. Surprisingly Argo is well positioned with 8% as well.

In also have trouble to explain the figure below saying that analysts are using Airflow mainly while having MS-Word as their second most important skill (another figure in the survey).

Do you use a workflow orchestration tool? If so, which? (chart cropped)

I feel that the survey depict the field quite well with no big surprises.

Fast News ⚡️

LinkedIn engineering blog is an awesome resource when it comes to machine learning. This time they are sharing ideas around their MLOps portal.
Working with large JSON files in Snowflake — Great walkthrough for JSON manipulation with Snowflake. He even uses Java UDFs 😬.
Data warehousing schemas types — Understand visually star, snowflake and fact constellation schemas.
Scaling data access by moving an exabyte of data to Google Cloud — Twitter moved 1 exabyte of data (10¹⁸ bytes) from Thrift in HDFS to Google Cloud in BigQuery.
Future of the Metrics Layer — Video and transcript about what sorcery is the metrics layer. In addition David spoke about the cache that will be a key part in any metrics layer.
Data Lakehouse explained
Setup Debezium of Kafka, a walkthrough

This is the Airflow Summit next week, if you are in France come see us at local event. Also send me the talks you would like me to cover in my usual afterwards takeaways 👋

Data News — Week 19

2022-05-13

1000 members (credits)

Hey, this week we've crossed the 1000 members milestone! I want to thank you for behind a important part of this newsletter. I've already said it before I do the newsletter because I want to bookmark links but I also do the newsletter because you read and like it. What would be a newsletter without readers?

That's why I'll add next week — if I'm on time — for the official one year 🎉 of the newsletter interactive features. You'll be able to comment and like — Substack style — every post, in addition I'll also improve the Links search feature we developed with Saija few months ago and the overall navigation of the blog.

Now let's aim for the 2000 members! 🎯

PS: I want to specifically thanks members that took the Early Supporter tier when subscribing to the Data News ❤️.

Data fundraising 💰

HuggingFace raised $100m in Series C for open & collaborative machine learning — a.k.a. the Github of machine learning. The French-American startup which released pre-trained PyTorch BERT models 4 years ago will continue to expand and to federate the ml community. They are also paving the way regarding responsible AI.
Aiven, a finish startup, raised $210m in Series D to become a new European Unicorn. The software company provides a cloud platform to deploy services on all major cloud providers. Some parts of the platform are open-source. Right now the services are mainly data storage tools like Kafka, Postgres, MySQL, etc. From what I understand the promise lies in the ability to switch cloud providers, regions easily with a fixed pricing.
Improvado just did a $22m Series A. The company illustrates super well the ops all-in-one vertical platform. They created a data platform for marketing and sales team to do data transformation on top of integrations. I know that the Data News audience is more engineering friendly and this kind of platform can become your shadow IT.

Following Snowflake partnership announcement last week with Dell this week Snowflake partners with Pure Storage to run data warehouse workloads on top of their on-premise S3-based storage. Yay.

Discussions about dbt vision and roadmap

When I waked up this morning I saw Pedram's post "We need to talk about dbt". I felt break-up vibes. You know like when you have two close friends in a relationship and one is telling you that something is going wrong. That they need to talk. They may need a break.

Predram is a dbt early adopter and he mainly feels that dbt is not anymore the product he loved before, especially when he sees VC filling the roadmap with SSO, AES and some other weird enterprise acronyms. He also asks for fixes or vision around their relation: the dbt Core library — so called the "CLI" — and criticizes the Cloud IDE product quality.

In a hot take Tristan, dbt Labs CEO, representing the other part of the relationship tried to be honest bringing "The response you deserve!". He understands Pedram's pain and he is touched to learn he lacked of transparency. Tristan reminds that he started to invest a lot in their relation: 10 FTEs working on the CLI and 8 on the community content, also revealing he plans to support non-SQL languages like Python (**cheering crowd**). Finally, he justifies the VC presence by saying they are helping their relationship by bringing another perspective.

In all seriousness, I really liked this conversation between Pedram and Tristan. I think dbt shines from the community, even if a huge part now is only asking for tech support on the Slack, there still are thinkers and makers. These thinkers and makers are helping dbt to become the central piece of any data stack. Becoming more than a tool, becoming your organised knowledge repository.

I wrote some thought on this after the last Coalesce that dbt is not —anymore— a data product.

Pedram and dbt — A romance (credits)

Google I/O and AWS Summit Berlin

Google annual developper conference took place this week while AWS Summit was taking place in Berlin. Because I had too much work — and was a bit lazy so I did not go the AWS Summit but here what I was planning to attend.

On the Google side the agenda was hugely oriented on AI & Machine learning, among everything they announced AlloyDB, a service that could change how we do databases. AlloyDB brings to PostgreSQL the disaggregation of compute and storage. This is a major evolution in the industry. Richard detailed on Twitter what it looks like.

Early this month the brilliant Looker team that was amalgamated by Google reappeared with Malloy — naming coincidence? — their new data exploration language. If you want more detail here some slides about Malloy and Malloy Composer, the UI to compose queries.

Use URLs to preselect dashboard filters in Superset

Yesterday I published an article on Apache Superset. Rather technical. Actually, this week, I was stuck on one specific task for Superset and did not find any content on Google to help me so I decided to write this post for me and to help others in with the same issue. I detail how you can use urls to preselect dashboard filters. It includes a small introduction to Superset.

ML Friday 🤖

A dress is not a pullover — In his post Alex, a ml engineer, propose a great walkthrough on how you can write a classification model with PyTorch. He uses Fashion MNIST data to classify dresses and pullovers. I found it crazy on how concise the code is becoming when it comes to tasks like this.

Detect silent model failure — I recently spoke with NannyML team about their open-source library to detect issues in ml models. By using only the models' inputs and outputs (class and probabilities) they are able to recompute the confusion matrix and then to detect data drifts, which means performance drifts. It works on classification use-cases. If you look at their user guides you will find great content around data drift.

ML FRIDAY (credits)

Fast News ⚡️

Awesome Public Datasets — A Github repository with public datasets per topics all over the world. This is a huge repo to get ideas.
Anaconda (full installer) is available now for M1 Macs!
Trino: open source infrastructure upgrading at Lyft — A nice write-up on how Lyft deployed Trino (formerly Presto) at their scale.
The landscape of timeseries databases — A simple comparison of timeseries databases explaining how each one is working.
Monitor your Snowflake credits usage — If you do self-service BI you may be afraid every to explode your bills. This post can give you ideas on what to do to monitor this.

PS: I don't know why but today's edition feels small in term of links. I hope you will enjoy ❤️.

Apache Superset — Use URL to filter dashboards

2022-05-12

Feel the Super-set (credits)

Over the last month I've been using Superset to demonstrate the tool capacities in multiple situations. There are situations where Superset shines and others where it's hard to find full guidance. This post will detail how to use URL parameters to filter your Superset dashboards.

Superset is a project that has been developed within Airbnb years ago by Maxime Beauchemin — also the original creator of Airflow. Superset entered the Apache program in 2017 to become Apache Superset. In January 2019 Maxime founded Preset. Preset is the commercial company that mainly maintains Superset today — around 70 employees. The startup provides a Cloud version, that includes a free tier, with enterprise features on top of the tool. The objective is clear: become one of the leading data visualization tool while being open-source.

I detailed Preset fundraising last year in my weekly newsletter. If you don't want to miss this kind of information about data field you can subscribe to the Data News.

Subscribe to the weekly newsletter

Vocabulary

As a preambule I want to give you the basic vocabulary. In Superset you have multiple entities:

You first start with a database, the connection with your datasource.
On top of your database you can have physical datasets, a table from your datasource (with calculated columns if needed) or virtual datasets the preparation layer of the data in Superset, like a view with no materialization. Virtual datasets are defined with a SQL query.
Then you can explore your datasets and create charts. Charts are the obvious way to visualize data in Superset. You can setup your charts by using the explore menu. You can also create custom charts in React thanks to the plugin system... but this is for another post 🤭.
And finally you can add charts in dashboards using a sticky grid layout.
In dashboards you can have scoped native filters narrowing your charts' data when applied. Native filters have their own menu. By legacy in Superset you can also have filter box that does the same job. The filter box is a special chart that allows you to filter data for visualisation charts — this feature will be soon deprecated.

Vocab time. (credits)

The needs

While working on Superset I needed to pass parameters in the dashboards' URL in order to filter my data. The use cases are for example:

you want to redirect from a dashboard's chart to another dashboard with filters already set
you want to include a dashboard link in an external application (intranet portal, a data catalog, slack notifications, etc.) and you need to filter depending on this application context
you want to share your dashboard to someone with filters saved

Now let's explore all the possibilities Superset offers below.

This is the easiest part of this post. The features have already been implemented in the UI. When you are on a dashboard you can click on the 3 dots on the top right and click on Copy permalink to clipboard it will generate a link with native filters saved.

"Copy permalink to clipboard" in my Preset workspace with example data

Under the hood Superset is creating an entry in the key_value table with the filters value saved in pickle in the value column. You can query this table with your permalink uuid.

Warning: if you still use filter boxes you have to be aware it will soon be deprecated.

So you have one or multiple filter boxes and you want to preselect the values with the URL in order to filter directly the data.

For this answer you will need to get the chart id of your filter box — also called slice_id. You can do it either by exploring the charts' library directly in the charts' url, or by inspecting the source code on a dashboard looking for a div with a data-test-chart-id attribute. Once you have your id you just have to add a GET param preselect_filters with the following value: {"":{"":["value1"]}}.

For instance if you want to filter on color blue and yellow for a filter box with 4 as id you will have something like below in your URL. Note the importance of double quotes as the content of the param should be JSON valid.

/superset/dashboard/1/?preselect_filters={"4":{"color":["blue", "yellow"]}}

Example of preselect_filters on a Superset dashboard (%3A and %2C stands for : and ,¹)

I want to preselect my Native Filters

Superset decided to implement a new way to filter dashboards called native filters as a replacement for filter boxes. Native filters are located in a collapsible bar on the left side of the screen. This is a bit more complex because the GET parameter format is longer than in the previous version but it works well.

So let's imagine we have a native filter on a color column and we want to filter on the green color.

In order to filter you will have to feed the native_filters param with the following (awful) pattern. The param uses the RISON² format which is a JSON serialization more friendly for url params. In this pattern you will have to replace:

— contains the native filter id. You can get the filter id when you edit your "Dashboard properties" in the advance part by looking at the JSON metadata.
— the column you want to filter on. Be careful you have to use the column name and not the filter name.
— the value you want to select. Depending on what you want to do you can change the operation, in the example I use a op:IN but it can be something else.

native_filters=(NATIVE_FILTER-:(__cache:(label:'',validateStatus:!f,value:!('')),extraFormData:(filters:!((col:,op:IN,val:!('')))),filterState:(label:'',validateStatus:!f,value:!('')),id:NATIVE_FILTER-,ownState:()))

Example of a working preselection for native filter.

Then it will redirect you to a new page with native_filters_key in the URL, it means that your filters selection has been saved in the key_value metadata table.

How to use these URLs

I just explained how you can from URLs set the filters in any Superset dashboard, but in order to complete the journey you'll also have to add links in another dashboard.

To to that you will have to create a calculated column in your dataset with a HTML content. This html content will be rendered in a table cell, so if you add a it will do the trick. As an example if you use Postgres you can do something like below with a CONCAT.

CONCAT('', color, '')

Conclusion

To be honest this is an awful way to preselect filters and I hope in the future we'll have a more friendly or concise way to do it. I got stuck on this issue for some hours and to save other people time I decided to write a blog post on it.

I did not mention the url_param GET parameter that allows us to directly filter the datasets' SQL queries. It will be for another post because this one is already long. If you want me to cover a specific topic do not hesitate to reach me.

Superset grown a lot in the last year to become a serious alternative to usual dashboarding solutions like Tableau, Looker, PowerBI or Metabase (if OSS). The energy the Preset team put in the product to fix previous flaws is huge. I think the biggest issue of Superset is the complexity of the product compared to others, deploying Superset means a backend, frontend, database, Redis, etc. It could be complex compared to alternatives.

¹ List of ASCII encoding reference: https://www.w3schools.com/tags/ref_urlencode.asp

² Here a library that can help you doing RISON. If you want to try RISON conversion you can use this Observable notebook.

Data News — Week 18

2022-05-07

Patchwork (credits)

Hello folks, I hope the newsletter finds you well. This week I had too many articles to share and tried to sort it out differently. I hope you'll like it.

Small survey: Would you be interested in getting the sources I use for the curation? If yes or no, please reply to this email.

PyScript, doing Python in the browser 🐍

At the PyCon US this week was announced PyScript a library that allows everyone to write Python in the browser. Yep, you read well. It means you can write Python directly within a tag. Under the hood it uses Pyodide, which is a CPython port in WebAssembly, providing a REPL directly in the browser.

Will Python become a Javascript replacement for all data people with frontend needs? No. No. And no.

Is it a good idea? To be honest I don't know. Technically I like it a lot, it shows how far we are capable to go today. Even if this isn't that surprising, we are able to run Doom in the browser for years already. So yeah, running Python should be business as usual... yeah? Actually, no.

When you give a deeper look at it is not as trivial as it seems. Python is not designed to be asynchronous, but doing stuff in the browser will require to be. It will also imply browsers to make significant changes and to align on APIs — right now in Firefox I get a lot of out of memory errors with the examples.

Finally if we thought about the use cases behind PyScript this is hard to imagine something that goes further that "templating". I understand that we have a real REPL in the browser but when we look at the example each time we want a bit of interactivity we fallback to Javascript — with panel library — a to a JS binding — with D3 for instance. Which is not so far of what we already do in Jinja. If I'm being totally honest you can also do animation with only Python but it'll require to recode everything we already have in Javascript with other many concerns (e.g. performance, HTML APIs available).

Last word will be that I think it'll be a cool alternative to do web UIs faster in Python but with a lot of limitations.

xkcd Python meme with PyScript in the Browser (source code)

MLOps in 10 Minutes

If you are interested in learning more about machine learning operations — a.k.a. MLOps — this starter is for you. The DataTalksClub community led by Alexey started a free MLOps Zoomcamp on Github and Alexey wrote a 10 minutes post about the MLOps. It explains well what are the main processes to consider when putting models in production: design, train, operate.

Related to this, Databricks in a spree to announce stuff publicly released their feature store.

Choosing a Data Catalog

You are used to Sarah's posts in the newsletter. This week she decided to write about data catalog and what to consider when choosing one. She decided to put catalogs in tree buckets: all-in-one, no-code with integrations and customizable with code. Even though I agree with this split, I would say every bucket we'll try to create will always compare old companies, startups and open-source products.

Still, one question before choosing a data catalog: find the root reason behind a need for catalog and you will have a lead on the solution you need. But remember that a data catalog is not a miracle that will solve all your data problems, this is only a technology layer. This technology layer without a community with practices and leadership will probably fail. I recommend you to truly dedicate human resources for animation and moderation.

Hard to choose a data catalog (credits)

How to add value as a Data Analyst

Now it's time for Analytics Engineers. Cassie — Chief Decision Scientist at Google — wrote an excellent article about the journey to become a "real" data analyst. She debunks 3 misconceptions we could have about the analytics discipline. In order to help you get started on the field you can also read the dbt analytics engineering glossary — fairly simple, it can help standardization in data teams.

On a more technical aspect you can now start to use the package developed by the community that will help you interact with the dbt Cloud API in CLI. You can also get inspiration from Whatnots data platform choices regarding dbt structure: they use marts with a "centrally-managed domain-owned" approach.

You can also have a look at the dbt Snowflake package to put physical constraints on your warehouse (snf or Postgres supported).

Kimball x OBT

I may have been ignorant for the last 4 years but I just discovered the OBT term. OBT means "One Big Table" and was actually heavily promoted by Google team when doing modelisation in BigQuery as opposed to traditional Kimball dimensional modelling. On Reddit they heavily debated on the matter and Fivetran wrote a performance comparison overlook 2 years ago.

Workflow orchestration vs. data orchestration

Anna from Prefect team tried to explain differences between workflow and data orchestration. The "data orchestration" term has been promoted by Dagster¹. So, it feels a bit like word war, but still the article is interesting because it explains with simple words concepts. Also confessionsofadataguy did a light Prefect review, I disagree a bit with him — but my review is in draft for months.

Fast News ⚡️

Snowflake announced a partnership with Dell that seems really huge. Snowflake will be able to run workloads on Dell's on-premise object storage. I know that probably no one is using Dell's on-premise storage here but still it means they are starting to open the on-premise door.
Douwe Maan, the CEO of Meltano explained their DataOps OS vision — Seeing the Modern Data Stack as an operating system with many OSS applications putting your data in motion is real.
Following conferences category last week Denis from Deezer wrote "what I loved, what I learned, what I loathed" from Devoxx. This is super nice with YouTube links (mainly in French).
Performance improvements on Delta Lake v1.2.
Trino released Project Tardigrade in order to provide out-of-the-box fault tolerance. Queries can restart at checkpoints after failure or they improved concurrent queries consumption sharing.
Control your Airflow DAGs from an external database — This is a good experimentation to show you how you can create dynamic DAGs from an external source. I do not recommend this in production.
Learn how to UNDROP a Snowflake table — This tip saved my team few months ago.
Discover databases with this introduction to NoSQL DBs or with 34 10-minutes videos about databases in a Twitter thread format.

¹ You can be leader of your category if you invent the category².
² This is the first time I add foot notes so if you read this 👋.

Data News — Week 17

2022-04-29

Some people are saying I'm the Data News Mozart (credits)

Hello friends you'll find below this week's newsletter. A lot of bulleted lists, but I had less time than usual to write it. So enjoy and see you next week!

Data fundraising 💰

Mozart Data raised $15m in Series A. Mozart Data provides an all-in-one data platform for modern needs. Snowflake has been picked behind the scene as primary warehouse and customers can use a lot of connectors, SQL based transformations and exports. This kind of cloud data platforms has flourished in the last two years to provide companies faster cold start.
Deepset announced a $14m Series A and a cloud version of their end-to-end search NLP platform. Deepset has open-sourced haystack, a NLP framework to do neural search, question answering and semantic document search. I got an idea with haystack, stay tuned...

Conferences time

Recently a lot conferences took place. Here some records or topics I liked.

Leboncoin team detailed how they went from a bash script to a data mesh — here are the slides. The presentation is totally worth it as they define the state of the art of data platform concepts, idempotency, commutativity, etc.
The Metrics Store Summit, held by Transform, took place this week but we don't have the recordings yet — hey Transform team if you read this I'm keen to have it.
Data Council Austin videos are out on YouTube and Dagster presentation about Software defined assets is available — a must see to be honest. Sandy goes through changes in the software domain and what data domain is living right now.
In the April Airflow community meetup, Benoit put sound on a great article he wrote a year ago: You don’t need an orchestrator.
And finally the Kafka Summit, which took place in London. Robin wrote a full recap of the conference on Confluent's blog. I also found this presentation from Viktor super interesting about Testing Kafka with Testcontainers.

Netflix ML fact store

This is maybe the first time your hear of this concept. Conceptually we commonly call it a feature store. The idea behind is to store at any time the value of facts — or machine learning features — about your users. Netflix decided to call it a fact store. This is a fact. But still their architecture is interesting.

Everything relies on a data storage, called Axion, which is a mix of Spark, a cache and Iceberg. Obviously they developed their own key-value in-memory cache called EVCache. Every feature store contains a key-value storage because you need data per ml "entities".

If you still struggle understanding what is feature engineering, Swapnil tries to unravel the feature engineering mystery.

As a follow up a small ML Friday 🤖

Zalando detailed their machine learning platform, from experimentation to production while integrating it in their internal custom tools.
Beyond matrix factorization, explains how Yelp mixed methods to do user recommendations.

This is blue. This is a fact. (credits)

How to build a lossless data compression and data decompression pipeline

Navigating through compression engines can be hard. In order to help you understand what it means Ramses explains what are the building blocks of a data compression pipeline and gives an example of the bzip2 algorithm written in Python. After reading the post you'll feel like Richard Hendricks.

Zack Wilson — Ask me anything on Reddit

Zack Wilson, the LinkedIn influencer, ran a AMA on Reddit about his experience being a data engineer at FAANG with a great career path evolution from L3 to L6. From the answers we can learn a lot like from what is the interview process at Netflix to advices for entry level people.

Fast News ⚡️

Pinterest team optimizing their logging data ingestion stack — The ingestion pipelines are composed of one pub/sub system (MemQ) and a log agent (Singer) that handle Pinterest logs at huge scale. This article details performance and partitioning research.
When not to use neural networks — Neural networks took over the datascience in the last years, but sometimes the "traditional" methods are still better or relevant, this post could help you decide.
Data Quality unit tests in PySpark using Great Expectations — This article is more a showcase about what is possible rather than a real solution. But still valid.
Modern data stack for startups — In summary: blablabla use a data loader like fivetran blablabla a warehouse and dbt.
The magical fusion between batch and streaming insights — This article will show you how you can implement a good old lambda architecture on Azure. But it's time to merge these layers. For real.
How our growth challenged our data migration process — Welcome to the Jungle backend team wrote a feedback post about what were the challenges while renaming a column in the Postgres database. This process is interesting to see for all data migration grumbling data engineers — like me.
Facebook doesn't really know where all your data goes, leak suggests — Is it time to dismantle Facebook? As opposed to Facebook, build privacy-first systems where you know where customer data is flooding to.

Delete Facebook — and all social networks? (credits)

PS: I hope you still have fun with pictures legends.

Data News — Week 16

2022-04-22

OVHcloud and Hightouch at the supermarket (credits)

Hey friends, as a small remainder I want to do a community post soon and I'm expecting from great posts you liked recently. Do not hesitate to hit reply 👋.

Data fundraising 💰

Hightouch acquires Workbase for an undisclosed amount. Hightouch has been one of the leading company of reverse ETL category, Workbase is an operational tool to automate workflows on top your data records. They will be able to provide more features with graphical workflows edition in their SaaS product in order to achieve their "Data Activation" goal. This acquisition should be put in perspective with Airbyte's few weeks ago, both companies trying to bridge the gap over the data platform coming from opposite directions.
OVHcloud, a French cloud provider, acquires ForePaaS a 7 years old company providing an end-to-end machine learning and analytics platform. OVHcloud aims to enrich their Platform as a Service offer with this acquisition. To be honest as of today OVHcloud data service catalog is pretty small and this new expertise will for sure help them growth.

Untapped potential of data lineage

Data lineage tools are sexy because they provide a neat graph view of your metadata but is it enough? Petr is trying to answer this question by putting in parallel Google Maps and Lineage in order to find information. The final idea is to find the true potential of data lineage. Which probably resides in the search. When reading his article I have to admit that I think about all search bars given by all data catalog tools, already answering his remarks.

Lineage — lmao (credits)

Our favourite tools schedulers

When we want to schedule data pipelines we have a lot of products. But what schedules our data tools? dbt and Airbyte wrote articles about what powers them. The Airbyte open source project is using Temporal Java SDK to implement sync workflows and triggers.

On the other hand in order for dbt Cloud to run client project they had to develop a in-house scheduler. This scheduler had bad performance prior to March — more than 80% of scheduler tasks were delayed by more than 1 minute. In a blog post Julia explained dbt Cloud scheduler improvements achieving less than 25s of delay for more than 80% of tasks.

Data tests and the broken windows theory

It is hard to write a complete suite of tests on your data. When we dive into it complexity is not a matter of tech — we have a lot of tools at our disposal — but more a matter of process and maintenance. Keeping up to date a suite of tests on hundreds of tables / thousands of columns is complex. Mo' data mo' problems.

Mikkel explains it better than me with the linking data tests and the broken windows theory. He proposes routines to apply in order to shift the mentality to get a important tests with great ownership.

People doing stuff on Snowflake

There are only two hard things in Computer Science: cache invalidation and naming things.
💬 Phil Karlton

Martin Fowler shared this 12 years ago. Regarding the second point about the data stack this is more than ever valid. When it comes to Snowflake organisation — or table / schema naming — we can do everything. Madison shared how she organizes a Snowflake data warehouse. Obviously this is a proposal and a good inspiration. Own your conventions and naming.

And in bulk people wrote about why they choose Snowflake as backend for a observability product ; how you can build a data app on top of Snowflake.

Modern data stack future

What is the modern data stack? What the future holds for us? I think no one knows yet. But still Nick for instance is saying that our data stacks aren't built for change. Which is close to be truth, our DAGs are often static, hard implementing logic in the data storage with close to none space for evolution.

On the other side Dunith is speaking about Modern Streaming Stack — this is the first time I read this term — and I like it. How the classic event-driven stack can live in the cloud driven world a lot of companies are in. This is a long post but it address all parts of the stack from the compute layer to the serving part and everything around.

Final note, TechCrunch that is trying to rethink (with paywall) Databricks valuation, $38b, in regards with their revenue, less than $1b. Will Databricks be in the modern data stack future?

Placing bets on the future (credits)

ML Friday 🤖

Podcast episode about MLOps and data engineer role — Episode from the data engineering podcast, to be honest I did not listen the podcast yet.
Model Evaluation in MLflow — This is a walk-through post on how you can use MLflow to store all your model evaluations and keep an history. If you need help in understanding, vpTech can help you understand model evaluation.
Enabling data science on Google Cloud Platform at Adevinta — Adevinta team detailed how they use GCP to support data science efforts with Vertex AI (pipelines and serving), Spark and BigQuery. I really like the efforts they put in platform components design.
Vox journalism: Why it’s so damn hard to make AI fair and unbiased.

Fast News ⚡️

Web scraping is legal — US appeals court reaffirmed its original decision and this is a good news for internet freedom and especially for side project lovers 🤓.
Yandex open sources YDB, a distributed SQL database — following ClickHouse success the Russian search engine tech team released a new piece of technology with YDB entering a new database segment with CockroachDB for instance.
Microsoft announced a partnership with Grafana Labs to provide in Azure a managed Grafana. This announcement will bridge the gap with other cloud providers already provided either a managed version (AWS) or either something equivalent (Grafana Google cloud monitoring data source).
Last week I shared LakeFS comparison of table formats and this week Dremio did their own. I prefer Dremio's, even if they go to far regarding community comparison.
Evolution of Redash at Blinkit — Shubham from Blinkit shows us what they did in order to fix all Redash issues they got along the way.

Data News — Week 15

2022-04-16

Easter weekend (credits)

Bonjour Data News readers. In order for me to prepare the anniversary community special edition if you have time could you send me your 3 favourite articles you read recently, but written at anytime. And for fun can you also send me the place where you are when you are reading this newsletter edition — on my side I enjoy the sun in the mountains ☀️.

Data fundraising 💰

Union.ai raised $10m in a seed round for another workflow orchestration tool built on top of Kubernetes. They are the team behind Flyte, the workflow orchestration tool chosen by Spotify to replace Luigi and initially developed at Lyft. This is impressive how the startup soft power today comes from open-source frameworks. Back in the days Luigi lost the battle against Airflow — in the background Airbnb vs. Spotify. And now Spotify is coming back with the round 2. With a lot of money and more competitors.

Then if we look at marketing and how Union.ai position the product in the market we see that they sell a ML and Data Science tool rather than a generic pipeline management system. This is something I've also notice while chatting with Prefect team, companies do not want to face Airflow generic capacities but address Airflow flaws particularly in ML space. Even though the Apache project by its generic nature can cover everything. In the end it's just about writing Python.

As a side note Flyte is written in Go.

The Datadogs of tomorrow

This is clearly the line drawn by data observability tools, they want to become the Datadog of the data field following the success of the company — valued at $50b. Which is a bit ironical because why can't we use the original Datadog rather than a copy?

Data Discovery Tool: why you absolutely need one!

Anas from HiPay shared what made his team pick Amundsen as discovery tool for their data platform. If you are still in the process to find the needs for this kind of tool in your company it'll help you for sure.

Kafka analytics at massive scale at Uber

Uber data teams rely heavily on Kafka when it comes to data infrastructure. In summary they are event driven and everything goes inside. After Kafka a lot of different tools are playing their part. Presto has a big role in this and they operator 15 clusters with 7000 weekly active users. This is massive. They detailed how Presto interacts with Kafka.

If you want an entry-level post Khandelwal explained step by step how you can query Kafka from Presto.

Massive Kafka (credits)

Feathr, a new feature store, entering the game

LinkedIn open-sourced their feature store, Feathr. It is written in Scala. For people not familiar with the matter a feature store is a centralized data store dedicated to machine learning features. The idea behind is to factorize ml features computation and results. Thank to it we can avoid repeating same feature engineering in each micro-service.

Feathr is built out of multiple components: offline store (object and SQL) + online store, a feature registry and compute engine. The online store proposed on Github Readme is Redis.

The second news in the post is that Feathr will also be provided to Azure cloud users.

Three tips to save BigQuery costs with immediate effect

I have to admit that I'm ashamed not knowing the second tip. Montadhar wrote 3 BigQuery tips to save costs. Which means saving query time. Which means in the end saving company money.

Fast News ⚡️

Snowflake Data Clean Rooms feature — This is something I discovered while doing the curation. Snowflake provide a way to do data sharing while preserving statistical secret to avoid risk of reidentification. Honestly from the article I don't get how it works and I'm more afraid of the tracking use-cases it would enable to avoid privacy protection laws. But yeah, it exists 🤷‍♂️.
Last week I shared Google BigLake announcement, Ben summarized Data Cloud Summit in a medium post.
Create your requirements.txt using this technique — I voluntarily kept the original title, but do not use this technique to create your requirements files. Prefer using pip-tools or poetry.
Use Python Fire package to encapsulate your Airflow tasks (for KubernetesPodOperator) — Avi proposed a pattern to replace PythonOperator by Fire + KubePodOperator. Fire is a auto-cli generation tool from an object.
LakeFS comparison of Hudi, Iceberg and Delta Lake — A great post to get the vocabulary associated to new table formats.
Be careful with Jupyter notebooks publicly exposed — This post have been spammed around social networks, but the message is still valid. There are exploits on Jupyter notebooks so be careful when you run a quick and dirty instance on your enterprise cloud.
Data engineering best practices — Matt wrote several practices to apply when working on data eng projects.
Data Engineering career path at gov.uk — This is a good ol' post from January 2020, the Crown defined what data engineering is there. This is a way to get inspiration for your job offers or career paths.
Embedded Analytics vs Data Apps — Firebolt CPO tries to define where data apps domain lands versus embedded analytics. Spoiler alert: it depends on the team that will develop the stuff.
Factless Fact table — a strange concept where your fact table contains no fact. The concept is fun but I have difficulties imagine this in a enterprise world, this is related to Chad Sanderson views about broken data warehouse.

credits

No comments 💬

New category where I just share bare links (and also I have nothing to say but I like the articles).

Data News — Week 14

2022-04-08

Coming from data tools shopping spree (credits)

Hello friends, I really liked writing last week edition even if it was too short and I did not go deep enough into my introspection thoughts. But I promise it will come back one day. Today's edition will probably feel like teleshopping. Unfortunately I don't do the agenda yet and everyone decided to announce something this week.

Data fundraising 💰

A lot of fundraising in the data field this week, this is fun to analyse because VC money tends to obviously depicts trends.

Grafana Labs raised $240m in Series D, less than one year after their previous round. Thanks to the cloud and the kubernetes shift the Grafana stack has been playing a key role in tech stacks. Increasing visibility and observability. Maybe Grafana still suffers being a DevOps tool rather than a data one, but seeing Snowflake logo on the landing page shows that something could change in a near future.
Data.World raised $50m in Series C, one year after the B, to bring another data catalog to the cloud world. But it seems the promise goes a bit deeper with a all-in-one tool to manage and query your data with a project view. The pricing starts at 50k annually. They clearly compete with Atlan among the data workspace segment.
Tinybird got $37m in Series A, to provide a cloud platform to create API endpoints on top of all your data (batch and real-time) in minutes. The product is build with ClickHouse as main OLAP warehouse with in house connectors for Kafka, S3/GCS, Postgres and BQ/Snowflake. We've seen a lot of companies entering the serverless realtime platform for your data and to be honest, Tinybird looks awesome among them.

And to finish this long list of data fundraising Kumo.ai and Ascend.io announced respectively $18.5m in Series A and $31m in Series B. The first one developed a new way to see machine learning for enterprise using graph data modelisation over enterprise data and the second one develop a all-in-one tool to do everything related to data and analytics engineering.

Airbyte acquires reverse ETL company Grouparoo

Data platforms are easy. We have data storage with inbound and outbound pipes, transformations on top. Regarding the inbound pipes, Airbyte is leading the open-source conversation. Conceptually if you are capable of doing the in the out could just be the pipe reversed. But yesterday Airbyte acquired Grouparoo an open-source outbound pipes technology — sometimes called reverse ETL — in order to be able to enter this segment.

To be honest seeing how modular is Airbyte I bet that this acquisition is only a reputation / people acquisition rather than a technology one because Airbyte will build everything on top of what they already have. And if it's not possible we may have an issue somewhere in their promise.

As a calendar coincidence, this week Rudderstack announced their reverse ETL product.

PS: I obviously caricature the reverse ETL job and I know that reading endpoints are different than writing ones.

Airbyte taking over Grouparoo (credits)

Reddit r/place data and architecture

If you weren't on internet this week you may have missed Reddit r/place subreddit, a 2000 pixels x 2000 pixels canvas where every redditor could colourise a pixel every 5 minute. Reddit gave us some statistics about the event, in 4 days around 10m users placed 160+m tiles. In 2017 they did the same event and they explained how they technically did it (the event was 10x smaller).

The 20 most popular data engineering tools in the Nordics

Validio team, a startup based in Sweden, analysed the 20 most popular data engineering tools in the Nordics and surprisingly — not really — BigQuery was ranked 3 behind Airflow and dbt. The reason behind is Spotify. Spotify has been the big data driven company in Europe that drove massive inspiration and also if you do tech in Sweden you probably worked there or knows someone that works there. So as we just copy paste what others do people use BigQuery like Spotify.

I did a small survey — no science behind — in a French based community about scheduling and Airflow was used by more than 90% of the respondents. On Airflow x dbt, Astronomer announced the new dbt Cloud provider to standardize the way we interact with those tools. If you like Airflow I also found a cross DAGs diagram generator to draw the whole picture — don't look at the examples they are bad.

To conclude this category I want to share Jacques thoughts about the modern data stack for the Marketing — or as they call it MarketingOps or MOps. Last year I shared a lot of stuff around Warehouses as Customer Data Platforms and the transformation is still going on.

BigLake

Google. Google I professionally like you because 4 years ago when I started working on GCP I really liked BigQuery and everything around GCP. Everything was simple to use and straightforward. But when I read this BigLake stuff I think you'll loose me for the sake of this marketing competition against Databricks Lakehouse concept. BigLake is the name Google choose for the multi-cloud capabilities for BigQuery data storage. The idea is to provide unified data storage APIs cross clouds for compute.

Big Lake Cloud Refuge (credits)

Miro Data Engineering team’s journey to monitoring

Miro data engineering team detailed their journey to monitoring and observability. If you are building a data platform this post is a goldmine of concepts to help you understand what you need to define to your incident management system. You can complete the picture with these 10 processes that will help you define your data quality routines.

On that topic I discovered the term circuit breakers from Monte Carlo blog that I really liked. Like a pipe valve, to prevent data pollution.

Fast News ⚡️

Scaling our dependency graph — Doctrine explained how they scaled their Python dependencies graph with pip-tools to lower time spent to resolve conflicts.
Delta Live Tables (DLT) — Databricks announced Delta Live Tables as I don't get the point I can just say they announced it.
A portable devkit for CI/CD pipelines: dagger — Docker creators released a devkit to build, test and debug CI/CD pipelines locally. What a revolution. No more "test ci" commit messages?
Transform team announced MetricFlow — Transform starts with a proposal about the metrics engine with this semantic layer you can define your metrics that you will be able to query after through the exposition server. I need to deep-dive more on this topic to give you more insights (soon).
Manage time series data pipes with Meerschaum
Get Lyn Health’s Data Laboratory feedback on deploying Dagster on ECS
Get a glimpse of Python 3.11 new features like the new fancy tracebacks

See you next week ❤️.

Data News — Week 13

2022-04-02

Me the whole week (credits)

I've been sick the whole week and writing this week edition feels a bit weird from me. What I like about the weekly newsletter routine is to discover people, products, stuff, crazy ideas, etc. But this week as my brain wasn't able to function at full potential navigating through all articles felt different. So it's gonna be different.

Benn, as usual, achieved my mental health. First because each time I read him I ask myself why people read my primary school English while there is him, Tristan, Sarah and the whole Substack data clique weekly distilling words about data. Secondly because he wrote about Data Council Austin and how it was a data professionals only party auto-satisfying themselves. Other people felt the same way and the critics were saying, for the first time, business users were forgotten. But we do data for business — do we? — not only for the sake of building tools and platform.

I'm also concerned regarding the future of data conferences, this is a side effect of the unbundling, as we have more and more companies we'll see a fight between conferences, often marketing conferences. We will be left with our Panini album trying to collect ideas on everything and everywhere on the globe. As an European I've always felt jet-lagged by big conferences, American people flying over to show us how they do stuff in order for us to copy paste because it's the way. It is deeply integrated in me. This is something I'd love to help change in the coming times.

Where am I going?

The whole point of this newsletter from the beginning has been to save ideas and links I like to nourish my mind and to shine when I speak with data people. So the moto had always been: write something you want to read. As an independent I also want to keep a neutrality, I mean I don't want to favour a product over another one or have an unbalanced edito. Because if I want you to trust and support me I have to show you that I'm not "sold". To be honest when we look at the numbers it's unbalanced, I've shared more stuff about dbt and Airflow than anything. But there are reasons, I really like these two products and I sell my expertise on those on the other side.

My expertise, something I've kept apart from the data news. For the last 2 years I've done consulting for French organisations — public, private, startups. I really like this part of my work but it's hard to conciliate content creation and client facing activities, especially when you have 5 clients at the same time and that you're alone. Anyway I'll bring new stuff soon.

Recently, for instance, I helped a startup — a bank for kids — to recruit their 2 first data profiles, DE and DA, and it was an awesome experience. I was running technical interviews. I enjoy myself the most when I see the value I can bring to a company and here being like a dating agency trying to match candidates and company was perfect.

I do consultancy in the shadows (lmao I don't even wear a shirt) (credits)

And the Fast News ⚡️

And because you subscribed for data news here some articles I found interesting this week.

The Soldiers, Rogues, and Mages of Data Teams — Jesse is saying that Data Teams are like RPGs. I can't wait to see this on companies career paths. I want to be a dwarf.
Comparing Flyte and Airflow — explanations from Lyft team after Spotify announcement last week.
S3 is not a Backup — Don't forget to build a complete strategy, S3 should not be your only answer
Computer Vision pipeline with Kubernetes — namR, a Paris based company, tries to infer parcels with aerial imagery with Kubernetes as orchestrator.
Features & Labels team developed a way to run Python scripts after a dbt SQL model execution by specifying it in the YAML tags.
Beat shared their dbt setup to provide company data marts on top of Trino by using Kubernetes

A pastel note 🎨

If you are still struggling to understand how Random Forest algorithm works this scroll to animate visualisation will help you a lot.

To finish this late edition of the data news different than what you are used to — next week gonna be as usual don't worry — Yesterday on April 1st people started the #30DayChartChallenge on Twitter. Justin just did a heatmap representing Ballon d'Or winners by nationality and club, I really like the color palette and the way he annotated the visualisation.

Data News — Week 12

2022-03-25

Be the first (credits)

Hello dear members, this week I've activated the paying membership on the blog, this is in beta for the moment — as additional perks are not yet available — so if you want to show some early support for what's coming soon you can upgrade.

No worries, Data News will always be free to read.

This week Data Council took place in Austin. A lot of the tools and people from the data community were there to discuss about data future. I've seen some tweets about it and I can't wait for the YouTube videos.

Data fundraising 💰

Astronomer is going big. The commercial developer of Airflow this week acquired Datakin and raised $213m in Series C. They made a major rebrand few weeks ago and they gonna push forward Airflow orchestration in the modern data stack context. Datakin, a startup specialized in data lineage with Marquez and OpenLineage initiative, will bring high value to Airflow/Astro ecosystem in order to be the central place, thanks to lineage.
Hex announced their $52m Series B with Snowflake and Databricks at the table of investors. Hex tech is here to fix the last part of the data journey: the knowledge creation. The product brings a new way to do data exploration and sharing that is worth seeing.
YZR, a Paris based startup, raised $12m in Series A to expand in the US. YZR provide a SaaS platform to increase data quality with 3 main features: standardization, labelisation and fuzzy matching.
datagen closed a $50m Series B to lead the way in the synthetic data production. They want to provide generators to create data-as-code in order to get the data we need to create more performant models.

The life of a data engineer: The Game

Firebolt creativity once again reach high-levels. They developed a 2D game where you play a data engineer navigating through the broken data pipelines in the modern data stack while the C-level board is waiting. Have fun.

Firebolt: The Big Data Game (in-game asset)

Navigate through the data stack

In the big data game you'll see a lot of different logos and companies. The number of tools you can use to deploy your data stack has exploded in the last 3 years. In order to understand and follow what's going on the first thing you can do is to subscribe to this newsletter.

You can also read the updated report from a16z — a Silicon Valley VC investing in a lot of successful data companies — on Emerging Architectures for Modern Data Infrastructure. This report gives you blueprints to describe in one drawing every part of your infrastructure: sources, ingestion, storage, query and processing, transformation and output.

To complete the picture you can also read a16z's Data50 list about the world's Top 50 data startups. The analysis is interesting because they propose a categorisation for each company but also a split by founding localisation, year and money invested. My main takeaway is about the amount invested in query and processing category (around 37% of the money raised if we exclude Databricks huge amounts).

Last but not lest, Secoda wrote a modern data stack glossary with more than 75 concepts in order to help people entry this new field.

Details on Github service disruptions

I really like post detailing outages or service disruptions because we can learn a lot. Github recently had performance trouble and incidents with their MySQL cluster. In the post they detail the timeline and what are the next steps for them.

How to make my Data Engineering department shine again

Alexandre from Papernest is giving us the weekly post about data organization. He tries to find how you can develop your data engineering team and make it shine. Do we need data engineers? How can the DE team useful to the whole company? Alexandre answers these questions and more.

Headless BI, Datalake vs. data warehouse

What is Headless BI? — Cube, a API first BI platform, defined what is headless BI and why you will need Cube (or similar) in the future. They also explain how it integrates with dbt and concepts around dbt metrics.
Datalake vs. a data warehouse, another post about this topic.I think this is a super write-up.

Headless BI (credits)

Trying out delta lake and Hex

Last week we had a personal project reading bottle of wines, this week Paul tried Hex and Delta Lake in order to get alerts on crypto prices. This is nice because it could give you a glimpse about the kind of stuff — notebooks — you can achieve with Hex. Sadly his project seems down but I like the idea.

Fast News ⚡️

The secrets of PostHog query performance — PostHog is an open-source analytics platform based on top of ClickHouse, in the article they detail what they did to achieve high queries performances.
How I built a music synthesiser using SQL — Another side project kinda fun. Ramiro built a wav sound from SQL code. 🤯
Visualise your Spotify data — Using Pandas to visualise your Spotify usage, everyone had this in mind.
RepliByte — Qovery a all-in-one DevOps platform to build apps on AWS developed a CLI tool to replicate Postgres while hiding sensitive data.
Implementing the GDPR ‘Right to be Forgotten’ in Delta Lake — I've spoke about datalake, Delta Lake and hiding sensitive data so this post is the zenith of it because it combines everything inside.

Data News — Week 11

2022-03-19

Soon (credits)

Hi, I'm Christophe and this is the 11th edition of the Data News, my weekly newsletter about data. In the news I share with you the articles I liked this week.

👉 Subscribe to the newsletter to get your weekly digest

Data fundraising 💰

Scribble Data raised $2.2m in Seed round to provide a cloud-native feature store. They want to help data teams by offering end-to-end platform to build, understand and consume machine learning models.
Sarus, a privacy-by-design data platform, launched this week HN. As they said: "Sarus is a privacy engineering software that lets data scientists work on data without the need to access it". They used under the hood differential privacy concepts and it could be a glimpse of the future.

News about the GAFAM

Facebook (Meta) has been hit by a $18.6m fine in Ireland after violations in 2018 of the GDPR laws because of 12 data leaks.

On the other side Microsoft is facing EU antitrust complaints lead by OVHCloud with 2 other companies. They claim that Microsoft favour their own cloud in term of pricing when it comes to deploy Office 365 to the cloud by making it more expensive on other's cloud.

Once again Mikkel continue to give us insights about salaries in tech companies around the world. This time he analysed FAANG salary data from levels.fyi and the results are quite interesting.

Spotify is moving away from Luigi

Every Python data engineer knows Luigi. Luigi is a Python module that helps you to build data pipelines. Luigi is easy to use and focus only on orchestration, in order to schedule your jobs you need something else. In 2019 the data orchestration team decided to move away from Luigi. They hand-picked Flyte, a graduate workflow automation platform from the Linux Foundation AI & Data.

This year the competition will be fierce between orchestration tools.

Modern Data Stack ideas

This week 2 companies shared their Modern Data Stack choices and implementations:

ManyPets implemented their platform around BigQuery. In term of ingestion they are mixing Fivetran, Meltano — to get their telephony data — and Airflow. This experience shows once again that SaaS tools are mandatory in all data team but are once again limited every time for one reason and a combination of multiple tools will always to the trick.
Whatnot on the other side used Snowflake as main warehouse. They also shared the decisions they took that made them faster. Especially they decided to use views instead of tables and to be CDC first without dimensional modeling in order to support frequent schema changes.

Ancient Data Stack reinvented

The warehouse-first philosophy becoming the default go to shouldn't make us forget our dear old friend the datalake. Mehdio decrypted the modern table formats — Hudi, Iceberg and Delta — and their role in the lake.

Delta team announced some days ago their Delta Sharing new release that works with Google Cloud storage with a combination of the Delta Sharing Server and their protocol you'll be able to read your lake data from compatible clients.

Following what Delta brought to the table LinkedIn wrote about Opal, their mutable data ingestion engine on top of their datalake. This is a great technical deep dive.

PS: I stole the Ancient data stack concept from this tweet.

Data lakes exploration (credits)

Why Analytics Sucks

I was bothered that no one had thought to let me know that the launch had occurred [...] I’d never felt more viscerally that my role in the product-building process was that of data vending machine: request in, data out.

Robert from Hyperquery bring the storytelling behind choices they made to build their product. A Notion-like collaboration tool where data insights merge with documentation. Robert explains that the tools we use to do analytics today "inevitably shaped our processes". And the processes are broken.

Once again a competition will be fierce in the data last-mile exploration to create the new way we interact with data.

Saturday ML 🦾

As a good French born person when I see a project trying to read label on a wine bottle by computer vision I have to share it. So here a two parts article where Antonin describe how he did it and it also includes an app he deployed for you to try it.

Artefact team decided to share their experience with Facebook Prophet — a Python/R framework to forecast time series. It gives you an overlook of framework on all aspects, especially on feature engineering, modeling, interpretability and deployment.

After too much machine learning evaluation (credits)

Fast News ⚡️

Registrations are open for the Airflow Summit in May 23-27
Making friends with Machine Learning — This is a YouTube playlist — as of now 101 videos — ran by Google Chief Decision Scientist: Cassie Kozyrkov where she covers all the topics related to machine learning
Understand process order in a SQL query and why it is important — small post but it covers surface notions about SQL queries everyone should at least know
The rise of the Data Reliability Engineer — This title is a good mix between two founding concepts, and open us to a new path in our data journey.
Understand Ibis and Apache Arrow — Two discovery articles to help you understand what is Apache Arrow and how it is positioned compared to Pandas for instance. And Ibis a yet another package stuff to query data with SQL, Marlene greatly details Ibis concepts.

💡 Wander Well.

Data News — Week 10

2022-03-12

Me yesterday (credits)

Dear members, I hope this edition finds you well. I'm sorry to be late once again, but it coming to an end. I've been giving classes since January almost every Friday and I've only 2 left in the next weeks so the original schedule will be back soon.

Data fundraising 💰

Atlan raised $50m in Series B to continue building their home for data teams. Which means in our vocabulary a data catalog-lineage — we should call this datacatalineage. Their vision is to become what Github has become for tech team but for data teams.
Canner, a Taiwanese startup, raised $3.5m in pre-Series A. Canner provide a data mesh approach on top of your existing tools. Connecting all your data storages, the product provide as an output an universal data access after transformation. This is a meta platform from data storage to data serving delivering data mesh philosophy.
This week DataRails raised $50m in Series B to provide a financial planning and analytics (FP&A) solution for Excel users. It means they developed in-app plugins to empowers your financial team.

Let's speak about Excel. Excel has been here for years and it has been a saviour for a lot of people and a lot of companies. The Excel ecosystem is huge, maybe wider than the data ecosystem. With all the efforts we all made, Excel is still a reference, people will still want to export your lovely Tableau dashboard into an Excel spreadsheet. We could name this Shadow data in relation to Shadow IT.

Data platforms future

This week Petr and Benn spoke about data platforms, not really from the same perspective but to me they are both saying the same thing: data stacks are built on top of core tools/concepts where everything else should be encapsulated inside at some point.

Petr proposed a different way to bundle data platforms. Or to rephrase it, a different way to categorise data tools. In a nutshell, today we have more than 20 categories of tools in the data ecosystem. This is a big number and we should relabel everything. On top of core layers — ingestion, storage, transformation, visualization and discovery — we need to provide cross-layers features like scheduling, orchestration, etc.

Petr vision of data platform cross-layers (credits)

The experience of teams running data platforms depends on their ability to handle the above problems cohesively across the stack. These problems are hard to solve in isolation within each tool.

I really like Petr conclusion of the unbundling Airflow conversation that has started few weeks ago.

On his side Benn followed-up on the very big deal Snowflake put in place last week by acquiring Streamlit. Benn is placing a bet on the data app store concept. Snowflake acquisition could lead to this strategy. If we consider the warehouse like the main piece everything will exist through his marketplace — or his data app store.

I also learn from Benn's post that Google laid off Looker's departement of Customer Love (more detail here). Once again Google strategy with Looker is hard to follow (Week 41 — Google partnership with Tableau), Looker was perfectly positionned in the Modern Data Stack and from the outside it seems they are ripping it off.

7 antifragile principles for a successful data warehouse

Iliana wrote a series of article about data engineering. In the last part (the 5th one) she detailled 7 antifragile principles to build your data warehouse. I think this post is a great ressources to think about the place of your data warehouse and to review processes around.

Source systems are accountable and responsible for resolving data issues

The second principle is stating that source systems are accountable for data issues. Amen. This is so true. But also so difficult to put in place in the real world because product teams hate data migration. You should check also the 6 other principles.

Solving concurrency in event-driven microservices

If you are trying to understand concurrency in event-driven architectures this post is for you. Hugo is also proposing a solution on top of Kafka to deal with concurrency by design rather by implementation.

Event-driven architecture (credits)

Saturday ML 🦾

Two articles this week in this category. This is food for thoughts.

MLOps is a mess but that's to be expected — MLOps and DataOps are still in their first days, like we saw earlier in the newsletter data platforms are fragmented and we need a better cross tooling view, we need more cross automation and so on. DevOps should continue to infuse data community to build the best solution.
Recommendations for all of us — why is it so hard to build recommendation systems for the household? How can Netflix and Spotify be so bad at recommending content for couples that are using the same account? This post tries to explore solutions to this recommendation problem.

Fast News ⚡️

A gentle introduction to Terraform — For all newcomers to want to understand what are the main layers in terraform code.
One stone, three birds: finer-grained encryption @ Apache Parquet™ — Uber team detailled how they use Parquet encryption features to support their security controls.
Mastering pivot tables in dbt — After Furcy pivot in Spark on top of BigQuery, this week we see how to master pivot table in dbt.
🥋 The art of command line — Awesome Github README to guide you through the command line, I bet you're gonna learn something from it.
Parallel grouped aggregation in DuckDB — DuckDB is a database management system designed for OLAP. Thanks to their vectorized query execution engine they achieve awesome performance. In this post they explain how they do parallel grouped aggregation. The post is useful to understand database internals.
Data Tidier — Convert messy tabular data into a tidy format.

See you next week.

Data News — Week 9

2022-03-05

I hope you are doing good and enjoy this Data News edition.

Data fundraising 💰

Snowflake has entered into an agreement to acquire Streamlit for $800m (news here and here). Streamlit is a Python package that helps you build data web apps in a wink. It generates HTML pages with interactive inputs inside.

It has been announced around the same time with Q4 FY22 earnings. And Snowflake has now reached $1b in product revenue for 2021. Which is quite massive but still small compared to GCP or AWS (resp. ~$20b and ~$67b) — I know it's different because they are generalist. If we compare to Databricks — $800m annual revenue in 2021 — Snowflake seems to win the direct competition.

This acquisition will for sure enrich Snowflake offer around their Data Cloud in order to provide to people a complete data experience. I bet we will see other acquisition from their side in the following months. Snowflake challenge is to attract companies with more than a data warehouse.

Dagster — Introducing software-defined assets

Dagster and Prefect have been in my radar for months. I tried Prefect for the last 3 months (stay tuned). I did not try Dagster yet. This week release introduce software-defined assets. They propose through Dagster to make the data assets a reality. The idea behind is to apply declarative programming (as opposed to imperative) to data.

With software-defined assets you will be able to declare your different data assets (files, dataframes, tables, ml models, etc.) and handle the whole lifecyle behind. Each asset will be define with a name, a function and upstream assets.

Using this modelisation of your data platform you get directly your own data lineage/catalog out of the assets definition because everything you need is directly defined within your orchestration tool. Or as they say in your reconciliation tool. Go check it out it's worth it.

I see the hype around.

Dagsbuster (credits)

Data & analytics trends to watch in 2022

Speaking about the hype, Taylor shared 10 bets on trends to watch in data for 2022. I really like the way she presented every bet and I pray some of them will become true in 2022.

Field level lineage explained

Engineers from Monte Carlo team detailed how they built their field level lineage in order to have a more effective root cause analysis workflow. This post is well illustrated and covers all the aspects of such a tool.

They do not enter in low-level details but they at least explore what are the use-cases and the business requirements. It could help you convince your boss that you need a data lineage layer in your modern data stack.

I speak a lot about warehouses and tooling around it because the modern data stack took all the attention recently but programmatic solutions are still here and will definitely stay for a long time.

Let's start with Kubernetes, last week was released the second part of Kubernetes: The Documentary [part 1 — part 2]. It's around 55 minutes of videos that are worth your time if you do data in the Cloud. This is a must seen to understand what's Google strategy about the cloud and how all today cloud concepts were born. In few words: AWS was betting Google and the Mountain View company needed a revolutionnary product to compete in the cloud space: Kubernetes. It covers Docker and containers revolution, open-source philosophy and kube story.

Now that Kubernetes allows us to go further we can for instance run Kafka on top of it like Yelp or we can try to improve Spark performance when coupled with Kube.

To finish this engineering category, I've recently talked about real-time streaming data platform which is one of the use case for Flink. This post is about the evolution of streaming architectures, from Storm to Flink via Spark. This is a good introduction to understand what are the main differences.

Finally if you are still struggling in doing data quality unit tests with PySpark, you can try to do it with Great Expectations. Karen shows us a lot of different cases this is well documented.

Kubernetes the new building bloc (credits)

SQL + Jinja is not enough — why we need DataFrames

Last post from Furcy, except from being long — 20 minutes, is really great. This is an ode to the DataFrames format. Furcy say that multiple operations are easier to write (and do) with DataFrames than in SQL + Jinja.

He then developed a Python library to run PySpark code using BigQuery data as a storage to do a pivot operation for example. Clap him on Medium because this is good food for thoughts.

Why am I seeing this ad?

Because.

Explainable AI (XAI) is one of 2022 major trend in machine learning. Transparency challenges around data systems in order to be compliant and fair in the actual world are becoming huge. LinkedIn team detailed how they built their "Why am I seeing this ad?" panel in their product. This is a good product thinking exercise to have if you have client facing ml models.

Fast News ⚡️

Dremio announced their Dremio Cloud to provide a query engine and a metastore on top of your data storage
lakeFS rolled out a playground to test their technology — lakeFS is a git-like object storage for your datalake
Gergely Orosz (The pragmatic engineer) wrote a insider about Amazon's engineering culture — if you want to read it you'll have to subscribe to his paid newsletter (this is not only about data but I found the topic somehow important)
Towards Analytics with Redis — how can you use Redis to create an analytics application. I found this interesting because it could answer some real-time production use-cases teams could have.
KPIs every Data Team should have — Jesse Anderson new post. He is one of the first who spoke about data team responsibilities and this blog will help you putting words on metrics for your data team.

See you next week with probably 2 posts.

Data News — Week 8

2022-02-26

Nothing to add (credits)

I should maybe rename the Data New to Saturday Data News (jk). Hey dear members this week edition gonna be quick because as Benn said there is nothing to add. All my thoughts are going to Ukrainian friends — I know a couple of you are members, so if I can do anything for you, do not hesitate to reach out.

Data fundraising 💰

dbt Labs™ raised again this week. They got $222m in Series D. Snowflake and Databricks (the warehouse enemies) participated. As Tristan transparently said as always in a blog post they want to build the next layer of the modern data stack. This is the semantic layer. This framework gonna be the dbt Server™.
To follow this semantic layer vision Hasura raised $100m in Series C. Hasura lets you add a GraphQL layer on top of your database. And they already work with BigQuery (and Postgres). This is also another way to see the story.
Making real-time data platform a commodity with serverless product is also a new big trend. This week Decodable and Redpanda respectively raised $20m in Series A and $50m in Series B in order to simplify real-time data accessibility.

PS: I'm joking with dbt trademark because with the fundraising they changed seo meta and html titles, dbt™ - Transform data in your warehouse. Their trademark guidelines is funny, we can't say "We use dbt" we have to say “Our software utilizes dbt Core™ functionality.” Good luck with people saying DBT.

Terraform — Best practices and project setup

As a Data Engineer I have somewhere a folder with a Terraform boilerplate I used every time I want to create a new platform. Terraform has many hidden ways to do stuff and this post is a good starter to get best practices and setup.

Spinner: The mass migration to Pinterest’s new workflow platform

Pinterest team explained how they migrated more than 3000 workflows to Spinner, their 1.10-stable fork of Airflow . This post is a huge resource for everyone that want to understand how Airflow can be customized in any ways. They show the dedicated UI they developed to help the move.

Fast News ⚡️

To conclude the unbundling of Airflow trend Ananth from dataengineeringweekly just wrote an bundling article about his thoughts
How Leroy Merlin managed their cloud data pipelines with Kestra — Kestra is another open-source data orchestration platform
Training pipeline orchestration with Kubeflow pipelines
Running Spark Pipelines on EMR using spots instances
scrubadud is a Python library that removes personally identifiable information from text. It uses spaCy NLP models

Love ❤️

Data News — Week 7

2022-02-19

Metamates (credits)

Hello dear datanewsmates, it's Saturday and as usual I'm a bit late. Let's 👋 our metamates friends. This week we will unbundle things.

Data fundraising 💰

Voltron Data raised $100m in seed and Series A in order to create a startup around Apache Arrow. Arrow, which has been co-created by Pandas creator Wes McKinney, is an awesome in-memory columnar format. The strength of Arrow resides in their agnostic design: it's cross-languages and working with CPU and GPU. It can be used in pair with Parquet or ORC for instance that are on-disk columnar formats.
As of today, a lot of big tech companies are using Arrow under the hood for connectors or processing. Once Arrow will be used by many, performances could reach new standards.
Promethium raised $26m in Series A to build their all-in-one data platform. Once your warehouse has been connected you can explore your data and then write pipelines and query in order to do visualization on it.

Follow-up on EU to US data transfers

I just discovered this week that the European Commission monitors data flows to cloud. The tool is a mix between actual sector data and forecast until 2030. On Twitter someone used the tool to get inflow and outflow diff. We mainly notice that Germany and Ireland are positive. In Germany, Frankfurt is probably one of the biggest EU city in terms of data centers and attractive for the financial market. And Dublin in Ireland is the place of many US headquarters.

On the tracking side, Google sees the light, follows Apple lead, they announced that they plan to adopt new privacy settings in order to limit tracking across apps following Apple decision.

The unbundling of Airflow (and others)

Two weeks ago I've shared the fal.ai initiative that I liked about developing tools around dbt to improve the Modern Data Stack experience. This week they wrote a nice piece of article that tries to depict the data ecosystem evolution like an Airflow unbundling. In summary, Airflow was good but covers too many use-cases so let's explode it in small pieces.

To be honest I mainly agree with the article but it saddens me. I'm still convinced that for most of data teams a central platform is better than a fragmented one. Today with the proliferation of SaaS tools it feels more like Airflow was doing great but lets explode this free perfect tool to milk money out of startups with accumulation of small pricing.

If we still consider data engineering as software engineering for data I don't think we are going in the right direction. Even the article conclusion sounds weird.

A diverse set of tools is unbundling Airflow and this diversity is causing substantial fragmentation in modern data stack. Like everyone else, I also predict some consolidation of these tools in the coming years. I believe dbt Cloud is the best positioned place for this consolidation to happen.

Does that mean we gonna write "The unbundling of dbt" in the coming years? I really like dbt but I don't want my dbt — or even worse dbt Cloud — project to be aware about all my data complexity outside of the transformation layer.

On the same topic, surfing the wave, Joseph wrote the unbundling of the BI dashboard. It will always amaze me how Tableau is absent from every Modern Data Stack posts but still present everywhere in the real world.

Airflow (credits)

How can data people become technical co-founders?

David is trying to research how data people can become technical co-founders. As more and more data people are becoming polyglot — multiple languages — and purple — navigating through business and data — we should see in the future data co-founders.

My answer is: it depends. Doing data today means to many different things. It could be Spark development with good practices to mixing SaaS products to create a MDS. I had a colleague that was frustrated because data engineering is sometimes more Lego than engineering — cf. last part also. So yeah, I think folks from the data field could become co-founders but the step is high.

What Substack Analytics Engineers must be thinking

Last week Substack — a blogging platform — team discovered a "bug" in their views count: they were double-counting the email opens. This is fun because one year ago they also discovered a bug in the subscribers count.

Sarah tried to analyze the situation and how it happened and why it's hard to define metrics, especially in the web analytics space. I really liked the conclusion.

Analytics teams should always strive to create dashboards that are either standalone or include links to provide the relevant context. Context is also important to get alignment on expectations. Curiosity saves the day when it prevents business users from misinterpreting results or analytics teams from misreporting them.

Can SQL be a library language?

George Fraser, Fivetran CEO, wrote thoughts to see SQL as a library language. He explores how databases engines added over the years packages to extends SQL capabilities but also how dbt packages hub is bringing modularity to the light.

Migrating to BigQuery

This week two teams detailed how they migrated data to BigQuery. The first post is about a Back Market migration from Delta files read by Snowflake to Delta files in BigQuery. To do so they mainly used Google Data Transfer triggered by Airflow, I do not agree with everything in the article, but it shows a good overview on what's possible.

In the second post Remya described another Snowflake to BigQuery migration handled by Airflow but it could have been done way much better (and they still use Airflow 1 😡).

Data engineering: always doing pipelines (credits)

Fast News ⚡️

Change data capture (CDC) illustrated and explained
Launching and scaling Data Science teams: three years later
Developing a new native Ads dashboard using server-driven UI — Yelp detailed how they released new Ads dashboard using server side generation. Could this become a trend in a near future to improve dashboard rendering perfomance?
jless — a command-line JSON viewer. Sometimes it's super hard to visualize JSON in your terminal, jless could help you do it.
Prefect unveiled a new UI within their new Orion initiative — I hope it'll fix UI boredom I had with the previous one.

Data News — Week 6

2022-02-11

Alphabet and Meta waiting for EU citizens data (credits)

New Friday means Data News, and here you are. I hope this new edition finds you well, enjoy the reading.

The newsletter is way longer than usual because I tried to deep dive the EU to US data transfers topic to give you perspective. I hope you'll enjoy.

Data fundraising 💰

Census raised $60m in Series B to bring what they call Operational Analytics to reality. The idea behind OA is to use your warehouse as primary data storage and Census and a tool to reverse ETL data to operational tools with software engineering principles (versioning, tests, monitoring, ci/cd, etc).
Superconductive — the team behind Great Expectations — raised $40m in Series B. Great Expectations has been over the last year a well identified tool when it comes to data quality. With the money they will launch a cloud version with the open-source core plus "added features".
Starbust raised $250m in Series D to push forward Trino in the SQL agnostic query engine. Trino was born out of the community conflict between community (incl. founders) and Facebook from Presto project. In this case Trino offers data team a unified SQL engine to query all your data storages (that also means that the data is computed on your servers).

EU to US data transfers

In the past week they have been a lot of discussion around EU to US data transfers, this is obviously related to data privacy concepts but also to sovereignty. I'm gonna try to summarize in a few words what happened recently.

In order to operate global services and global collaboration the United States and the European Union negotiated over the years two majors framework to regulate data transfers.

First the Safe Harbor in 1998, that prevented EU and US private organizations from accidentally disclose or lose personal information. In 2015 the CJEU (Court of Justice of the European Union) invalidated the Safe Harbor and one year later the Privacy Shield was born. The Privacy Shield frame the commercial use of EU citizen data in order to protect their privacy. In July 2020 the CJEU also declared the Privacy Shield invalid.

Starting from this, an European non-profit org called noyb — none of your business — filled 101 complaints in August 2020 against websites using Google Analytics or Facebook Connect after reviewing them. They filled the complaints in each relevant local DPA (Data Protection Authorities).

In January 2022, this year, the Austrian regulator stated that the use of Google Analytics violates CJEU decision. Then the French regulator, the CNIL, considered that Google Analytics data transfers are illegal under the GDPR.

The CNIL considers that these transfers are illegal and orders a French website manager to comply with the GDPR and, if necessary, to stop using this service under the current conditions.

Meta and Alphabet in front of the CJEU (credits)

Google Analytics future

Ok, this current situation is obviously a mix between political, lobbying and technical stakes. To be honest I'm quite happy to see finally lights on data privacy topics. But what does it mean for Google Analytics future?

This is difficult to find numbers on the revenue Alphabet gets from GA, but at $150k/year for the paid version I can imagine this is core product for them. I believe right now companies will have a deeper looker at their tracking to see if personal data is transferred to GA and take measures, either they will remove this tracking either they will move to another tool, but I bet this is not a small project.

If you want to have a look at open-source alternatives, there are plenty. The historical one is Matomo — formerly known as Piwik, which is different than Piwik PRO, which was the same at first but they diverged in 2016. Recently Posthog appeared and my favourite one: Plausible. I use Plausible for the blog tracking. There are also Snowplow and Rudderstack which are 2 big tools with a lot of features.

Meta threatens to shutdown in EU?

Big news titles. In their 10-K annual filling — a annual report to send to the SEC about financial performance — Meta wrote the following 👇

If a new transatlantic data transfer (Ed. from EU to US) framework is not adopted [...], we will likely be unable to offer a number of our most significant products and services, including Facebook and Instagram, in Europe, which would materially and adversely affect our business, financial condition, and results of operations. (source — p. 9)

This is totally different than what the press was saying, but still it means that Meta is financially impacted by all the regulations about privacy. Which obviously means they gain a substantial amount of money from our data. If we add to this Apple privacy changes that could lead to $10b advertising loss in revenue for Meta something is changing.

To close this category, I want to share Mozilla conjoint work with Meta engineers about Privacy Preserving Attribution which could be the future of the online attribution. The idea behind is to propose a new conversion measurement while proving strong privacy guarantees with a multi-party computation and aggregated system (result will not be linked to individuals).

PS: I far from behind an expert in the legal domain but I tried to write a big summary about this whole recent story. I hope you liked it.

Modern data stack builders

Arpit, who is building astorik: a place to explore modern data landscape, tried to depict the different evolutions of the modern data stack. From early-stage companies in the need of their first dashboards to the well established companies with a wide portfolio of data tools.

Moreover, Photobox shared their new data platform. They decided to be event-driven first and used the Cloud event spec to simplify their work. This is a huge article but full of insights.

If we zoom in the transformation layer Vimeo shared how they do dbt, they have been very creative in the way they decided to mix dbt Cloud with their own CI/CD processing in dev and staging environments. If you still struggle to put envs in place on your dbt setup this post is for you. On the other side Son showed how you can do environment-dependent unit testing in dbt.

Teams building the modern data stack (credits)

iAdvize data catalog research

iAdvize team offered us this week an awesome series of articles about their data catalog research on top of Tableau. They decided to built it on top of Tableau Metadata API. Then they decided what they needed from Tableau API: datasources + calculated field and workbook content. In the end they put together the puzzle to create a Tableau dashboard where you can look at the data.

I've already done this kind of stuff in the past and I would personally developed something outside Tableau to have more liberty in term of result.

Understand...

the metrics store ; it has been recently popularized by dbt, in this post you gonna get what's behind it.
kafka ; this is a new way to be introduced to Kafka concepts — and if you are feeling brave Slack explained how they built self-driving Kafka clusters.
distributed computing ; huge post demystifying it

Be happy, you avoided a photo-montage with Kafka in an self-driving car (credits)

Cost Efficiency in big data file format (Uber)

Uber shared metrics around their Parquet compression performance at their scale, they compared ZSTD with GZIP and SNAPPY, this is a good post to understand what is under the hood.

Firebolt, thinking in Lambdas

Can we introduce lambda functions to SQL? Octavian, product manager at Firebolt, proposed a way to add imperative concepts to the declarative nature of SQL. Imagine adding the Python way to create lambda functions but in SQL select. This is what he brings to the table.

Fast News ⚡️

Snowflake announced data classification feature in public preview but then remove the article. You can still read it in the cache. The feature uses machine learning to detect column tags, then you can apply policies to protect sensitive columns.
Alibaba Open-Sources AutoML Algorithm KNAS
A Career in Football Analytics, The What — If you like sport analytics, this post from Benoit will help you understand what does it mean for football.
What is MLOps? — If you are still looking for a MLOps definition this post is for you.
O'Reilly book humble bundle for charity — you can get 15 books in one donation for charity
Andrew Ng interview: "The AI pioneer says it’s time for smart-sized, “data-centric” solutions to big issues"

Curiosities 💡

In order to split news from fun ideas and curiosities this week I separated it in two categories.

Load Twitter data into Google Sheet and automate it — Aurélien Robin shows how you can load and scheduler Twitter pipelines in Google Cloud Platform. To be honest I don't think we should use Pub/Sub for this kind of entry-level tutorials, but still a good project.
github/pull request to apply Black on all Django source code → it means around 100k changes 😬
github/wtfpython — Explore Python through surprising snippets like is not ... is not is (not ...) .
idempotence
MergeStat — Treat your source code and development history as data and use SQL to explore it

See you next week.

Data News — Week 5

2022-02-04

Me <3 You (credits)

Back on Friday release. Hi dear members you'll find below your beloved Data News. Enjoy 🎉.

Data fundraising 💰

Data team collaboration tools will probably be trendy in 2022. They are the last mile of the data exploration.

This week we first got Canvas, a spreadsheet based tool that helps you explore your data without SQL — but they generate SQL from actions you do. They raised $4.2m in funding.
And then Deepnote raised $20m in Series A to provide real time cloud based Jupyter compatible notebooks.
On "low-code" predictive analytics side, Pecan raised $66m in Series C to continue develop their BI-friendly predictive tool. They sell it like a way to bootstrap data science without data scientists.
Rudderstack raised $56m in Series B for their customer data platform. They sell a cloud platform where you send all your data (tracking, sales, marketing, ml, etc.) and they sync directly to your warehouse and/or to you favourite operational tools.
Tilo raised €1.2m in pre-seed funding to create an entity resolution platform. They want to help you deduplicates entities in your database in order to beat fraud for instance.

New Fivetran pricing

Fivetran unveiled, 3 days ago, a new pricing that will make at least half of their customers happy (according to them). The new pricing changes are:

Resyncs will be free for every connectors
They are moving out of credits, pricing plan will be in dollars factor the Monthly Active Rows
The entry cost will significantly be lower: $60 / month for 200k primary keys to sync

I found these changes very interesting because each time I speak we people about Fivetran the cost was a recurrent topic.

dbt community-led tooling

With dbt crossing the 10k weekly active projects this is the time to say that dbt is taking place at the centre of the ecosystem. That means that the community should start developing tooling around it to own it, to make it evolve. I came across 2 initiatives that may interest you:

Interactive CLI search for dbt models with fuzzy finder (fzf-dbt) — fzf is a command line fuzzy finder that helps you search faster
Features & Labels (fal.ai) is a team of engineers building tools to make it easier to deploy ML models and they started with dbt tooling with fal dbt and dbt model training is coming soon. Thanks to fal-dbt you can run for instance Python script from dbt Cloud without Airflow (with and not within — don't get clickbaited)

- "Hahaha, this is so fun developing dbt tooling" (credits) (jk I really like these 2 projects)

Where is Google Cloud going?

For a long time I have been pushing and advising BigQuery over all other warehouses vendors because I was convinced that is was easier on all aspects for data newcomers. Recently Google Cloud announced their results and this is not yet perfect: they extended their server life for 1 year and lost around $3b in 2021. But what does that mean for the future of Google Cloud?

2 years ago, rumours were saying that Cloud division had to become top 1 player to avoid losing funding. With these results and also the number of departures from BigQuery team to other data warehouses I've already reported I'm still waiting for an deeper vision and strategy from GCP.

By curiosity I had a look at BigQuery releases pace — below — and it continued to increase since 2015 so even with departures Google is still present, phew.

The number of BigQuery features released per week

The Analytics Engineer, 2022 most sexy job?

I bet 2022 will be the year where we'll see flourish the Analytics Engineer term. But what are companies expectations about Analytics Engineering? Obviously we can say SQL and dbt, which is already present in a lot — more than 1500 — of jobs.

ML Friday 🦾

Spotify team shared the platform they built in order to support their machine learning efforts: ML Home. This is a huge inspiration for all data teams in search of ml collaboration.

LinkedIn on the other side shared their DARWIN platform that allow data science team to do everything from Jupyter notebooks and well integrated with Datahub.

Spotify ml platform inspiration (credits)

Events 📅

Fast News ⚡️

5 things to know about Parquet — Marlene wrote a short note about what you should know about Parquet
The baseline data stack — Seattle data guy wrote a series of article about the different kind of data stacks out there — from open-source to server less ones. This is a good overview article for beginners.
Setting up data monitoring for Snowflake — Ivan popped out in my Twitter recently and I discovered Monosi, a new data observability solution. Let's see where it goes.
Postgres WAL activities useful requests — if you want to explore your Postgres write ahead logs, this post will help you with useful requests.
Metadata Guardian — A Rust open-source project that will scan your data sources in order to find any PII there.
Building an SEO Data Pipeline

PS: small personal question, sometimes I'm asking myself if a Patreon based model could work for the newsletter. What do you think? Would you be willing to support the adventure for 2$-10$ per month (or more for companies)? Obviously the newsletter will stay free forever but you could have other perks.

Introduction to Airflow concepts

2022-02-03

Understand Apache Airflow's architecture and basis.

Data Engineering (Photo by Lukas Blazek on Unsplash)

What is Airflow to Data Engineering?

Data Engineering is a relatively new engineering field, and as our society becomes more and more data-oriented, Data Engineering has become a core stake in the data utilization process. Creating data pipeline has become an essential practice for any data exploitation model.

Apache Airflow was quick to become one of the most used tool for Data Engineers to orchestrate their data pipelines. This open source platform was born in Airbnb’s headquarters, to overcome the massive amount of data they have to deal with. In 2016, Airflow joins Apache’s incubator and becomes Apache Airflow.

Airflow has been created by Maxime Beauchemin while at Airbnb. With his vision of the discipline Maxime revolutionized how company treat data today. He wrote his vision on his medium. Even 5 years after everything he said is still relevant.

Airflow

Airflow’s architecture

Airflow architecture diagram

Concept

Airflow is a workflow orchestration tool, that allows you to build, schedule and monitor data pipelines.

The way Airflow works is the following: you schedule a workflow which is made by Tasks and their dependencies. A Task can be created based on an Operator (link), for instance we have the PythonOperator or the BashOperator. Once created the Tasks can be added to a DAG (directed acyclic graph) — which means the "pipeline". The DAG is then parsed and scheduled by the Scheduler. Then depending on the architecture you decided your tasks are picked up and ran inside the Executor by either the scheduler either some workers.

If it seems a bit complicated, you’re in the right place! We’re about to explore in detail each keyword above.

Tasks

A Task is the smallest unit in Airflow’s architecture. Tasks are basically jobs to be executed and can be made out of Operators, which are pre-existing templates. For example, here’s an Airflow task using the BashOperator.

from airflow.operators.bash_operator import BashOperator

bash_task = BashOperator(
	task_id='bash_task', 
	bash_command="echo 'my-command'",
)

Airflow Task using BashOperator example

There are many more basic Airflow such as PythonOperator, EmailOperator, but also many custom-made Operators created by Airflow’s community. More information here. There is also a Registry out there.

DAG and Task dependencies

Tasks are defined inside a DAG, or Directed Acyclic Graph. A DAG is a graph that is directed and without cycles connecting the other edges, which means each node is in a certain order. Here’s a mathematical example of a DAG.

Directed Acyclic Graph example

And here’s an example of an Airflow DAG, in Python.

from airflow import DAG
from datetime import datetime
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator

with DAG(
    "my-dag",
    start_date=datetime(2022, 01, 20),
    schedule_interval="@daily",
) as dag:
    bash_task = BashOperator(
		task_id='bash_task', 
		bash_command="echo 'my-command'"
    )
    email_task = EmailOperator(
		task_id='send_email',
		to='to@gmail.com',
		subject='Airflow Alert',
		content="my-content"
	)

bash_task >> email_task

Airflow DAG example

At the end of a DAG, it is mandatory to specify Task dependencies as you see in the example above.

If we take a look back at the mathematical DAG above, here’s how the Task dependencies would look like:

1 >> [2, 3] >> [4, 5]

Task dependencies example

Scheduler and Executor

Once the DAG is set and is placed in the DAG folder, the Airflow Scheduler will take care of triggering the Task and and sends it to the Executor chosen.

Executors in Airflow are where Tasks ran. Specifying an Executor is mandatory. If you’re looking for a simple way to run your Airflow DAGs, you are most likely looking for the Local Executor. Here’s more information about Airflow’s Executors.

Airflow's Executor receives a Task by the Scheduler, and will queue it to be executed by a Worker. The Executor handles operations at Task-level, while the Scheduler handles operations at DAG-Level.

Workers

Workers are the entities that actually execute a Task. The number of Workers available for a DAG and their type is specified in the DAG's Executor.

Webserver

Airflow’s Webserver is a UI that allows to monitor and debug DAGs and Tasks. Here’s Airflow’s UI where we can manage DAGs, Tasks, schedules and runs.

Airflow UI, credits

Airflow commands explained

In the latest version Airflow commands renewed and are more intuitive than ever. Below a quick overview of each of them. The complete documentation is available on airflow website.

scheduler

airflow scheduler

→ Command to launch the scheduler process, if you want to run it in the foreground pass the -D option.

webserver

airflow webserver

→ Command to launch the webserver process, by default it will run on the port 8080 and on all interfaces "0.0.0.0". Like the scheduler if you want to run it in foreground pass the -D.

db

airflow db init
airflow db migrate
airflow db ...

→ In order to launch an Airflow instance you'll need to have a database (can be SQLite locally for instance). Airflow db commands group will help you manage your database, with the init you'll create the tables in the database and with migrate you'll apply the migrations if you already have tables but you apply a new release.

users

airflow users create –username admin –firstname FIRST_NAME –lastname LAST_NAME –role Admin –email admin@example.org

→ You'll for sure need an user to access your Airflow UI, and this command will let's you create your first admin user. Don't forget to change the options while running it production ;)

dags & tasks

airflow dags 
airflow tasks

→ These two commands are super powerful and will help run and/or test your pipelines before scheduling them in production. I really like the airflow tasks test command that allows you to run a specific task without meeting any dependencies.

Deploying your pipeline

The easiest way to run Airflow is to run it inside a Docker container, tutorial here. However, we'll see below possibilities to deploy Airflow in production using cloud services.

Google Cloud Composer

Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow. Here's the official documentation on how to get started.

AWS

Amazon also has a workflow management tool called Managed Workflows for Apache Airflow (MWAA), overview here.

Astronomer

Alternatively, you can try out Astronomer, the enterprise framework designed for organization use of Airflow. Check our feedback article about Astronomer along with a full tutorial on deployment.

Quick tips

And finally, here are some Airflow tips from Christophe.

Use aliases to jump in your Airflow workspace easily — each day when you start to work on Airflow you will need to activate or change directory. In order to have a nice fresh start, use aliases to jump in faster in your project. My generic alias is this one.
Prefix or suffix all your DAG names — In order to have a better visual sort once loading the UI I recommend prefixing your DAG names by the frequency (daily, weekly, monthly) and at least the step of the ETL the DAG is in.
Use factories to create DAGs — Data engineers are not meant to write DAG, they are better at writing software, so prefer using factories to create DAGs to factorize the common configuration of all your DAGs. It will save you time in the future. For instance you can use a create_dag function.

If you liked this post do not hesitate to tell us on Twitter we can do a second part where we'll deep dive into more advanced Airflow concepts.

Data News — Week 4

2022-01-29

Me on Thursday morning seeing new members (credits)

Hello dear members, I hope this new edition finds you well. I want to welcome all the new subscribers that joined this week, it boomed. In the newsletter you gonna find data articles out of my weekly curation plus my subjective views.

The Data News is a way for me to keep a record of the articles I like but also a way for you to save time to get an diverse glimpse of the data ecosystem.

Enjoy the reading.

Data fundraising 💰

We already seen the Firebolt data warehouse in the newsletter. They excellently market their product and claim performance gap with other data warehouses. After recruiting key engineers from the BigQuery team, Firebolt raised $100m at a $1.4b valuation. I'm waiting for feedbacks from the community before having any opinion.
Dremio raised $160m in Series E reaching $2b valuation. Dremio is a SQL first datalake platform that plugs their query engine on top of your cloud storage to create your interactive analytics layer.
When it comes to cloud storage, MinIO $103m in Series B to provide a multi-cloud agnostic cloud storage platform that is S3 compatible. I didn't know the product before and I think it's a good trend to watch.

ClickOps, this is time to tell the truth

This January I started to teach "DataOps" for students. This is a new class I've written this year. As DataOps is so different from companies to companies this is hard to define it, but still there are invariants. Even if we sell terraform or Ansible like magic tools, in the end, like Corey Quin said, we gonna “click around in the web console, then lie about it”. This is ClickOps.

As sad as it sounds it also exists a browser extension to record what you are doing in the console that translates it to config, the Console Recorder.

PS: if you're interested in the course content, ping me, I'll need beta testers because I plan to release it to everyone. I talk about Cloud (GCP), Terraform, data infra, DevOps and we build a dev + data platform from scratch.

New Year always means technology introspections

Each new year we get the same recipe talking about what will be the trending language or tool for the year to come and I usually don't like it. But this week Furcy tried to find the place of Spark (and Databricks) in the Modern Data Stack and the post is great, and in the end SQL will still be the first.

On the other side Medhi said that the data engineer should not be just a "SQL person" and place a bet that Rust could become a thing in the data world. To be honest I don't totally agree. Mainly because today there aren't a lot of data engineers that are only SQL focused and also because Python is still well suited and simple for today use-cases in data that are still batch-driven.

Learn from Google’s data engineers: don’t optimize your SQL

No, this is a bad advice.

Galen who work at Google, wrote this piece of advice saying you should save time by not optimizing your SQL[the original post has been deleted]. This is a strong opinion and I found this post thanks to the associated Reddit post. The main takeaway of the post is to say that cloud computing time is way cheaper than your salary, so don't bother yourself doing optimisation with MERGE or dimensional modeling and do full snapshots to deliver more value elsewhere.

It's probably a solution when you work at Google because you have close to unlimited power, money or storage but not sure it suits well everyone. If you live everyday in the fear to break the BigQuery limits for a bad written query you are far from this. With this kind of mindset, how lucky we are that Google is carbon neutral 🙄. Yep, more Cloud means more datacenters.

Each time your SD card is full buy a new one. Don't optimize photos storage. (credits)

Building Reference Architectures for User-Facing Analytics

What can you do to develop user-facing analytics? What are the actual open-source solutions? Dunith explored different solutions, proposing Spark + MongoDB or Apache Pinot combined with CDC. It reminds me I don't know a lot about Pinot and I should explore it more.

Learn from the best

Gitlab Handbook had always been a huge resource for all data people. I saw this week they refreshed it. And I noticed 2 concepts I really liked:

Data Pump — This sounds like Reverse ETL but I prefer data pump name. They also create a well documented approach to do it.
Trusted Data Framework — If you're still is the middle of your data quality / testing approach definition, Gitlab way to do it will help you for sure.

Data Customers expect Data Teams to provide data they can trust to make their important decisions. And Data Teams need to be confident in the quality of data they deliver. But this is a hard problem to solve

Fast News ⚡️

If you ever wondered what are the meaning of the single and double _ in Python, Ahmed tried all the cases for you.
Snowpark Is Now Generally Available — Use Scala and Java to interact with Snowflake dataframes.
How Paris public transport evolutions will impact average travel time in the future? Modality developed a tool to explore geographical data — like kepler.gl. This is well crafted.
Let's remember Oozie — I developed some Oozie workflows 8 years ago and as I saw this post I was nostalgic for new data engineers that never had the chance to play with it.
Now’s the time to tackle data ownership — Maggie post, from Datahub project, was like an alarm to me. Data ownership always had been a problem in my previous experiences, maybe it will sound the same to you.

Data News — Week 3

2022-01-22

Train(ing) my students (credits)

Welcome to the (late) third edition of the Data News this year! I've been busy teaching a Big Data training course at university every Friday, hence the delay.

By the way, here's a fun thing I discovered while preparing my courses: the French railroad company has an open-data platform where you can get datasets about, for example, lost belongings, train delays... I found that very interesting. It would be fun to have it in all countries to see who's late the most.

This week I struggled a bit in finding interesting topics, so the news will be way shorter than usual.

Data fundraising 💰

Coalesce closes $5.92m in seed funding. They plan to re-invent data transformations. Honestly if you're asking yourself what could be a dbt clone, we may have find it. Even the name is inspired from them.
Prophecy, a low-code platform for Spark and Airflow, raised a $25 million Series A round. I may give it a try to see what's under the hood.
On another hand, Qlik, the data-analytics giant, has confidentially filed for another IPO. 6 years ago they were public and they go back to this. I don't know what it means but I know that Qlik does not seem to belong to the "Modern Data Stack".

Snowflake add Iceberg support for external table

This is a good product news to me. Last year, the work that has been done on new data formats was a huge step forward. Seeing Snowflake supporting Iceberg as an external table format looks great towards general adoption.

From a business perspective they also support Delta Lake to help people migrating from Databricks 🙊.

Do we have snowflakes on Icebergs? (credits)

Are data engineers firefighters?

We — data engineers — are often called to fix stuff that we sometimes don't know about or that we sometimes broke not on purpose. And it always appears when we are not ready for it. Indeed we try to "put the fire out" like Kiana said, and she also proposed some common points you should have in mind when working on your data systems.

Monte-Carlo wrote also a guide for non-engineers for bad data. Actually this is not a guide but more a story of what regularly happens. I do agree with most common issues they proposed.

According to a recent study by HFS, 75 percent of executives don’t trust their data.

Can we get a study on the number of employees that doesn't trust the data reading skills of their executives? Maybe it could put the number in perspective 🤷‍♂️.

Good data citizenship doesn’t work

Benn and Mark Grover (Stemma/Amundsen founder) wrote a joint article about good data citizenship, and why it's hard to rely on it. The idea behind is to assume that non-data team members will be good data citizen when it comes to fill data documentation. And spoiler it can't work.

In the post they deep dive on why it's hard to rely on it, especially because of the daily volume of data alteration. But they also take a look at other systems like Yelp, Wikipedia or Google to understand how they fix it.

As they said "Review more, document less." and automate everything.

YAML DAGs for Airflow

🙄. 2 developpers — sorry I did not find who's really behind — released the Typhoon orchestrator where you can write YAML to define you airflow DAGs. The idea behind is to keep Airflow task to pure Python that can be easily tested and the whole sugar code is a framework used from YAML. Here an example.

Is Typhoon a good name for a YAML project? (credits)

ML Friday 🧠

LinkedIn posted an article about diversity and fairness in their Artificial Intelligence products – worth the reading in my opinion.

On the other hand, Facebook open-sourced XLS-R, a cross-lingual speech recognition AI model. A Google AI researcher who works on the project wrote a Twitter reply on handling biases during the training process.

Fast News ⚡️

A very useful SQL tip about optimising the use of the "%" wildcard.
Staying on SQL's topic, here's an interesting take on why Google considers SQL as a coding language and why you should do it also.
Here's a twitter thread about Privacy Shield negotiations between the US and EU.
Everyone can do a side project: automating Nike Run club data analysis with Python, Airflow and Google Data Studio.

Thank you for reading ! See you next week.

Data News — Week 2

2022-01-14

Hey it's me (credits)

Dear readers, I hope this second edition of the year finds you well. We often say that saying things in public helps you achieve goals. So here some personal goals I'd like to achieve this year:

Move from Paris to Berlin
Run every week (starting next week 😇)
Map the European data landscape by interviewing people and launch a podcast
Lose weight (-7kgs)
Publish 12 videos on YouTube
Do a 30 days datavis challenge

If you want to use me as your sparing goal partner, hit reply and send me your goals it'll be a pleasure to remind you along the whole year at your duties. Each newsletter will be like an secret alarm.

Have fun.

Data fundraising 💰

Observable raised $35.6m. Observable is a real-time collaboration notebook that runs Javascript or SQL. This is the perfect tool for data visualisation developers. I think that the platform has been wonderfully crafted. I can't wait to see how what vision they will bring to enterprise data visualisation layer. This is probably my favourite tool in the data world 👀.
The French open-data sharing company founded in 2011 Opendatasoft raised $25m. Opendatasoft is a platform built for public and private organizations to open their data. This sharing can be paying or free. As an example, the city of Vancouver data portal.
Snowflake decided to say in public they invested in Collibra in their latest funding round. Did I already tell you that Collibra means collaboration and library contracted 🤷‍♂️?

If you want to go to space contribute to Astropy

And don't wait for US billionaires to open tourism there. Except from finding the Great Bear and Cassiopeia in the sky I've never been that good or interested at it. But this reading about how the Astronomy community shaped Astropy and got NASA grants gave me feels.

BigQuery now supports semi-structured data

Yeah 🥳 that means you will be able to create a column with the JSON. This is a good and a bad news. Now we will be able to load a JSON column easily, but that also means we'll drop the schema validation we were all forced to have while parsing the data. Have a look at the BigQuery JSON data documentation page. Zach also detailed it on a medium post.

On the other side. I don't know which words to use but I always feel cold with Snowflake communication, I know this is probably the strategy, but they seem so corporate. This week on their Medium publication they explained how you can break the 16MB JSON limit — coincidence with Google news? — but tbh the post is so hard to read.

By curiosity I had a look at BigQuery limits and I did not find for JSON column, but I saw that a JSON row can't exceed 100MB.

The metadata money corporation

I share Benn weekly post with his unique style. I honestly totally agree with him about how he sees the field and the MDS evolution.

But the landscape is getting crowded, and its unincorporated territories are becoming too small to represent new categories in the eyes of the customer. Buyers don’t spend nearly as much time studying the distinctions between vendors as the vendors themselves do, and what can seem like category-defining differences from the inside are minor details to everyone else.

This is what I exactly tried to say last week but in a better English, Benn also argue in the direction of one unique data experience that could be powered with the metadata kerosene — data is the new oil as we say.

🔥(credits)

The *Mythical Modern Data Stack

Is it simpler to find the best pancake recipe or the best Modern Data Stack? Doug Foo tried to answer to this simple question. I really enjoyed the way he demystified the tools and the stack. This is a nice reminder of what are our common concepts and a nice entry-level post for newcomers.

PS: this is also a good follow-up to Benn article.

Airbyte or Meltano — and why they did not choose one of them

Robert Offner detailed to us how his team at Kolibri Games tried Airbyte and Meltano in order to decide what were the next steps of their data integration. Finally! Here is a first feedback on those tools. To Robert both tools are also not yet totally matures, but I can bet that recent valuations will help fix that soon — I hope.

Don't forget Airflow

And Airflow is still here in the king seat. Astronomer wrote about their astro packages for ETL that can help you bootstrap some DAG I really like the shortcuts around SQL, this is a good start.

Voodoo detailed why they decided to go with Airflow, the post bring new ideas about Airflow monitoring that I like.

ML Friday 🦾

Doctrine team shared how they implement A/B test with an example on a actual alert they use in their product. The way they did is a genius combination of APIs with HTTP headers forwarding. They have an allocation service using a deterministic hash and back-end services.

Also this week Aurelien from GitGuardian talked about zero false positives when doing predictions.

Raw News 🥦

Because I'm too late, it's raw stuff for you.

Data to engineers ratio: US vs Europe — part 2 after last week post
How to collect and visualize data lineage in an AWS-based data lake — good ideas and could be applicable to all kind of platforms
Zingg entity resolution for deduplication in Svenn Thoughts
Why I chose data engineering over data science
Introducing Credmark’s Senior Data Engineer — even DeFi is doing data engineering we are getting hyped
Yes, you can learn SQL in two hours — 😂 :troll:

See you next week after a jogging session. I've been late again this week but I had a class once again.

Data News — Week 1

2022-01-07

Trying to guess what comes next (credits)

After 2 specials editions the Data News comes back with a longer version than usual, but I hope as good as before.

Data fundraising 💰

After the last crazy year, this new year is starting slowly. I've catch only one fundraising and some fines — but it's a kind of fundraising you know.

Verb Data raised $3m in funding to continue developing their in-product dashboard tool. Their platform re-create a data platform (with a datalake and a warehouse) from your data in order to give you embed charts via their SDK.
The French data privacy regulator fined Google €150m and Facebook €60m about dark patterns in their cookie policy. It seems to be the record for France. The CNIL is asking from them to add a "Decline Cookie" button as simple as the "Accept" one. They have 3 months.

2021 summary and 2022 predictions

I did not realize before but a New Year means a lot of best-of articles to finish the year but also some people playing the game of predictions regarding the following 12 months. I'll try summarize what I've seen on both aspect and will maybe play a bit at that game.

First let's talk about databases. Last year has been a huge year regarding databases: huge fundraising and big competition between players. It seems that right now everyone want to store your data. And wow, Snowflake has been elected DBMS of the year — elected means more popular. Bravo, even if I don't understand why we do stuff like this, it probably means something. For the first time in the prize list we have a "Big data" or software-as-a-service player.

So Snowflake, indeed, what a big last year for them. On Ottertune Andy Pavlo did a retrospective on databases. He highlighted the fight between Snowflake and Databricks about speedtest. I highly recommend this post to have a better vision on the space.

This year also saw the talent drain intensifying from Google to others and recently a 17 years of career at Google ML leader joined Snowflake: Tal Shaked.

I've speak a lot about dbt this year, and the community have also written a lot about it. Devoted Health wrote a niece piece about their first anniversary using dbt. They cover how they integrated it with Airflow and the CI/CD concepts going from a POC to more than 1000 models. I've guess that a lot of company took the same path last year.

Salma Bakouk tried to capture the trends that shaped the Modern Data Stack in 2021. Obviously she included tools but also the way we do data, technically but also philosophically. I do agree also with her data mesh and observability were trendy. But to be honest it was only a trendy. We are still waiting for consolidation and real application of it.

What about the Cloud? Erik Bernhardsson wrote in November last year a cool blog about how the cloud will transform and what has changed recently. I really like how he wrote it. The predictions are so true and he excels in the exercise.

YAML will be something old jaded developers bring up after a few drinks. You know it's time to wrap up at the party at that point.

I honestly can wait to be in the future.

Then Microsoft celebrated their Data Science second anniversary. :troll:

Hear my prayer, Oh Data Lord (credits)

Some wishes I have. I'd like for 2022 less huge fundraising — but I'll continue to track them — and less marketing pressure from data tools and a standardisation of the use-cases in each segment of the MDS. I can mention for instance the observability/quality space that is totally fragmented with a lot, I mean a lot, of different product that are doing almost the same thing with small differences. I share Patrik Liu Tran observations about the data quality space evolutions needed for 2022.

The big issue in this case is for the user. How can we compare those tools when they don't sell for the same use case or don't provide a comparable solution? How can we as user compare them or choose them? Do we need to buy them all?

Gotta Catch 'Em All (credits)

Also now that the transformation segment has been transformed — 🤷‍♂️ — I hope that in 2022 we will get leaders on the last mile of the data, I mean once the transformation has been done. I'm not yet sure about the metrics layer and how it should evolve or be sold, but still I think that visualisation / exploration layer should be better equipped. And from what I see, a lot of companies are working towards it.

I'd also add two things. First we need better tooling to train juniors or outsiders to data tools, I do teaching about data for the last 6 years and the proliferation of tools make it harder year after year.

Second, this year I'll try to do more data visualisations because I like this. For instance if you want to develop a TV tuner recorder that does speech-to-text to analyse words occurrence in TV morning shows[FR] you can do it. To help you having ideas look at open datasets, we have more open data as the Open Data Maturity index reports in Europe.

Data to engineers ratio: A deep dive into 50 top European tech companies

Mikkel did a huge work looking at the ratio between data and engineering teams by looking at the number of currently open roles in 50 EU tech companies. To be honest this is a good ratio to have a look at but I think this is difficult to get some conclusion from it

Because it does not take into account the current situation and the data roles are so diverse that it will include outliers. Obviously there isn't a lot of companies where data is larger than engineering so it gives a good trend.

But in the end, why should we oppose engineering and data? Isn't the data engineer the final evolution of the software engineer? 🙃 João proposed us a vision going that way following what decentralization offers to us. Software engineers will evolve and own data publication when doing PLT instead of ELT and probably data engineers specificities will disappear.

Let's all sit at the same table (credits)

Airbyte journey with Kubernetes

Airbyte team detailed how they used Kubernetes to scale their workloads. The post is quite long and details what major challenges they faced while doing it. Like using socat to redirect stdio between pods.

Were we all using Airflow wrong?

If you remember this popular post about how we were all using Airflow wrong — tbh I still disagree with this post because I still think that the strength of Airflow is not only his scheduler but also his Python modularity. Locale.ai team is saying that now with the KubernetesPodOperator it is fixed. I got a bit clickbaited by the post because actually they just used KubernetesPodOperator.

Choosing the best schema to improve Google BigTable performance

If you think about doing a Feature Store or if you do BigTable this post by Algolia tech may help you designing your BigTable data schema. BT is something really specific for people not use to it — like HBase — and the post is a good starter for you to understand it. They explain very well how they merge the batch and the speed layers thanks to BT.

Fast News ⚡

Artificial Intelligence in Magic The Gathering — this is a good side project if you love card games. Gabriel developed a Streamlit application to explore MTG cards.
How to Use TomTom Data for Predictive Modeling — Another idea of side project, TomTom team detailed how you can use their data to predict traffic.
Sorry Ladies… This Is Still a MAN’s World! — details why AI is still not gender-neutral.
Analytics with SQL: What keeps them coming back? — Using arrays in SQL to write clickstream analysis.
MySQL character sets explained – why utf8 is not UTF-8 — I really like database explaination posts. Even if it's MySQL and I stoped using it 10 years ago it still demystify stuff.
Use your Airflow metadata — That is a good inspiration post for people using Airflow. Don't let the data sleep there. Use it to create observability on top of your Airflow instance.

Graphext team used the data from the newsletter to visualise it in their tool. This is nice to see how the articles I've been featuring for the last months are clustered. Thank you a lot for this contribution!

Data News — Week 52

2021-12-31

Throwback of the Data News journey (credits)

This year is for me, Christophe, a turning point in my professional life: I left my standard 9-17 job as a data engineer and started 2 projects. Data consultancy and making online content about what I’m most interested in: data. Especially data engineering, of course.

At first, I wrote the first editions of the Data News thinking that few (if any) people will find it, and read it. Surprisingly enough, having my very first subscribers motivated me to keep writing the Data News every week, without ever skipping one! And here we are today, I’m writing this article, and 616 people will receive it in their inbox.

I was quite intrigued when I started looking back at my whole year, so i took some time to analyse and understand the maths behind our Data News journey.
So here’s what happened between May 2021 and December 2021, in figures.

In the 33 editions of the Data News, various data topics were addressed. Recently, we tried categorizing all the articles ever mentioned on the Data News in the Data Links Explorer. We used an algorithm — that can be improve — to categorize each article into 9 categories :

data mesh
data warehouse
data management
ELT / ETL
Modern Data Stack
data analytics
data monitoring
IA / machine learning
data fundraising

This Data Links Explorer allowed me to make some statistics about what are apparently my favourite topics to write about.

According to it, here are a table of how many articles I mentioned in each category:

Number of articles mentioned in the Data News by category

Obviously and sadly — for diversity — Medium and Towards Data Science were the most frequent domains I shared in the newsletter with around 150 articles (20%). I also recently started to include dev.to and Hashnode to my sources to add diversity but also more and more independent writers.

After seeing all the topics and articles I wrote about, I wondered how much time I spent on the Data News. Since I started writing in May, I wrote a total of 29 187 words, spent a total of 165 hours writing Data News editions, yes it's more than 6 days.

With all this content on the Data News, the number of subscribers kept increasing since the beginning.

Other stats

The Data News has reached a total of 616 subscribers, growing more and more each month. Here's a quick graph to illustrate subscribers over the past months.

Newsletter subscribers over time

I also took some time to track some interactions you could have with the Data News using Plausible — I tried to keep your privacy safe.

The most popular editions of the Data News are Week #31, Week #40 and Week #49.
Since June we got 20k page views and the average visit duration is 2m15s.
1800 links have been clicked and probably visited and the top 3 is the Future of Modern Data Stack by dbt, data engineering future by Zach and data engineer roadmap.
The average open rate of the Newsletter is 62% and around 100 subscribers have a 100% open rate.
Only 13 people unsubscribed and according to my system 30 members never ever opened an email

Conclusion

Thank you ❤️.

🤗 (credits)

This journey has just began, I want to create something I'd love to read each week so I'm happy you find the newsletter useful or just fun to read. I'll keep doing that in 2022. The biggest issue will probably be the naming because soon we'll loop over the weeks.

Also, thanks to you and thanks for your goodwill I feel more confident in writing in English — because spoiler it's not my native tongue and I was unsure at first. The main takeaway here is don't be afraid to do stuff, just do it. Especially if you think it'll be good for you.

I wish you a Happy New Year in advance and see you soon.

PS: if New Year means New Job for you here a motivational post for you.

Data News — Week 51

2021-12-24

I want free open-source data tools (credits)

What a year, it's almost the last week of the year, I imagine a lot of people are in holidays. So today it'll be a short Data News and next week will be a retrospective post about what we achieve in 2021.

Data fundraising 💰

Airbyte raised $150m in Series B, it's an extract-load open-source platform with a Cloud version aiming to compete with leaders like Fivetran or Stitch. The money will mainly be used to increase hirings but also to launch other products like real-time data ingestion or reverse-ETL.
brytlyt, raised another $5m in an extends of a Series A to launch their data analytics and visualisation platform. They leverage PostgreSQL with GPUs in order to create analytics platforms that "scale".

How to think about the ROI of data work

Once again Monzo data team offers us an awesome data article. This time it's about measuring the ROI of data work. This is probably a question all the data teams have. How can we prove the C-level that data investments are profitable?

In the article Mikkel shows a new way to talk about ROI, he also brings nice visuals to explain all the concepts. To be honest this is a must-read.

How should organizations structure their data

Every once in a while we get data modeling articles and Kimball concepts comes back to the denormalisation world Hive, BigQuery and Snowflake have brought years ago. Michael compares Kimball, Inmon and Data Vault structures to help you get started.

Personally I'm more a pragmatic person so the simpler structure, to me, is often the better.

Kimball, maybe the last train (credits)

Improving Data Quality with Data Contracts

Sometimes we expect (or we wait) for a magic product to solve all our Data Quality issues. But, spoiler, it may not solve everything. Probably you will need to define schema (Data Contracts) on you data and enforce them. The team at GoCardless added a schema validation layer in their CDC architecture to bring a better data quality. If you are in this, go check it out.

Deploying Airflow 2 on EKS using Terraform, Helm and ArgoCD

This is a huge 2 parts tutorial. Vitor explains how you can deploy Airflow 2 on AWS using ArgoCD, Helm and Terraform (part 1 & part 2). Obviously this is a way to deploy Airflow, but not the only one. When we look at the numbers more and more companies are now deploying Airflow on top of Kubernetes.

In the tutorial you will find Terraform files and also how to configure your Argo to make Airflow works. If you are new to these technologies it'll give you a overlook.

The guide to data versioning

If you want to understand how data versioning is working, LakeFS team wrote an article detailing the 3 most common versioning practices.

Thank you all for the support over this year, this week I have been hit by the Covid so this edition is shorter than usal but I still wish you Happy Holidays and see one next week for the last of the year.

Stay safe.

Data News — Week 50

2021-12-17

The K2 second-highest mountain on Earth (credits, look at this expedition on YouTube) — it's Friday you have time.

Hello, here the Data News. Once again I'm late but you can still read me because it's Friday and it's December. As snow is probably blocking the data pipes you have time for some reads.

I have fun projects for next year, so if you're living in an European capital do not hesitate to reach me I'd love to come say hi soon 😉

Data fundraising 💰

Secoda, a space to store all your data knowledge, raised this week $2.2m in pre-seed. They compete in the high concurrency segment around metadata tools — yep I still don't know how to label this category.
This is not about raising money this time, but seems like it. Castor, French founded metadata tool, hired as a founding Sales Brian Blevins. Brian worked at Collibra and Alteryx and he surely knows data and US markets.
Yet another data load platform raised money. This time Hevo Data raised $30m in Series B to provide a SaaS to load data from your database to your favourite warehouse. They copy-pasted the standard Stitch pricing and they have a free tier when < 1M monthly rows.
Following last week about dbt Coalesce, Forbes reported rumours that dbt Labs is in talks to raise — again — to value to company at $6b. If it's happens, like I said last week, dbt will for sure go in another dimension, they will pave the way for next gens data platforms. This is exciting and scary.
Sigma raised $300m in Series C. They sell a drag-n-drop browser analytics product to create pivot tables and dashboards out of spreadsheet. In addition they also have an embedded analytics feature.
Because I like geographical dataviz, I want to put light on Carto that also just raised $61m in Series C to unlock the power of spatial analysis. Carto can connect to all major cloud providers to create maps and geographical computes on top of your warehouse.

This category is longer than usual to be honest. But that means things are moving.

A note on Apache log4j vuln

On Friday last week an exploit has been found on Apache log4j Java library. In summary this vulnerability is an arbitrary code execution. If an attacker can submit data to an application using log4j to log the events he will remotely execute code. The could be achieved by injecting data in the URL or the in user-agent for instance.

Obviously the data world has been severely impacted — Hadoop, Spark, etc. — by this exploit and a lot of data product have published an answer regarding what they did and how they fix it.

As I'm far from being an expert I give you a list of entry-level post that could help you understand what is happening here:

Emy wrote a nice thread on Twitter for non-tech people
InfoQ detailed what in Java is responsible from this 0-day exploit
Cloudflare detecting what attackers are trying to do

Swiss government computer emergency team schema to explain the exploit (credits)

I found all these links thanks to Didier Girard post (in French) on LinkedIn.

Licensing dbt: Apache 2.0... — don't jump it's interesting

Probably the last post about dbt in a while. Whilst a lot of data engineers are still trying to understand what's the hype about, Tristan Handy the CEO wrote a nice piece of vision about dbt licensing. What a funny topic. No, seriously, this is a rare post that shows you the complexity of licensing while using entry-level term to help you discover that wonderful world.

I've never read in my life a single license I've used for the open-source software I'm using. It's probably mea culpa, but it's too hard for me. I'm already fighting with French laws to understand how to properly run a business and it feels kinda the same for me.

In the post you will find why they chose Apache 2.0 for the core. Plus Tristan says how they will distribute the dbt Server they announced last week and how they will monetize it.

tl;dr: dbt Server will be "source available" under Business Source License (BSL) and they will monetize it by selling a proxy server as a dbt Cloud component helping dbt intercept requests to database to compile dbt.

Licensing — all I see is papers (credits)

A brief history of the metrics store

A CEO answers to another one, but on another topic. Nick Handel, from Transform a metric store already existing, delivers a post explaining why do we need metrics store in term of data platform architecture. It actually covers the history part of the data and lightly shows what is the future.

Say hi to my content creator colleagues

I want to give some shout-out to other newsletter writers. First, Ruben, who this week decided to write data pipelines with Alloy, it may seems weird, but I still like it. Also from time to time Ruben gives us his Readings and it's always an open-window on the tech world.

Second, Adam, each week he gratifies us with 3 bullets give asynchronously old or recent articles with his thoughts. It's keeping up with data. On the same level Sven has the Three Data Point Thursday newsletter that I also really like.

And finally Benoit, who is publishing every month from an engineer sight. I really like Benoit views because he is between data science and engineering and he writes a lot about visualization and data and tech general understanding and philosophy.

Do not hesitate to go see their content, they produce good quality stuff.

Introducing FugueSQL — SQL for Pandas

SQL will rule the world

Probably someone said it one day and if not I'm saying it loud. If you are often using Pandas and SQL sometimes when you are in your Jupyter Notebook with dataframes you miss your dear SQL friend. The wonderful FugueSQL extension will help you get reunited with him. Khuyen Tran wrote a tutorial post on how to use this extension.

Imagine doing that — image from the original post (credits)

Fast News ⚡️

Databricks announced the general availability of Databricks SQL the ANSI SQL on top of their "Lakehouse platform".
Someone asked on StackOverflow what are the specifications of a Snowflake server? and Rogier answered on LinkedIn. It's a bet but the guess are nice.
Lakehouse Concurrency Control: Are we too optimistic? — it speaks about Optimistic concurrency control and what are the models available when considering reading and writing in parallel on files.
Slim CI for dbt with BigQuery and docker — Teads team detailed in a well documented post their CI/CD setup. I think it could help a lot of data teams.

See you next week.

Data News — dbt Coalesce 2021 takeaways

2021-12-10

What I see when I read my own post. Something weird but cool. (credits)

The end of the year is coming and less articles are written by the whole data community and this is normal because we need a well deserved rest. But don't worry I'll continue to share what has been done!

This week I want to review the whole dbt Coalesce conference in addition to the usual data articles.

dbt is not — anymore — a data product

Obviously dbt is made of dbt Core and dbt Cloud that are two useful and well crafted products. But I think that dbt right now is becoming something different. They left the space of a simple product and tech company to drive a whole ecosystem.

We now have the whole markers: they made some terms the norm — Modern Data Stack and Analytics Engineering —, they put vision, they welcome everyone (especially people not used to data), they produce entry-level tools and documentation and they foster conversations through Coalesce conferences.

If we have a look at the dbt-core, the concept is quite simple: put SQL queries in order and run them. I've seen a lot of companies out there developing similar in-house solution or competitors doing the same. What differs with dbt Labs?

The main difference is the dbt content, they produce a freaking huge amount of content giving you best practices, giving you advice. It's actually a framework. In the pure software world definition. They released the first all-in-one SQL framework for data warehousing.

To me dbt, this year, became a pillar in the data ecosystem, everyone wants to develop an integration with it. Actually everyone wants to be in (incl. me tbh).

Small remark: please write dbt and not DBT. I'm tilted each time I see DBT.

Coalesce 2021 in review

Online conference 101 (credits)

For this edition of the Data News I want to give my views on top of the 70+ live sessions of the Coalesce 2021 — the dbt annual conference. This is the same format as the Airflow Summit review I've already done.

So I grabbed a cup of tea and I watched the 5 days of conferences. I've seen 5 categories of talks this year:

Food for thought — to help us seeing forward
101 talks about dbt or other concepts
Feedbacks from companies implementing dbt, this is my favourite part obviously
Promotional content — talks from the sponsors, sometimes related to dbt sometimes not 🤷‍♂️
Diversity talks about how we can be more open in the data field

Food for thought

I really liked the introductory talk by Erica Louie about Scaling Knowledge over Scaling Bodies. In the presentation Erica tries to define what is a good self-service and what are the metrics around it. She also mentions something I've seen a lot of companies struggle with: aiming for "data requests are dominantly investigative questions or infra tasks [rather than answering business questions]".

But actually this is something everyone should answer because self-service is not the go-to solution for every company, sometime DaaS (Data as a Service) could still be the best option.

From the presentation I can give you some key concepts:

prefer async documentation (docs or videos) for onboarding
define your data user journey
visualize dashboards data team reach (for instance dashboards owned by non-data team members, etc.)
Weekly active users (WAUs) and feedback score

To continue dbt co-founders (Tristan and Drew) had 3 awesome chats speaking of investment in data, Spark history — unveiling Databricks partnership and with Snowflake SVP about how Snowflake sees analytics today. These discussions truly place dbt as a key actor in the cloud data storage war.

My main takeaway is that data is becoming the main differentiator now when some years ago it was the software. Do people now want to join companies that treat their data bad?

In a glimpse, some other stuff I like:

This is you. If you open all the links provided. (credits)

Let's talk about dbt only

They presented the v1 version during Coalesce sessions. The v1 versions means the API will be stable and officially "out-of-beta". It also means the compile is gonna be faster (could improve up to x100 for companies with a lot of models).

They also teased us about what's coming next in early 2022. The Metrics system and dbt Server. The idea of the dbt Server is to provide an unified interface for the last mile of the data. Right now you can in v1 describe your metrics in the YAML configs. Benn also argued regarding the metrics system and why it's important. Probably the next big move for all companies trying to build the best warehouse.

They also teased us about "Define your own tasks for the dbt DAG" — would it mean that dbt could become more than a SQL queries repository?

📗 If you are new to the data field you may also struggle in building a mature dbt project or to use Git because it is often required to operate your projects. And finally Emma detailed how you can develop dbt packages to encapsulate logic for your data clients teams.

Companies feedback

A lot of companies went to Coalesce to show how they use dbt and also to give us a glimpse of what we can achieve. Like last time at the Airflow Summit I really like this because it says to everyone: it's an open-source product, so take it and make it your own, be creative.

Firstly Aula Education came with 2 talks, the first one was more general about Analytics Engineering discipline becoming a thing where we should all apply software practices — like data engineering couple of years ago. Apply testing, modularity, building first small products and then iterates.

✨ The second one was very inspirational, about how you can survive the schema changes with automation. Imagine a world where MongoDB is your primary source with no schema defined and where everything is changing. At Aula they geniously used dbt_utils macros to automate sources addition and the dbt-audit-helper to easily compare relations. To finish they also presented how they automatically generate the whole dbt config with Jinja and Python from config files.

Eric from Mattermost also explained dynamic sources management with macros.

If we continue this path over the dbt tooling, Slido demonstrated two packages they developed that could be useful to everyone out there. The first one is dbt-coverage, it allows you to create a coverage metric over your dbt docs. The second one is dbt-superset-lineage that lets you push documentation to Superset or pull dashboard to exposure. It reminds me the dbt-metabase package. This stuff is really good.

And in a nutshell (to keep it short):

Abhiti explained how you adapt incremental_strategy for your needs, what I also like was how they crafted their Sessions modeling with immutable and mutable tables along the way. It resonates with Snowplow time.
The most common practice I noticed was the use of tag to create different schedules, for instance daily, weekly, monthly or 30_minutes, 2_hours.
Companies loves using dbt Cloud Metadata API to build observability: here and here.
Regarding observability don't forget you can use dbt artifacts (run_results.json and manifest.json) to create dashboard on top of it. Snapcommerce team illustrated how they did observability within dbt (they used the dbt_artifacts package).
dbtplyr — bring R dplyr to dbt, particularly useful when you need to dynamically building columns selection.
Use the meta tag in your YAML file to add context: add ownership or access policies, add alerting rules. Everything could go in.

(credits)

Diversity

Because this post is already too long, I'll give you a list of the open talks I liked:

Inclusive Design and dbt
Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team.
To all the data managers we've loved before
All the daily talks by Jillian especially the closing remarks

If you came this far, first I want to thank you. Second, hi I'm Christophe I do data stuff and I write a weekly curation newsletter with my views inside. If you liked it please consider Subscribe to support this kind of work in the future. It's forever free.

Usual Data News is below but in a smaller format.

Back to the roots (credits)

Data fundraising 💰

HashiCorp went public this week, the home company of Terraform successfully did the IPO and raised around $1.2b. Even if we are using more and more cloud services, Terraform is here to stay and will still be the good practice in term on architecture setup.

AI Friday

Pinterest shows us how they use Machine Learning (with BERT under the hood) to provide an "healthy" space in the comments sections. This could inspire a lot of companies even if you don't have a comment section, for instance with customer support.

Fast News ⚡️

Behavioral timeseries segmentation in ClickHouse — how can you use ClickHouse to build user journey funnel, Content Square team wrote about it.
I said 43 times the word dbt, let's speak about dataform, a competitor bought by Google but only for BigQuery. This post gives you a glimpse of what you can achieve with it.
How Data Engineers can use SQL to estimate BigQuery storage costs — you can use INFORMATION_SCHEMA to multiply it and get the money you spend. Actually I prefer another solution which is to send directly all your billing data in BigQuery to have the actual precise amount.
Keeping your data pipelines organized — Felipe propose a way to structure your pipeline code to keep it organized
Evolving LinkedIn’s analytics tech stack — Modern Data Stack is paving the way and changing everything, sometimes it's good to see what big companies with big Hadoop footprint evolve their stack. Today it's LinkedIn.

Data News — Week 48

2021-12-03

Unknown Future — temperature anomalies in a nice cover

Hello, December has started and soon it will be time to do a retrospective. I hope this edition finds you well. I'll start with some auto-promo, but usual Data News is below 😬.

This week I've published a new YouTube video (🇫🇷, with en captions) where I do with D3js the iconic Joy Division cover (which is originally Pulsar data representation) and I also used Berkeley Earth data about temperature anomalies to represent the Unknown Future. Do not hesitate to subscribe and to give a like it will help me with YouTube discovery.

Data fundraising 💰

Weld, an all-in-one data platform, raised $4.6m in seed round. The Danish company (🇩🇰,🇪🇺) aims to provide ETL, data models, reverse-ETL, visualizations directly in one app to rule them all. They also have a "On-demand Data Analyst" incl. in the pricing (500€ to 5000€/month). Overall, I think they have really good ideas.
Is Privacy becoming more trendy? STRM Privacy raised €3m to bring privacy-by-design to streams. From what I understand you define schema in STRM and they ensure the data identified as PII will be correctly encrypted before any persistent storage. They also provide a data quality layer where you can validate your data.
Last but not least, Privacy Dynamics raised $4m in seed round (after a pre-seed in 2019)and launched yesterday. They want to help engineering teams anonymize data seamlessly and without any hurdle. From the small glimpse we have they will provide a SaaS to do that and it seems they integrate directly with Snowflake and dbt — whatever it means.

Embrace PTL rather than ETL

Kudos to Dragos for the term, PTL stands for Push Transform Load. In his post Dragos says that the E in ETL is a problem because it requires a lot of stuff before having a minimal platform, so we should consider pushing data rather than extracting it.

Even if the post could be better this is an interesting view, and it works great when you have a central bus or an event-driven platform already on product side but I don't think it's good for batch platforms. Do not forget that before we had to ask for FTP exports to IT teams to get data. Do we want to go back to that time? The real issue is the ownership of the source data and this will only be solved if teams invest time on it.

To conclude on this matter I propose you this post about Change Data Capture (because CDC can help you achieve PTL pattern) that is very well illustrated.

We just saw that privacy was becoming trendy, let's continue that way, it's time to speak about compliance. Getaround team detailed how they treated the data retention law requirement. Because of the GDPR the User lifecycle has become something and they cover the different parts.

I think they illustrated very well one of the most important part of the process: finding data to delete. A biggest issue we have in our platforms is that it tends to become quickly messy with data everywhere, so if you put a process in place do not forget all hidden places: third parties, backups, versions systems, extracts, etc. everything should go in.

What’s in Store for the Future of the Modern Data Stack?

It's the end of the year, so it's time to guess what will drive next year data innovations. Personally I only have wishes I'll write for Santa soon. But Barr Moses during an interview with Bog Muglia — ex-CEO at Snowflake — tried to predict the future of the modern data stack.

All the points are truly relevant but my personal favourite is the 5th because I do like graphs. So please give me nice usable knowledge graphs.

Let's find next year trends (credits)

Build Superset dashboards on top of Cube

Several weeks ago I presented cube.js an API interface between your data storage and your visualisation layers. It's time for a deeper understanding of the product, the Cube team wrote a large tutorial on how to build a Superset dashboard with Cube as a "backend". This is super cool because it gives us a overview of the product, they also intentionally — I guess — provided us with public credentials to try the product in our dashboarding tool.

Cloud data warehouse does not mean I ignore all data modeling principles

This week Adam from confessionofadataguy wrote a feedback about merge operations with Spark, the main takeaway is that you should always partition your data if you want effective — cost and time — operations. Antonio gave also the same advice when working with BigQuery, partitioning and clustering are your true friends.

So, yeah, ok cloud data warehouses gave us "unlimited" power but that does not mean we should be stupid when storing our data. All modern warehouses have legacy from Hadoop and distributed compute, so don't forget our roots and what it takes.

Even if it's not the main point, I recently came across this ML CO2 impact calculator. It could help you realize the carbon footprint of your un-optimized GPU models. Faster access to data means less compute time, do partitions.

With great data warehouses comes great responsibility (credits)

🦾 AI Friday

This category comes in-and-out because it depends on how I'm fascinated by some AI posts even if I understand only 10% of the posts. This week I propose you Algolia tech under the hood about AB testing evaluation for your search. When we speak about search Algolia is a big name, so this post will obviously give you the vocabulary but also ideas on what to do.

Social networks are graphs, everyone know that. LinkedIn team wrote about completing a member knowledge graph with Graph Neural Networks. It could also be useful for every company running a marketplace when it comes to identify important profiles attributes for instance.

Fast News ⚡

❤️ Every product will be a data product — The title say all, this is something we already discussed in last week edition so I don't have much to add right now.
I already shared DAG factories blog posts, but here a new one — Abstracting data loading with Airflow DAG factories, I've always been a fan of abstractions when it comes to data loading in Airflow and even if this post is not innovative it gives a fresh look at the topic.
Being a data person means sometimes being lost in the technical vocabulary. I started development with Django 12 years ago and when I see entry-level post to help you understand Django for data projects I like it.
Deep dive into Delta Lake via Apache Zeppelin — Nice tutorial to cover basic Delta Lake concepts.
👀 ObservableHQ — a freaking awesome JS notebook technology — released SQL cells support. I truly can't wait to test it.

See you next week 👋.

Data News — Week 47

2021-11-26

Oups I'm late, but with a surprise (credits)

Hi Data News readers, this week edition is one hour late and smaller than usual but still interesting.

This week I got a Black Friday offer for you just below 🤓.

We developed over the last 2 weeks a dedicated page on blef.fr that will allow you to search over all the articles that have been shared since the start last May (more than 600). In the following weeks this page will only be accessible to newsletter subscribers as a reward for your support. Do not hesitate to reach me on LinkedIn or by email to give me feedback on the page.

You'll probably need to login to have access to the page, if the workflow is to heavy tell me I'll try to find a workaround. And Subscribe to get instant access to the links page.

Small glimpse of the data links explorer v0.1 (page)

How Stuart is building their data platform as a product

This is a vocabulary question do we treat data as a product, do we do data products or do we build data platform as a product. At Stuart they have chosen the last one. Osian, a Data Product Manager, details his 3 first months where he treated the data platform as a product: user personas, user intents, product components. Everything goes in.

I personally think this is a good approach in order to bring closer engineering practices to data world. This post has really good illustration that are self-portative and good to give vision and understanding over your data stack.

On the other hand I also want to give another perspective over the data product trend. Coming from the data mesh and more precisely from DDD principles (Domain Driven Development). If you want to treat your platform like an aggregation of data products that answers business needs you'll need a methodology to discover them. This post will help you in the identification of data products. This post may be hard to read but it has some interesting concepts.

Agile Data Engineering at Miro

To continue in the Agile world, Miro shared their 4 pillars (or values) when it comes to do agile data engineering. I do agree with this light post that in data engineering it's often better to embrace an iterative delivery pace rather than a waterfall one.

Also do not forget to always challenge and deeply understand the needs before starting a project. I would say it'll save you a lot of time. Engineers tend to love to build complex system even if it's not mandatory — I've been there too, mea culpa.

Better do data engineering with glasses (credits) — once again a joke for French people

Data-Centric AI — The rise again of the Data Engineer

Walmart Tech wrote this post following the shift in the industry from Model-Centric AI to Data-Centric AI. Probably the data engineers and data engineering (again) are key pieces of the puzzle. This is a meta post that tries to argue around the border of ownership between DataOps and MLOps. I'd conclude that data jobs definitions in the end are unique per companies with some invariant like data infrastructure skills always needed.

Data Governance has a serious branding problem

I do agree. In a lot of (biiig) companies the Data Governance problem has been treated like a risk and security topic rather than a collaborative one. Legal team needs to understand where the data flows in order to answer legal needs and to govern better (yep I write it).

Prukalpa, Atlan co-founder, wrote about this topic a nice post following the history of data governance but also shows we can be achieved if you modernize your governance tooling. I find it interesting even if the post is biased as Atlan operates in that field.

What is BI Engineering?

I wrote a lot about Analytics Engineering in the newsletter because more and more companies are switching to this vocabulary, but we historically came from BI and from BI Engineering. Folks at Grofers what is modern BI Engineering and what are the roles involved.

Landing data on S3: the good, the bad, and the ugly

❤️ My favourite article of the week, this is a technical post about what is at stake when landing data on S3 and also a small journey about what you could encounter when doing so. Nevertheless, I do not agree with the takeaway saying that Spark is the best solution to use because the answer is probably: "it depends on your architecture".

Unix sed command tutorial with examples

Who is using sed among the data news readers? If you do more than one sed command per month hit reply. I'LL 🤗 YOU FOREVER.

A part from the joke, because data is a field quite new for a lot of people I'd say that a sed tutorial is a good read for everyone because sed could help you gain time when transforming small datasets locally on your computer or elsewhere. Imagine a world where you can avoid start Excel when you want to removing lines from a CSV. Imagine.

After that you just need to learn multi-cursor operations and you're a true data geek.

"Now, I see the light" — a testimony from someone that discovered sed (credits)

ACID vs BASE: Comparison of two Design Philosophies

You've probably understood now that I like database concepts explanation. This week we got ACID vs. BASE detailed. This is a short post, but it's well written.

Thank you for reading the data news until here. This week you did not have fundraising or fast news because I did not find any relevant content to add inside. Please try the link explorer tool and give me feedback on it.

See you next week. In December already.

Data News — Week 46

2021-11-19

We reached the 500. (credits)

Hi data folks, 2 weeks ago we reach the 500 mark and we did not celebrate. So it's time for me to say thank you to all of you and for all the support you manifested in the last months! I'll continue improve the Data News over the next months.

We also published this week on the blog a feedback on our Astronomer trial, we use it for 2 weeks and the post covers the basic tutorial and some limitations we saw while using it.

This week is rather small in term of articles I found relevant to add, I hope you will still enjoy it.

Data fundraising 💰

Mixpanel raised $200m in Series C. Mixpanel is a product analytics tools that mix your tracking system with your warehouse in a single app. In the last 2 years they add a huge growth — driven by the online traffic explosion.
ThoughtSpot reached $4b valuation after a $100m Series F. After 10 minutes on their website I still don't know what is about — I'm probably not the target. It seems they provide a cloud platform that answer to search questions using AI with visualization.
Hightouch, a reverse-ETL tool raised $40m in Series B. The third funding round in less than one year. They'll strategically focus in the next month into adapting the experience for business verticals like marketing or sales.

Build a data quality architecture

Data quality is probably one of the biggest pain point in the actual data teams. Every analyst struggle at one point and every engineer is unsure about the "data quality" definition. At HomeToGo they divided data quality in 3 pillars: accuracy, observability and monitoring with different ownerships in order to build a data quality system. I also discovered Anomalo that is the tool they decided to use.

Chit-chat about data analysts role

In the last two weeks a lot of people wrote about the analyst role or the so called translators as McKinsey said in 2018. First we got people debatting about the method to measure analytical work where Robert answered that the analyst isn't your sql monkey.

Then Taylor tried to define what will be the analyst of the future and how the discipline is evolving. I liked the charts she draw they are a good way to summarize the analytics role situation.

Warning — the post illustration can be hard on this one 👇

Following the post about the fact the analytical work has never been harder last week we got Mikkel writing the moldy data definition and arguing that dashboards should self-destruct and that spreadsheets today are still a not bad alternative. To conclude on one fact, we need new tools to help everyone's work.

"Hey what do you think about data analysts" "Oh I found them useful" "Cool" (credits)

Understand SQL joins in Python

I really like when people tries to explains basic concepts we use everyday but with another technology. This time we got SQL joins re-implemented in Python. This way it will help you understand what's going on under the hood when we speak of Postgres joins. Kelvin deconstructed nested join loop, merge join and hash join — go see the post to understand the diff with left and right joins 😉.

Kubernetes for data engineering

If you are looking for an introduction into some Kubernetes concepts this post is not for you, this is a small glimpse of what you can achieve faster as a data engineer when it comes to deploying apps with Kubernetes.

PS: This is also a reminder for me to write an entry-level detailed post about kube.

6 lessons I learnt early as a Data Engineer

All of the 6 points are common sense and also something that we discover quite early when we are a data engineer but they are all true and I think this is a good reminder for all data people. It's a small head's up.

My personal favourite is obviously the second, every data employee should have training to say no.

My "Being a data engineer notebook" growing big (credits)

Releases 👻

Airflow 2.2.2 got released — changelog
Snowflake release Python support in Snowpark — demo video on YouTube (this 10 min overview but with code)
Metaplane launched this week on Hackernews — they aim to become the Datadog for data promising only 30 minutes to have your setup up and running.

Fast News ⚡

Gradient Flow and Jesse Anderson published the results of the State of Data Engineering survey they launched few months ago, look at the results here.
CinnaMon (Github repo) — a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system.
PySpark cheatsheet — if you do PySpark this huge sheet may help you
SelectStar is a data discovery tool that unveiled this week a partnership with Snowflake
Versatile data kit — vmware (yes!) open-sourced an abstraction layer around data jobs to help in monitoring and debugging. This medium post can give you an quick overview.
Felipe Hoffa tutorial to migrate your Google Analytics data to Snowflake — a SQL cookbook

Feedback from my two weeks Astronomer journey

2021-11-17

Astronomer is an enterprise framework designed for organization use of Airflow, the leading open-source data workflow orchestration platform. It allows you to deploy and maintain pipelines.

As I was very interested in this product, I tried using it for two weeks. We'll walk through the process of deploying an Airflow pipeline with Astronomer.

Astronomer's logo

My project was to move data from this blog and upload it into Notion (read my previous article on how to upload data to notion). I wanted to keep track of members subscribed to this newsletter, and send a message everyday to a Slack channel that contains the number of daily new members subscribed to the newsletter. So I developed an Airflow pipeline with Python and Bash.

Before deploying my pipeline to Astronomer, I developed it and made sure it was working on my local machine using Docker. If you don’t have Docker installed yet, install it (guide).

Getting Started with Astronomer

Go on Astronomer's trial page and get a free 14-day trial without giving more than your email. You’ll receive an invitation on your email that you’ll have to accept in order to get started.

After logging into your Astronomer account, you'll be prompted to create a workspace. From a workspace, you can manage a collection of Airflow Deployments and a set of users with varying levels of access to those deployments. Note that every workspace you create has a free trial of 14 days.

The next step is downloading the Astronomer CLI. A simple tutorial will appear once you've successfully created a workspace.

After making sure the the astro command works properly on your terminal, you can initialize the work environment on your machine by creating an empty directory and running the following command.

astro dev init

This will create all the directories and files needed by Astronomer in your directory. Then you can put all your dags, includes, and plugins in their respective directories.

Capture from Astronomer's Documentation, CLI Quickstart > Step 3

If you have environment variables for your project, you can set them locally in a .env file, or in the Dockerfile generated by the initialization command you run earlier. For secret variables, setting them up in Astronomer’s UI is recommended.

Astronomer's UI > Variables

Starting your project locally

To start your project locally, run the command astro dev start on your project directory, (astro dev start --env .env if you want to take into account environment variables). You’ll see a localhost URL, that’s where the Airflow instance will run. By default, Astronomer uses port 8080 for the web server and port 5432 for Postgres. If that doesn’t suit you, you can change it via the .astro/config.yaml file (example here).
To stop running your project on localhost, simply run astro dev stop.

Deploying to Astronomer

Astronomer’s purpose is to help data engineers deploy pipelines and maintain them easily. We’ll now deploy our project to Astronomer and see how it turns out.

Before anything, login to Astronomer with this command astro auth login gcp0001.us-east4.astronomer.io. You'll be prompted to enter your email adress and password.

Setting up the Docker image (tutorial here) with airflow version, environment variables and bash commands to run at start has to be done before deploying to Astronomer. Then you can run the command astro deploy in your project directory. If you have more than once active workspace, you can switch workspaces by using the command astro workspace switch and selecting the right workspace.

On Astronomer's UI, wait for the deployment to finish then open Airflow and that's it ! You can now trigger your DAGs.

Airflow on Astronomer

Additionally, you can deploy to Astronomer via CI/CD using Github or other version control tools, learn more here.

Problems I encountered during this process

Astronomer doesn't support Buildkit

If buildkit is enabeled on Docker, Astronomer won't launch properly. You'll get the following error : 'buildkit not supported by daemon Error'. To fix this, go to Docker > Preferences > Docker Engine and set buildkit to false.

Issues with Astronomer and Mac M1

Astronomer CLI installation might fail if you're using a Mac with M1 chip, as it is not yet supported by Astronomer. See this post for help.

Problem reading secret environment variables

Storing secret environment variables in Astronomer might cause some issues : Astronomer stores secret variables in all CAPS. This might work if you’re storing passwords in all CAPS or emails but from my experience, it doesn’t work with API tokens and URLs. I simply unchecked the “secret” option on my variables to solve this problem, even if this is not a sustainable solution in my opinion.

You're deploying your project to a US based server

Astronomer Cloud (the trial version) hosts you DAG in a server located in the US. As far as I researched, I didn't find any way to switch servers (I was looking for an EU server). This might be problematic if you're dealing with data subjected to GDPR. If you want to know more about Astronomer Entreprise hosting options, go here.

Can't pass files from one task to another

Every task on your DAG runs on a different pod even if you are on Local Executor, so if you're passing files from a task to another, you might consider using an external bucket (S3 or GCS).

Global Opinion

According to Astronomer's website, Astronomer is supposed to make data engineers' work smoother and easier. While it is relatively easy, I am personally not convinced that Astronomer helps more than other platforms. The main pain point I see with this product is the lack of log files related to the way Astronomer works and the lack of configuration possible in the UI.

Edit : Astronomer redirected the 14 days free trial page to their standard trial page (after reading this article maybe ?)

Data News — Week 45

2021-11-12

Just a regular Friday for data people on duty — cleaning the pipes (credits)

Hey there people working on this Friday, I'm here to bring you sun while others are not working. This edition will surely give you food for thoughts.

Data fundraising 💰

Headless BI term is now out, Supergrain raised $6.8m of seed funding in order to transform how BI software works. They are plugged on top of your warehouse with a 3 layers app: YAML metrics definitions → web catalog → query-able API to get your metrics.
Datafold, a data reliability-monitoring-quality-observability platform, raised $40m in Series A. The platform already include Data Diff their product to easily find regression testing on ETL, column-level lineage, scheduled SQL alerts and finally a data catalog (something that is not a data quality tool).
Collibra, another dinosaur that ends in "a" (founded in 2008), raised $250m in Series G. Collibra is a data platform aiming to help companies in finding, understanding and accessing data. They call it a data intelligence cloud.

Why the Data Analyst role has never been harder

Even if we added "modern" before our data platform tools, we are probably still at the beginning and we are still artisans when it comes to debugging and monitoring. In his article Petr illustrates very well why maintaining a data model is really hard.

I totally agree on his wishlist and join my forces to ask for the same. I hope that companies that are inspired by Datadog to build data monitoring will try to address these issues.

OLAP Cube, round 2

After the first explanation we already saw in #37 edition we have the round 2 of the OLAP Cube explained. It starts a written discussion between Claire and Cedric. Claire was saying that an OLAP Cube is nothing more than a table with some conventions, Cedric on his side tried to look at the 30 years history to find the different definitions we used.

In the end OLAP history is cool story to be told, but is it more in this time where metrics layer is coming back stronger? Is it still a mandatory skill for analysts and engineers? I don't think so, but somewhat important.

OLAP cube but triangle (credits)

How to ship a machine learning model in days not months?

We all know the blabla about the data science jobs that are 80% of data cleaning vs. 20% of delivering insights and that mostly never (87%) make it into production. It's often due to lack of skills, processes or silos, we then created the MLOps concept, but still there are issue because the gap is not filled with software engineers.

On the other side Doctolib team shows with brio how they are able to ship models in days and not in months. This is a combination of re-usable components to avoid rebuilding the wheel each time and a collaboration with developers (software engineers). To have tried it in the past I'd say that both points are equally important — and you should also include product collaboration.

Feature store to unlock data superpowers

It's related to previous category: in order to be able to ship models faster you need to have a data storage ready for that. If you want to copy paste what bigger companies like Uber did, this is maybe a solution. João Santiago explained how Billie developed a feature store on AWS with Redis, Snowflake and Lambda functions. I like this pragmatic approach that gives wings to Snowflake.

On the contrary if you are still building a data lake in 2021 — that's not something to be ashamed of — here meltwater journey explained: from database to data lake.

Should data engineers fear low-code tools?

Zach Wilson tries to explain why as data engineers we shouldn't really fear low-code tools. Even if each day there are more and more tools that aims to simplify everything, data engineering is not something only about tools. This article scratch only the surface.

And remember, even low-code tools need skilled people that understands what's going on.

If you like the Data News Subscribe to get it by email each Friday. Forever free.

Database replication explained + Postgres unknowns features

All data stacks today have used or are still using a database replica to read the product data. Here the Part 1 of database replication explained focusing on Single Leader Replication. Understanding replication could help you understanding better your data stack.

To go further in understanding I propose you this awesome post exploring lesser known PostgreSQL features. Features you already have but may not know about! For instance I bet you don't know the dollar quoting is fun.

Database replication — leader and followers (credits)

Git branching: Best practices for BI and Data Teams

Beat team evaluates Git branching practices — mainly Gitflow — and gives you feedback on it. I think this is a good article for entry-level data practitioner in the need of Git data concepts.

Releases 👻

dbt-core v1.0.0 first release candidate is out ⚙️ — Prepare the migration with calm by at least looking at the breaking changes (only fields renaming in dbt_project.yml).
Debezium team is increasing the pace in term of releases — 1.8.0.Alpha2 is here.

Fast News ⚡

Prefect Cloud is doubling the free tier — It's an orchestrator aiming to answer Airflow flaws. The free tier is now at 20k successful runs per month. That's huge but a bit misleading because you'll still have to pay for your workers (where the code run), Prefect is only a tasks orchestrator in that case.
cube.js — your API layer between warehouse and frontend app. I just discovered cube.js (11k stars on Github) and I can't wait to try it, because it bridges the gap between serverless warehouses and JavaScript frontends.
How Uber migrated financial data from DynamoDB to Docstore
HEY R USERS I HAVE SOMETHING FOR YOU — I found this article describing how you can query in SQL parquet files from R, so because I liked it here you are.

Data News — Week 44

2021-11-05

People getting their Data News at the local market (credits)

Hello for a new edition of the Data News. I hope this email finds you well. Enjoy this week reading.

Data fundraising 💰

The old dinosaur Informatica went public for the second time on the NYSE. The company was founded in 1993 (!) and went once public from 1999 to 2005. They are still vending a all-in-one platform for "Data Management", but in the Cloud (obviously).
Yellowbrick Data, a cloud data warehouse, raised $75m in Series C to accelerate growth and to compete with a myriad of players. Their key differentiator is that they offer an on-premise deployment service plus native streaming capabilities. What a good time to be a data warehouse.
A data labeling company that fights against poverty by giving people work, Sama, raised $70m in Series B. They have a Hub in Kenya where probably most of the work is done and they develop an advanced workflow to label video, image and text.
dbt Labs and Snowflake announced a partnership to help companies succeded in their dbt usage integration with Snowflake. They will team up in the costumer success response and you'll be able to access a dbt Cloud free trial from Snowflake Partner Connect. Interesting move.

Maxime Beauchemin new post — Reshaping Data Engineering

If you missed it, you should read it: how the Modern Data Stack is reshaping Data Engineering. Actually, this is more an appetizer, the post is meant to give a overlook of the discipline and tries to define trends with simple true words.

Scaling Airflow on Kubernetes: lessons learned

My friends at Qonto wrote feedback article about scaling Airflow on top of Kubernetes, Yannick details in the post what are the parameters to look at when you want to fine tune your cluster performance on both side (Airflow and Kube).

How to improve at SQL as a data engineer

We all know that in data engineering teams Python and SQL are often required. Even if to me Python — or any other data oriented language — is more important, SQL is still something. In a multi-skilled data team having data engineers mastering SQL but more importantly data modeling is a must have. The post covers all the important topics.

How to rerun a task with Airflow

This is maybe the most complicated Airflow concept. Each time I teach on Airflow I take time to explain why Airflow is build like this and why you should clear instead of trigger. Astronomer team wrote a guide about the task reruns.

But don't forget that it's useless to clear tasks if your tasks are not idempotent and deterministic — you don't want to re-insert the data in your tables for instance.

PS: something that is annoying with task clearing is that it updates directly the Airflow metadata, so it'll break every monitoring you have based on the Airflow database.

Data Engineering learning path — metaphor (credits)

The Modern Real-Time Data Stack

WTF Christophe what is this again? At the moment the MDS is focused on the ELT with the warehousing and not a lot of articles and visions are talking about the real-time part. This article on thenewstack is trying to define the real-time limits and what you should do to add streams as source.

To be practical I propose you an awesome guide to build a real-time Metabase dashboard on top of Materialize. Marta is using the Twitch API to get and publish data into Kafka and then uses SQL streams to query the data in live from Metabase. It looks promising.

Today you can read everywhere on the blogs that you should start treating your data like a product. But to do so you to adapt you organization and we've seen some articles in the past like this one.

To support data products, Databand is saying that you also need Data SLAs. This article is a good introduction about service-level KPIs you should define to be less blind.

AI Friday

This category is back, even if renamed, because this week I have some AI news to share.

If you missed it, Facebook announced they will stop using facial recognition, but Vox is guessing that Meta — new Facebook name — will continue to use it to build the Metaverse.
Following Metaflow new UI (cf. last week) we have this week a post explaining how Metaflow can be useful with working examples and GIFs to help you understand it.
Introducing Pathways: A next-generation AI architecture — I'm far to be an expert on this topic but Google announced a new AI architecture that will excel at doing many tasks at once. As they say: "Pathways will enable us to train a single model to do thousands or millions of things."
Doctrine, a legal intelligence platform, built a recommendation system on top of 3.3m court decisions. They used CamemBERT and in the post they well describe how they did it.

Is this the Metaverse? (credits)

Fast News ⚡

Meltano, an open-source packaged ETL system powered by Singer, Airflow and dbt, changed their logo and wrote a foundation post about their strategy. As a reminder Meltano was initially built at Gitlab.
Airflow permissions — if you want to manage your Airflow UI permissions with Python code here an example script to help you getting started.
How Netflix, Airbnb and Uber do Anomaly Detection — If you want to do Data Quality Monitoring (DQM) this post will give you an introduction to techniques used by US tech giants.
Data Quality on top on Snowflake with Great Expectations — following the previous point Snowflake team wrote on how you could use Great Expectations to check your data quality.
Databricks sets a new TPC record — They official broke the previous records by 2.2x. It means they are able to process 100TB of data faster than before using more than 2k CPUs for half a million $. Can't wait to run it on my laptop (ironic).

PS: Can you guess where am I?

Data News — Week 43

2021-10-29

Me reading Hacker News comments (credits)

Hello dear readers, this is year 2021 week 43 — yeah, big advantage of being subscriber, you get the week number directly in your mailbox.

This week I published 3 stories about the time I deleted data in production, even if the title is voluntarily clickbait because the term "production" is blur. I had my first experience to be on Hacker News front page and I did not like it, as you can see there comments are a bit negative. But isn't a good way to learn something?

I've changed something to the usual fast news, so read everything to see it!

Data fundraising 💰

5 weeks after their Series A, ClickHouse is in raising spree, they announced their $250m Series B. As a reminder ClickHouse is an open-source analytical database that promises low-latency performance over a warehouse. As always I see this as a good step forward multiplicity of warehouses technologies available.
To continue on databases technologies, Yugabyte raised $188m in Series C to be the "default database for Cloud native applications". It's a distributed SQL database built over Postgres. They gonna compete with Google Spanner or CockroachDB.
This week also features a transfer, a new departure from the BigQuery team. Hossein Ahmadi has joined Snowflake as a Principal Software engineer after 10 years at Google. This follows week 32, where another Principal left to become CTO at Firebolt.

The next big challenge for Data is organizational

I know that you are a diligent reader of the data news, so you already noticed that I really like articles dealing with data organization topics — even if I'm a data engineer 🤓. We already see recurring patterns towards organizational issues in all data teams, and it's gonna be the next big challenge for data.

I totally agree with the article, today tools are important but not crucial as people. Tomorrow your data teams will shine because you hire the good person for you organization and because you embrace techniques that suits them — probably inspired by software engineering.

Also do not hesitate to Subscribe if you like this kind of content.

How Postman's data team works

We already seen how Postman data team organized their platform around Atlan to create the data workspace for everyone. This week we'll see how they organized the data team to meet the operational needs. It seems they are organized around 2 teams: engineering and science and analysts are included in the second one. They also put different hats on the analysts as they could be central, embedded or distributed.

I really like this article because it shows what are the thoughts they had while building their team and it could help everyone.

Grow as a ML engineer

The ML engineer position is probable the newest in the tech ecosystem. The border and the role are still blur but as more people are writing about it as faster we gonna converge to the best definition. This week I propose you to read this Roadmap: from Backend Engineer to ML Engineer. It also covers the major topics the MLE is involved in and the skills it needs to mainly master.

If you also plan to go as a machine learning engineer freelance Pau wrote how you can defeat your impostor syndrome to launch yourself in the independant way.

To finish and because I'm a data engineer I'd like to share also the 7 things you need to know to become a data engineer that Medhi proposed.

A ML engineer, a data engineer and an analytics engineer are in a pot... (credits)

The rise and evolution of data engineering — what’s it all about? 🌙→☀️

Data engineering is truly changing. A first reason is all the tools appearing each day that simplify our daily job. The second reason is that it is starting to becoming hype attracting money and talent. Data engineer are dragged out of the shadow because of this. But what's outside the shadow? Is it light? Olivier Molander puts words on the data engineering evolution.

What is data versioning and 3 ways to implement it

We've all build data lake or data warehouses once in our life. And we probably did it the batch way. This is normal. But with time and volume the batch method will start to struggle. You'll need to understand concepts around change data capture (CDC) or data versioning, here how to do it.

To go further on this concept I propose you this Podcast from Data Engineering Podcast about Streaming data pipelines made SQL with Decodable (I like the pun) and this state of the art about Feature Stores look at the end of the post for a link to the Feature Store Summit if you're in.

Data Engineers shouldn't write Airflow DAGs — new episode¹

Leroy Merlin Russian team explained how they implemented a configuration driven DAGs generation for their data platform using Airflow. Thanks to YAML templates they are able to describe what data they want and how they transform it.

Let's release data products (credits)

Fast News ⚡ and Releases 👻

Let's bring some data product news and releases in the fast news to follow how our beloved data products are evolving.

dbt Cloud — They finally rolled-out the environment variables support in dbt Cloud, it can be used to bring env related variable to the Jinja templates.
Snowflake x Google Cloud — Snowflake announced support with Google VPC. It could for instance allow you to block public access to your account.
Metaflow UI out — Metaflow — the Netflix open-source ML platform — is finally getting a GUI. Bridging the gap with MLflow for scientist to go deeper in their model monitoring, understanding and debugging.

Below usual Fast News.

Snowflake Java UDTF in action — Felipe Hoffa shows how he used UDF to detect language in Reddit messages and created a treemap.
Hazelcast + Kibana — A walkthrough post on how to combine Hazelcast and Kibana to explore Wikipedia data. Hazelcast is a in-memory platform oriented for real-time performances.
Choosing a Cache — if you are looking for criteria to met when looking for a cache this article is for you, it does not talk about technologies, only concepts to look for.
Tracking in SwiftUI — if your product team does not know how to track your iOS app Mixpanel wrote how you can elegantly do it without breaking any Swift logic.
Count rows in a CSV file — There are at least 6 ways to count rows of a file in Python and surprisingly Pandas is not the best way — lmao.
LinkedIn and the Great Reshuffle — What a concept popularized by LinkedIn CEO following the Great Resignation, here we have AI post overlooking what LinkedIn is doing. It's midly interesting, but I like the Great Concepts.

¹ Props to Anthony Henao for the title (cf. here)

I deleted data from production

2021-10-25

Oups I broke something (credits)

The last few weeks have been busy with some huge downtimes. Facebook and OVH to name a few went totally down because of a wrong router configuration. Earlier in 2017 Gitlab got an outage because they accidentally removed data from the primary database and the backup was not working.

These incidents seem far from us — we are not at Facebook scale, we are not a hosting company like OVH or Gitlab — but still we are human, which means that we are not protected from totally destroying our beloved platforms in a blink.

In this post I would like to share my own personal experience regarding data loss and describe mistakes I did in the past. This is an educative and a therapeutic post in order to show that this is something normal and also to say it loud. If you understand French you can also watch the related YouTube video.

That time when I deleted the /usr folder

First story takes place in 2014, I was setuping an Hadoop cluster for a company. A small size cluster with less than ten nodes. I was far from being an expert because it was my first setup. I was trying to do my best and it was working. We were using Ambari and the cluster was working with one Namenode and some Datanodes.

And the issue appears. That day I got a The partition / is too full error in the Namenode. Big deal, the cluster was down. Small throwback, at that time I just graduated and I have experience in Linux administration but only to deal with personal websites and blogs I had as a teen. So I'm doing what everyone is probably doing: some Stack Overflow driven debugging and fixing.

I don't totally remember why but I was a bit stuck in the server and nothing was freeing space. I started to move big folders from the full partition to the data partition. The operation I was running was:

copy the folder to the new data partition
remove the old folder
create a symlink between the new folder and the old path

This solution is actually working for data folder, not for binaries, a not really for /usr.

Once I remove the /usr folder I got into big trouble. Yes, removing /usr means removing all Linux binaries. So, yeah, you are out. But actually with some creativity you can get back binaries, but the masterpiece is the sudo command. Sudo command needs to be owned by uid:0. The plan was not working. I was stuck outside of the Namenode and the cluster was still down.

That's when I asked for help. Also the reason I did not ask for help earlier was because the IT team needed ticket to work on topic and I wanted to fix the issue faster. What a champion.

In the end it wasn't a big issue — because the cluster was not yet in production. I lost a lot of time re-installing the cluster after the IT team fixed the issue and we change all partitions sizes. But if I had asked for help earlier it could have been fixed faster.

That time when I deleted /data folder

Like all horror stories everything is running fine until the disruptive element comes in. I was in another company and we often had morning pipeline issues so over the weekends I was frequently checking the Slack alerts channel. One morning I wake up happy and then I have a look at Slack. Nothing ran. The whole system was down, all pipelines were broken.

After a small deep-dive I see that it mentions that the /data folder is missing in HDFS. What a weird issue. How the f*ck this folder could not be present? Actually, that was true the folder was missing, then I send a message to the team asking if someone did something. In parallel I continued to dig and found that I was responsible of the deletion.

Argh 🙃.

If we can take a break. It's Sunday morning and I just discovered that I've lost 3To of data and that all data pipelines have stop working because on Friday I ran for no reason hdfs dfs -rm /data. What a nice day in perspective.

In the end we were able to get back around 60 to 80% of the data because a part of the data was still in data sources or in the BI tool cache layer. It took me around 3 days to get the system up again, but everything went fine in the end.

That time when a colleague did terraform destroy on the production project

Another company again and another story. But this time it wasn't my mistake but a colleague one and I was his manager.

terraform destroy (credits)

One day, just before noon break, someone is asking on Slack: "Is Metabase down?". Usually we answer: "Are you on the VPN?" because it often the solution, but it was different this time. After a small check I also get a 502 bad gateway page from nginx.

We were in a open-space, it was pre-covid time, — do you still remember? — and I just look over my shoulder and see the screen of my colleague behind me. The screen is fully red. When I get closer I see a nice and welcoming terraform destroy at the top of the terminal. It should have run in staging but it was in production.

Even if you don't know this command you can understand by the word destroy what it could do. So, it was at noon and someone from the team destroyed the whole GCP data architecture we had, so we lost our Kubernetes cluster with apps (Airflow, Metabase, etc.), our GCE instances and our SQL instances. By chance our buckets and BigQuery weren't managed in terraform. So it limited the impacts.

I learnt a lot from this experience, I took us between 3 to 4 hours to put everything back up but as I wasn't the person who did the mistake I felt it differently. I had to be comforting with my colleague to help him fix the issue as fast as possible to get the numbers back up for everyone and to avoid us spending a night at work.

Takeaways

On each blunder I had different position and experience. First I was totally junior, then I had a more senior role and on the last one I was the manager. From my stories here some takeaways you can apply to you or your team to avoid this to produce:

Create a good wheel environment, ask for help and do not hide stuff from colleagues
Don't break under the pressure — easy to say and hard to do
Never blame the responsible, I prefer to think that if the mistake happened it's because the team or the company let the issue happens — be also careful when you joke about it afterwards
Measure the risk when you give all the permissions to one developer, one data engineer or one SRE — it means similar stories could happen
Technically: do backups and test backups (cf. Gitlab), have audits on bash or admin commands, automatize everything (IaC - Terraform should be run from CI/CD for instance)
Avoid copy pasting from internet if you don't understand the side effects

If I can also add as a last note, all the mistakes I did were on a data infrastructure and not on a client-facing product that could bring another level of stress in the troubleshooting.

If you already felt the same way I want you to know that you are not alone in this boat. It's ok to do mistakes. Learn from it.

If you feel overwhelmed with data I write a weekly curation do not hesitate to Subscribe. Obviously no spam and forever free.

Data News — Week 42

2021-10-22

42 (credits)

Big number, week 42. It says enough for the introduction, let's jump to the news.

Data fundraising 💰

It was last Friday, but I'd like to highlight it. Gitlab became public, through a well welcomed IPO. Gitlab is a big piece when it comes to data. They democratize the CI/CD concepts for everyone (and for free if you want) and also they have a wonderful distributed data team that could inspire everyone. Discover their public data handbook.
OVH — a French based company — also went public last week in Paris. OVH being controversial, I don't have a lot to say except a wish: can we have European built Analytics solutions please?
Imagine a world where GDPR — and all local privacy laws — could be solved with one tool? Maybe with Skyflow. I don't know. Actually, they raised $45m in Series B. Skyflow provides an API on a vault in order to keep privacy data secured and compliant.
In last week newsletter I've mentioned Hex. This week Hex is raising $16m in Series A. As per them: they are "Building the first collaborative data workspace". Checkout the GIF in the PR, it looks cool.

Data Organization: why are there so many roles ?

Good question. IMHO the reason we have so many roles is obvious: we have to many things to do in data. It's a whole ecosystem. If you want a better answer I suggest reading Furcy Pin's awesome article — my favorite of the week. I totally agree, data skills are organized in 4 realms and we have blur frontiers between each one.

I only have one guess right now. I think the role between the Analyst and the Scientist is not totally defined yet and will appear in the next years.

If you want to have another synthetic perspective Benedict Neo also tried to define the 2021 5 data roles.

Is Looker dead? — follow-up

Last week I wrote about Google's partnership with Tableau. Let's overlook this week's posts about the situation. The tableau will become clearer.

As Benn wonderfully said we need to split up Looker in two pieces, LookerML — which btw is so badly named, no ml inside — and Looker as a visualization tool. Once the split is done, who cares about Looker as a visualization tool? Even Google doesn't. What is left for Looker? Is Looker dead?

The second article continues the discussion about the position of "LookerML" in the data stack. The actual trend is to have a transformation layer (dbt) and a metrics layer (LookerML, transform, etc.), but what's his actual place?

And then meet... Malloy a new "experimental language for describing data relationships and transformations" developed by Looker team with a Google color-looking logo. The Github repository shows 2 months old files, that means that they have a strategy somewhere. I'd bet they want to compete with dbt because the whole industry is shifting towards it. But it's only a guess.

Yeah, actually the tableau is still unclear.

Will Looker be the Phoenix (credits)

Data Observability 101

I've already wrote a lot about data observability — the new data category popularized by Monte Carlo. As of today the best way to be the first in your category is to create yours. That being said, if you want to truly understand what's behind this term (and also to have an overview of the tool), you can check-out the guide.

What is ClickHouse? How does it compare to Postgres?

If you ever considered ClickHouse as a warehouse or as a timeseries database, this article is for you. Timescale team — TimescaleDB is a Postgres database with timeseries capacity, open-source or managed — wrote a 33 min read post comparing ClickHouse with Postgres. They also describe ClickHouse internals and flaws.

Distributed data teams

Other mainstream people are calling it data mesh but I don't like it. HelloFresh described in a nice Medium post their journey to the mesh. I think that everyone can learn from the challenges they got because all teams are facing them, mesh or not.

If on the other side you want to technically implement better effective distributed data team you can have a look at event-driven architecture by adding a central event bus. The only interesting part of this post is that it showcases EventBridge, an Amazon service that does Kafka job but in a black box — what a good idea AWS.

Why can’t we find enough Data Engineers?

Daniel Rojas Ugalde tries to answer the $1m question in a small open post. I know that in my local market the reason is simple, there isn't any decent or advanced curriculum in data engineering in public school compared to data science for instance. So each years the gap between supply and demand is increasing.

A detective looking for data engineers (credits)

Fast News ⚡

Dear Dataflow users Netflix team detailed how they manage their deployments
If you are a junior data eng without a lot of cool project to showcase you can go to drawSQL database templates to have a look at some startup data schema to get some project ideas
Apache ECharts — For those who missed it like me, ECharts is an open source JavaScript library — behind the scene it uses zrender. Superset is planning to move from NVD3 to ECharts and they wrote about the actual state.
Timetable in Airflow explained — more customizable schedules for your Apache Airflow DAGs
Julia language overview for data projects
Another data stack to be inspired — Monzo team detailed what their modern data stack

See you next week & go see my YouTube video in French about some personal anecdotes.

Uploading data to a Notion Database

2021-10-20

Time to get some Notions (credits)

Notion is a platform designed to blend your workflow into an all-in-one workspace. It has many options and features adapted to all kinds of usage.

I started using Notion recently, and I figured I'd use the Database mode that Notion offers. I wanted to keep track of members subscribed to blef.fr's newsletter, by uploading every member to a Notion Database.

We'll walk through the process of creating a database in Notion, and uploading data to it.

First, create a database in your workspace. Go to workspace and add a page. Type '/table' and choose full page table. You'll then have to create an integration and share your database with this integration, tutorial here.

Now that you got your database ID and your Notion key, you can start making API requests.

We'll go over Notion requests using cURL and Python.

Requesting Notion's API using cURL

Here's what a Notion HTTP request looks like :

curl -X POST https://api.notion.com/v1/pages -H "Authorization: Bearer $NOTION_KEY" -H "Content-Type: application/json" -H "Notion-Version: 2021-08-16" --data @your_json_file.json

Let's say that your database has four columns, Name, Country, Email and Subscription.

Name is a Title property, Country is a Text field, Email is an email format and Subscription is a Select Option.

In order to insert data to this database, you must create a JSON file respecting the following format :

You can find this JSON file here.

For more information about Notion's properties objects : documentation.

For more examples on how to use Notion's API : postman.

Requesting Notion's API using Python

Start by importing modules json and requests. Store your API headers and API endpoint in variables :

API_ENDPOINT = "https://api.notion.com/v1/pages"
HEADERS =  {"Authorization": "Bearer YOUR_API_TOKEN", "Content-Type": "application/json" ,"Notion-Version": "2021-08-16"}

Then open your JSON file containing what you want to add into the database.

with open ('PATH_TO_YOUR_JSON', 'r') as f:
    DATA = json.load(f)

Make a POST request with Headers and Data.

r = requests.post(API_ENDPOINT, data=json.dumps(DATA), headers=HEADERS)

Print the request's content to see the returned message.

print(r.content)

Use the NotionHook I created

I found an unofficial Python API for notion (link here) but I didn't find anything Airflow related, so I figured I'd make my own Notion Hook (link here). It has three methods : creating a database, retrieving a database and uploading data to a database.

You can also find my Airflow Pull Request here, leave a thumbs up if you come by!

Data News — Week 41

2021-10-15

A true Tableau partnership: Le Louvre and Mona Lisa (credits)

Hello Data News readers, I hope this edition finds you good. Last week edition was fun to write and was longer than usual. This week post is shorter but with awesome articles. Like all weeks. But no crunchy fundraising to eat today.

Google Cloud partners with Tableau

But well, I found interesting news about Google Cloud partnership with Tableau unveiled at the Cloud Next '21 — the Google Cloud annual conference. With this move it means that Google Cloud can visualize data with Data Studio, Looker and Tableau.

If we read between the lines to me it's a growth technique for Google Cloud, being Tableau's favorite partner when it comes to Cloud migration. I can bet that a lot of users are still on-premise with Tableau Server.

On the other hand, Tableau will be able to access Looker semantic model. Ironically I bet no-one care about this one. If I am wrong please reach out to me.

Designing the new event tracking system at Udemy

❤️ My favorite article of the week. Udemy team wrote a long piece of article about their journey migrating from a legacy event tracking system to a fresh new one. What amazes me in the article is how meticulous they were in the selection phase of the project. How everything goes is: requirements, buy vs. build, serialization.

If you are planning to go event driven in the next month this article is a good start when it comes to technical design. Huge shout-out to the Avro customization they made.

Snowflake Streams applied to IoT data

Following the previous article I suggest reading this demonstration about Snowflake Streams applied to IoT data. The idea is to do a real time processing on top of an eventual big table (31 billions rows/year). Thanks to Streams you are able to get only new rows and compute faster than a full.

Create a STREAM in APPEND_ONLY (only new rows) — image extracted from the article

NPS for data teams

When you are in a data you obviously like numbers but you — also obviously — struggle to know if you truly have an impact. Shifting away from a support team to a product team could be mandatory. To go further we can apply NPS survey to data team to get a team KPI to follow.

This is something I've already done in the past but you need to have a certain scale to be sure that you have relevant results and also a certain maturity. But don't take the — good — results for granted because a recommendable data team for business could not be the data team you want to be in.

In order to boost my own KPIs I recommend you to Subscribe to get the news by email each Friday. Obviously no spam and forever free.

Using Singer to ingest data at Glassdoor

Singer is an open-source standard for composable extract-load. It creates a transfer between a tap — a source — and a target. Because Singer is composable it is theoretically possible to use all taps with all targets, bringing a lot of combination.

That being said. Glassdoor explained how they used Singer to ingest data from APIs. I think this is a good introduction post with good ideas. I really like the idea of using Singer schema discovery features to check if Tap schema have been altered.

A Singer tap (credits)

Thoughts about Hex and dbt

Claire Carroll — previously at dbt Labs — wrote her thoughts about Hex and dbt used together. She says that using a notebook based query tool is better than the Snowflake UI mainly because you are able to juggle between queries results. Finally in the wishing list something everyone is probably waiting for: can we have query editors supporting the dbt ref macro?

As cool as the new table formats

Recently Hudi and Iceberg became topics I write about in the newsletter because it could become the next big improvement in our data stacks. This week we have a series of 3 articles about what is Hudi and how it can be used. On the other side we have a demo article about Iceberg.

What's new with BigQuery

Following the Cloud Next '21, BigQuery team announced what is coming next to BigQuery. With an overlook I can say that they keep bridging the gap in terms of database features with Snowflake by keeping on to adding machine learning expertise. Here a small outlook:

BigQuery become heavily interoperable → Storage accessible from various part and query federation even more
BigQuery Omni is becoming generally available (GA) to support multi-cloud based workflows, but no idea about the price (Google if you read this contact me)
They preview GRANT / REVOKE commands to support data authorization — row and column level security
Run Python external functions (and 6 other languages) from your SQL
New Monitoring UI to understand how BigQuery is used
And more: Table snapshots and clones (hey Snowflake), cheaper write API for streaming, search indexes for text fields, better ML explainability and Vertex AI integration

Recent BigQuery Innovations — capture form the video

Fast News

It's time now for the fast news. Cools news but faster than before.

What's new in Apache Airflow 2.2.0 — Have a look at the @task.docker decorator. Neat!
MonteCarlo released on OReilly Data Quality Fundamentals — if they could also release observability term in the dictionary that would be awesome because the linter will go crazy (just kidding).
Curated list of dbt resources — Hiflylabs created a Github repository of with awesome articles and tutorials about dbt
The only Data Mesh video you should watch — if you have 30 minutes (+30' of question) to spare I recommend you to have a look at this video. They explain how they implemented domain design to data warehouse.
Databases explained with Manga — if you want to have fun learning databases concepts read this manga.
Bitwise operations for data engineers — Daniel wrote about something fun: bitwise operations in Python. He does not explain well the (2 & 3) << 1 giving 4 but the post is cool.

Have a great weekend and see you next week!

PS: if you read the newsletter until here I thank you, what do you think of a audio format 🎙️ (podcast) of the newsletter? Will you listen it? Drop me an email to tell me, I'm curious about it.

Data News — Week 40

2021-10-08

Is it a trusted cloud? (credits)

New week for the Data News, from the feedback I got I saw that you really like last week news. I would like to start this week edition with our beloved GAFAM who made the headlines.

If you like this kind of stories do not hesitate to Subscribe to get it by email each Friday. Forever free.

GAFAM stories

Earlier this week Facebook had a huge outage impacting all services because of some DNS configuration. Cloudflare explained what they've seen from their distributor point-of-view, I really like the charts.

Twitch — Amazon — had a massive leak this week 135GB of compressed archives has been distributed on the web. The leak started on 4chan — surprisingly — and was driven by revenue numbers made by all streamers. Archives contain a bit of data but mostly it's around 6k project repositories.

Google announced in France a partnership with Thales to create a "sovereign cloud" or as they say a "Trusted Cloud" — sorry if the translation is ridiculous it's the same in French. This follow Google partnership announcement in Germany last August with T-Systems and Microsoft venture with Orange and Capgemini. What does that even mean? Well, I'm not sure people already know.

That is probably some political announcement aiming to target digital transformation for big French companies and public sectors that are willing to dedicate — or waste — money in the next year to go to the Cloud of Confiance.

But is it really a trusted cloud? Do we need only to host data in our country to have a trusted cloud? Don't we also need to master — meaning develop — the software? Can we trust Thales and Orange to run cloud software? I already have my opinion on that matter.

Now, let's jump to traditional data news 🦘

Data fundraising 💰

A new week and a new company entering the data pipeline space. Orchest, a Dutch company, raised $3.5m in a Seed round to "build data pipelines, the easy way". From their website that means you can create data pipeline from Python or R notebooks and run jobs in defined environment.
To follow last week news, a new Israeli startup called NeuroBlade raised $83m in Series B to accelerate data analytics processing using processing-in-memory (PIM). Look like a trend is coming from there.
Last but not least, a new open-source data observability tool will go out of the stealth soon: Elementary — and yet another Israel based startup. They started with a data lineage tool parsing your Snowflake query logs to create the dependencies tree with context.

Me observing the observability space (credits)

Why Machine Learning Engineer are replacing Data Scientists

This article is busting myth about the ML engineers and how they are included in a data team and what are the skills required. Without any spoil that doesn't mean data scientists will disappear but more that it will transform their daily job. If I can simplify, machine learning engineers are here to bridge the gap between data engineers and data scientists.

Propagating metadata across our data architecture

I was afraid to see Apache Atlas disappear with all these tools about lineage and metadata, but here for you an superb article on how QuintoAndar used Atlas to connect all the metadata from their systems. Their platform is based on Kafka, Hive and Airflow but it can be replicated to every system. They also explain how they integrated this in the CI process.

Herding elephants: Lessons learned from sharding Postgres at Notion 🗒️

If you are a Notion user you could have feel earlier this year that the product was experiencing slowness. It seems they speedup thanks to some Postgres sharding. This post is a great feedback about the Postgres sharding journey they had. If you are asking yourself when, what and how to shard this is a must-read.

PS: jump a the conclusion if you want to fast takeaways.
PS2: it reminds me this Devoxx FR talk about Postgres at Leboncoin (vid not included :/).

The startup data journey

A lot of new startups are starting up — yep you read it 🙈 — every day. That means many new data stories will come in the next months. If you still struggle in the build of your data platform this post can help you: how to build a Data Lake in your early-stage startup.

On the other side, imagine that you have a legacy data platform that you want to migrate to something less-legacy, this post by Mark Grover (ex-Lyft, Amundsen) about 3 steps for a successful data migration will help you a lot.

I totally agree with the 3 steps and I highlight that tools often need to be specifically built when you do a migration.

Startup data journey still unclear (credits)

A/B tests follow-up

Netflix engineering team continues in the A/B test series, this time with the result interpretation. If you ask yourself how to manage false positive and false negative interpretation this article is for you.

In France we are just starting with The Trusted Cloud, but we have for years trusted ads companies: Criteo and Teads to name a few. This week Teads team describes their A/B test analysis framework. A 17 min read post that explains all the angles of their awesome A/B platform leveraging Spark and Athena.

Tired of Airflow?

On towards data science Madison Schott published: "Tired of Airflow? Try this.". Apart from the clickbait title and that she recommends Prefect in place of Airflow there are some complaints that should be heard by the Airflow community. I've been a huge fan of Airflow for years now, so I may be biased on that topic.

Is it just me or is Airflow ugly and just not that useful in telling your pipeline’s story?

When she's saying that Airflow is ugly, I agree but also Airflow is coming with a legacy — from Airbnb mostly — that is not written on the new fancy website or on Astronomer content. I taught Airflow for the last years and onboarding non-familiar people is harsh.

Last point about community and Astronomer support team "almost inexistent" should be watched closely. Just to add Airflow also has a community Slack with around 12k members.

To balance and conclude this Airflow category I propose you an article written by Ari from Airbyte¹ about pipelines with Airflow the good, the bad and the ugly. It features a Chord chart about Airflow transfers.

To contrast with the previous articles footage of people enjoying Airflow (credits)

Fast News

Feature Stores — What, Why, Where and How? ; in last week edition we saw that feature stores are emerging, this simple post is a reminder on what it really means
dot-env Python package — Super nice tutorial on how to handle sensitive data in your Python applications by Amhed Besbes
Python Lists Are Overrated — Other data structures worth looking
Streamlit tutorial — Streamlit is as they say "The fastest way to build and share data apps" ; this tutorial gives you an overview of the product
Following Cloudflare R2 announcement from last week. Last week on AWS shows how much money ($$$) you can save switching to R2.
Unit tests your BigQuery SQL — this is a concept idea to unit test SQL queries with templating. This is something I've seen at the Spark Summit Europe in 2015 but I don't find the video online.

¹ another tool competing with Airflow — I'm saying competing because Airbyte EL and connectors are replacing some Airflow use-cases. But maybe for the best?

² it's my first time adding footnotes, if you read this clap your hands

Data News — Week 39

2021-10-01

Me writing and delivering the Data News directly in your home station (credits)

Hi there. Big news this week. The post will be a bit longer than usual as I added some more views about the topics I care the most. If you find views interesting do not hesitate to reach me on LinkedIn to say give me your feedback. I'm looking forward to hear you.

Have fun with the news 👇

Data fundraising 💰

Speedata a Israeli startup founded in 2019 with 45+ employees is developing chips (processor) to accelerate big data analytics processing. They announced $15m in seed round + $55m in Series A. This is a market trend we should follow because it will impact what Cloud providers are using but also the "AI" dedicated chip market — dominated by Nvidia.
As announced in the Week 28, Amplitude went public this week via direct listing (and not via IPO — see diffs). It seems that the valuation settled down at $5b after a jump from $35 to $50 per share and as I'm far from being an expert I let you with this Reuters article detailing the operation.
Snowflake Ventures has invested in Anaconda — the data science platform previously known as the package manager that makes data engineers unhappy. This partnership will help Snowflake bringing machine learning and Python pipelines in the warehouse experience (probably through Snowpark).

Data landscape has gone MAD 😅

Will all VC and market money the data ecosystem and market has probably gone mad: startups and tools are going out the blue every week, salaries are increasing, data is flying and privacy concerns are... what is privacy? Is it a bubble? I dont know.

If you are lost in that lake or if you want to understand a bit what is happening the huge 2021 machine learning, AI and data (MAD) landscape map is out. Shout-out to Matt Turck for this quality work. The write-up is long and decrypt all the concepts for you.

On the other hand thoughtworks team released the Tech Radar 24, when it comes to data ecosystem, they "trialed" Snowflake, dbt, Great Expectations, Delta Lake, Materialized, MLflow and Streamlit and starting to consider DataHub, Dagster and Feature Store concept.

A MAD landscape (credits)

Data Visualization Society survey

The Data visualization society is running a State of the industry survey that is closing down today. Go fill out the survey there, you can also find the result from 2020.

Airbyte — Worth the hype?

I live in a bubble where I see Airbyte a lot, I mean a lot — LinkedIn and relations working there — I haven't had the time yet to test it out, this post is trying to test the tool with a lot of screen captures and thinking about use-cases.

gRPC for Data Engineers ⚙️

I really like articles that directly in the title are saying by whom it is meant be read 🙂. This one is trying to explain simple concepts about gRPC and how to use it when you already master Python.

For those that did not know gRPC is a protocol implemented by Google 6 years ago that aims to be used in API communication, by default gRPC uses protobuf messages (Google again) over HTTP.

Data quality metrics for your data warehouse

Metaplane team comes up with a write-up about data quality metrics you should look into when you want to build a working warehouse. Or as they say in the title KPIs for KPIs. This is a must-read if you are still struggling in data quality definition in your data team.

Data people skillset — Analytics Engineer and others

These last years 2 new positions came out to fill the void in data teams. Even if it seems that the Analytics Engineer is a rebrand of the SQL developer or the BI Engineer — with new skills, tools and profiles tbh. The ML Engineer is here to help DS avoiding becoming unicorns.

A Reddit user analyzed 44k unique job posts and tried to defined what are the technologies used per position. If you are trying to hire data people this raw post can help you finding the right words. AS the author said this is a US-centric view.

My takeaways from this are:

Python stronger than ever across the universe — Java sometimes here, Scala, wait Scala?
Old tools like Hadoop, SSIS are still here
dbt — that democratize the Analytics Engineer position — is doing a small 20% appearance in the AE job posting, what does that mean for the other AE positions?
Tableau still being the reference. That makes me asking a question I have for a long time, why Looker is the visualization layer in the Modern Data Stack?

Analytics Engineering — old wine in new bottle, it's all marketing gimmick (from Reddit comment - photo)

Why data scientists shouldn’t need to know Kubernetes

Following the previous post about technology skills for data scientists. We can notice that Kubernetes isn't mentioned (and this is a good point IMHO). But still if you did not read this great post by Chip Huyen it's a good reminder and worth checking because the industry is shifting away from the Unicorn data scientist.

s/Kafka/Pulsar/g 🪛

Geeky title for all sed editors out there. If you have the motivation to move away from Kafka to use Apache Pulsar instead Jesse Anderson wrote about how Pulsar could have help companies like Slack and Uber struggling with some Kafka internals.

Fast News ⚡️

I've already reach the usual 800 words for the newsletter so I'm keeping it short for the last articles.

Data Content Creator database — Jeremy, founder of naas.ai, created a database with data content creators, if you are looking for other people to follow go check it out.
Open Source Data Stack Conference replays — Here you are: Day 1, Day 2, Day 3.
Cloudflare announced R2 Storage — A way cheaper S3, around 0.08$ per GB saved (looking at general pricing), 80$ saved if you lake is 1To and so on. R2 means Really Requestable.
Enlarge your data — sometime you need to have a larger dataset, this medium post explains what data augmentation is and how it can be useful.
192 Snowflake nodes for 10$ — if the title catches you it's normal, Christian tried to launch a big Snowflake cluster for less than 10$ on the trial account.
[in French] French administration data guidelines — the French administration released this week a big roadmap for public data policy: 15 ministry roadmap giving 500 actions to be done.
Tech burnout — Two major articles about burnout in tech teams and companies: I Just Don’t Want to Be Busy Anymore and Reflections on Burnout.

Be safe and read the two last articles to take care about your mental health.

Have a good weekend (credits)

Data News — Week 38

2021-09-24

Data trends — same shape but different colors (credits)

Hello, a new fresh edition of the Data News is delivered to you in time. This week we maybe have less articles than the previous weeks but it raises still some interesting trends and questions.

Data fundraising 💰

Yesterday Bigeye announced their $45m Series B following a Series A 6 months earlier. Bigeye (formerly called Toro Data) is a data product focused on a data monitoring and alerting. You need to connect your sources, setup your metrics (auto or not), setup thresholds (auto or not) and you're done.
ClickHouse, Inc announced $50m in Series A founding. The high-performance columnar database will be incorporated in this new company and will spin-out from Yandex (their founding home company). I can imagine they will try now to compete in the Cloud database segment with others.

Snowbricks & Dataflake

This is a topic I've already mentioned in the newsletter. Snowflake and Databricks are converging, they both compete on the "Cloud Data Platform" segment (and also on Data Cloud 🤷). Annika Lewis wrote a two-parts analysis on where its going and also what are the similarities in this competition with previous SAP & Oracle competition.

A house with Snowbricks (credits)

Is BI dead?

Benn asked this question this week about the future of BI. Is BI dead? In a sense, yeah the terms BI and BI tools became untrendy because now we are speaking of Modern Data Stack so we don't want old stuff, we want modern tools. But what does that even mean? Is Looker really modern — why Tableau is considered too old to be at the table? I'd say it's actually only vocabulary discussion. The original BI is no longer the same, now we have visualization tools in a whole data ecosystem.

dbt v1.0 — get ready

In December 2021, the dbt Core v1.0.0 will be released. I'm not sure it will change a lot in the product — still some great improvements coming — but the perception will be at least different. dbt is now used by more than 6000 teams and here to last. With the v1.0.0 it means that you can start building on a stable version for the future. So prepare to upgrade.

Infrastructure as SQL

This week thanks to Octavian Zarzu I've discovered a new Data x DevOps range of tools: Infrastructure as SQL. Imagine a place where you could be able to run SQL queries to know how many EC2 instances you run and how much memory it represents. It's amazing. 3 tools came out recently aiming to do that:

Cloud Query — open-source and working with AWS, GCP, Azure, Yandex and DigitalOcean (+ Slack and Kubernetes)
Infrastructure as SQL (iasql) — only in early access, in creation by Alan Technologies, a company based in SF
Steampipe — open-source and working with AWS, GCP, Azure (+ Slack, Github, Zendesk)

Verticalized product analytics suite: PostHog

The product analytics space is one of the first to get specialized solutions in the modern data stack. This week I want to share PostHog that provides a self-hosted solutions for company to own the whole product analytics workflow.

They want to replace these complex product SQL queries by a platform already containing funnel analysis and product usage trends to name a few.

Under the hood if you plan to self-host the platform they will launch a ClickHouse instance to be able to keep a interactive UI. If you need to go deeper in ClickHouse I propose you this Reddit thread comparing it with Apache Pinot (another low-latency columnar database developed at LinkedIn).

Ad event processing at Uber

The Uber team wrote a post detailling the technologies they use to do a real-time exactly-once event processing. What I found fun here is that Uber become so huge today that they say using Pinot in that post, but they also appear on ClickHouse website in the portfolio.

A/B test explained by Netflix

If you are still new to A/B testing or that you need a post to explain it your colleagues Netflix wrote it for you. What is an A/B Test? This is the second post in the this dedicated series. It talks about the product challenges and that everything starts with an idea.

A guy doing coffee tests (credits)

Does empathy play a role in being data-driven?

Adam Votava writes in TDS about empathy. Do we need empathy to create a data-driven company? As he said "data doesn't lie!", so why bother being empathic? Go there to find out.

PS: this summary has been voluntarily wrote without empathy 🙃 .

Fast News

re_data — This is an OS data quality tool worth checking it out: Build on top of dbt, re_data helps you find, debug and resolve problems in your data. They got me at "Discover data issues before your users & CEO".
BigQuery Cloud SQL federation — if you need to query Cloud SQL instances from BigQuery you can try federated queries, this post guides you through it.
How to run containers on AWS — if you are asking yourself how to run containers on AWS, Last Week on AWS gathered all ways in one post or you. Spoiler there are more than 17 ways 😂.
Tabular, Iceberg and new datalake generation — 2 weeks ago I wrote about Tabular company fundraising, this week founders wrote a note regarding the Iceberg community and Paul detailed on TDS what Iceberg, Hudi and Delta Lake are bringing to the table.

Data News — Week 37

2021-09-17

Back to school — let's learn new things this year (credits)

Hi folks, I hope you all had a great week. A new professional year has started — yes to me the new year is in Sept. like I'm still going to school — and I wish you the best for yours. What do you plan to learn this year? On my side I'd like to be better at pottery 🫖.

With this new year, new community events have been announced and because this week I don't have any awesome fundraising to share I'll share with you upcoming events.

Upcoming events 📺

Sept. 28 → 30 — The first Open Source Data Stack Conference is coming featuring people from technologies powering the modern data stack. It'll be online. This is a great initiative, but without a lot of diversity — only one tech per usage in the MDS.
Nov. 3 — Monte Carlo is announcing IMPACT 2021, the Data Observability Summit. Great speakers that in the Data world will come to stage online there.
Dec 6 → 10 — dbt Labs announced his yearly Coalesce conference. Sadly this year it'll still be online. It seems already 33 sessions have been announced.

Also to note that the Kafka Summit took place this week, if you want to have an idea Robin Moffatt, Dev Advocate at Confluent, wrote some Twitter threads about the his attendance.

What's an OLAP cube? 🧊

From an interview question to an onboarding chat data people mention OLAP or OLTP often without really knowing what it means. Claire Carroll wrote an awesome post explaining what is an OLAP cube. It has been written 1 month ago but it's a personal favorite.

As a side note I really like the conclusion about "Jargon as a gatekeeper" saying that we — the data community collectively — keep using complicated terms to create a barrier excluding new people.

O'Reilly 2021 Data/AI Salary Survey

O'Reilly published this week the result of their salary survey (mainly based on US -based respondents). Charts are interesting to see, they were able to split salaries by gender (a gap still need to be closed), by programming languages and also by tools and platform (this last split is not that relevant, the tools are too heterogeneous).

The 20% salary gap at executive level explained (credits)

It's a first time in the Data News, we are speaking about the GDPR. People from Sifflet wrote a FAQ / glossary post about all you need to know about the European regulation to be compliant.

🌱 Use dbt seeds for your Lookup tables

Daniel Mateus Pires explained how his team use the dbt seeds to manage better the lookup tables — or reference tables. This post a super good introduction to dbt seeds feature.

Airflow hidden features

Did you know the Airflow CLI contains a command to generate an image of you Airflow DAG? I didn't know before reading this cheat sheet about the Airflow CLI.

Folks at Databand.ai also explained how to use Airflow cluster policies and task callbacks to add observability on your tasks without too much overhead.

Understand Materialized Views — Part 1 & 2

Dunith Dhanushka wrote two articles on Medium to help you understand how materialized views can be useful for you and how it can speed up queries.

AI Friday

This week I want to share with some AI articles that have been written in the last weeks that I found really well written and inspiring! It sometimes makes me want to do AI 🙃.

Deezer team explained what they use when it comes to recommend music to new users. It is nice to see people writing about cold start.

Marie-Fleur Sacreste from Preligens, a french defense AI company, described in a well detailed post how they created a unique agile framework to deploy deep learning algorithms in a blink of an eye.

If finally these two articles gave you motivation to work with AI here some lessons learned from 2 years as a data scientist.

Don't forget to put your Data Scientist glasses to do AI (credits)

Fast News ⚡

Atlan by Postman team — Data team at Postman shared their journey through data discovery and how they went from Confluence and Google Sheet to Atlan.
Snowflake unstructured data support — it had been announced at the Summit, it is now in preview, also Snowflake released a serverless way to manage warehouses for Tasks.
PS: unstructured means text, image, video and audio.
DataHub’s UI is getting a makeover — The metadata platform enhanced the UI in their last version 0.8.12 and it also seems that a dark mode is coming soon. Go check their demo.
Fivetran SDK — Fivetran released their SDK... in Go, the SDK aims to use the Fivetran API to manage programmatically your account. But, in Go.
New Tableau integrations in Slack — Following Slack acquisition by Salesforce we could expect more crossover like this. This post details what teams at Slack and Tableau have been doing to better communicate with data (get also a sneak peek at the Tableau Slack App).

Thank you and see you next week.

Data News — News 36

2021-09-10

me to you — below 👇 the data news (credits)

Hi, it's me again. This week will feature more technical articles than organizational ones. As last week I celebrated my first year of freelancing I would like to share with you some numbers since I started this journey.

🔥 We crossed 300 subscribers for the newsletter in 5 months and the blog got 7k pageviews. I also almost doubled my LinkedIn audience in the same time.
💰 I've worked with 5 clients over the year and generated around 140k€.
🤩 For the next 6 months, 2 interns will join me to grow my content creation efforts.

I just want to say thank you to every people making this possible.

Data fundraising 💰

Seqera Labs, a Barcelona based startup founded at the Centre for Genomic Regulation raised $5.5m in a founding round. They developed Nextflow an open-source orchestration framework that they say focused on scientific use-cases. To be honest I've never ever heard about Nextflow but it seems already pretty advanced.
A new data platform is entering the room: Tabular. Tabular is built around Apache Iceberg (open table format for huge analytic datasets) and they raised an undisclosed amount in Series A from Andreessen Horowitz (a16z). The 3 founders Jason, Dan and Ryan seems to have met at Netflix, homeplace of Iceberg (where Dan and Ryan co-created it). I can't wait to see what's coming next!

Our Data systems are broken and deteriorating fast

This is a big piece. Medium is saying 28 min read, but I really enjoyed the reading of this article that is trying to understand why the Big Tech broke everything by trying to monetize our data and then losing control of what they built. Huge credits to Dave Fay.

Career choice — Data Scientist or Analytics Engineer?

Jason Ganz from dbt Last community team explained why he choose Analytics Engineering discipline over the Data Science. In 3 major points he describes why analytics will still be the frontend of the data and reason is pursued this path.

Compass for career choices — 100$ on my shop (credits)

Improve how you communicate with data

This post published in Towards Data Science explain how you can find the right balance when adding context to a visualization between being not enough and being too much.

Migrating Airflow from Amazon EC2 to Kubernetes

Snowflake team wrote about their Airflow migration from Amazon EC2 to Kubernetes. I really like this kind of myth buster articles, seeing the architecture and the tools behind a black box (yep Snowflake could act as a black box in term of data platform) is really cool. Besides the post describe their whole Kube archi for Airflow, a must-seen for those interested.

dbt setup feedback

It's been almost 6 months since I've started using dbt and I'm starting to like it. From what I see others are also liking it, this week Jaya wrote his feedback and takeaways on what to look for when you are setting up dbt. I can't wait to try model templates as he describes.

Dask vs. PySpark

If you are interested in Dask or if you are asking yourself if it's a viable alternative to PySpark, confessionsofadataguy tried it recently for you. I don't what to spoil you the article. But the result is not as expected (at least for me).

Spark still first in our heart (credits)

Time to write some ETL

If you are still struggling understanding the key differences between ETL and ELT, this is for you.
If you are write your first Airflow pipelines but you don't know how to start to unit tests them, this is for you.
If you still don't know what Great Expectations can add to your data pipelines, this is for you.
Following the series on MLOps security, here 14 principles to secure your data pipelines.

Fast News

AI Checklist — a small checklist to check each time you want to build a model
DIY, Air Quality logger — if you like do it yourself project I propose you this air quality logger that for 30$ will be a small project to play with data
PaddleOCR — PaddlePaddle released PaddleOCR an OCR tool. The demo on Hugging Face is working with Chinese, English, French, German, Korean and Japanese, you can check it out.
Learn Clustering in Python — "Using Natural Language Processing (NLP) and K-Means to cluster unlabelled text in Python"
prettymaps — a HUGE tool to turn osm maps in beautiful artwork.

See you next week ❤️

If you are interested in having a chat with me do not hesitate to drop me an email hi@blef.fr or connect on LinkedIn.

Data News — Week 35

2021-09-04

Delivering the Data News (credits)

Hello, as I said this week data engineers never meet deadlines and here I'm. I'm late for this week publication but I have a good reason. I wrote my first original article about the complicated relationship we have, data engineers, with deadlines and also started tips daily posting on LinkedIn.

I really hope you like my content 😊 We are getting close to 300 subscribers and I want to thank you for that.

Data fundraising 💰

"Databricks raises data lake of cash at monstrous $38bn valuation" — title by blocksandfiles. In this final part of the Databricks story they raised $1.6 billion in their Series H. I previously announced they could go public but it seems the rumor was wrong.
Everyone saw the news, Docker changed their pricing model raising a lot of discussions within the tech community. The big change is that Docker will become paying for companies with 250 employees or more than $10m annual revenue. Following this story I propose you 2 Twitter threads from Docker folks to understand what's behind. The first one by Docker VP Products on buying vs. DIY OSS and the second one about Docker Desktop hidden work.
Contentsquare, a french founded analytics company, acquires Hotjar. Contentsquare is providing an digital experience platform to get insights from consolidated user journeys. Becoming one they will represent 1000 employees and they will be used on at least 1 million websites.

How to increase your data adoption

Imagine a data team with incredible people doing incredible stuff everyday but you struggle making them shine within the whole company. I'm sure you can see. Paige Berry gives you all the keys to Share your data insights to engage your colleagues. If you look for a working example of weekly Slack post with a checklist of questions you need to answer, this is the post.

On the other hand, if you are a data team leader and you still struggle in the adoption after the strategy definition, Adam Votava gives you 3 challenges to overcome.

[Research] Reimagining database querying on unstructured data

Facebook NLP Research team came out with a new concept: Neural Databases. Imagine a database where you could ask a question like: List everyone born before 1980.

And the database would be able to search among rows like below and returns Teuvo and Sheryl.

Sarah married John in 2010. 
Teuvo was born in 1912 in Russia.
Sheryl’s mother gave birth to her in 1978. 
Sarah was born in Chicago in 1982.

It seems magical and awesome. If you are interested on what is behind, you should have a look at the blog post, it includes also a paper and a Github repo with code example.

Facebook NPL research team doing magic (credits)

Timeseries anomaly detection with Thirdeye

Cyril from AB Tasty offers us a wonderful article about anomaly detection with Thirdeye. For those asking, Thirdeye is a monitoring tool for timeseries developed at LinkedIn based on Apache Pinot. If you are looking for ideas regarding how to architecture your data quality pipelines, this one is for you.

As a side note, they also added to Thirdeye a BigQuery connector.

Data Scientists I come in Peace

Over the last months, we got engineers saying we don't need Data Scientists, we need Data Engineer then Data Scientists saying we don't need Data Engineers, we need better tools. Ben Rogojan (SeattleDataGuy) answered in a YouTube video to the last provocative (clickbait?) article.

If I can add my views to the debate I think data engineers and data scientists as a whole are bringing something unique to a company when they successfully work together. This is not a race or a war.

Incident Commander reporting for duty

I'm in favor to inspire data engineering from software engineering practices. Copying the incident response management from Google SRE practice is a way to start. Glen Willis wrote on TDS about applying Incident Commander to the data field.

Pinterest Analytics Platform

If you are interested in building an Analytics platform and that you are curious about Druid this series of posts is for you. Pinterest team shares in their last article what they learned along the way about Druid optimizations.

If you are a Druid newbie this recent article describe everything you need to know.

I searched for Druid in Unsplash — will understand who can (credit)

ML Saturday — Criteo and LinkedIn

This week I want to share with you Criteo image-based ML techniques to classify e-commerce products. Obviously they have chosen deep learning techniques to work on their 25+ billion catalog. From a tech perspective they use Spark, Parquet and HDFS to transform and store the data.

LinkedIn Anti-Abuse AI Team shared their deep learning research regarding fake accounts creation or automatic profile scraping and spam. The way they represent the activity data to find patterns and then decide what to do is amazing. I really like the article.

Fast News

Data Engineering Podcast — Designing And Building Data Platforms As A Product feat. amazing people from Monte Carlo, Vimeo and Facebook.
SeattleDataGuy wrote about Snorkle.ai and analyzed trend over the data labeling industry. Data Labeling and Annotation Services is soon reaching plateau in the Gartner hype cycle for Data Science and ML.
Mito — A tool to render sheets directly inside your own JupyterLab with advanced features.
BigQuery debug fonction — I discovered the ERROR function in BigQuery, it could be useful to raise issue when something is not as expected.
GCP Sketchnotes — The Cloud Girl created a lot of sketch resources to understand Google Cloud resources.

See you next week.

Data engineering failure — Why is it almost impossible to meet deadlines?

2021-08-31

Engineering is an art (credits: me).

Failure. I would like to write about failure. Why is it so hard for data engineers to evaluate projects duration but also to meet deadlines?

Over my last 7 years of experience I had the opportunity to work with multiple data engineering teams. A common pattern I noticed projects after projects was the difficulty to define and meet milestones for engineers.

Teams after teams we tried all the trendy methodologies in order to create roadmaps and milestones. Jumping from Scrum to Lean, trying to size tasks using poker planning or with peer evaluation. That didn't stop us from failing. But, this is normal.

In this post I'll share the 3 main reasons of our failure as data engineers with solution proposals to open the discussion with the community.

Data Engineers are bad at prioritization

We are bad at prioritization, which means lots of interruptions and unrealistic timelines. You said interruptions? Yes, data engineering deeply suffer of context switching. We often jungle between tasks on a daily basis.

Here is the exaggerated (is it really?) daily planning of a data engineer:

morning — fix pipelines which broke during the night
noon — answer a public Slack question on #data-public regarding a warehouse column
afternoon — help one analyst with a tricky issue you chatted about at the coffee break in the morning
afterwork — you start working on your daily task

Because data engineers are historically seen as a support team, they need (or they want) to answer as fast as they can which obviously conflict with project deadlines. It's hard to say no. I know. But to preserve your mental health you'll need to learn.

On the other hand we also have too much recurrent pipeline issues. Writing error-less pipelines is a Herculean task. But why? Because we often are between data providers and consumers (cf. data mesh), facing too many moving parts. We move data from a database to storage or to third party platforms. Database schema, storage, third party API or business requirements can change. And then generate new issues.

But let's not blame only the other teams, we also sometime build crappy pipelines because we do not have enough time or because we just botch. It is hard to prioritize between planned projects, unplanned projects and daily bugs & questions.

Solutions

In order to solve this prioritization issue I offer you 4 paths to explore. You'll see that the three first ones are organizational, the last one is more about removing tech debt to avoid paying it daily.

1 — Start your journey to become a product team. Make people understand you build tools with them rather than only provide a service. Then, open a Product Management position in order to start considering data like a product. This bullet is a big trend in the industry right now with data mesh and everything around it. Warning: do not hire a PM to manage a tasks backlog.

2 — If you manage a data team, create a trust climate that will allow engineers to say no if necessary. Saying no is not an easy thing to do when a lot of pressure comes from the top.

3 — Create a Totem holder role. Within the team someone should be responsible on a weekly basis to answer all possible external interruptions. Creating the weekly rotations and the role is easy but don't forget to check if the role really works. Does this generate less interruptions? If not, find why and improve.

3 — Apply software engineering best practices to your pipeline or app development. Concepts like idempotency, reproducibility, read/write schema, etc. are a must-do. When you develop a pipeline, it will most likely fail one day. So do yourself a favor for the future, make it easy to debug and relaunch. And remember: it is simple to build complex, but complex to build simple.

Every day a new crossroad (credits)

Data Engineering projects are on another schedule

Data Engineering tasks are rarely small ones compared to other data tasks. You probably already faced this situation where an analyst asks for a table to answer a business question and by the time you add the table the analyst has already found a solution to get the data.

The previous situation illustrates the different schedules data people are in and it also depicts the complexity of syncing them up.

Data Analysts are on daily-based schedules, as they mainly answer questions (business or instincts).
Data Scientists are on weekly-based schedules, they transform questions into models thanks to feature engineering.
Data Engineers are on monthly-based schedules, they build platforms in order to support past, current and future use-cases and growth.

The main reason we work on a monthly-based schedule is that, as of today, tools are still hard to operate (do you remember Hadoop?), we have too many tools and a lot of viable solutions for the same task, without adding the fact that each time we work on something we discover unplanned stuff.

If we add to the tools complexity the low quality inputs data platforms gets. It brings a lot of edge cases: the normal cases are becoming the edge cases, and the bad input the norm. Let's be honest the data quality path has not yet be covered in many data platforms and data engineers are still facing on a daily basis edge cases.

This mix between tools complexity and input quality is something that brings delay on every task we work on. We finally understand why it takes us so long to move forward.

A data engineer late on his project because of a flood (credits)

Solutions

1 — Do baby steps. First time sizing a project? Do not try to size a 1 year project. Slice it up to 3 weeks or 3 months smaller projects and create small understandable tasks. You will be wrong the first times, but after some tries you will be better for sure. Without practice, you cannot be a master.

2 — We are all in the same boat. In data teams our North Star should be to empower colleagues with data. Let's be teammates and understand each others constraints. Engineers are here to create the best platform for the company, Analysts and Scientists are here to create the visibility, the usage and the interface on the data.

3 — Watch for technical innovations and embrace them. A lot of new products are coming out of the ground (Airbyte, Meltano, Singer — to name a few) added to cloud data storage (BigQuery, Snowflake, Databricks). They aim to simplify data engineering complexity. Use them. More importantly: adapt them to your stack.

Data engineering discipline is still young

The two previous points are accentuated by the fact that the discipline is so young. Every choice or small mistake could lead to technical debt easily.

The industry was wrong for several years and with the data science hype a lot of companies hired data scientist and asked them to do data engineering jobs.
It resulted in two major issues: we had frustrated data scientists and bad engineered platforms.

Do not get me wrong here, I think that data scientists can build wonderful platforms but when you are doing a job you are not supposed to do the results could be deceptive.

Data engineering teams are often integrated within the "traditionnal" product team but because the discipline is new, other teams are not able to figure out what data engineers are capable of. This is a issue. Sometimes data engineers could bring more adapted solutions to the table to avoid unnessecary work.

We also lack of data product managers or product managers with data platform understanding. It happens often that product managers forget totally the data ecosystem in their scoping adding unplanned work to data teams later.

Also, because of his youth data engineering skillset as of today is quite wide, except for the big corporations that have enough resources to specialize teams, smaller companies need to hire 5-legged sheep data engineers. Data engineers are asked to jump from DevOps techniques to production-grade code writing then to SQL debugging.

A data engineer reading "Data Engineering Magazine" (credits)

There are not that many engineers with multiple experiences building data platforms out there. It is hard for a team on a new project to evaluate how long it will last if no one in the team did it before. If we add to that tools and trends that are changing every 2 years, how can we be great at evaluation?

Solutions

1 — Ask for help. You may not have the chance to have a senior in your team, but I think there are people out there that are willing to help or challenge you if you ask.
For instance ask other companies with the same size or within the same market, be creative. I did it in the past and I met awesome people.

Challenge your choices and vision by presenting at public conferences and/or meetup. From this you will get questions and feedback that will help you to rate your platform.

2 — Read what others are writing and get inspired. But do not forget that you might not be Netflix, Uber or Airbnb. You probably do not need to create a new open-source database technology to compute your metrics along the way.

3 — Hire teams with diverse skills. If it suits your use-case prefer hire 2 generalists DE with 2 specialized ones (DevOps and streaming for instance) rather than 4 generalists. But be careful, specializing your team too early will also create issues. You don't want to have only person in charge of streaming. Otherwise holidays will be hard for him.

Conclusion

If you are a data engineer, please, be kind to yourself. It takes time to be a great data engineer. If this is the first time you create a platform alone it is normal to struggle.

If you are working with data engineers, please, be kind with them. In their daily professional life engineers love to see people using the tools they build.

But do not forget to put (challenging) deadlines on your projects. It is only like this that you will improve and start building trust with your colleagues. Also have in mind that each solution could have his own article, we have many actionable levers to improve the situation.

This article can also apply more generally to software engineering. Is data engineering really so particular that it has specific needs? Shouldn't we consider data engineers like another engineering team?

Special thanks to Augustin, Charlotte, Emmanuel and Pierre for the review.

Data News — Week 34

2021-08-27

You are all invited to my remote 1 year-freelance-anniversary (credits)

It's already last August week. Back to school. I'll celebrate my first year of freelancing, and to be honest it's been a wonderful year. I've tried and started a lot of projects: the weekly newsletter, streams on Twitch and videos on YouTube. Growing an audience and a community is hard but so satisfying. But I also thanks clients that trusted me over this whole year to help me achieve my goals.

Let's go for a new weekly newsletter!

Data fundraising 💰

Grafana Labs raised $220m in a funding round. With this fundraising Grafana will continue to compete with Splunk and Datadog with a vendor neutral approach in order to help cloud monitoring.
Monad, a vertical data platform dedicated to security domain, raised $17m to create the first data cloud for security. Monad comes with connectors to security tools to load data in any warehouse and with Monad Core Tools: a suite of reports.
In echo with last week fundraising, Cribl raised $200m in Series C to put an aggressive plan for expansion 2022. Cribl is an observability platform focused on micro-services data.
Bodo.ai Raises $14m Series A led by Dell Technologies, Bodo.ai is a company that aims to make Python a first-class, high-performance and production-ready platform. They want us to avoid rewriting Python code to run ETL, Feature Eng or AI on all hardware.

Erratum: last week I did not say it, but Preset raised also $35.9m in Series B.

Rebranding data or finding the next bubble

The last ten years have been quite a ride. Data ecosystem went through a lot of trendy concepts and bubbles. We got (not exhaustive) BI, datalakes, big data, AI, data warehouses, and more recently modern data stack or data mesh. O'Reilly Radar published Rebranding Data, it help us understanding why we jump from one bubble to another one. Asking all data engineers outside to adapt every couple of years.

Tip: if you want to find new hype just go to big vendors blogs and look a the SEO articles they write.

How Data Shapes the Uber Rider App

Uber team wrote a super nice article about how they use all the data they collect to improve their Rider App. They give us some diagrams to understand what are the logical blocks behind this kind of architecture. Once again it illustrates the data/AI hierarchy of needs.

Data Lineage at Slack

I really like when big companies writes technical articles because even if we don't have the human resources or the time we can get inspired to find great ideas. Here Slack team explain their whole data lineage system. From lineage ingestion to SQL parsing everything is covered.

Data lineage — illustrated (credits)

Maximizing Productivity of Analytics Teams

Great Expectations wrote a nice series of 3 articles on how to maximize the productivity of analytics teams. Here the part 1. I want to emphases last part of the article. It's really important to make root cause analyses easier because for sure issues will happen. So give yourself a favor.

Airflow decorators for better readability

With Airflow 2 the new Taskflow API was released with super useful decorators to simplify the readability of your DAGs. This article gives you a quick overview of what you an achieve. Time to rewrite all your DAGs 🤓.

What is Data Quality really?

The 10 million dollars question.

This is one of the most difficult question of the data field. Everyone enterprise has his own answer or understanding of the question because it depends on too many factors. Servian tries to answer the question as the part 4 of their Data Engineering testing series.

Data professionals, you suck at Interviews

Leo Godin came back with a well written article about data interviews. It says that "if no laughs during an interview, you probably failed". And I agree. Also do not forget that interview are a way for company to test you but also a way for yourself to test a company. This is a 2-way process.

After a failed remote interview (credits)

JSON woes in Apache Spark — Null fields that are not really null

If you already had Spark JSON parsing issue this article by Ruben Berenguel is for you. It describes how Spark behaves when you send numbers in string instead of int. Is it a bug or a feature? Go check the article even if you don't do Spark it's well written.

Fast News

Cloudera announced DataFlow for the Public Cloud — all that marketing means they run Apache NiFi on top of Kubernetes for you with a control plane.
Oracle announced increased performance with MySQL HeatWave — they announced that their MPP database is faster than Redshift at the third of the cost.
Open-source project: sqlmodel — this is a way to infer Python type annotations to create SQLAlchemy model. The project have been created by FastAPI creator (so it integrates well with).
Homomorphic encryption — If you are interested in the homomorphic encryption I proposed you this blogpost. Look at the example with the image filter. I hope it will help us in the future to be privacy-first.
Airflow + MLOps Virtual Meetup — Next week on Sept. 2 Tel Aviv Apache Airflow meetup group will host a virtual meetup on Airflow + MLOps use-cases. If you are interested in, do not hesitate to attend!
Snowflake quarter performance — Snowflake shared their quarter performance. They announced almost 5000 clients and 212 Fortune500 companies (+34% YoY).

I also announce that I'll restart streaming next Wednesday (in French) See you next week ❤️!

Data News — Week 33 (late)

2021-08-23

Me looking for WiFi in the middle of nowhere (credits: me)

Hi, it's late Data News. I did forget you last Friday because I was lost between trees without electricity and internet. But here I am, and I'll give to you a glimpse of last week articles.

Data fundraising 💰 and Preset Cloud GA

Databricks is going to the moon (or crazy), Bloomberg reported that a new investment round led by Morgan Stanley will inject at least $1.5B. This could bring Databricks valuation at $38 billion.
Monte Carlo, a data observability product, raised $60m in Series C. Monte Carlo is bringing to data ecosystem the observability we deserve (inspired by Datadog).
Dataiku raised $400m in Series E, this announcement follows the launch of the Dataiku Online version to focus more on startups and SMBs.
Preset announced the Preset Cloud GA and raised $35.9m in Series B. Preset Cloud is the fully managed version of Apache Superset the modern, open-source data exploration & visualization platform. It includes a freemium version.

Modern Data Stack comparators

You want to start your data platform or you want to find all available tools before doing a benchmark to add new features to your platforms? The 2 following links will help you.

Datafold compared multiple categories: collection, warehousing, transformation, cataloging and analysis. This is not exhaustive but they started a good job comparing the communities behind each open-source tool, giving also key insights on each.

On the other side moderndatastack.xyz is finally available. The tool is giving us a more exhaustive list of the whole data landscape: categories, companies, influencers and also a list of useful resources to start your journey.

To finish this category Tech Ninja wrote a small comparison between Feature Store technologies. If you want to compare Feast, Hopsworks, iguazio, bytehub and QuintoAndar this is for you.

A data engineer comparing Data tools (credits)

Modern Data Experience

Benn — I really like his views — wrote about what he called Modern Data Experience. He gave us principles that should help us define standards between modern data tools. Benn also share a lot of links to help your understanding of the field.

Announcing OpenMetadata

Suresh Srinivas, founder of Hortonworks and ex-VP engineering, announced OpenMetadata initiative to create an open standard for Metadata with a centralized store. The goal is to improve discovery and collaboration.

dbt Cloud — DAGs in the IDE

dbt Cloud team released a new feature where you can see your full dbt DAG directly in the development IDE in your browser. I would love to see this kind of feature inside my PyCharm (or VSCode) in the future.

Passing the Google Cloud professional Data Engineer exam

If you are an aspiring Google Cloud certification student, Chenming Yong wrote a Medium post about his own journey getting certified. It's worth checking out if you want to get certified too.

Fast News delivery — because I'm too late (credits)

Fast News

If you want to get a new job or post a job for your team there is a new simple job board you can use: Data Stack Jobs. The board is clean and tags are useful.
Get a glimpse of AWS Glue Data Catalog and Quicksight with this AWS technical blog post.
Why You Should Probably Never Use pandas inplace=True — a must-read to understand more pandas internals.
Data Warehouse Migration with AWS DMS — The post describe how Servian used DMS to migrate from Aurora PostgreSQL to Redshift.

See you on Friday 🥰.

Data News — Week 32

2021-08-13

People reading my newsletter in a nice office (credits)

Hi, I hope you have a good time at work. I know you'll get well deserved holidays soon! As last week was a special edition this week I'll feature the last two weeks articles. Don't forget to give me feedback on the weekly digest, it helps me a lot.

Data fundraising 💰

Treeverse raised $23m to bring Git-like version control to data lakes. This is the company behind LakeFS open source project. They aim to create atomic versioned data lake.
Hightouch got $12.1m in Series A in order to bring back the data into customer facing tools. Hightouch is a reverse-ETL tool and provide a way to export SQL results from warehouse directly to third parties.
Dune Analytics raised $8m in Series A to provide free analytics for the crypto community. They develop a tool to explore, create and share dashboard on top of cryptocurrency data. I found it interesting because this is one of the first data analytics tools to be vertically specialized.
DataRobot finally raised $300m in Series G. I already mentioned it in week 23 when they seeking for fresh capital.

This is not about fundraising but transfers. After Felipe Hoffa that moved from Google to Snowflake 1 year ago. Mosha Pasumansky also left Google to become CTO at Firebolt. Firebolt also hired Octavian Zarzu as developer advocate. Octavian was previously sharing daily tips about Snowflake.

As a reminder Firebolt is a new competitor in cloud data warehouses space promising huge performance uplift and big costs savings.

Data Surveys

Two 2021 surveys got published recently: the first one about the state of production machine learning and the second one about the state of data engineering. Do not hesitate to fill the surveys to help the community.

Conan making surveys for the data ecosystem (credits)

The unspoken gerrymandering of the modern data stack

I feature a lot of articles about MDS and tools that gravitates around. But last week Benn wrote thoughts on modern data stack and why the explosion of tools and tools categories is not a good evolution of the landscape.

On the matter if you want to go further I propose you the transcript of a Tristan Handy interview about why dbt was started to fill the gap between engineers and analysts.

We the purple people

Following the whole discussion about the MDS and the Analytics Engineering discipline creation that is linked. dbt Labs (and more precisely Anna Filippova) wrote about the real need of a people to translate business into technical. The purple people.

Play in SQL with Reddit comments

Felipe loaded 261GB of Reddit comments to Snowflake and played with it. If you want to see in action Java UDFs, data loading and recursive queries this post is for you.

Playing with Snowflake at the beach (credits)

Stream Your Database Changes with Change Data Capture

Nice article about CDC, with illustrations about databases events and also some databases triggers examples that could be used. If you want to go deeper in CDC or to implement it this should be on your reading list before starting.

Top-notch data quality frameworks

In this digest we are lucky because we have to post from top teams. Airbnb shared their Wall framework to prevent bug and then improve data quality. On the other side Uber shared how they achieved operational excellence in Data Quality.

Being a data engineering manager

Tiffany Jachja gave us feedback after her 3 first weeks as a data engineering manager and what she set up to map skills, roles and ownership. She also detailed the vision of the data team.

Data engineering guides

This week I share with you one BIG guide and a smaller one guide. This big one is The Data Engineering Cookbook, the repo got almost 10k stars on Github, it's well written and contains all information you need to know to start your DE journey.

The second one is a guide to prepare Data Engineering interviews. I'd add to this guide that you should prepare accordingly to company technologies and that Spark and Hadoop aren't that mandatory today.

Fast News

OpenAI launched a Python 5 puzzles AI contest yesterday (2021-07-12) using Codex (code autocomplete AI) as your teammate. The challenge is now closed.
Github OCTO (meaning GitHub Office of the CTO) and more specifically Amelia Wattenberger, open sourced a tool called repo-visualizer that lets you create a bubble visualization of a Github repo. See the demo post for the wow effect.
The story of the BInosaurus 🦖
Part 3 of the dbt with Airflow series — how to use Airflow tasks groups and a custom DAG Parser to simplify readability.

The BInosaurus (credits)

Thanks, see you next week ❤️.

Data News — must-read articles (mid 2021)

2021-08-06

A selection among amazing content (credits)

This week I want to share with you a list of articles that have been published during the last year and that are paving the way. The data ecosystem is at cross-roads, a lot of new concepts, new tools and new people are redefining data teams. Below you'll get a first glimpse at what you need to know to understand the trends.

The goal of this edition is to create a photography of last data trends with explanations.

MODERN DATA STACK

The Modern Data Stack: Past, Present, and Future

A must-read article to understand the new way to build data platforms today. How we moved from ETL to ELT using dbt to make the Transformations. Nice writeup by dbt Labs CEO Tristan Handy.

Why the Future of ETL Is Not ELT, But EL(T)

Airbyte CEO, John Lafleur, wrote why the future is mainly about Extract + Load rather than all previous paradigms. If you want to see the opposite side you can read this well-commented discussion on Reddit: "Is it just me or ELT seems over hyped?".

Data & Data Engineering — the past, present, and future

Zack Wilson in echo of the previous articles tries to define where data and data engineering are going with a reminder on the past.

Building The Modern Data Team

Modern Data Stack is trendy, but this is a technical way to see the industry in his thoughts Pedram propose a way to setup the Modern Data Team accordingly.

DATA MESH

What the Heck is a Data Mesh?!

The second big hype in the data ecosystem is around the Data Mesh concepts. As I already showed in a previous news a lake of articles have written on the topic. If you want to fast forward Chris Riccomini can help you.

Building a data mesh to support an ecosystem of data products at Adevinta

This article could help you by sharing an actual journey to implement a data mesh architecture.

PS: As a reminder you can find here the original proposal.

Build data teams. It's hard and fun (credits)

DATA LINEAGE, CATALOGING, OBSERVABILITY, etc.

This year has been a huge milestone in the journey to bring lovable tools to the toolkit of data people. Regarding this I'll refer Future part in Tristan article.

And then I'll wait for the end of the year to see where all the companies are going in term of tooling. But I think that from all the fundraising we got in the last months we are gonna get a lot of features (Airbyte, Alation, Atlan, Bigeye, Castor, Meltano, Monte Carlo, dbt Labs, Soda, etc. — do not hesitate to send me a message to add your company).

DATA ENGINEERING

First I want to share some content formats that are not pure articles but that will help you sharpen your practice and your knowledge of the field:

Data Engineering Roadmap — DataStack drawn an awesome roadmap to discover all data engineering concepts. A must seen.
How Data Engineering Works — A YouTube video describing and illustrating in 14 minutes how Data Engineering works.
Data Engineering Manifesto — I love it. A poster with 9 principles regarding data engineering.
Data Engineering in 5400-words — Chip Huyen wrote a huge Google Doc with her lecture note on the basics of data engineering.

One Skill Every Data Engineer Needs

I've already featured this article in the newsletter but I found it so true. So here you are. Leo Godin analyses data engineers and what they need the most.

We Don't Need Data Scientists, We Need Data Engineers

Even if the hype over data engineering is just starting we still lack of data engineers. We need these kind reminders. We don't need data scientists, we need data engineers. (Actually, I think we need both).

Introduction to Databases

A README on Github detailing everything you need to know about databases.

DATA ANALYTICS (and TEAMS)

How should our company structure our data team?

This is another way to see data team structure. The article doesn't mention data mesh but ideas are very similar. If you have question around interactions between data internally and externally this is a must-read.

Analytics is at a crossroads

Benn wrote an amazing post here. With everything we saw previously in the post Analytics is at a crossroads. To help you get the larger picture I propose you to read also the manifesto Against SQL and the answer by Pedram For SQL.

What makes a data analyst excellent?

Cassie Kozyrkov, Head of Decision Intelligence at Google, published almost one year ago in Towards Data Science what makes a data analyst excellent (part 2). Starting with misconceptions Cassie shares deep insights.

A data analyst (or me) explaining to a data engineer why he doesn't like Spark (credits)

Particular shout-out to Yassine Hamou Tahra and Romain Pierlot for their ideas regarding articles to feature.

See you next week.

Data News — Week 30

2021-07-30

Me since I started the weekly digest (credits)

Hi it's me again 👋, I hope this new digest finds you good. Maybe you are in holidays, maybe not, still I'm here to give you your weekly glimpse of the data ecosystem.

I got many feedback telling me you enjoyed Airflow Summit takeaways from last week so in the future I'll continue to cover these kind of large conferences for you! Thanks a lot ❤️.

Data fundraising 💰

Yesterday we got news that SAS, the analytical software company started in 1966 (yep you read it, 1966), will go public in 2024. The goal of this change is to provide stock options to employees in order to attract more talents.
Kili Technology, a Paris based AI startup, announced their $25m Series A funding. Today the company report about 40 employees on LinkedIn and developed a end-to-end AI training platform with the end goal to have better AI thanks to better data.

Is it Hard Being a Woman Data Engineer?

I've voluntarily chosen to start the Data News with this article because the gender equality in data engineering is obviously not yet here. I hope that in the future voices like Sarah Krasnik one will continue to emerge to help reach this diversity. I'll let you read the whole article to have the answer.

Scale data teams and platforms

Everyone knows, data is coming. The future we'll all face in our respective companies will be to manage more and more data everyday. This lead to a major scaling issue for everyone. What works for 10 will probably not work perfectly for 100 or 1000.

This means you'll probably need to migrate from a system to another while still maintaining the first one. So, this is a good question, how do you modernize while keeping the lights on? On the other hand you will need to scale teams, this article will provide you keys to build your data dream team.

To illustrate data migration I give you this article from LinkedIn engineering about their largest data migration task for the recruiter and jobs products.

Data teams/platforms you need to water to make it scale or works (credits)

The dysfunctions of Data Engineering

This is my evergreen content, I think that all data teams today are still dysfunctional. I've already shared a news (#27) in the past.

This week we have an amazing article on the topic. MrTrustworthy shares with us what are the dysfunctions of data engineering. You'll find a lot of founding concepts in the article: type of data engineers, inversion of responsibility, driving data rather than data-driven.

Use Amazon Location from Redshift

Yeah, now it comes some technical articles. On the AWS blog you can get an example on how you can access Amazon Location service from Redshift in SQL using some UDFs and Lamdba. As a side note, this article reveals how much AWS is technical and how all solutions look complex.

This example is about Redshift but I think it's easily feasible with other warehouses.

Understand concepts by comparison

Sometimes by comparing tools or concepts it's easier to understand. So this week I propose you 3 posts:

When it comes to databases what is the difference between surrogate and primary keys
Compare Relational vs. Document vs. Graph databases, this is a huge piece of article in two parts
You always ask yourself how does the pricing model from Snowflake compares to BigQuery one? Joao Marques tries to give an answer is the part 2 of Snowflake vs. BigQuery

Quality: how to write good batch pipelines or good code comments?

Zack Wilson (a data engineering LinkedIn voice) has started writing on medium. The first topic he wrote on is how to evaluate the quality of a batch data pipeline. The third point he mentions about Evaluating the maintenance burden is the one I prefer.

On Stack Overflow blog they give you 9 rules for writing good comments in your code. My favorite rule is the third one (again).

Run Airflow in one command

Every Airflow users know that it's hard to make airflow running in one command. Airflow needs a bit of setup in order to run: env variables, scheduler & webserver, database, etc. This article gives you an example on how you can write a single CLI command to run a production-like instance locally.

If this is already too much for you, you can also check Meltano that aims to simplify airflow startup.

People learning new data concepts on the beach (credits)

ML Friday — MLOps & summer upskill

This week I share with you two articles that have been written on KDnuggets. MLOps best practices and a program to upskill in machine learning with a lot of resources in 4 weeks.

—

Thanks for reading. Do not hesitate to subscribe. I've also started a YouTube channel (in French) do not hesitate to subscribe it will help me a lot 🤗.

Data News — Airflow Summit edition

2021-07-23

Hosts warming up the crowd during Airflow Summit (credits)

This week I want to write about the Airflow Summit 2021 and all the takeaways I got from there. An usual fast news will still be present but at the end of the post.

So the Summit was the time for the Airflow community to meet online and share every wonderful ideas and use-cases everyone has. The summit took place from the 8th to 16th of July. In the following post I'm gonna share with you content I got from all the talks I watched from the Summit and link to the replays if you want to go further. As a warning I say that this post will reflect my views and how I perceived presentations.

To write this article I've watched almost all of the 53 sessions from the Summit, I hope it'll help you narrow your watch.

I'll sort all takeaways in 3 categories:

Airflow vision and understanding 🔭
Airflow multi tenancy 🕸️
Airflow with superpowers ️🦸‍♀️

Before going further I want to share my favorite session from the Summit. Pinterest team shared how they migrated their 40K daily job from their internal system to Airflow. In order to smooth the transition they develop a super creative way with a PinterestDagBag to translate configs from old system to the new one. I can't sum-up the whole session and I advise you to see it there.

Airflow vision and understanding 🔭

Ash sharing the directions for Airflow (credits)

Ash Berlin-Taylor gave an awesome introduction regarding his views on where Airflow should go in the next months/years. From this he tried to give a probable roadmap. By listening users he drew 3 main pillars:

Making DAGs a joy to write
Airflow should be THE data workflow orchestrator
Operate Airflow should be done with confidence

From these pillars we got some ideas for the community to work on like:

custom schedules
execution_date evolution to a data interval concept
deferrable (async) operators — in the following of Smart Sensors — will introduce a new Trigger concept to Airflow
new UI with a Single Page App (React) ✨
airflowctl, to operate your airflow cluster from remote (probably like the astro CLI)
untrusted workers to segregate connections depending on DAG or tags
create a dynamic dags API — more and more people are writing dag factories and it's time to help them
DAG versioning — it could solve many issues
Replace XCom with a "data object" to improve data lifeline and lineage
and more by bringing streaming and better ML support to Airflow

If you want to see more go see Ash video.

After this refreshing vision regarding Airflow and what could coming after some talks presented how you could today write your Airflow DAGs in a more maintainable manner. For instance by using dataclasses as pipeline definitions or by using Viewflow to define all your transformations. I like a lot the dataclasses approach because you can create a full inheritance layer regarding your configurations.

To conclude this part I'd repeat that it's important to have DRY process when you start to write your Airflow DAGs. In the session Sarah Krasnik proposed a folder architecture where 1 folder = 1 DAG to help newcomers on-boarding. This in addition to code consolidation and abstraction will prevent you doing useless copy pasting. Go see the presentation it's worth checking.

Last but not least if you want to what the Airflow scheduler is doing internally and why Airflow webserver is faster in Airflow 2.x go see Ash talk about scheduler internals.

Scheduler infinite loop (image extracted from Ash Berlin-Taylor presentation)

Operating contexts in Airflow

Maxime Beauchemin presented how in the future we could introduce operating contexts in Airflow. Below the definition he wrote.

a declared mode of operation for a DAG that alters its shape or behaviour in a deterministic way

Apart the fact that he shared the Preset will come with a freemium version in a soon release he also shared 4 patterns to explain OC: swap connections pointers, schema suffixes, crossovers env (e.g. load prod in dev) and a fast mode limiting data reads (SQL DML LIMIT in dev for instance).

Airflow Multi Tenancy 🕸️

With the scale of data teams deploying Airflow with multi tenancy is becoming more and more a standard. As Yuanmeng Zeng and Nitish Victor from EA presented the multi-tenancy is needed because when you scale teams you need: isolation, support from diverse use-cases, lower the charge on scheduler, to have multiple hardware requirements (e.g. GPU), self-serve.

From that we got wonderful talks from Apple, EA and Wise talking about it. What I get for these is that teams from all over the world are being innovative regarding the way Airflow can make them more productive.

Apple team shown for instance how they mocked the operators to send to Airflow scheduler only mocked code to avoid any risk running code in the main scheduler loop. At EA they develop a custom self-service UI to create Airflow cluster leveraging the Airflow k8s operator.

The main advantage of the multi-tenancy is to isolate workloads.

By default Airflow is not designed to do that. But when you give power to end-user you want to be sure that dag from a team will only access the right connections and not the another team ones. To do this kind of isolation you could create your own SecretBackend like Xiao Deng presented or create dynamic security roles.

Airflow with superpowers 🦸‍♀️

This category is so inspiring. From sessions we can get ideas on how to improve our own Airflow setup.

Backfilling — Every airflow user know that backfilling DAGs could be a big pain once you have multiple DAG to backfill with dependencies. This year we got two detailed solutions regarding the matter. Adyen explaining how they developed a Undo DAG and airflow extension to mitigate the impact when we have multiple frequencies and dag ownerships to re-run. On the other side Pinterest team shown the form they develop to help backfill run.

Debug and inspection — Regarding superpowers Pinterest team shown, imho, huge improvements to the UI, the Clear failed button, the audit log and dag run tab at dag level, the graph view improvements with stats. Go check the replay to see the amazing work they done.

Pinterest team tools to operate Airflow (credits)

Open lineage — Every month we get a new tool in the lineage ecosystem, but in order to simplify all implementation we need open formats. This is Open Lineage. In the session we see how open lineage can be integrated in your airflow setup and what's the entities and protocols under-the-hood.

dbt, Airbyte, Great Expectations, reverse-ETL — we almost forgot that Airflow is used to schedule data pipelines. If we integrate Airflow with best in class tools we can get the best in class platform. We got talks presenting:

how you can integrate dbt within Airflow by writing a DbtDagParser and how GE in addition with dbt tests will ensure your data is perfect (link)
how you can trigger Airbyte refresh from a DAG (link)
how you can write reverse ETL to keep your third parties tools in sync (link).

That's all for the Airflow Summit sessions, I can't wait for next year sessions, with I hope physical ones!

Fast news

To balance with this huge Airflow post Dagster released the 0.12.0 version. I really like the execution timeline they developed.
Why being a Data scientist is more painful than stepping on a lego? This is a good reminder of the bubble when came from and what still need to be done.
Alvin team explained how they generated data lineage to be consumed and displayed in Amundsen.
Leo Godin comes back with an super article about the most important trait of great data engineers. Be curios to check it out :D
If you want to understand quickly what's Fast API.
With all new legal rules it's important to understand data retention.
AI will come to save IT operations in a near future?

This is a huge unusual post. Please come say hi on LinkedIn. I hope you liked it and you shared love to all Airflow speakers ❤️.

PS: As a side note once every sessions will be available on YouTube I'll change all the link in the article.

Data News — Week 28

2021-07-16

Data News — Week 27

2021-07-09

Data News — Week 26

2021-07-02

Data News — Week 25

2021-06-25

Data News — Week 24

2021-06-18

Data News — Week 23

2021-06-11

Data News — Week 22

2021-06-04

Data News — Week 21

2021-05-28

Data News — Week 20

2021-05-21

☀️ Happy Friday. This week we have six new articles about data engineering but not only! From Back Market learning on Delta Lake to Reddit rex on recommendation.

Back Market — Subscriptions on a Delta Lake

Back Market engineering explains how they used Delta Lake to get data from production and then push it to Snowflake or S3.

Subscriptions on a Delta Lake

After moving data out of our Monolith’s database and preparing it to be consumed for multiple usages by our business stakeholders, the manifesto of our Data Engineering teams includes that we have to…

Back Market EngineeringFlorian Valeye

Do we need new editor for SQL analysts?

Robert Yi asserts that IDE are not the future of SQL analytics and propose a solution with his new company to fix this. He started prequel.ai in order to help closing the gap.

Why the IDE is not the future of SQL-based analytics

I’m just going to say it: the traditional IDE format is not great for writing queries for analytics work. I’ll start by explaining why, then tell you what you can do about it. First, my explanation —…

Towards Data ScienceRobert Yi

Fivetran for collection in the Modern Data Stack

We all read the Modern Data Stack article from dbt CEO Tristan Handy (if not, go now!). Here Fivetran tries to write a small summary about this matter.

What Is the Modern Data Stack? | Blog | Fivetran

A radically new approach to data integration saves engineering time, allowing engineers and analysts to pursue higher-value activities.

Fivetran

State of reverse ETL

A nice articles filled with charts on with market share of reverse ETL tools. We see that Snowflake and BigQuery have trully taken the lead in integrations regarding source, and for destination at least 40% of the tools are used for CRM.

The State of Reverse ETL: What Operational Analytics Will Look Like in 2021 and Beyond

A deep-dive into how companies are using Reverse ETL in 2021 to power their operational workflows

Hightouch

Tips and tricks to setup your PostgreSQL as a DWH

Imagine a world where Snowflake and BigQuery does not exist. In this world you will need to configure your Postgres to act like a data warehouse, this article deep dives in the Postgres configuration to tune this perfectly. I like the Partitioning tip!

Using PostgreSQL as a Data Warehouse

With some tweaking Postgres can be a great data warehouse. Here's how to configure it.

Narrator Data Blog | Narrator.aiCedric Dussud

ML Friday, the Reddit journey on recommendation

Reddit shared their journey on from no recommendation to 100 billion recommendations a day. They speak about feature store and model feedback loop to improve predictions. A must read!

Data News — Week 19

2021-05-21

Data News — Week 18

2021-05-21

Hi everyone, first post regarding news digest. This week is focused on data detection tools.

csv-detective

csv-detective is a lightweight open-source tool doing type inferences. Worth using it if you want to detect stuff around french address system.

etalab/csv-detective

CSV inspection. Contribute to etalab/csv-detective development by creating an account on GitHub.

GitHubetalab

pandas-profiling

I didn't know before pandas profiling. The project has 7k stars on GitHub and in the future I'll use it on all my dataset. It generates a HTML UI describing everything you have

pandas-profiling/pandas-profiling

Create HTML profiling reports from pandas DataFrame objects - pandas-profiling/pandas-profiling

GitHubpandas-profiling

Data cleaning principles

A small talk from the csv,conf,v6 where Karl Broman explains his data cleaning framework. 20 points worth checking.

Data cleaning principles

Karl Broman

Data quality - rethink previsualisation

(Sorry for the french). Here the french administration tried to rethink how to visualise the data on the french open-data platform.

Qualité des données : repenser la prévisualisation des données - data.gouv.fr

D’avril à juin c’est le printemps de data.gouv.fr : chaque semaine nous partageons nos réflexions, des annonces concrètes ou encore des événements et quelques surprises !

data.gouv.fr