Hello you, I hope this new Data News finds you well. After last week question about your consideration of a paying subscription I got a few feedbacks and it helped me a lot realise how you see the newsletter and what it means for a you. So thank you for that. I'll try to think about it in the following weeks to understand where I go for the third year of the newsletter and the blog.
Stay tuned and let's jump to the content.
This week I've published an compact article about how to get started with dbt. The idea behind this article is to define every dbt concept and objects from the CLI to the Jinja templating or models and sources. The article has been written as something you can add in your own internal dbt onboarding process for every newcomer.
Machine Learning Saturday 🤖
- How BlaBlaCar leverages machine learning to match passengers and drivers — BlaBlaCar is a carpooling company and in this article they detail what they did—in terms of machine learning—to improve trips listing with a Boost feature that proposes detours to drivers in order to be able to cover more countryside cities. It does not include any generative AI but greatly shows how machine learning can impact business problems.
- Sharing LinkedIn’s Responsible AI Principles — Very short article that lists the 5 principles LinkedIn aims to follow. In a nutshell AI should be use as a tool to empower members and augment their success, while prioritising trust, privacy, security, and fairness, providing transparency in AI usage, and the right governance should be put in place to maintain accountability over AI algorithms.
- Designing a regional experiment to measure incrementality — Monzo team did an geographical experiment in order to understand how their referral program works.
Fast News ⚡️
- Writing well: a data engineer’s advantage — This is probably a leftover part of the data engineer toolkit, but writing is an essential skill. Luuk gives a few advices on how to improve in your email communications with coworkers in order to announce new release or to seek for budget for a factorisation project.
- Here’s why your efforts to extract value from data are going nowhere — If data science is “making data useful,” then data engineering is “making data usable.”. This is a quote from Cassie article which I find awesome. But still, in order to make data works we still need to praise other data coworkers that have to do documentation and all the governance burden that no-one wants to do.
- Understanding slowly changing dimensions (SCD) in data warehousing — SCD modeling is an old technique but more and more relevant today as we need to keep track of transactional data. The article proposes 6 types of SCDs. I think the SCD type 2 is the most common and lossless one, but other are worth mentioning. As a side note, if you want to understand quickly what SCD are, dbt snapshots documentation page is the best path to go.
- How to run dbt with BigQuery in GitHub Actions — When you're starting with dbt you don't need any orchestrator or dbt Cloud, a CI/CD do it for sure. This article gives you the GitHub Action you need to setup.
- Snowflake: query acceleration service — Snowflake invented a boost, that you activate with a flag at warehouse creation (in Snowflake a warehouse is the compute isolation your queries run in, the bigger the warehouse is the more compute you use and pay). When you activate the query acceleration service when Snowflake thinks that a query can be accelerated it will launch more compute than actually specified by your warehouse. Not related, they also announced Snowpipe Streaming this week.
- Data ingestion pipeline with Operation Management — At Netflix they annotate video which can lead to thousand of annotation but they need to manage the annotation lifecycle each time the annotation algorithm runs. This article explains how they did it.
- Ensuring Data Consistency Across Replicas — Mixpanel details how they ensure that different zones Kafka consumers are writing the data in the same manner. This way, when a zone is unavailable they can use the other zone to still have the data without any duplication or lack of messages.
- Pandas 2.0.0 — A new major Pandas release is out. In the shadows of Polars that seems to revolutionise DataFrame computation Pandas came with a lot of optimisation and changes.
- AWS lambdas are still on Python 3.9 — Corey rant about AWS lambdas that are still using Python 3.9 while all the competition upgraded to at least Python 3.10.
- A small head's up, the Apache Airflow team has announced the Airflow Summit for 2023 which will be held in Toronto in September. They recently opened the call for presentations.
Data Economy 💰
- Qwak raises $12m Series A. Are the ducks the new elephants? Qwak proposes a all-in-one platform to manage all operations in a machine learning project. In the platform you do the feature engineering, the model creation, versionning, deployment and monitoring with all pipeline automated. I think a lot of platforms like this exists today.
- Announcing Tabular — Tabular has been released in public this week. Tabular is a cloud offer using Apache Iceberg. This is funny to see their offering because they offer a "managed data warehouse storage", which means without the compute. You bring your own compute. Some company also call it a lakehouse or a data lake, but the word shift is enough interesting to notice. At least for me.
- Insights from new data and AI Pegacorns — Ben from GradientFlow gave a few economic insights about the data Pegacorns (companies with more than $100m annual revenue). I don't have much to say on except that next year probably we'll see generative AI companies on the track to enter the selection.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.