<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        <title><![CDATA[ blef.fr ]]></title>
        <description><![CDATA[ I put words on data engineering. ]]></description>
        <link>https://www.blef.fr</link>
        <atom:link href="https://www.blef.fr" rel="self" type="application/rss+xml"/>


                <item>
                    <title><![CDATA[ Data News — Week 25.43 ]]></title>
                    <description><![CDATA[ Data News #25.43 — Best-of the last 6-months articles: AI and data eng stuff that happened. ]]></description>
                    <link><![CDATA[ /data-news-week-25-43/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 68f3cdbc197a63000182f6da ]]></guid>
                    <pubDate><![CDATA[ 2025-10-26 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1471229058801-75ee9a43ef35?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHN1dHJvJTIwdG93ZXJ8ZW58MHx8fHwxNzYxNTAzOTU5fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" class="kg-image" alt="tower surrounded by clouds" loading="lazy" width="4290" height="2802" srcset="https://images.unsplash.com/photo-1471229058801-75ee9a43ef35?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHN1dHJvJTIwdG93ZXJ8ZW58MHx8fHwxNzYxNTAzOTU5fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=600 600w, https://images.unsplash.com/photo-1471229058801-75ee9a43ef35?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHN1dHJvJTIwdG93ZXJ8ZW58MHx8fHwxNzYxNTAzOTU5fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1000 1000w, https://images.unsplash.com/photo-1471229058801-75ee9a43ef35?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHN1dHJvJTIwdG93ZXJ8ZW58MHx8fHwxNzYxNTAzOTU5fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1600 1600w, https://images.unsplash.com/photo-1471229058801-75ee9a43ef35?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHN1dHJvJTIwdG93ZXJ8ZW58MHx8fHwxNzYxNTAzOTU5fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2400 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Stand out from the cloud (</span><a href="https://unsplash.com/?utm_source=ghost&utm_medium=referral&utm_campaign=api-credit"><span style="white-space: pre-wrap;">Credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey you. It's been a while! The newsletter is back. So, expect Data News to land in your inbox every week between Friday and Sunday. Same recipe as before: a bunch of links about data and AI, topped with my usual spicy opinions.</p><p>Below a best of the last months Data News, mainly the best articles about the AI and data ecosystem that I've came across, it's a great reading list.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>The consumer AI companies are working on changing the way we browse and consume internet<ul><li>OpenAI brought to ChatGPT new integrations like <a href="https://openai.com/index/buy-it-in-chatgpt/?ref=blef.fr">shopping</a> and <a href="https://investor.coursera.com/news/news-details/2025/Coursera-Partners-with-OpenAI-to-Bring-Learning-Capabilities-into-the-First-Generation-of-Apps-in-ChatGPT/default.aspx?ref=blef.fr">courses</a>. This is a new way to consume the web, ChatGPT shopping will be a way to monetise but also a paradigm shift in how we use internet, OpenAI is trying to rebuild a web from within a chat, shortcutting the browser. But they also release a browser this week, named <a href="https://openai.com/index/introducing-chatgpt-atlas/?ref=blef.fr">Atlas</a>.</li><li>Browsers are getting more and more AI capabilities, whereas it's <a href="https://www.diabrowser.com/?ref=blef.fr">Dia</a> or <a href="https://www.perplexity.ai/comet?ref=blef.fr">Comet</a>, the goal is to give AI browsing capabilities like if it's human. Might be a transitional phase until the whole web gets destroyed <a href="https://blog.cloudflare.com/introducing-pay-per-crawl/?ref=blef.fr">because</a> of <a href="https://www.zdnet.com/article/cloudflare-just-changed-the-internet-and-its-bad-new-for-the-ai-giants/?ref=blef.fr">bots</a> and usual websites disappears? </li></ul></li><li><a href="https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the?ref=blef.fr">From GPT-2 to gpt-oss: analysing the architectural advances</a> — This is a great architectural deep-dive to understand the architecture behind all the GPTs. What layers are used, how gpt-oss compares with Qwen etc. If you're French, Defend Intelligence <a href="https://www.youtube.com/watch?v=v5JPwgLKb4Q&ref=blef.fr">redeveloped a GPT</a> for a YouTube video.</li><li><a href="https://www.mechanize.work/blog/the-upcoming-gpt-3-moment-for-rl/?ref=blef.fr">The upcoming GPT-3 moment for RL</a> — Small essay about the current state of reinforcement learning which need to move from being task-specific to scale. Scaling RL would require something like replication training as a set of specs to reproduce complex RL scenarii. </li><li><a href="https://stackoverflow.blog/2025/09/10/ai-vs-gen-z/?ref=blef.fr">AI vs Gen Z</a> — How AI has changed the career pathway for junior developers. It has been posted on Stack Overflow blog, which ironically has been also very impacted by AI in the last 2 years. It describes well the current situation, being a junior developper was already difficult and AI made it worse (25% decrease in junior job posting in 2024), and the employment for Software engineers has decreased nearly 20% since 2022 peak. <br><br>After years of considering SE like a promising career, AI is changing everything, we don't learn as mush as before, we don't need intern or juniors, salaries might decrease if the job become less complex. But if you don’t hire junior developers, you’ll someday never have senior developers.</li><li><a href="https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html?ref=blef.fr">Use LLMs to analyse postmortems at Zalando</a> — Large companies often have a large number of postmortems (memo written after incidents) and it might be a great use of AI. They designed a multi-stage pipeline with: summarisation, classification, analyse, patterns and opportunity.</li><li><a href="https://jeremyberman.substack.com/p/how-i-got-the-highest-score-on-arc-agi-again?ref=blef.fr">How I got the highest score on ARC-AGI again swapping Python for English</a> — ARC-AGI is a benchmark doing an intelligence test designed to measure pattern recognition over puzzles that humans can easily solve.<br><br>Currently an human panel score 98%, while GPT-5 Pro scores 18%. The author of the article successfully score 29% when switching from code to English.</li><li><a href="https://www.praf.me/ai-coding?ref=blef.fr">An unusual consequence of AI coding</a> — <em>"What AI coding has taken away is the time where you know exactly what you want to implement and have a rough mental model of how to do it [...] There was a beauty and joy to this part that I miss, a flow state you can hit with a nice linear progression"</em>. Probably what factory workers might have said when their factory got automated? We don't have to think the way we were thinking before. <br><br>Related: <a href="https://kix.dev/dumb-cursor-is-the-best-cursor/?ref=blef.fr">Dumb Cursor is the best Cursor</a>.</li><li><a href="https://damek.github.io/random/basic-facts-about-gpus/?ref=blef.fr">Basic facts about GPUs</a> — Explains how GPUs compute and memory work and the different performance regimes: memory-bound, compute-bound and overhead.</li><li><a href="https://blog.trailofbits.com/2025/08/21/weaponizing-image-scaling-against-production-ai-systems/?ref=blef.fr">Prompt injection attacks through images</a> — Hide a text in a image that might be readable when the image gets downsampled or filtered. If a LLM interpret this text it's an attack surface when people are adding images to their chat conversation.</li><li><a href="https://about.datnguyen.de/blog/internal/context-engineering-modern-llm-ecosystem/?ref=blef.fr">Context Engineering: How RAG, agents, and memory make LLMs actually useful</a> and <a href="https://thenewaiorder.substack.com/p/learn-agentic-ai-a-beginners-guide?ref=blef.fr">Learn Agentic AI: A Beginner’s Guide to RAG, MCPs, and AI Agents</a> — Two guides to explore agentic concepts.</li><li><a href="https://www.databricks.com/blog/building-state-art-enterprise-agents-90x-cheaper-automated-prompt-optimization?ref=blef.fr">Use GEPA automated prompt optimisation to surpass Claude Opus 4.1</a> —&nbsp;Databricks achieved great performances after doing prompt optimisation on gpt-oss-120b.</li><li>[study] <a href="https://andymasley.substack.com/p/a-cheat-sheet-for-conversations-about?ref=blef.fr">Using ChatGPT is not bad for the environment</a> — A cheat sheet about carbon emissions related to LLMs. </li><li>[paper] <a href="https://arxiv.org/pdf/2505.23836?ref=blef.fr">Large Language Models often know when they are being evaluated</a>.</li><li>[podcast] <a href="https://www.youtube.com/watch?v=RqWIvvv3SnQ&ref=blef.fr">How GPT-5 thinks</a> — From OpenAI’s VP of Research Jerry Tworek. He explains how reasoning works.</li><li>[paper] <a href="https://cdn.openai.com/pdf/a253471f-8260-40c6-a2cc-aa93fe9f142e/economic-research-chatgpt-usage-paper.pdf?ref=blef.fr">How people use ChatGPT</a> — OpenAI ran a classifier on 1.1m sample conversations to understand how their 800m+ weekly active chatters are using the AI. It shows how widely the AI can be use to do people stuff.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/10/Screenshot-2025-10-26-at-09.57.08.png" class="kg-image" alt="" loading="lazy" width="1212" height="604" srcset="https://www.blef.fr/content/images/size/w600/2025/10/Screenshot-2025-10-26-at-09.57.08.png 600w, https://www.blef.fr/content/images/size/w1000/2025/10/Screenshot-2025-10-26-at-09.57.08.png 1000w, https://www.blef.fr/content/images/2025/10/Screenshot-2025-10-26-at-09.57.08.png 1212w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Breakdown of granular conversation topic shares from a sample of approximately 1.1 million sampled conversations from May 15, 2024 through June 26, 2025 (extracted from the paper How people use ChatGPT).</span></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.youtube.com/watch?v=GfH4QL4VqJ0&ref=blef.fr">Python, the documentary</a> — A great documentary about Python and the origin, how the initial community has been build and what it takes to create such a piece of open-source software that is widely used. Python is scoring 25% popularity index (<a href="https://www.tiobe.com/tiobe-index/?ref=blef.fr">TIOBE</a>), when top 2 and 3 are C and C++ with 9% each.</li><li>Apache Airflow Summit took place a few weeks ago, videos are not yet out but Marc Lamberti shared a few takeaways on LinkedIn like how <a href="https://www.linkedin.com/feed/update/urn:li:activity:7381937258410455040/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7381937258410455040%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Duolingo</a> is using Airflow. <a href="https://airflow.apache.org/blog/airflow-3.1.0/?ref=blef.fr">Airflow 3</a> has also been released.</li><li>Python 3.14 is out — it gets a natural <a href="https://blog.miguelgrinberg.com/post/python-3-14-is-here-how-fast-is-it?ref=blef.fr">performance uplift</a> and pave the way for the <a href="https://realpython.com/python-gil/?ref=blef.fr">GIL</a> changes.</li><li>Astral innovations to Python ecosystem — Astral is changing Python tooling forever with great crafted product. Recently they released:<ul><li><a href="https://github.com/astral-sh/ty?ref=blef.fr">astral/ty</a> — A Python type checker, written in Rust (obv) that runs faster than anything else.</li><li>uv — <a href="https://pydevtools.com/blog/uv-format-code-formatting-comes-to-uv-experimentally/?ref=blef.fr">uv format</a> (might replace black). And funny thing someone <a href="https://mildbyte.xyz/blog/solving-wordle-with-uv-dependency-resolver/?ref=blef.fr">solved wordle using uv</a> dependency resolver.</li><li><a href="https://astral.sh/blog/introducing-pyx?ref=blef.fr">pyx</a> — If you need a private package registry Astral created pyx. Might be their way to make money at the Enterprise level to keep them working on this great tooling.</li></ul></li><li><a href="https://luminousmen.substack.com/p/how-not-to-partition-data-in-s3-and?ref=blef.fr">How not to partition data in S3 and what to do instead</a> — When you need to partition by data on S3 you should partition using the <code>YYYY-MM-DD</code> format.</li><li><a href="https://clickhouse.com/blog/moosestack-does-olap-need-an-orm?ref=blef.fr">Does OLAP need an ORM?</a>&nbsp;— Great question. ORM can bring type-safety to SQL generation because database objects are translated in the native programming language. This way AI when generating objects knows types and might know if something will fail before it hits the database. As chat with your data is becoming more and more tried at companies, this is maybe a requirement we actually need.</li><li>Some news about the Iceberg / lake house ecosystem.<ul><li><a href="https://github.com/ClickHouse/ClickHouse/pull/82692?ref=blef.fr">ClickHouse</a> and <a href="https://ducklake.select/2025/09/17/ducklake-03/?ref=blef.fr">DuckLake</a> now support write to Iceberg.</li><li><a href="https://maxhalford.github.io/blog/ducklake-thoughts/?ref=blef.fr">Thoughts on DuckLake</a> — Max explains why DuckLake might be a big thing when it comes to improving the local developper experience. As DuckLake can make DuckDB function as a data warehouse. Imagine if while developing you could run you usual BigQuery pipelines but locally on the production data (that is available on GCS).</li><li><a href="https://blog.cloudflare.com/cloudflare-data-platform/?ref=blef.fr">Cloudflare data platform</a> — Cloudflare announced their lake house platform based on <a href="https://www.cloudflare.com/developer-platform/products/r2/?ref=blef.fr">R2</a> (S3 compatible storage). They released R2 catalog (a fully manage Iceberg catalog) and R2 SQL. R2 SQL <a href="https://blog.cloudflare.com/r2-sql-deep-dive/?ref=blef.fr">relies on Apache Datafusion</a>. </li><li><a href="https://tobilg.com/the-age-of-10-dollar-a-month-lakehouses?ref=blef.fr">The age of the 10$ lakehouse</a> — A great deep-dive of the combination of the 2 previous bullet points. This is awesome to see this new kind data platforms. Back then they moved away from Fivetran + Snowflake to CDC with Debezium + Hudi (an Iceberg alternative). </li></ul></li><ul><li><a href="https://www.notion.com/blog/building-and-scaling-notions-data-lake?ref=blef.fr">Building and scaling Notion data lake</a> — An old article about how Notion structured their data lake.</li></ul><li><a href="https://perspectives.datainstitute.io/the-minimalists-data-stack-19b0a0aeef3e?ref=blef.fr">The minimalist data stack</a> — 5 parts article describing a dltHub + dbt + BigQuery data stack.</li><li>If you missed it Fivetran and dbt Labs are merging, <a href="https://www.blef.fr/data-news-dbt-coalesce-2025/">here are my thoughts</a>.</li><li><a href="https://medium.com/@hugo.hauraix/redefining-analytics-roles-at-decathlon-aligning-skills-and-practices-for-future-ready-insights-9abfc00d01b1?ref=blef.fr">Redefining analytics roles and aligning skills and practices for future-ready insights</a> — How to rebalance the skills and responsibilities when analytics engineering becomes a bottleneck.</li><li><a href="https://www.datacult.com/post/the-data-modeling-framework-every-analytics-engineer-should-know?ref=blef.fr">Data modeling framework</a> + <a href="https://www.dataengineeringweekly.com/p/revisiting-medallion-architecture-760?ref=blef.fr">revisiting medaillon architecture</a> — Would you take a bit of data modeling content?</li><li><a href="https://medium.com/doctolib/analytics-at-scale-the-frameworks-behind-monitoring-100-features-1-2-1012d3c0bbd3?ref=blef.fr">Analytics at scale</a> — How to do product analytics at scale when tens of new features are released every week and product teams wants to understand what's happening. The article shares the organisation Doctolib implemented and the data modeling that was put in place to make it work.</li><li><a href="https://medium.com/blablacar/scaling-success-the-dbt-ecosystem-at-blablacar-c214c4b8f0cb?ref=blef.fr">Scaling Success: The dbt ecosystem at BlaBlaCar</a> — What a team of 45+ engineers had to put in place to make their dbt setup work for everyone: dev containers + extensions + a few dbt packages. If you want the same setup but without doing anything you can use <a href="https://getnao.io/?ref=blef.fr">nao</a>.</li><li><a href="https://netflixtechblog.medium.com/data-as-a-product-applying-a-product-mindset-to-data-at-netflix-4a4d1287a31d?ref=blef.fr">Data as a product</a>, applying a product mindset to data at Netflix.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7364294652977364993/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7364294652977364993%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">RIP Tableau</a> — 2 months ago Voi killed Tableau and switched to LLM as a bridge in Slack and Sheets to accomplished what was possible in Tableau before. It required an effort of metrics definitions tho.</li><li><a href="https://towardsdatascience.com/why-bi-in-the-ai-age/?ref=blef.fr">Why BI in the AI age</a> — "<em>Great analytics isn’t about generating charts quickly, it’s about building confidence in decisions through rigorous investigation of data. Every discovery, design choice, and contextual annotation represents a human analyst’s business intelligence."</em></li><li><a href="https://www.danhock.co/p/vibe-analysis?ref=blef.fr">Vibe Analysis</a> — The other side of the coin.</li><li><a href="https://www.counting-stuff.com/doing-sql-work-with-llm-aids-as-a-sql-addict/?ref=blef.fr">Doing SQL work with LLM aids as a SQL addict</a>.</li></ul><p>I'll be speaking at <a href="https://odsc.ai/?ref=blef.fr">OSDC AI</a> next Tuesday about <em>Building AI Agents is data engineering</em>.</p><hr><p>See you next week!</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — dbt Coalesce 2025 ]]></title>
                    <description><![CDATA[ Data News — dbt Coalesce 2025. What about the Fivetran + dbt Labs. What it means for data ecosystem and more. ]]></description>
                    <link><![CDATA[ /data-news-dbt-coalesce-2025/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 68f5412959b34a00012925ff ]]></guid>
                    <pubDate><![CDATA[ 2025-10-20 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1605833556294-ea5c7a74f57d?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHZlZ2FzfGVufDB8fHx8MTc2MDkxNTQzM3ww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" class="kg-image" alt="welcome to fabulous las vegas nevada signage" loading="lazy" width="4288" height="2848" srcset="https://images.unsplash.com/photo-1605833556294-ea5c7a74f57d?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHZlZ2FzfGVufDB8fHx8MTc2MDkxNTQzM3ww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=600 600w, https://images.unsplash.com/photo-1605833556294-ea5c7a74f57d?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHZlZ2FzfGVufDB8fHx8MTc2MDkxNTQzM3ww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1000 1000w, https://images.unsplash.com/photo-1605833556294-ea5c7a74f57d?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHZlZ2FzfGVufDB8fHx8MTc2MDkxNTQzM3ww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1600 1600w, https://images.unsplash.com/photo-1605833556294-ea5c7a74f57d?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDN8fHZlZ2FzfGVufDB8fHx8MTc2MDkxNTQzM3ww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2400 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Vegas baby (</span><a href="https://unsplash.com/?utm_source=ghost&utm_medium=referral&utm_campaign=api-credit"><span style="white-space: pre-wrap;">Credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><p>Hey here. I hope this email finds you well. My dear Data News has been a bit neglected these past few months—I’ve been busy with my other gig (that you might <a href="https://getnao.io/?ref=blef.fr">nao</a>). But don’t think I forgot you. Every Friday, I thought of you and this little corner of data passion we share.</p><p>I’ve decided the weekly write-ups are coming back—they have to. Earlier this year, I went through YC, and the aftermath took up way more of my time than I expected (especially the unplanned detour into hiring). But that’s over now. It’s time to get back to basics.</p><p>So, expect Data News to land in your inbox every week between Friday and Sunday. Same recipe as before: a bunch of links about data and AI, topped with my usual spicy opinions.</p><p>To reboot the machine this week, I’m sharing my take on <a href="https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary?ref=blef.fr"><em>dbt Coalesce</em></a> and the <a href="https://www.getdbt.com/blog/dbt-labs-and-fivetran-merge-announcement?ref=blef.fr">dbt Labs + Fivetran</a> merger that caught everyone by surprise over the past few weeks. Next week will be a “best of” edition—a curated collection of the most interesting articles from the last six months. It’s a great one, don't miss it.</p><p></p><h1 id="a-bit-of-history">A bit of history</h1><p>If you were living in a cave last month, you might have missed some big news. But before jumping in, a bit of history.</p><p>I started my journey in data back in 2014, at the height of the Big Data era—when Hadoop was on everyone’s lips and companies were throwing hundreds of thousands of euros at infrastructure, teams, and software. It feels like another lifetime, when building a recommendation system was a multi-month, six-figure project</p><p>But that wasn’t even the beginning. The story starts in the late ’70s and early ’80s when the term <em>data warehouse</em> was coined (did you know Excel was created in 1985??). From early Oracle data warehouses to Hadoop, one pattern stayed the same: these tools were painful to use. Getting into data required obscure knowledge you couldn’t find in school, and the technology itself was… well, a bit of a nightmare (or a Java nightmare).</p><p>Then came AWS and the cloud, which simplified a lot of what we were doing. BigQuery made it even easier, just throw in your data, query it, and pay per query. In 2018, I migrated a 3 To exploding Postgres warehouse to BigQuery, cutting query times from hours to seconds. Everything ran through Airflow, orchestrating extraction and transformation. Like thousands of others, I had unknowingly built a simili-dbt.</p><p>At that time, Airflow was the glue. Every issue, every new need meant extending our internal Airflow framework — even reverse ETL was just another DAG in our dynamic DAG factory. Then, after being laid off following an <a href="https://techcrunch.com/2020/04/16/kapten-merges-with-parent-company-free-now-starts-restructuring-plan/?ref=blef.fr">acquisition</a>, I went freelance and worked on my first dbt project. At first, I wasn’t convinced — then it clicked. It was exactly what I’d built internally, but open-source, standardised, and ready to become the industry norm. It empowered less-technical users while letting data engineers focus on keeping the platform running.</p><p>dbt also helped make SQL-first thinking mainstream. For years, it was SQL data engineers vs. JVM data engineers. The former chilling, the latter raging about our pipelines not being type-safe. Then came the “Modern Data Stack”: ingest with a paid tool (and a few custom scripts when it fails), transform with dbt on your warehouse, and visualize with two BI tools — Tableau for execs, Metabase for everyone else. We’d unbundled Airflow into a set of specialized tools.</p><p>It was a great run. But the fun had to end sometime. After a decade of building the foundations of the modern data stack, did we finally get it right? So that when AI arrived, we could just plug it in and have it magically work? Read clean metric definitions from a central warehouse and deliver a single, shared version of “revenue”?</p><p>We <em>did</em> nail that… right? Right?</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/10/IMG_7123.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="1105" srcset="https://www.blef.fr/content/images/size/w600/2025/10/IMG_7123.jpg 600w, https://www.blef.fr/content/images/size/w1000/2025/10/IMG_7123.jpg 1000w, https://www.blef.fr/content/images/size/w1600/2025/10/IMG_7123.jpg 1600w, https://www.blef.fr/content/images/size/w2400/2025/10/IMG_7123.jpg 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Coalesce 2025.</span></figcaption></figure><p></p><h1 id="nobody-got-fired-for-choosing-dbt">Nobody got fired for choosing dbt</h1><p>I’d say AI has arrived and it’s forcing companies to move faster than ever. This is the era of building AI apps on top of AI, not the time to pour more resources into data pipelines. That’s why we’re seeing consolidation: companies want bundled services, a single invoice, fancy certifications, and the comfort of validation from expensive tools.</p><p>All of this sets the stage for where we are now. The Modern Data Stack has become the useful idiot of the moment. Replaced by Analytics and AI Stack<em>.</em> Because, of course, <a href="https://www.instagram.com/p/DCU9CCFT2Le/?ref=blef.fr">AI runs on data, data runs on dbt</a>.</p><p>Every year since 2020 dbt Labs is doing an annual conference called Coalesce—to not mix up with <a href="https://coalesce.io/?ref=blef.fr">Coalesce.io</a> (one of their Enterprise and <a href="https://www.govinfo.gov/content/pkg/USCOURTS-paed-2_22-cv-03324/pdf/USCOURTS-paed-2_22-cv-03324-0.pdf?ref=blef.fr">lawsuit</a> competitor). I've covered <a href="https://www.blef.fr/dbt-coalesce-takeaways/">2021</a> and <a href="https://www.blef.fr/dbt-coalesce-takeaways-2022/">2022</a> Coalesce from abroad and this year I had to chance to go in-person to live the hype. Here my main takeaways:</p><ul><li><strong>The merge</strong> — Fivetran and dbt Labs are merging (stocks operation) and will provide a first-of-his-kind open data infrastructure (see below). Pay attention, the merged company will do a lot of things open, because "open means opens". In their word open means open standards and not necessarily open-source.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2025/10/image.png" class="kg-image" alt="" loading="lazy" width="1974" height="1152" srcset="https://www.blef.fr/content/images/size/w600/2025/10/image.png 600w, https://www.blef.fr/content/images/size/w1000/2025/10/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2025/10/image.png 1600w, https://www.blef.fr/content/images/2025/10/image.png 1974w" sizes="(min-width: 720px) 720px"></figure><ul><li><strong>The vision</strong> —The follow-up of the vision of the open data infra is: all their biggest competitors are all selling storage and compute (Databricks, Snowflake, Fabric, BigQuery) but dbt Labs/Cloud/Platform is the only valid and trustworthy platform to do data because they are not vendor-locking you with a compute engine and providing you open standards on all other parts of the stack. So you can switch over something else.</li><li><strong>dbt Fusion</strong> — There is an economic reality, dbt acquired SDF Labs to develop Fusion. 2 mains selling points of Fusion are: cost cutting and better developer experience. I guess it's schizophrenic to sell cost cutting (~50%) while setting compute.</li><li><strong>Open standards</strong> — dbt Labs is rooting and supporting open standards<ul><li>Iceberg and lake — with Fivetran EL destination as datalake and dbt capabilities to support Iceberg catalog / adapter. Fivetran is a data lake company now also.</li><li>OSI — the <a href="https://www.snowflake.com/en/blog/open-semantic-interchange-ai-standard/?ref=blef.fr">Open Semantic Interchange</a> to create a unified way to define metrics with other big companies (Snowflake, Salesforce etc.) and (re)open-source MetricFlow (after taking it private after <a href="https://www.getdbt.com/blog/dbt-acquisition-transform?ref=blef.fr">Transform acquisition</a>)</li><li>SQL and ADBC — dbt relies obv. heavily on this</li><li>MCPs — because it's AI, the <a href="https://github.com/dbt-labs/dbt-mcp?ref=blef.fr">dbt MCP</a> provides new capabilities to do stuff with AI.</li></ul></li><li><strong>But what about open-source</strong> — This is our friend who stayed out of the party. They tried to do the magic trick to make us believe Fivetran is a true open-source contributor with more than 100+ open-source repos (they have 278+ but the one with the most starred has 184, even my <a href="https://github.com/Bl3f/yato?ref=blef.fr">yato</a> thing has more).<ul><li>dbt Language — it was the best demonstration of the Community keynote, what makes dbt dbt is the language and the fact that everyone speaks the same with a bunch of SQL and YAML file. It unified the way we define transformations. dbt language is an open standard. It a way to organise files and this can be picked up by whatever engines, today we have dbt Core and dbt Fusion as engines.</li><li>They announced dbt Core will be maintained for the foreseeable future while maintaining dbt Fusion at the same time, so spending twice the effort to support language evolution in a codebase that wasn't meant for this.</li><li>Just to remember dbt Core is stupid as fuck as it's just a templating processing engine organising files in a DAG thanks to the manual declaration of relationships, whereas Fusion is understanding SQL by parsing it.</li><li>dbt is the common language of 90.000 data teams around the world. That's a lot.</li></ul></li><li><strong>Coalesce</strong> — I was a bit disappointed by the quality of the talks at Coalesce this year. Some speakers didn’t seem to even be using dbt, while others gave entry-level presentations aimed at… well, I’m not sure who. In the past, I always learned something new from Coalesce, but this year felt like a turning point. The tool has gone mainstream, reaching Enterprise™ levels and drifting away from its original community of people hacking around dbt Core. That said, the people I met and the conversations I had were great.</li></ul><p></p><h1 id="conclusion">Conclusion</h1><p>I’ve been a dbt (Core) advocate for years. If you look through this blog, it’s probably the most mentioned technology here—alongside Airflow and DuckDB. Those three tools share something fundamental: they’re open-source and community-driven. In France, I helped run the Airflow community for a few years, later became known as a dbt expert, and at one point people even thought I worked for the ducks.</p><p>The reason I’ve spent the past eight years sharing and writing about these tools is simple: they were open-source. I was happy to give my time with no direct return because it felt like my own way of contributing back. But lately, something feels broken in my relationship with dbt. It’s not the merger itself—it’s the direction, the shift in strategy.</p><p>dbt Labs now seems focused on the Fortune 500. The new features aren’t made for someone like me anymore. Why would I need a drag-and-drop UI when that’s exactly what I tried to escape early in my career (hello, Talend)? Why would I pay $10,000 to run a simple SQL-only DAG? The new company’s focus just doesn’t speak to me as a data engineer.</p><p>Of course, as a founder, I understand why they’re doing it. They have to make money eventually, and I don’t have a solution to this. This is just my perspective.</p><p>And we shouldn’t forget SQLMesh—the only real open-source alternative to dbt Core—which quietly disappeared after an acquisition not long before all this. I can’t help but think that was part of a larger chess game, by Fivetran, to smooth the path for the dbt Labs deal and remove the one viable option that could have welcomed dbt users looking for an exil.</p><p>I bet than the consolidation is not yet finished, it's either a bigger fish acquiring the new venture or dbtran acquiring a catalog and/or an orchestrator. Dagster being, I think, a match in heaven.</p><hr><p>If you're looking for an exil there are some alternatives when it comes to transformation: <a href="https://www.bauplanlabs.com/?ref=blef.fr">bauplan</a>, <a href="https://getbruin.com/?ref=blef.fr">bruin</a>, <a href="https://github.com/carbonfact/lea?ref=blef.fr">lea</a> and for ingestion: <a href="https://dlthub.com/?ref=blef.fr">dltHub</a> (and all the tools based on it).</p><h3 id="other-writers">Other writers</h3><ul><li>You can read Hugo <a href="https://medium.com/@hugolu87/is-dbt-%EF%B8%8F-tbd-everything-you-need-to-know-post-coalesce-2025-02f93cbc19cc?ref=blef.fr">views on the matter here</a> — he enters a bit deeper in the Iceberg / Open compute topic that I squeezed a bit because the post is already too long.</li><li><a href="https://www.linkedin.com/posts/christopheblefari_wrap-up-of-day-1-of-dbt-coalesce-takeaways-activity-7384046814553014273-Vlnp/?ref=blef.fr">On my views on LinkedIn after Day 1 of Coalesce</a>.</li><li><a href="https://benn.substack.com/p/in-the-air?ref=blef.fr">Poetic Benn views</a>.</li></ul><p>See you next week ❤️ (and this time it's real).</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Forward Data Conference + some news ]]></title>
                    <description><![CDATA[ Data News are coming back and Forward Data Conference CfP still open until next Sunday! ]]></description>
                    <link><![CDATA[ /forward-data-conference-some-news/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 686257b8dda8f90001acbb9d ]]></guid>
                    <pubDate><![CDATA[ 2025-06-30 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>Hey Data News readers. Sorry for being absent for the last 2 months, I was in SF to work on <a href="https://getnao.io/?ref=blef.fr">nao</a> because we went through <a href="https://www.ycombinator.com/?ref=blef.fr">Y Combinator</a>, to be honest it was intense 3 months and an awesome experience. </p><p>Small head's up, I'm organising the Forward Data Conference (2nd edition) on November 24th in Paris and we are cooking a great program!</p><p><strong>The </strong><a href="https://conference-hall.io/forward-data-conference-2025?ref=blef.fr"><strong>call for talk proposal</strong></a><strong> is ending this Sunday (July 6th), so make sure to propose a talk this week if you wanna join this awesome moment! </strong>We are welcoming all level of speakers for everything about data, English and French submissions are welcome.</p><p>We are also looking for sponsors to make this event awesome and unforgettable! We have announced <a href="https://omni.co/?ref=blef.fr">Omni</a> as our first platinium sponsor.</p><hr><p>And last thing. Big news, starting this <strong>Friday Data News are coming back</strong>! Be ready, I miss you.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 25.15 ]]></title>
                    <description><![CDATA[ Data News #25.15 — Arrived in San Francisco, Llama 4 is out, reasoning hacks, MCP hype, Iceberg stuff. ]]></description>
                    <link><![CDATA[ /data-news-week-25-15/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 67f94fb975ec530001cce8e2 ]]></guid>
                    <pubDate><![CDATA[ 2025-04-14 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/04/image.png" class="kg-image" alt="" loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2025/04/image.png 600w, https://www.blef.fr/content/images/2025/04/image.png 900w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Painted ladies in SF (</span><a href="https://unsplash.com/photos/lined-of-white-and-blue-concrete-buildings-HadloobmnQs?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey here. What's up? While you're all data vibing I'm sliding into your inbox with the fresh Data News of the last month.</p><p>I have moved to San Francisco for the next 3 months, so if you're in town and wanna talk data or go for a run, you know where to find me. It's been a week since we arrive with <em>nao Labs</em> team in SF and it has been a blast. We will be at the <a href="https://www.datacouncil.ai/bay-2025?ref=blef.fr">Data Council</a> pitching at the <a href="https://www.datacouncil.ai/talks25/ai-launchpad-2025-nao?ref=blef.fr">AI Launchpad</a> on the 22nd.</p><p>I'm planning to create content for you to follow the Data Council from the inside as it has always been great to write takeaways about the talks these last year (<a href="https://www.blef.fr/data-council-austin-takeaways/">2023</a> and <a href="https://www.blef.fr/data-news-week-24-20/">2024</a>).</p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://platform.openai.com/docs/guides/audio?ref=blef.fr">New OpenAI text-to-speech model</a> — OpenAI released a new text-to-speech model, available through their API, it looks like better than the Whisper baseline. There is a <a href="https://www.openai.fm/?ref=blef.fr">demo</a> website which is quite impressive. </li><li><a href="https://www.llama.com/?ref=blef.fr">Llama 4 is out</a> — Meta has released the new iteration of their open models. This time it includes 4 models:<ul><li>Llama 4 Scout, a small 17B model. Natively multimodal, it achieves an industry leading 10M+ token context window and can also run on a single GPU.</li><li>Llama 4 Maverick a multimodal model, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks. It can also run on a single host.</li><li>And soon they will release Llama 4 Behemoth (outperforming GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks) and Reasoning. </li><li>Cool part is that they partnered from the beginning with <a href="https://www.databricks.com/blog/introducing-metas-llama-4-databricks-data-intelligence-platform?ref=blef.fr">Databricks</a> and <a href="https://www.snowflake.com/en/blog/meta-llama-4-now-available-snowflake-cortex-ai/?ref=blef.fr">Snowflake</a> to bring LLM to your data.</li></ul></li><li><a href="https://www.anthropic.com/research/reasoning-models-dont-say-think?ref=blef.fr">Reasoning models don't always say what they think</a> — A summary of a research about the faithfulness of reasoning models. Anthropic discover that these models, that are not really "reasoning" but more doing a Chain-of-Thoughts (CoT) are not always honest when they are given a hint (or hacked) about the the answer they should provide.</li><li><a href="https://medium.com/gorgias-engineering/how-to-roll-out-a-data-conversational-agent-c6a4b600e4e5?ref=blef.fr">How to roll-out a data conversational agent?</a> — Gorgias engineering team released a conversational agent to the company using <a href="https://www.getdot.ai/?ref=blef.fr">Dot</a> which is a AI Slack bot that answer data warehouses questions. The article explains super well how Dot fits in the data team and how you can evaluate the AI.</li><li>Google announces at Next'25 AI in their cloud products<ul><li><a href="https://cloud.google.com/bigquery/docs/generate-table?ref=blef.fr">AI.GENERATE_TABLE</a> in BigQuery — The setup looks a bit weird because in order for everything to be accessible you have to have a model in a dataset and prompts in a table but this is a great way to extract informations in strings <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-table?ref=blef.fr#example">as per the example</a>.</li><li><a href="https://cloud.google.com/blog/products/data-analytics/looker-bi-platform-gets-ai-powered-data-exploration?hl=en&ref=blef.fr">Talk to your data with Looker</a> —BI tools are already widely deployed in most organizations, and conversational analytics serves as another interface to access the same insights. That's why embedding this "talk to your data" capability within existing BI tools is likely the best path for adoption. However, this approach might also introduce additional complexity, given that many BI tools are already somewhat cumbersome, cluttered repositories of charts.</li><li>Google announced their <a href="https://github.com/google/adk-python?ref=blef.fr">Agent Development Kit (ADK)</a> and an open protocol to enable communication between agentic apps called <a href="https://github.com/google/A2A?ref=blef.fr">Agent2Agent</a> (A2A).</li><li>Also announced BigQuery AI engine that is doing something I don't understand completely: analysts could ask to extract info in a image and match it to a product catalog, and, a copilot for Jupyter Notebooks (in Collab) — <a href="https://cloud.google.com/blog/products/data-analytics/data-analytics-innovations-at-next25?hl=en&ref=blef.fr">more about analytics announcements</a>.</li></ul></li><li><a href="https://petrjanda.substack.com/p/lessons-learned-from-building-agent?ref=blef.fr">Lessons learned from building agent that can code like Composer</a> — Petr wrote a great article that explains how coding agents are working (which is dead simple). In order to build the most simple version of a coding agent you need to give 3 capabilities <em>list files</em>, <em>read file</em> and <em>write file</em>. With this and a few prompt you can have a working demo in a few minutes, but how to move forward? He gave the 5 important lessons he got out of this.</li><li><a href="https://www.oxy.tech/blog/introducing-oxy-and-the-future-of-agentic-analytics?ref=blef.fr">Oxy, an open-source agentic analytics framework</a> — Robert and Joseph co-founded <a href="https://www.hyperquery.ai/?ref=blef.fr">Hyperquery</a> back in the days (at the peak of the notebook hype) and they are back with a new journey: Oxy. An open-source framework to create analytics workflows in a friendly way. Looks promising.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/04/image-1.png" class="kg-image" alt="" loading="lazy" width="900" height="506" srcset="https://www.blef.fr/content/images/size/w600/2025/04/image-1.png 600w, https://www.blef.fr/content/images/2025/04/image-1.png 900w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">4 lama (</span><a href="https://unsplash.com/photos/four-beige-camels-R0g6wtDN1M8?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h3 id="navigate-the-mcp-hype">Navigate the MCP hype</h3><p>If you've been on the internet lately you should have seen the massive MCP hype, everyone is either building a MCP server or a MCP registry or even a <a href="https://mastra.ai/mcp-registry-registry?ref=blef.fr">registry of registries</a>.</p><p><em>But what's a MCP?</em></p><p>MCP means <a href="https://modelcontextprotocol.io/introduction?ref=blef.fr">Model Context Protocol</a> and is an open protocol created by folks at Anthropic. A MCP is a most of the time referenced as a server that encapsulate discoverable tools, prompts and data to be used by an LLM. MCP clients are on the LLM side and make requests to MCP server. </p><p>For instance there are a few Snowflake MCP servers, if you add them to Claude, you will be able to query Snowflake from a Claude prompt for instance, or get a table metadata.</p><ul><li><a href="https://neo4j.com/blog/developer/model-context-protocol/?ref=blef.fr">Everything a Developer needs to know about the MCP</a> — If you want to deep-dive more there is this article about it.</li><li><a href="https://x.com/sama/status/1904957253456941061?ref=blef.fr">OpenAI supports MCP in the agents SDK</a> — The biggest news recently is OpenAI validating the protocol and supporting it (not yet in ChatGPT tho).</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://discord.com/blog/overclocking-dbt-discords-custom-solution-in-processing-petabytes-of-data?ref=blef.fr">Overclocking dbt, Discord custom solution</a> — Discord data platform is huge and they reached few dbt limitations, especially on the backfilling side (which is not really a dbt strength), so they built their own way to overcome this leveraging the <code>meta</code> tag. They also managed to create isolated environments for the whole team and have a bunch of CI/CD jobs running on each PR validating their own internal rules. </li><li><a href="https://squadrondata.com/Databricks-SQL-Warehouse-Limitations/?ref=blef.fr">Current state of Databricks SQL warehouse</a> — Does Databricks SQL outperforms Snowflake?</li><li><a href="https://medium.com/google-cloud/deduplication-in-bigquery-tables-a-comparative-study-of-7-approaches-f48966eeea2b?ref=blef.fr">Deduplication in BigQuery, 7 ways to do it</a> — If you're usually do deduplication in BigQuery (or elsewhere) here 7 patterns to achieve it.</li><li><a href="https://cleandataarchitecture.substack.com/p/ensuring-data-contracts-adoption?r=48edk3&utm_campaign=post&utm_medium=web&triedRedirect=true&ref=blef.fr">Ensuring data contracts adoption across an organization</a>.</li><li><a href="https://www.gable.ai/blog/shift-left-data-manifesto?ref=blef.fr">The shift left data Manifesto</a> — I did not read it because it's too long, but Chad have been a shift left advocate for a long time. Which mean "Shifting Left means moving ownership, accountability, quality and governance from reactive downstream teams, to proactive upstream teams". Putting in another way give software engineers the responsibility of the data.</li><li><a href="https://www.linkedin.com/pulse/bi-dead-change-my-mind-dmitry-pavlov-2otae/?trackingId=P2egxlMzTNC6TAPisnxEyA%3D%3D&ref=blef.fr">BI is dead, change my mind</a> — It's Clickhouse director of engineering turn to say BI is dead, he got the light while chatting with Clickhouse using LibreChat + a Clickhouse and a Github MCP servers. Looking at how chat for everything is taking all over the place, it's only a few months until stakeholders asks for data interfaces using chat.</li><li><a href="https://duckdb.org/2025/04/04/dbt-duckdb?ref=blef.fr">Local data transformation with dbt and DuckDB</a> —&nbsp;Great article showcasing how you can locally do all your transformations today with dbt and DuckDB, and we even got a great <a href="https://duckdb.org/2025/03/12/duckdb-ui?ref=blef.fr">DuckDB local UI</a>.</li><li><a href="https://fromanengineersight.substack.com/p/issue-46-software-is-now-content?ref=blef.fr">Software is now content</a> — I really liked this week Benoit's post.</li><li><a href="https://blog.duolingo.com/dataset-development/?ref=blef.fr">How we built a robust ecosystem for dataset development</a> — Duolingo process to apply software engineering practices to data modeling, in a sense of datasets are assets that could be treated like APIs.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/04/image-2.png" class="kg-image" alt="" loading="lazy" width="2000" height="1500" srcset="https://www.blef.fr/content/images/size/w600/2025/04/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2025/04/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2025/04/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2025/04/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A rare Iceberg table in real life (</span><a href="https://unsplash.com/photos/white-and-gray-rock-formation-on-blue-sea-under-blue-sky-during-daytime-l6OraG-v0d8?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h3 id="navigating-iceberg-landscape">Navigating Iceberg landscape</h3><p>Over the last month a lot of things happened also in the data engineering space, especially around Iceberg which is taking over a lot of discussion when it comes to data storage. </p><p><em>Why is Iceberg so important right now?</em></p><p>Iceberg is a way to escape the data warehouses to build your own warehouses in kit on-top of bucket storages. Iceberg being open-source it will allow us to build interoperability between all systems while supporting some kind of transactional systems on-top of Parquet files.</p><ul><li>The Iceberg Summit took place in San Francisco (but I could not go), tho Neelesh publish a small <a href="https://www.linkedin.com/feed/update/urn:li:activity:7315622856141279235/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7315622856141279235%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">recap of the Summit</a>. I guess the videos will be on the <a href="https://www.youtube.com/@ApacheIceberg?ref=blef.fr">YouTube channel</a> soon.</li><li>I personally think that DuckDB might be the easiest developper interface to interact with Iceberg ecosystem, as it's dead simple to spin a Duck instance. Recently we got <a href="https://www.linkedin.com/feed/update/urn:li:activity:7310671698276564993/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7310671698276564993%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">DuckDB to attach to Iceberg</a> and the capabilities <a href="https://duckdb.org/2025/03/14/preview-amazon-s3-tables.html?ref=blef.fr">preview Amazon S3 Tables</a>.</li><li><a href="https://medium.com/@yogevyuval/athena-vs-snowflake-on-iceberg-performance-and-cost-comparison-on-tpc-h-03b96fa6dbf9?ref=blef.fr">Athena vs. Snowflake on Iceberg</a>, performance comparison. In the end Snowflake won being 2x less expensive, the tests uses the engine on top of Iceberg datasets to see how they handle working with Iceberg. Would have been cool to compare it to the same using native tables.</li><li><a href="https://arrow.apache.org/blog/2025/02/28/data-wants-to-be-free/?ref=blef.fr">Data wants to be free: fast data exchange with Apache Arrow</a> — How Arrow compares to Postgres when it comes to serialisation and why is it so fast?</li><li><a href="https://blog.cloudflare.com/r2-data-catalog-public-beta/?ref=blef.fr">Cloudflare R2 data catalog</a> — Cloudflare R2 is a global object storage (like S3) with free egress (meaning free data read from external system) which is a paradise for Iceberg as Lakehouse really heavily on data reads on buckets and most of the engine are living elsewhere. So Cloudflare announced a Iceberg catalog that can live close to your tables.</li><li><a href="https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-s3-express-one-zone-reduces-storage-request-prices/?ref=blef.fr">Amazon reduces prices for S3 Express One Zone</a> —&nbsp;Following Cloudflare announcement Amazon decided to reduce the price of the Iceberg offering.</li><li><a href="https://www.xorq.dev/posts/introducing-xorq?ref=blef.fr">xorq, declarative, multi-engine pipelines</a> — This new world opened by Iceberg brings us to the multi-engine data stack. Where we use different engines (Snowflake, DuckDB, BigQuery) for instance for what they are great about and store the underlying data in bucket using Iceberg unifying everything in a catalog. xorq is one of the first multi-engine pipeline system for ML use-cases.</li></ul><p>Because we have to unify all the trends there is a <a href="https://github.com/ryft-io/iceberg-mcp?ref=blef.fr">Iceberg MCP server</a> that has been developed.</p><p></p><h4 id="examples-and-thoughts">Examples and thoughts</h4><p>Just to go further and connect everything, a few post about the relathionship with Iceberg and the lakehouse and where all this fuzz is going, and what it could mean for your actual data stack.</p><ul><li><a href="https://roundup.getdbt.com/p/iceberg-give-it-a-rest?ref=blef.fr">Iceberg?? Give it a REST!</a>.</li><li><a href="https://medium.com/@coreycheung/we-built-a-data-lakehouse-to-help-sell-dog-food-a94f6ea9c648?ref=blef.fr">We built a data lakehouse to help dogs live longer</a>.</li><li><a href="https://www.dataengineeringweekly.com/p/towards-composable-data-infrastructure?ref=blef.fr">Towards Composable Data Infrastructure</a>.</li><li><a href="https://www.bvp.com/atlas/roadmap-data-3-0-in-the-lakehouse-era?ref=blef.fr">Roadmap: data 3.0 in the lakehouse era</a> —&nbsp;4 possible thesis of what could be the next revolution about your data stack.</li></ul><p>My two cents about this: this is mainly experimental and this is not relevant yet for the scale most of the companies are. Warehouse + native tables is the easiest user experience you can find, and as data engineers what we want it users using our platforms, right?</p><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.bloomberg.com/news/articles/2025-03-26/openai-close-to-finalizing-its-40-billion-softbank-led-funding?ref=blef.fr">OpenAI raises $40b at $300b valuation</a>.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7305937479319044098/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7305937479319044098%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Omni raises $69m Series B.</a></li></ul><p></p><hr><p>See you soon ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 25.10 ]]></title>
                    <description><![CDATA[ Data News 25.10 — Super large edition, all new models releases, events, dbt Core vs. SQLMesh, benchmark your data team, and more. ]]></description>
                    <link><![CDATA[ /data-news-week-25-10/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 67cc02ec59959d000171f0e6 ]]></guid>
                    <pubDate><![CDATA[ 2025-03-08 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/03/image.png" class="kg-image" alt="" loading="lazy" width="900" height="598" srcset="https://www.blef.fr/content/images/size/w600/2025/03/image.png 600w, https://www.blef.fr/content/images/2025/03/image.png 900w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Joy (</span><a href="https://unsplash.com/photos/small-tree-67-CqTBwNI0?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><h3 id="hello-here-%E2%98%80%EF%B8%8F">Hello here ☀️</h3><p>I feel ashamed for not posting any Data News for the last 2 months, a lot was going on and I did not manage to find time every Friday to write news, I'm so sorry about it.</p><p><strong>Hello to all the new subscribers who arrived since January, I want to warmly welcome you ❤️. </strong>This is your first Data News ever, enjoy the moment, read whatever you feel curious about, at your own rhythm.</p><p>At the moment, I don't want to promise something about being back to our regular weekly schedule, but I'm trying as hard as I can to organise my new routines/life as a content creator and company founder.</p><p>I've always worked on multiple projects at the same time, but since I've started nao things changed. There's a truth you only grasp when you've lived it: you are thinking about your company all the damn time.</p><h3 id="events-%F0%9F%AA%AD">Events 🪭</h3><p>While being less present online, I've done a lot of things in real life in the last weeks and I'll continue to do in the weeks to come. I was at the <a href="https://duckdb.org/events/2025/01/31/duckcon6/?ref=blef.fr">DuckCon #6</a> in Amsterdam to talk about <a href="https://www.youtube.com/watch?v=m7ACh3DRVW0&ref=blef.fr">yato, the smallest DuckDB SQL orchestrator</a> and Robin published 3 podcasts episodes—in French—that I hope you'll listen while running this weekend 🤭:</p><ul><li><a href="https://www.youtube.com/watch?v=jj8jvy1Eu4U&ref=blef.fr">3 data trends to follow in 2025</a></li><li><a href="https://www.youtube.com/watch?v=wxyVl-1Cr0U&ref=blef.fr">A comparison between SQLMesh and dbt</a></li><li><a href="https://www.youtube.com/watch?v=TRPXKyThtIo&ref=blef.fr">The 3 priorities of a VP data in 2025</a></li></ul><p>At the end of the month, on March 31st, I'll co-organise the <a href="https://www.ai-product-day.com/en?ref=blef.fr">AI Product Day</a> in Paris, we are sold-out, but we still have slots for sponsors if you want to help us organise the event and get massive visibility to AI and product teams.</p><p>I'm going to Barcelona 🇪🇸 (from March 19 to 22)—I'd love to hangout with data people there. I'll give a talk to the French Tech Barcelona on March 20, you can <a href="https://lu.ma/72121psm?ref=blef.fr">register here</a>. I might plan a day-trip to Madrid (?).</p><p>The biggest news as a data fan is also that I'll be at <a href="https://www.datacouncil.ai/bay-2025?ref=blef.fr">Data Council</a> this year as your news reporter on duty 🤓. So if you plan to got, or if you're in San Francisco around April, let's have a coffee.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/03/image-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2025/03/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2025/03/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2025/03/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2025/03/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Running after all the models releases (</span><a href="https://unsplash.com/photos/man-running-down-on-desert-pizgoJNQ-xY?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>The pace of change is nothing short of extraordinary. I haven't published in two months, and it feels like two years. Here's a recap.</p><ul><li>Timeline of the major models releases. When I say major it's obviously subjective. It's mainly related to the noise they made online.<ul><li><a href="https://huggingface.co/microsoft/phi-4?ref=blef.fr">phi-4</a> — Microsoft continues to release their small open models. I never came across someone using it, however.</li><li><a href="https://github.com/deepseek-ai/DeepSeek-R1?ref=blef.fr">DeepSeek R1</a> — DeepSeek is a Chinese startup building foundational models, they released <a href="https://api-docs.deepseek.com/news/news1226?ref=blef.fr">v3</a> previously, then R1 which is a reasoning model that made OpenAI and American AI companies panicking because they claimed major cost reduction in model training. More, DeepSeek code and models are open-source with MIT license. R1 is built on top of v3 using reinforcement learning combined with <a href="https://www.promptingguide.ai/techniques/cot?ref=blef.fr">chain-of-thoughts</a> (CoT) to "reason".<br><br>HuggingFace created <a href="https://github.com/huggingface/open-r1?ref=blef.fr">open-r1</a> a fully open (for what it means) version of r1, in Python where every step is detailed.<br><br>There is also a good analyse of <a href="https://albertoai.substack.com/p/ai-update-22?ref=blef.fr">DeepSeek vs. the world</a>.</li></ul></li><ul><li><a href="https://mistral.ai/en/news/mistral-small-3?ref=blef.fr">Mistral Small 3</a> — A small model that can be use to do CoT, under Apache Licence.</li></ul><ul><li>Google <a href="https://deepmind.google/technologies/gemini/flash/?ref=blef.fr">Gemini Flash 2.0</a> — Multimodal reasoning.</li><li>Anthropic <a href="https://www.anthropic.com/news/claude-3-7-sonnet?ref=blef.fr">Claude 3.7</a> — Claude 3.5 has been by far the most used model for code generation for the last 6 months. 3.7 should be a uplift, and to be honest I feel it's not. They also release <a href="https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview?ref=blef.fr">Claude Code</a>, an agentic coding tool that you can use in command line to make changes, commit and fix issues. Simon, created the <a href="https://simonwillison.net/2025/Feb/24/claude-37-sonnet-and-claude-code/?ref=blef.fr">pelican bicycle test</a>, which is fairly good to evaluate models.</li><li><a href="https://qwenlm.github.io/blog/qwen2.5-max/?ref=blef.fr">Alibaba Qwen2.5</a> — Nothing much to say to be honest.</li><li><a href="https://openai.com/index/openai-o3-mini/?ref=blef.fr">Open AI o3-mini</a> — OpenAI fast reasoning model series.</li><li><a href="https://x.ai/blog/grok-3?ref=blef.fr">Grok 3</a> — Musk fans says it's the best model.</li><li><a href="https://openai.com/index/introducing-gpt-4-5/?ref=blef.fr">OpenAI GPT 4.5</a> — One month later OpenAI released GPT 4.5, and I feel like a teleshopping presenter. Now we have a selector with 6 models in ChatGPT, I'm kinda lost to be honest. There is also the <a href="https://simonwillison.net/2025/Feb/27/introducing-gpt-45/?ref=blef.fr">pelican test.</a></li><li><a href="https://openai.com/index/introducing-deep-research/?ref=blef.fr">OpenAI deep research</a> — OpenAI mode to replace McKinsey consultant or Phd people, because why not.</li><li><a href="https://mistral.ai/en/news/mistral-ocr?ref=blef.fr">Mistral OCR</a> — The promise is crazy, you can have a quick look at their examples, from a pdf or a photo it can extract information so you can use it. It's even "multimodal" because it keeps the figure in the output.</li></ul><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7303504840209358849/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7303504840209358849%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">ChatGPT for MacOS can interact with your code (in the IDE)</a> — in the demo it works with XCode or VS Code and directly changes the files on disk so they changed in your editor.</li><li><a href="https://www.twitch.tv/claudeplayspokemon?ref=blef.fr">Claude 3.7 plays Pokemon on Twitch</a> — finally something useful.</li><li>I'm not a Perplexity user but I see more and more people switching their Google search usage to Perplexity, which announce a new <a href="https://www.perplexity.ai/comet?ref=blef.fr">web browser called Comet</a> and <a href="https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research?ref=blef.fr">deep research</a>. Deep research has been built to generate ideas, summaries or takeways.</li><li><a href="https://www.linkedin.com/pulse/streaming-ai-agents-why-kafka-flink-foundations-scale-derosiaux-u24se?ref=blef.fr">Streaming AI agents: why Kafka and Flink are foundations</a> — A small bond with the data engineering world.</li><li><a href="https://www.anthropic.com/engineering/building-effective-agents?ref=blef.fr">Building effective AI agents</a> — This is a great article from Anthropic if you wanna learn how to build AI agents. It explains well the flow between the user, the UI and the LLM.</li><li>📺 <a href="https://www.youtube.com/watch?v=7xTGNNLPyMI&ref=blef.fr">Deep dive into LLMs</a> — 1.5m views and have been recommended by a lot of people, should be good (did not see it). Goes from GPT-2 to DeepSeek R1 and give a mental model of what it is.</li><li>RAG stuff<ul><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7285322375603134466/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7285322375603134466%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Forget RAG, welcome Agentic RAG</a></li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7295766165363077120/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7295766165363077120%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">RAG is dead long live RAG</a>.</li><li><a href="https://engineering.ramp.com/industry_classification?ref=blef.fr">From RAG to richness: how Ramp revamped industry classification</a>.</li></ul></li></ul><p></p><h1 id="dbt-core-and-sqlmesh-wat-%F0%9F%A7%AD">dbt Core and SQLMesh, wat 🧭</h1><p>dbt Core has become one of the most used tool across data teams all around the world. Because of its success, companies might feel the dbt fatigue which happens when your dbt project has been a success but widely spread within the company leading to A LOT of tables—we call it models, the dbt way.</p><p>When you have a lot of tables, dbt projects tend to become less manageable, the CLI becomes slow, the local development experience isn't great and more and more features are going into the Cloud version. SQLMesh has been created to fix dbt Core issues and to compete with dbt Cloud.</p><p>A few weeks ago, dbt Labs acquired <a href="https://www.sdf.com/?ref=blef.fr">SDF</a>—which I had been watching closely for more than a year, see <a href="https://www.blef.fr/data-news-week-24-07/">DN#24.07</a>. SDF is a Rust binary that understands dbt projects and speeds up everything, making up to 100x gain in performance. Under the hood SDF, parses the SQL queries, gets syntax trees, compile and executes them to find issues even before they hit the data warehouse.</p><p>We will know very soon what this acquisition will bring to dbt and we all praise for the best improvements to be in the open-source codebase, (spoiler: not sure). </p><p>On the other side, SQLMesh answered with the <a href="https://tobikodata.com/tobiko-acquires-quary.html?ref=blef.fr">acquisition of Quary</a>, a Rust knowledgeable that made significant improvement to SQLGlot the underlying SQL parser of SQLMesh.</p><p>There is a fierce competition between the 2 companies and <a href="https://tobikodata.com/dbt_sdf.html?ref=blef.fr">shots</a> are fired. SQLMesh team is also organised <a href="https://groupby.tobikodata.com/?ref=blef.fr">GROUP BY</a> their annual conference in a few days, any resemblance to <a href="https://coalesce.getdbt.com/?ref=blef.fr">another event</a> is fortuitous. This week Tobiko also published a benchmark claiming that SQLMesh on top of Databricks delivers <a href="https://www.linkedin.com/feed/update/urn:li:activity:7303513859485483008/?ref=blef.fr">9x cost saving</a>.</p><p>Time will tell where this leads, but ultimately it will benefit data professionals as they strive to build the best SQL orchestrators. However, I believe there are still unresolved issues with the developer experience in the age of AI—challenges that I'm actively working to address with <a href="https://getnao.io/?ref=blef.fr">nao</a> 🤭.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/03/image-2.png" class="kg-image" alt="" loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2025/03/image-2.png 600w, https://www.blef.fr/content/images/2025/03/image-2.png 900w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Will dbt stays on top? (</span><a href="https://unsplash.com/photos/white-and-brown-concrete-house-fbCtFV3FkfE?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.databenchmarks.com/?ref=blef.fr">Benchmark your data team</a> — Mikkel has been a great contributor to metrics about data teams worldwide: size, ratio to software engineering, team composition, salaries. This time it's a dynamic website where you can explore all these metrics to compare your team with what's out there.</li><li><a href="https://hex.tech/blog/myth-of-data-team-roi/?ref=blef.fr">The myth of measuring data team ROI</a> — ROI of a data team is one of the most difficult thing to measure. Hex view on this is to ask other to tell the ROI for you, especially via a NPS of your users.</li><li><a href="https://fivetran.com/docs/usage-based-pricing/2025-pricing-faq?ref=blef.fr">Fivetran</a> and <a href="https://airbyte.com/blog/introducing-capacity-based-pricing?ref=blef.fr">Airbyte</a> pricing changes — The 2 data ingestion services changes their method of billing. Fivetran is doing thing I don't understand but they have charts explaining and Airbyte switched to capacity-based pricing—which means  it's based on the number of pipelines your run rather than the volume you move. <a href="https://www.linkedin.com/feed/update/urn:li:activity:7294706306827862019/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7294706306827862019%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Benjamin analysed the pricing changes on LinkedIn</a>, it's a competitor perspective. </li><li><a href="https://evidence.dev/blog/what-is-a-flat-file?ref=blef.fr">What is a flat file?</a> — A large article explaining all the flat files format. I do not miss fixed-width files.</li><li><a href="https://duckdb.org/2025/01/22/parquet-encodings?ref=blef.fr">Query engines: gatekeepers of the Parquet file format</a> — DuckDB team is unhappy because most of the query engines are not supporting the latest Parquet advancement forcing the duck to write old spec, which lower performance.</li><li><a href="https://sharon-53595.medium.com/how-we-migrated-to-apache-iceberg-utilizing-athena-trino-and-spark-58c6875b5641?ref=blef.fr">How we migrated to Iceberg using Athena, Trino and Spark</a> — How you can plan a migration to Iceberg. It lasted 4 months and reduced the data volume from 70TB to 40TB.</li><li><a href="https://luminousmen.com/post/how-not-to-partition-data-in-s3-and-what-to-do-instead?ref=blef.fr">How not to partition data in S3</a> — You should partition by folder/date=2025-03-08, rather than with subfolders (sorry American readers, we put the months before the day 🙃).</li><li><a href="https://pola.rs/posts/polars-cloud-what-we-are-building/?ref=blef.fr">Polars launches Polars Cloud</a> — Run stuff remotely, why not 🤷‍♂️, on your own Polars cluster. Looks like a 2025 Spark.</li><li>Tobi launched a DuckDB newsletter: <a href="https://learningduckdb.com/newsletters/welcome-to-learning-duckdb/?ref=blef.fr">learning DuckDB by example</a>.</li><li><a href="https://mehdio.substack.com/p/duckdb-goes-distributed-deepseeks?ref=blef.fr">DuckDB goes distributed</a> — DeepSeek released <a href="https://github.com/deepseek-ai/smallpond?ref=blef.fr">smallpond</a> a lightweight data processing framework built on DuckDB and <a href="https://github.com/deepseek-ai/3FS?ref=blef.fr">3FS</a>—their distributed storage tech. <em>smallpond</em> is an alternative to Daft or Spark. I'm skeptical, distributed processing is not really the initial purpose of DuckDB which is made to remove the communication burden between client &lt;&gt; server through single-node processing. <br><br>📅<em> Mehdi is organising an online event about </em><a href="https://lu.ma/5946jam3?ref=blef.fr"><em>Scaling DuckDB</em></a><em>.</em></li><li><a href="https://count.co/blog/announcing-duckdb-on-the-server?ref=blef.fr">Count.co combines DuckDB processing in-browser and on the server</a>.</li><li><a href="https://medium.com/@petrica.leuca/what-ive-discovered-while-using-uv-436b4085b6d6?ref=blef.fr">uv is becoming a thing, how to use it in PyCharm</a> — uv is a Python package manager written in Rust, that aims to fix all the issues we all faced one day. uv brings also on-the-fly package management for script, which is freaking cool.</li><li><a href="https://medium.com/skello-engineering/building-a-robust-ci-cd-pipeline-for-dbt-at-skello-e59d685292da?ref=blef.fr">Building robust CI/CD pipeline for dbt</a> — Ideas of things you can put in your CI to test your dbt projects before production. Even though I'm personally convince that the CI arrives too late in the process and that it should be done even before you push, this is a great start.</li><li><a href="https://maxhalford.github.io/blog/minimizing-sql-dag-runtime/?ref=blef.fr">Minimising the runtime of a SQL DAG</a> — What if you could theoretically save time in your SQL DAG by looking at the duration and the dependencies? This is what Max did and he found 26% uplift in performance. This guy never stop to amaze me.</li><li><a href="https://github.com/ashish10alex/vscode-dataform-tools?ref=blef.fr">VS Code extension for Google Dataform users</a> — For the first time in 2025 I've met Dataform users, this is cool if it brings another alternative to the table. Tho, it's strictly coupled to BigQuery. <em>Dataform, is dbt but for BigQuery with another syntax (SQLX).</em></li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7298633823070683136/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7298633823070683136%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Looks like BigQuery is getting a git integration</a> — BigQuery is getting a lot of new features over the last months: notebooks, lineage, data profiling, etc.</li><li><a href="https://thenewaiorder.substack.com/p/a-head-of-datas-take-on-ai-code-editors?ref=blef.fr">A head of data take on AI code editors</a> — Cursor and Windsurf as AI code editor are everywhere and a lot of engineering teams are starting to use them, but what's the equivalence for data teams? How can we as data works benefits these innovations?<br><br><em>the post has been written by my co-founder Claire—it's her first post, send her love ✨</em></li><li><a href="https://dlthub.com/blog/compound-ai-systems-data-engineering?ref=blef.fr">How dlt enters the AI code generated pipelines world</a> — dlt is becoming the best Python toolki to ingest data into whatever destination. In this AI assisted code writing era, because dlt is just code, it means MCP or other LLMs can really shine into help data engineers writing ingestion pipelines.</li><li>SQL related stuff<ul><li><a href="https://ibis-project.org/posts/does-ibis-understand-sql/?ref=blef.fr">Does Ibis understand SQL?</a>.</li><li><a href="https://fromanengineersight.substack.com/p/beyond-sql-as-a-pure-database-syntax?ref=blef.fr">Beyond SQL as a pure database syntax</a>.</li><li><a href="https://medium.com/google-cloud/sql-is-all-you-need-77554fea90c0?ref=blef.fr">SQL is all you need</a>.</li><li><a href="https://medium.com/google-cloud/detecting-similar-sql-queries-with-vertex-ai-and-vector-search-5356928074b0?ref=blef.fr">Detecting similar SQL queries with vector search</a>.</li></ul></li><li>📺 <a href="https://www.youtube.com/watch?v=X_RFo616M_U&ref=blef.fr">Graph databases after 15 years?</a></li></ul><p>🕵️ What if you could become a SQL detective: <a href="https://www.sqlnoir.com/?ref=blef.fr">SQL Noir</a>. It's a funny game to practice you SQL skills to solve mysteries.</p><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy  💰</h1><p>Become it's already too long, only headlines.</p><ul><li><a href="https://www.anthropic.com/news/anthropic-raises-series-e-at-usd61-5b-post-money-valuation?ref=blef.fr">Anthropic has raised $3.5b Series E</a>.</li><li><a href="https://www.qlik.com/us/news/company/press-room/press-releases/qlik-acquires-upsolver-to-deliver-low-latency-ingestion-and-optimization-for-apache-iceberg?ref=blef.fr">Upsolver have been acquired by Qlik</a>.</li><li><a href="https://blog.fal.ai/fal-raises-49m-series-b-to-power-the-future-of-ai-video/?ref=blef.fr">fal.ai raised $49m Series B</a>.</li><li><a href="https://groq.com/leap2025/?ref=blef.fr">Groq gets $1.5b funding from Saudi Arabia</a>.</li><li><a href="https://www.getdbt.com/blog/dbt-labs-acquires-sdf-labs?ref=blef.fr">dbt Labs acquired SDF</a> and dbt Labs reached <a href="https://www.getdbt.com/blog/dbt-labs-100m-arr-milestone?ref=blef.fr">$100m ARR</a>.</li><li><a href="https://tobikodata.com/tobiko-acquires-quary.html?ref=blef.fr">Tobiko acquired Quary</a>.</li><li>Databricks <a href="https://techcrunch.com/2024/12/17/databricks-raises-10b-as-it-barrels-toward-an-ipo/?ref=blef.fr">raised $10b Series J</a>, and <a href="https://www.bloomberg.com/news/articles/2025-01-13/databricks-inks-5-billion-financing-from-private-credit-banks?embedded-checkout=true&ref=blef.fr">$5b more in debt</a>.</li><li><a href="https://hightouch.com/blog/hightouch-funding-series-c?ref=blef.fr">Hightouch raised $80m Series C</a>.</li><li><a href="https://elevenlabs.io/blog/series-c?ref=blef.fr">Eleven Labs $180m Series C</a>, at the same time they released <a href="https://elevenlabs.io/blog/meet-scribe?ref=blef.fr">Scribe</a>, their new cloud-base speech to text model.</li></ul><hr><p>Sorry for the large edition, I also feel a bit rusty after 2 months not writing. See you soon folks ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 25.02 ]]></title>
                    <description><![CDATA[ Data News #25.02 — New conference AI Product Day, what are AI agents, does size matter, and awesome analytics engineering content. ]]></description>
                    <link><![CDATA[ /data-news-week-25-02/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 67823a18652a1600015a65b1 ]]></guid>
                    <pubDate><![CDATA[ 2025-01-11 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/01/image-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2025/01/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2025/01/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2025/01/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2025/01/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">HNY 2025 (</span><a href="https://unsplash.com/photos/person-holding-a-light-during-nighttime-vwYrQQFoE-k?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Happy new year ✨. <strong>I wish you the best for 2025</strong>. There are multiple ways to start a new year, either with new projects, new ideas, new resolutions or by just keeping doing the same music. I hope you will enjoy 2025.</p><p>The Data News are here to stay, the format might vary during the year, but here we are for another year. Thank you so much for your support through the years.</p><p>Some personal news:</p><ul><li> I will be in Amsterdam for the <a href="https://duckdb.org/2025/01/31/duckcon6.html?ref=blef.fr">DuckCon</a> on Jan 31, I'll give a 5 minutes talk about <a href="https://github.com/Bl3f/yato?ref=blef.fr">yato</a>, if you're also going or living there, reach out so we can chat!</li><li>We announced the <a href="https://www.ai-product-day.com/en?ref=blef.fr">AI Product Day</a>, a 1-day conference that will take place in Paris on March 31. It will be a day dedicated to product teams who want to fully exploit the potential of AI. We are looking for sponsors and the <a href="https://www.billetweb.fr/ai-product-conference?ref=blef.fr">ticketing</a> is open. I have a 15% discount code if you're interested <em>BLEF_AIProductDay25</em>.</li><li>We published videos about the Forward Data Conference, you can watch Hannes, DuckDB co-creator, keynote about <a href="https://www.youtube.com/watch?v=1QSs5XY8Hvc&ref=blef.fr">Changing Large Tables</a>.</li><li>Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. It was refreshing to recharge and kick off the year, there’s nothing quite like diving back into the joy of hacking and creating.</li></ul><p>Let's jump to the news, and have fun reading, it's a large wrap of everything that happened at the end of the year + how 2025 started.</p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://www.ai-product-day.com/en?ref=blef.fr"><img src="https://www.blef.fr/content/images/2025/01/Header-PROFIL-AI-PRODUCT-DAY.png" class="kg-image" alt="" loading="lazy" width="1585" height="397" srcset="https://www.blef.fr/content/images/size/w600/2025/01/Header-PROFIL-AI-PRODUCT-DAY.png 600w, https://www.blef.fr/content/images/size/w1000/2025/01/Header-PROFIL-AI-PRODUCT-DAY.png 1000w, https://www.blef.fr/content/images/2025/01/Header-PROFIL-AI-PRODUCT-DAY.png 1585w" sizes="(min-width: 720px) 720px"></a><figcaption><span style="white-space: pre-wrap;">AI Product Day on March 31 (</span><a href="https://www.ai-product-day.com/en?ref=blef.fr"><span style="white-space: pre-wrap;">register</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p>The current economic uncertainties are affecting the tech and data worlds. Meanwhile, the AI landscape remains unpredictable. AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least <a href="https://gizmodo.com/leaked-documents-show-openai-has-a-very-clear-definition-of-agi-2000543339?ref=blef.fr">$100 billion in profits</a>.</p><ul><li><a href="https://time.com/7205359/why-ai-progress-is-increasingly-invisible/?ref=blef.fr">Why AI progress is increasingly invisible</a>.</li><li>It's happening, leaders wants AI agents to take the job of human employees. NVidia CEO said "<em>IT department of every company is going to be the HR department of AI agents in the future" </em>(cf. <a href="https://www.youtube.com/live/k82RwXqZHY8?feature=shared&t=2409&ref=blef.fr">Keynote video</a>) and <a href="https://www.firecrawl.dev/?ref=blef.fr">Firecrawl</a>, a tool for turning websites into LLMs, posted a $15K <a href="https://www.ycombinator.com/companies/firecrawl/jobs/1vMVVCc-firecrawl-example-creator-ai-agents-only?ref=blef.fr">job for AI agents</a>. Actually a modern Kaggle for Agentic AI, in the end it's a mechanism to lower human labor cost, because spoiler human will code to create these agents.</li><li><a href="https://huyenchip.com//2025/01/07/agents.html?ref=blef.fr">Agents</a> — Chip Huyen wrote a very large guide about AI agents, this is very detailed, it covers the necessary tooling, how planning works with agents and how you evaluate them. This is great quality material to be honest. There is also a Google introduction <a href="https://www.linkedin.com/feed/update/urn:li:activity:7282707916841795584/?ref=blef.fr">video about AI Agents</a>.</li><li><a href="https://arxiv.org/html/2412.15605v1?ref=blef.fr">Don't do RAG, use CAG</a> — A paper about another way to think about the information retrieval for AI knowledge tasks. The goal is to use a key-value (KV) cache that eliminates latencies issues traditional RAG might occur.</li><li><a href="https://arxiv.org/pdf/2409.14160?ref=blef.fr">Does size matter?</a> — Paper written by Gael Varoquaux (sklearn), Meredith Whittaker (Signal) and Alexandra Sasha Luccioni (HuggingFace) about the negative impact of the <em>bigger-is-better </em>paradigm<em>. </em>It's easily readable (mildly large ~10 pages) and gives metrics about the performance plateau that we start to see at scale.</li><li>A large international scientist collaboration released <a href="https://www.linkedin.com/feed/update/urn:li:activity:7269446402739515393/?ref=blef.fr">The Well</a>: 2 massive datasets from <a href="https://github.com/PolymathicAI/the_well?ref=blef.fr">physics simulation</a> (15TB) to <a href="https://github.com/MultimodalUniverse/MultimodalUniverse?ref=blef.fr">astronomical scientific data</a> (100TB). They aim produce the same innovation as <a href="https://en.wikipedia.org/wiki/ImageNet?ref=blef.fr">ImageNet</a> produced for image recognition. </li><li>Models news and tour<ul><li><a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf?ref=blef.fr">DeepSeek-v3</a> — It entered the space with a bang. DeepSeek is a <a href="https://huggingface.co/collections/deepseek-ai/deepseek-v3-676bc4546fb4876383c4208b?ref=blef.fr">model</a> trained by the Chinese company with the same name, they directly compete with OpenAI and all to build foundational models. They released v3 in open-source and it outperforms every other models.  </li><li><a href="https://techcrunch.com/2024/12/20/openai-announces-new-o3-model/?ref=blef.fr">OpenAI o3</a> — OpenAI announced their advanced reasoning model called o3, that can achieve large task. o3 is kinda a waste of energy when you look at the numbers <a href="https://www.linkedin.com/feed/update/urn:li:activity:7276250095019335680/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7276250095019335680%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">estimated carbon impacts</a> (estimated via kWh), based out of François Chollet research on <a href="https://arcprize.org/blog/oai-o3-pub-breakthrough?ref=blef.fr">ARC-AGI benchmarks</a>.</li><li><a href="https://github.com/huggingface/smolagents?ref=blef.fr">smolagents</a> — HuggingFace released a barebones library for agents. Agents write python code to call tools and orchestrate other agents.</li></ul></li><li>❤️ <a href="https://goyalpramod.github.io/blogs/Transformers_laid_out/?ref=blef.fr">Transformers laid out</a> — The best article out there to understand Transformers (which are key to understand LLMs).</li><li><a href="https://simonwillison.net/2024/Dec/31/llms-in-2024/?ref=blef.fr">Things we learned about LLMs in 2024</a> — I discovered Simon content during Christmas break and to be honest it's one of the best. He compiled a list of things we learned in 2024 about LLMs. <strong>This is a must-read</strong>.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/01/image-2.png" class="kg-image" alt="" loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2025/01/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2025/01/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2025/01/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2025/01/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Does size matter for LLMs? (</span><a href="https://unsplash.com/photos/gourd-and-white-tape-measure-on-blue-surface-GTUwF3agcI0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.oreilly.com/pub/pr/3465?ref=blef.fr">O'Reilly 2025 tech trend report</a> — Every year O'Reilly releases a report based on their skills platforms search. From the traffic they get they draw market trends. A few things to notice:<ul><li>Interest in AI grew by 190%, Prompt Engineering by 456%.</li><li>Python and Java still leads the programming language interest, but with a decrease in interest (-5% and -13%) while Rust gaining traction (+13%), not sure it's related, tho.</li></ul></li><ul><li>Read the <a href="https://ae.oreilly.com/l/1009792/2024-12-06/332nf/1009792/1733515474UOvDN6IM/OReilly_Technology_Trends_for_2025.pdf?ref=blef.fr">pdf version</a> directly. Not really digest.</li></ul><li><a href="https://issues.org/limits-of-data-nguyen/?ref=blef.fr">The limits of data</a> — The articles argues that while data's universality offers significant power, it often sacrifices contextual nuances, leading to oversimplified representations of complex human experiences. (Generated by AI as the article is too large to read atm for me).</li><li><a href="https://uncultureddata.substack.com/p/the-hidden-cost-of-over-abstraction?ref=blef.fr">The hidden cost of over-abstraction in data teams</a> — It's an interesting take that about the layer of abstractions we tend to create to build data platforms, whereas it's dbt Macros, CLI wrappers, etc. in the end it often adds more complexity than it removes.</li><li><a href="https://www.rilldata.com/blog/designing-a-declarative-data-stack-from-theory-to-practice?ref=blef.fr">Designing a declarative data stack: from theory to practice</a> — Related to the previous article Simon wrote a great article about the things to have in mind when we build a proprietary DSL for a declarative data stack. Meaning: a YAML configuration system for ingestion and transformations, and now, visualisation with BI-as-code.</li><li><a href="https://sqlpatterns.com/p/lessons-learned-implementing-metric?ref=blef.fr">Lessons learned implementing Metrics Trees</a> — An article in a form of an interview about someone who implemented Metrics Tree, mainly it's not about the visual representation but more about the process to translate business needs into an equation (I mainly see the tree as an equation).</li><li><a href="https://www.linkedin.com/pulse/evolution-olap-artyom-keydunov-hrkgc/?ref=blef.fr">The evolution of OLAP</a> — What is OLAP in the modern data stack? As Cube is preaching for their tooling, obviously the semantic layer is the OLAP layer.</li><li><a href="https://www.brooklyndata.co/ideas/2025/01/08/our-hybrid-kimball-and-obt-data-modeling-approach?ref=blef.fr">Hybrid Kimball &amp; OBT data modeling approach</a> —&nbsp;This is maybe the most common setup I've seen the last 3 years. A combination of Star schema with OBT in the marts for ease of consumption.</li><li><a href="https://netflixtechblog.com/part-1-a-survey-of-analytics-engineering-work-at-netflix-d761cfd551ee?ref=blef.fr">Analytics engineering at Netflix</a> —&nbsp;(and <a href="https://netflixtechblog.com/part-2-a-survey-of-analytics-engineering-work-at-netflix-4f1f53b4ab0f?ref=blef.fr">part 2</a>). A internal survey of analytics engineering practices at Netflix. They developed a /data command internally that answer questions about everything and structured the analytics around a foundational data platform with company-wide analytics data layer that provides time series efficiency metrics across various business use cases.</li><li><a href="https://cube.dev/blog/semantic-layer-and-ai-the-future-of-data-querying-with-natural-language?ref=blef.fr">The future of data querying with Natural Language</a> — What are all the architecture block needed to make natural language query working with data (esp. when you have a semantic layer). </li><li><a href="https://maxhalford.github.io/blog/hard-data-integration-problems-at-carbonfact/?ref=blef.fr">Hard data integration problems</a> — As always Max describes the best way the reality. He listed 4 things that are the most difficult data integration tasks: from  mutable data to IT migrations, everything adds complexity to ingestion systems.</li><li><a href="https://handsondata.substack.com/p/materialization-of-data-warehouse?ref=blef.fr">Materialization of data warehouse layers</a> — What are the consideration for every materialisation you should pick in your data warehouse layer: view, tables, schema vs. databases, etc.</li><li><a href="https://medium.com/@jairus-m/the-software-development-lifecycle-within-a-modern-data-engineering-framework-11c44a2f7189?ref=blef.fr">The software development lifecycle within a modern data engineering framework</a> — A great deep-dive about a data platform using dltHub, dbt and Dagster.</li><li><a href="https://fromanengineersight.substack.com/p/issue-43-the-best-code-you-never?ref=blef.fr">The best code is the code you never wrote</a> — Every line of code is a form of debt—a liability that must be maintained and understood. As we move toward a future dominated by AI-generated code, the balance will shift dramatically, making human-written code an increasingly scarce and valuable resource.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2025/01/image-3-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1111" srcset="https://www.blef.fr/content/images/size/w600/2025/01/image-3-1.png 600w, https://www.blef.fr/content/images/size/w1000/2025/01/image-3-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2025/01/image-3-1.png 1600w, https://www.blef.fr/content/images/2025/01/image-3-1.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Snowflake (</span><a href="https://unsplash.com/photos/snowflakes-gOOaMbsrdyI?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li>Canva Snowflake journey:<ul><li><a href="https://www.canva.dev/blog/engineering/our-journey-to-snowflake-monitoring-mastery/?ref=blef.fr">Our journey to Snowflake monitoring mastery</a>.</li><li><a href="https://www.canva.dev/blog/engineering/snowpipe-streaming/?ref=blef.fr">Continuous data platform with Snowpipe Streaming</a>.</li></ul></li><li><a href="https://szarnyasg.org/posts/duckdb-vs-coreutils/?ref=blef.fr">DuckDB vs. coreutils</a> — A side by side comparison of DuckDB counting features with grep and wc.</li><li><a href="https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/?ref=blef.fr">Amazon S3 Tables</a> —&nbsp;Amazon released S3 Tables, an out-of-the-box support of Iceberg within S3 a few weeks ago. It came with a bang but with <a href="https://dataengineeringcentral.substack.com/p/amazon-s3-tables?ref=blef.fr">a few concerns</a>.</li><li><a href="https://www.cs.cmu.edu/~pavlo/blog/2025/01/2024-databases-retrospective.html?ref=blef.fr">Database in 2024, a year in review</a> — Mainly last year was about licensing issues, Databricks vs. Snowflake and DuckDB trying to decrown pandas as the default.</li><li><a href="https://medium.com/wrenai/how-uber-is-saving-140-000-hours-each-month-using-text-to-sql-and-how-you-can-harness-the-same-fb4818ae4ea3?ref=blef.fr">How Uber is saving 140,000 hours/month using text-to-SQL</a> — They developed within their tooling QueryGPT that helps Uber employees find the best stuff according to their needs. They have a table agent that list the best tables according to the user intent.</li><li><a href="https://www.datastackdiagram.com/?ref=blef.fr">A tool to make data stack diagram</a> — Great tool to do stack diagram.</li><li><a href="https://blog.alexewerlof.com/p/staff-engineer-vs-engineering-manager?ref=blef.fr">Staff Engineer vs Engineering Manager</a> — One of the best article about the topic, when it comes to expertise staff engineers are still rare. How should we structure the career ladder to make it work.</li><li><a href="https://www.snowflake.com/en/blog/anthropic-claude-sonnet-cortex-ai/?ref=blef.fr">Claude Sonnet 3.5 within Snowflake Cortex</a>.</li><li><a href="https://tech.kakao.com/posts/681?ref=blef.fr">Journey with Apache Flink &amp; Flink CDC</a>.</li></ul><p></p><p><a href="https://simonwillison.net/2025/Jan/2/they-spy-on-you-but-not-like-that/?ref=blef.fr">I still don’t think companies serve you ads based on spying through your microphone</a>.</p><p></p><h1 id="economy-%F0%9F%92%B0">Economy 💰</h1><ul><li><a href="https://www.calcalistech.com/ctechnews/article/h1eodtavjl?ref=blef.fr"><strong>Boomi</strong> acquires <strong>Rivery</strong></a> in a $100m deal.</li><li><a href="https://techcrunch.com/2024/11/20/snowflake-snaps-up-data-management-company-datavolo/?ref=blef.fr"><strong>Snowflake</strong> acquires <strong>Datavolo</strong></a>.</li><li><a href="https://techcrunch.com/2024/12/17/databricks-raises-10b-as-it-barrels-toward-an-ipo/?ref=blef.fr"><strong>Databricks</strong> raises $10B towards an IPO</a>.</li><li><a href="https://techcrunch.com/2024/12/19/in-just-4-months-ai-coding-assistant-cursor-raised-another-100m-at-a-2-5b-valuation-led-by-thrive-sources-say/?ref=blef.fr"><strong>Cursor</strong> raised $100m in Series B</a>.</li></ul><p></p><hr><p>I missed you ❤️ — and see you next week.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Small break until January ]]></title>
                    <description><![CDATA[ Data News #50 — Small break until January next year. ]]></description>
                    <link><![CDATA[ /data-news-week-50-2/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 675be18c6b1db2000121fa09 ]]></guid>
                    <pubDate><![CDATA[ 2024-12-13 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2024/12/image.png" class="kg-image" alt="" loading="lazy" width="2000" height="1325" srcset="https://www.blef.fr/content/images/size/w600/2024/12/image.png 600w, https://www.blef.fr/content/images/size/w1000/2024/12/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/12/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/12/image.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Hey, it's been a few weeks since something has been published here—I hope you haven’t forgotten about me 😊.</p><p>In the last weeks I've been all over the place and worked on a lot of topics except this newsletter, I've decided to take a break from the newsletter to catchup the rhythm in January!</p><p>The <a href="https://www.forward-data-conference.com/?ref=blef.fr">Forward Data Conference</a> was a huge success and I want to thanks again all the attendees, speakers, sponsors and my co-organisers. I can't wait to work on next year edition. I also organised the first dlt Paris community meetup and I plan to do some more events like this next year because I really like this IRL part of the content sharing.</p><p>I'm also switching from freelancing to building a <a href="https://getnao.io/?ref=blef.fr">company</a> and I can't wait to share more about it.</p><p>About the Data News, I'd like to use the blog as a place for other to share stories, so if you wanna write or co-write plain articles with me, hmu. </p><p>I wish you all the best in advance, and see you next year ✨.</p><p></p><p></p><p></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.45 ]]></title>
                    <description><![CDATA[ Data News #24.45 — dlt Paris meetup and Forward Data Conference approaching soon, SearchGPT, new Mistral API, dbt Coalesce and announcements and more. ]]></description>
                    <link><![CDATA[ /data-news-week-24-45/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 672496f2c3d7d400019e806e ]]></guid>
                    <pubDate><![CDATA[ 2024-11-08 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/11/image.png" class="kg-image" alt="" loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2024/11/image.png 600w, https://www.blef.fr/content/images/size/w1000/2024/11/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/11/image.png 1600w, https://www.blef.fr/content/images/2024/11/image.png 2361w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Métro-boulot-dodo (</span><a href="https://unsplash.com/photos/time-lapse-photo-of-woman-inside-a-train-im7Tiw1OY7c?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>It's Data News time. Time really flies on my side, and apart from the bad news from across the Atlantic, all is well on my side. To be honest, I miss you folks. Writing here has been my little thing for the last 3 years and because I haven't been able to get back to my previous frequency since July, I feel empty every Friday.</p><p>I'm back in Paris and, wow, the way I live my life in Paris is so different from Berlin, Paris demands speed at every level. I've only been back 6 weeks and I feel like I haven't even left the last 2 years. I haven't yet settled back into my routines by jumping between all the hats I've decided to accumulate over the last few years: <a href="https://www.blef.fr/">content</a>, freelancing, <a href="https://getnao.io/?ref=blef.fr">founder</a> and <a href="https://www.forward-data-conference.com/?ref=blef.fr">conference organiser</a>.</p><p>I don't know when I'll be able to start writing here once a week again, but I'm doing my best to do it as soon as possible.</p><p>Enough.</p><hr><p>On November 19th, I'm organising with <a href="https://dlthub.com/?ref=blef.fr">dltHub</a> folks the first <a href="https://lu.ma/gsf3mjbz?ref=blef.fr">dlt Paris community meetup</a>, the event will take place at <a href="https://www.google.com/maps/search/?api=1&query=42&query_place_id=ChIJ9x4656lv5kcRgqM23RKIgE4&ref=blef.fr">42</a> and will start a 16h. It will feature:</p><ul><li>Navigating the complexities of enterprise ELT, towards data democracy and cost efficiency</li><li>Me — Towards a simple future (dlt, DuckDB, yato and more)</li><li>dltHub CTO, Marcin Rudolf — A teaser of the upcoming dltHub "Portable Data Lake"</li><li>Lightning community talks (reach out if you want to present something)</li></ul><p>It will be only a few days before the <a href="https://www.forward-data-conference.com/?ref=blef.fr">Forward Data Conference</a> that we sold out a few weeks ago, the <a href="https://www.forward-data-conference.com/programme/talks?ref=blef.fr">program is out</a>—the official schedule is coming soon. We are very proud and honoured, along with the organising committee, that around 300 people took a paid ticket to the event. I'm sorry for those who weren't able to get a ticket, we've set up a waiting list and we're doing our best to find a way to push the walls.</p><p>I would also like to thank the sponsors who are accompanying us on this exciting adventure: <a href="https://www.castordoc.com/fr?ref=blef.fr">Castordoc</a>, <a href="https://omni.co/?ref=blef.fr">Omni</a> and <a href="https://www.corailanalytics.com/?ref=blef.fr">Corail Analytics</a>, <a href="https://www.mirakl.com/fr-FR?ref=blef.fr">Mirakl</a>, <a href="https://www.synq.io/?ref=blef.fr">SYNQ</a>, <a href="https://nibble.ai/?ref=blef.fr">nibble</a>, <a href="https://www.sparkline.fr/?ref=blef.fr">Sparkline</a>, <a href="https://www.montecarlodata.com/?ref=blef.fr">Monte Carlo</a>.</p><p>Whether at the dlt community meetup or at Forward, I look forward (no pun intended) to meeting you all.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/11/image-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2024/11/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2024/11/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/11/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/11/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Conferences coming—I'm hyped (</span><a href="https://unsplash.com/photos/crowd-of-people-in-building-lobby-nOvIa_x_tfo?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p>I'm a bit offended, AI news is not what it used to be 🙃, we were used to more exciting news, competition and drama by the space.</p><ul><li><a href="https://openai.com/index/introducing-chatgpt-search/?ref=blef.fr">ChatGPT Search</a> — OpenAI finally plugged ChatGPT to internet and live data. You can now switch on the web logo and ask for model to search on the web alongside his training knowledge. When you mix it to the new <a href="https://openai.com/index/introducing-canvas/?ref=blef.fr">Canvas</a> UI ChatGPT look more and more like Google Search results.</li><li>HubSpot co-founder bought chat.com earlier this year and sold it to OpenAI for shares [<a href="https://www.linkedin.com/posts/olivermolander_openai-activity-7260596174301057026-rbWS/?utm_source=share&utm_medium=member_desktop">via Oliver</a>].</li><li><a href="https://www.youtube.com/watch?v=jqx18KgIzAE&ref=blef.fr">Claude Computer Use</a> — Like a soufflé, everyone was hyped when Claude released <a href="https://www.anthropic.com/news/3-5-models-and-computer-use?ref=blef.fr">Computer Use</a>, a chat interacting with an operating system, but a few days after it looks like almost everyone forgot it, like the <a href="https://github.com/OpenInterpreter/01?ref=blef.fr">01 interpreter</a>. If you wanna try Computer Use (at your own risk), there is a <a href="https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo?ref=blef.fr">repo with a Docker</a> image launching a VNC and a Streamlit app—it works fine.</li><li>New Mistral APIs — a <a href="https://mistral.ai/news/batch-api/?ref=blef.fr">batch API</a>, for batching calls rather than doing it synchronously lowering by 50% the costs and a <a href="https://mistral.ai/news/mistral-moderation/?ref=blef.fr">moderation API</a> which is a 0-categories classifier scoring text into intent.</li><li><a href="https://github.com/ibm-granite/granite-3.0-language-models/tree/main?ref=blef.fr">IBM released lightweight open-foundation models</a> — 2 sets of "small" models: 2B / 8B dense models and mixture-of-experts 1B / 3B models. IBM has <a href="https://www.linkedin.com/feed/update/urn:li:activity:7259535100927725569/?ref=blef.fr">proudly shared the datasets</a> they used to train their model.</li><li><a href="https://www.youtube.com/watch?v=nU_WaPpnlZA&ref=blef.fr">Skrub: Less data wrangling, more machine learning</a> — skrub is a preprocessing / feature engineering library for tabular machine learning. The video emphasis something critical, even if we often talk about training impact—time, carbon footprint—we tend to forget that inference is also a critical part and more importantly because of the preprocessing. So your preprocessing matters.</li><li><a href="https://www.youtube.com/watch?v=9vM4p9NN0Ts&ref=blef.fr">Standford, "Building Large Language Models (LLMs)"</a> — 1h44 of a Stanford class about building LLMs. I did not watch it, but I bet you gonna learn at least a thing watching it. </li><li><a href="https://www.sequoiacap.com/article/generative-ais-act-o1/?ref=blef.fr">Generative AI’s Act o1</a> — Sequoia essay deep dives into the current state of LLM apps and infra which has been stabilised around Microsoft/OpenAI, AWS/Anthropic, Meta and Google/DeepMind. The text touches fast vs. slow reasoning, System 1 and System 2 thinking, while in the awaiting the all-mighty AGI to come plunging us in a new technology era.</li><li><a href="https://www.latimes.com/business/story/2024-11-01/column-these-apple-researchers-just-proved-that-ai-bots-cant-think-and-possibly-never-will?ref=blef.fr">Apple proved LLMs do not reason</a> (at least Mathematically) — Did we actually needed Apple for this? Because it's actually a design nor a bug or a feature. Apple researchers published a <a href="https://arxiv.org/pdf/2410.05229?ref=blef.fr">paper</a> saying: <em>"our work underscores significant limitations in the ability of LLMs to perform genuine mathematical reasoning". </em>No way.</li><li>The future is robot — In the recent weeks a lot of new robots were (re)-announced with eventually new features like <a href="https://x.com/TheHumanoidHub/status/1849352562480394291?ref=blef.fr">Humanoid robots</a>, <a href="https://en.wikipedia.org/wiki/Optimus_(robot)?ref=blef.fr">Tesla Optimus</a>, <a href="https://www.youtube.com/watch?v=F_7IPm7f1vI&t=22s&ref=blef.fr">Boston dynamics</a>. Happy to learn that we really need robots to fix the jobs market.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/11/image-2.png" class="kg-image" alt="" loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2024/11/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2024/11/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/11/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/11/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Google lacks money (</span><a href="https://unsplash.com/photos/fan-of-100-us-dollar-banknotes-lCPhGxs7pww?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li><a href="https://edition.cnn.com/2024/10/31/tech/google-fines-russia/index.html?ref=blef.fr">Russia fines Google $20,000,000,000,000,000,000,000,000,000,000,000</a> — because a few Russian YouTube channels that have been blocked 2 years ago are still blocked.</li><li><a href="https://www.linkedin.com/pulse/why-kafka-always-late-st%C3%A9phane-derosiaux-fq9ne/?trackingId=NN8zbaNTTDSaYqwip2cYiQ%3D%3D&ref=blef.fr">Why is Kafka always late?</a> — A great large article about Kafka concepts about what <strong>Time</strong> is, and which configuration lever can be moved to fix eventual issues.</li><li><a href="https://dlthub.com/blog/portability?ref=blef.fr">The path to vendor-agnostic data platforms</a> — The Iceberg trend and the competition between Snowflake and Databricks might be scary for the future, thinking about vendor-agnostic data platforms based on open-source technology is becoming a thing again. dlt + DuckDB is a combo that bridges a lot of gabs.</li><li>Snowflake and Databricks <a href="https://www.linkedin.com/posts/sridhar-ramaswamy_most-organizations-spend-70-of-their-budget-activity-7252356050098495490-PEHO?utm_source=share&utm_medium=member_desktop">finger</a> <a href="https://www.linkedin.com/feed/update/urn:li:activity:7257314057454510080/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7257314057454510080%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">pointing</a> like kids — Recently both CEO played the card of "we are better than them and we are so different". Someone wrote a <a href="https://medium.com/@robert.thompson75/databrick-vs-snowflake-by-the-numbers-82744dd4cb51?ref=blef.fr">comparison side-by-side with numbers</a>. Regarding performance if you're going the Iceberg way here a <a href="https://www.prequel.co/blog/how-fast-is-iceberg-on-snowflake?ref=blef.fr">benchmark with Snowflake</a>.</li><li><a href="https://duckdb.org/2024/10/30/analytics-optimized-concurrent-transactions?ref=blef.fr">Analytics-optimized concurrent transactions in DuckDB</a> — Technical writeup about the concurrency concepts at stake within DuckDB and a good <a href="https://www.linkedin.com/feed/update/urn:li:activity:7257907053703114752/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7257907053703114752%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Hannes</a> interview.</li><li><a href="https://dataengineeringcentral.substack.com/p/duckdb-inside-postgres?ref=blef.fr">DuckDB inside Postgres!!??</a> — Daniel Beach really showcased what's <em>pg_duckdb</em> and out-of-the-box performance. He noticed that <em>pg_duckdb</em> performances are slower than raw Postgres. Main reason is because the extension does not support indexing yet.</li><li><a href="https://www.linkedin.com/pulse/data-warehousing-dead-vincent-rainardi-dfxre/?ref=blef.fr">Data warehousing is dead</a> — Actually the article is saying the opposite, with all the fuzz around lakes, real big data warehouse with HUGE DATA are still out there. Funny article.</li><li><a href="https://www.economist.com/business/2024/10/15/why-microsoft-excel-wont-die?ref=blef.fr">Why Microsoft Excel won’t die</a> — A small reminder for me (and my co-founder).</li><li><a href="https://github.blog/news-insights/octoverse/octoverse-2024/?ref=blef.fr">Python becomes the most popular language on Github</a> — Who would have guessed. According to Github Python overtook Javascript as #1 language. A lot of metrics are shared in the post like the number of 5B of contributions in 2024 in Github 🤯.</li><li><a href="https://githubnext.com/projects/github-spark?ref=blef.fr">Github announced Github Spark</a> — Can we enable anyone to create or adapt software for themselves, using AI and a fully-managed runtime? With a natural language based editor, everyone can spark and application ✨.</li><li><a href="https://www.youtube.com/watch?v=2g1nBbHgZbY&list=PLGudixcDaxY2NIjMYT8t5zA9KJ47wTCkM&index=19&ref=blef.fr">The road ahead: What’s coming in Airflow 3 and beyond?</a> — Keynote from Airflow Summit about Airflow moving towards assets and more.</li><li><a href="https://www.snowflake.com/en/blog/govern-open-lakehouse-snowflake-open-catalog/?ref=blef.fr">Managed Polaris by Snowflake</a> — The open source Iceberg catalog can be managed with Snowflake UI.</li><li><a href="https://www.fivetran.com/blog/the-easy-button-for-replicating-postgresql-data-into-snowflake?ref=blef.fr">Fivetran is now in Snowflake marketplace</a> — Tomorrow, the warehouses will ingest themselves their data, Fivetran can be used from Snowflake marketplace. </li><li><a href="https://www.fivetran.com/blog/a-change-to-our-transformations-pricing-structure?ref=blef.fr">dbt runs in Fivetran becomes a paid feature</a> — you start to pay when you run more than 5000 models a month. In my local community in France I don't know a lot of company orchestrating dbt this way.</li><li><a href="https://dlthub.com/blog/sql-benchmark-saas?ref=blef.fr">Ingestion performance comparison</a> — dlt compared the performance of themselves against other tools. Tbh, I'm quite impressed by dlt perf.</li><li><strong>dbt Coalesce 2024</strong> — Sorry, I wasn't able this year to pull a large recap of all the Coalesce talks, still in my todo tho. Gleb wrote about it on <a href="https://www.linkedin.com/feed/update/urn:li:activity:7252325370299846656/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7252325370299846656%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">LinkedIn</a> ; mainly the <a href="https://www.getdbt.com/blog/whats-new-in-dbt-cloud-november-2024?ref=blef.fr">dbt Explorer + mesh</a> becomes more complete, a new visual experience (like Alteryx or Knime) to write models and Iceberg. All of this illustrates dbt Labs strategy : targeting large companies (visual editor for non-tech people) and a move towards emancipation of the warehouse with Iceberg.</li><li><a href="https://tobikodata.com/dbt-incremental-but-incomplete.html?ref=blef.fr">dbt: incremental but incomplete</a> — A critic of the new dbt <a href="https://docs.getdbt.com/docs/build/incremental-microbatch?ref=blef.fr">microbatch</a> incremental feature, it works with Postgres, BigQuery, Spark and Snowflake. If I understand it right microbatch are a way to batch your large incremental queries by a time dimension that you specify—see it as chunking.</li><li><a href="https://engineeringblog.yelp.com/2024/11/loading-data-into-redshift-with-dbt.html?ref=blef.fr">Loading data into Redshift with dbt</a> — lmao it's been ages I did not heard about Redshift, so I wanted to do a shout out. </li><li><a href="https://github.com/canva-public/dbt-column-lineage-extractor?ref=blef.fr">dbt-column-lineage-extraction</a> — A Python CLI tool from Canva team.</li><li><a href="https://cloud.google.com/blog/products/data-analytics/introducing-ai-driven-bigquery-data-preparation?hl=en&ref=blef.fr">BigQuery's AI-assisted data preparation</a> — Do data preparation in the fresh new BigQuery editor with natural language description of the transformations.</li><li><a href="https://cloud.google.com/looker/docs/enabling-studio-in-looker?ref=blef.fr">Enabling and disabling Studio in Looker&nbsp;</a> — Looker Studio will become a feature that you activate / deactivate in your Looker setup.</li><li>❤️ <a href="https://www.rilldata.com/blog/bi-as-code-and-the-new-era-of-genbi?ref=blef.fr">In the era of GenBI</a> — one of the best post of this week.</li><li><a href="https://www.lightdash.com/blogpost/lightdash-raises-series-a?ref=blef.fr"><strong>Lightdash</strong> raises $11m</a> —&nbsp;I really like Lightdash, it's a promising great mix between a BI tool, a semantic layer, as code.</li></ul><h3 id="food-for-thoughts">Food for thoughts</h3><ul><li><a href="https://stkbailey.substack.com/p/docker-for-data-products?ref=blef.fr">Docker for data products</a>.</li><li><a href="https://craftingdataproducts.substack.com/p/why-agile-and-product-management?utm_campaign=post&utm_medium=web&triedRedirect=true&ref=blef.fr">Why Agile and Product Management fail with Data &amp; AI?</a></li><li><a href="https://luminousmen.com/post/whos-really-responsible-for-team-failures/?ref=blef.fr">Who’s really responsible for team failures?</a></li><li><a href="https://slack.engineering/empowering-engineers-with-ai/?ref=blef.fr">Empowering Engineers with AI at Slack</a>.</li></ul><hr><p>See you soon ❤️</p><p>PS: for our product research with nao we are looking for analytics engineers working on modeling daily. If you fall into this bucket, answer by saying "hi i'm an analytics engineer" and we will follow-up on this 🤗.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.40 ]]></title>
                    <description><![CDATA[ Data News #24.40 — Back in Paris, Forward Data Conference program is out, OpenAI and Meta new stuff, DuckCon and a lot of things. ]]></description>
                    <link><![CDATA[ /data-news-week-24-40/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6700d40dc6618b000118a409 ]]></guid>
                    <pubDate><![CDATA[ 2024-10-06 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/10/image-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1319" srcset="https://www.blef.fr/content/images/size/w600/2024/10/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2024/10/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/10/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/10/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Back in Paris (</span><a href="https://unsplash.com/photos/a-building-with-a-bunch-of-signs-on-the-side-of-it-MpAgdPjCwbo?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, hey, hey. I'm so sorry for this small break about the news. I was in middle of starting my new company, <a href="https://getnao.io/?ref=blef.fr">nao</a>, and moving back from Berlin to Paris. Still I hope this edition finds you well, it will be a mix of personal news, OpenAI saga and usual data engineering stuff that I enjoy reading.</p><p>First things first, yes, I'm co-founding a company. We called the company nao and you can see it as a no-code semantic layer. Still I keep a post about it for later, but if you're interested, hmu.</p><p>Then, with my girlfriend we decided to move back from Berlin to Paris after 2 years there. It's a professional move for both of us, we will miss Berlin to be honest but a big part of our social life is in Paris. Being in Paris will ease all the events and IRL stuff I go / organise.</p><p></p><h1 id="forward-data-conference-%E2%9C%A8">Forward Data Conference ✨</h1><figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2024/10/image.png" class="kg-image" alt="" loading="lazy" width="700" height="200" srcset="https://www.blef.fr/content/images/size/w600/2024/10/image.png 600w, https://www.blef.fr/content/images/2024/10/image.png 700w"></figure><p>As a reminder, on November 25th I'm organising the Forward Data Conference. It will be a day to shape the future of the data community, where teams can come to learn and grow together. There are still tickets left—we sold around 80% of the tickets.</p><p>This week we announced the program, you can find it on our <a href="https://www.forward-data-conference.com/en/programme/talks?ref=blef.fr">website</a>. I really like the program we put in place, it a mix of Engineering and Strategic / Vision talks.</p><p>The conference will be held in French + English, a few talks will be given in French but we will subtitle them live and we will also find a way to always have something in English in parallel for all English native speakers.</p><p>You can use <strong>BLEF_FWD24</strong> promo-code to get 15% reduction on your ticket.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.billetweb.fr/forward-data-conference?ref=blef.fr" class="kg-btn kg-btn-accent">Get tickets for Forward Data Conference</a></div><p><em>PS: dear readers, if you proposed a talk to the FDC which has been rejected, I'm so sorry you did not get a detailed explanation, we received a lot of talks and I wasn't able to write a personal message to every talk that has been rejected. Tho, if you're wondering why, reach me and I will explain you.</em></p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>OpenAI is our best saga about drama and tech, when the Netflix show is going out?<ul><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7247656888023027712/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7247656888023027712%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">DevDay recap</a> — OpenAI DevDay was the developer conference to announce features, models and stuff about their product. The "biggest" announcement was around <a href="https://openai.com/index/introducing-the-realtime-api/?ref=blef.fr">Realtime API</a> targeting the speech-to-speech applications.<br><br>In addition they introduced <a href="https://openai.com/index/api-prompt-caching/?ref=blef.fr">prompt caching</a> to save tokens costs, the possibility to <a href="https://openai.com/index/introducing-vision-to-the-fine-tuning-api/?ref=blef.fr">fine-tune vision for GPT-4o</a>. Last thing is <a href="https://openai.com/index/introducing-canvas/?ref=blef.fr">Canvas</a>, which is a new way to interact with the models, I'd say it's a mix of Notion and Anthropic better UI. This is mandatory for OpenAI to improve and diversify their public UI/UX in order to compete with large apps ecosystems.</li></ul></li><ul><li><a href="https://huggingface.co/openai/whisper-large-v3-turbo?ref=blef.fr">Whisper large v3 turbo</a> — New turbo version of Whisper has been released on Hugging Face (<a href="https://github.com/openai/whisper/discussions/2363?ref=blef.fr">announcement</a>). Following Realtime voice API, it's great to see improvements in Whisper, the voice model.</li><li><a href="https://www.reuters.com/technology/artificial-intelligence/openai-remove-non-profit-control-give-sam-altman-equity-sources-say-2024-09-25/?ref=blef.fr">OpenAI to remove non-profit control and give Sam Altman equity</a> — After a magic trick, Sam could receive equity worth around $150b. The important note is also that OpenAI is moving it's core business to for-profit which will not be controlled anymore by the non-profit board.</li></ul><ul><li><a href="https://x.com/OpenAI/status/1838642453391511892?ref=blef.fr">Advanced Voice not available in EU</a> — Advanced voice is a Siri interface on top of Chat-GPT capabilities. The unavailability in EU is lobbying at it's finest, fearing AI Act or GDPR could harm innovation. Explain to me why companies with the best engineers in the world can't find a way to make things legal.</li><li><a href="https://www.theguardian.com/technology/2024/oct/02/openai-raises-66bn-in-funding-is-valued-at-157bn?ref=blef.fr">They raised $6.6b at $157b valuation</a> (and <a href="https://www.crunchbase.com/funding_round/openai-debt-financing--82b1aa6b?ref=blef.fr">$4b in debt</a>). Another $10b after the first in Jan 2023.</li></ul><li>Meta, if there was a race, Meta would be well positioned, who would have thought after Metaverse choices?<ul><li><a href="https://ai.meta.com/research/movie-gen/?ref=blef.fr">Meta Movie Gen</a> — Meta announce new research for movie generation models. Let's be honest for the moment it just feels unreal, like a video game or something in virtual reality. But in the end, this is maybe what we need?</li><li>New hardware (powered with AI) — Two promising product have been demonstrated a pair of <a href="https://x.com/altryne/status/1839007699255832583?ref=blef.fr">glasses</a> and a <a href="https://www.linkedin.com/feed/update/urn:li:activity:7245722659819311104/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7245722659819311104%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">wristband</a> that allows you to interact with virtual interfaces with your finger movements.</li><li>SAM 2, Segment Anything Model 2 <a href="https://www.linkedin.com/feed/update/urn:li:activity:7243618695032225793/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7243618695032225793%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">can run on-device on Apple CoreML</a> — A demo of image segmentation that run 100% offline and on-device. Industrial application might easily follow out of this.</li><li>Mark Zuckerberg says <a href="https://www.yahoo.com/tech/mark-zuckerberg-says-leaders-technical-092702401.html?guccounter=1&ref=blef.fr">leaders should have technical skills if they want to call themselves a tech company</a>. Yes, but technical leaders are also sometimes not the best ones, maybe the crazy ones, so other skills are required.</li></ul></li><li><a href="https://www.anthropic.com/news/contextual-retrieval?ref=blef.fr">Introducing contextual retrieval</a> — Anthropic introduced a new way to do RAG with more context, that performs better than standard.</li><li>Meta and Google announced automatic dubbing for resp. <a href="https://www.techradar.com/computing/artificial-intelligence/meta-announces-an-ai-translation-tool-that-could-change-the-way-you-watch-instagram-and-facebook-reels-forever?ref=blef.fr">Reels</a> and <a href="https://www.socialmediatoday.com/news/youtube-announces-expansion-auto-dubbing-more-creators-languages/727573/?ref=blef.fr">YouTube videos</a>, this is something. Translation looks like a use-case that is <em>almost</em> solved with LLMs. It unlocks a world where languages are not anymore barriers, giving us access to instantly content and discussions all around the world, especially if it can run on-device, cheaply.</li><li><a href="https://github.com/fmind/bromate?ref=blef.fr">Web browser automation through agentic workflows</a> — A Github repo with a demo using Gemini and Selenium to automate browser actions.</li><li><a href="https://microsoft.github.io/autogen/blog/2024/10/02/new-autogen-architecture-preview/?ref=blef.fr">New AutoGen architecture</a> — AutoGen is an open-source programming framework for agentic workflows, they designed a new architecture (to be honest I don't know what it means).</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7240795734436945920/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7240795734436945920%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Klarna drama</a> — Klarna CEO announced he will shutdown Salesforce and Workday to replace it with internal initiatives + AI. Let's see where it goes.</li><li><a href="https://www.rfi.fr/en/france/20240927-paris-police-chief-backs-keeping-ai-surveillance-in-place-post-olympics?ref=blef.fr">Paris police wants to keep AI surveillance in place post-Olympics</a> — Who could have predicted?</li><li><a href="https://trends.malt.com/en/ai-report?ref=blef.fr">Malt AI report</a> — Malt is a French / European freelance marketplace and they dropped their new AI report. A few things I can note going through the report below.<ul><li><em>Snowflake</em> demand has largely increased and it's close to <em>Databricks</em> in volume, tho <em>Hadoop</em> demand is still larger 🙃</li><li>The biggest demand concern stuff around AI like LLM, Deep Learning, Machine Learning, scikit-learn, etc. — in 2024 there are <em>16k AI freelancer profiles</em></li><li><em>dbt</em> pops out as a specific skill on freelancer profile</li><li>AI engineers and scientists have an average daily rate around 500€, which is 100€ more than tech and data general category.</li><li>AI supply is half data scientists half all other tech positions (DA, DE, Back-end, SE, DevOps).</li></ul></li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/10/image-2.png" class="kg-image" alt="" loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2024/10/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2024/10/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/10/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/10/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Build the foundations (</span><a href="https://unsplash.com/photos/an-aerial-view-of-a-building-under-construction-eMs5ghrVW7M?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://duckdb.org/2025/01/31/duckcon6.html?ref=blef.fr">CfP for DuckCon</a> in Amsterdam on January 31, 2025 — In January next week, the DuckCon will take place, the call for paper is still open until Oct 18th. I might propose something about yato (?).</li><li><a href="https://dlthub.com/blog/dlt-v1?ref=blef.fr">dlt goes 1.0.0</a> — dlt announced their 1.0.0 version, as well as 1000 open-source customers in production. This version brings stability and marks a new milestone for the library.<br><br><em>Side note, I'm a dltHub investor.</em></li><li><a href="https://airbyte.com/blog/1-0-prime-time?ref=blef.fr">Airbyte is also going 1.0</a> — Following dlt (?), Airbyte is also going 1.0 with 3 objectives more use-cases, reliability and better throughput performance.</li><li>❤️ <a href="https://docs.google.com/spreadsheets/d/1Wx6S3qUjjSuK-VX2tkoydTZGb1LzcYnht4N_WkBwApI/edit?gid=0&ref=blef.fr#gid=0">NO SLIDES conference</a> — Be careful before clicking on this link you might loose yourself in a loophole. Recently Timo organised a NO SLIDES conference, a conference where people would only share their screen and no slides. I participated to demo nao, but the demo failed, so the recording does not exist anymore (oups), still I've watched other few talks and really enjoyed.</li><li><a href="https://medium.pimpaudben.fr/elt-with-kestra-duckdb-dbt-neon-and-resend-5bfd62160190?ref=blef.fr">ELT with Kestra, DuckDB, dbt, Neon and Resend</a> — How with Kestra you can create a declarative data pipelines to move data using the trendy libraries.</li><li>DuckDB is the <a href="https://davidsj.substack.com/p/foundation?ref=blef.fr">foundation</a>.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7243008879301664768/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7243008879301664768%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Fast feedback when SQL writing</a> — A nice experiment showcasing how writing SQL tomorrow would look like. Imagine getting results directly while typing to have a faster iteration loop.</li><li><a href="https://cloud.google.com/blog/products/data-analytics/bigquery-jobs-explorer-is-now-ga?hl=en&ref=blef.fr">BigQuery jobs explorer refreshed</a> — Google team released a fresh new explorer for BigQuery Jobs.</li><li>Coursera and <em>Joe Reis</em> launched a <a href="https://www.coursera.org/professional-certificates/data-engineering?ref=blef.fr">Data Engineering Professional Certificate</a> — I can't recommend Joe enough, he's one of the best when it comes to capture date engineering job and the syllabus is great.</li><li><a href="https://www.databricks.com/blog/whats-new-with-databricks-sql?ref=blef.fr">Current state of Databricks SQL</a> — "The best data warehouse is a lakehouse", lmao. Episode 21425325 in the competition between <a href="https://www.blef.fr/databricks-snowflake-and-the-future/">Snowflake and Databricks</a>.</li><li><a href="https://craftingdataproducts.substack.com/p/the-data-death-cycle?ref=blef.fr">The data death cycle</a> —&nbsp;5 traps you wanna avoid to deliver value with Data &amp; AI products: the tech trap, the doing trap, the project trap, the silo trap and the performance-first trap. And <a href="https://www.getorchestra.io/blog/the-data-death-cycle-avoiding-the-silo-trap?ref=blef.fr">follow-up about silos</a> by Hugo.</li></ul><p></p><h3 id="no-comments">No comments</h3><p>Mainly because of time and length of this issue.</p><ul><li><a href="https://benn.substack.com/p/is-excel-immortal?ref=blef.fr">Is Excel immortal?</a> —&nbsp;Benn</li><li><a href="https://engineering.grab.com/catwalk-evolution?ref=blef.fr">Evolution of catwalk: model serving platform at Grab</a>.</li><li><a href="https://engineering.hometogo.com/how-hometogo-improved-our-superset-monitoring-framework-60eb98e1a650?ref=blef.fr">How HomeToGo improved our Superset monitoring framework</a>.</li><li><a href="https://luminousmen.com/post/the-importance-of-clear-software-requirements/?ref=blef.fr">The importance of clear software requirements</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.theguardian.com/technology/2024/oct/02/openai-raises-66bn-in-funding-is-valued-at-157bn?ref=blef.fr"><strong>OpenAI</strong> raises $6.6b at $157b valuation</a>. Softbank <a href="https://www.theinformation.com/articles/softbank-to-invest-500-million-in-openai?ref=blef.fr">goes in with half a billion</a>.</li><li><a href="https://supabase.com/?ref=blef.fr"><strong>Supabase</strong></a> <a href="https://techcrunch.com/2024/09/25/supabase-a-postgres-centric-developer-platform-raises-80m-series-c/?ref=blef.fr">raises $80m  Series C</a>. It's an open-source <a href="https://en.wikipedia.org/wiki/Firebase?ref=blef.fr">Firebase</a> based on-top of Postgres.</li><li><a href="https://kestra.io/?ref=blef.fr"><strong>Kestra</strong></a> <a href="https://medium.com/@edarras/how-kestra-raised-8m-our-seed-deck-now-public-b3493f5a9fbb?ref=blef.fr">raises $8m Series A</a>. Kestra is an open-source orchestration engine, written in Java, and you create workflows using a <a href="https://www.linkedin.com/feed/update/urn:li:activity:7242071652748972032/?ref=blef.fr">declarative</a> model. Ludovic the CTO wrote about turning an <a href="https://www.linkedin.com/pulse/lessons-learned-from-turning-open-source-project-viable-ludovic-dehon-e876e/?trackingId=7E87MGnbR%2BqpR4SDM7nhyw%3D%3D&ref=blef.fr">open-source project into a viable business</a>.</li><li><a href="https://fal.ai/?ref=blef.fr"><strong>fal.ai</strong></a> <a href="https://blog.fal.ai/generative-media-needs-speed-fal-has-raised-23m-to-accelerate/?ref=blef.fr">raises $14m Series A</a>. For the readers that are here for a long time you might remember fal.ai, they were the first to propose a way to mix Python and dbt models with a specific tooling, and they pivoted into a super-faster GenAI inference platform.</li><li><a href="https://www.forbes.com/sites/janakirammsv/2024/09/30/nvidia-acquires-octoai-to-dominate-enterprise-generative-ai-solutions/?ref=blef.fr"><strong>NVidia</strong> acquires <strong>OctoAI</strong></a>.</li><li><strong>BlackRock</strong> and <strong>Microsoft</strong> <a href="https://www.ft.com/content/4441114b-a105-439c-949b-1e7f81517deb?ref=blef.fr">plan $30bn fund to invest in AI infrastructure</a>.</li><li><strong>Voltron Data</strong> <a href="https://x.com/_Felipe/status/1840759201909318097?ref=blef.fr">laid off 50+ employees recently</a>. Voltron engineers are one of the best when it comes to under the hood engines powering our modern data platforms.</li><li><a href="https://probabl.ai/?ref=blef.fr"><strong>:probabl</strong>.</a> <a href="https://papers.probabl.ai/announcing-major-milestone-empowering-the-future-of-data-science?ref=blef.fr">raised €5.5m Seed round</a>. probabl is the official operator of the scikit-learn brand and will develop products and services around the library. Because we need the data science tooling to be and stay open-source.<br><br><em>Side note, I'm a :probabl investor.</em></li></ul><hr><p>See you soon ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.37 ]]></title>
                    <description><![CDATA[ Data News #24.37 — OpenAI o1 new series, building low cost platform with Model dlt and dbt, Data teams survey, feature store, Ibis without pandas. ]]></description>
                    <link><![CDATA[ /data-news-week-24-37/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 66e3f169c715700001fc574f ]]></guid>
                    <pubDate><![CDATA[ 2024-09-13 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/09/image.png" class="kg-image" alt="" loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2024/09/image.png 600w, https://www.blef.fr/content/images/size/w1000/2024/09/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/09/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/09/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Back to work (</span><a href="https://unsplash.com/photos/a-group-of-boats-on-a-beach-1stcb527UGU?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey you, can you believe it's already September? This year has been flying. It feels like I just blinked, and here we are. In August, I've been focusing mainly on my next big journey—if you follow me on LinkedIn, you might have caught a sneak peek! I'll be making a full announcement next week. I want to take the time to explain my thought process and ideas behind it. I hope you will like it.</p><p>Below are the Data News wrapping summer and the first two weeks of Sept. </p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>OpenAI released 2 new models <a href="https://openai.com/index/introducing-openai-o1-preview/?ref=blef.fr">OpenAI o1-preview and o1-mini</a> — These models brings changes and a breakpoint in the models naming. OpenAI decide to give up on the GPT naming, which means GPT-5 will never be plugged in. <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf?ref=blef.fr">GPT paper</a> has been co-authored by 4 person and 3 are not anymore at OpenAI, leaving GPTs also mark a change in paradigm. <br><br>The o1 series brings more “<em>reasoning</em>”, it looks like a pre-prompt that does a <a href="https://www.promptingguide.ai/techniques/cot?ref=blef.fr">chain of thoughts</a> on top of what they already did best. Lots of stories about exceptional things the model can do have been published today—e.g. in the OpenAI <a href="https://openai.com/index/openai-o1-system-card/?ref=blef.fr">system card</a> explained that the model was able during a cybersecurity challenge (a CTF) to understand a failing Docker environment (due to infra) and still be able to find the flag.<br><br>Here a <a href="https://www.youtube.com/playlist?list=PLOXw6I10VTv_T9QV-DKXhq7HFUQRkGQLI&ref=blef.fr">YouTube playlist</a> demonstrating o1 capacities.<br><br>As clem mentioned on Twitter, it's always important to pay attention to words, even if the “<em>reason</em>” model, <a href="https://x.com/ClementDelangue/status/1834283206474191320?ref=blef.fr">it doesn't think, it processes</a>.</li><li>More news about OpenAI<ul><li><a href="https://azure.microsoft.com/en-us/blog/introducing-o1-openais-new-reasoning-model-series-for-developers-and-enterprises-on-azure/?ref=blef.fr">New models are already available on Azure</a> ; but be careful Microsoft open-source <a href="https://huggingface.co/microsoft/Phi-3.5-mini-instruct?ref=blef.fr">Phi-3.5-mini</a> is out.</li><li>Ilya Sutskever, previously Chef Scientist at OpenAI, raised 1b$ to co-found Safe Superintelligence with a <a href="https://ssi.inc/?ref=blef.fr">manifesto</a>.</li><li>Alexis Conneau, Her ex-research lead at OpenAI, decided to create a new company and got a lot of <a href="https://x.com/alex_conneau/status/1833535309902189015?ref=blef.fr">Tweet impressions</a>. Previous OpenAI members are quite popular when it comes to founding.</li><li>Bloomberg reported that <a href="https://www.bloomberg.com/news/articles/2024-09-11/openai-fundraising-set-to-vault-startup-s-value-to-150-billion?ref=blef.fr">OpenAI seeks to raise $11,5b more at $150b valuation</a>, making it the third private company in terms on valuation [paywall article].</li><li>NEO Beta, a humanoid company backed by OpenAI, released a first <a href="https://www.linkedin.com/feed/update/urn:li:activity:7235643741191995392/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7235643741191995392%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">video demo</a>. And it's impressive (🙃), the robot is able to handover a bag to a human!</li><li>We hope next OpenAI model is not o7. /s</li></ul></li><li><a href="https://www.nist.gov/news-events/news/2024/08/us-ai-safety-institute-signs-agreements-regarding-ai-safety-research?ref=blef.fr">OpenAI and Anthropic will give their model first to US gov</a> (NIST) to help advance safe and trustworthy AI innovation for all. But they cry when in Europe the AI Act is voted threatening innovation.</li><li><a href="https://github.com/NVlabs/EAGLE?ref=blef.fr">NVidia released Eagle a vision-centric multimodal LLM</a> — Look at the example in the Github repo, given an image and a user input the LLM is able to answer things like "Describe the image in detail" or "Which car in the picture is more aerodynamic" based on a drawing.</li><li><a href="https://aleph-alpha.com/introducing-pharia-1-llm-transparent-and-compliant/?ref=blef.fr">Aleph Alpha introduced Pharia-1-LLM</a> — it's a <a href="https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control?ref=blef.fr">7B model</a> and the license is explicitly targets non-commercial and research usages. Aleph Alpha is a German company, funded by German VCs (with $500m), was trying to compete with US companies (like Mistral and OpenAI 🤭) in the models race but gave up this competition to pivot to a AI-support company for public sector.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/09/image-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="1330" srcset="https://www.blef.fr/content/images/size/w600/2024/09/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2024/09/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/09/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/09/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Calm data flows (</span><a href="https://unsplash.com/photos/brown-wooden-boat-on-body-of-water-overlooking-houses-by-the-shore-at-daytime-Pnc2Uxb7PG0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li><a href="https://policy.trade.ec.europa.eu/news/eu-and-china-launch-cross-border-data-flow-communication-mechanism-2024-08-28_en?ref=blef.fr">EU and China launch cross-border data flow communication mechanism</a> — It's an official statement saying that EU and China will re-discuss the policy about the data transfer out of China for European company, which is difficult. </li><li><a href="https://modal.com/blog/analytics-stack?ref=blef.fr">Building a cost-effective analytics stack with Modal, dlt, and dbt</a> — A great example of how you can built a small analytics stack in today's world with dlt, dbt and Modal, a serverless platform to run Python stuff. The articles contains a lot of code snippets to understand what's under the hood.</li><li><a href="https://www.jesse-anderson.com/2024/08/data-teams-survey-2024-results/?ref=blef.fr">Data teams survey 2024</a> — Jesse Anderson released the results of his survey about the state of data teams in 2024.</li><li><a href="https://maxhalford.github.io/blog/python-daily-cache/?ref=blef.fr">Daily cache implementation in Python</a> — A highly effective approach for caching when working with large datasets stored in distant buckets is to implement a local cache. It avoids the need to repeatedly download the data.</li><li><a href="https://www.tweag.io/blog/2024-06-06-safer-composable-python/?ref=blef.fr">Safe composable Python</a> — A good article about function composition and testing in Python and how it articulates together.</li><li><a href="https://www.startdataengineering.com/post/parts-of-dataengineering/?ref=blef.fr">What are the key part of data engineering</a> — Simple way to present what are the key part of data engineering.</li><li><a href="https://blogs.halodoc.io/automation-for-error-handling-in-data-warehouse/?ref=blef.fr">Automation strategies for monitoring and self-healing of data pipelines</a> — I like the concept of self-healing pipelines, tho, not sure it's really idempotent and not sure it leads to a great management of data assets, still the article is related also to data contracts and might be solved another way.</li><li><a href="https://medium.engineering/laying-the-foundations-lists-in-mediums-feature-store-part-1-8fc075b5d355?ref=blef.fr">Medium feature store, how do they store lists</a> — Medium built a feature store powering their recommendation system (which could work better tbh), in this blog they explain how did they decided to store features of type list.</li><li><a href="https://www.nytimes.com/athletic/5697684/2024/09/03/football-analytics-uk-evolution/?ref=blef.fr">How the UK football rely heavily on data?</a> — It's common knowledge that Liverpool FC won multiple titles recently by being data driven. This article shows how data teams are becoming larger and larger in the clubs. In the Premier League top 6, data team headcount average is 14.</li><li><a href="https://engineering.atspotify.com/2024/09/are-you-a-dalia-how-we-created-data-science-personas-for-spotifys-analytics-platform/?ref=blef.fr">Spotify data science personas</a> — Data science role is evolving and Spotify proposed multiple persona among their data teams.</li><li><a href="https://engineering.atspotify.com/2024/08/unlocking-insights-with-high-quality-dashboards-at-scale/?ref=blef.fr">Unlocking insights with high-quality dashboards at scale</a> — A checklist of stuff that you should have a look at to build high-quality dashboard. They even developed a Dashboard portal to improve dashboard usage and discovery.</li><li><a href="https://ibis-project.org/posts/farewell-pandas/?ref=blef.fr">Ibis drops pandas backend and embrace fully DuckDB</a> — It's a big choice moving forward Ibis, a multi backend dataframe library, decided to drop pandas support using by default DuckDB. The article says that DuckDB is way faster, covers the feature gap and pandas was mostly annoying because of <em>NaN</em> for nulls values, whereas it's <em>NULL</em> for all the other backends.</li><li><a href="https://duckdb.org/2024/08/15/duckcon5?ref=blef.fr">DuckCon #5 videos</a> — All the videos from the 5th DuckCon in Seattle are on YouTube. I did not had the time yet to look at it but I think awesome things are waiting us behind a click.</li><li><a href="https://hightouch.com/blog/migrating-to-iceberg-lakehouse?ref=blef.fr">Should you be migrating to an Iceberg Lakehouse?</a> — This is an excellent question, and a good starting point for considering whether you should change all your assets to Iceberg.</li><li><a href="https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3-conditional-writes?ref=blef.fr">Amazon S3 now supports conditional writes</a> — I think it's a great start for table formats.</li><li><a href="https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle?ref=blef.fr">The analytics development lifecycle by dbt</a> — Tristan Handy proposed a framework (very Enterprise) to rethink how the analytics workflow should work as of today. The article is 37 minutes long so I did not read but I saw the holy DevOps/DataOps infini sign that gave me instant headache. Toby from SQLMesh answered on <a href="https://www.linkedin.com/feed/update/urn:li:activity:7240183170715820033/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7240183170715820033%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">LinkedIn</a>, drama time.</li><li><a href="https://tobikodata.com/making-sqlmesh-faster.html?ref=blef.fr">Making SQLMesh faster</a> — the road for SQLMesh is crystal clear, they want to be the faster alternative to unmanageable large dbt projects, so they work of the speed of execution.</li><li><a href="https://fromanengineersight.substack.com/p/issue-39-whats-your-question?ref=blef.fr">What's your question?</a></li></ul><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.34 ]]></title>
                    <description><![CDATA[ Data News #24.34 — Forward Data Conference guest speakers, Data Engineering for AI/ML, AI news and a lot of great fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-24-34/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 66c82ba23343f50001d2ee17 ]]></guid>
                    <pubDate><![CDATA[ 2024-08-24 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/08/image-1.png" class="kg-image" alt="" loading="lazy" width="1500" height="1000" srcset="https://www.blef.fr/content/images/size/w600/2024/08/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2024/08/image-1.png 1000w, https://www.blef.fr/content/images/2024/08/image-1.png 1500w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">News again.. (</span><a href="https://diymag.com/review/live/fred-again-alexandra-palace-london-september-2023?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>It's been 3 weeks. </p><p>Summer continues and I hope this new edition finds you well, having had a great vacation and a nice break before getting back to business in September. Content and articles have been a little slow over the last few weeks and that's to be expected, but I feel it gonna get back to business as usual soon.</p><p>Some personal news, in September things will be changing professionally on my side, I'll be slowly leaving the freelancing world. More details soon in a 2-parts articles I'm writing about it. I can't wait to tell you more to be honest. Still the newsletter gonna be the same formula.</p><h1 id="events-%E2%9C%A8">Events ✨</h1><figure class="kg-card kg-image-card"><a href="https://www.billetweb.fr/forward-data-conference?ref=blef.fr"><img src="https://www.blef.fr/content/images/2024/08/image.png" class="kg-image" alt="" loading="lazy" width="700" height="200" srcset="https://www.blef.fr/content/images/size/w600/2024/08/image.png 600w, https://www.blef.fr/content/images/2024/08/image.png 700w"></a></figure><p>As you may know I'm co-organising the <a href="https://www.forward-data-conference.com/?ref=blef.fr">Forward Data Conference</a> on November 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. There are still tickets left—we sold around 60% of the tickets.</p><p>We have started to announce a few guest speakers for the conference that I can't wait listen on stage. At the moment we have announced:</p><ul><li><a href="https://www.linkedin.com/in/josephreis/?ref=blef.fr">Joe Reis</a>, best-seller author, data engineer, he will speak about the new art of data modeling</li><li><a href="https://www.linkedin.com/in/hfmuehleisen/?ref=blef.fr">Hannes Mühleisen</a>, co-creator of DuckDB</li><li><a href="https://www.linkedin.com/in/clairelebarz/?ref=blef.fr">Claire Lebarz</a>, Chief Data and AI at Malt, who was working at Airbnb previously</li><li><a href="https://www.linkedin.com/in/virginiecornu/?ref=blef.fr">Virginie Cornu</a>, Co-founder / CTPO and previously VP data at Jellysmack</li></ul><p>You can use <strong>BLEF_FWD24</strong> promo-code to get 15% reduction on your ticket.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.billetweb.fr/forward-data-conference?ref=blef.fr" class="kg-btn kg-btn-accent">Get tickets for Forward Data Conference</a></div><hr><p>Transition to another event. Demetrios Brinkmann is organising on Sep 12—in 18 days—a <strong>free</strong> online conference called <a href="https://home.mlops.community/public/events/dataengforai?ref=blef.fr">Data Engineering for AI/ML</a>, the agenda is pretty packed and the lineup is full of awesome speakers (Joe and Hannes will be there ☺️). The idea of the conference is to go deeper in the AI/ML current state and how we do data engineering in 2024 that services ML and AI teams.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://home.mlops.community/public/events/dataengforai?ref=blef.fr" class="kg-btn kg-btn-accent">Register for Data Engineering AI/ML</a></div><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/08/image-2.png" class="kg-image" alt="" loading="lazy" width="1296" height="730" srcset="https://www.blef.fr/content/images/size/w600/2024/08/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2024/08/image-2.png 1000w, https://www.blef.fr/content/images/2024/08/image-2.png 1296w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The 3 last OpenAI co-founders (credits: HBO)</span></figcaption></figure><ul><li><a href="https://www.entrepreneur.com/business-news/chatgpt-cofounders-leaders-leaving-openai-3-left-of-11/478125?ref=blef.fr">Drama at OpenAI people leaving</a> (again) — Only 3 of the 11 original co-founders are still at OpenAI. In July reports were saying that OpenAI could be on the track to make a $5b loss.</li><li><a href="https://openai.com/index/introducing-structured-outputs-in-the-api/?ref=blef.fr">OpenAI structured outputs</a> — You can now force OpenAI APIs to return you a specific enforced JSON schema when calling.</li><li>New image generation capabilities — The German lab <a href="https://www.linkedin.com/feed/update/urn:li:activity:7226119613372108800/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7226119613372108800%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Black Forest</a> created more realistic than ever images and with their model Flux.</li><li><a href="https://engineering.fb.com/2024/08/14/production-engineering/how-meta-animates-ai-generated-images-at-scale/?ref=blef.fr">How Meta animates AI-generated images at scale</a>.</li><li>Meta <a href="https://x.com/ylecun/status/1818167736813711686?ref=blef.fr">segment anything model</a> (SAM v2) is impressive in order to identify anything in images.</li><li><a href="https://www.etsy.com/codeascraft/machine-learning-in-content-moderation-at-etsy?ref=blef.fr">ML in content moderation at Etsy</a>.</li><li><a href="https://www.linkedin.com/blog/engineering/search/introducing-semantic-capability-in-linkedins-content-search-engine?ref=blef.fr">Semantics in the LinkedIn search engine</a>.</li><li><a href="https://survey.stackoverflow.co/2024/ai/?ref=blef.fr">StackOverflow AI survey</a> — a few insights quoted from the survey<ul><li>76% of all respondents are using or are planning to use AI tools in their development process this year, an increase from last year (+70%)</li><li>81% agree increasing productivity is the biggest benefit that developers identify for AI tools.</li><li>70% of professional developers do not perceive AI as a threat to their job</li></ul></li><li><a href="https://raphaelvienne.substack.com/p/watermarking-generative-ai?ref=blef.fr">Watermarking Generative AI</a> — I believe in watermarking and I root for users app in the future to identify watermarks and to inform users if its AI, not AI or unknown.</li><li><a href="https://opensource.org/deepdive/drafts/open-source-ai-definition-draft-v-0-0-9?ref=blef.fr">The open-source AI definition (v0.0.9)</a> — A try to put words in order to define, in open, what's AI: "an AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments".</li><li><a href="https://towardsdatascience.com/what-nobody-tells-you-about-rags-b35f017e1570?ref=blef.fr">What nobody tells you about RAG</a> — Large deep-dive about RAGs.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://seattledataguy.substack.com/p/timeless-skills-for-data-engineers?ref=blef.fr">Timeless skills for data engineers and analysts</a> — Benjamin proposes 4 non-technical skills that are useful when doing the data work, the 2 first one are "thinking in systems" and "data intuition". With the current resurgence of data modeling I think it both are critical part of the modeling work and need to be looked at.</li><li><a href="https://mikkeldengsoe.substack.com/p/how-top-data-teams-are-structured?ref=blef.fr">How top data teams are structured</a> — Mikkel had a look at 40 data teams and analysed the way they are structured. As an output we get the roles ratio understanding better team composition.</li><li><a href="https://www.wsj.com/articles/mainframes-find-new-life-in-ai-era-1e32b951?ref=blef.fr">Mainframes finds new life with AI</a> — Obviously we will never get rid of mainframes, and IBM says clients wants to run AI on it. I think this is time to learn machine learning in COBOL.</li><li><a href="https://location.foursquare.com/resources/blog/leadership/modern-data-platform-an-unbundling-of-a-traditional-data-warehouse/?ref=blef.fr">Foursquare modern data platform</a> — A classic modern data stack, but still interesting to see that Foursquare—like large companies—is going multi-technology by having multiple storage, processing engines and notebooks technologies. </li><li><a href="https://aws.amazon.com/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/?ref=blef.fr">Amazon migration from Spark to Ray</a> — Exabyte-scale migration and in the end Amazon saves a lot of processing time when it comes to files compaction.</li><li><a href="https://bytes.swiggy.com/hermes-a-text-to-sql-solution-at-swiggy-81573fb4fb6e?ref=blef.fr">Design a text-to-SQL solution</a> — An Indian food delivery company, called Swiggy, developed an internal text-to-SQL solution that user can interact with via Slack. The article describe all their thoughts and the challenges they faced while working on it.</li><li><a href="https://github.com/supabase-community/postgres-new?ref=blef.fr">In browser WASM Postgres</a> — Recently a company ported Postgres to WASM and now you can run Postgres within your browser without any server, you can test it at <a href="https://postgres.new/?ref=blef.fr">postgres.new</a> (only desktop for the moment). You can see more on <a href="https://pglite.dev/?ref=blef.fr">pglite</a> website.</li><li><a href="https://www.geteppo.com/blog/why-we-replaced-airflow-in-our-experimentation-platform?ref=blef.fr">Why we replaced Airflow in our experiment platform</a> — Eppo reached a job concurrency limit with Airflow (wanted to run 50k concurrent experiments) and decided to switch to something else. After trying a few different things they decided to developed their own tech.</li><li><a href="https://airflowsummit.org/?ref=blef.fr">Airflow Summit in San Francisco</a> — Airflow gonna turn 10 this year. There is Airflow Summit in San Francisco on 10-12 of September. This year will also  mark the start of <a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3.0?ref=blef.fr">Airflow 3.0</a> aiming to be release in March next year. If we look at the confluence page Airflow 3.0 will feature: a new web UI, DAG versioning, remote execution (from cloud to on-prem), data assets (like Dagster, a renaming of Datasets) and more but that's a great milestone. </li><li><a href="https://www.astronomer.io/blog/airflow-dbt-next-chapter/?ref=blef.fr">Airflow and dbt ; in Astronomer</a> — Orchestrating dbt within an orchestrator is one of the most discussed topic among data teams using dbt. It's also an large adoption lever for companies like Dagster from the private discussion that I have (Dagster-dbt integration is top-notch). Astronomer has to go the same direction and through <a href="https://github.com/astronomer/astronomer-cosmos?ref=blef.fr">cosmos</a> and a better integration now they propose it as well.</li><li><a href="https://blog.det.life/no-data-engineers-dont-need-dbt-30573eafa15e?ref=blef.fr">No, data engineers don't need dbt</a> — Common sense but good reminders. If you do ETL instead of ELT and you don't have a warehouse, dbt might not be the best fit.</li><li><a href="https://www.uber.com/en-DE/blog/sparkle-modular-etl/?ref=blef.fr">Sparkle write Spark pipelines in YAML at Uber</a> — Uber doing uber things. Still <a href="https://jobs.picnic.app/en/blogs/yaml-developers-and-the-declarative-data-platforms?ref=blef.fr">declarative platforms</a> in YAML are the way to go to standardise data work.</li><li><a href="https://netflixtechblog.medium.com/etl-development-life-cycle-with-dataflow-9c70c64aba7b?ref=blef.fr">ETL development life-cycle with Dataflow</a> — How Netflix is also doing YAML development for Dataflow job writing.</li><li><a href="https://medium.com/pinterest-engineering/delivering-faster-analytics-at-pinterest-a639cdfad374?ref=blef.fr">StarRocks usage at Pinterest for faster analytics</a>.</li><li><a href="https://jack-vanlightly.com/blog/2024/8/7/table-format-comparisons-how-do-the-table-formats-represent-the-canonical-set-of-files?ref=blef.fr">Table format comparisons</a> — Honest linear review of 4 formats (it includes the 3 main ones and Apache Paimon). This first part details how formats manage physical files, the second part is about <a href="https://jack-vanlightly.com/blog/2024/8/13/table-format-comparisons-append-only-tables-and-incremental-reads?ref=blef.fr">append-only and incremental tables</a>.</li><li><a href="https://www.snowflake.com/en/blog/polaris-catalog-open-source/?ref=blef.fr">Polaris catalog is open-source</a> — Snowflake released the expected Polaris catalog. On this <a href="https://www.blef.fr/databricks-snowflake-and-the-future/">matter</a>, it has been reported that Databricks finally acquired Tabular for $2b while Snowflake tried to get them for $600m.</li><li><a href="https://cloud.google.com/blog/products/databases/announcing-sql-support-for-bigtable?ref=blef.fr">BigTable supports SQL</a> (and this is something) and <a href="https://cloud.google.com/blog/products/data-analytics/new-managed-service-for-apache-kafka/?ref=blef.fr">GCP can run managed Kafka</a>.</li><li><a href="https://datafordoers.substack.com/p/the-revolving-door-of-bi?ref=blef.fr">The revolving doors of BI</a> — I frequently observe companies switching BI tools every 2-3 years, hoping that the latest solution will resolve all their issues. While these migrations often provide short-term relief, they inevitably lead to another dead end. The initial success of the migration is typically due to the fact that, during the transition, companies address their technical debt and apply the lessons learned from previous mistakes. However, without addressing the underlying issues, the cycle is likely to repeat itself.</li><li><a href="https://blog.duolingo.com/growth-model-duolingo/?ref=blef.fr">Metrics management at Duolingo</a> — I should go back to learning German in order to improve Duolingo metrics.</li><li><a href="https://airbyte.com/blog/how-we-test-airbyte-and-marketplace-connectors?ref=blef.fr">How we test Airbyte and marketplace connectors</a> — Exhaustive tests suite Airbyte put in place to check if connectors are behaving correctly.</li><li><a href="https://blog.dagworks.io/p/slack-summary-pipeline-with-dlt-ibis?ref=blef.fr">Slack summary ELT pipeline</a> — It showcases how you can create an ELT pipeline with dlt, Ibis and <a href="https://github.com/dagworks-inc/hamilton?ref=blef.fr">Hamilton</a>—which is a Python library to create transformation DAGs.</li><li><a href="https://github.com/dbecorp/snowflakecli?ref=blef.fr">snowflakecli</a> — a DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.</li><li>A Postgres extension (<a href="https://github.com/duckdb/pg_duckdb?ref=blef.fr">pg_duckdb</a>) that brings DuckDB as an analytical engine within Postgres has been backed by <a href="https://www.theregister.com/2024/08/20/postgresql_duckdb_extension/?ref=blef.fr">Microsoft</a> and fostered a bit of <a href="https://motherduck.com/blog/pg_duckdb-postgresql-extension-for-duckdb-motherduck/?ref=blef.fr">discussions</a> in the <a href="https://davidsj.substack.com/p/the-unrealised-promise-of-htap?ref=blef.fr">community</a>.</li><li><a href="https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/RDBMSGenealogy/RDBMS_Genealogy_V6.pdf?ref=blef.fr">Genealogy of databases</a> — like a subway map that depicts all databases changes since Prehistory.</li><li>Martin Kleppmann is working on a version 2 of <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/?ref=blef.fr">Designing Data-Intensive Applications</a>.</li></ul><p></p><h4 id="final-stuff">Final stuff</h4><ul><li><a href="https://engineeringblog.yelp.com/2024/08/dbt-Generic-Tests-in-Sessions-Validation-at-Yelp.html?ref=blef.fr">dbt Generic Tests in Sessions Validation at Yelp</a>.</li><li><a href="https://www.letsql.com/posts/builtin-predict-udf/?ref=blef.fr">LETSQL inference for DataFrames</a> — I don't understand the blog but looks cool.</li></ul> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.30 ]]></title>
                    <description><![CDATA[ Data News #24.30 — TV shopping for foundational models (OpenAI, Mistral, Meta, Microsoft, HF), BigQuery newly released stuff, and more obviously. ]]></description>
                    <link><![CDATA[ /data-news-week-24-30/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 669a408fb1f34700011776ee ]]></guid>
                    <pubDate><![CDATA[ 2024-07-26 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1685788467854-85a80ce5dd29?fm=jpg&amp;q=60&amp;w=3000&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="a view of a city at sunset from a high rise" loading="lazy" width="3000" height="2001" srcset="https://images.unsplash.com/photo-1685788467854-85a80ce5dd29?fm=jpg&amp;q=60&amp;w=600&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1685788467854-85a80ce5dd29?fm=jpg&amp;q=60&amp;w=1000&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w, https://images.unsplash.com/photo-1685788467854-85a80ce5dd29?fm=jpg&amp;q=60&amp;w=1600&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1600w, https://images.unsplash.com/photo-1685788467854-85a80ce5dd29?fm=jpg&amp;q=60&amp;w=2400&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Tallinn (</span><a href="https://unsplash.com/photos/a-view-of-a-city-at-sunset-from-a-high-rise-0mR6KB4eDqM?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe.</p><p>I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions. Many thanks to all people who trusted us and submitted a talk for the conference (especially the DN members!).</p><p>We also announced our first guest speaker, <a href="https://www.linkedin.com/in/josephreis/?ref=blef.fr">Joe Reis</a>. Joe is a great speaker, he wrote <a href="https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/?ref=blef.fr">Fundamentals of Data Engineering</a>, which is one of the bibles in data engineering and I can't wait to hear him at Forward Data. He his currently writing his second book about data modeling.</p><p><strong>Forward Data is a 1-day conference I will co-organise on November 25th, in Paris.</strong> It will be a day to shape the future of the data community, where teams can come to learn and grow together.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.billetweb.fr/forward-data-conference?ref=blef.fr" class="kg-btn kg-btn-accent">Buy tickets for Forward Data Conf</a></div><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p>Some days, AI News is like a TV shopping show. Over the past two weeks, a few dozen models have been released, and I'd like to introduce them to you.</p><h2 id="new-models-stuff">New models &amp; stuff</h2><ul><li>OpenAI — OpenAI is trying to continue leading the charge by releasing models like Apple products.<ul><li><a href="https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/?ref=blef.fr">GPT-4o mini: advancing cost-efficient intelligence</a> — After GPT-4o which brought great performance and became the new flagship model, available in the free tier, OpenAI released a smaller version of it, the mini. According to the benchmark GPT-4o mini is close to GPT-4o performance but best in class among the small models. Even if OpenAI did not disclose how small it is, a few people are claiming it's a 8B.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7221852875428122626/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7221852875428122626%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Fine-tune GPT-4o for free</a> — Until September 23, 2024 GPT-4o mini is free to fine-tune. This means each organization will get 2M tokens per 24 hour period to train the model and any overage will be charged at $3.00/1M tokens. Worth trying [<a href="https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset?ref=blef.fr">docs</a>].</li><li><a href="https://openai.com/index/searchgpt-prototype/?ref=blef.fr">SearchGPT, new OpenAI product</a> —Yesterday, OpenAI unveiled their latest product, SearchGPT, a prototype AI search application. The system generates answers while providing reliable sources. This announcement coincides with Google Search's recent report of a 11% increase in revenue for the last quarter, reaching $64 billion. It shows that search did not disappeared with the advent of GPTs.</li></ul></li><li>Meta — This is crazy how Meta who suffered an unintentionally leak of LLaMA weights on torrents 1 year ago is now the company advocating for open models and leading this part of the ecosystem. <ul><li><a href="https://llama.meta.com/?ref=blef.fr">LLaMA 3.1 is out</a> — The model is out in 3 versions, the largest one with 405B and 2 smaller ones (70B and 8B). They even released a 92 -pages <a href="https://scontent.ftll3-1.fna.fbcdn.net/v/t39.2365-6/452387774_1036916434819166_4173978747091533306_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=7qSoXLG5aAYQ7kNvgFmXYFc&_nc_ht=scontent.ftll3-1.fna&oh=00_AYCLzWIqE5zXXZcETIDOED5eNaSHiQ9eb2XC_IDDSsCY7g&oe=66A91E0D&ref=blef.fr">whitepaper</a> explaining how they trained it, the expected performances and what you can do with it. How dear fried Mark even wrote an ode to the open source with some kind of <a href="https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/?ref=blef.fr">manifesto</a>. The <a href="https://ai.meta.com/blog/meta-llama-3-1/?ref=blef.fr">announcement</a> is like a summary if you want it short.</li><li><a href="https://www.theverge.com/2024/7/18/24201041/meta-multimodal-llama-ai-model-launch-eu-regulations?ref=blef.fr">Meta won’t release its multimodal Llama AI model in the EU</a> — It would have been perfect if Meta was complying with the rules but in the end Meta is Meta. Doing lobbying and like a crying kid we punish for being bad they announce they will not release they super multimodal AI in Europe because regulatory environment is too "unpredictable". How to say that they use training data they should not have used.<br><br>Last point related to tech giants (Apple, Nvidia, Salesforce) <a href="https://www.newsnationnow.com/business/tech/ai/ai-youtube-stolen-subtitles/?ref=blef.fr#:~:text=Tech%20giants%20Apple%2C%20Anthropic%2C%20Nvidia%20and%20Salesforce%20pilfered%20data%20from,48%2C000%20channels%20to%20AI%20programs.">stealing YouTube subtitles</a> to train foundational models.</li><li><a href="https://huggingface.co/collections/facebook/chameleon-668da9663f80d483b4c61f58?ref=blef.fr">Meta Chameleon</a> — Finally Chameleon is available on HuggingFace, it's Meta Mixed-Modal Early-Fusion Foundation, which actually means it can understand and generate text and images.</li></ul></li><li>Mistral — The US-French company is keeping the rhythm with other giants going back on open models.<ul><li><a href="https://mistral.ai/news/mathstral/?ref=blef.fr">MathΣtral</a> — A 7B model for math reasoning and scientific discovery under Apache license. I'll try it soon for something I'm cooking.</li><li><a href="https://mistral.ai/news/codestral-mamba/?ref=blef.fr">Codestral Mamba</a> — A 7B model for code generation under Apache license.</li><li><a href="https://mistral.ai/news/mistral-large-2407/?ref=blef.fr">Mistral Large 2</a> — Competing directly with the large LLaMA 3.1, Mistral Large 2 is 123B parameters and is at the moment the closest model to GPT-4o which still sets the benchmark.</li></ul></li><li><a href="https://azure.microsoft.com/en-us/blog/announcing-phi-3-fine-tuning-new-generative-ai-models-and-other-azure-ai-updates-to-empower-organizations-to-customize-and-scale-ai-applications/?ref=blef.fr">Microsoft Phi-3 models</a> — Microsoft continues to try hard at the game with their Phi-3 models available in Azure. But who cares?</li><li><a href="https://huggingface.co/blog/smollm?ref=blef.fr">SmolLM - blazingly fast and remarkably powerful</a> — HuggingFace released new state-of-the-art small models (135M, 360M and 1.7B parameters) trained on a open corpus.</li></ul><p></p><h2 id="articles">Articles</h2><p>Because AI and GenAI is not only about models, a few great articles have been written as well.</p><ul><li>Twitter uses your data to train xAI Grok — Twitter recently added a <a href="https://x.com/settings/grok_settings?ref=blef.fr">opt-in</a> to utilise your X posts as well as your user interactions, inputs and results with Grok for training and fine-tuning purposes. The opt-in is only available on desktop.</li><li><a href="https://engineering.fb.com/2024/07/16/developer-tools/ai-lab-secrets-machine-learning-engineers-moving-fast/?ref=blef.fr">AI Lab: The secrets to keeping machine learning engineers moving fast</a> — About the AI Lab Meta put in place to keep ML engineers velocity giving them possibilities to A/B tests models and avoid regressions.</li><li><a href="https://medium.com/pinterest-engineering/building-pinterest-canvas-a-text-to-image-foundation-model-aa34965e84d9?ref=blef.fr">Building Pinterest Canvas, a text-to-image foundation model</a> — Great article about creating a image generation model for product backgrounds.</li><li><a href="https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb?ref=blef.fr">Multimodal RAG pipeline</a> — A notebook explaining how you can index and build a RAG on deck slides.</li><li><a href="https://raphaelvienne.substack.com/p/watermarking-generative-ai?ref=blef.fr">Watermarking Generative AI: ensuring ownership and transparency</a> — This is the future, being able to watermark all the generated content for ensure trust and transparency for end consumers. </li><li><a href="https://www.cio.bund.de/SharedDocs/kurzmeldungen/Webs/CIO/DE/startseite/2024/ozg_aendg.html?ref=blef.fr">Germany</a> (in German, sorry) and <a href="https://www.zdnet.com/article/switzerland-now-requires-all-government-software-to-be-open-source/?ref=blef.fr">Switzerland</a> both added into their law some kind of preference for open-source software (even required in Switzerland). </li><li><a href="https://docs.google.com/spreadsheets/d/1BbibWUwJ5bX8Q6u_juutqRWdMFg11VAa_36NQjl9vbc/copy?ref=blef.fr">Understand Vector database in Google Sheets</a> — Playful. This is a Google Sheet template that you can copy that will explain you how a vector database works as well as the search.</li><li><a href="https://huggingface.co/spaces/infinite-dataset-hub/infinite-dataset-hub?ref=blef.fr">Infinite dataset hub</a> — A Generative app that generates datasets for you out of a few words.</li></ul><p><a href="https://medium.com/@Pinterest_Engineering?source=post_page-----aa34965e84d9--------------------------------" rel="noopener follow"></a></p><p></p><h1 id="fast-news-%E2%9A%A1">Fast News ⚡</h1><p>Because the fast news are always the best.</p><ul><li><a href="https://towardsdatascience.com/why-it-feels-impossible-to-get-a-data-science-job-398d57de464c?ref=blef.fr">Why it feels impossible to get a data science job</a> — Data science market became highly competitive in the last years, even more with everyone rushing to AI and jobs switching from being great at machine learning to being good at maintaining APIs orchestration. This articles tries to explain why and what to do.</li><li><a href="https://www.snowflake.com/engineering-blog/snowflake-brings-seamless-postgresql-and-mysql-integration-with-new-connectors/?ref=blef.fr">Snowflake brings seamless PostgreSQL and MySQL</a> — It has been announce at the Snowflake summit, now you can directly ingest Postgres and MySQL from Snowflake UI, removing the need of any other tool for these sources. The way they did it requires you to run a Docker container, which is kinda meh.</li><li><a href="https://buremba.com/blog/use-snowflake-and-duckdb-with-iceberg?ref=blef.fr">Query Snowflake Iceberg tables with DuckDB &amp; Spark to save costs</a> — That's what Iceberg tables on Snowflake unlocks. The capabilities to offload compute in DuckDB or Spark to save costs (or to move costs actually).</li><li>BigQuery team is on fire and released a lot of cool new stuff<ul><li><a href="https://cloud.google.com/bigquery/docs/table-explorer?ref=blef.fr">Table explorer</a> — an automated way to visually explore table data and create queries based on your selection of table fields.</li><li><a href="https://cloud.google.com/bigquery/docs/continuous-queries-introduction?ref=blef.fr">Continuous queries</a> — A way to answer to Snowflake dynamic tables, continuous queries are SQL statements that run continuously. Google announce that CQ can be used for low-latency tasks.</li><li>On the same topic as CQ, they released the <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions-built-in?ref=blef.fr#changes">changes</a> function, which is a SQL function that returns all rows that have changed in a table for a given time range. I think it will unlock a lot of use-cases in BigQuery.</li></ul></li><li><a href="https://engineering.mixpanel.com/how-mixpanel-delivers-funnels-up-to-7x-faster-than-the-data-warehouse-af6da1f5a982?ref=blef.fr">How Mixpanel delivers funnels up to 7x faster than the data warehouse</a> — Mixpanel team is proud to say that they have better performance than Snowflake.</li><li>You can run Clickhouse functions in DuckDB with the <a href="https://community-extensions.duckdb.org/extensions/chsql.html?ref=blef.fr">chsql extension</a> and there is a great post about how <a href="https://duckdb.org/2024/07/09/memory-management?ref=blef.fr">DuckDB manages memory</a>.</li><li><a href="https://ibis-project.org/posts/1tbc/?ref=blef.fr">Querying 1TB on a laptop with Python dataframes</a> — A benchmark on a 96 GB of memory laptop using DuckDB, DataFusion and Polars. Crazy what we can do nowadays.</li><li><a href="https://towardsdatascience.com/data-modeling-techniques-for-the-post-modern-data-stack-03fc2e4a210c?ref=blef.fr">Data modeling techniques for the post-modern data stack</a> — A great recap of all the techniques that exists out there for modeling (medallion and dimensional).</li><li><a href="https://towardsdatascience.com/parquet-file-format-everything-you-need-to-know-ea54e27ffa6e?ref=blef.fr">Parquet File Format: everything you need to know</a> — How parquet file are written.</li><li><a href="https://developer.nvidia.com/blog/encoding-and-compression-guide-for-parquet-string-data-using-rapids/?ref=blef.fr">Encoding and Compression Guide for Parquet String Data Using RAPIDS</a> — For Nvidia geeks.</li><li><a href="https://hudi.apache.org/docs/table_types/?ref=blef.fr#merge-on-read-table">Hudi merge-on-read</a>  — Iceberg has been all over the place, but there is Hudi as well, and the merge-on-read is a great feature tbh.</li><li><a href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78?ref=blef.fr">Maestro: Netflix’s workflow orchestrator</a> — 5 years later Netflix finally open-source their orchestrator. Curious to see if it will pick up. It's written in Java and it does what other orchestrator are already doing.</li><li><a href="https://roundup.getdbt.com/p/the-analytics-development-lifecycle?ref=blef.fr">The Analytics Development Lifecycle</a> — dbt Labs needs to reinvent itself, now that dbt is everywhere, they need to define the next vision as they have pure gold in their hands. Tristan, CEO of dbt Labs, is aiming for a new acronym, ADLC (Analytics Development Lifecycle), and provides a draft manifesto with user stories of what AE should be able to do tomorrow.</li><li><a href="https://towardsdatascience.com/deliver-your-data-as-a-product-but-not-as-an-application-99c4af23c0fb?ref=blef.fr">Deliver your data as a product, but not as an application</a>.</li></ul><p><a href="https://netflixtechblog.medium.com/?source=post_page-----ee13a06f9c78--------------------------------" rel="noopener follow"></a></p><p><a href="https://medium.com/@will.chen_71923?source=post_page-----af6da1f5a982--------------------------------" rel="noopener follow"></a></p><p></p><h1 id="data-economy-%F0%9F%92%B0">Data economy 💰</h1><ul><li><a href="https://www.calcalistech.com/ctechnews/article/hjvuvyb000?ref=blef.fr">Google in negotiations to acquire Wiz in $23 billion deal</a>, <a href="https://www.bbc.com/news/articles/c3gdlng47k7o?ref=blef.fr">actually no</a>. Wiz is a cloud security firm.</li></ul><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.28 ]]></title>
                    <description><![CDATA[ Data News #24.28 — Catching up the news, OpenAI, Claude, kyutai and all the engineering stuff from the last 3 weeks. ]]></description>
                    <link><![CDATA[ /data-news-week-24-28/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6688ded36a35500001f5ddd3 ]]></guid>
                    <pubDate><![CDATA[ 2024-07-13 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1514412076816-d228b5c0973c?fm=jpg&amp;w=3000&amp;auto=format&amp;fit=crop&amp;q=60&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="gull flying above body of water" loading="lazy" width="3000" height="2000" srcset="https://images.unsplash.com/photo-1514412076816-d228b5c0973c?fm=jpg&amp;w=600&amp;auto=format&amp;fit=crop&amp;q=60&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1514412076816-d228b5c0973c?fm=jpg&amp;w=1000&amp;auto=format&amp;fit=crop&amp;q=60&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w, https://images.unsplash.com/photo-1514412076816-d228b5c0973c?fm=jpg&amp;w=1600&amp;auto=format&amp;fit=crop&amp;q=60&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1600w, https://images.unsplash.com/photo-1514412076816-d228b5c0973c?fm=jpg&amp;w=2400&amp;auto=format&amp;fit=crop&amp;q=60&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">EuroSeagull (</span><a href="https://unsplash.com/photos/gull-flying-above-body-of-water-btQt9i0Krag?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Dear members, it's been a few weeks since I did not catch you on a proper Data News with a collection of links. Here we are.</p><p>This week, I attended <a href="https://ep2024.europython.eu/?ref=blef.fr">EuroPython</a> in Prague. While I spent most of my time at the <a href="https://dlthub.com/?ref=blef.fr">dltHub</a> booth in the sponsors hall, I didn't attend many talks. However, I did give a few presentations on my SQL orchestration library, <a href="https://github.com/Bl3f/yato?ref=blef.fr">yato</a>, which pairs well with dlt. A YouTube video might come out soon.</p><p>Additionally, I attended an interesting talk by a Data News reader about <a href="https://matthieu.io/dl/talks/2024-07-11-europython-yaml-engineer.pdf?ref=blef.fr">the rise of YAML engineers</a>, Matthieu also wrote in the past an <a href="https://jobs.picnic.app/en/blogs/yaml-developers-and-the-declarative-data-platforms?ref=blef.fr">article</a> about this. I'm so happy to have met a few of you there 😊.</p><p>This is a great transition as a reminder to say that I'm co-organising a 1-day conference on Nov 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. <strong>The Call for Paper (CfP) closes in a few hours, on Sunday 23h59</strong>. So <a href="https://conference-hall.io/public/event/9YgSSWq5AKeuQAcLyVHO?ref=blef.fr">propose a talk</a>. Submissions are welcome in French and English.</p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://conference-hall.io/public/event/9YgSSWq5AKeuQAcLyVHO?ref=blef.fr"><img src="https://www.blef.fr/content/images/2024/07/700x200-blef.png" class="kg-image" alt="" loading="lazy" width="700" height="200" srcset="https://www.blef.fr/content/images/size/w600/2024/07/700x200-blef.png 600w, https://www.blef.fr/content/images/2024/07/700x200-blef.png 700w"></a><figcaption><span style="white-space: pre-wrap;">Submit your talk to the Forward Data Conference!</span></figcaption></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>OpenAI — Always the biggest news provider, announcements or dramas<ul><li><a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/openai-was-hacked-revealing-internal-secrets-and-raising-national-security-concerns-year-old-breach-wasnt-reported-to-the-public?utm_campaign=socialflow&utm_medium=social&utm_source=twitter.com">OpenAI was hacked, revealing internal secrets and raising national security concerns</a> — The hacker reached OpenAI’s internal messaging systems early last year, stealing details of how OpenAI's technologies work from employees.</li><li><a href="https://time.com/6996842/microsoft-quits-openai-board-seat-antitrust-scrutiny-ai-partnerships/?ref=blef.fr">Microsoft quits OpenAI board</a> — Microsoft said that they are not needed anymore because the governance has improved and at the same might want to avoid antitrust issues raised by gouv around the world. Apple also will not join the board as it was also expected.</li><li><a href="https://spectrum.ieee.org/chatgpt-for-coding?ref=blef.fr">How good is GPT at coding, really?</a> — A research team evaluated the capabilities of GPT-3.5 in solving LeetCode problems. Although GPT-3.5 might be outdated, the findings are still somewhat relevant. The team discovered that GPT-3.5 performed significantly better on problems that existed before its training cut-off date. However, the model struggled with correcting its own mistakes.</li><li><a href="https://medium.com/@yingjunwu/openais-acquisition-of-rockset-what-it-means-for-the-industry-c5fcfc4f1718?ref=blef.fr">OpenAI’s acquisition of Rockset, what it means</a> — I announced it a few weeks ago, OpenAI bought Rockset a real-time analytical vector database. Customers have 2 months left to migrate away from the database that will probably because the core of the OpenAI architecture.</li></ul></li><li><a href="https://www.youtube.com/live/hm2IJSKcYvo?ref=blef.fr">kyutai released Moshi</a> — Moshi is a "voice-enabled AI". The team as kyutai developed the model with an audio interface-first with an audio language model, which make the conversation with the AI more real (demo at 5:00 min) as it can interrupt you or kinda "think" (meaning for predict the next audio segment) while it speaks. Moshi will be part of kyutai open-source released and is purely local.</li><li><a href="https://www.anthropic.com/news/claude-3-5-sonnet?ref=blef.fr">Claude 3.5 Sonnet</a> — To end the tour of the last model, if you missed it Claude 3.5 went out in June and featured great performances when "reasoning", Claude is capable to split the screen and develop some kind of Codepen playground with a React app applying what you're asking. There is a demo where <a href="https://x.com/Saboo_Shubham_/status/1805789967203156357?ref=blef.fr">Sonnet transformed a research paper into a simulator app</a> about the paper in one prompt.</li><li><a href="https://github.com/Sinaptik-AI/pandas-ai?ref=blef.fr">pandas-ai</a> — Give a dataframe to pandas-ai and configure a model, then you'll be able to chat with your data to get answers or chat about your questions. Nothing new I'd say, the only difference is the fact that the API to use it is fairly simple.</li><li><a href="https://engineering.fb.com/2024/07/10/data-infrastructure/machine-learning-ml-prediction-robustness-meta/?ref=blef.fr">Meta’s approach to machine learning prediction robustness</a> — Principles Meta applies to bring robustness to ML.</li><li><a href="https://dropbox.tech/machine-learning/bringing-ai-powered-answers-and-summaries-to-file-previews-on-the-web?ref=blef.fr">Bringing AI-powered answers and summaries to file previews on the web</a> — Dropbox has developed a feature that generates summaries for every file in your storage. This process involves converting files, regardless of their format, into text. The text is then transformed into embeddings, which make the content easily summarisable and queryable for Q&amp;A purposes.</li><li><a href="https://blog.malt.engineering/super-powering-our-freelancer-recommendation-system-using-a-vector-database-add643fcfd23?ref=blef.fr">Recommendation system using a vector database</a> — How Malt (a freelancer platform) built a recommendation engine using current vector database technologies (with Qdrant).</li><li><a href="https://do4ds.com/?ref=blef.fr">DevOps for data science</a> — An open-source and free book covering what data scientists need to know about DevOps. It has been written by someone at Posit (the company behind RStudio). It covers general knowledge about infra + code snippets.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://aws.amazon.com/blogs/aws/introducing-end-to-end-data-lineage-preview-visualization-in-amazon-datazone/?ref=blef.fr">End-to-end data lineage in AWS</a> — <strong>AWS announced DataZone to bring lineage to your data assets</strong>, from the picture it can mixes datasets (?), Glue table and jobs while giving you a greed/red vision of what's up to date. They mention column lineage, but from the picture it looks like they track columns but not proper column-level lineage. UI is AWS-tier.</li><li><a href="https://www.dataengineeringweekly.com/p/a-brief-history-of-modern-data-stack?ref=blef.fr">A brief history of modern data stack</a> — Ananth from data engineering weekly wrote a few weeks after the modern data stack debate his views on it (read <a href="https://www.blef.fr/modern-data-stack-disappearing/">my opinion</a> on this), he considers that we are in the post-modern data stack era with a few a points that will (or are) be implemented everywhere. Especially the interoperability.</li><li><a href="https://www.youtube.com/watch?app=desktop&v=T-ee0xdJ7yM&ref=blef.fr">Apache XTable</a> — XTable is a new layer that provides a cross-table interoperability, you don't need to choose only one table format out of Hudi, Delta and Iceberg. It provides abstractions and tools for the translation of lakehouse table format metadata.</li><li><a href="https://docs.snowflake.com/en/user-guide/dynamic-tables-tasks-create-iceberg?ref=blef.fr">Create dynamic Iceberg tables</a> — Snowflake added support for dynamic tables in a Iceberg format. Dynamic tables are tables based on "real-time data" (or streams, or continuous pipelines). So it means now Snowflake can just be used as an engine writing continuous tables in a blob storage in a open format like Iceberg — <a href="https://www.blef.fr/databricks-snowflake-and-the-future/">future is coming</a>.</li><li><a href="https://blog.devgenius.io/creating-a-file-format-in-rust-92201498df0a?ref=blef.fr">Creating a file format in Rust</a> — Experiment about what you need to create a new file format. That's super interesting to understand what's under the hood of popular stuff we often use.</li><li><a href="https://juhache.substack.com/p/data-pipelines-and-scds?ref=blef.fr">Data pipelines and SCDs</a> — Slowly changing dimensions are an important pattern to know when it comes to data engineering. Julien wrote a great article about it and explaining the 3 possible form and the snapshot way. His chart are great. You can also read Timo detailed post on Mixpanel blog and why <a href="https://mixpanel.com/blog/slowly-changing-dimension-tables-in-product-analytics/?ref=blef.fr">SCD are the best thing for product analytics</a>.</li><li><a href="https://docs.getdbt.com/blog/semantic-layer-in-pieces?ref=blef.fr">How to build a Semantic Layer</a> — An great small guide that gives you the things to consider when going the Semantic Layer road. Gwen gives a step-by-step method to migrate from marts to a dbt Semantic Layer.</li><li><a href="https://jorritsandbrink.substack.com/p/how-dlt-uses-apache-arrow-for-fast-pipelines?ref=blef.fr">How dlt uses Apache Arrow</a> — A great post explaining why the next generation of data tooling need to use Arrow and how it impact the performances. This article explains how dlt (extract and load) then leverages Arrow.</li><li><a href="https://dewey.dunnington.ca/slides/scipy2024/?ref=blef.fr#/title-slide">nanoarrow, a way to technically understand Arrow</a> — Slides about a re-implementation of the Arrow framework (it's highly technical to be honest without a video).</li><li><a href="https://www.linkedin.com/pulse/duckdb-x-dbt-make-psyduck-great-again-jean-guinvarch-bbqke/?trackingId=G2Rg6aifSUqFW9crzR%2BFvA%3D%3D&ref=blef.fr">DuckDB and dbt</a> — How with DuckDB and dbt you can build the transformation layer of a BI application (e.g. a Pokemon dashboard).</li><li><a href="https://duckdb.org/2024/07/05/community-extensions?ref=blef.fr">DuckDB extension mechanism</a> — DuckDB wants to provide a repository for community extension. This way the community will be able to extend DuckDB easily and it will also reduce the minimal size of the DuckDB allowing use to have an even more portable database/engine.</li><li><a href="https://seattledataguy.substack.com/p/dont-lead-a-data-team-before-reading?ref=blef.fr">Don’t lead a data team before reading this</a> — 5 important points you should consider when leading a data team. I really like the "<em>The business doesn’t care about how you solve the problem</em>" because it's a good reminder for my technical audience that your role as a data person is to empower others with data, so boring tech is often the best.</li><li><a href="https://vutr.substack.com/p/apache-kafka-part-1-overview?ref=blef.fr">Apache Kafka overview</a> — If you're not familiar with Kafka this is a great overview.</li><li><a href="https://semyonsinchenko.github.io/ssinchenko/post/porting_deequ_to_sparkconnect/?ref=blef.fr">Spark-connect, what's this</a> — Very detailed post about what is spark-connect and why it will change the way we do Spark. It highlights how it simplifies and enhances the development process, particularly through its compatibility with various languages and the potential it unlocks for creating a data quality process.</li><li><a href="https://discord.com/blog/how-discord-uses-open-source-tools-for-scalable-data-orchestration-transformation?ref=blef.fr">How Discord uses Dagster</a> — 2000 dbt tables, covered by over 12000 dbt tests. Discord uses the dbt &lt;&gt; Dagster integration to power they whole data assets management.</li></ul><p></p><h2 id="stories">Stories</h2><ul><li><a href="https://www.canva.dev/blog/engineering/product-analytics-event-collection/?ref=blef.fr">How Canva collects 25 billion events per day</a> — Protobuf + Amazon Kinesis. </li><li><a href="https://yokota.blog/2024/07/11/in-memory-analytics-for-kafka-using-duckdb/?ref=blef.fr">In-memory analytics for Kafka using DuckDB</a> — The author develop <a href="https://github.com/rayokota/kwack?ref=blef.fr">kwack</a> a small utility that allows you to run SQL queries on top of Kafka streams (in-memory).</li><li><a href="https://www.atlassian.com/blog/artificial-intelligence/ai-prompts-for-marketing?ref=blef.fr">40 AI prompts to boost your marketing team’s creativity</a> — Atlassian collection of 40 prompts to do marketing stuff. I'm not sure I'm happy to see this on Atlassian blog.</li><li><a href="https://netflixtechblog.com/a-recap-of-the-data-engineering-open-forum-at-netflix-6b4d4410b88f?ref=blef.fr">A Recap of the Data Engineering Open Forum at Netflix</a> — Video about the Netflix data engineering open forum are out on YouTube and this post is a recap + takeaways.</li><li><a href="https://luminousmen.com/post/senior-engineer-fatigue/?ref=blef.fr">Senior engineer fatigue</a> — When you gain in experience as your career progress, you will start to feel fatigue as an engineer. Senior fatigue is characterised not by a decline in productivity but by a deliberate deceleration. I find the first part about the paradox of slowing down to speed up so true, that I recommend you to read it warmly.</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Going back in time — <a href="https://blog.cleancoder.com/uncle-bob/2016/01/04/ALittleArchitecture.html?ref=blef.fr">A Little Architecture</a></div></div><p><a href="https://medium.com/@lior_30980?source=post_page-----92201498df0a--------------------------------" rel="noopener follow"></a></p><p></p><hr><p>See you next week (probably) ❤️ — I'll take random breaks this summer in order to prepare for the changes coming in my professional and personal life in September.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Databricks, Snowflake and the future ]]></title>
                    <description><![CDATA[ Databricks and Snowflake summits featured major announcements, including open-sourcing their catalogs and enhancing Iceberg compatibility. This article covers all the key updates you need to know. ]]></description>
                    <link><![CDATA[ /databricks-snowflake-and-the-future/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 666c7b3b5d699d00018ca4bd ]]></guid>
                    <pubDate><![CDATA[ 2024-06-21 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/06/image-2.png" class="kg-image" alt="" loading="lazy" width="2000" height="1500" srcset="https://www.blef.fr/content/images/size/w600/2024/06/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2024/06/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/06/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/06/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Welcome to the snow world (</span><a href="https://unsplash.com/photos/person-holding-ski-poles-in-the-middle-of-snow-during-winter-season-Dzd_O5cnr0Y?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Every year, the competition between Snowflake and Databricks intensifies, using their annual conferences as a platform for demonstrating their power. This year, the Snowflake Summit was held in San Francisco from June 2 to 5, while the Databricks Data+AI Summit took place 5 days later, from June 10 to 13, also in San Francisco. The conferences were expecting 20,000 and 16,000 participants respectively.</p><p>Snowflake is listed and had annual <a href="https://www.macrotrends.net/stocks/charts/SNOW/snowflake/revenue?ref=blef.fr#:~:text=Snowflake%20annual%20revenue%20for%202024,a%20105.95%25%20increase%20from%202021.">revenue of $2.8 billion</a>, while Databricks achieved $2.4 billion—Databricks figures are not public and are therefore <a href="https://www.cnbc.com/2024/06/12/databricks-says-annualized-revenue-to-reach-2point4-billion-in-first-half.html?ref=blef.fr">projected</a>. Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014.</p><p>Snowflake and Databricks have the same goal, both are selling a cloud on top of <em>classic</em><sup>1</sup>&nbsp;cloud vendors. In the data world Snowflake and Databricks are our dedicated platforms, we consider them big, but when we take the whole tech ecosystem they are (so) small: AWS revenue is $80b, Azure is $62b and GCP is $37b.</p><figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2024/06/Frame-27-3-.png" class="kg-image" alt="" loading="lazy" width="2000" height="1007" srcset="https://www.blef.fr/content/images/size/w600/2024/06/Frame-27-3-.png 600w, https://www.blef.fr/content/images/size/w1000/2024/06/Frame-27-3-.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/06/Frame-27-3-.png 1600w, https://www.blef.fr/content/images/2024/06/Frame-27-3-.png 2121w" sizes="(min-width: 720px) 720px"></figure><p>The Google search results give an idea of the market both tools are trying to reach. Using a quick semantic analysis, "The" means both want to be THE platform you need when you're doing data. Both companies have added Data and AI to their slogan, Snowflake used to be The Data Cloud and now they're The AI Data Cloud.</p><p>Below a diagram describing what I think schematises data platforms:</p><ul><li><strong>Data storage</strong> — you need to store data in an efficient manner, interoperable, from the fresh to the old one, with the metadata.</li><li><strong>Data engine</strong> — you need to make computations on data, the computation can be volatile or be materialised back to the storage</li><li>Programmable — you need to run <strong>code</strong> on your platform, whatever the language or the technology at some point you need to translate your business logic into a programmatic logic</li><li><strong>Visualisation</strong> — you need to visualise the output of the computed data because charts are often better than table</li><li><strong>AI</strong> — you need to be proactive or predictive, that's when <strong>machine learning or deep learning</strong> enters, more generally today AI.</li><li>In order to make all of this work data flows, going <strong>IN and OUT</strong>.</li><li><strong>Edge stuff</strong> — and then everything else that goes with it like privacy, observability, orchestration, scheduling, governance, etc. which might be required or not depending on the company maturity.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/06/Screenshot-2024-06-20-at-19.07.05.png" class="kg-image" alt="" loading="lazy" width="2000" height="963" srcset="https://www.blef.fr/content/images/size/w600/2024/06/Screenshot-2024-06-20-at-19.07.05.png 600w, https://www.blef.fr/content/images/size/w1000/2024/06/Screenshot-2024-06-20-at-19.07.05.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/06/Screenshot-2024-06-20-at-19.07.05.png 1600w, https://www.blef.fr/content/images/2024/06/Screenshot-2024-06-20-at-19.07.05.png 2036w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">One way to read data platforms</span></figcaption></figure><p>When we look at platforms history what characterises evolution is the separation (or not) between the engine and the storage. Good old data warehouses like Oracle were engine + storage, then Hadoop arrived and was almost the same you had an engine (MapReduce, Pig, Hive, Spark) and HDFS, everything in the same cluster, with data co-location. Then cloud changed everything and created a new way separating the storage from the engine leading to ephemeral Spark clusters with S3 and then, <a href="https://www.getdbt.com/blog/future-of-the-modern-data-stack?ref=blef.fr">Cambrian explosion</a>, engines and storage multiplied. </p><p>This is the fundamental difference between Snowflake and Databricks.</p><p><strong>Snowflake sells a warehouse, but it's really more of a UX</strong>. A UX where you buy a single tool combining engine and storage, where all you have to do is flow data in, write SQL, and it's done. <strong>Databricks sells a toolbox, you don't buy any UX</strong>. Databricks is terribly designed, it's an amalgam of tools, they have a lot of products doing the same thing—e.g. you could write the same pipeline in Java, in Scala, in Python, in SQL, etc.—with Databricks you buy an engine.</p><p>At least, that's what the two platforms are all about. <strong>Ultimately, they both want to become everything between the left and the right arrows.</strong></p><p>Now that I've introduced the two competitors, let's get down to business. In this article I'll cover what Snowflake and Databricks announced at their respective summits and why Apache Iceberg in the middle crystallised all the hype.</p><p></p><h1 id="snowflake-summit">Snowflake Summit</h1><p>Snowflake took the lead, setting the tone. I won't delve into every announcement here, but for more details, SELECT has written a blog covering the <a href="https://select.dev/posts/snowflake-summit-2024?ref=blef.fr">28 announcements and takeaways from the Summit</a>. If you're a Snowflake customer, I recommend reading Ian's insights. His business is centered on Snowflake, and he always offers the best perspectives.</p><p>Here what I think summarises well the summit:</p><ul><li><a href="https://docs.snowflake.com/en/user-guide/tables-iceberg?ref=blef.fr"><strong>Apache Iceberg</strong></a><strong> support</strong> — it means Snowflake engine is now able to read Iceberg files. In order to read Iceberg files you need a catalog, Snowflake support external catalogs—like AWS Glue—and they will <a href="https://www.snowflake.com/blog/introducing-polaris-catalog/?ref=blef.fr">open-source Polaris</a>, in the next 90 days, their own Apache Iceberg catalog. <br><br>If you're not familiar with Iceberg, it's an open-source table format built on top of Parquet. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table. For a comprehensive introduction to Iceberg, I recommend reading my friend <a href="https://seattledataguy.substack.com/p/apache-iceberg-what-is-it?ref=blef.fr">Julien's Iceberg guide</a>.</li><li><strong>Native CDC for </strong><a href="https://www.snowflake.com/blog/ingest-data-faster-easier-new-connectors-updates/?ref=blef.fr"><strong>Postgres and MySQL</strong></a> — Snowflake will be able to connect to Postgres and MySQL to natively move data from your databases to the warehouse. This could be a significant blow to Fivetran and Airbyte's business. While the exact pricing hasn't been revealed yet, the announcement emphasises cost-effectiveness.</li><li><strong>Store and run whatever you want on Snowflake</strong> — They bring a serverless / container philosophy to Snowflake as you will be able to store your <a href="https://docs.snowflake.com/en/developer-guide/snowpark-ml/model-registry/overview?ref=blef.fr">AI models</a>, run <a href="https://docs.snowflake.com/en/developer-guide/snowpark/python/snowpark-pandas?ref=blef.fr">pandas</a> code or any <a href="https://docs.snowflake.com/en/developer-guide/snowpark-container-services/overview?ref=blef.fr">container</a>.</li><li><strong>Dark mode interface</strong> — Ironically it was their closing announcement, their most asked feature and their Reddit most liked post following the announcement. I found it a bit ridiculous but it showcases how much Snowflake is a UX first platform.</li></ul><p>From the start, Snowflake has been a straightforward platform: load data, write SQL, period. This approach has always appealed analysts, analytics engineers, and pragmatic data engineers. However, to capture a larger market and address AI use-cases, Snowflake needed to break through its glass ceiling. To me, that's what these major announcements are. Snowflake becomes Databricks.</p><p></p><h1 id="databricks-dataai">Databricks Data+AI</h1><p>I didn't attend either summit in person. While I enjoy these events, I prefer avoid flying for ecological reasons, and large gatherings can be challenging for an introvert like me. Watching the Data+AI Summit from home did give me a bit of <a href="https://en.wikipedia.org/wiki/Fear_of_missing_out?ref=blef.fr">FOMO</a>, but the Snowflake Summit did not. Databricks successfully built hype during the event, announcements after announcements.</p><p>Once again it boils down to the nature of the platform, Snowflake is insanely boring, even if use-cases are different Snowflake solution standardise everything, when it comes to Databricks, creativity arise—or we can call it tech debt. By the multiplicity of products or ways to handle data shiny stuff can appeal everyone.</p><p>Here what Databricks brought this year:</p><ul><li><a href="https://www.youtube.com/watch?v=S1B0J-uzSDE&ref=blef.fr">Spark 4.0</a> — (1) PySpark erases the differences with the Scala version, creating a first class experience for Python users. (2) Spark versions will become even easier to manage with Spark Connect, allowing other languages to run Spark code—because Spark Connect decouple the client and the server. (3) Spark 4.0 will support ANSI SQL and <a href="https://spark.apache.org/news/spark-4.0.0-preview1.html?ref=blef.fr#:~:text=There%20are%20a%20lot%20of,by%20default%2C%20and%20many%20more." rel="noreferrer">many other things</a>.</li><li><a href="https://www.databricks.com/blog/introducing-aibi-intelligent-analytics-real-world-data?ref=blef.fr">Databricks AI/BI</a> — Databricks has introduced AI/BI, a smart business intelligence tool that blends AI-powered low-code dashboarding solution with Genie, a conversational interface. AI/BI will be able to semantically understand and use all the objects you have in your Databricks instance. Visually the dashboarding solution looks like a mix between Tableau and Preset.</li><li><a href="https://docs.databricks.com/en/release-notes/serverless.html?ref=blef.fr">Serverless compute</a> — This is something that keeps bridging the gap in term of user experience, because manage Spark cluster is painful, Spark serverless let's you run a Spark job without worrying about the execution. Still, serverless compute does not support SQL.</li><li><a href="https://www.databricks.com/blog/databricks-tabular?ref=blef.fr">Buying Tabular</a> — Before the last bullet point, it was already something big. Databricks bought Tabular for $1b. Tabular was founded in 2021, had less than 50 employees and raised $37m. Jackpot. Accordingly to the press Snowflake and Confluent (Kafka) were also trying to buy Tabular.<br><br>But what is doing Tabular? Tabular is building a catalog for Apache Iceberg and Tabular employs a few part of Iceberg open-source contributors. By getting Tabular Databricks gets all the intellectual knowledge about Iceberg and how to build a catalog around it.</li><li><a href="https://www.databricks.com/blog/open-sourcing-unity-catalog?ref=blef.fr">Open-sourcing Unity Catalog</a> — Finally, on stage, Databricks' CEO hit the button to open-source Unity Catalog, directly responding to Snowflake’s open-sourcing of Polaris. Unity Catalog, previously a closed product, is now a key part of Databricks' strategy to become a the data platform. This move, combined with Tabular acquisition, will help Databricks achieve top-notch support for Iceberg.</li></ul><p>If you've made it this far, you probably understand the story. Databricks is focusing on simplification (serverless, auto BI<sup>2</sup>, improved PySpark) while evolving into a data warehouse. With the open-sourcing of Unity Catalog and the adoption of Iceberg, Databricks is equipping users with the toolbox to build their own data warehouses.</p><p></p><h1 id="apache-iceberg-and-the-catalogs">Apache Iceberg and the catalogs</h1><p>We finally get down to Iceberg. What's Iceberg? Why catalogs are so important? How do they differ to data catalog we are used to?</p><p>So Iceberg has been started at Netflix by Ryan Blue and Dan Weeks around 2017. Both co-founded later Tabular (which got acquired by Databricks). Iceberg has been designed to fix the flaws of Hive around table management, especially about <a href="https://en.wikipedia.org/wiki/ACID?ref=blef.fr">ACID transactions</a>. The project became a top-level Apache project in Nov 2018.</p><p>Currently Apache Iceberg competes with Delta Lake and Apache Hudi and became the leading format in the community when looking at all metrics. Newcomers are also arriving late to the party like <a href="https://github.com/facebookincubator/nimble?ref=blef.fr">nimble</a> or <a href="https://duckdb.org/docs/internals/storage?ref=blef.fr">DuckDB</a> table format which could be a thing in the future.</p><p><strong>What is Iceberg?</strong></p><p>The community decided these last year and Parquet became the go-to file format when it comes to storing data. Parquet has many advantages like being columnar, the compression, can pushdown predicates, own the schema at file level and more. But there are a few issues with Parquet. Parquet is a storage format, except for a few metadata and the schema Parquet has lack of information about the <em>table</em>.</p><p>A <em>table format</em> creates an abstraction layer between you and the storage format, allowing you to interact with files in storage as if they were tables. This enables easier data management and query operations, making it possible to perform SQL-like operations and transactions directly on data files.</p><p>Iceberg is composed of 2 layers but has sublayers, like a onion:</p><ul><li>the data layer — contains the raw data in Parquet, Iceberg manages the way the Parquet files are partitioned, etc.</li><li>the metadata layer<ul><li>manifest file — A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information.</li><li>manifest list (or snapshot) — A new manifest list is written for each attempt to commit a snapshot because the list of manifests always changes to produce a new snapshot. This is simply a collection of manifests describing a state or a partial state of the table.</li><li>metadata file —  Table metadata is stored as JSON. Each table metadata change creates a new table metadata file that is committed by an atomic operation.</li></ul></li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://iceberg.apache.org/assets/external/iceberg.apache.org/assets/images/iceberg-metadata.png" class="kg-image" alt="Iceberg snapshot structure" loading="lazy" width="1248" height="1290"><figcaption><span style="white-space: pre-wrap;">Official Iceberg schema (</span><a href="https://iceberg.apache.org/spec/?ref=blef.fr#overview" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>That's what it is, if you have to understand something, <strong>Iceberg creates table on-top of raw Parquet files</strong>.</p><p>So once you have Iceberg you're capable to create multiple tables, but you need a place to store all the metadata about your tables. Because Iceberg each table atomically but obviously you need more than one table. That's why you need a catalog. This catalog is like the <a href="https://en.wikipedia.org/wiki/Apache_Hive?ref=blef.fr">Hive</a> Metastore. I've read somewhere that we should call it a <em>super metastore</em> rather than a catalog which is already used to describe another product in the data community.</p><p>Still that's why we need a place to keep a track all our Iceberg tables. That's what is <a href="https://www.unitycatalog.io/?ref=blef.fr">Unity Catalog</a>, <a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html?ref=blef.fr">AWS Glue Data Catalog</a>, <a href="https://www.snowflake.com/blog/introducing-polaris-catalog/?ref=blef.fr">Polaris</a>, <a href="https://github.com/kevinjqliu/iceberg-rest-catalog?ref=blef.fr">Iceberg Rest Catalog</a> and <a href="https://tabular.io/?ref=blef.fr">Tabular</a> (RIP). Actually all of these catalog are implementing the <a href="https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml?ref=blef.fr">Iceberg REST Open API</a> specification.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Read <a href="https://seattledataguy.substack.com/p/apache-iceberg-what-is-it?ref=blef.fr">Julien's post about Apache Iceberg</a> if you want to go deeper.</div></div><p></p><h1 id="conclusion">Conclusion</h1><p>Databricks and Snowflake embracing Iceberg by open-sourcing a compatible catalog and opening their engines to Iceberg show how far ahead is Iceberg. I don't think Databricks or Snowflake really won the competition.</p><p>On Snowflake's side, they mitigated the impact by open-sourcing Polaris and embracing the Iceberg format. However, most Snowflake end-users won't be concerned with these changes ; they simply want to write SQL queries on their data. These format details are more relevant to data engineers. Snowflake finds itself between Databricks' innovation and BigQuery's simplicity<sup>3</sup> (ingest data, query). To grow, Snowflake needs to expand in both directions.</p><p>With this move Databricks will finally provide a data warehouse to their customers, it will be a data warehouse in kit, but a data warehouse. Because this is what it is, the Iceberg + catalog combo just create a data warehouse. It mimics what a database is already doing for ages, but more in the open with you pulling all the levers rather than something hidden in a black box written in database compiled language like C.</p><p>Wait, Iceberg is written in Java, and honestly, PyIceberg is lagging significantly behind the Java version... Here we go again.</p><hr><p>1 — I don't like the classic term to qualify AWS, Google and Microsoft but actually that's what they are right now. Leaders and commodities. </p><p>2 — I just made this term, looks like it does not exist for data really but I like it a lot.</p><p>3 — Actually recently BigQuery added a lot of features to extend the compute and with more way to interact with data (<a href="https://cloud.google.com/bigquery/docs/create-notebooks?ref=blef.fr">notebooks</a>, <a href="https://cloud.google.com/bigquery/docs/data-canvas?ref=blef.fr">canvas</a>, etc.)</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.24 ]]></title>
                    <description><![CDATA[ Data News #24.24 — I&#39;m back sorry for the late news. I&#39;m co-organising a conference in Paris in Nov, CfP is open, AI news with OpenAI and Apple and a lot of Fast News. ]]></description>
                    <link><![CDATA[ /data-news-week-24-24/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 666bee985d699d00018ca466 ]]></guid>
                    <pubDate><![CDATA[ 2024-06-15 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/06/image-1.png" class="kg-image" alt="" loading="lazy" width="800" height="533" srcset="https://www.blef.fr/content/images/size/w600/2024/06/image-1.png 600w, https://www.blef.fr/content/images/2024/06/image-1.png 800w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">hey (</span><a href="https://unsplash.com/photos/focus-photography-of-standing-gray-rodent-uWCGd6BY-zU?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>🥹It's been a long time since I've put words down on paper or hit the keyboard to send bytes across the network. We're in the age of AI, and my lords computer science have evolved over the last 30 years. I'm writing this edition from my child's home, and it brings back memories. I got my first computer at the age of 6 and spent my days installing Windows 98 over and over again, getting lost between the BIOS and the Windows installation pages, playing with Word, Dreamweaver and Adobe Premiere.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/06/Screenshot-2024-06-15-at-10.31.50.png" class="kg-image" alt="" loading="lazy" width="1704" height="822" srcset="https://www.blef.fr/content/images/size/w600/2024/06/Screenshot-2024-06-15-at-10.31.50.png 600w, https://www.blef.fr/content/images/size/w1000/2024/06/Screenshot-2024-06-15-at-10.31.50.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/06/Screenshot-2024-06-15-at-10.31.50.png 1600w, https://www.blef.fr/content/images/2024/06/Screenshot-2024-06-15-at-10.31.50.png 1704w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">My first website is still up somewhere on internet 🥹 — it was to help my aunt sell her house</span></figcaption></figure><p>Who would have thought that 25 years later, I'd be celebrating 10 years working with computers? June also marks the third anniversary of this newsletter. 3 years ago I started the newsletter in order to share my expertise with people, I'm so happy how it turned out. <strong>More than 5000 members subscribed to the newsletter and the blog generated almost 100k unique visitors.</strong></p><p>Recently a lot of people subscribed and never received a Data News, I want to give you a warm welcome and this edition marks the journey we embark together, you will enjoy what's coming next, I'm sure.</p><p>I've taken a little forced break because I've been overwhelmed with work lately, juggling a lot of requests and my customers' work. In order to deliver I had to reclaim my Fridays. Around the newsletter there are unfinished projects with the <a href="https://www.blef.fr/explorer/reco/">Recommendations</a> page and <a href="https://www.qrators.io/?ref=blef.fr">Qrators</a> and I'll get back to them starting July once I'm done with the rest.</p><p></p><h1 id="forward-data-conference-%E2%8F%A9">Forward data conference ⏩</h1><p>I'm excited to announce that I am co-organising the <a href="https://www.forward-data-conference.com/?ref=blef.fr">Forward Data Conference</a>, a one-day event in Paris. Join us on November 25th as we bring together around 350 attendees and an impressive lineup of speakers. It's going to be an incredible opportunity to connect, learn, and explore the latest in data. We will make our best to make the conference friendly for English natives.</p><p>Forward Data aims to be a hub for knowledge sharing and best practices, offering you the chance to expand your horizons, explore new facets of the data ecosystem, and connect with key international community leaders.</p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://www.forward-data-conference.com/?ref=blef.fr"><img src="https://www.blef.fr/content/images/2024/06/Screenshot-2024-06-15-at-14.54.10.png" class="kg-image" alt="" loading="lazy" width="2000" height="1155" srcset="https://www.blef.fr/content/images/size/w600/2024/06/Screenshot-2024-06-15-at-14.54.10.png 600w, https://www.blef.fr/content/images/size/w1000/2024/06/Screenshot-2024-06-15-at-14.54.10.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/06/Screenshot-2024-06-15-at-14.54.10.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/06/Screenshot-2024-06-15-at-14.54.10.png 2400w" sizes="(min-width: 720px) 720px"></a><figcaption><span style="white-space: pre-wrap;">Be ready for Forward Data!</span></figcaption></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p>A lot of AI news and changes were made in the last 3 weeks. This is a small recap.</p><ul><li>OpenAI<ul><li><a href="https://www.vox.com/future-perfect/2024/5/17/24158403/openai-resignations-ai-safety-ilya-sutskever-jan-leike-artificial-intelligence?ref=blef.fr">The super-alignement team was fired</a> — The goal of the super-alignement team was to research all related topics to AGI security. But it seems priorities reshuffled. Then <a href="https://www.theverge.com/2024/6/13/24178079/openai-board-paul-nakasone-nsa-safety?ref=blef.fr">OpenAI appointed former NSA leader</a> (nominated by Donald Trump), he will probably work with Safety and Security committee.</li><li><a href="https://zeenews.india.com/companies/openai-doubles-annualised-revenue-to-3-4-billion-report-2757398.html?ref=blef.fr">Annualised revenue projected to be $3.4b</a> — This is crazy how the company successfully reached this amount, mainly by selling to Enterprise customers. By comparison Snowflake revenue was $2.8b in 2023.</li><li><a href="https://openai.com/index/extracting-concepts-from-gpt-4/?ref=blef.fr">Extracting concepts from GPT-4</a> — </li></ul></li><li>Apple announced iOS 18 and their own AI — AI will stand for <a href="https://www.apple.com/apple-intelligence/?ref=blef.fr">Apple Intelligence</a>. With great ego Apple appropriates the letters AI at their annual developer conference (the <a href="https://developer.apple.com/wwdc24/?ref=blef.fr">WWDC</a>) they showcase how AI will be integrated everywhere in iOS:<ul><li><strong>Siri has been revamped </strong>— now looking like a Microsoft AI copilot, Siri will be able to sort notifications, to help you writing better or to give better contextualised answer. Siri will also integrate with OpenAI through ChatGPT when needed.</li><li>At the same time they announced their <strong>model will run on-device</strong> (keeping your data safe and private) and when more compute will be required they will use a private cloud. </li></ul></li><ul><li><strong>Writing tools</strong> — to bring a few of the best GenAI features: <strong>proofreading and rewriting</strong>. When selecting a text you will be able to ask the model to rewrite it more professionally, etc.</li></ul><ul><li><strong>Genmoji</strong> — a way for your parents to be even cringer in their emoji usage by generating emoji from a sentence.</li><li>Finally, with new Siri and Writing tools they <strong>reworked one of the worst Apple application: Mail</strong>. Giving a better look and new capabilities in email writing.</li><li>It joins other features for which Apple will introduce AI (and GenAI) throughout its products (audio transcription, image generation from tags, better natural language search on photos, etc.). But this anchors Apple in a consumer products company, not an AI company like Google, Microsoft or Meta. Apple has decided for years to keep its users' data safe and private, which means they don't have a pool of data to train large language models.</li></ul><li><a href="https://x.com/elonmusk/status/1798504201196368219?ref=blef.fr">How to rethink the recommendation for social networks</a> — This is a small video about Jack Dorsey (Twitter co-founder) about recommendation algorithm and how platforms today should give the choice back to users, this is about free will and building biais / <a href="https://en.wikipedia.org/wiki/Filter_bubble?ref=blef.fr#:~:text=A%20filter%20bubble%20or%20ideological,recommendation%20systems%2C%20and%20algorithmic%20curation.">filter bubbles</a>. Why should we have transparency on what rules the recommendations and why should platforms propose multiple algorithms and let the users decide, like a marketplace.</li><li><a href="https://medium.com/@anis.zakari/changing-the-gpu-is-changing-the-behaviour-of-your-llm-0e6dd8dfaaae?ref=blef.fr">Changing the GPU is changing the behaviour of your LLM</a> — A cool experimentation that shows how GPU impact the inference.</li><li><a href="https://mlops-coding-course.fmind.dev/?ref=blef.fr">MLOps coding course</a> — Great MLOps course! It contains 6 chapters and covers all the needed topics to put models in production with the correct choics.</li><li><a href="https://cloud.google.com/blog/products/data-analytics/how-to-use-rag-in-bigquery-to-bolster-llms?hl=en&ref=blef.fr">RAG in BigQuery</a> — When you do RAG in database it's often correlated to embedding functions and being able to query these vector with performance. BigQuery has all the toolkit to do it and this article showcase it well (and let's be honest all the competition does the same).</li><li><a href="https://pure.mpg.de/rest/items/item_3588217_2/component/file_3588218/content?ref=blef.fr">What makes a Gen AI system open?</a> — A paper that survey 45 models across 14 elements that could define them open. <a href="https://huggingface.co/allenai/OLMo-7B-Instruct?ref=blef.fr">OLMo 7B Instruct</a> is the most open according to the paper and ChatGPT the least one. On the same topic Mozialle released a paper about a <a href="https://assets.mofoprod.net/network/documents/Towards_a_Framework_for_Openness_in_Foundation_Models.pdf?ref=blef.fr">framework for Openness in Foundation Models</a>.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/06/image.png" class="kg-image" alt="" loading="lazy" width="800" height="539" srcset="https://www.blef.fr/content/images/size/w600/2024/06/image.png 600w, https://www.blef.fr/content/images/2024/06/image.png 800w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">(</span><a href="https://unsplash.com/photos/white-and-green-box-on-table-iCp8p7wVXS0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://louisabraham.github.io/articles/probabilistic-tic-tac-toe?ref=blef.fr">Solving probabilistic Tic-Tac-Toe</a> — Probabilistic tic-tac-toe is like tic-tac-toe but each cell is given a probability distribution. So when you make a play randomly you can <em>x</em>, <em>o</em> or do nothing. Someone develop a Unity version of the game and someone else wrote a math solver giving the best play at every turn.</li><li><a href="https://amphi.ai/?ref=blef.fr">Amphi ETL</a> — Amphi is a low-code visual ETL that you can run in JupyterLab. This is super clever. This is the first time I see this kind of application that can run as an extension of Jupyter. Worth watching it in the future, this is still early.</li><li><a href="https://untitleddata.company/blog/How-to-create-a-dlt-source-with-a-custom-authentication-method-rest-api-vs-airbyte-low-code?ref=blef.fr">Compare Airbyte and dlt ways to create custom sources</a> — A large article that compares Airbyte and dlt when it comes to creating custom sources. Both extract and load tools can create custom source via either Airbyte low code CDK or dlt REST API Source toolkit.</li><li><a href="https://clickhouse.com/blog/how-trip.com-migrated-from-elasticsearch-and-built-a-50pb-logging-solution-with-clickhouse?ref=blef.fr">trip.com migrated from 50PB Elastic to ClickHouse</a> — I've never been fan of NoSQL platform like ES for data work. This article on ClickHouse blog showcase how a client migrated their ES cluster to ClickHouse to improve their logs querying capabilities. More, the article focus once at scale with multiple CH clusters how to correctly route the queries.</li><li><a href="https://clickhouse.com/videos/hunting-non-optimized-queries-clickhouse?ref=blef.fr">Hunting non optimised queries in ClickHouse</a> — The talk is about ClickHouse but can apply to every engine. In the talk Yohann explains the mechanism he put in place to find non-optimised SELECT. He did it with a machine learning model, which means that he identified the features slowing the queries like nesting, subqueries, join and wheres.</li><li><a href="https://github.com/tosun-si/bigtesty?ref=blef.fr">BigTesty</a> — a framework that allows to create integration tests with BigQuery on a real and short lived infrastructure. It uses Pulumi (a infra-as-code tool) and requires you to give inputs, SQL queries and outputs and tests it against a dedicated BigQuery project.</li><li><a href="https://engineering.atspotify.com/2024/05/data-platform-explained-part-ii/?ref=blef.fr">Data platform explained part II</a>&nbsp;— Part 2 of the Spotify article about data platforms. Their name 3 different steps: data collection, management and processing (and they even mention GDPR) and finally explain how they treat data culture.</li><li><a href="https://seattledataguy.substack.com/p/apache-iceberg-what-is-it?ref=blef.fr">What is really Apache Iceberg?</a> — Iceberg has been at the center of the discussion this week. Julien wrote the greatest deep dive you can find on the topic.</li><li><a href="https://www.linkedin.com/pulse/cron-expressions-duckdb-rusty-conover-6bole/?trackingId=o0MmGrYtQbqqZ3mybY2kmQ%3D%3D&ref=blef.fr">Cron expressions with DuckDB</a> — An handy function in DuckDB that can generate time arrays when given a cron syntax, it's more understandable than generate_series().</li><li><a href="https://engineering.fb.com/2024/06/10/data-infrastructure/serverless-jupyter-notebooks-bento-meta/?ref=blef.fr">Serverless Jupyter notebooks at Meta</a> — They develop a system called Bento which allows notebooks to either run with classic kernels or with in-browser kernel (being really serverless) using Pyodide. They have handy functions to get sql, googlesheet or graphql data in the browser memory to then work on it.</li><li><a href="https://www.youtube.com/watch?v=jXmRrChXUrI&ref=blef.fr">Airflow new youth</a> — If you stayed with Airflow 1.x or previous 2.6 you might have missed Airflow new youth. This presentation from Jarek showcase all the recent improvements: data aware scheduling, deferrable operators, object storage, etc.</li><li><a href="https://aetperf.github.io/2024/05/30/A-Hybrid-information-retriever-with-DuckDB.html?ref=blef.fr">A hybrid information retriever with DuckDB</a> — how can you fusion semantic and lexical search with DuckDB. Looks neat.</li><li><a href="https://blog.picnic.nl/picnic-open-sources-dbt-score-linting-model-metadata-with-ease-428278f9f05b?ref=blef.fr">dbt-score, lint metadata and get max score</a> — Lint you dbt metadata, gets a score and be happy in the CI/CD.</li><li><a href="https://tobikodata.com/automatically-detecting-breaking-changes-in-sql-queries.html?ref=blef.fr">Automatically detecting breaking changes in SQL queries</a> — Use SQLGlot diff function (on AST) and gets what changed on a SQL query and act accordingly.</li><li><a href="https://medium.pimpaudben.fr/how-i-failed-to-implement-dbt-in-my-previous-job-0b168f59e150?ref=blef.fr">How I failed to implement dbt</a> — Benoit explains why he failed implemented dbt in his previous role. He identifies 5 errors that led to a failure. As always this is not about a technical issue.</li><li><a href="https://mmc.vc/research/250-european/?ref=blef.fr">250 European data infrastructure startups and what we learned from them</a> — Another perspective about data infrastructure that greatly complete the <a href="https://mattturck.com/mad2024/?ref=blef.fr">MAD landscape</a>. At the end of the page it gives great definition about every part of a data platform.</li><li><a href="https://dagster.io/blog/the-rise-of-medium-code?ref=blef.fr">The rise of medium code</a> — Between low-code practitioners and software engineers there are medium code practitioners like analytics engineers and data scientists. This code often lies into Python orchestrators and has to be treated correctly because it's production code as well.</li><li><a href="https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/?ref=blef.fr" rel="bookmark kk">Write-Audit-Publish pattern</a> — Once again a great article about this pattern.</li><li><a href="https://medium.com/data-monzo/how-monzo-uses-incremental-modelling-to-handle-billions-of-events-every-day-45b2bc9ebe89?ref=blef.fr">How Monzo uses incremental modelling to handle billions of events every day</a>.</li></ul><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">I'm working a dedicated article about Snowflake and Databricks latest advancements which should be published on Monday.</div></div><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://sifted.eu/articles/mistral-468m-round-news?ref=blef.fr">Mistral raises €600m</a> — Mistral has never been a French company from the first rounds but raises again a lot of cash to go faster.</li><li><a href="https://x.ai/blog/series-b?ref=blef.fr">xAI raises $6b</a> — Late to the party and it seemed no one care about but Musk tries to fight.</li><li><a href="https://cube.dev/blog/cubes-raises-25-million?ref=blef.fr">Cube raises $25m</a> — Cube has the most advanced piece of technology today when it comes to semantic layer and they raised enough money to continue going into this direction.</li><li><a href="https://www.snowflake.com/blog/snowflake-ventures-invests-in-omni-to-empower-self-service-business-intelligence-and-data-modeling/?ref=blef.fr">Snowflake invests in Omni</a> — Omni is a refreshed version of Looker with a fresher LookML. </li><li><a href="https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-tabular-company-founded-original-creators?ref=blef.fr">Databricks acquires Tabular</a> — It created waves last week in the data community. I'll write more about it on Monday.</li><li><a href="https://tobikodata.com/the_future_of_tobiko.html?ref=blef.fr">Tobiko raises $17.3m</a> — The company behind SQLMesh and SQLGlot raises cash to create a suite of tool to invent the data development of tomorrow.</li><li><a href="https://redpanda.com/press/redpanda-acquires-benthos?ref=blef.fr">Redpanda acquires Benthos</a> — In the streaming world it was big.</li></ul><p></p><hr><p>I want to address something weighing on my mind. We've all seen the results of recent European elections and how the far right has influenced public debate and opinion. I strongly believe we should not fall for their tactics or their so-called solutions. In the tech community, many of us are privileged, often due to our financial stability. However, we cannot build a society with only people like us. Because of our privilege, we (1) should vote, (2) should use our vote to support those marginalised by the system.</p><p>For my French readers, there are parliamentary elections in France in 15 days. I urge you to vote and to vote against the far right. Hate and division are not solutions. Cutting public services through tax reductions is not a solution. Pushing for more productivity when AI is on the rise is not a solution. Individualism is not a solution. They don't bring any solution.</p><p>Consider what the tech ecosystem would look like under far-right principles: diversity stifled, innovation hindered, and global collaboration restricted. These ideologies could limit talent flow, reduce educational programs, and promote censorship and surveillance (which is almost already here, we work in big data face the reality), undermining our core values of privacy and open access.</p><p>If you feel this message doesn't belong in a tech newsletter or professional sphere, I don't care and you can unsubscribe. However, I believe that advocating for openness and tolerance is essential, and accepting hate speech is unacceptable.</p><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.20 ]]></title>
                    <description><![CDATA[ Data News #24.20 — Big edition, 5000 members ❤️, launching Qrators to search in videos, Data Council, OpenAI and Google I/O stuff and data eng stuff. ]]></description>
                    <link><![CDATA[ /data-news-week-24-20/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6647284168a7850001d2407a ]]></guid>
                    <pubDate><![CDATA[ 2024-05-17 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1471877325906-aee7c2240b5f?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="photography of spot light turned on" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1471877325906-aee7c2240b5f?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1471877325906-aee7c2240b5f?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Lights on (</span><a href="https://unsplash.com/photos/photography-of-spot-light-turned-on-mln2ExJIkfc?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hello you. The sun is out, the days are getting longer and Data News is still here. Next week marks 3 years of this newsletter/blog (yay 🎉 ). It'll be a time for looking back, reflecting and celebrating, but next week. This week, we reached 5000 members.</p><p><strong>Yes, 5000 of you read my content periodically. Just thank you ❤️.</strong></p><p>In the recent days I've been working on a new side project. What if you could search in video content and get the exact timestamp of what you're looking for?</p><p>Let me introduce an application of this on the Data Council 2024 80 videos.</p><h1 id="data-council-2024-%E2%9C%A8">Data Council 2024 ✨</h1><p>Data Council Austin is according to me one of the best conference when it comes to think about the future of data. Every year the talk that are given at DC are always full of quality content. There is a main drawback of this which is: it 80 videos of ~30 minutes and not everyone have the time to watch everything or search among the videos.</p><p>So I developed an app which allows you to <strong>search for words in the Data Council video playlist</strong> and we've <strong>curated highlights</strong> with <a href="https://juhache.substack.com/?ref=blef.fr">Julien</a> in order for you to watch only what we've curated.</p><p>It's available on <a href="https://qrators.io/?ref=blef.fr">qrators</a> (can be pronounced curators / creators). And it works for the moment only on desktop.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.qrators.io/?ref=blef.fr" class="kg-btn kg-btn-accent">Search content on qrators</a></div><p>The search is working greatly, for instance you can full-text or exact-term search. For instance <a href="https://www.qrators.io/?search=Airflow&ref=blef.fr">Airflow</a>, <a href="https://www.qrators.io/?search=dbt&ref=blef.fr">dbt</a>, <a href="https://www.qrators.io/?search=backfill&ref=blef.fr">backfill</a>, <a href="https://www.qrators.io/?search=%22data+mesh%22&ref=blef.fr">"data mesh"</a> or <a href="https://www.qrators.io/?search=%22SQL+Glot%22&ref=blef.fr">"SQL Glot"</a>. Quotes means an exact-term search.</p><p>I'll write another post later about the behind the scene and how this app has been built, but because I'm your humble servant, this app uses DuckDB WASM and requires no backend to work (except a bucket with the data).</p><p>Still, I want you to get as always a few takeaways of the conference so here are my favourite talks with a few highlights:</p><ul><li><a href="https://youtu.be/cylAr9oUluI?ref=blef.fr">Data culture as a product</a>— Abhi already had one of my favourite talk from <a href="https://www.blef.fr/data-council-austin-takeaways/">Data Council 2023</a> about Metrics tree. Following on from his work on metrics, this time he attempts to give advice on creating a good data culture in order to create a good decision culture in companies. After all, companies need to make decisions, and these decisions need to be informed by data. [<a href="https://www.qrators.io/?videoId=cylAr9oUluI&ref=blef.fr">highlights</a>]</li><li><a href="https://youtu.be/TrmJilG4GXk?ref=blef.fr">Processing trillions of records at Okta with DuckDB instead of Snowflake</a> — it was one of my most expected talk from the council because a few months ago Jake posted on LinkedIn his team successfully reduced Snowflake billing by hundred of thousands by shifting to DuckDB. In the talk he explained what was the issue with Snowflake and how a multi-engine data stack built on-top of S3 + Lambda drastically reduced dollars spent. [<a href="https://www.qrators.io/?videoId=TrmJilG4GXk&ref=blef.fr">highlights</a>]</li></ul><p>I like a few of the others talks, but I'll do a dedicated post for this I think because the Data News is already super dense.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/05/Screenshot-2024-05-17-at-14.30.45.png" class="kg-image" alt="" loading="lazy" width="2000" height="1238" srcset="https://www.blef.fr/content/images/size/w600/2024/05/Screenshot-2024-05-17-at-14.30.45.png 600w, https://www.blef.fr/content/images/size/w1000/2024/05/Screenshot-2024-05-17-at-14.30.45.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/05/Screenshot-2024-05-17-at-14.30.45.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/05/Screenshot-2024-05-17-at-14.30.45.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Launching Qrators, place to search for stuff in videos</span></figcaption></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>OpenAI recents announcements — The company behind ChatGPT announced a few things hyping everyone recently. Especially their <a href="https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/?ref=blef.fr">GPT-4o</a> (it's 4-o the letter and not 40 the number) model, this model adds new capabilities to ChatGPT around photos, videos and audio. The model can talk or understand what's in a image/video and answer questions about it. They also released a MacOS app that you can call on Option+Space to ask ChatGPT. OpenAI also detailed a bit their <a href="https://openai.com/index/introducing-the-model-spec/?ref=blef.fr">model specs</a> and what principles they implemented to put guardrails around answers.</li><li><a href="https://www.businessinsider.com/satya-nadella-bill-gates-microsoft-concern-google-rivals-ai-emails-2024-5?ref=blef.fr">Why Microsoft invested in OpenAI in 2019</a> — Emails explaining why Satya Nadella (CEO) and Kevin Scott (CTO) pushed Microsoft to invest in OpenAI have been made public, and are worth a look. It mainly reads that Microsoft was "several years behind the competition in terms of ML scale" (compared to Google, in search / ML in applications) and that to get there, they needed someone with gigantic ambition, from silicon chips to high-level programming abstractions. And the OpenAI team was someone.</li><li><a href="https://www.theinformation.com/articles/openais-new-tack-in-talent-war-with-google-promising-recruits-a-quick-stock-bump?ref=blef.fr">OpenAI is offering $10m packages</a> to top AI researchers. There is a paywall I can't say more.</li><li><a href="https://www.politico.com/news/2024/05/12/ai-lobbyists-gain-upper-hand-washington-00157437?ref=blef.fr">AI lobbyists are everywhere now</a> — A bit more politic but with the stakes around AI (money, power, content moderation and generation, privacy, etc.) lobbying around it is through the roof.</li><li><a href="https://www.youtube.com/watch?v=XEzRZ35urlk&ref=blef.fr">Google I/O keynote</a> — Google I/O was the response from Google to OpenAI announcement around models. They showcased agents that can help you do more in your favourite Google Apps, then DeepMind showcases new capabilities around image and music processing / generation. But one of the most important announcement was only a few seconds about search that <a href="https://www.wired.com/story/google-io-end-of-google-search/?ref=blef.fr">might change forever</a> (paywall can be avoided with a page reader). Google introduced AI overview that will be presented first in search answer pushing traditional results far below. </li><li><a href="https://x.com/fchollet/status/1791168963445223543?ref=blef.fr">LLMs with Keras</a> — Keras team demoed various workflows around LLMs (Gemma) with Keras.</li><li><a href="https://x.com/JoshuaSteinman/status/1790942018077966409?ref=blef.fr">Opt-out to avoid Slack training LLM models on your private data</a> — Slack (acquired by Salesforce) could train their LLM models on your data. Still their answered in the Twitter thread but it's legal stuff I don't understand.</li><li><a href="https://huggingface.co/HuggingFaceM4/idefics2-8b?ref=blef.fr">HuggingFace releases Idefics2</a> — An open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. Works with multiple images as well to create stories.</li><li><a href="https://doordash.engineering/2024/04/23/building-doordashs-product-knowledge-graph-with-large-language-models/?ref=blef.fr">Building DoorDash’s product knowledge graph with LLMs</a> — A good graph is like good wine and DoorDash used LLMs capabilities in information extraction to improve their product catalog graph.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1473090928358-00fcead4f08c?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="man using welding machine" loading="lazy" width="1000" height="731" srcset="https://images.unsplash.com/photo-1473090928358-00fcead4f08c?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1473090928358-00fcead4f08c?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Fusion (</span><a href="https://unsplash.com/photos/man-using-welding-machine-9sJMyPKlKhw?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://arrow.apache.org/blog/2024/05/07/datafusion-tlp/?ref=blef.fr">Apache Arrow DataFusion becomes Apache DataFusion</a> — DataFusion, a query engine built in Rust and uses Arrow for in-memory structures, has been promoted as a top-level Apache project. DataFusion is one of the most important alternative to DuckDB when it comes to engine (not mentioning polars here). On that topic this week I've met people from <a href="https://www.sdf.com/?ref=blef.fr">SDF</a> who are betting on DataFusion as their core execution engine.</li><li><a href="https://github.com/facebookincubator/nimble?ref=blef.fr">facebook/nimble, a new columnar file format</a> — A new columnar file format is out. They announce it as "a replacement for file formats such as Apache Parquet". Ok but why?</li><li><a href="https://medium.com/blablacar/unexpected-tips-for-data-managers-c44a71db6594?ref=blef.fr">Unexpected tips for data managers</a> — A comprehensive and pragmatic list of tips to be a great data manager. This is pure gold.</li><li><a href="https://mikkeldengsoe.substack.com/p/data-about-data-from-1000-conversations?ref=blef.fr">Data about data from 1,000 conversations with data teams</a> — Mikkel output of his interviews with a lot of data teams and what topics are important.</li><li><a href="https://medium.com/israeli-tech-radar/how-to-save-90-on-bigquery-storage-a1ca99582c5c?ref=blef.fr">How to save 90% of BigQuery’s storage cost</a> and <a href="https://www.startdataengineering.com/post/optimize-snowflake-cost/?ref=blef.fr">how to reduce your Snowflake cost</a>. On the same topic if you don't know about GROUP BY ROLLUP you should use it in <a href="https://docs.snowflake.com/en/sql-reference/constructs/group-by-rollup?ref=blef.fr">Snowflake</a> or <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax?ref=blef.fr#group_by_rollup">BigQuery</a>.</li><li><a href="https://kunalbhat.notion.site/Reverse-Engineering-Connections-by-NYT-b325a3ed84a14ddb90322887aa1cb7be?ref=blef.fr">Reverse engineering exercice</a> — Awesome idea. In order to learn concepts the author decided to reverse engineered a NYT game and he documented the process and what he understood. I find this exercice super insightful and I'd love to do something similar.</li><li><a href="https://listed.to/@mattcarter/51660/initial-thoughts-on-sqlmesh?ref=blef.fr">Initial thoughts on SQLMesh</a> — A post describing what are the key concepts of SQLMesh (esp. around envs, plans and projects), this is a great introduction. Last week SQLMesh team also released stuff around <a href="https://sqlmesh.readthedocs.io/en/stable/concepts/tests/?ref=blef.fr">testing</a>, similar to dbt unit tests you can define input and outputs to test your models.</li><li><a href="https://dlthub.com/docs/blog/rest-api-source-client?ref=blef.fr">dltHub REST API source toolkit</a> — dlt released a toolkit to build extract and load pipelines on top of custom APIs. With the toolkit you can declare your endpoints resources and auth and then you'll be able to extract and load your data.</li><li><a href="https://cube.dev/blog/a-practical-guide-to-getting-started-with-cubes-ai-api?ref=blef.fr">Cube releases their AI API</a> — Now you can query in natural language your semantic layer and get answers (it uses OpenAI). This is close to what <a href="https://youtu.be/BUYrm_O0vFk?t=2182&ref=blef.fr">I had demoed last year in a talk</a>.</li><li><a href="https://motherduck.com/product/pricing/?ref=blef.fr">MotherDuck pricing page</a> — Great pricing page, competitors should get inspired from this. It's fun to play with it to see how many hundreds of thousands of dollars you would have spent.</li><li><a href="https://www.uber.com/en-DE/blog/auto-categorizing-data-through-ai-ml/?ref=blef.fr">Uber, auto-categorizing an exabyte of data at field level through AI/ML</a> — Reminds me <a href="https://blog.sdf.com/p/automating-data-classification-for?ref=blef.fr">SDF article</a> about end-to-end classification of your data models but at Uber scale.</li><li><a href="https://posit-dev.github.io/great-tables/articles/intro.html?ref=blef.fr">great_tables</a> — A great tool to create nice looking tables in Python on top of your dataframes.</li></ul><p><strong>Food for thought to end (because it's already too long)</strong></p><ul><li><a href="https://dlthub.com/docs/blog/on-orchestrators?ref=blef.fr">On Orchestrators: you are all right, but you are all wrong too</a></li><li><a href="https://omni.co/blog/do-you-model-in-dbt-or-bi?ref=blef.fr">Do you model in dbt or BI?</a></li><li><a href="https://slack.engineering/how-women-lead-data-engineering-at-slack/?ref=blef.fr">How women lead data engineering at Slack</a></li><li><a href="https://glossgenius.com/blog/how-we-migrated-from-dbt-cloud-and-scaled-our-data-development?ref=blef.fr">How we migrated from dbt Cloud</a></li><li><a href="https://eng.lyft.com/technical-learning-at-lyft-build-a-strong-data-science-team-a6628215513c?ref=blef.fr">Lyft, build a strong data science team</a></li></ul><p><a href="https://medium.com/@manu.martin.chave?source=post_page-----c44a71db6594--------------------------------" rel="noopener follow"></a></p><hr><p>See you next week for the anniversary 🎂</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ How to build a data team ]]></title>
                    <description><![CDATA[ This article will give you a list of the top resources to follow when building a data team. ]]></description>
                    <link><![CDATA[ /how-to-build-a-data-team/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 66340334f16ccc00018beaf7 ]]></guid>
                    <pubDate><![CDATA[ 2024-05-03 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1522071820081-009f0129c71c?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="group of people using laptop computer" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1522071820081-009f0129c71c?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1522071820081-009f0129c71c?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">And it's a team... (</span><a href="https://unsplash.com/photos/group-of-people-using-laptop-computer-QckxruozjRg?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, new Friday and special Data News this week. This week has been pretty packed in term of work for me so here a joker as a weekly newsletter. This is a compilation of great resources about building a data team.</p><p>This is a collection I've created when working on a talk called "<a href="https://docs.google.com/presentation/d/1hTqtvGOoVyJ7whYpQ2jRLFLJliJHwuC473xo0iI0Ons/edit?usp=sharing&ref=blef.fr">How to build a data dream team</a>" that I've made last year. All the articles are different and give a large spectrum of perspectives about creating a data team. </p><p>In my experience, building a data team is a mixture of everything, there's no single recipe, but for it to work, you need to adapt the technology you choose to the people you have. These days, it's very easy to technically build a data platform, but building a data team goes further than that, it's about processes, communication and prioritization, how to build trust with stakeholders, etc.</p><p><strong>10 great resources to build a data team</strong></p><ul><li><a href="https://www.castordoc.com/blog/how-to-build-your-data-team?ref=blef.fr">How to build your data team?</a> —  This article from Castor team brings all the vocabulary needed. It explains what are the different models: centralised, embedded or federated and what are the pros and cons for each. It also overlook the topic of the size and the roles.</li><li><a href="https://www.secoda.co/blog/net-promoter-score-for-data-teams?ref=blef.fr">Net promoter score for data teams</a> — A very important topic I guess. A reminder, one of the most common data team mission is to empower stakeholders. So face the truth and compute a NPS to know what your stakeholders are thinking about.</li><li><a href="https://medium.com/alan/vision-for-a-data-team-2eae845b8052?ref=blef.fr">Vision for a data team</a> — Probably the most pragmatic and full of handy advice. This blog from Alan data team explain what a data team should do.</li><li><a href="https://erikbern.com/2021/07/07/the-data-team-a-short-story.html?ref=blef.fr">Building a data team at mid-stage startup: a short story</a> — Views of all the journey your data team will go through in a startup. From the first day to one year later.</li><li><a href="https://about.gitlab.com/handbook/business-technology/data-team/how-we-work/?ref=blef.fr">Data team, how we work</a> — The Gitlab handbook which is a big bible of resources when it comes to data. Everything is detailed, how do they work, how they do triage, prioritisation, etc.</li><li><a href="https://www.typeform.com/blog/inside-story/data-team/?ref=blef.fr">How Typeform build a data team in under 6 months</a> — 5 key insights and 7 top advices about what to do. </li><li><a href="https://medium.com/younited-tech-blog/data-organisation-why-are-there-so-many-roles-9c3992d0a436?ref=blef.fr">Data organisation: why are there so many roles?</a> — A great guide about the role and responsibilities of people in a data team.</li><li><a href="https://mitsloan.mit.edu/ideas-made-to-matter/how-to-build-a-data-analytics-dream-team?ref=blef.fr">How to build a data analytics dream team</a> — Goes a bit further than the previous article open to new (weird) roles.</li><li><a href="https://locallyoptimistic.com/post/the-next-big-challenge-for-data-is-organizational/?ref=blef.fr">The next big challenge for data is organisational</a> — Yes, technically this is just about alignement. The rest is human collaboration and change management which is quite hard.</li><li><a href="https://www.getdbt.com/data-teams/?ref=blef.fr">Building a data team, dbt recommendation guide</a> — dbt Labs wrote a great guide about data team building.</li></ul><hr><p>Sorry about this intermittence in my Data News, I promise next week I'll be back ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.16 ]]></title>
                    <description><![CDATA[ Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks. ]]></description>
                    <link><![CDATA[ /data-news-week-24-16/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 662127a11d5eca000181b599 ]]></guid>
                    <pubDate><![CDATA[ 2024-04-19 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1636371449439-e19a1b5a25b2?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="a couple of llamas are standing in a field" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1636371449439-e19a1b5a25b2?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1636371449439-e19a1b5a25b2?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">easy (</span><a href="https://unsplash.com/photos/a-couple-of-llamas-are-standing-in-a-field-NJfWUwyUI5M?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, new Friday, new Data News. This week, I feel like the selection is smaller than usual, so enjoy the links. I'm a bit late with the Recommendations emails, I'm sorry about that I got a few new leads as a freelancer I had to take in priority changing a bit my schedule. But don't worry it gonna be out soon.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p><em>When do models get the same hype as 2007 iPhone release? I did not get the memo.</em></p><ul><li><a href="https://ai.meta.com/blog/meta-llama-3/?ref=blef.fr">Meta releases Llama 3</a> — After last week <a href="https://mistral.ai/news/mixtral-8x22b/?ref=blef.fr">Mistral new models</a>, this week ends with the new Meta open-source models. Llama the Third is online. One thing to analyse is that the model created more hype than <a href="https://about.fb.com/news/2024/04/meta-ai-assistant-built-with-llama-3/?ref=blef.fr">Meta AI</a> the new ChatGPT run by Zuck company <a href="https://fortune.com/2024/04/18/meta-ai-llama-3-open-source-ai-increasing-competition/?ref=blef.fr">going all-in into his AI vision</a>. It shows something changed, generative models reaching massive adoption, in my bubble at least, people care more about a new model than an assistant available in Meta ecosystem (Insta, WhatsApp, Facebook and <a href="https://www.meta.ai/?ref=blef.fr">more</a>). Until we reach the model fatigue hype is real.<br><br>Personally, I can't comment on the performance of the models, it's like comparing the performance of two cars, as long as I can drive, it's fine. You can try Llama 3 on <a href="https://modal.chat/?ref=blef.fr">Modal</a> or <a href="https://huggingface.co/chat/?ref=blef.fr">HuggingChat</a>.<br><br>In order to go further you can read this <a href="https://twitter.com/karpathy/status/1781028605709234613?ref=blef.fr">excellent analysis</a> on Twitter or the <a href="https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md?ref=blef.fr">model card</a>—they even give the teCO2 estimated in the training phase. In a nutshell it says:<ul><li>Llama is available in 8B and 70B, 400B is coming once training will be completed—and approaching GPT-4 performances.</li><li>Llama has a larger tokeniser and the context window grew to 8192 tokens as input.</li><li>It was trained on a large dataset containing 15T tokens (compared to 2T for Llama 2).</li></ul></li><li><a href="https://www.theinformation.com/articles/mistral-an-openai-rival-in-europe-in-talks-to-raise-capital-at-a-5-billion-valuation?ref=blef.fr">Mistral wants to raise again at $5B valuation.</a></li><li><a href="https://www.microsoft.com/en-us/research/project/vasa-1/?ref=blef.fr">Microsoft VASA-1</a> — Microsoft published a paper about a model generating talking avatars from an image and an audio. This is quite impressive. They did not released the code so I tried the closest solution open-source called <a href="https://github.com/OpenTalker/SadTalker?ref=blef.fr">SadTalker</a> and tried it <a href="https://drive.google.com/file/d/1qUeK1V7mj3CELW8LAistyVgzllCdFzWH/view?usp=drive_link&ref=blef.fr">on</a> <a href="https://drive.google.com/file/d/1IZlEPVdl7vzJ7i64XfhpwcwU-VVYKEYm/view?usp=sharing&ref=blef.fr">me</a>. This is a bit creepy, but impressive for the low quality of my inputs.</li><li><a href="https://towardsdatascience.com/structured-generative-ai-e772123428e4?ref=blef.fr">Structured generative AI</a> — Oren explains how you can constraint generative algorithms to produce structured outputs (like JSON or SQL—seen as an AST). This is super interesting because it details important steps of the generative process.</li><li><a href="https://towardsdatascience.com/evaluate-anything-you-want-creating-advanced-evaluators-with-llms-e2d540af6090?ref=blef.fr">Evaluate anything you want with LLMs</a> — I really like how LLMs can be used for tasks that are not the one we are first thinking of. This blog shows how you can use Gen AI to evaluate inputs like translations with added reasons.</li><li><a href="https://slack.engineering/how-we-built-slack-ai-to-be-secure-and-private/?ref=blef.fr">How we build Slack AI to be secure and private</a> — How Slack uses VPC and Amazon SageMaker with your data secured and private.</li><li><a href="https://twitter.com/OpenAIDevs/status/1779922566091522492?ref=blef.fr">OpenAI batches</a> — OpenAI opened a new API endpoint to batches requests.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://nealabbott.files.wordpress.com/2020/06/theseus.jpg?w=500" class="kg-image" alt="theseus" loading="lazy" width="500" height="400"><figcaption><span style="white-space: pre-wrap;">Theseus against really big data (</span><a href="https://www.tes.com/teaching-resource/theseus-and-the-minotaur-12477053?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li><a href="https://blog.alexewerlof.com/p/principal-engineer?ref=blef.fr">Principal Engineer</a> — Although staffs and principals have been on the career ladder for a long time, there are very few articles on what it takes to become one of the greats. This article covers the whole ladder and the mix of skills needed to reach the top. It's a mix of hard, soft and business skills.</li><li><a href="https://www.junaideffendi.com/p/data-pipeline-incremental-vs-full?ref=blef.fr">Data pipeline, incremental vs. full load</a> — A comprehension comparison between 2 mode of ingestion with a decision tree about which one to pick.</li><li>❤️ <a href="https://www.canva.dev/blog/engineering/scaling-to-count-billions/?ref=blef.fr">Scaling to count billions</a> — An awesome retrospective of the Canva OLAP architecture to count marketplace usage, from MySQL to Snowflake + buckets. It greatly decompose every important part of an OLAP platform with the collection, the deduplication and the aggregation.</li><li><a href="https://voltrondata.com/benchmarks/theseus?ref=blef.fr">Spark (and Theseus) on GPUs benchmark</a> — A detailed benchmark by Voltron data about running Spark and Theseus (their GPU data processing engine) workloads on GPUs. This is crazy how Theseus outperform Spark. The conclusion look like a great summary for me:<ul><li>For less than 2TBs &gt; use DuckDB, Polars, DataFusion or Arrow backed projects.</li><li>Up to 30TBs &gt; Cloud warehouse or Spark</li><li>Over 30TBs &gt; Go Theseus. [Theseus]&nbsp;"<em>prefer to operate when queries exceed 100TBs"</em>. 😅</li></ul></li><li><a href="https://pola.rs/posts/benchmarks/?ref=blef.fr">Polars new benchmarks</a> — Polars released new benchmarks about the TPC-H dataset. Polars and DuckDB are the cool kid and benchmarks show you should stop using pandas and switch to polars to get x10 performance gain.</li><li><a href="https://www.hydra.so/blog-posts/2022-03-21-announcing-hydra-postgres-data-warehouse?ref=blef.fr">Hydra: the Postgres data warehouse</a> — Postgres is one of the most used database, this week I discovered Hydra an open-source columnar port of Postgres aiming to create an open-source Snowflake. To watch.</li><li><a href="https://neon.tech/blog/neon-ga?ref=blef.fr">Neon GA</a> — Neon, another Postgres fork, is generally available. Neon wants to provide a serverless autoscalable Postgres for devs.</li><li><a href="https://kestra.io/blogs/2024-18-04-clever-cloud-use-case?ref=blef.fr">Clever Cloud offloading 20TB every month</a> — Kestra showcases how one of their client is using the declarative orchestrator to offload TBs of data every month.</li><li><a href="https://medium.com/criteo-engineering/kubecon-cloudnativecon-europe24-notes-d8d9f4d77c6d?ref=blef.fr">KubeCon + CloudNativeCon Europe’24 notes</a> — A few notes from the 2024 Kube big mass.</li><li><a href="https://smallbigdata.substack.com/p/is-sqlmesh-the-dbt-core-20-a-feet?ref=blef.fr">Is SQLMesh the dbt Core 2.0</a>? — A great blog to answer a great question. SQLMesh is bringing fresh ideas to the SQL transformation landscape. The post covers a lot of topics and explains the concept similarities between both tools.</li><li><a href="https://github.com/gwenwindflower/tbd?ref=blef.fr">gwenwindflower/tbd</a> — A code generator for dbt. Winnie developed a great tool to save time in documenting you dbt projects using Gen AI models.</li><li><a href="https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/?ref=blef.fr">Snowflake text embeddings for retrieval</a>.</li><li><a href="https://xebia.com/blog/distributed-dashboarding-with-duckdb-wasm/?ref=blef.fr">Distributed dashboarding with DuckDB WASM</a> — Ramon put words into ideas I have in my mind for months: <strong>distributed dashboarding</strong>. I buy so much this concept, especially with DuckDB WASM and what it unlocks in term of autonomy or privacy for users.</li><li><a href="https://juhache.substack.com/p/write-audit-publish-wap-pattern?ref=blef.fr">WAP with dbt, Icerberg or Nessie</a> — Julien showcase how you can achieve WAP pattern with different technologies.</li><li><a href="https://csv-to-db-six.vercel.app/?ref=blef.fr">CSV to DB</a> — What if you could open a CSV, re-order or rename the columns directly in the browser? Without any backend call—with DuckDB obv—. s/o to Théodore, a Data News subscriber, who developed this. This is a great idea.</li></ul><hr><p>See you next week ❤️.</p><p></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.15 ]]></title>
                    <description><![CDATA[ Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more. ]]></description>
                    <link><![CDATA[ /data-news-week-24-15/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 661906df338923000186a0e9 ]]></guid>
                    <pubDate><![CDATA[ 2024-04-12 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1450044804117-534ccd6e6a3a?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="crowd of people at concert" loading="lazy" width="1000" height="750" srcset="https://images.unsplash.com/photo-1450044804117-534ccd6e6a3a?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1450044804117-534ccd6e6a3a?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The fest we deserve (</span><a href="https://unsplash.com/photos/crowd-of-people-at-concert-rdmJc2Os4EM?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>I hope this Data News finds you well. In today's edition we have a large selection of links, I think you will enjoy it.</p><p>But first I want to welcome all the new members joining this week after my new <a href="https://youtu.be/LNC0SbHknxw?si=RZ2IRcCTIl0DMog7&ref=blef.fr">episode on DataGen</a> with Robin Conquet. This is an episode in French and we talked mainly about the eventual end of the modern data stack. Which I have already condensed in <a href="https://www.blef.fr/modern-data-stack-disappearing/">a post a few weeks ago</a> (in English).</p><p></p><h1 id="mds-fest-%F0%9F%A5%B3">MDS Fest 🥳</h1><p>As announced last week I've participated to the MDS Fest 2.0 this Thursday. I've shared my journey with Apache Superset and why I consider Superset the best open-source alternative when it comes to building BI applications.</p><p>Yes because you should <strong>stop building dashboard and build BI apps instead</strong>. It enters in the productisation of the data, but mainly I think you should consider your BI tools as a way for your users to interact with data and not only to monitor metrics. With customisation possible Superset is the best tool for it.</p><p>You can have a look at my <a href="https://docs.google.com/presentation/d/1GaIN0p6msfYm3ZzwPoV6q4HqARXi0003AxVyxqDs_jU/edit?usp=sharing&ref=blef.fr">slides</a> or watch the <a href="https://www.youtube.com/watch?v=3BQBnE8jYsI&ref=blef.fr">replay on YouTube</a>.</p><p>In the same conference a lot of other talks took place here a few selection you should check out:</p><ul><li><a href="https://www.youtube.com/watch?v=fetXTKA1U9o&ref=blef.fr">How to pivot your data team from a service team to a value-generator</a> — Very often data teams struggle in delivering value or in finding what's their real identity. Taylor identified pattern and gives great advice to help you finding it.</li><li><a href="https://www.youtube.com/watch?v=bQJ3wMqJB0M&ref=blef.fr">Data contracts: federated data governance</a> — Another talk by Chad about the data contracts, always on point in describing the pains around the "data supply chain".</li><li><a href="https://www.youtube.com/watch?v=L0M_RWSp4RE&ref=blef.fr">Deliver reporting in pure SQL with dbt + Evidence</a> — A great showcase of what you can build with Evidence (a BI as code solution).</li><li><a href="https://www.youtube.com/watch?v=c7XvlQ3s5Yg&ref=blef.fr">Build analytics at Hive.co</a> — The journey Oleg and his team went through to implement a modern data stack. They used RFC to document where they were heading to.</li></ul><p><em>PS: </em><a href="https://preset.io/blog/apache-superset-4-0-release-notes/?ref=blef.fr"><em>Apache Superset is going 4.0</em></a><em> this week with a lot of new features.</em></p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://blog.siemens.com/2024/04/open-source-llms-for-everyone/?ref=blef.fr">Open-source LLMs for everyone</a> — A great post from Siemens AI team about open LLMs initiatives that brings new usages in the dev workflow, wether it's about code completion and pull request / crash reporting summarisation, it looks neat.</li><li><a href="https://blog.replit.com/code-repair?ref=blef.fr">Building LLMs for code repair</a> —&nbsp;Replit is a AI-driven workspace for developers , think of a supercharged IDE. They wrote a blog about what they developed to create a LLM driven fix suggestion for LSP (<a href="https://en.wikipedia.org/wiki/Language_Server_Protocol?ref=blef.fr">Language Server Protocol</a>), which is a protocol between your IDE and a server that understand and analyse the code to find errors or highlight the code.</li><li><a href="https://huggingface.co/spaces/lhoestq/LLM_DataGen?ref=blef.fr">LLM DataGen</a> — A small demo of a LLM based on Gemma that generates JSONL based on a given name. This is not working super well and it would be better if we could specify the columns names and types for instance, but showcase another great usage example of generative algorithms.</li><li><a href="https://www.metacareers.com/life/behind-gen-ai-building-an-infrastructure-for-the-future?ref=blef.fr">Meta, building an infrastructure for the future</a> — It explains how Meta is partnering with GPUs vendors to design new chips and how it's incredibly hard to connect thousands of GPUs in a cluster where everything can fail at every moment.</li><li><a href="https://twitter.com/deedydas/status/1778621375592485076?ref=blef.fr">Can Gemini 1.5 actually read all the Harry Potter books at once?</a> —&nbsp;A nice Graphviz chart spotted on Twitter of the whole Harry Potter relationships in a poster. Done by Gemini with the content of all the books. Obviously Gemini already knows a few of the Hogwarts lore by his training, but still this is impressive. Sadly we don't have the complete prompt / code.</li><li>Speaking of prompts, PromptLayer organised a tournament and they blogged about their <a href="https://blog.promptlayer.com/our-favorite-prompts-from-the-tournament-b9d99464c1dc?ref=blef.fr">favourite prompts of the competition</a>. Once again speaking to a LLM is like speaking to children, USE CAPITAL LETTERS TO CAPTURE THEIR ATTENTION.</li><li><a href="https://github.com/openai/simple-evals?ref=blef.fr">OpenAI open-sourced a light library to evaluate language models</a>&nbsp;— you can use 7 different evals and check the results on OpenAI or Claude models.</li><li><a href="https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1?ref=blef.fr">Mistral-8x22B is out</a> — a new model that does something probably awesome.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>Last week I forgot to share the <a href="https://www.getdbt.com/resources/reports/state-of-analytics-engineering-2024?ref=blef.fr">2024 state of analytics engineering</a> by dbt Labs. By overlooking at it it depicts well the actual trends I also see in my local market and we see that in 2024 more than 50% of the time spent by data practitioners is to maintain or organise data assets.</li><li><a href="https://beam.incubator.apache.org/blog/beam-yaml-release/?ref=blef.fr">Introducing Beam YAML</a> — Apache Beam is a unified processing framework (understand unifying streaming and batch) that runs on many different engines. Today they introduce Beam YAML a way to write pipelines the declarative way. Reminds me <a href="https://kestra.io/?ref=blef.fr">Kestra</a> so much.</li><li><a href="https://tobikodata.com/ast_journey.html?ref=blef.fr">How I became an AST convert</a> — I've been an AST convert for a long time, I'm so happy someone writes about this. AST stands for abstract syntax tree and can be an abstract representation of a language, and this is what SQLGlot does (hence SQLMesh) to SQL. Afzal writes in this blog what it means, especially in a diff context. <br><br>You can also listen Toby interview in Joe podcast about <a href="https://podcasters.spotify.com/pod/show/joereis/episodes/Toby-Mao---SQLMesh--Simplifying-Data-Transformations--and-more-e2ht7mt?ref=blef.fr">SQLMesh and SQL transformations</a>.</li><li><a href="https://github.com/airbnb/chronon?ref=blef.fr">Airbnb open-sources Chronon</a> — A data platform for serving for AI/ML applications. In Chronon you defined <a href="https://chronon.ai/getting_started/Introduction.html?ref=blef.fr#example">sources and groupby</a>—a collection of aggregations on keys—which represents in the end features and then the platform handle the downstream management.</li><li><a href="https://cloud.google.com/bigquery/docs/data-canvas?ref=blef.fr">BigQuery releases data canvas</a> — This is a large open canva (like <a href="https://count.co/?ref=blef.fr">count.co</a>) in which you can write SQL queries, assisted by Gemini, and link them in a DAG fashion.</li><li><a href="https://uncledata.substack.com/p/write-audit-publish-pattern-in-modern?ref=blef.fr">Write-Audit-Publish pattern in modern data pipelines</a> — A pattern worth to be known more because it can avoid your pipelines pushing wrong data into your users tools.</li><li><a href="https://preset.io/blog/exploring-the-dbt-cloud-semantic-layer-in-preset/?ref=blef.fr">Using Preset (Superset) to explore the dbt Cloud semantic layer</a> — You can configure a sync between both clouds or use a CLI, then you will be able to explore metrics in your BI tool.</li><li><a href="https://thorben-janssen.com/book-review-duckdb-in-action/?ref=blef.fr">Book review of DuckDB in Action</a>.</li><li><a href="https://youtu.be/YrqSp8m7fmk?si=6UFP0F034DNbEt4c&ref=blef.fr">Efficient CSV parsing</a> — A YouTube talk about the DuckDB cvs parser and what it means to parse unstructured files.</li><li>To conclude this edition 2 project walkthrough:<ul><li><a href="https://blog.dagworks.io/p/slack-summary-pipeline-with-dlt-ibis?ref=blef.fr">Slack summary pipeline with dlt, Ibis, and Hamilton</a>.</li><li><a href="https://www.linkedin.com/pulse/local-pipeline-development-sqlmesh-airflow-postgres-alexis-chicoine-oiyte/?trackingId=pvbvtn8DSUmPiHikq660Yw%3D%3D&ref=blef.fr">Local pipeline development with SQLMesh, Airflow, and Postgres</a>.</li></ul></li></ul><p></p><p><em>✨ s/o to Hugo who runs a weekly data round-up and this week he published before me so you can also check his </em><a href="https://orchestra.substack.com/p/roundup-29-we-15-april-2024?ref=blef.fr"><em>great links selection</em></a><em>.</em></p><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.14 ]]></title>
                    <description><![CDATA[ Data News #24.14 — New MAD landscape, polars on GPU, git in Snowflake, open data portals and more. ]]></description>
                    <link><![CDATA[ /data-news-week-24-14/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 660a785f1f175400014d011c ]]></guid>
                    <pubDate><![CDATA[ 2024-04-05 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1524311583145-d5593bd3502a?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="person carrying backpack inside library" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1524311583145-d5593bd3502a?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1524311583145-d5593bd3502a?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Lost between ideas (</span><a href="https://unsplash.com/photos/person-carrying-backpack-inside-library-W_ZYCEUapF0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, new Data News edition. I hope you will enjoy this week selection after skipping last week one. I was a bit overwhelmed with the amount of tasks I had on the desk—and I'm still. But here we are.</p><p>Before jumping to the news, I want to let you know that I have improved the <a href="https://www.blef.fr/explorer/reco/">Recommendations</a> page and the weekly emails with the recommendation should arrive soon. The new page supports better the mobile and give you titles and overviews of links which are GPT4 generated.</p><p>I'll speak at the&nbsp;<a href="https://www.mdsfest.com/?ref=blef.fr">MDS Fest 2.0</a> next week on April 10. MDS Fest is a free virtual 5 days conference about Modern Data Stack topics, a lot of awesome speakers, there are a few talks I can't wait to watch. On my side I'll talk about Apache Superset and what you can do to build a complete application with it.</p><figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2024/04/frame_80424.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="433" srcset="https://www.blef.fr/content/images/size/w600/2024/04/frame_80424.jpg 600w, https://www.blef.fr/content/images/size/w1000/2024/04/frame_80424.jpg 1000w, https://www.blef.fr/content/images/size/w1600/2024/04/frame_80424.jpg 1600w, https://www.blef.fr/content/images/2024/04/frame_80424.jpg 2000w" sizes="(min-width: 720px) 720px"></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://docs.google.com/presentation/d/1dbfoxzNcoI-D45RKZfO1UfBJIr4v0YtHhj1cwuCj020/edit?ref=blef.fr#slide=id.p">LlamaIndex slides, examples with Mistral AI</a> — A few slides with a lot of example on how you can use LlamaIndex with Mistral AI models. I guess there is a video associated with the slides, but I don't have it. It shows a few RAGs, Agents and parsers on documents to retrieve the data you need.</li><li><a href="https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm?ref=blef.fr">DBRX, a new state-of-the-art open LLM</a> — Databricks has to be an AI company (<a href="https://twitter.com/arny_trezzi/status/1775972218716995776?ref=blef.fr">bragging vs. Snowflake</a>). This week they released a new open model that performs great.</li><li><a href="https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff?ref=blef.fr">How we built Text-to-SQL at Pinterest</a> — Pinterest open-sourced a tool called Querybook that they used to access Pinterest data every day. In order to boost usage they developed a text-to-SQL feature. This article greatly explained how they did it.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://mattturck.com/mad2024/?ref=blef.fr">MAD 2024 landscape</a> — The new edition of the Machine learning, AI and Data Landscape is out, with many logos and obviously changes since last year with the GenAI hype. I did not yet analysed the new map, but I'll try to do it soon.</li><li><a href="https://pola.rs/posts/polars-on-gpu/?ref=blef.fr">Polars on GPU</a> — Polars announced that they are collaborating with Rapids to bring GPU performance to reach another summit with Polars. Looks cool to be able to switch Polars engines like this. If you are using Polars reach out to me I'm curious to know how people are using it.</li><li><a href="https://medium.com/snowflake/connect-git-to-snowflake-now-in-public-preview-25b0456ce02c?ref=blef.fr">Git in Snowflake</a> — Snowflake is getting more and more feature months after months. Becoming a fully complete suite of applications directly reachable from SQL in your warehouse. Reminds me of Oracle and I don't like this centralisation, but the future is always going the bundling way. Now you can read a git repository when creating procedure in the SQL DML.</li><li><a href="https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/?ref=blef.fr">DuckDB is the new jq</a> — The author show how you can manipulate a json file with a DuckDB one liner. I really like this take, it gives a great perspective about DuckDB and how you can use it locally to do fast manipulation. But contrary to jq which has a non-trivial syntax, DuckDB is SQL.</li><li><a href="https://www.linkedin.com/posts/lakehouse_deltalake-apacheiceberg-apachehudi-activity-7179573552331837441-IBYH/?ref=blef.fr">Survey about query engines used by companies</a> — Data Council happened recently. It's a US-based conference that I often really like because talks and ideas that are discussed there often shape what we do in the data industry, at least from what I see in the YouTube videos, I never went there. A talker did a survey during his keynote about the query engines used by the audience and Spark is still leading to BigQuery/Snowflake/Athena.</li><li><a href="https://clickhouse.com/blog/building-a-logging-platform-with-clickhouse-and-saving-millions-over-datadog?ref=blef.fr" rel="noreferrer">How we built a 19 PiB logging platform with ClickHouse</a> — ClickHouse is a tech company and you can see it from the blogpost. They deeply explain in this article why they choose ClickHouse to monitor their ClickHouse Cloud offering saving money on their Datadog bill.</li><li>❤️ <a href="https://davidgasquez.com/modern-open-data-portals/?ref=blef.fr">Building open data portals in 2024</a> — David open-source a end-to-end framework to build open data portals. This is awesome (<a href="https://filecoindataportal.davidgasquez.com/?ref=blef.fr">example</a>), you can easily ingest, transform and share data, looks like yato but with many more features puzzled together to create a local-first data platform.</li><li><a href="https://engineering.atspotify.com/2024/04/data-platform-explained/?ref=blef.fr">Spotify, data platform explained</a> — The beginning of a series explaining the Spotify data platform.</li><li><a href="https://towardsdatascience.com/navigating-your-data-platforms-growing-pains-a-path-from-data-mess-to-data-mesh-c16df72f5463?ref=blef.fr">A path from data mess to data mesh</a> — 5 key principles you should apply to avoid the data mess.</li><li><a href="https://davidsj.substack.com/p/semantic-layers-a-buyers-guide?ref=blef.fr">Semantic layers, a buyers guide</a> — This is a exhaustive comparison between dbt Cloud metrics offering and Cube. In a nutshell, I'd say that both technologies are not yet mature, with a smaller advantage for Cube for being open.</li><li><a href="https://fromanengineersight.substack.com/p/the-data-analyst-every-ceo-wants?ref=blef.fr">The data analyst every CEO wants</a> — I really like this blog from Benoit, he gives practical advices about what you have to focus on if you're working as an data analyst for C-level of your company. </li><li><a href="https://blog.picnic.nl/yaml-developers-and-the-declarative-data-platforms-4719b7a1311c?ref=blef.fr">YAML developers and the declarative data platforms</a> — A good introduction to why declarative languages are perfect for creating data platforms. To be honest, I think this is a topic that can be used to compare good and great data engineers. Creating a declarative data platform is easy, but creating the right level of abstraction that describes reality without creating debt and over-engineered solutions is much harder.</li><li><a href="https://github.com/datarecce/recce?ref=blef.fr">PR review tool for dbt projects</a> — A nice tool creating visual representations comparing 2 dbt artifacts that you can embed in a CI to validate changes before they get merged into production code.</li><li><a href="https://www.linkedin.com/pulse/when-data-model-finished-bill-inmon-cqkhc/?trackingId=QiPwlC8rT5uFGEvndODVVQ%3D%3D&ref=blef.fr">When is the data model finished?</a> — Spoiler: a data model is never finished. Actually you need to depict your company business and activities, as the time goes, activities grows and obviously you have to manage this asset in time.</li><li><a href="https://maxhalford.github.io/blog/bike-sharing-forecasting-training-set/?ref=blef.fr">A training set for bike sharing forecasting</a> — Max has created a large dataset of bike sharing providers in ~50 cities around the world. If you want to play with DuckDB and visualisations this is a good start.</li><li><a href="https://www.architecture-performance.fr/ap_blog/calculating-walking-isochrones-with-python/?ref=blef.fr">Calculating walking isochrones in Python</a> —&nbsp;Cool way to produce Python viz.</li></ul><p><a href="https://medium.com/@matthieucan?source=post_page-----4719b7a1311c--------------------------------" rel="noopener follow"></a></p><p></p><hr><p>See you next week ❤️</p><p>Dont forget to check out the new Recommendations pages (below an overview of mines).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/04/Frame-22.png" class="kg-image" alt="" loading="lazy" width="2000" height="1047" srcset="https://www.blef.fr/content/images/size/w600/2024/04/Frame-22.png 600w, https://www.blef.fr/content/images/size/w1000/2024/04/Frame-22.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/04/Frame-22.png 1600w, https://www.blef.fr/content/images/2024/04/Frame-22.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Overview of Recommendations and email (resp. left and right)</span></figcaption></figure> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.12 ]]></title>
                    <description><![CDATA[ Data News #24.12 — My Friday routine, the 01 interpreter, RAG, xAI Grok-1, Apple entering the course, run Spark in BigQuery, Williams F1 using Excel BigData (lol) and more. ]]></description>
                    <link><![CDATA[ /data-news-week-24-12/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65fd79b0f7bf6400015e2ec0 ]]></guid>
                    <pubDate><![CDATA[ 2024-03-22 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1581269632459-409ff09f73de?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="woman in white t-shirt holding black ceramic mug" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1581269632459-409ff09f73de?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1581269632459-409ff09f73de?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Friday routine (</span><a href="https://unsplash.com/photos/woman-in-white-t-shirt-holding-black-ceramic-mug-KF96lDEvqwY?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>It's Friday and it's Data News. I don't go into too much detail about the magic of Data News, but every Friday is the same. At first, I'm: <em>oh shit, here we go again</em> and 10 minutes later I'm lost in reading the content and picking too many articles to fit into a thousand word edition.</p><p>Usually all the process takes me a whole Friday. I organise myself as follows:</p><ul><li><strong>During the whole week I scroll</strong>—too much—LinkedIn. Save posts without reading. Sometimes I also save stuff on Twitter by liking. The reason I do this is to avoid <a href="https://en.wikipedia.org/wiki/Context_switch?ref=blef.fr">context switching</a>—let's be honest, it works for the DN context, but does not work in general in my life.</li><li><strong>Exploration, Friday morning</strong><ul><li>I read the last 7 days of 2 Twitter lists (<a href="https://twitter.com/i/lists/1463573327868481540?ref=blef.fr">MDS</a>, <a href="https://twitter.com/i/lists/1484841091828432896?ref=blef.fr">Data voices</a>) and I open interesting stuff in tabs.</li><li>Then I use Feedly which is connected to ~500 websites, Reddit and Medium and opens interesting articles in tabs.</li><li>Then I opens the elements saved from LinkedIn.</li></ul></li><li><strong>Reading and writing, Friday afternoon</strong><ul><li>I read the articles and remove what I find irrelevant (context, values, quality, etc.), I create a first connexion between all the links, trying to sort them to have a fluid path between articles ideas</li><li>I usually go from ~50 links to 25 after the reading part.</li><li>I write in one-go starting from top to bottom.</li></ul></li><li><strong>Publication</strong>—Once the Data News is ready, I just click publish, I don't proofread much (sorry for the typos). I think I already spend so much time selecting + writing that I can't be stuck in revision mode for long.</li><li><strong>Post-publication</strong>—After the publication I do my homework of promoting my own work (mainly on LinkedIn), I run a few post-publication scripts for the Explorer / Recommendations. I also watch the click / opening stats and thats all. But I could do it better I think.</li></ul><p>The process works well, but as you can see, because I use fresh news, it's just-in-time. Which puts pressure on my Fridays. I'd like to have a few articles in stocks to remove the pressure of having to write something on selected Fridays and be off.</p><p><em>❤️ I rarely say it, if Data News helps you save time you should consider taking a </em><a href="https://www.blef.fr/#/portal/signup/60817789b7677e002ff7b655/yearly"><em>paid subscription</em></a><em> (60€/year) to help me covers the blog fees and my writing Fridays.</em></p><p>Just before I jump to the news, I'll speak at the <a href="https://www.mdsfest.com/?ref=blef.fr">MDS Fest 2.0</a> on April 10. MDS Fest is a free virtual 5 days conference about Modern Data Stack topics, a lot of awesome speakers, there are a few talks I can't wait to watch. On my side I'll talk about Apache Superset and what you can do to build a complete application with it.</p><p><strong>Ok. Now give me the news.</strong></p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-22-at-18.00.56.png" class="kg-image" alt="" loading="lazy" width="2000" height="1469" srcset="https://www.blef.fr/content/images/size/w600/2024/03/Screenshot-2024-03-22-at-18.00.56.png 600w, https://www.blef.fr/content/images/size/w1000/2024/03/Screenshot-2024-03-22-at-18.00.56.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/03/Screenshot-2024-03-22-at-18.00.56.png 1600w, https://www.blef.fr/content/images/size/w2400/2024/03/Screenshot-2024-03-22-at-18.00.56.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The 01 light</span></figcaption></figure><ul><li><a href="https://twitter.com/OpenInterpreter/status/1770821439458840846?ref=blef.fr">01 open interpreter</a> — The 01 light is a small device operable with your voice that controls your home compute. I've been rarely amazed by the latest physical device AI startups produced, but this one is different. It's a small white sphere that understand what you says and then control your compute mouse to execute actions for you wether you're in front of your compute or elsewhere.<br><br>The initiative wants to be open(-source) and they provided the <a href="https://github.com/OpenInterpreter/01?tab=readme-ov-file&ref=blef.fr">code on Github</a>. Actually they trained a "computer LLM". And they are the reason this newsletter was late, because you can build yourself the physical device with Arduino stuff—<a href="https://github.com/OpenInterpreter/01/blob/main/hardware/light/BOM.md?ref=blef.fr">list of materials</a>—and I wanted to do it today but a piece was not available at the electronic shop 🥲.<br><br>Under the hood it uses a <a href="https://github.com/OpenInterpreter/01/blob/main/software/source/server/system_messages/BaseSystemMessage.py?ref=blef.fr">big prompt</a> to instruct their LLM because in the end <a href="https://hamel.dev/blog/posts/prompt/?ref=blef.fr">fuck you, show me the prompt</a>. And actually it's always fun to read the prompt companies are using to do specific tasks. Sometimes it looks like you speak to a child. Capslocks and repetitions to make the algorithm understand.</li><li><a href="https://github.com/xai-org/grok-1?ref=blef.fr">Finally, xAI released Grok-1 in open</a> — The weights are available in torrent / HF and everything is under Apache License. The repo has been released last Sunday after Musk publicly announced last week the released, I feel bad for the sweaty engineers who worked the whole week on it. I did not see a lot of feedback on it since.</li><li>Apple tries to enter the LLM game (<a href="https://www.rfi.fr/en/international/20240318-tech-giants-grilled-on-their-compliance-with-eu-s-new-digital-markets-act?ref=blef.fr">while</a> <a href="https://apnews.com/article/apple-antitrust-monopoly-app-store-justice-department-822d7e8f5cf53a2636795fcc33ee1fc3?ref=blef.fr">facing</a> <a href="https://www.rfi.fr/en/science-and-technology/20240304-apple-faces-%E2%82%AC1-8bn-eu-fine-for-breaking-music-streaming-competition-laws?ref=blef.fr">fines</a>) — Rumours says they will partner with Google to use <a href="https://www.bloomberg.com/news/articles/2024-03-18/apple-in-talks-to-license-google-gemini-for-iphone-ios-18-generative-ai-tools?embedded-checkout=true&ref=blef.fr">Gemini to power iPhone AI features</a>, at the same time they wrote a paper about <a href="https://arxiv.org/abs/2403.09611?ref=blef.fr">MM1, a family of multimodal models up to 30B parameters</a>.</li><li><a href="https://blogs.microsoft.com/blog/2024/03/19/mustafa-suleyman-deepmind-and-inflection-co-founder-joins-microsoft-to-lead-copilot/?ref=blef.fr">Microsoft hires DeepMind co-founder</a> — <strong>Mustafa Suleyman will lead a new organisation called Microsoft AI</strong>. Following the announcement Copilot, Bing, Edge and the GenAI teams will all move to the new organisation. Satya Nadella is going all-in on AI. It's important to say that Mustafa is joining Microsoft <a href="https://inflection.ai/the-new-inflection?ref=blef.fr">from Inflection</a> a LLM company in which Microsoft invested 1 year ago.</li><li>OpenAI closing partnerships with major newspapers in Europe — After <a href="https://openai.com/blog/axel-springer-partnership?ref=blef.fr">Axel Springer in Germany</a>, they sign with <a href="https://openai.com/blog/global-news-partnerships-le-monde-and-prisa-media?ref=blef.fr">Prisma Media</a> which groups <a href="https://www.lemonde.fr/en/about-us/article/2024/03/13/le-monde-signs-artificial-intelligence-partnership-agreement-with-open-ai_6615418_115.html?ref=blef.fr">Le Monde</a> (France), El País (Spain) and the Huffington Post (worldwide). All these partnership will help OpenAI train GPTs on media corpus to <em>enhance the reliability of the answers in return for a significant source of additional revenue</em>.</li><li><a href="https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613?ref=blef.fr">Commun Corpus</a> — A HuggingFace dataset collection including public domain texts, newspapers and books in a lot of languages. TB size.</li><li><a href="https://towardsdatascience.com/designing-rags-dbb9a7c1d729?ref=blef.fr">Designing RAGs</a> — A super long and detailed article about RAG. It covers the 5 main components: indexing, storing, retrieval, synthesis and evaluation. Let's be honest it contains all you have to know about this new trend and what are the key considerations.</li><li><a href="https://superlinked.com/vector-db-comparison/?ref=blef.fr">Vector DB comparison</a> — A table comparing all the different vector technologies on different axes like the search, models, APIs and technical details.</li><li><a href="https://github.com/fmind/mlops-python-package?ref=blef.fr">Python codebase with best practices to support MLOps</a> — This is a Github repository with a lot, I mean a lot, of tools and tips to create a production grade repository.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://cloud.google.com/blog/products/data-analytics/apache-spark-stored-procedures-in-bigquery-are-ga?hl=en&ref=blef.fr">Run Spark procedures in BigQuery</a> — BigQuery released a way to write PySpark code in the web editor and to run / deploy it from there creating a new serverless way to create BigQuery assets. This is a nice way to mix SQL and Python code.</li><li><a href="https://juhache.substack.com/p/pip-install-data-stack?ref=blef.fr">pip install data-stack</a> —&nbsp;This is a title I could have written myself. In this blog Julien covers the new Pythonic tooling and how far it can bring us in building lightweight programmatic data stacks. He also mentions my baby <a href="https://github.com/Bl3f/yato?ref=blef.fr">yato</a>.</li><li><a href="https://github.com/pretzelai/pretzelai?ref=blef.fr">Pretzel notebooks</a> — A new open-source notebook / exploration tool based on-top of DuckDB WASM and PRQL, it allows you to chain operation like upload file, SQL, charting, filter, sort etc. You can explore the <a href="https://pretzelai.github.io/?ref=blef.fr">demo</a>.</li><li>On the same topic <a href="https://twitter.com/trucklos/status/1770490894581485756?ref=blef.fr">Hashquery</a> launched — this is a Python framework to create semantic data models.</li><li><a href="https://www.the-race.com/formula-1/shocking-details-behind-painful-williams-f1-revolution/?ref=blef.fr">Williams F1 used Excel to build their car</a> — F1 parts (thousands) were managed in a spreadsheet. These Excels files were unmanageable and explained why Williams had delays in deliveries. That's not surprising because I think during a F1 season there aren't a lot of breaks which you can use to pay technical debt.</li><li><a href="https://github.com/dbt-labs/dbt-core/blob/v1.8.0b1/CHANGELOG.md?ref=blef.fr">dbt Core unit testing in v1.8</a> — dbt Core has implemented unit testing and it's coming soon. When unit testing a model you can give input rows and says what you expect as output rows <a href="https://docs.getdbt.com/docs/build/unit-tests?ref=blef.fr">in the YAML definition</a>. dbt will run and validate the model for you. This is game changer.</li><li><a href="https://duckdbsnippets.com/page/1/most-popular?ref=blef.fr">Awesome DuckDB snippets</a> — A website that collects cool DuckDB snippets. The most popular is a 4 line bash command that you can add to your bashrc to convert a CSV to Parquet.</li><li><a href="https://medium.com/@mikldd/the-cost-of-data-incidents-53646b588601?ref=blef.fr">The cost of data incidents</a> — Mikkel is one of my favourite authors, he carefully picks all his titles that they resonate deeply in me. He proposes a formula to compute the costs of your data incidents, changing downtime numbers into $.</li><li><a href="https://erdavis.com/2024/03/07/my-2023-in-reading/?ref=blef.fr">2023 in reading</a> — This is a great side project idea. This is a visualisation of the hours spent by Erin reading books in 2023. Personally I just finished my first book of 2024 😅.</li></ul><p></p><hr><p>This newsletter edition is already too long and I have 10 other deep articles that I'll keep for next week ❤️.</p><p><a href="https://www.blef.fr/explorer/reco/">Recommendations</a> have been computed this Wed., go check what the algorithm prepared for you. The email notification feature is almost ready so opt-in on the reco page to get your recommended link by email once ready. I know the mobile version of the reco page is buggy, I'll work on it next week as well.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.11 ]]></title>
                    <description><![CDATA[ Data News #24.11 — OpenAI CTO, Musk vs. LeCun, Grok open-source?, French report about AI ambition, RAG is hype, and data engineering stuff. ]]></description>
                    <link><![CDATA[ /data-news-week-24-11/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65f40fc05d21e60001b641bb ]]></guid>
                    <pubDate><![CDATA[ 2024-03-15 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/03/IMG_4072-1-1-1.png" class="kg-image" alt="" loading="lazy" width="1354" height="903" srcset="https://www.blef.fr/content/images/size/w600/2024/03/IMG_4072-1-1-1.png 600w, https://www.blef.fr/content/images/size/w1000/2024/03/IMG_4072-1-1-1.png 1000w, https://www.blef.fr/content/images/2024/03/IMG_4072-1-1-1.png 1354w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Mountains</span></figcaption></figure><p>I hope this e-mail finds you well, wherever you are. I'd like to thank you for the excellent comments you sent me last week after the publication of the first version of the Recommendations. This is just the beginning!</p><p>This week I've added a subscribe button in the <a href="https://www.blef.fr/explorer/reco/">Recommendations</a> page in order for you to opt-in for the weekly recommendation email—every Tuesday. You can subscribe starting today on the page and you'll get emails as soon as I've developed the email sending—expected to be out at the end of the month.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-15-at-18.20.13.png" class="kg-image" alt="" loading="lazy" width="2000" height="540" srcset="https://www.blef.fr/content/images/size/w600/2024/03/Screenshot-2024-03-15-at-18.20.13.png 600w, https://www.blef.fr/content/images/size/w1000/2024/03/Screenshot-2024-03-15-at-18.20.13.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/03/Screenshot-2024-03-15-at-18.20.13.png 1600w, https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-15-at-18.20.13.png 2266w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">You can opt-in for the recommendations</span></figcaption></figure><p>Second point, I passed the 100 stars on Github for <a href="https://github.com/Bl3f/yato?ref=blef.fr">yato</a>, which is a crazy amount! I'd like to do a bit of user research about yato, if you consider using it drop me a message please.</p><p><em>yato, is a small Python library that I've developed, yato stands for yet another transformation orchestrator. With yato you give a folder with SQL queries and it guesses the DAG and runs the queries in the right order.</em></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>Mira Murati <a href="https://www.wsj.com/video/series/joanna-stern-personal-technology/openai-made-me-crazy-videosthen-the-cto-answered-most-of-my-questions/C2188768-D570-4456-8574-9941D4F9D7E2?ref=blef.fr">answers the Wall Street Journal</a> about OpenAI Sora — OpenAI CTO has been <strong>asked a few questions about the underlying technology in Sora</strong>. She revealed a few insights. OpenAI consider for the moment Sora as a research output and might eventually be released later this year, it required "<em>much much more</em>" compute power than DALL-E to generate a video and they have a lot of interrogations regarding impact on elections or film industry. Saying mainly that "<em>Sora is a tool to extend creativity</em>". <br><br>Last point Mira has been mocked and criticised online because as a CTO <strong>she wasn't able to say on which public / licensed data Sora has been trained on</strong>. When she was asked if it was on YouTube videos, Facebook or Instagram she said "<em>I'm actually not sure about that</em>".<br><br>I personally really recommend this interview which covers a lot of interesting topics in 10 minutes.</li><li>Elon Musk said out loud that <a href="https://twitter.com/elonmusk/status/1767108624038449405?ref=blef.fr">xAI will open-source Grok this week</a>. It's Friday and it seems they are later than me when it comes to release stuff. Just-in-time for a reminder about the fact that <strong>open-source ≠ open-weights</strong> when it comes to <a href="https://opencoreventures.com/blog/2023-06-27-ai-weights-are-not-open-source/?ref=blef.fr">AI licensing</a> but differences in weights licensing <a href="https://web.archive.org/web/20230722024435/https://www.alessiofanelli.com/blog/llama2-isnt-open-source">are not as important as they seem</a>.</li><li><a href="https://www.databricks.com/blog/databricks-invests-mistral-ai-and-integrates-mistral-ais-models-databricks-data-intelligence?ref=blef.fr">Databricks invests in Mistral AI</a> — Mistral successfully positioned as the main OpenAI rival by being integrated in all major data platforms (Azure, Snowflake previously).</li><li>A French commission released a 130 pages report untitled <strong>"Our AI: our ambition for France"</strong>. You can <a href="https://www.gouvernement.fr/actualite/25-recommandations-pour-lia-en-france?ref=blef.fr">download</a> the French version and an English 16 pages summary. Report includes 25 recommendations given by French-speaking AI leaders (Yann LeCun, Arthur Mensch, etc.).</li><li>Assisted AI wars are around the corner — I'm only following the French news, but the government is proudly doubling its budget for "AI defense". From what I know, AI is mainly used as an information companion to find signals in the huge amount of data we generate, creating more efficient agents. <br><br>This is related to Paris testing <a href="https://www.lemonde.fr/en/pixels/article/2024/03/03/paris-olympics-2024-testing-on-algorithmic-video-surveillance-of-the-games-begins_6580505_13.html?ref=blef.fr">automated video surveillance during Olympics</a>. The technology under this, is, <a href="https://wintics.com/en/cityvision/?ref=blef.fr">Cityvision</a>.</li><li>Yann LeCun <a href="https://twitter.com/ylecun/status/1768330052570173471?ref=blef.fr">clashed</a> with Elon Musk on Twitter about AI future. <strong>Musk thinking that AI will be smarter than any single human next year</strong>, while LeCun said "<em>No</em>" taking as en example the <a href="https://www.theverge.com/2023/8/23/23837598/tesla-elon-musk-self-driving-false-promises-land-of-the-giants?ref=blef.fr">false self-driving car promise</a>. More, LeCun believes that human information compression capabilities are still so far ahead of AI that AGI is not even close.</li><li><a href="https://twitter.com/cognition_labs/status/1767548763134964000?ref=blef.fr">Cognition AI introduced Devin</a> — Devin is the first AI software engineer, Devin can, unassisted, do software engineering tasks like fixing Github issues (13% of success, previously best was ~5%), apply to jobs on Upwork, train and fine-tune its own models. I'm speechless.</li><li><a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/?ref=blef.fr">Building Meta’s GenAI infrastructure</a> — 2x 24k GPU clusters and it's growing. I like how Meta tries to do stuff out in the open (or at least with some kind of transparency) but the number of GPUs is just disconcerting.</li><li><strong>RAG is the new trend</strong> —&nbsp;RAG means retrieval-augmented generation, it has been coined in 2020 (<a href="https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/?ref=blef.fr">see more</a>) and let's you ground AI models with facts fetched from external sources.<ul><li><a href="https://blog.streamlit.io/build-a-real-time-rag-chatbot-google-drive-sharepoint/?ref=blef.fr">A real-time RAG chatbot built on Sharepoint and Google Drive</a></li><li><a href="https://docs.superduperdb.com/blog/rag-system-on-duckdb-using-jinaai-and-superduperdb/?ref=blef.fr">RAG on-top of DuckDB</a></li><li><a href="https://decodingml.substack.com/p/a-real-time-retrieval-system-for?ref=blef.fr">RAG on LinkedIn data</a></li></ul></li><ul><li>There is an exponential number of technologies in the RAG space, especially re vector databases that I don't even mention them but obviously post are all saying "<em>ours is the best</em>".</li></ul><li><a href="https://blog.research.google/2024/03/croissant-metadata-format-for-ml-ready.html?ref=blef.fr">Croissant: a metadata format for ML-ready datasets</a> —&nbsp;In order to move forward, faster in AI and model building we need a interoperable and easy-to-use metadata format for ML datasets. This is Croissant. Starting today it will be supported by 3 majors platforms: Kaggle, HuggingFace and OpenML. Croissant is under mlcommons and you can have a look at the <a href="https://mlcommons.github.io/croissant/docs/croissant-spec.html?ref=blef.fr">specification</a>.</li><li><a href="https://mlcontests.com/state-of-competitive-machine-learning-2023/?ref=blef.fr">The State of competitive machine learning</a> — a study about ML competition platforms. Give a lot of insights on the market. </li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1587912001191-0cd4f14fd89e?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="brown bread on white table" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1587912001191-0cd4f14fd89e?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1587912001191-0cd4f14fd89e?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A new standard full of butter (</span><a href="https://unsplash.com/photos/brown-bread-on-white-table-dCKQMAzy8II?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>Since the end of Feb. BigQuery supports <a href="https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables?ref=blef.fr#using_dml_delete_to_delete_partitions">DELETE</a> to delete partitions in a SQL query.</li><li><a href="https://www.junaideffendi.com/p/how-i-saved-70k-a-month-in-bigquery?ref=blef.fr">How I saved $70k a month in BigQuery</a> — Junaid shared a few techniques he used to saved a bunch of dollars on the BigQuery bills, this is nothing new. this is more common sense but always works. In a nutshell it's: smarter schedules, tables optimisations, incremental, avoiding views and precomputing.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7172876370606243840/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7172876370606243840%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Attributing Snowflake cost to whom it belongs</a> — Fernando gives ideas about metadata management to attribute better Snowflake cost. Wether it's a dbt model, a Tableau dashboard or a Metabase question it has to be tracked to understand what drives your bills.</li><li><a href="https://blog.det.life/i-spent-3-hours-figuring-out-how-bigquery-inserts-deletes-and-updates-data-internally-0b04d11a274a?ref=blef.fr">Understand how BigQuery inserts, deletes and updates</a> — Once again Vu took time to deep dive into BigQuery internal, this time to explain how data management is done.</li><li><a href="https://pandera--1373.org.readthedocs.build/en/1373/polars.html?ref=blef.fr#polars">Pandera, a data validation library for dataframes, now supports Polars</a>.</li><li><a href="https://medium.com/@PyDataParis/announcing-pydata-paris-2024-700220accc72?ref=blef.fr">PyData is coming to Paris in 2024</a> —&nbsp;The CFP is open and I submitted a talk there about yato.</li><li><a href="https://medium.pimpaudben.fr/airflow-kestra-a-simple-benchmark-ffc5a533aa85?ref=blef.fr">A comparison between Kestra and Airflow</a> —&nbsp;Benoit (who works at Kestra) did a great comparison between the 2 tools, comparing the syntax to write DAGs and the performance in term of scheduling capacities—tasks per seconds. Obviously Benoit prefers Kestra, at the expense of writing YAML and running a Java application.</li><li>New Apache Arrow engines — Arrow has become one of the most used library when it comes to built in-memory engines. Arrow doing a lot of the data operation heavy lifting.<ul><li><a href="https://arrow.apache.org/blog/2024/03/06/comet-donation/?ref=blef.fr">Apache Arrow DataFusion Comet</a> — a native Spark SQL accelerator, the idea behind is to improve Spark performance behind replacing Spark executor by delegating it to Comet. On the matter there is also <a href="https://gluten.apache.org/?ref=blef.fr">Apache Gluten</a> which is a plugin aiming to double SparkSQL performance.</li><li>Arroyo, a stream-processing platform, <a href="https://www.arroyo.dev/blog/why-arrow-and-datafusion?ref=blef.fr">rebuilt their engine using DataFusion</a>.</li></ul></li><li>Postgres creator launches <a href="https://www.dbos.dev/blog/announcing-dbos?ref=blef.fr">DBOS, a transactional serverless computing platform</a> — Mike sees DBOS like a cloud-native OS that runs on-top of the database in order to rethink application development and deployment.</li><li><a href="https://blog.allegro.tech/2024/03/kafka-performance-analysis.html?ref=blef.fr">Unlocking Kafka's potential: tackling tail latency with eBPF</a>.</li></ul><h3 id="forward-thinking">Forward thinking</h3><ul><li><a href="https://docs.malloydata.dev/blog/2024-02-29-hierarchical-viz/?ref=blef.fr#dataviz-is-hierarchical">Dataviz is hierarchical</a> — Malloy, once again, provides an excellent article about  a new way to see data visualisations. It's inspirational. </li><li><a href="https://dlthub.com/docs/blog/code-vs-buy?ref=blef.fr">Coding data pipelines is faster than renting connector catalogs</a> — This is something I've always believed. The devil is in the details and when it comes to data pipelines there are a lot of details, which often refrain us to buy leading to build (or code). Matthaus gives the dlt vision about creating the foundation for developers to be able to create sources in a wink creating a large ecosystem of APIs datasets easily maintainable.</li><li><a href="https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/?ref=blef.fr">Differential storage, a building block for a DuckDB-based data warehouse</a> — It's MotherDuck vision, creating the next data warehouse on-top of DuckDB leveraging DuckDB morphing capacities between a single machine and a production ecosystem. In the article Joseph explains how MotherDuck extended DuckDB to add time travel, zero-copy snapshots opening the door for more collaboration and concurrency.</li></ul><hr><p>See you next week ❤️ — recommendations for this week have been computed, <a href="https://www.blef.fr/explorer/reco/">go check it out</a>.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Recommendations ]]></title>
                    <description><![CDATA[ Data News #24.10 — A special announcement this week I introduce you to a new Data News feature: the recommendations. ]]></description>
                    <link><![CDATA[ /introduce-recommendations/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65eadf27e214bb000152b838 ]]></guid>
                    <pubDate><![CDATA[ 2024-03-08 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1504807959081-3dafd3871909?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="two person holding map and clear compass" loading="lazy" width="1000" height="662" srcset="https://images.unsplash.com/photo-1504807959081-3dafd3871909?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1504807959081-3dafd3871909?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">We all need recommendations (</span><a href="https://unsplash.com/photos/two-person-holding-map-and-clear-compass-ioYwosPYC0U?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>When I started writing this newsletter nearly three years ago, I never imagined that the words I write on my keyboard would take such an important place in my life. All the interactions I have with you, whether online or offline, are always amazing and give me wings.</p><p><strong>Today I want to introduce a new feature in the Data News galaxy.</strong></p><p>I don't talk much about my freelance life in Data News because sometimes I think that's not the contract we have together. Data News promise is to give you, every week, the links I've hand-picked with my spicy opinion about them. Since the beginning of the year balance between content and freelancing has gone from 80/20—80% client stuff and 20% to content—to 30/70. This is mainly due to the fact that I've done my annual University lectures and talked at <a href="https://www.blef.fr/talks/">7 events</a> since the beginning of the year.</p><p>Let's be honest, I'm also a bit stupid. At every event I talk, I decide to do a new presentation. That's great because it helps me innovate and pushes me to new horizons every time, but it takes time to assimilate chunks of work in order to produce creative keynotes.</p><p>All of this is made possible thanks to my Data News curation. Thanks to the time I spend reading content, forging ideas and chatting with all of you, I get inspired and my crazy brain invents things. And I want you to have the same superpowers as me. This is what motivates me.</p><p><em>PS: Fast News ⚡️ at the very end if you want to keep this story. Which will makes me sad, but I understand.</em></p><p></p><h1 id="there-is-a-problem">There is a problem</h1><p>Data News have grown so much since the beginning, I currently have 4500 members on blef.fr. I have sent 132 Data News editions which represents 2500 links (~20 links per edition).</p><p><strong>But there's a big problem: all my old Data News is dead content.</strong></p><p>I mean, there is a big difference between podcast for instance and news blogging like I'm doing. When you subscribe to a new podcast you often scroll over the past episodes of the creator. When someone subscribe to the Data News rarely the person goes over my old news.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-08-at-17.14.09.png" class="kg-image" alt="" loading="lazy" width="2000" height="1029" srcset="https://www.blef.fr/content/images/size/w600/2024/03/Screenshot-2024-03-08-at-17.14.09.png 600w, https://www.blef.fr/content/images/size/w1000/2024/03/Screenshot-2024-03-08-at-17.14.09.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/03/Screenshot-2024-03-08-at-17.14.09.png 1600w, https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-08-at-17.14.09.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A few numbers</span></figcaption></figure><p>All these 2500 links that I've liked and commentated. When I'm looking at all these links for the most of them they are timeless and I think they can still bring a lot of value to all of you.</p><p>That's why I want to re-activate my old content.</p><p></p><h1 id="the-explorer">The Explorer</h1><p>One year and half ago I had developed <a href="https://www.blef.fr/explorer/">the Explorer</a>. The Explorer is a search bar that let's you search over all the links that I have shared in the 132 Data News editions.</p><p>It was my first step in this journey to make my handpicked links browsable and usable to everyone. While I'm not good at marketing it there is a few number of you using it every month but I think it could be used way more.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/assets/img/overview.png?v=aaf514f082" class="kg-image" alt="" loading="lazy" width="3298" height="2452"><figcaption><span style="white-space: pre-wrap;">The Explorer (https://blef.fr/explorer)</span></figcaption></figure><p>But I want to go further.</p><h1 id="introducing-the-recommendation">Introducing the Recommendation</h1><p>2500 links is a huge amount and sometimes this is like finding a needle in a haystack. That's why I've developed a new feature: a recommendation module.</p><p><strong>Data News recommendation will give you every week a single link that you should have clicked on</strong>.</p><p>For the moment the recommender will be based on your click history. In every Data News email I send you I know which link you clicked on, so I'm able to leverage this information to recommend you content.</p><p>This is just the beginning and for the moment the algorithm is very trivial, this is a collaborative filtering algorithm that recommends you links you did not clicked on that have been clicked by members with the same click behaviour as you.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-08-at-17.54.13.png" class="kg-image" alt="" loading="lazy" width="2000" height="1435" srcset="https://www.blef.fr/content/images/size/w600/2024/03/Screenshot-2024-03-08-at-17.54.13.png 600w, https://www.blef.fr/content/images/size/w1000/2024/03/Screenshot-2024-03-08-at-17.54.13.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/03/Screenshot-2024-03-08-at-17.54.13.png 1600w, https://www.blef.fr/content/images/2024/03/Screenshot-2024-03-08-at-17.54.13.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Data News recommendations</span></figcaption></figure><p></p><p>As you can see in the screenshot of the feature in the Recommendation panel you can see the link that have been recommended to you and the link you've clicked on. In order to for me to get your feedback you have the possibility to like / disliked all the links (wether it's recommendation or clicked links).</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.blef.fr/explorer/reco/" class="kg-btn kg-btn-accent">See your recommendation</a></div><p><strong>Christophe, why did you make this? No one asked for it.</strong></p><p>Yes no one asked for it but let me extend deeper on the why</p><ul><li>Frustration — Like I said before, I'm super frustrated by all the content that I've referenced that is "dead". I'm pretty sure that if I successfully reactivate this content I can: generate more traffic on blef.fr, diversify my revenue and bring more knowledge to the data community.</li><li>It's a showcase — It can be an educational project showing others how you can orchestrate and schedule a small-scale AI application.</li><li>It's fun and rewarding — Looking from my side I like the fact that every week members will have a <em>gift</em> coming from me being this recommendation.</li><li>Why not? — Finally I don't run any playbook, why not trying stuff?</li></ul><p></p><h1 id="architecture">Architecture</h1><p>I said it, while being a new feature to the blog this is as well an educational projet I can use to showcase technologies. See below the global architecture I've used to make this links recommender work.</p><figure class="kg-card kg-image-card"><img src="https://lh7-us.googleusercontent.com/RrWAdbpEO2ATYy20nD7VdPGpw38syPFTsSvuT86RwRxEc6taoQS9FkGFTIQQMwjP094baNqBrtEdkJA-_Eyp4uyvNzrMj0EKFWj3kqbKTBXmMTbkNP1LHvPgH8Z76bFT4PmBQpCz_tt21RJKk-ih1gQeGg=nw" class="kg-image" alt="" loading="lazy" width="1600" height="1225"></figure><ul><li><strong>Ghost</strong> — My blog is hosted on <a href="https://ghost.org/?ref=blef.fr">ghost.org</a>, I really like Ghost because it's open-source (but I use the paid hosted version) and give me the possibility to extend the blog with custom code. The main part of the blog is just a bunch of <a href="https://handlebarsjs.com/?ref=blef.fr">Handlebars</a> templates connected to Ghost Content API. I extended the website by embedding a React application that powers the custome frontend of the Explorer and Recommendation.</li><li><strong>blefapi</strong> — In order to make the React apps working I need to have a custom backend that I've developed with Django, this backend connects to Ghost using some kind of SSO (with JWT), which means that I don't need to create another login page, once you're a member you can use all my extended features. The Django app uses a Postgres as a database and a bucket to host a few static files. Everything is hosted on Scaleway (a French cloud company).</li><li><strong>CI/CD</strong> — Everything is just deployed from Github Actions, wether it's the React application of the Django API I just need to push and it will deploy a new version.</li><li><strong>newsletter-reco</strong> — This is where the recommendation magic happens. This is small pipeline that needs to get the activity data from the blog API, do a bit a feature engineering, recommend an article for every member and then publish the recommendations to the blefapi. Under the hood (see below) it uses dlt, DuckDB / pandas and Github Actions.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://lh7-us.googleusercontent.com/-pZYzro9fzkaMFuYux5JRIh2TJ7BJ9-MWTFy_zeH2ipqRZ5EdA4puFnKjENIUGoSAk9_fTht4LhUQTT66k6U5IgImnf-4X1v-zaZQ-pKAqigJoTlR4jlJsZmIHk_RzQo1KRtidA-HPWBY8GWrs6fvyjiWg=nw" class="kg-image" alt="" loading="lazy" width="1600" height="1136"><figcaption><span style="white-space: pre-wrap;">How the recommendation works</span></figcaption></figure><p>The recommendation is fairly simple, it uses <a href="https://dlthub.com/?ref=blef.fr">dlt</a> to do the <a href="https://github.com/Bl3f/newsletter-reco/blob/main/ghost.py?ref=blef.fr">extract-load</a> from the Ghost API then dlt loads the data into a <a href="https://github.com/Bl3f/newsletter-reco/blob/main/pipeline.py?ref=blef.fr#L25-L34">DuckDB database</a> then this DuckDB data is transformed using <a href="https://github.com/Bl3f/newsletter-reco/tree/main/transform/sql?ref=blef.fr">SQL / Python transformations</a> <a href="https://github.com/Bl3f/newsletter-reco/blob/main/pipeline.py?ref=blef.fr#L13-L21">orchestrated</a> by <a href="https://github.com/Bl3f/newsletter-reco/blob/main/pipeline.py?ref=blef.fr#L40">yato</a>. In order to publish the recommendation to the API it uses the DuckDB ATTACH capabilities by <a href="https://github.com/Bl3f/newsletter-reco/blob/main/transform/sql/export/insert_ghostapi_recommendation.sql?ref=blef.fr">directly inserting records</a> to the Postgres database (it's a hack, but works). All of this will run into Github Actions every week to produce a new recommendation for everyone.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://github.com/Bl3f/newsletter-reco?ref=blef.fr" class="kg-btn kg-btn-accent">Browse the recommendation code on Github</a></div><p></p><h1 id="next-steps">Next steps</h1><p>I'll work incrementally in the next week on the recommendation, I'm open to all suggestion and I'd love to get your feedback on this, you can even do Pull Requests on the code if you feel it. Here what I plan to add in the following weeks:</p><ul><li>Subscribe to an additional email to receive the recommendation on Tuesday (if you really want to receive recommendation by email answer to this email I'll opt you in directly).</li><li>Use GenAI to summarise all the links database to give you a summary of each link that have been recommended to you—saving you one-click maybe</li><li>Improve the recommendation algorithm by using an item-based approach and embeddings</li><li>Taking into account the like / dislike from the Timeline</li><li>Develop public BI-as-code dashboard showing metrics about the content and showcasing Evidence and Observable</li></ul><p></p><h1 id="bonus-yato">Bonus: yato</h1><p>While working on the recommender I've developed something else called <a href="https://github.com/Bl3f/yato?ref=blef.fr"><strong>yato</strong></a>. yato stands for yet another transformation orchestrator and is the smallest DuckDB SQL orchestrator on Earth.</p><p>The idea behind yato is to provide a Python library (<code>pip install yato-lib</code>) that you can run either with Python code or via CLI that run all the transformations in a given folder against a DuckDB database.</p><p>yato uses SQLGlot to guess the underlying DAG and run the transformations in the right order. For the moment yato is tight to DuckDB, philosophically yato has been developed like black (the formatter) you just have one required parameter: a transformation folder and then you can do <code>yato run</code> .</p><p>I don't think yato will ever replace dbt Core, SQLMesh or lea, yato is just lighter alternative that you can use with your messy SQL folder.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://github.com/Bl3f/yato/?ref=blef.fr" class="kg-btn kg-btn-accent">See yato on Github</a></div><hr><p>It was a special announcement for me, I hope you'll understand and receive this news as excited as I'm.</p><p>And because I still want you to get a few news below a very fast news.</p><h1 id="very-fast-news-%E2%9A%A1%EF%B8%8F">Very Fast News ⚡️</h1><ul><li><a href="https://www.nytimes.com/2024/03/01/technology/elon-musk-openai-sam-altman-lawsuit.html?ref=blef.fr">Elon Musk decided to sue OpenAI</a> for violating company principles by putting profits and commercial interest first. Funny to see this from Elon Musk the philanthropist.</li><li>Google <em>is slowly loosing</em> the race for (Gen)AI, so people are <a href="https://www.businessinsider.com/calls-for-google-ceo-sundar-pichai-alphabet-step-down-ai-2024-3?r=US&IR=T&ref=blef.fr">starting to call for Sundar Pichai to step down</a>.</li><li><a href="https://www.anthropic.com/news/claude-3-family?ref=blef.fr">Anthropic released Claude 3</a> — that seems to achieve great results in benchmarks with "sophisticated vision capabilities". </li><li><a href="https://huggingface.co/enterprise?ref=blef.fr">HuggingFace released Enterprise Hub</a> — A private space to use HF features but in a dedicated space.</li><li><a href="https://www.youtube.com/watch?v=5t1vTLU7s40&ref=blef.fr">Yann Lecun went on Lex Fridman podcast</a> — He chatted for almost 3h. I did not listen the podcast yet but I guess he chatted about the concept of intelligence like he he used to do.</li><li>Sicara released a <a href="https://www.sicara.fr/en/tech-radar?ref=blef.fr">tech radar about AI technologies</a>. It includes 4 pillars: algorithm, data, methods and industrialisation. This is funny to see parquet as a technology still to adopt.</li><li><a href="https://hubertdulay.substack.com/p/easy-introduction-to-real-time-rag?r=46sqk&utm_campaign=post&utm_medium=web&triedRedirect=true&ref=blef.fr">Easy introduction to real-time RAG</a> — Showcase how you can include your Langchain / OpenAI pipeline into a classic Kafka / Pinot infrastructure.</li><li>ClickHouse <a href="https://clickhouse.com/blog/chdb-joins-clickhouse-family?ref=blef.fr">acquired chDB</a> a DuckDB alternative and <a href="https://clickhouse.com/blog/clickhouse-1-trillion-row-challenge?ref=blef.fr" rel="noreferrer">achieved the 1 trillion challenge</a> (with classic ClickHouse) in under 3 minutes for $0.56.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7171216203414167553/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7171216203414167553%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Snowflake now support trailing commas</a> and <a href="https://investors.snowflake.com/news/news-details/2024/Snowflake-Partners-with-Mistral-AI-to-Bring-Industry-Leading-Language-Models-to-Enterprises-Through-Snowflake-Cortex/default.aspx?ref=blef.fr">partners with Mistral AI</a> to bring models to the warehouse, we also learn that Snowflake Ventures also invested in Mistral AI. Long gone are the days when mistral was French.</li><li><a href="https://www.getorchestra.io/blog/introducing-orchestra-rapidly-build-and-monitor-data-and-ai-products?ref=blef.fr">Orchestra released a free-tier platform</a> to rapidly build and monitor data products. Orchestra is graphical solution to define DAGs and orchestrates different parts of the Modern Data Stack.</li><li>Use <a href="https://ibis-project.org/posts/into-snowflake/?ref=blef.fr">Ibis to load data from other databases</a> to Snowflake. This is similar to the ATTACH I did in my recommender with DuckDB.</li><li><a href="https://substack.timodechau.com/p/how-to-measure-a-data-platform?ref=blef.fr">How to measure a data platform</a> — A great article discussion the metrics tree we need to put in place as a data team. I really like it.</li></ul><p></p><hr><p>See you next week ❤️ — and please give me feedback wether you like it or not.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.09 ]]></title>
                    <description><![CDATA[ Data News #24.09 — Mistral AI, Klarna AI customer support agent, extract and load still unsolved ]]></description>
                    <link><![CDATA[ /data-news-week-24-09/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65e2db310eef200001574974 ]]></guid>
                    <pubDate><![CDATA[ 2024-03-02 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1505672678657-cc7037095e60?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="trees with wind photo" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1505672678657-cc7037095e60?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1505672678657-cc7037095e60?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Mistral (</span><a href="https://unsplash.com/photos/trees-with-wind-photo-WtwSsqwYlA0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hello all, this is the Data News, this week edition might be smaller than usual in term of comments as I'm working on a Data News related project that takes me a bit of time, which will probably lead to a series of articles. </p><p>Before I forget I've appeared on <a href="https://open.spotify.com/episode/4Rs4Xqovqs1mrI18FZXHZi?si=u6q7NtovTX6sS7uv0FzLeQ&nd=1&dlsi=f2406027f65043da&ref=blef.fr">The Joe Reis Show</a>, we chatted with Joe about data engineering teaching, why it is hard and about generative AI that will change education for ever. This is a 1h podcast, I hope you will enjoy listening to it.</p><p>Final reminder, next week there is <a href="https://conference-mlops.com/?ref=blef.fr">La Conférence MLOps</a> which will take place in Paris on March 7th. If you want to register I sill have a 40% promocode: <strong>mlops-blef-40</strong>. I'll give a talk—in French—about <em>how to put in production machine learning at a small scale</em>. Topic which is related to the Data News project 😬.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li>Mistral AI announcements<ul><li><a href="https://mistral.ai/news/mistral-large/?ref=blef.fr">Mistral Large</a>, their new <em>flagship</em> model, which outperform other concurrent excepting GPT-4. At the same Microsoft closed a <a href="https://azure.microsoft.com/en-us/blog/microsoft-and-mistral-ai-announce-new-partnership-to-accelerate-ai-innovation-and-introduce-mistral-large-first-on-azure/?ref=blef.fr">partnership</a> with Mistral to make Large available to Azure, their <em>first distribution partner</em>. It has led to a lot of discussion in French politics about Mistral AI being American more than French. With the partnership Microsoft entered the Series A with a <a href="https://techcrunch.com/2024/02/27/microsoft-made-a-16-million-investment-in-mistral-ai/?ref=blef.fr">€15m addition</a> joining a16z.</li><li>They also released a smaller model called Mistral Small.</li><li><a href="https://mistral.ai/news/le-chat-mistral/?ref=blef.fr">Le Chat</a>, the conversational interface to interact with Mistral models.</li><li>Final comment, with these 2 announcement Mistral left the open side to go <a href="https://sifted.eu/articles/mistral-microsoft-deal-controversy?ref=blef.fr">commercial</a> / <a href="https://twitter.com/KeldonB/status/1762183708738523379?ref=blef.fr">closed</a>. It led to conversation where people felt <a href="https://old.reddit.com/r/LocalLLaMA/comments/1b0o41v/top_10_betrayals_in_anime_history/?ref=blef.fr">betrayed</a> by Mistral which built their differentiator—or should I say marketing—on-top of open-source/weight models. <a href="https://www.youtube.com/watch?v=_YqzuE-5RE8&ref=blef.fr">Mistral perdant</a>.</li></ul></li><li><a href="https://github.blog/2024-02-27-github-copilot-enterprise-is-now-generally-available/?ref=blef.fr">GitHub Copilot Enterprise is now generally available</a> —&nbsp;This week I've started to use GitHub Copilot (not the Enterprise version). And let's be honest this is a productivity boost, especially when you want to write docstrings and comments. Still there is an annoying interaction in PyCharm where Copilot takes <em>too much space.</em> Copilot Enterprise mainly comes with 3 features: understand your whole org codebase, a chat to ask question about the codebase, summarise pull requests.</li><li><a href="https://www.sievedata.com/blog/fast-active-speaker-detection?ref=blef.fr">Fast, efficient active speaker detection on videos</a> — This is a great introduction to active speaker detection, which means you are able to detect in video speaker faces and if they are actually speaking or not.</li><li><a href="https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/?ref=blef.fr">Klarna's AI customer support agent do the equivalent of 700 agents</a> — Klarna developed an AI agent that interacts automatically with customer driving profit. It has to be put in <a href="https://twitter.com/GergelyOrosz/status/1762755589527015537?ref=blef.fr">context</a>.</li><li><a href="https://ibis-project.org/posts/duckdb-for-rag/?ref=blef.fr">Using DuckDB + Ibis for RAG</a> — Handy code snippet to explain why DuckDB is a good solution bringing best of both world when it comes to RAG.</li></ul><p></p><h1 id="extract-and-load-still-unsolved-%F0%9F%A4%AD">Extract and load, still unsolved 🤭</h1><p>I've started writing data pipelines in 2014 and the movement from sources to destinations has always been one of the most discussed topic in my data engineering spaces. Personally I'm the kind of guy who likes to build it custom because I think an out-of-box solution does not exist. In the end you finish with a composable solution mixing up 2 or 3 technologies to extract and load you data in your central storage, ready for transformations.</p><p>In 2024 we are more than ever tools to move data from sources to destinations. But the field has taken a new direction.</p><p>Until now, solutions were mainly full platforms (often in the cloud) with the promise to do everything in search of rebundling the data platform (cf. <a href="https://web.archive.org/web/20230202214350/https://blog.fal.ai/the-unbundling-of-airflow-2/">The unbundling of Airflow</a>). Recently, it has reached new heights: <strong>what if the extract and load is just a small library layer that integrates whatever you're doing</strong>—for people reading me carefully this is what I was calling for in <a href="https://www.adventofdata.com/using-airflow-the-wrong-way/?ref=blef.fr">using Airflow the wrong way</a>, but the fun way.</p><p>Enters the new kids on the blocks:</p><ul><li><a href="https://dlthub.com/?ref=blef.fr">dlt</a> — it stands for <em>data load tool</em>, it's a Python <em>library</em> installable with pip. It provides a framework to do the extract and load, you need to define sources and resources what are the specificities of the resources you want to load: primary keys, write disposition, incremental mode, etc. and the library does the heavy lifting accordingly.</li><li><a href="https://airbyte.com/blog/announcing-pyairbyte?ref=blef.fr">PyAirbyte</a> — Airbyte announced their Python <em>library</em> in beta. Currently it support around 250 sources, which is a subset of all Airbyte sources (only the ones written in Python) and it seems it does not support connecting to classic databases. They call a destination a Cache, which is a terrible name. Even if the library is a great idea I feel this is a sad that the interoperability with Airbyte is not 100%.<br><br>Adrian from dlt wrote a <a href="https://dlthub.com/docs/blog/what-is-pyairbyte?ref=blef.fr">small post about PyAirbyte</a>.</li><li><a href="https://github.com/cloudquery/cloudquery?ref=blef.fr">CloudQuery</a> — Written in Go, YAML driven configuration to move data.</li><li><a href="https://github.com/bruin-data/ingestr?ref=blef.fr">ingestr</a> — ingestr is a CLI tool to copy data between any databases with a single command seamlessly. It's built on top of dlt.</li><li><a href="https://github.com/slingdata-io/sling-cli?ref=blef.fr">Slings</a> — Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database. Written in Go.</li><li>Let's not forget <a href="https://github.com/meltano/meltano?ref=blef.fr">Meltano</a>.</li></ul><p>We see a pattern here, when we talk about extract and load there are 2 kinds of sources: databases and APIs, behind able to do both correctly is the key.</p><p>On the other side of the movement there is a new open-source reverse-ETL technology called <a href="https://github.com/Multiwoven/multiwoven?ref=blef.fr">Multiwoven/multiwoven</a>. This is built in Ruby (haha). At the moment it can sync to Facebook, Salesforce and Slack.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1590329431219-34a09cabf8b2?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="green trees and plants under blue sky and white clouds during daytime" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1590329431219-34a09cabf8b2?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1590329431219-34a09cabf8b2?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Rare footage of a roman extract and load pipeline (</span><a href="https://unsplash.com/photos/green-trees-and-plants-under-blue-sky-and-white-clouds-during-daytime-G-jzc1YTk4M?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://more-than-numbers.count.co/p/your-first-90-days-as-a-head-of-data?ref=blef.fr">Your first 90 days as a head of data</a> — Handbook and roadmap with pragmatic insights on what to do in your new journey as head of data.</li><li><a href="https://blog.det.life/career-pathways-of-data-engineers-2bc4465483d0?ref=blef.fr">Career pathways of data engineers</a> — IC, manager, being a data engineer or a data full stack. It covers great topics.</li><li>Google Search <em>filetype:pdf</em> was not working for a moment — Internet panicked and believed Google was continuing to downfall. But actually <a href="https://twitter.com/searchliaison/status/1762866266585620922?ref=blef.fr">it was a bug</a>.</li><li><a href="https://cloud.google.com/bigquery/docs/working-with-time-series?ref=blef.fr">BigQuery time series data</a> — BigQuery will now support time series analyses.</li><li><a href="https://applytitan.com/?ref=blef.fr">Snowflake access management</a> — I've already share Teej work in the past, but now he launched a company to solve Snowflake access management, using code. I bet it gonna become the best solution for this issue out there.</li><li><a href="https://www.rilldata.com/blog/operational-bi-embedded-dashboards-for-clickhouse?ref=blef.fr">Rill dashboards for ClickHouse</a> — Rill works now with ClickHouse.</li><li><a href="https://maxhalford.github.io/blog/fast-poetry-pre-commit-github-actions/?ref=blef.fr">Fast Poetry and pre-commit with GitHub Actions</a> — An efficient and useful Github Actions to cache poetry installs in the CI.</li><li><a href="https://www.bigquerycost.com/?ref=blef.fr">BigQuery cost dashboard app</a> — Hashboard develop a dashboard to help you follow your BigQuery costs, it's free. It uses Hashboard, which is a BI tool. Even if you don't use it, it gives good ideas about what to track.</li><li><a href="https://doordash.engineering/2024/02/27/introducing-doordashs-in-house-search-engine/?ref=blef.fr">Introducing DoorDash’s in-house search engine</a> — Custom searcher built on top of S3.</li></ul><p></p><h3 id="tech-stuff">Tech stuff</h3><ul><li><a href="https://github.com/adidas/lakehouse-engine?ref=blef.fr">Adidas Lakehouse engine</a>.</li><li><a href="https://luminousmen.com/post/why-apache-spark-rdd-is-immutable/?ref=blef.fr">Why Apache Spark RDD is immutable?</a></li><li><a href="https://medium.com/israeli-tech-radar/high-order-and-partially-applied-functions-in-python-0c9fa0459089?ref=blef.fr">High-order and partially applied functions in Python</a>.</li></ul><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.08 ]]></title>
                    <description><![CDATA[ Data News #24.08 — Presentation about Engines leading to DuckDB, Gemma and Gemini, Mistral Next, MDS follow-up and more. ]]></description>
                    <link><![CDATA[ /data-news-week-24-08/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65d86faf89fbd00001d59fcb ]]></guid>
                    <pubDate><![CDATA[ 2024-02-23 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1472068996216-8c972a0af9bd?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="woman sitting on bed with flying books" loading="lazy" width="1000" height="664" srcset="https://images.unsplash.com/photo-1472068996216-8c972a0af9bd?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1472068996216-8c972a0af9bd?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">My ideas these days (</span><a href="https://unsplash.com/photos/woman-sitting-on-bed-with-flying-books-yHG6llFLjS0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, fresh Data News edition. This week I've participated to a round table about data and did a cool presentation about <a href="https://docs.google.com/presentation/d/1b-MpgqdNuGvlVqMV0WoOpQp7sSdJ9jgLK2Xhd8nzmNA/edit?usp=sharing&ref=blef.fr">Engines</a>. The idea was to depict the history of engines over the last 40 years and what leads to polars and DuckDB. Obviously the I forgot a few things and I'll do a more complete v2 soon.</p><p>This is my third presentation about DuckDB in the last 3 months and I think I'll slow down a bit until I get new crazy things to share.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-21-at-14.15.27.png" class="kg-image" alt="" loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2024/02/Screenshot-2024-02-21-at-14.15.27.png 600w, https://www.blef.fr/content/images/size/w1000/2024/02/Screenshot-2024-02-21-at-14.15.27.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/02/Screenshot-2024-02-21-at-14.15.27.png 1600w, https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-21-at-14.15.27.png 2158w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Engines evolution (me)</span></figcaption></figure><p>There are 3 points that have triggered discussion about the visualisation I done</p><ul><li>What about Arrow? — Apache Arrow is an awesome library that powers a lot of innovations in the data space in the recent years. But UX is where it differs to others, DuckDB user experience is insanely magical. So yeah. But for sure I'll add Arrow in the v2.</li><li>Spark future — I'm convinced that Apache Spark will have to transform itself if it is not to disappear (disappear in the sense of Hadoop, still present but niche). This is already happening, according to the feedback I've had, but Spark requires more infrastructure and investment, which will continue to drive adoption down, whereas the current trend is towards simplification.</li><li>JVM vs. SQL data engineer — There's a big discussion in the community about what real data engineering is. Is it Java/Scala or Python? Is it DataFrames or SQL? Is it lake or warehouse? It's a sterile debate: both are useful and can serve different organisations with different service level for data users and stakeholders. Still, I prefer SQL/Python data engineering, as you know me.</li></ul><p>Small reminder, I'm partnering with&nbsp;<a href="https://conference-mlops.com/?ref=blef.fr">La Conférence MLOps</a>, a half-day conference on the challenges of industrialising AI. It will take place on March 7 in Paris. The list of speakers includes many important figures from the French data ecosystem, and I'm very excited about it. You can get a ticket with at 40% discount with the following promo code: <strong>mlops-blef-40</strong>. We have only a few seats left.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News  🤖</h1><ul><li>Mistral AI will release next week Mistral Next a ChatGPT alternative. We don't have a lot of detail because it has not been announced publicly—I got the news in a French politic newspaper. Still you can test mistral-next on <a href="https://chat.lmsys.org/?ref=blef.fr">lmsys</a>. Here a <a href="https://medium.com/@ingridwickstevens/mistral-next-first-impressions-of-mistrals-latest-stealth-release-73086187a656?ref=blef.fr">first review</a>.</li><li><a href="https://blog.google/technology/developers/gemma-open-models/?ref=blef.fr">Google releases Gemma</a> — Gemma is a family of <em>open models</em>. Available in 2 sizes: 2B and 7B it seems to have baseline performance compared to Llama-2.</li><li>The same <a href="https://www.livemint.com/technology/tech-news/us-presidential-candidate-vivek-ramaswamy-slams-google-gemini-globally-embarrassing-rollout-blatantly-racist-11708660194299.html?ref=blef.fr">Google got a backslash</a> after Gemini image generation rollout — Conservative people over social networks have been hurt because Gemini wasn't capable to generate image of white people. Google rolled back Gemini until further improvements.</li><li><a href="https://artificialanalysis.ai/models?ref=blef.fr">Models comparison across key metrics</a> — I found it via <a href="https://www.linkedin.com/feed/update/urn:li:activity:7166149579208441858/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7166149579208441858%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Guido on LinkedIn</a>, it shows a lot of cool metrics like for instance the price per token, the speed or the model quality.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.youtube.com/watch?v=eCI1wtLw4Fo&ref=blef.fr">Is the modern data stack dead?</a> — This is a follow-up podcast of Tristan Handy with Matt Turck—famous VC guy producing the <a href="https://mad.firstmark.com/?ref=blef.fr">MAD landscape</a>—following last week post about the MDS. In this 40 minutes podcast they chat more in detail of the dynamics behind the end of hype regarding MDS, AI implication and the future of analytics engineering work.</li><li><a href="https://www.blef.fr/modern-data-stack-disappearing/">Is the modern data stack disappearing?</a> —&nbsp;An article I wrote 4 days ago as an answer to the trend. Pragmatic and easy-to-read. Essentially I analyse why the modern semantic is an issue.</li><li><a href="https://www.youtube.com/watch?v=cyZfpXxXojE&ref=blef.fr">State of the Duck</a> — Introduction Keynote of the DuckCon that gives an overview of how is the current ecosystem and what's to come.</li><li><a href="https://tabular.io/blog/pyiceberg-0-6-0-write-support/?ref=blef.fr">PyIceberg 0.6.0: Write support</a> — Yeah, finally I'll be able to play a bit more with Iceberg. Still you need a catalog to make it work.</li><li><a href="https://marcogorelli.github.io/polars-plugins-tutorial/?ref=blef.fr">How you can write a Polars plugin</a> — A dedicated website to explain how to write Polars plugins to extend the library capabilities. In order to do it you'll have to write Rust and Python code. This is a good way to enter the Rust world I guess.</li><li><a href="https://dataengineeringcentral.substack.com/p/unit-testing-for-data-engineers-43b?ref=blef.fr">Unit testing for data engineers</a> — Daniel describes what you need to know as a data engineer to write test. He mainly covers BDD (behavior-driven development) as opposed to TDD (test-driven development).</li><li><a href="https://blog.det.life/i-spent-another-6-hours-understanding-the-design-principles-of-snowflake-heres-what-i-found-dea9fd74ae96?ref=blef.fr">Understand the design principles of Snowflake</a> — Someone took a few hours to understand Snowflake internals and this is a great wrap-up.</li><li><a href="https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/?ref=blef.fr">Aligning Velox and Apache Arrow</a> — Go deeper into memory management and how you can create open standards across the different librairies.</li><li><a href="https://engineering.grab.com/enabling-near-realtime-data-analytics?ref=blef.fr">Enabling near real-time data analytics on the data lake</a> — Grab showcasing what they did with Flink and Hudi to enable real-time use-cases.</li><li><a href="https://arxiv.org/abs/2402.06282?ref=blef.fr">Retrieve, merge, predict: augmenting tables with data lakes</a> — A paper that explains how you can improve data discovery on data lakes to finally augment a given table with new data. I did not read the paper except the introduction and a the first schema, but it looks like awesome.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-23-at-16.20.08.png" class="kg-image" alt="" loading="lazy" width="1286" height="776" srcset="https://www.blef.fr/content/images/size/w600/2024/02/Screenshot-2024-02-23-at-16.20.08.png 600w, https://www.blef.fr/content/images/size/w1000/2024/02/Screenshot-2024-02-23-at-16.20.08.png 1000w, https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-23-at-16.20.08.png 1286w" sizes="(min-width: 720px) 720px"></figure><ul><li><a href="https://datamonkeysite.com/2024/02/22/building-a-cost-effective-solution-using-fabric/?ref=blef.fr">Building a cost effective solution using&nbsp;Fabric</a> — Another look at Fabric. In the end the author creates a workspace and transform the data with pandas and DuckDB in notebooks. Thank you Microsoft.</li><li><a href="https://www.decodable.co/blog/checkpoint-chronicle-february-2024?ref=blef.fr">A newsletter about the streaming data space</a> — Robin collected a lot of cool articles about the streaming ecosystem.</li></ul><h3 id="cool-ideas">Cool ideas</h3><ul><li><a href="https://www.brainfart.dev/blog/foss-state-in-2024?ref=blef.fr">Open-source, current state and future hopes</a>.</li><li><a href="https://mikkeldengsoe.substack.com/p/data-will-not-tell-you-what-to-do?ref=blef.fr">Data will not tell you what to do</a>.</li><li><a href="https://medium.com/ft-product-technology/turning-ideas-into-ai-use-cases-the-product-manager-point-of-view-f5e4aa7fe0af?ref=blef.fr">Turning ideas into AI use cases</a> — the Product Manager point of view.</li></ul><p><a href="https://medium.com/@dilyanaevtimova?source=post_page-----f5e4aa7fe0af--------------------------------" rel="noopener follow"></a></p><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://techcrunch.com/2024/02/19/struggling-database-company-mariadb-could-be-taken-private-in-a-37m-deal/?ref=blef.fr"><strong>MariaDB</strong> takeover at $37m</a>. MariaDB is a public company and could be taken over in by an investment company.</li><li><a href="https://www.neurelo.com/?ref=blef.fr"><strong>Neurelo</strong></a> <a href="https://www.businesswire.com/news/home/20240131575739/en/Neurelo-Data-Access-Platform-S%5B%E2%80%A6%5Difies-and-Accelerates-Modern-Cloud-Application-Development?ref=blef.fr">raises $5m seed</a> to provide HTTP APIs on top of databases (PostgreSQL, MongoDB and MySQL). We can see it as a semantic layer but on software engineering side.</li><li><a href="https://www.motifanalytics.com/?ref=blef.fr"><strong>Motif Analytics</strong></a> <a href="https://techcrunch.com/2024/02/12/motif-analytics-brings-sequence-analytics-to-growth-teams/?ref=blef.fr">raises $5.7m seed</a>. This is a tool made to analyse sequences, especially useful in web analytics / acquisition. They provide tooling to do without writing awful SQL queries.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Is the modern data stack disappearing? ]]></title>
                    <description><![CDATA[ Today we answer the most important question. Is the modern data stack coming to an end? ]]></description>
                    <link><![CDATA[ /modern-data-stack-disappearing/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65d343b3a0ae420001db5822 ]]></guid>
                    <pubDate><![CDATA[ 2024-02-19 ]]></pubDate>
                    <content>
                        <![CDATA[ <p></p><p><strong>No.</strong></p><p></p><hr><p>This question generated a lot of content last week, and a lot of words were written. I wanted to keep my answer short so as not to burden you with a few thousand more words to read.</p><ul><li>Modern data stack has been coined by US companies and VCs—mainly Fivetran / dbt Labs—as a word to quickly emphasis a way to build data stack in the cloud related to ELT. It was a well-suited marketing term, let's be honest.</li><li>The time came and everyone took their place at the table to eat a slice of cake.</li><li>A lot of people have issues with the <strong>modern</strong> word. Probably because it's not an explicit semantic, definition is <em>relating to the present or recent times as opposed to the remote past</em>. In this definition there are 2 issues.<ul><li><em>is relating to the present</em> — not all companies are in the same present</li><li><em>as opposed</em> — the term creates an opposition between 2 worlds, creating something we always like in tech: a debate between 2 kind of technologies.</li></ul></li><li>Actually modern creates some kind of exclusion between new technologies and old technologies. It was useful as first for Fivetran or dbt Labs to be disruptive, but now that everyone is using the MDS is it still a good idea to create this competition? Especially if you want to enter the Fortune 500 where they actually use old tech?</li><li>And, we should stop being cynical, who in the hell—in my readers at least—wants to work with SAP, IBM or mainframes in 2024? Because they still exists, numbers show that still <a href="https://www.statista.com/statistics/1308367/share-server-format-companies-worldwide/?ref=blef.fr">50% of companies are on-premise</a>, when it comes to publicly listed companies or government stuff it's probably way higher.</li><li>For these organisations the ideal of a modern data stack still resonate. Employees are stuck in hell regarding data tooling. Data projects are still failing to go in production.</li><li>Personally, down there, my vision of the modern data stack has always changed over the years. As always, I don't blindly apply the principles by the book. The idea of dedicated storage where all the data lies with SQL transformations on top and top-notch CI/CD processes with everything-as-code and a galaxy of convenient tools around to be observable sums up what's modern about our data ecosystem.</li><li>That's why I think modern data stack vision isn't going anywhere.</li></ul><p>In my four years of freelancing, I've always said I build <strong>data platforms</strong> or <strong>data stacks</strong>, because who am I to judge whether I'm modern?</p><hr><p>As a reference read my online friends views:</p><ul><li><a href="https://roundup.getdbt.com/p/is-the-modern-data-stack-still-a?ref=blef.fr">Is the "Modern Data Stack" still a useful idea?</a> — Tristan Handy, dbt Labs CEO. He mainly coined the term and whistles the end of the playtime. MDS was previously useful to align practices but now he thinks we should move on to <strong>analytics stack</strong>. And AI is around the corner to take all the lights while we actually do stuff at the bottom of the pyramid.</li><li><a href="https://benn.substack.com/p/the-problem-was-the-product?ref=blef.fr">The problem was the product</a> — Benn Stancil, Mode CTO, scroll to mid-article. </li><li><a href="https://joereis.substack.com/p/everything-ends-my-journey-with-the?ref=blef.fr">Everything Ends - My journey with the modern data stack</a> — Joe Reis, author of Fundamentals of Data Engineering. Joe depicts his own journey and views and why it became a mess with too many companies on the radar. Creating finally the most fragmented platforms with no coherence at all, negating all the good MDS aspects.</li></ul> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.07 ]]></title>
                    <description><![CDATA[ Data News #24.07 — OpenAI Sora, Gemini, boximator, models competition is fierce, new Observable and BI as Code and more stuff. ]]></description>
                    <link><![CDATA[ /data-news-week-24-07/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65c71ec598601300016951e0 ]]></guid>
                    <pubDate><![CDATA[ 2024-02-16 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1606140955270-43cc65e06870?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="cars parked on side of the road near building during daytime" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1606140955270-43cc65e06870?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1606140955270-43cc65e06870?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Italy Sora (</span><a href="https://unsplash.com/photos/cars-parked-on-side-of-the-road-near-building-during-daytime-Bf49iOwtpWA?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey you, time for the Data News. Because I did not send the news last week you will get articles from the 2 last weeks. Last few days have been heavily packed with AI News as well.</p><p><em>Disclaimer, the 2 events below will be in French.</em></p><p>Before jumping to the news there are a few events I want to write about. Next Wednesday I will participate to a <a href="https://www.linkedin.com/feed/update/urn:li:activity:7164176804176478208/?ref=blef.fr">Data Night Talk</a> a open discussion about AI &amp; data engineering with other content creators. We will do it online / in-person. So tune-in. I working on a 10-minute light talk about data engines (🦆) and a funny game.</p><p>✨ Second, I'm partnering with <a href="https://conference-mlops.com/?ref=blef.fr">La Conférence MLOps</a>, a half-day conference on the challenges of industrializing AI. It will take place on March 7 in Paris. The list of speakers includes many important figures from the French data ecosystem, and I'm very excited about it. I may give a talk—but I'm not sure yet.</p><p><em>PS: This is not a paid partnership with nibble—the company behind the conference—but in fact they've been a client of mine for 2 years and have been a huge supporter of the newsletter since day one and they're good humans. I'm happy to partner with them.</em></p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p>The last few days have been stacked in term of AI News. Have fun getting through everything 😊, this is really cool how fast things are improving.</p><ul><li><a href="https://openai.com/sora?ref=blef.fr">OpenAI Sora, generate 1-minute long videos</a> — OpenAI released yesterday a new generative model that is able to create 1-minute long videos from text prompt. At the moment this is not public and only in the hands of limit testers but the first look shows that OpenAI might be already ahead of the competition. You can see a few videos on their landing page as well as the current limitations.</li><li><a href="https://boximator.github.io/?ref=blef.fr">ByteDance boximator, create motion on images</a> — Boximator is a friendly method to instruct generative algorithms with boxes. With the boxes you define a motion on an image and the models create a video out of it. ByteDance is the company behinds TikTok. At the end of the page you have a current comparison of generative videos and you can make yourself an opinion compared to Sora.</li><li><a href="https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/?ref=blef.fr">Meta V-JEPA, fills the void in videos</a> —&nbsp;V-JEPA is not a generative model. With this model you can "fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space".</li><li><a href="https://www.wsj.com/tech/ai/sam-altman-seeks-trillions-of-dollars-to-reshape-business-of-chips-and-ai-89ab3db0?ref=blef.fr">Sam Altman wants to raises $7 trillion</a> — It was the WSJ news of last week about OpenAI CEO seeking for 7,000,000,000,000 of dollars. Obviously everyone tried to bet why he wants this money—which could came from UAE government—probably to enter the semi-conductor industry. I'm happy we correctly use money to save our planet /s.</li><li><a href="https://blogs.nvidia.com/blog/canada/?ref=blef.fr">Canada partners with NVIDIA to bring more computing power</a>.</li><li><a href="https://wow.groq.com/why-groq/?ref=blef.fr">Groq, which speeds up LLMs inference</a> — This week I discovered Groq a company that created LPU™ inference engine claiming to be the <a href="https://wow.groq.com/groq-lpu-inference-engine-crushes-first-public-llm-benchmark/?ref=blef.fr">most efficient</a> cloud-based provider (18x faster) in term of tokens/s.</li><li><a href="https://techcrunch.com/2024/02/15/googles-new-ai-hub-in-paris-proves-that-google-feels-insecure-about-ai/?ref=blef.fr">Google announced a "new" AI hub in Paris</a> — Techcrunch is mocking US company about this announcement considered as a communication effort because the 300 members of the new AI hub are already working for Google and it's just a new office space.</li><li><a href="https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/?ref=blef.fr">Still Google announced Gemini 1.5</a> — Gemini is the <a href="https://blog.google/products/gemini/bard-gemini-advanced-app/?ref=blef.fr">new name of Bard</a> (as of last week) and blablabla Gemini is awesome blablabla. This is crazy how Google feels outdated when you look at smaller AI companies in term of hype or magic they are able to build. <a href="https://twitter.com/JeffDean/status/1758146022726041615?ref=blef.fr">More details on Twitter</a>.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7161088523264020480/?ref=blef.fr">HuggingFace model usage on HuggingChat</a> —&nbsp;HuggingChat is a Chat UI from HF that lets you play with whatever model works with it. It depicts how fluid is the model market. Mainly it shows that Mistral is currently replacing LLaMa. You can also see the models "market share" among the major APIs / Clouds vendors in a nice <a href="https://miro.com/app/board/uXjVNz_4nrc=/?ref=blef.fr">Sankey diagram</a>.</li><li><a href="https://www.nvidia.com/en-us/ai-on-rtx/chat-with-rtx-generative-ai/?ref=blef.fr">NVIDIA Chat with RTX</a> — It's a Windows app (~37GB) that locally runs a GPT model to unlock chat with your files in a secure way out of the cloud. Happy gamers.</li><li><a href="https://gitlab.adullact.net/dgfip/projets-ia/llamandement?ref=blef.fr">French Finance ministry released a LLM to summarise legislative proposals</a> — Called LLaMandement it's a fine-tuned LLaMa designed to produce neutral summary of law proposal to help the government with preliminary notes. All the data used for the fine-tuning is available on Gitlab and the FastChat command used as well. Here the English paper on <a href="https://arxiv.org/abs/2401.16182?ref=blef.fr">arxiv</a>.</li><li><a href="https://medium.com/artefact-engineering-and-data-science/why-you-need-llmops-48c0925827de?ref=blef.fr">Why you need LLMOps</a> — A great post that encapsulates all the words needed to understand what needs to be done when it comes to put LLM in production.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>❤️ <a href="https://observablehq.com/blog/observable-2-0?ref=blef.fr">Observable 2.0</a> — Observable has a close place to my heart. Observable has been created by Mike Bostock the creator of D3js which is my Proust madeleine. Today they announced the 2.0 version which is mainly Charts as Code. It goes beyond the notebooks and become a static site generator for building fast, beautiful data apps, dashboards, and reports. I'm so excited to play with it.</li><li><a href="https://evidence.dev/blog/why-we-built-usql/?ref=blef.fr">Introducing universal SQL</a> — I have to talk about Evidence now, which is as well a static site generator for building data front-end. They introduce universal SQL as a way to connect to all kind of datasources, adding interaction in the frontend while staying fast. Mainly it means data is exported in Parquet and compute with DuckDB WASM in the browser.</li><li><a href="https://astral.sh/blog/uv?ref=blef.fr">uv: Python packaging in Rust</a> — I've been using poetry in the last 2 years and I'm quite satisfied about it. Seeing a new kid on the block is good because it renew the ideas because let's be honest we will never have a de facto packaging tool.</li><li><a href="https://github.com/quarylabs/sqruff?ref=blef.fr">sqruff: SQL linter written in Rust</a> — This is the results of the Rust hype, people are now porting more and more tools in Rust for efficiency. And it's for the better.</li><li><a href="https://motherduck.com/blog/introducing-column-explorer/?ref=blef.fr">Introducing the column explorer in MotherDuck</a> — A cool feature in MotherDuck (DuckDB in the cloud) to add sparklines and columns distributions when looking at a dataset.</li><li>dbt Labs announced a few new things to their dbt Explorer (that is only available to dbt Cloud). In a nutshell they announced <a href="https://docs.getdbt.com/blog/dbt-explorer?ref=blef.fr#wheres-this-data-coming-from">column-level lineage</a>, <a href="https://docs.getdbt.com/blog/dbt-explorer?ref=blef.fr#recommendations">recommendations</a> and <a href="https://www.getdbt.com/blog/announcing-exports-for-the-dbt-semantic-layer?ref=blef.fr">semantic layer exports</a>. This is fair to say that lineage is powered by SQLGlot (and <a href="https://twitter.com/Captaintobs/status/1757601463852023876?ref=blef.fr">Toby is not happy about it</a>).</li><li><a href="https://cloud.google.com/blog/products/data-analytics/introducing-new-vector-search-capabilities-in-bigquery?hl=en&ref=blef.fr">Introducing vector search in BigQuery</a> — RAG is everywhere and BigQuery enters the game.</li><li><a href="https://docs.snowflake.com/en/user-guide/tasks-intro?ref=blef.fr#label-billing-task-runs">Snowflake lowers the cost of task from x1.5 to x1.2.</a></li></ul><p></p><h1 id="engineering-%E2%9A%99%EF%B8%8F">Engineering ⚙️</h1><ul><li><a href="https://substack.timodechau.com/p/eventify-everything-data-modeling?ref=blef.fr">Eventify everything</a> — This is an ode to event modeling and a different way to think data modeling. Timo showcases how you can eventify your data model to think differently your business activity.</li><li><a href="https://medium.com/apache-airflow/what-we-learned-after-running-airflow-on-kubernetes-for-2-years-0537b157acfd?ref=blef.fr">What we learned after running Airflow on Kubernetes for 2 years</a> — Outstanding article with great insights about the journey of running Airflow in production. It breaks down how to handle dynamic DAG generation, multiple DAG repository, configuration fine tuning and observability. </li><li><a href="https://medium.com/@cautaerts/a-dataframe-is-a-bad-abstraction-8b2d84fa373f?ref=blef.fr">A dataframe is a bad abstraction</a> — The article is too long for me to read it right now but the title is enough catchy for me to put it in the newsletter. If you read it I'm curious to know what you think about it.</li><li><a href="https://engineering.backmarket.com/back-markets-journey-towards-data-self-service-89b278d6617a?ref=blef.fr">Back Market’s journey towards data self-service</a> — How to be a data self-service company and what initiatives they tried toward this journey.</li><li><a href="https://moderndatanetwork.medium.com/how-to-leverage-metabase-for-efficient-self-service-analytics-ac60f855299c?ref=blef.fr">How to leverage Metabase for efficient self-service analytics?</a> — 3 companies joined to shared tips about Metabase governance and monitoring. This is goldmine if you're using Metabase and struggling understanding how users are using Metabase.</li><li><a href="https://blog.sdf.com/p/automating-data-classification-for?ref=blef.fr">Automating data classification for the 21st Century</a> — How Semantic Data Fabric (SDF) is able to statically infer data types and lineage how of a SQL patrimony. In a sense SDF is a dbt alternative. I really like this article.</li></ul><p></p><p><a href="https://substack.com/profile/150631179-wolfram-schulte?ref=blef.fr" rel="noopener"></a></p><p><a href="https://piethein.medium.com/?source=post_page-----bf0dfcfd0583--------------------------------" rel="noopener follow"></a></p><p><a href="https://moderndatanetwork.medium.com/?source=post_page-----ac60f855299c--------------------------------" rel="noopener follow"></a></p><h1 id="food-for-thoughts-%F0%9F%8D%B1">Food for thoughts 🍱</h1><ul><li><a href="https://www.bvp.com/atlas/what-founders-need-to-know-to-build-a-high-performing-data-team?ref=blef.fr">What founders need to know to build a high-performing data team</a> — Centralised, distributed or hybrid data team? This article discuss it.</li><li><a href="https://petrjanda.substack.com/p/elevate-the-role-of-analytics-in?ref=blef.fr">Let's elevate the role of analytics</a>.</li><li><a href="https://www.synq.io/blog/data-ownership-guide?ref=blef.fr">Data ownership: A practical guide</a>.</li><li><a href="https://roundup.getdbt.com/p/is-the-modern-data-stack-still-a?ref=blef.fr">Is the Modern Data Stack still a useful idea</a> — I'll write more about this later I guess this is too big too write my views in a bullet point. The article is about Tristand Handy views about the future of the modern data stack.</li></ul><hr><p>See you next week ❤️.</p><p><em>PS: I'd love to get your feedback about the newsletter either if you recently joined or if you're here since the beginning. More I'd love, as well, to understand who you are and what eventually would make you financially support my content creation activity.</em></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.05 ]]></title>
                    <description><![CDATA[ Data News #24.05 — text-to-sql problem, state of French data market, DuckCon and the fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-24-05/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65b3e5c03e7aba00011c9a2f ]]></guid>
                    <pubDate><![CDATA[ 2024-02-03 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1576924542622-772281b13aa8?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="a group of boats that are sitting in the water" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1576924542622-772281b13aa8?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1576924542622-772281b13aa8?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">hey (</span><a href="https://unsplash.com/photos/a-group-of-boats-that-are-sitting-in-the-water-3Ze88tZX-p0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hello here, this is Christophe from Amsterdam. I hope you're doing good. I'm in Amsterdam for the day for the DuckCon #4. The DuckDB annual conference, and god I like Europe. Being able to travel by train from Berlin to Paris to Amsterdam while going to the west of France for a lecture in a week is something truly awesome.</p><p>Anyway this week will be a mixed Data News with links, stuff and ideas and a small wrap-up of the DuckCon + the stuff I presented on Wed. to a Modern Data Stack meetup in Paris about DuckDB WASM. I hope you'll enjoy it.</p><p></p><h1 id="the-text-to-sql-problem">The text-to-sql problem</h1><p>Every once in a while the people are trying to give a shot at the text-to-sql problem. Each time a new breaktrough is happening (meaning a new LLM) company launch and people tries. 2 weeks ago TextQL <a href="https://www.textql.com/blog/announcing-our-4-1m-fundraise?ref=blef.fr">raised $4.1m seed</a> trying to solve this issue.</p><p>But what problem are we trying to solve?</p><p>In fact, I think we're trying to solve two different problems. The first is self-service, we want our stakeholders to be able to access information on their own and with no errors, once again chasing the dream that our clients can navigate the data jungle on their own, in fact this problem is "text-to-insights". And there's the second part of the problem which is much simpler, a data copilot, which can be a tool that accelerates the productivity of data workers by bootstrapping SQL writing or analysis.</p><p>Obviously when it comes to self-service we need a layer that does a text-to-sql conversion. In the current cycle of hype it can be done with LLMs, like <a href="https://motherduck.com/blog/duckdb-text2sql-llm/?ref=blef.fr">DuckDB-NSQL-7B</a>, the one MotherDuck provided recently. Like every model you have to analyse the <a href="https://arxiv.org/abs/2401.12379?ref=blef.fr">efficiency</a> of these generation layers. </p><p>From my own little experiments in this field here what I can say a generating layer can behave like an analyst but will be way more stupid than an analyst. I mean, on one side a LLM can get a thousands lines queries right the first time, like an analyst, it has to be done incrementally, either with prompt for the LLM or by test and run by the analyst. </p><p>But there is something that limits the LLM: his business understanding. Even if you give your LLM access to the database, the codebase and the docs there is something the LLM does not have: the implicit (vocal) business rules that are written nowhere.</p><p>I have 2 thing for the conclusion:</p><ul><li>Have a look at what Alan did as a <a href="https://www.linkedin.com/feed/update/urn:li:activity:7156555784661704704/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7156555784661704704%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Copilot / Metabase bot</a> to help people getting insight — by people in this case it means the CEO, who is explicitly saying on LinkedIn "<em>It's incredible I don't need to ask anymore my People or Data Analysts team</em>" — 😬</li><li><strong>Having a data catalog does not mean that people knows what to do with the data</strong> [they just know it exists] —this is like an aggregate of quotes from my Wed. conference.</li></ul><p></p><h1 id="state-of-the-french-data-market">State of the French data market</h1><p>2 benchmarks have been published recently about the French data market.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-03-at-10.33.01.png" class="kg-image" alt="" loading="lazy" width="1764" height="384" srcset="https://www.blef.fr/content/images/size/w600/2024/02/Screenshot-2024-02-03-at-10.33.01.png 600w, https://www.blef.fr/content/images/size/w1000/2024/02/Screenshot-2024-02-03-at-10.33.01.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/02/Screenshot-2024-02-03-at-10.33.01.png 1600w, https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-03-at-10.33.01.png 1764w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">French public market salary grid in data (compared to software engineer) (</span><a href="https://www.numerique.gouv.fr/uploads/Circulaire%20n%C2%B06434-SG%20du%203%20janvier%202024%20-%20r%C3%A9f%C3%A9rentiel%20num%C3%A9rique.pdf?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">source</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li>The public sector released their <a href="https://www.numerique.gouv.fr/uploads/Circulaire%20n%C2%B06434-SG%20du%203%20janvier%202024%20-%20r%C3%A9f%C3%A9rentiel%20num%C3%A9rique.pdf?ref=blef.fr">salary grid for all tech workers</a> — this is in French but scroll to the last page of the PDF to have the table<ul><li>We have 4 experiences buckets &lt;5, &lt;10, &gt;10 and &gt;20 years. Which is completely relevant for the tech / data field I think, only a few people are +20 years from what I see.</li><li>This is crazy how bad data engineers are paid compared to all others positions — especially when you know that other positions are doing data engineering when there is no data engineers</li><li>The comparaison to data scientists is nevertheless not relevant because very often data scientists have PhD. so make sense they start higher that other positions</li><li>What do you think of it?</li></ul></li><li>At the same time the Modern Data Network released the <a href="https://moderndatanetwork.medium.com/how-much-data-professionals-make-in-france-the-mdn-annual-benchmark-0f77f706b79c?ref=blef.fr">annual benchmark of data professionals</a><ul><li>An Analytics Engineer role enters the chat — this is explained because the MDN is full of startup and roles evolves faster than elsewhere. </li></ul></li><li>This is complicated to compare the 2 benchmarks because the experiences ranges are not the same still we see trends that are similar between the positions.</li><li>My main takeaway is that Data Analyst role is finally taking the place it should take as a full role and not a transition role before a DS or a DE role — being paid higher than other at entry for instance.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-03-at-10.46.31.png" class="kg-image" alt="" loading="lazy" width="1378" height="306" srcset="https://www.blef.fr/content/images/size/w600/2024/02/Screenshot-2024-02-03-at-10.46.31.png 600w, https://www.blef.fr/content/images/size/w1000/2024/02/Screenshot-2024-02-03-at-10.46.31.png 1000w, https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-03-at-10.46.31.png 1378w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Modern Data Network annual benchmark of data professionals (</span><a href="https://moderndatanetwork.medium.com/how-much-data-professionals-make-in-france-the-mdn-annual-benchmark-0f77f706b79c?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">source</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">😥</div><div class="kg-callout-text">Currently there is a huge layoffs period in tech startups in Europe and in the US. When looking at number <a href="https://layoffs.fyi/?ref=blef.fr">layoffs.fyi</a> this Jan has more layoffs than the last 6 months of 2023.<br><br>If you have been impacted by a layoffs and you need help finding your new journey, write to me.</div></div><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><p>Not to fragment the news that much because I already wrote too much AI News is blended without the Fast News.</p><ul><li><a href="https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e7359222?ref=blef.fr">OLMo, a new open-source LLM</a> — The Allen Institute in Seattle released what they called a truly open LLM. For the first time we have the model, the weights <strong>and the training data</strong>. I can't wait to see how it compares and used by people.</li><li><a href="https://huggingface.co/spaces/vikhyatk/moondream1?ref=blef.fr">hf/moondream1</a> — This is really awesome, this is a tiny LLM that can answer questions about a given image.</li><li><a href="https://www.probabl.ai/?ref=blef.fr">:probabl. launch</a> — The team behind scikit-learn is joining forces and creates a new venture with the goal to maintain a state-of-the-art data science tooling suite to benefit France, EU and the World.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-03-at-11.11.48.png" class="kg-image" alt="" loading="lazy" width="2000" height="1331" srcset="https://www.blef.fr/content/images/size/w600/2024/02/Screenshot-2024-02-03-at-11.11.48.png 600w, https://www.blef.fr/content/images/size/w1000/2024/02/Screenshot-2024-02-03-at-11.11.48.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/02/Screenshot-2024-02-03-at-11.11.48.png 1600w, https://www.blef.fr/content/images/2024/02/Screenshot-2024-02-03-at-11.11.48.png 2068w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Thank you moondream</span></figcaption></figure><ul><li><a href="https://github.com/yohannj/cybersec-ctf-box?ref=blef.fr">github/cybersec-ctf-box</a> —&nbsp;A cybersecurity CTF to train yourself. A friend of mine created this repo to train yourself against a few attacks you might face. The first one is around Chart.js library.</li><li><a href="https://www.getdbt.com/blog/dbt-labs-names-data-industry-veteran-mark-porter-as-chief-technology-officer?ref=blef.fr">dbt Labs names a new CTO</a> — He was CTO at MongoDB previously.</li><li><a href="https://towardsdatascience.com/dont-fix-bad-data-do-this-instead-d45262444cf2?ref=blef.fr">Don't fix bad data, do this instead</a> — This is never a good idea to apply patch on bad data. Always remember to identify root causes before jumping on the fixing wagon.</li><li><a href="https://medium.com/@bxh_io/our-transformation-journey-toward-an-open-data-platform-b6f869b6a173?ref=blef.fr">Our transformation journey toward an open data platform</a> — Condé Nast data platform walkthrough. The platform is built on-top of Databricks with a lot of other logos revolving around making sense of the lakehouse platform.</li><li><a href="https://towardsdatascience.com/mastering-airflow-variables-32548a53b3c5?ref=blef.fr">Mastering Airflow variables</a> — All the different techniques to master Airflow variables.</li><li><a href="https://engineering.grab.com/rethinking-streaming-processing-data-exploration?ref=blef.fr">Grab, rethinking stream processing: data exploration</a> — How do you unlock analyst super-powers by giving them capabilities to analyse real-time data directly on streams and not in the offloaded lake data.</li><li><a href="https://luminousmen.com/post/how-to-build-highperformance-engineering-teams/?ref=blef.fr">How to build high-performance engineering teams</a> — Get click baited like me. If I can add a following point, step 0 is important but then you need to give them enough freedom and vision.</li><li><a href="https://mikkeldengsoe.substack.com/p/the-business-critical-data-warehouse?ref=blef.fr">The business-critical data warehouse</a> — Putting back the church at the center of the village.</li><li><a href="https://www.databricks.com/blog/welcome-data-intelligence-platform-databricks-einblick?ref=blef.fr">Databricks acquires Einblick</a> — It goes back to the text-to-insights problem. Einblick is a drag-n-drop solution to "solve any data problem in one solution". LMAO marketing teams at the finest.</li></ul><p></p><h1 id="duckcon-my-duck-stuff">DuckCon + my Duck stuff</h1><p>Because this Data News is already too long I split the content into 2 articles. Read my <a href="https://www.blef.fr/duckcon-4-takeaways/" rel="noreferrer">DuckCon takeaways</a> 🦆.</p><p>Still last Wed. I've presented DuckDB to a French audience during this <a href="https://docs.google.com/presentation/d/1QAME-RYonvNp-qfga2vpXhNtkIg44sdKnVZt4xDaf0E/edit?usp=sharing&ref=blef.fr">presentation</a> I've showcased what you can do with DuckDB and DuckDB WASM. WASM is a portable way to run DuckDB in the browser.</p><p>You can play with the SQL editor I've worked on <a href="https://blef.fr/mds-criteo?ref=blef.fr" rel="noreferrer">here</a> (mobile + desktop), try to run a small group by query after a load tables, everything you do run on your device. This is the wasm magic. There is as well the <a href="https://github.com/Bl3f/parquet-info?ref=blef.fr">Firefox extension</a> the let's you hover parquet file in cloud console to get the schema, but more of this later as I plan to push it forward this month.</p><p><em>PS: I'm so happy to met a few readers IRL, it anchors my content and my work into the reality. So once again to the few people who came to me, thank you so much.</em></p><hr><p>See you next week ❤️.</p><p></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ DuckCon #4 takeaways ]]></title>
                    <description><![CDATA[ My DuckCon #4 takeaway, this is a enhanced raw notes about the Duck conference that happened on Feb 2 in Amsterdam. DuckDB magical work and people like it. ]]></description>
                    <link><![CDATA[ /duckcon-4-takeaways/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65be1704b4a4050001bddae0 ]]></guid>
                    <pubDate><![CDATA[ 2024-02-02 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1550001683-57add9a997bf?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="flock of white ducks on brown soil" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1550001683-57add9a997bf?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1550001683-57add9a997bf?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A picture of people chatting at DuckCon (</span><a href="https://unsplash.com/photos/flock-of-white-ducks-on-brown-soil-0wk5wCTfyfs?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, this is a straightforward post about the ides and the takeaways I got for the <a href="https://duckdb.org/2023/10/06/duckcon4.html?ref=blef.fr">DuckCon</a>. I guess the recording will be posted online a in few days / weeks. </p><p>It took place in Amsterdam in a wonderful location. The agenda of the afternoon was quite small (because it is still a small conference) but interesting. There is something awesome to meet the DuckDB community at this step. The tool has not yet reached his peak so you meet people that are early adopters and fans of it — it's a nerds (male — diversity might come later I hope) conference actually.</p><h1 id="duckdb-announcements">DuckDB announcements</h1><p>The Duck creators announced that v0.10.1 is coming soon and before end of July we might get the v1.0.0. DuckDB adoption numbers are demonstrating a real trend behind the "hype". DuckDB docs website gets 500k unique visitors per month and DuckDB has a <a href="https://duckdb.org/?ref=blef.fr">new shiny website</a>.</p><p>Soon we will get things like:</p><ul><li>Forward (best efforts) and backward (guaranteed) compatibility between duckdb file formats</li><li>Attach Postgres database to execute Postgres queries from DuckDB prompt</li><li>Fixed lengths arrays new data type</li><li>A new unified memory manager</li><li>A secret manager that can persists between sessions</li><li>A <a href="https://ir.cwi.nl/pub/33334?ref=blef.fr">new compression algorithm</a> called ALP that brings faster compression / decompression and higher compression ratio</li><li>v1.0.0 will have no new feature compared to v0.10, focus on stability stuff and bugfixes</li></ul><figure class="kg-card kg-embed-card kg-card-hascaption"><iframe width="200" height="113" src="https://www.youtube.com/embed/cyZfpXxXojE?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" title="State of the Duck (DuckCon #4, Amsterdam, 2024)"></iframe><figcaption><p dir="ltr"><span style="white-space: pre-wrap;">See the State of Duck introductive keynote</span></p></figcaption></figure><h1 id="ideas-from-talks">Ideas from talks</h1><p>I'll just throw in the wild ideas and stuff I've seen from talks.</p><ul><li>HuggingFace is using DuckDB in multiples features to power data exploration in the frontend. In their datasets product when looking at a dataset you can full text search or see distributions (with bars at the top of columns) and this is powered with DuckDB. Lastly they pre-compute statistics on datasets with DuckDB.</li><li>Fivetran uses DuckDB as the tech to do file merge in the data lake offering</li><li>Datacamp uses DuckDB to be able in notebooks to query dataframes in SQL and consider it for teaching SQL — I might have something in the making about this on my side.</li><li>dbt Core developer is using DuckDB is pdb to debug what happening in the database pretty easily and can create "debug packages" to send to other people.</li><li>DuckDB feels magical for a few people (Liverpool FC) because it does stuff faster than other technologies with less technical footprint — you just write SQL and it works.</li><li>The pattern might me<ul><li>Get the data out of db</li><li>Query it with DuckDB</li><li>Put it the data back into the db</li></ul></li></ul><h1 id="in-conclusion">In conclusion</h1><p>There is something between the lines, even if DuckDB is used differently by everyone it just runs and creates something universal (thanks to SQL). Actually this might be the final tool that will break the wall between tech teams and data teams.</p><p>With DuckDB you offload a business logic that would be embarked in a backend app into SQL queries. You can use DuckDB as a library and not a service, which changes everything, what you need to do is <code>import duckdb</code> and not launch a Docker service manage connection strings, etc.</p><p>Last point, parquet was the starting point of a lot of use-cases because the Duck is working well with the columnar files. But between all the question and feeling people seems to like the idea of a DuckDB file format that will become the defacto data format. </p><p>Let's see.</p><hr><p>I'm sorry I've written this as enhanced raw notes, I hope you'll like it.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.04 ]]></title>
                    <description><![CDATA[ Data News #24.04 — Let&#39;s talk AI podcast interview, data &amp; AI products conference, Disney VR floor, dbt awesome community projects. ]]></description>
                    <link><![CDATA[ /data-news-week-24-04/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65b386923e7aba00011c9942 ]]></guid>
                    <pubDate><![CDATA[ 2024-01-26 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1550948390-6eb7fa773072?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="single perspective of pathway leading to house" loading="lazy" width="1000" height="664" srcset="https://images.unsplash.com/photo-1550948390-6eb7fa773072?q=80&amp;w=600&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1550948390-6eb7fa773072?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Hey (</span><a href="https://unsplash.com/photos/single-perspective-of-pathway-leading-to-house-qYwyRF9u-uo?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, new week new email. This is already end of January but I took time to travel and see people I did not see for a long time so I'm super happy how this new year is starting.</p><p>Next week, I'll be wrapping up my <em>DataOps</em> lecture by incorporating how to deploy machine learning models. This is a fun part where students learn how to serve a simple classifier in production. Building a custom HTTP API, Docker image and CI/CD processes making it accessible on internet. For the modern part this year, I'm going to integrate an LLM "classifier" part, it might attract their curiosity. We'll see.</p><p>Yesterday an interview I did for the podcast <a href="https://smartlink.ausha.co/let-s-talk-ai/55-data-software-engineering-freelance-career-and-teaching-with-christophe-blefari?ref=blef.fr">Let's talk AI</a> has been published. Available everywhere. We talke about data engineering, freelancing and career stuff. </p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://smartlink.ausha.co/let-s-talk-ai/55-data-software-engineering-freelance-career-and-teaching-with-christophe-blefari?ref=blef.fr"><img src="https://www.blef.fr/content/images/2024/01/56-CHRISTOPHE-BLEFARI.png" class="kg-image" alt="" loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2024/01/56-CHRISTOPHE-BLEFARI.png 600w, https://www.blef.fr/content/images/size/w1000/2024/01/56-CHRISTOPHE-BLEFARI.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/01/56-CHRISTOPHE-BLEFARI.png 1600w, https://www.blef.fr/content/images/2024/01/56-CHRISTOPHE-BLEFARI.png 2000w" sizes="(min-width: 720px) 720px"></a><figcaption><span style="white-space: pre-wrap;">Let's Talk AI podcast </span><a href="https://smartlink.ausha.co/let-s-talk-ai/55-data-software-engineering-freelance-career-and-teaching-with-christophe-blefari?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">new episode</span></a></figcaption></figure><h1 id="data-ai-products">Data &amp; AI products</h1><p>Yesterday I went to 5h conference organised in Paris about <a href="https://www.hymaia.com/event/data-product?ref=blef.fr">Data &amp; AI products</a>—in French, the idea of the conference was to mix people coming from data and product ecosystem which is, let's be honest, the key enabler for AI in production. The recording will be online in a few days / weeks and I'll share them once online. </p><p>Here a few takeaways in a messy way:</p><ul><li>Data products and organisational impacts<ul><li>Data engineers are still the limiting human resource.</li><li>Data mesh by the book will not work, if you want to scale you can't just add more people in a central team.</li><li>Data mesh means decentralisation but more importantly ownership and responsibilities to team (esp. data producers)—if every team has to be responsible you need to have a easy-to-use platform and you have to <strong>explicitly</strong> give them responsibilities.</li></ul></li><li><a href="https://docs.google.com/presentation/d/1B5M5Dy1zQCFbbRNyhOyEAuz2STbxuTDs6xHf11b9KPA/edit?ref=blef.fr#slide=id.g2b2e220b133_0_482">UX for data products</a> — This is a presentation I really enjoyed by Claire Lebarz, VP data at Malt. Without the voice you will miss a lot of things, still it contains great practicals tips.<ul><li>Before jumping to AI projects you need first to <strong>start with words</strong> and to define <strong>metrics reflecting</strong> [your] <strong>values</strong>. You don't want your AI to give bad product experience. So define—as a metric—what you don't want to have.</li><li>Then Claire schematised human interaction with models (slide 8) via an interface with inputs and outputs. Inputs and outputs can be instrumented with multiple techniques that will empower people in their interaction with AI algorithm.<ul><li>Inputs — This is what you ask from the users to feed your algorithm. It can be done with <em>calibration</em>, <em>implicit or explicit feedback</em> and <em>corrections</em>.</li><li>Outputs — Product design choices where you give power over the algorithm. It can be done with <em>multiple options</em> (like trips alternative on Google Maps), <em>attributions</em> (why something has been recommended), <em>confidence interval</em> (weather) or <em>limitations</em>.</li></ul></li><li>As data people you need to build relationships with designers to converge on common terms about human-AI interactions</li></ul></li><li>And other bits I got from the others talks<ul><li>OKR means metrics alignement across the company which lead to team autonomy—AI teams should be autonomous in finding solutions to move indicators</li><li>It critical to have dashboards measuring success when AB testing models</li><li>"Product is about people crafting together to best solutions and experiences to solve a customer problem" — <a href="https://www.linkedin.com/in/anneclairefortinbaschet/?ref=blef.fr">Anne-Claire Baschet</a>.</li></ul></li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2024/01/Screenshot-2024-01-26-at-17.46.09.png" class="kg-image" alt="" loading="lazy" width="2000" height="1131" srcset="https://www.blef.fr/content/images/size/w600/2024/01/Screenshot-2024-01-26-at-17.46.09.png 600w, https://www.blef.fr/content/images/size/w1000/2024/01/Screenshot-2024-01-26-at-17.46.09.png 1000w, https://www.blef.fr/content/images/size/w1600/2024/01/Screenshot-2024-01-26-at-17.46.09.png 1600w, https://www.blef.fr/content/images/2024/01/Screenshot-2024-01-26-at-17.46.09.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">How to interact with AI models — Claire Lebarz, </span><a href="https://docs.google.com/presentation/d/1B5M5Dy1zQCFbbRNyhOyEAuz2STbxuTDs6xHf11b9KPA/edit?ref=blef.fr#slide=id.g27c2f639e3c_0_886" rel="noreferrer"><span style="white-space: pre-wrap;">UX for data products</span></a></figcaption></figure><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://www.wired.com/story/chinese-startup-01-ai-is-winning-the-open-source-ai-race/?ref=blef.fr">This Chinese startup is winning the open source AI race</a> — Thanks to Mistral AI open-source became the new standard among the community. There is a Chinese company called 01.ai who wants to build the first killer app of the Gen AI. (see also <a href="https://www.sequoiacap.com/article/clem-delangue-spotlight/?ref=blef.fr">open-sourcing the future of AI</a>, which is a HuggingFace praising post at some point).</li><li><a href="https://huggingface.co/blog/gcp-partnership?ref=blef.fr">Hugging Face and Google partner for open AI</a> —&nbsp;Do not mistake it's open AI and not OpenAI 😬. This partnership will benefit Google Cloud customers with unique hardware to train models and HuggingFace users will have some benefits but I did not understand the corporate sentences from the press release.</li><li><a href="https://openai.com/blog/new-embedding-models-and-api-updates?ref=blef.fr">OpenAI new embedding models and API updates</a> — new Turbo models for 4.5 and 3.5 and 2 new embeddings models.</li><li><a href="https://medium.com/artefact-engineering-and-data-science/unleashing-the-power-of-langchain-expression-language-lcel-from-proof-of-concept-to-production-8ad8eebdcb1d?ref=blef.fr">Unleashing the power of LangChain</a> — From POC to production, it showcases the LangChain expression language that helps developers chaining prompts in a nicer way.</li></ul><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://twitter.com/Guglielminetti/status/1749464272424354085?ref=blef.fr"><a href="https://www.ign.com/articles/disney-unveils-the-holotile-floor-inching-us-closer-to-a-real-life-holodeck?ref=blef.fr">Disney Holotile VR floor</a></a> — Disney developed a "dynamic" floor for VR use-cases. With it you can walk without really moving. This is a bit disturbing but it can unlock the <a href="https://twitter.com/Guglielminetti/status/1749464272424354085?ref=blef.fr">metaverse future</a>.</li><li><a href="https://clickhouse.com/blog/clickhouse-one-billion-row-challenge?ref=blef.fr">ClickHouse and the one billion row challenge</a> — ClickHouse proposed a SQL solution with ClickHouse local to the a challenge consisting in aggregating 1B rows in a text file. Initially this <a href="https://github.com/gunnarmorling/1brc?ref=blef.fr">challenge</a> has to be answered in Java. The leader submitted a solution running in less than 2s—<a href="https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java?ref=blef.fr">have fun</a>—while ClickHouse took 19s.</li><li><a href="https://tobikodata.com/sqlglot-jumps-on-the-rust-bandwagon.html?ref=blef.fr">SQLGlot switch to Rust</a> —&nbsp;I really like SQLGlot, this is a SQL parser that gives you back the AST to do stuff. They ported the parser from Python to Rust and got 30-40% performance improvement.</li><li><a href="https://kestra.io/blogs/2024-01-24-2024-data-engineering-trends?ref=blef.fr">2024 data engineering trends</a> — We are still in January so it's still valid, Anna captured a few things that will make data teams busy this year. Firstly the reducing in resources leading teams to do more with less (or at least doing the same with less).</li><li><a href="https://select.dev/posts/snowflake-batch-loading?ref=blef.fr">Snowflake batch data loading</a> — A good explanation of the Snowflake kCOPY INTO command and what you need to setup around it to make it work.</li><li><a href="https://store.metasnake.com/effective-pandas-book?ref=blef.fr">Effective pandas 2 is out</a> — <em>I did not read the book</em>. As pandas is, still, everywhere, it can be a good ressource if you need to learn the 2.0 version.</li><li><a href="https://nightingaledvs.com/introducing-girls-to-code-one-flower-at-a-time/?ref=blef.fr">Introducing girls to code, one flower at a time</a> — An awesome initiative to introduce girls to code and data visualisation through creative coding projects, there is a <a href="https://data-garden.notion.site/Data-Garden-Guidebook-47a11bf555ab40bfbf68540d85067e9f?p=3e186c5b3c864f549fc8abf25b65c404&pm=s&ref=blef.fr">Notion guidebook</a> to do data visualisations with p5.js.</li><li><a href="https://github.com/kanton-bern/hellodata-be?ref=blef.fr">The open-source enterprise data platform in a single portal</a> — Bern local community open-sourced an data platform blueprints to launch an all-in-one data platform with dbt, Airflow and Superset on top of Postgres and K.</li><li><a href="https://github.com/AxelThevenot/dbt-assertions?ref=blef.fr">Github/dbt-assertions</a> — A dbt package to write dbt tests at row-level and to save exceptions alongside your failing rows (cf. <a href="https://github.com/AxelThevenot/dbt-assertions/tree/main/models/examples/basic_example?ref=blef.fr">example</a>).</li><li><a href="https://medium.com/inthepipeline/use-this-updated-pull-request-comment-template-for-your-dbt-data-projects-de06f12fc38d?ref=blef.fr">PR comment template for dbt data projects</a> — Great stuff. This is a proposal of Github pull request template when modifying dbt models. It includes description, lineage diff, illustration of model changes or impacts and more.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ How to learn data engineering ]]></title>
                    <description><![CDATA[ How to learn data engineering in 2024? This article will help you understand everything related to data engineering. ]]></description>
                    <link><![CDATA[ /learn-data-engineering/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6298c738e7d1fa003d604ac5 ]]></guid>
                    <pubDate><![CDATA[ 2024-01-20 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/06/image-1.png" class="kg-image" alt="" loading="lazy" width="900" height="675" srcset="https://www.blef.fr/content/images/size/w600/2022/06/image-1.png 600w, https://www.blef.fr/content/images/2022/06/image-1.png 900w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Learn data engineering, all the references (</span><a href="https://unsplash.com/photos/xFcoLGuhdGs?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>This is a special edition of the Data News. But right now I'm in holidays finishing a hiking week in Corsica 🥾. So I wrote this special edition about: <strong>how to learn data engineering in 2024</strong>.</p><p>The aim of this post is to create a repository of important links and concepts we should care about when we do data engineering. Obviously I'm full of bias, so if you feel I missed something do not hesitate to ping me with stuff to add. The idea is to create a living reference about Data Engineering.</p><p></p>
<!--kg-card-begin: html-->
<p style="text-align:center;"><a data-portal="signup" style="cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; border-radius: 5px;">📬 Subscribe to the excellent weekly newsletter 📬</a></p>
<!--kg-card-end: html-->
<p></p><h2 id="a-bit-of-context">A bit of context</h2><p>It's important to take a step back and to understand from where the data engineering is coming from. Data engineering inherits from years of data practices in US big companies. Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center.</p><p>In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics. </p><ul><li><a href="https://betterprogramming.pub/what-is-hadoop-b90591ffae89?ref=blef.fr">What is Hadoop?</a> A quick overview of what everyone used for years (and still using it for some of us). It's important to understand the distributed computing concepts, <a href="https://www.youtube.com/watch?v=PhdRyrmbRYQ&ref=blef.fr">MapReduce</a>, <a href="https://thisdataguy.com/2015/10/01/hortonworks-cloudera-or-mapr/?ref=blef.fr">Hadoop distributions</a>, <a href="https://www.thoughtworks.com/es-es/insights/decoder/d/data-locality?ref=blef.fr#:~:text=Data%20locality%20is%20the%20process,on%20your%20network%20and%20systems.">data locality</a>, HDFS.</li><li><a href="https://medium.com/@eczachly/data-data-engineering-the-past-present-and-future-ac3ad5795ddf?ref=blef.fr">Data &amp; Data Engineering — the past, present, and future</a> ; this is a good overlook on data engineering history.</li><li>This one is a gitbook with a lot of content but I recommend you specifically to read the <a href="https://jheck.gitbook.io/hadoop/introduction-to-data-engineering?ref=blef.fr">introduction to data engineering</a>.</li><li>In order to become a great data engineer you'll also need to understand computer science. <a href="https://www.explainthatstuff.com/howcomputerswork.html?ref=blef.fr">How do computer works?</a> Additionally by understanding how web works — frontend &amp; backend, deployment, etc. This is oversimplified but I did not found a simple resource on this topic, so if you have something, I'm interested.</li></ul><p></p><h2 id="who-are-the-data-engineers">Who are the data engineers?</h2><p>Every company out there has his own definition for the data engineer role. In my opinion we can easily say <strong>a data engineer is a software engineer working with data</strong>. The idea behind is to solve data problem by building software. Obviously as data is different than "traditional product" — in term of users for instance — a data engineer uses other tools.</p><p>In order to define the data engineer profile here some resources defining data roles and borders.</p><ul><li><a href="https://medium.com/younited-tech-blog/data-organisation-why-are-there-so-many-roles-9c3992d0a436?ref=blef.fr">Data Organization: why are there so many roles ? — And why it is important to understand them</a>. This is one of the most synthesized article about data roles. Furcy defined <em>Programming</em> as the core skill for data engineers.</li><li>To complete the picture here are some <a href="https://www.lewagon.com/tech-jobs/data-science/data-engineer?ref=blef.fr">missions and skills</a> that are expected to be done by data engineers. Warning, the article is from an online bootcamp but they summarize pretty well everything. You can also have a look at the <a href="https://www.gov.uk/guidance/data-engineer?ref=blef.fr">gov.uk data engineer job card</a>, they detail every seniority level expectations.</li><li><a href="https://www.mihaileric.com/posts/we-need-data-engineers-not-data-scientists/?ref=blef.fr">We don't need data scientists, we need data engineers</a> —&nbsp;for years companies were hiring data scientists because it was booming, then realized they were in need for data engineers to team up with scientists. This post shows the data job market with numbers.</li></ul><p></p><h2 id="what-is-data-engineering">What is data engineering</h2><p>As I said it before data engineering is still a young discipline with many different definitions. Still, we can have a common ground when mixing software engineering, DevOps principles, Cloud — or on-prem — systems understanding and <a href="https://zeenea.com/what-is-data-literacy-tips-on-becoming-data-literate/?ref=blef.fr">data literacy</a>.</p><p>If you are new to data engineering <strong>you should start by reading the holy trinity from Maxime Beauchemin</strong>. He wrote some years ago 3 articles defining data engineering field.</p><ul><li><a href="https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603?ref=blef.fr">The Rise of the Data Engineer</a></li><li><a href="https://maximebeauchemin.medium.com/the-downfall-of-the-data-engineer-5bfb701e5d6b?ref=blef.fr">The Downfall of the Data Engineer</a></li><li><a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a?ref=blef.fr">Functional Data Engineering — a modern paradigm for batch data processing</a></li></ul><p><strong>There is a global consensus stating that you need to master a programming language (Python or Java based) and SQL in order to be self-sufficient.</strong> </p><p></p><h4 id="some-concepts">Some concepts</h4><p>When doing data engineering you can touch a lot of different concepts. <strong>Firstly, read the </strong><a href="https://connectingdots.xyz/blog/posts/2021/05/the-data-engineering-manifesto/?ref=blef.fr"><strong>Data Engineering Manifesto</strong></a>, this is not something <em>official</em> in any kind but it greatly depicts all the concepts data engineers daily face.</p><p>Then here a list of global resources that can help you navigate through the field:</p><ul><li><a href="https://github.com/datastacktv/data-engineer-roadmap?ref=blef.fr">The Data Engineer Roadmap</a> — An image with advices and technology names to watch.</li><li>Reddit <a href="https://dataengineering.wiki/Concepts/Concepts?ref=blef.fr">r/dataengineering wiki</a> a place where some data eng definitions are written.</li><li>This book, <a href="https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/?ref=blef.fr"><em>📘 Data Pipelines Pocket Reference</em></a><em>, </em>defines everything related to data pipelines and how to treat data movement from source to target.</li></ul><p></p><p>If we go a bit deeper, I think that every data engineer should have basis in:</p><ul><li><a href="https://dataengineering.wiki/Concepts/Data+Modeling?ref=blef.fr">data modeling</a> — this is related to the way the data is stored is a data warehouse and the field has been cracked years ago by <a href="https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802?ref=blef.fr">Kimball dimensional modeling</a> and also <a href="https://www.youtube.com/watch?v=jXXfdscVyLc&ref=blef.fr">Inmon model</a>. But it recently got <a href="https://discourse.getdbt.com/t/is-kimball-dimensional-modeling-still-relevant-in-a-modern-data-warehouse/225?ref=blef.fr">challenged</a> because of "infinite" cloud power with <a href="https://www.fivetran.com/blog/star-schema-vs-obt?ref=blef.fr">OBT</a> (one big table or flat) model. In order to complete your understanding of data modeling you should learn <a href="https://analyticsengineers.club/whats-an-olap-cube/?ref=blef.fr">what's an OLAP cube</a>. The cherry on the cake here is the <a href="https://www.holistics.io/blog/scd-cloud-data-warehouse/?ref=blef.fr">Slowly Changing Dimensions</a> — SCDs — concept.</li><li><a href="https://luminousmen.com/post/big-data-file-formats?ref=blef.fr">formats</a> — This is a huge part of data engineering. Picking the right format for your data storage. Wrong format often means bad querying performance and user-experience. In a nutshell you have: text based formats (CSV, JSON and raw stuff), columnar file formats (Parquet, ORC), memory format (<a href="https://arrow.apache.org/docs/format/Columnar.html?ref=blef.fr">Arrow</a>), transport protocols and format (Protobuf, Thrift, gRPC, Avro), table formats (<a href="https://www.dremio.com/subsurface/comparison-of-data-lake-table-formats-iceberg-hudi-and-delta-lake/?ref=blef.fr">Hudi, Iceberg, Delta</a>), database and vendor formats (Postgres, Snowflake, BigQuery, etc.). Here a small <a href="https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/?ref=blef.fr">benchmark</a> between some popular formats.</li><li><a href="https://en.wikipedia.org/wiki/Batch_processing?ref=blef.fr">batch</a> — Batch processing is at the core of data engineering. One of the major task is to move data from a source storage to a destination storage. In batch. On a regular schedule. Sometime with transformation. This is close to what we also call ETL or ELT. The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory. In this category I recommend also to have a look at data ingestion (Airbyte, Fivetran, etc.), workflows (Airflow, Prefect, Dagster, etc.) and transformation (Spark, dbt, Pandas, etc.) tools.</li><li><a href="https://en.wikipedia.org/wiki/Stream_processing?ref=blef.fr">stream</a> — Stream processing can be seen as the evolution of the batch. <a href="https://towardsdatascience.com/the-magical-fusion-between-batch-and-streaming-insights-8f1353bfe4a?ref=blef.fr">This is not</a>. It addresses different use-cases. This is often linked to real-time. Main technologies around stream are bus messages like Kafka and processing framework like Flink or Spark on top of the bus. Recently all-in-one cloud services appeared to simplify the real-time work. Understand <a href="https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b?ref=blef.fr">Change Data Capture</a> — CDC.</li><li>infrastructure — When you do data engineering this is important to master data infrastructure concepts. You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team. You'll be also asked to put in place a data infrastructure. It means a data warehouse, a data lake or other concepts starting with data. <strong>My advice on this point is to learn from others</strong>. Read technical blogs, watch conferences and read 📘 <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/?ref=blef.fr"><em>Designing Data-Intensive Applications</em></a> (even if it could be overkill).</li><li>new concepts — in today's data engineering a lot of new concepts enter the field every year like quality, lineage, metadata management, governance, privacy, sharing, etc.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/06/tobias-keller-2ecH5Lw3zSk-unsplash.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="1429" srcset="https://www.blef.fr/content/images/size/w600/2022/06/tobias-keller-2ecH5Lw3zSk-unsplash.jpg 600w, https://www.blef.fr/content/images/size/w1000/2022/06/tobias-keller-2ecH5Lw3zSk-unsplash.jpg 1000w, https://www.blef.fr/content/images/size/w1600/2022/06/tobias-keller-2ecH5Lw3zSk-unsplash.jpg 1600w, https://www.blef.fr/content/images/2022/06/tobias-keller-2ecH5Lw3zSk-unsplash.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Is it really modern? (</span><a href="https://unsplash.com/photos/2ecH5Lw3zSk?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><h2 id="the-modern-and-the-future-data-stack">The modern (and the future) data stack</h2><p>Coming from Hadoop — also called the old data stack — people are now building modern data stacks. This is a new way to describe data platforms with a warehouse at the core where all the company data and KPIs sit. Below some key articles defining this new paradigm.</p><ul><li>👍 <a href="https://www.getdbt.com/blog/future-of-the-modern-data-stack/?ref=blef.fr">The Modern Data Stack: Past, Present, and Future</a></li><li><a href="https://future.com/emerging-architectures-modern-data-infrastructure/?ref=blef.fr">Emerging Architectures for Modern Data Infrastructure</a></li><li><a href="https://towardsdatascience.com/the-new-generation-data-lake-54e10e02b757?ref=blef.fr">The New Generation Data Lake</a></li><li><a href="https://towardsdatascience.com/bootstrap-a-modern-data-stack-in-5-minutes-with-terraform-32342ee10e79?ref=blef.fr">Bootstrap a Modern Data Stack in 5 minutes with Terraform</a></li><li><a href="https://medium.com/alexandre-beauvois/modern-data-stack-as-a-service-1-3-1a1813c38633?ref=blef.fr">Modern Data Stack as a Service</a></li><li><a href="https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html?ref=blef.fr">Storm in the stratosphere: how the cloud will be reshuffled</a></li><li>❤️ <a href="https://petrjanda.substack.com/p/a-path-towards-a-data-platform-that?ref=blef.fr">A path towards a data platform that aligns data, value, and people</a></li></ul><p>And now some articles I like that will help you get inspiration.</p><ul><li><a href="https://about.gitlab.com/handbook/business-technology/data-team/?ref=blef.fr">Gitlab Data Team Handbook</a> — One of the best data resource. This is a public documentation on how Gitlab data team do stuff.</li><li>Airbnb is great at exposing what they are doing in term of data. For instance with these 2 articles: <a href="https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70?ref=blef.fr">How Airbnb achieved metric consistency at scale</a> &amp; <a href="https://medium.com/airbnb-engineering/how-airbnb-built-wall-to-prevent-data-bugs-ad1b081d6e8f?ref=blef.fr">How Airbnb built “Wall” to prevent data bugs</a></li><li>Data Engineering patterns are important — Dagster tried to <a href="https://dagster.io/blog/software-defined-assets?ref=blef.fr">introduce Software-Defined Assets</a> and Prefect spoke about <a href="https://www.prefect.io/guide/blog/positive-and-negative-engineering/?ref=blef.fr">Positive and Negative engineering</a>.</li><li><a href="https://building.nubank.com.br/scaling-data-analytics-with-software-engineering-best-practices/?ref=blef.fr">Scaling data analytics with software engineering best practices</a></li><li>Jesse Anderson ; <a href="https://www.jesse-anderson.com/2018/11/creating-a-data-engineering-culture/?ref=blef.fr">Creating a Data Engineering Culture</a> and his book 📘 <a href="https://content.bigdatainstitute.io/books/data_engineering_teams/Data_Engineering_Teams.pdf?ref=blef.fr">Data Engineering Teams</a></li><li>What is MLOps? Some people wrote a white paper detailing <a href="https://arxiv.org/ftp/arxiv/papers/2205/2205.02302.pdf?ref=blef.fr">Machine Learning Operations (MLOps): Overview, Definition, and Architecture</a> in which they write about rols and missions.</li></ul><hr><p>Once again if you feel I forgot something important do not hesitate to tell me. I'll add more and more stuff to this article in the future.</p>
<!--kg-card-begin: html-->
<p>If you enjoyed this article please consider <a data-portal="signup" style="cursor: pointer;">subscribing</a> to my weekly newsletter about data where I demystify all these concepts. I help you save 5 hours of curation per week.</p>
<!--kg-card-end: html-->
 ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.03 ]]></title>
                    <description><![CDATA[ Data News #24.03 — ChatGPT in classes, Zuckerberg announcements, Bard and awesome news. ]]></description>
                    <link><![CDATA[ /data-news-week-24-03/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65ab6d9ee4dcbc000139fdff ]]></guid>
                    <pubDate><![CDATA[ 2024-01-20 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1586348278474-312d4266bbc3?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxzZWFyY2h8NzV8fGFuZ2Vyc3xlbnwwfHwwfHx8MA%3D%3D" class="kg-image" alt="ice hockey players on ice hockey field" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Walking in the street be like recently (</span><a href="https://unsplash.com/photos/ice-hockey-players-on-ice-hockey-field-W7t3cNm8LXk?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey I hope this new edition finds you well. We are deep in the winter, it's time for comfy Data News to read near the fire 🔥.</p><p>This week, on Monday, I started my annual university lecture. It's been 9 years since I started teaching and this year something was different. The students were incredibly calm, obviously my course is a bit difficult at the beginning because it touches on concepts that they are not used to—cloud, data in production, data engineering, etc. So it's normal that they don't have any questions at first. But still, even during exercices hands were still down when previous year they were asking me for debugging help.</p><p>This year something was off.</p><p>On Wednesday I finally understood what changed. It was ChatGPT. Actually the whole class was using ChatGPT—I did a raising hand survey and everyone said yes. So now, the default go to was to ask ChatGPT questions rather than ask me, and then if ChatGPT does not have the answer they might ask me.</p><p>I still don't know how to react about this. I think it does not makes sense to ban ChatGPT, like it was stupid to ban Google Search at my time. But still there is something to do, I need to research and think more about it.</p><p>I assume that education will be radically transformed. Both ways. The way students learn will be different, but the way teachers teach will have to be different. This will force us to bring something to the class that ChatGPT can't: humanity.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://openai.com/careers/elections-program-manager?ref=blef.fr">OpenAI opened an Elections Program Manager</a> — The role is to support the "efforts around elections security and integrity for the EMEA region". We have this year the European parliament election. It deeply shows how OpenAI products are—or might be—used in order to win races. I guess they already have people for the US elections.</li><li><a href="https://www.axios.com/2024/01/17/alex-karp-davos-ai-us-advantage?ref=blef.fr">Palantir CEO: U.S. eating everyone's lunch on AI</a> — Let's continue on politics, at the World Economic Forum in Davos Palantir CEO said that within 10 years 95% of the world top tech companies will be American. I don't see any difference with today.</li><li><a href="https://github.com/facebookresearch/audiocraft/blob/main/docs/MAGNET.md?ref=blef.fr">Meta released MAGNeT</a> —&nbsp;MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions. It seems that it works well when generating sound effects.</li><li><a href="https://www.instagram.com/zuck/reel/C2QARHJR1sZ/?hl=en&ref=blef.fr">Zuckerberg teased 2024 Meta AI strategy</a> — In a selfie video on Facebook / Instagram Zucky explained that Llama 3 is coming and that Meta is building a massive 600k H100 NVidia GPU infrastructure. It just represents $27b just in raw GPUs. But luckily for us Meta will open-source everything they do because they love us so much &lt;3. In exchange we can wear the new Ray Ban Meta glasses with AI inside to give Meta more training data. We are just seeing the world shifting and we are all ok with it.<br><br>If you want a less salty opinion than mine <a href="https://www.linkedin.com/feed/update/urn:li:activity:7153821341538824193/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7153821341538824193%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">Oliver as always aced it</a>.</li><li><a href="https://www.youtube.com/watch?v=i01cizb6Txg&ref=blef.fr">Prompt engineering with Bard</a> — A recording from a local Google Developers meetup from last November. In this talk we discover a few concepts on how to talk to Bard. Mainly you can do it through the API or the UI and Peter explains that Bard shines in creativity, factuality and reasoning. He greatly explains the concept of grounding and why it matters. <br><br>Just after he explains how you can "teach" Bard to reason with a reverse word example in which Bard fails. In order to do it you have to ask Bard to write and execute code in background but to activate the code execution feature "you're at the mercy of the classifier". <strong>This is our future, being at the mercy of classifiers</strong>.</li><li><a href="https://www.404media.co/google-news-is-boosting-garbage-ai-generated-articles/?ref=blef.fr">Google News is boosting garbage AI-generated articles</a> — <em>This is a paid article</em>. The title speaks by itself.</li><li><a href="https://arxiv.org/pdf/2401.05566.pdf?ref=blef.fr">Paper, Sleeper agents</a> — A not reassuring paper. Anthropic research team proved that this is possible to insert backdoors in models and the backdoor persists despite safety and adversarial training.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1588581939864-064d42ace7cd?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="books on shelves in library" loading="lazy"><figcaption><span style="white-space: pre-wrap;">I'm writing from a library today, I feel like a student (</span><a href="https://unsplash.com/photos/books-on-shelves-in-library-sd8uJsf4XM4?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://benn.substack.com/p/its-time-to-build?ref=blef.fr">It's time to build</a> — Still a big fan of Benn's content. This week he talks retrospectively about why his content shifted from the modern data stack to AI. Then there's all the marketing that goes into selling data tools to <em>unleash the power of your data</em>. Far from trends and the lights, it's actually time to build tools.</li><li><a href="https://roundup.getdbt.com/p/my-thoughts-going-into-a-new-year?ref=blef.fr">My thoughts going into a New Year</a> — Tristan, dbt CEO, wrote thoughts about 2024. He covers why AI has not yet impacted data jobs and writes about OSS licensing after <a href="https://snowplow.io/blog/introducing-snowplow-limited-use-license/?ref=blef.fr">Snowplow recent changes</a>, renewing a statement saying that dbt Labs does not need to do this change because they have a solid commercial path.</li><li><a href="https://www.castordoc.com/blog/understanding-the-eu-ai-act?ref=blef.fr">EU AI Act</a> — A nice looking summary of what matters in the new EU AI Act, from the risk definitions to the potential fines. Then CastorDoc explains why a data catalog can help you overlook how to be compliant.</li><li><a href="https://chriswarrick.com/blog/2024/01/15/python-packaging-one-year-later/?ref=blef.fr">Packaging, one year later: a look back at 2023 in Python packaging</a> — <strong>Understanding Python packaging is one of the most important skill to master when you want to enter Python world.</strong> I've seen too many people struggle in their development workflow because they are not used to pip and all. Chris wrote a follow-up to last year post about the <a href="https://chriswarrick.com/blog/2023/01/15/how-to-improve-python-packaging/?ref=blef.fr">sad state of Python packaging</a> explaining standards and proposing things for the future.</li><li><a href="https://engineering.fb.com/2024/01/18/developer-tools/lazy-imports-cinder-machine-learning-meta/?ref=blef.fr">How lazy imports accelerate machine learning at Meta</a> — Meta developed their own implementation of CPython called Cinder. In order to speedup model training time they switch to Cinder and decided to use lazy imports.</li><li><a href="https://www.synq.io/blog/measuring-data-quality?ref=blef.fr">Measuring data quality: bringing theory into practice</a> — Mikkel is one of the best when it comes to putting the correct words on data quality issues. You should read this article to clarify these concepts.</li><li><a href="https://towardsdatascience.com/big-o-a-practical-approach-319a6a3c8b27?ref=blef.fr">Big O — A practical approach</a> —&nbsp;The Big O notation is something taught at school and super important when programming, especially in data when complexity has to be understood to speed up data transformation. This articles gives you what's important. o/</li><li><a href="https://doordash.engineering/2024/01/16/staying-in-the-zone-how-doordash-used-a-service-mesh-to-manage-data-transfer-reducing-hops-and-cloud-spend/?ref=blef.fr">How DoorDash used a service mesh and saved costs</a>.</li><li><a href="https://robertsahlin.substack.com/p/datahem-odyssey-the-evolution-of-95f?ref=blef.fr">The evolution of a data platform</a>, part 2 — The part 2 of the MatHem’s analytical platform. On GCP, BigQuery at the center with event flowing from PubSub / DataFlow with a great usage of all Google Cloud items.</li><li><a href="https://medium.com/apache-airflow/airflow-evolution-at-snap-c988cdd95abd?ref=blef.fr">Airflow evolution at Snap</a> — Large teams need multi-tenancy and Snap is one of them. This article shows all the different architecture Snap put in place to deploy Airflow at scale.</li><li><a href="https://airbyte.com/blog/integrating-airbyte-with-data-orchestrators-airflow-dagster-and-prefect?ref=blef.fr">Integrating Airbyte with data orchestrators: Airflow, Dagster and Prefect</a> — A orchestrators comparison and how Airbyte can be used as extract-and-load within them.</li><li><a href="https://medium.com/snowflake/convert-your-pyspark-code-to-snowpark-code-using-snowconvert-c2234691cc5e?ref=blef.fr">Convert your PySpark code to Snowpark code using SnowConvert</a> — Snowflake trying to attract Databricks customers.</li><li><a href="https://hoffa.medium.com/hey-snowflake-send-me-a-fancy-email-fe04ad2c9888?ref=blef.fr">Hey Snowflake, send me a &lt;fancy&gt; HTML email</a> — This is one of the feature I'm the most unsure about. Do I want my warehouse to be able to send emails without my global orchestration system to be aware of it? Yes because it's cool to give freedom to users... but no because as a data engineer I want a platform where flows are controlled. What's your take of this?<br><br>If you want to do it with BigQuery, you should take a look at my friend's <a href="https://github.com/unytics/bigfunctions?ref=blef.fr">BigFunctions</a> (see the <a href="https://unytics.io/bigfunctions/bigfunctions/?ref=blef.fr#send_mail">send_email</a> function).</li><li><a href="https://github.com/borjavb/bq-lineage-tool?ref=blef.fr">bq-lineage-tool</a> — Java code that uses ZetaSQL to build a column-level lineage parser for BigQuery.</li><li><a href="https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions?ref=blef.fr">Comparison running dbt-core and dlt-dbt runner on Functions</a> — If you're runnning dbt-core within Cloud Functions check this article integrating dbt with dlt to avoid a weird hacky subprocess.</li><li><a href="https://docs.getdbt.com/blog/serverless-dlt-dbt-stack?ref=blef.fr">Build a personal real estate dashboard with dlt and dbt</a> — Another example where you can chain dlt—to extract and load data—and dbt—to transform data—in order this time to build a real estate dashboard to find your dream property in Portugal.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://phospho.ai/?ref=blef.fr"><strong>Phospho</strong></a> <a href="https://tech.eu/2024/01/17/elaia-and-ycombinator-back-phospho-with-1-7m-for-genai-application-monitoring/?ref=blef.fr">raises €1.7m in pre-seed</a> to build GenAI monitoring applications.</li><li><a href="https://skyengine.ai/se/?ref=blef.fr"><strong>SKY ENGINE AI</strong></a> <a href="https://skyengine.ai/?ref=blef.fr"><a href="https://skyengine.ai/se/skyengine-blog/136-sky-engine-ai-raises-7m-to-accelerate-vision-ai-development-for-automotive-robotics-medical-diagnosis-more?ref=blef.fr">raises $7m Series A</a></a>. With a platform that generates synthetic data for deep learning vision algorithms. It let's you create 3D stuff that you can use to train algorithms. </li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 24.02 ]]></title>
                    <description><![CDATA[ Data News #24.02 (late) — First DN edition of the year, let&#39;s catchup with awesome content written these last weeks. ]]></description>
                    <link><![CDATA[ /data-news-week-24-02/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 659ad4dce4dcbc000139babe ]]></guid>
                    <pubDate><![CDATA[ 2024-01-15 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1570616969692-54d6ba3d0397?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="people sitting on chair" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Back to school (</span><a href="https://unsplash.com/photos/people-sitting-on-chair-w1FwDvIreZU?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hello you. Back to the usual Data News—with a little delay, I'm sorry.</p><p>First of all, I'd like to thank you for your positive comments on <a href="https://www.blef.fr/2024/">last week</a>'s article. It's a subject close to my heart and I was very happy to share it with you, because I never thought that Data News would become such a big part of my life.</p><p>I'm starting my annual university lectures today. It's always very exciting to go back and teach students, to help them discover the world of data from another perspective. The details: it's a 27-hour course called <em>DataOps</em>. It's quite a broad subject. I actually cover data engineering and how to put data stuff into production.</p><p>For years I gave a 30-hour lecture called <em>Python for Data Science</em> in which I covered the basics of Python, pandas and scikit-learn. But I stopped 2 years ago because it was too much and repetitive for me. I'm very happy with this new <em>DataOps</em> lecture because it's much closer to what's really going on in the data world.</p><p>Over the years I've accumulated exercices and one day—I hope this year—I'll provide it to everyone in a nice way. </p><p>It's funny because in the days leading up to the lecture, I'm always stressed about something: <strong>I'm always afraid I'm going to run out of content</strong>. The last thing I want to do is give a boring class.</p><p>Wish me luck and have fun reading the news.</p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://www.youtube.com/watch?v=PkXELH6Y2lM&ref=blef.fr">Bill Gates talks with Sam Altman</a> — An 30 minutes episode of Bill Gates' podcast where he chats with Sam Altman.</li><li><a href="https://towardsdatascience.com/navigating-the-ai-landscape-of-2024-trends-predictions-and-possibilities-41e0ac83d68f?ref=blef.fr">14 predictions about AI</a> — In a long form article, Vincent shares his predictions about AI and the trends we might see in 2024. Garbage in, garbage out, still one of the most important issue. Personally, I have a question for authors in 2024: when are you going to stop generating images to illustrate articles? They're horrible and destroy the content. If I have to predict something it would be the this trend to stop.</li><li><a href="https://people.eecs.berkeley.edu/~evonne_ng/projects/audio2photoreal/?ref=blef.fr">Meta, from audio to synthesize human in conversation</a> — Do we finally see an outcome of the billions Meta invested in the Metaverse 🙃. To be honest this is impressive, from an audio Meta is capable to generate a photorealistic avatar that behaves like if it was you speaking.</li><li><a href="https://engineering.fb.com/2024/01/11/ml-applications/meta-advancing-genai/?ref=blef.fr">How Meta is advancing Gen AI</a> —&nbsp;a podcast about Meta GenAI breakthroughs. </li><li><a href="https://www.theverge.com/2024/1/4/24023809/microsoft-copilot-key-keyboard-windows-laptops-pcs?ref=blef.fr">Microsoft will replace the Windows keycap by a Copilot</a> — This might be a major change to Windows computer and keyboards, Microsoft wants to add a physical AI trigger on every keyboard. Might be the best adoption trigger we ever saw.</li><li><a href="https://bytesdataaction.substack.com/p/coding-with-chatgpt?ref=blef.fr">I coded exclusively with ChatGPT for 30 Days</a> — Good takeaways about a nice experiment.</li><li><a href="https://www.linkedin.com/pulse/why-we-invested-hugging-face-ibmventures-dyexc%3FtrackingId=bJZR%252F8YdTWynXWKMrrUdaQ%253D%253D/?trackingId=bJZR%2F8YdTWynXWKMrrUdaQ%3D%3D&ref=blef.fr">IBM explaining why they invested in HuggingFace</a> — <em>During gold rush sell shovels</em>. It explains NVIDIA 2023 success, but HuggingFace is legendary for the same reason. HF became the defacto platform when it comes to share and showcase AI models.</li><li><a href="https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/?ref=blef.fr">Sentence embeddings</a> — After reading this article you will be able to do a PhD in embeddings. Personally I did not read it but if you want to understand embeddings you should. </li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>dbt related stuff<ul><li><a href="https://dbterd.datnguyen.de/latest/nav/guide/dbt-cloud/download-artifact-from-a-job-run.html?ref=blef.fr">Download artifacts from you dbt Cloud job runs</a> —&nbsp;a tutorial from a CLI tool to generate ERD diagrams for dbt Cloud projects.</li><li><a href="https://leo-godin.medium.com/testing-dbt-macros-a80e76243ae4?ref=blef.fr">Testing dbt macros</a> — A clever pattern to write unit tests on dbt macros with a model computing all the possible macro values and a dbt test checking all the possible cases.</li><li><a href="https://medium.com/teads-engineering/unit-testing-with-dbt-fb84f2ef7dd6?ref=blef.fr">Unit testing dbt models</a> — Using a <a href="https://github.com/EqualExperts/dbt-unit-testing?ref=blef.fr">dbt-unit-testing</a> package Matthieu showcases how you can easily test your models.</li><li><a href="https://tayloramurphy.substack.com/p/the-dbt-meta-tag?ref=blef.fr">dbt meta tag</a> —&nbsp;A list of the companies habing product features depending on the <code>meta</code> tag. It shows how deeply dbt change the data world.</li></ul></li><li><a href="https://www.arecadata.com/what-would-i-do-differently-about-getting-into-data-engineering-2024/?ref=blef.fr">What I would do differently getting into Data Engineering</a> — Data engineering has changed a lot in the recent years and Daniel gives 3 advices that you should consider to get into data engineering. Learn SQL, be social and learn to say no.</li><li><a href="https://towardsdatascience.com/lead-data-engineer-career-guide-699e806111b4?ref=blef.fr">Lead Data Engineer career guide</a> — Detailed skillsets needed to be a lead data engineer.</li><li><a href="https://leaddev.com/team/effectively-managing-junior-developers-remote-teams?ref=blef.fr">Effectively managing junior developers on remote teams</a> — In the current state of the ecosystem this is super important to provide a perfect introduction to the data world to juniors.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1610534440162-e0e68fbdeca3?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="grayscale photo of people walking on street" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Time to sleep (</span><a href="https://unsplash.com/photos/grayscale-photo-of-people-walking-on-street-D9FQYwAclwQ?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-green"><div class="kg-callout-emoji">🫠</div><div class="kg-callout-text">I'm sorry, it's midnight when I'm writing this. To be able to publish on Monday morning I don't have the time to read all the following articles.</div></div><ul><li><a href="https://andrew-jones.medium.com/every-data-transform-is-technical-debt-a6d09d3961e5?ref=blef.fr">Every data transform is technical debt</a>.</li><li><a href="https://medium.com/@vutrinh274/you-dont-know-this-for-sure-how-bigquery-stores-semi-structured-data-a80adc6060de?ref=blef.fr">How BigQuery stores semi-structured data?</a> —&nbsp;It relates to Dremel and parquet structures.</li><li><a href="https://engineering.mixpanel.com/how-mixpanel-built-a-fast-lane-for-our-modern-data-stack-680701736f8c?ref=blef.fr" rel="noreferrer">Mixpanel modern data stack <strong>fast lane</strong></a><strong>.</strong></li><li><a href="https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359?ref=blef.fr">Netflix video processing rebuilt with microservices</a>.</li><li><a href="https://medium.com/data-monzo/how-we-built-year-in-monzo-unlocking-the-data-magic-74c880e32378?ref=blef.fr">How Monzo built <em>Year in Monzo</em></a><em>.</em></li><li><a href="https://engineering.hometogo.com/part-iii-a-b-testing-at-hometogo-running-the-whole-a-b-pipeline-on-snowflake-70f0996b12e6?ref=blef.fr">A/B Testing at HomeToGo</a>.</li><li><a href="https://www.datadoghq.com/blog/engineering/crunchconf-talk-self-serve-analytics/?ref=blef.fr">Datadog, scaling self-serve analytics, serving 5000 employees</a> — 🤯.</li><li><a href="https://towardsdatascience.com/2024-the-year-of-the-value-driven-data-person-f7f2b6344a5a?ref=blef.fr">2024: the year of the value-driven data person</a>.</li><li><a href="https://datamonkeysite.com/2024/01/08/using-arrow-and-delta-rust-to-transfer-data-from-bigquery-to-fabric-onelake/?ref=blef.fr">Transfer data from BigQuery to Fabric with Arrow and Rust</a>.</li><li><a href="https://cloud.google.com/blog/products/networking/eliminating-data-transfer-fees-when-migrating-off-google-cloud?hl=en&ref=blef.fr">Removing egress fees when moving off Google Cloud</a>.</li><li><a href="https://robertsahlin.substack.com/p/datahem-odyssey-the-evolution-of?r=7bvua&utm_campaign=post&utm_medium=web&ref=blef.fr">The evolution of a data platform</a>.</li><li><a href="https://motherduck.com/blog/introducing-fixit-ai-sql-error-fixer/?ref=blef.fr">Fixit, MotherDuck SQL AI error fixer</a>.</li><li><a href="https://docs.malloydata.dev/blog/2024-01-09-whats-next-in-2024?ref=blef.fr">What's next for Malloy in 2024</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.talend.com/blog/update-on-the-future-of-talend-open-studio/?ref=blef.fr"><strong>Talend</strong> will shutdown Talend Open Studio</a> their open-source version on January 31. As a reminder Talend has been acquired by Qlik 9 months ago. This is probably a strategy to keep money flowing. See you Talend 👋.</li><li><a href="https://siliconangle.com/2023/12/18/alteryx-acquired-private-equity-firms-4-4b-deal/?ref=blef.fr" rel="noreferrer"><strong>Alteryx</strong> to be acquired by private equity firms in $4.4B deal</a>. OK.</li></ul><p></p><hr><p>See you soon ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — 2024 ]]></title>
                    <description><![CDATA[ 2024 — Let&#39;s conclude 2023 and open 2024 with an open article about what I do and what I&#39;d love to improve. This is a bit personal I hope you&#39;ll like it. ]]></description>
                    <link><![CDATA[ /2024/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6596b93ce4dcbc000139b61d ]]></guid>
                    <pubDate><![CDATA[ 2024-01-07 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1491382825904-a4c6dca98e8c?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="silhouette of person standing on sea dock under cloudy sky" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Thoughts. Backward and forward. (</span><a href="https://unsplash.com/photos/silhouette-of-person-standing-on-sea-dock-under-cloudy-sky-g1TWbj5XYb4?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p><strong>Hello, it's 2024.</strong> I hope you're well and that you've ended 2023 on a high note with your loved ones. I wish you a Happy New Year and all the best for 2024. I'm very happy to have the privilege of corresponding with you and it honours me.</p><p>This edition of Data News will focus on the end of 2023 with a good retrospective about me and my activities—content and freelancing. Of course, it will also look ahead to 2024, and I'll try to set the vision 2024. But you know how bad I'm with goals.</p><h1 id="lets-wrap-up-2023">Let's wrap-up 2023</h1><p>In technology, we live in a fast-paced environment and when you compare yourself to others or try to keep up with all the news, it's easy to get FOMO. This year was perhaps the key moment when I managed to step away from vanity metrics and take breaks away from work. With the exception of my trip to Japan in 2019, I think this is the first time in my life it's happened in this way. Even my parents noticed I was away for a week from the computer for Christmas.</p><p>The next step for me is finally to recognise that even if I have a deep feeling that I haven't achieved anything in 2023, that I'm always behind in my ideas, it's a wrong feeling and I should be proud.</p><p><strong>Proud about my content, my professional and my personal life.</strong></p><h3 id="content-%E2%80%94-i-dont-do-it-for-fame">Content — I don't do it for fame</h3><p>When I started content creation, my North Star was to create an international audience while being in France—hence in Europe—aiming, at my level, to balance with everything in data being said/made in the US.</p><p>Then, I also stated that I wanted to produce content that I'd like to read, actually all the stuff I produce is something that helps my sharpen my ideas and to save for later. If I like my own content other will do it because I'm a just a normal person.</p><p>What 2023 brought:</p><ul><li><strong>Followers</strong> —&nbsp;I doubled in followers on my 3 main platforms: I reached 4000 people on the blog, 8000 on <a href="https://www.linkedin.com/in/christopheblefari/?ref=blef.fr">LinkedIn</a> and almost 600 on <a href="https://twitter.com/_Blef?ref=blef.fr">Twitter</a> (even if I don't post that much there). My only target is, to be honest, to grow my blog, having 4000 people who trusted my enough to enter their email and validate the subscription is just crazy.</li><li><strong>The blog</strong> —&nbsp;46 articles published in 2023, this is way less than in 2022 but it's ok. In term of views my blog got an increase of 67% going up to 36k unique visitors, just wtf.</li><li><strong>Video &amp; audio</strong> — This year I've published 3 podcasts episode, which is way less than what I initially wanted and I participated in 2 podcast épisode with <a href="https://www.datageneration.co/?ref=blef.fr">DataGen</a>. This is something I want to change in 2024.</li><li><strong>Conferences</strong> — I've talked to a few meetups and online conferences, my main issue here is that everytime I do a talk I want it to be a unique experience, so it asks me so many prep hours. Something I should change maybe. On the same topic we ran the Paris Airflow meetup for 6 months but took a never-ending break after the summer holidays.</li></ul><p>In the end, I've done more things IRL and I've met a lot of people I wouldn't have met if I hadn't been visible online and I think that's the big W of 2023. That's what brings me the most joy in fact. <strong>Thank you all for being so nice and supportive with me ❤️.</strong></p><p>In conclusion I'm happy with this. But frustrated not to have done more. Still I decided not to focus entirely on content creation so I kept room for freelancing and personal life.</p><h3 id="professional-%E2%80%94-the-limits-of-freelancing">Professional — The limits of freelancing</h3><p>I started my freelance career almost 4 years ago. In a world affected by COVID, At that time, remote work was the new norm in the tech and people were surprised that I decided to go down the unstable route when COVID was already enough.</p><p>And I won't regret it, for the last years I've worked on the most pleasant projects with people I really enjoyed working with. I'm lucky enough to be able to choose the companies I work with, people who understand my requirements. Being able to take time for myself while working wherever I want on projects I choose is something I wish everyone could do. It changed the way I see work.</p><p>It's time for a review of 2023.</p><ul><li><strong>Revenue</strong> — My freelancing activity is stable, last year I billed almost the same revenue as 2022 around 140k€, while taking more big breaks. In term of clients I had in number less clients than in 2022 because my main client kept me busy.</li><li><strong>Projects</strong>— The few noticeable projects I've worked on<ul><li>I designed, developed and deployed a reporting application with Apache Superset. This is for the French gov, for more than 60k users (+12k weekly), it contains more than 10 dashboards with 5 custom visualisations in React—you can see an <a href="https://pad.numerique.gouv.fr/liivpMjUR4e0WcGTUpavkw?both&ref=blef.fr">example screenshot here</a>.</li><li>At the same time for the gov I've worked on a larger project to develop a private datalake to work datasets with on-demand RStudio and Jupyter containers. For this I deployed a private Kubernetes with on-top MinIO, Keycloak, LDAP (for auth) and <a href="https://www.onyxia.sh/?ref=blef.fr">Onyxia</a> to deploy containers.</li><li>Then I've worked to implement a few small data stacks (revolving around ELT and a warehouse) and helped 3 companies migrating from something to dbt.</li></ul></li><li><strong>Partnerships</strong> — I had a few discussion with people about partnerships in 2023 and did not really push it forward, but I should next year.</li><li><strong>Angel investment</strong> — I did my two first investments recently, it continues my content North Star, putting light on stuff made in Europe. I'm so happy to finally open this path. Welcome <em>blef ventures.</em> More about on this soon.</li><li><strong>Data engineering</strong> — Data engineering is changing and my work is changing. When I started data engineering in 2014, the term wasn't even existing. Moving data from A to B has always been something fun for me. But with the years something changed and I might want something else as evidence by my work at gov, I do engineering. Data engineering is moving towards the left creating a deeper gap between data users and underlying layers. More on my views about this in a coming article.</li><li><strong>Other freelancers</strong> —&nbsp;When I started, I France we were only a few doing data engineering in freelance. Now, because of all the layoffs, the way the work is changing and the promise of money a lot of people entered the game. I've met a lot a tried to give advices but it probably means for me that I need to renew my offering as well.</li></ul><p>I'm happy about 2023, but it brought a few big issues in my daily routine that I want to fix next year:</p><ul><li>I feel alone in my daily work —&nbsp;working partially and remotely for companies isolated me a bit and after 4 years it's time for a social boost</li><li>Fuck dopamine —&nbsp;all the attention business distracts me so much and I lose focus so fast, especially when I open Twitch. I changed my phone routine and spend way less time on it but I have to change something on my computer as well.</li><li>Get things done —&nbsp;When it comes to finishing task, I'm good when it's for client, but when it's for myself, there is a huge improvement.</li><li>Administrative tasks —&nbsp;...</li></ul><h3 id="personal-%E2%80%94-catch-me-in-a-train">Personal —&nbsp;Catch me in a train</h3><p>In 2023 I've achieved a great Work-Life-Balance. I'm so happy and in love with my girlfriend, who is freelancing as well, so the rhythm is kinda the same for us and we have the same kind of issue even if we are not working in the same ares—she's in the movies industry.</p><ul><li><strong>Travel</strong> — We still mainly live in Berlin, and I travelled a lot between Paris—where my business is—and Berlin. These ~10 trips represent all-together around 80kg in carbon emissions. In comparaison I went once to Malaga last year to follow my gf for work and it's just insanely more (300kg eC02).</li><li><strong>Sport</strong> — Since August I started running again, 550 kms since. Twice a week for 2 months then 4 times a week and I've never been happier. Some people might remember but it was my 2022 goals to run once a week. It took me 1,5 years to reach the goal. The target is 45 mins for a 10k next year. I also started bouldering, unexpected and I like it. Still in 2023 I did less bike than the previous years.</li><li><strong>Friendships</strong> — Met a few new friends and I'm so happy with this because when you pass 30 I feel that creating new relationships become way more difficult.</li></ul><h3 id="a-few-articles">A few articles</h3><p>That's a wrap for 2023, and because it's the Data News, here a few articles that people (and me) liked in 2023 that you might find interesting:</p><ul><li><a href="https://count.co/canvas/pB7iGb4yyi2?ref=blef.fr">Count.co SQL guide</a> — A infinite canvas with content to discover and improve in SQL.</li><li><a href="https://maxhalford.github.io/blog/kpi-evolution-decomposition/?ref=blef.fr">Answering "Why did the KPI change?" using decomposition</a> — An excellent article by Max, I really enjoyed both the content and the form.</li><li><a href="https://docs.malloydata.dev/blog/2023-10-03-malloy-four/?ref=blef.fr#announcing-malloy-4-0">Malloy 4.0 announcement</a> — Malloy is a new language created to transform and analyse data, it transpiles to SQL. I did not had the time to play with it but I'll in 2024.</li><li><a href="https://www.startdataengineering.com/post/code-patterns/?ref=blef.fr">Coding patterns in Python</a> — a great list of Python to know in Python.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1699275303964-a9a1a8ae8c6b?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="a close up of a cell phone screen with numbers on it" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Here we go again (</span><a href="https://unsplash.com/photos/a-close-up-of-a-cell-phone-screen-with-numbers-on-it-mis7syjThUU?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><h1 id="2024">2024</h1><p>2024 marks the tenth anniversary of my entry into working life. My 6-month internship started in April 2014 and I was developing a custom drag-n-drop dashboard application with Django and D3 as a project. Fast-forward 10 years later projects haven't changed 😅.</p><p>Once again I wish you the best for 2024.</p><p>If you are following me for a long time you know that I'm super bad at having resolution. Let's more make ideas and stuff I'll be proud about in January 2025 when writing the 2024 post.</p><ul><li><strong>Keeping the habits</strong> — I often repeat this is not about motivation but discipline. Let's continue having the habits I have, continuing running and the newsletter.</li><li><strong>Adding new habits</strong> — I'd like to add at least 2 habits, especially in the content creation, this year I want to reboot my YouTube channel and stick to podcast publication.</li><li><strong>Create courses</strong> — For all my pro career I've written courses—I teach since 2015—but I've never really created something for people online. It's time.</li><li><strong>Release 2 products</strong> — I want to release 2 products that can live by themselves next year, one around the blog, the other one we will see.</li><li><strong>I want to find my new pro journey. If you have ideas, hit me up.</strong></li><li><strong>Invest in 4 companies</strong> — If you are in this journey, same, hit me up.</li></ul><hr><p>Let's go and see you this Friday for a more traditional newsletter. I'm sorry for this long format but changing years is something enough noticeable.</p><p>Thanks again ❤️. I wish you a good end of Sunday.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — December 2023 ]]></title>
                    <description><![CDATA[ Data News #23.52 — Last Data News of 2023, a curation of articles from December 2023 and a few news from my side. I wish you happy new year. ]]></description>
                    <link><![CDATA[ /data-news-week-23-52/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6569e54ef8c8580001b82eef ]]></guid>
                    <pubDate><![CDATA[ 2023-12-31 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>Hi, it's been a while since I last posted something here. Happy new year 🎉. I hope you haven't forgotten about me. A lot of things have been happening at the same time in my professional and personal life. To be honest, everything's been going well, but I've found it hard to find time to write among other things.</p><p>And that's the problem. I want to do so many things at once. It's quite funny because when I'm coaching someone, one of the first pieces of advice I give them is to stay focused and avoid multitasking, but when it comes to me... Yeah, you know.</p><p>However, some excellent articles have been written and I want to end 2023 with one last big wrap on these December articles. I'd also like to say hello to all the newcomers who arrived in December, thank you for your trust. We're going to get to know each other.</p><p>Before moving on to the Data News, a bit of personal news, in December, I took part in the MotherDuck meetup in Berlin. I presented what I believe to be the <a href="https://www.youtube.com/watch?v=eqyIiWMbXv4&ref=blef.fr">future from my DuckDB experiments</a>. I've especially been amazed by DuckDB in the browser with WASM. I'll also go to the <a href="https://duckdb.org/2023/10/06/duckcon4.html?ref=blef.fr">DuckCon in Amsterdam</a> on February 2nd—pm me if you're going.</p><p>End of January, on the 31st I'll speak at a <a href="https://datanosco.com/modern-data-stack/?ref=blef.fr">Modern Data Stack conf</a> in Paris, still about DuckDB, but this time in French. I also took part in my friend's podcast where <a href="https://www.youtube.com/watch?v=vEguK-J2QIg&ref=blef.fr">we discussed 3 trends in data</a>: data modeling, real-time analytics and DataOps.</p><p>My retroprojective—a retro 2023 with a projection into 2024—will soon be written. It will talk about my search for a new spicy adventure, the fact that I've finally taken up running again, my new journey as an angel investor, and so on.</p><p>Enjoy this last 2023 Data News.</p><p></p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><a href="https://bbycroft.net/llm?ref=blef.fr">An interactive 3D explaination of LLMs</a> — Explaining complex things the visual way is the best. In this one it details all the components in a LLM—a big part explain what's a Transformer.</li><li><a href="https://mehdio.substack.com/p/llms-for-builders-jargons-theory?ref=blef.fr">LLMs for builders: jargons, theory &amp; history</a> — Mehdi, compiled in a large article all the necessary vocab to understand the basic conversation when it comes to generative. He even quickly explain how you can run a model on your computer.</li><li>Cocorico 🐓. Mistral AI, one of the French "OpenAI" startup, entered the field setting new standards and with recognition. They released their first <a href="https://mistral.ai/news/la-plateforme/?ref=blef.fr">AI endpoints</a>: generative and embedding. When it comes to generative they currently have 3 models: tiny, small and medium, which are performing well against GPT-3.5. <strong>At the same time they released Mixtral 8x7B, the first open-source model of this calibre under an Apache Licence</strong>. And the weight are open-source as well.</li><li><a href="https://blog.samaltman.com/what-i-wish-someone-had-told-me?ref=blef.fr">What I wish someone had told me</a> — It's borderline AI news, but as the author is Sam Altman, I think it belongs here. After the whole Hollywood thing around Sam being pushed out and then coming back, Sam clickbaited us. He's written 17 great HR / team building tips—but they have nothing to do with the drama we're all living for.</li><li><a href="https://blog.fal.ai/building-applications-with-real-time-stable-diffusion-apis/?ref=blef.fr">Building applications with real-time stable diffusion APIs</a> — fal, has written a great article about how you can use WebSockets in Javascript to interact in real-time with a Python backend and stable diffusion. There is a great article with an image generation from a sketch. It gives so many ideas.</li><li><a href="https://gael-varoquaux.info/programming/people-underestimate-how-impactful-scikit-learn-continues-to-be.html?ref=blef.fr">People underestimate how impactful Scikit-learn continues to be</a> — The year is coming to an end and LinkedIn is playing at the 2024 predictions game. Obviously no one will get it right. At the same time one of the Scikit-learn confounder, put the church back at the city center—this is a French expression poorly translated. Scikit is still the most used library when you look at some numbers and LLMs have still to bridge the gap in usage.</li><li><a href="https://platform.openai.com/docs/guides/prompt-engineering?ref=blef.fr">OpenAI prompt engineering guide</a> — Wow, an official guide to become a prompt engineer /s. Seriously, it seems it contains good tips to communicate to the algorithm.</li><li>Google announced <a href="https://www.youtube.com/watch?v=UIZAiXYceBI&t=171s&ref=blef.fr">Gemini</a>, their new multimodal model "beating" GPT-4, but fooled us with an <a href="https://www.cnbc.com/2023/12/08/google-faces-controversy-over-edited-gemini-ai-demo-video.html?ref=blef.fr">edited video</a>.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html?ref=blef.fr#airflow-2-8-0-2023-12-14">Airflow 2.8 is out</a> — The Airflow rhythm of release is crazy, I can't keep up with the awesome feature that have been added this year. To finish the year Airflow team have released improvements to Datasets and a major step forward with the new <a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/objectstorage.html?ref=blef.fr">Object Storage API</a> that provides a generic abstraction over Cloud Storage to transfer data from one to another.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7139895340735844352/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7139895340735844352%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">The EU AI Act has passed</a> — After many years working on the text the EU has voted for the AI Act to regulate usage of AI when usage European citizen data. It points to a cheat sheet that summarises what you need to know. In a few words: the AI Act provides the glossary to define what's an AI and define the boundaries of prohibited and high-risks AIs.</li><li><a href="https://cloud.google.com/bigquery/docs/write-sql-duet-ai?ref=blef.fr">BigQuery now integrates DuetAI</a> — to help you generate or complete SQL queries.</li><li><a href="https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/?ref=blef.fr">AWS announced S3 Express</a> — S3 Express is a new zone with 10x better performance (latency and parallelisation).  Paul, wrote a few <a href="https://quickwit.io/blog/s3-express-speculations?ref=blef.fr">speculations</a> about the new S3 tier, this is highly detailed and explains very well what to expect, DateEngineeringWeekly also wrote <a href="https://www.dataengineeringweekly.com/p/thoughts-on-amazon-express-one-and??ref=blef.fr">thoughts</a> about it. S3 will still be the king, or the <a href="https://juhache.substack.com/p/s3-is-the-goat?ref=blef.fr">GOAT</a>.</li><li><a href="https://newsletter.casewhen.xyz/p/data-explained-idempotence?ref=blef.fr">Idempotence</a> — Matt wrote an article to explain what's the idempotence and why it matters in data engineering. Idempotence can be mathematically summarise to <em>f(f(x)) = f(x)</em>, it's important in data engineering because for the same input you want a pipeline to produce the same output. Never forget to have it in mind when thinking of a pipeline it leads to great questions.</li><li><a href="https://nightingaledvs.com/have-i-resolved-the-pie-chart-debate/?ref=blef.fr">Have I Resolved the Pie Chart Debate?</a> —&nbsp;We all know pie chart are terrible. Nick proposes how to fix the pie chart dilemma.</li><li><a href="https://more-than-numbers.count.co/p/how-to-know-if-your-data-team-is?ref=blef.fr">How to know if your data team is successful?</a> — Reflexions around team performance and how to measure it.</li></ul><p></p><p></p><h1 id="engineering-stuff-%E2%9A%99%EF%B8%8F">Engineering stuff ⚙️</h1><ul><li><a href="https://netflixtechblog.com/our-first-netflix-data-engineering-summit-f326b0589102?ref=blef.fr">Netflix internal data engineering Summit</a> — Netflix team organised an internal conference about DE topics. And they recorded it. 8 videos are on YouTube and to be honest this is awesome content to learn patterns and get ideas from the best. They <a href="https://www.youtube.com/watch?v=QxaOlmv79ls&ref=blef.fr">still use technologies around JVM</a> (Spark and Flink), but with no surprise everything resolves around Iceberg—which has been created at Netflix.</li><li><a href="https://netflixtechblog.com/incremental-processing-using-netflix-maestro-and-apache-iceberg-b8ba072ddeeb?ref=blef.fr">Using Netflix Maestro and Apache Iceberg</a> — Going deeper into incremental processing the engineering team details how they implemented it.</li><li><a href="https://tobikodata.com/introducing-wap-pattern-support.html?ref=blef.fr">Introducing WAP pattern support with Apache Iceberg</a> (with SQLMesh) — Small article about a important pattern to avoid putting bad data in production. The WAP pattern—Write-Audit-Publish—let's you first <em>write</em> the data in a staging layer in which the data is <em>audited</em>, if the audit is green then the data is <em>published</em> in the production layer. This article is just an entry point to SQLMesh—a dbt alt—that enables you to do it.</li><li><a href="https://medium.com/snowflake/how-to-integrate-databricks-with-snowflake-managed-iceberg-tables-7a8895c2c724?ref=blef.fr">Use Databricks to read Iceberg tables in Snowflake</a> 🙃 — This post have been written by Snowflake team, but reflect a strategy from Snowflake to attract customers by being open, and Iceberg do the glue here, winning the table format. Still, don't do it and try to avoid spaghetti data platform.</li><li><a href="https://maxhalford.github.io/blog/efficient-data-transformation/?ref=blef.fr">Efficient ELT refreshes</a> — Max detailed how he designed his ELT pipelines </li><li><a href="https://dlthub.com/docs/blog/dlt-aws-taktile-blog?ref=blef.fr">Run dlt on Lambda to save on extract and load costs</a> — dlt is an open-source Python library to do extract-load in Python, if you want to save cost out of different cloud services that moves data, it might be an alternative.</li><li><a href="https://eng.lyft.com/druid-deprecation-and-clickhouse-adoption-at-lyft-120af37651fd?ref=blef.fr">Druid deprecation and ClickHouse adoption at Lyft</a> — Data engineer loves migration. They prefer even more speaking about the migrations they have done. Moving from Druid to ClickHouse looks like a good improvement.</li><li><a href="https://medium.com/airbnb-engineering/data-quality-score-the-next-chapter-of-data-quality-at-airbnb-851dccda19c3?ref=blef.fr">Data Quality Score: next chapter of data quality at Airbnb</a> — After all the data cataloging vision and trends Airbnb launched, this time they explained how they see dataset quality and how they score it.</li></ul><h3 id="other-reads">Other reads</h3><ul><li><a href="https://clickhouse.com/blog/the-state-of-sql-based-observability?ref=blef.fr">The state of SQL-based observability</a>, on ClickHouse blog.</li><li><a href="https://leo-godin.medium.com/designing-one-big-table-obt-c1dd797d60ac?ref=blef.fr">Designing OBT</a> and comparing <a href="https://hubertdulay.substack.com/p/one-big-table-obt-vs-star-schema?ref=blef.fr">OBT with Star Schema</a>.</li><li><a href="https://www.datafold.com/blog/code-review-best-practices-for-analytics-engineers?utm_source=linkedin&utm_medium=social&utm_campaign=evergreen-datafold_ci">Code review best practices for Analytics Engineers</a>.</li><li><a href="https://towardsdatascience.com/self-service-data-analytics-as-a-hierarchy-of-needs-19bb68551640?ref=blef.fr">Self-Service data analytics as a hierarchy of needs</a>.</li><li><a href="https://robertsahlin.substack.com/p/easy-gcp-cost-anomaly-detection?r=7bvua&utm_campaign=post&utm_medium=web&ref=blef.fr">Easy GCP cost anomaly detection</a>.</li><li>A<a href="https://engineering.grab.com/an-elegant-platform?ref=blef.fr">n elegant platform</a>, Grab.</li><li><a href="https://medium.com/thefork/a-guide-to-mlops-with-airflow-and-mlflow-e19a82901f88?ref=blef.fr">A guide to MLOps with Airflow and MLflow</a>, TheFork.</li><li><a href="https://ntrs.nasa.gov/api/citations/19720005243/downloads/19720005243.pdf?ref=blef.fr">What made Apollo a success</a> —&nbsp;A Nasa PDF with 8 articles reprinted from March 1970 issue of Astronautics &amp; Aeronautics.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://techcrunch.com/2023/12/11/mistral-ai-a-paris-based-openai-rival-closed-its-415-million-funding-round/?ref=blef.fr" rel="noreferrer"><a href="https://techcrunch.com/2023/12/11/mistral-ai-a-paris-based-openai-rival-closed-its-415-million-funding-round/?ref=blef.fr"><strong>Mistral AI</strong></a> raised another €415m at $2B valuation</a>. Mainly from US based capitals, it will probably change the governance of the company, is it still French?</li><li><a href="https://www.datacenterdynamics.com/en/news/elon-musks-generative-ai-startup-xai-looks-to-raise-1bn/?ref=blef.fr">Elon Musk’s generative AI startup <strong>xAI</strong> looks to raise $1bn</a>.</li><li><a href="https://siliconangle.com/2023/12/04/assemblyai-raises-50m-cloud-based-ai-speech-models/?ref=blef.fr"><strong>AssemblyAI</strong> raises $50m.</a> API endpoints to convert voice data to text in all his forms (transcript, chapters, summaries, etc.).</li><li><a href="https://www.keboola.com/blog/keboola-data-operations-supercharger-raises-32m-in-series-a-funding?ref=blef.fr"><strong>Keboola</strong> raises $32m in Series A</a>. This is a all-in-one data platform for non-technical data users.</li><li><a href="https://www.eu-startups.com/2023/12/london-based-harriet-raises-e1-4-million-pre-seed-to-deliver-a-full-stack-ai-offering-to-hr-teams/?ref=blef.fr">London-based <strong>Harriet</strong> raises €1.4 million pre-seed</a>. An AI assistant using HR data to help employees.</li><li><a href="https://en.globes.co.il/en/article-ai-data-platform-vast-data-raises-118m-at-9b-valuation-1001464366?ref=blef.fr">AI data platform <strong>VAST Data</strong> raises $118m</a>. All-in-one platform for big corp to do AI and engineering at the same place.</li><li><strong>Octolis</strong> <a href="https://www.linkedin.com/feed/update/urn:li:activity:7139959931775922176/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7139959931775922176%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">has been acquired by <strong>Brevo</strong></a> (ex-SendinBlue). Octolis is a CDP / reverse-ETL solution and Brevo is a CRM, the join makes total sense.</li></ul><hr><p>See you this Friday with a post opening 2024 🎊.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.46 ]]></title>
                    <description><![CDATA[ Data News #23.46 — Sam Altamn has been fired as CEO of OpenAI, all AI News and catching up the news from the last month. ]]></description>
                    <link><![CDATA[ /data-news-week-23-46/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 653ba5092d38dc000188ee17 ]]></guid>
                    <pubDate><![CDATA[ 2023-11-18 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1512617835784-a92626c0a554?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="person in gray shirt with backpack walking on street between houses" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Back in town (</span><a href="https://unsplash.com/photos/person-in-gray-shirt-with-backpack-walking-on-street-between-houses-YQSXw2YVqyU?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, it's been a few weeks since I've not written any news. It was a necessary break for me and a blank page syndrome at the same time. Still I've accumulated a lot of articles that I think should fit in the Data News so this week might be a huge recap of content that has been produce in the last month.</p><p>I hope you will enjoy the selection.</p><p>On Monday I'll also give a talk at <a href="https://www.eventbrite.com/e/motherduck-duckdb-user-meetup-de-november-2023-edition-2-tickets-742532794577?ref=blef.fr">Berlin MotherDuck meetup</a>: <em>DuckDB experiments, a glimpse of the future</em>. I think it will not be live but the recording will be published after the event on YouTube I think.</p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://t.co/ErR21C55I4?ref=blef.fr"><img src="https://www.blef.fr/content/images/2023/11/gUwPXx9P.png" class="kg-image" alt="" loading="lazy" width="1000" height="500" srcset="https://www.blef.fr/content/images/size/w600/2023/11/gUwPXx9P.png 600w, https://www.blef.fr/content/images/2023/11/gUwPXx9P.png 1000w" sizes="(min-width: 720px) 720px"></a><figcaption><span style="white-space: pre-wrap;">Not sure there are still free seats, but if you want to come reach me.</span></figcaption></figure><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><ul><li><strong>Sam Altman has </strong><a href="https://www.theverge.com/2023/11/17/23965982/openai-ceo-sam-altman-fired?ref=blef.fr"><strong>been fired as CEO of OpenAI</strong></a><strong>.</strong><ul><li>OpenAI announced this <a href="https://openai.com/blog/openai-announces-leadership-transition?ref=blef.fr"><em>leadership transition</em></a><em> </em>yesterday. At the same time Greg Brockman (actual President and co-founder) will step down from the chairman of board and Mira Murati (actual CTO) will become interim-CEO. It was a <a href="https://twitter.com/gdb/status/1725736242137182594?ref=blef.fr">brutal</a> <a href="https://x.com/karaswisher/status/1725718391548207246?s=20&ref=blef.fr">decision</a>.</li><li>The public official given reason was "[Sam]<em> was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. </em>The board no longer has confidence<em>.</em>".</li><li>The Internet has spent the last 15 hours guessing what this really meant. Here are a few theories I've read: a security leak occurred and Sam/Greg hid it from the board, Sam is <a href="https://www.lesswrong.com/posts/QDczBduZorG4dxZiW/sam-altman-s-sister-annie-altman-claims-sam-has-severely?ref=blef.fr">publicly accused of sexual abuse</a> by his sister, Sam has different views about company vision which doesn't please the board—esp. regarding <a href="https://twitter.com/karaswisher/status/1725678074333635028?ref=blef.fr">profits</a> or <a href="https://www.indiatoday.in/technology/news/story/openai-boss-sam-altman-confirms-they-are-working-on-chatgpt-5-says-ai-does-not-need-heavy-regulation-yet-2464153-2023-11-17?ref=blef.fr">AI regulations</a>, Sam invested in an OpenAI competitor. Either way, we'll see in a few days.</li></ul></li><ul><li>People are mostly saddened by the news because Sam was a publicly-beloved and transparent CEO who changed AI. Comparisons with the coup that overthrew Steve Jobs back in the days are many.</li></ul><li>The news arrived a few day after <a href="https://www.youtube.com/watch?v=U9mJuUkhUzk&ref=blef.fr">OpenAI dev-day</a>, a public conference announcing new products and features. Mainly they announced <a href="https://openai.com/blog/introducing-gpts?ref=blef.fr">GPTs</a>, a no-code UI to create custom versions of ChatGPT.</li><li>Other AI announcements<ul><li><a href="https://www.youtube.com/watch?v=NrQkdDVupQE&ref=blef.fr">Github Universe</a> was the moment to announce more Copilot everywhere in Github ecosystem. The most interesting thing was the fact that Github will introduce <a href="https://github.blog/2023-10-02-introducing-the-new-apple-silicon-powered-m1-macos-larger-runner-for-github-actions/?ref=blef.fr">M1 and GPU runners</a>.</li><li>xAI—the company founded by Musk after quitting OpenAI—announced <a href="https://x.ai/?ref=blef.fr">Grok</a>. It's a 33B parameters LLM.</li><li>Germany wants to build the European OpenAI competitor and invested $500m in <a href="https://aleph-alpha.com/?ref=blef.fr">Aleph Alpha</a>, a startup. On the landing page it's clear that the focus is to build <em>safe AI.</em></li><li><a href="https://kyutai.org/?ref=blef.fr">Kyutai</a> has been announcement at a AI Pulse event in Station F, Paris. <strong>Kyutai is an open science lab to build and democratize AGI—</strong>artificial general intelligence<strong>—through open science</strong>. They carefully picked open science rather than open-source. The team looks great.</li><li>The GPU availability competition is on. Y Combinator announced a <a href="https://twitter.com/ycombinator/status/1721920476694634709?ref=blef.fr">Microsoft partnership and priority access</a> to compute resources. This is linked as well to <a href="https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-gpu-ai-chips-azure-maia-cobalt-specifications-cloud-infrastructure?ref=blef.fr">Microsoft making custom AI chips</a>.</li><li><a href="https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/?ref=blef.fr">Biden issues executive order on safe, secure, and trustworthy AI</a>.</li></ul></li><li>2 reports with hundreds of pages about AI were published — The <a href="https://www.stateof.ai/?ref=blef.fr">State of AI report</a> and <a href="https://www.coatue.com/blog/perspective/ai-the-coming-revolution-2023?ref=blef.fr">AI: The Coming Revolution</a>. Both looks full of interesting things to say, but I did not read them.</li><li><a href="https://arxiv.org/abs/2311.00871?ref=blef.fr">Google team wrote a paper</a> "demonstrating various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks". In a nutshell, LLM can't generalize.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://m.media-amazon.com/images/M/MV5BODQwODk5NjcxOF5BMl5BanBnXkFtZTgwMDMwMDgyNTM@._V1_.jpg" class="kg-image" alt="Silicon Valley (TV Series 2014–2019) - IMDb" loading="lazy"><figcaption><span style="white-space: pre-wrap;">🍿 (© Silicon Valley HBO series)</span></figcaption></figure><p></p><p>Now that I gave you the general news, let's jump to a few use-cases about AI.</p><ul><li><a href="https://ai.meta.com/blog/brain-ai-image-decoding-meg-magnetoencephalography/?ref=blef.fr">Towards a real-time decoding of images from brain activity</a> — This is crazy, Meta researchers have been able to create a system that predicts an image seen by a person from the brain magnetoencephalography.</li><li><a href="https://engineering.grab.com/llm-powered-data-classification?ref=blef.fr">LLM-powered data classification for data entities at scale</a> — Grab explains how you can use LLMs to do classification, in this case identifying PII in the database. They explain the real-time architecture the system is using and give an example of the prompt they are using.</li><li><a href="https://blog.developer.atlassian.com/generative-ai-the-intern-you-cant-trust/?ref=blef.fr">Generative AI, the intern you can’t trust</a> — A small post from Atlassian blog, it gives 3 ways to improve LLMs accuracy.</li><li><a href="https://www.canva.dev/blog/engineering/summarise-post-incident-reviews-with-gpt4/?ref=blef.fr">Summarizing post incident reviews with GPT-4</a> — Canva has so many incidents that they need a LLM to summarize them for reporting purposes 🙃. Obviously it's a joke, but while the use-case is interesting I question myself about real need behind.</li><li><a href="https://netflixtechblog.com/building-in-video-search-936766f0017c?ref=blef.fr">Building in-video search at Netflix</a> — What if you could prompt for a specific situation and get all the movies—at the relevant timecodes—presenting the situation. This is so cool.</li><li><a href="https://medium.com/artefact-engineering-and-data-science/llms-deployment-a-practical-cost-analysis-e0c1b8eb08ca?ref=blef.fr">Cost analysis of deploying LLMs</a> — All of this is cool, but pricey, this post do a good exploration of the costs.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><p>Because the AI News is pretty packed and I still want you to enjoy this newsletter articles will be less commented than usual. But still spicy opinion, because you know, it's me.</p><ul><li>Data contracts is undoubtedly a new growth lever for data observability companies and data VCs. Soda announced their <a href="https://www.soda.io/resources/soda-releases-oss-data-contract-engine?ref=blef.fr" rel="noreferrer">open-source data contracts</a> engine. It's done in YAML. Here another example of contracts with <a href="https://dataqualityguru.substack.com/p/data-contracts-schema-validation?ref=blef.fr">msgspec</a>.</li><li>NVidia research has been able to supercharge pandas with cuDF to <a href="https://colab.research.google.com/drive/12tCzP94zFG2BRduACucn5Q_OcX1TUKY3?ref=blef.fr">run pandas on GPUs</a>.</li><li>Wes McKinney, pandas and Arrow creator <a href="https://wesmckinney.com/blog/joining-posit/?ref=blef.fr">will join Posit</a>—the company behind RStudio—as a Principal Architect. His new role will probably ease the integration in the Posit ecosystem of all the Python tooling, even if it has already been the case for months.</li><li><a href="https://www.getdbt.com/blog/dbt-labs-appoints-tech-veteran-brandon-sweeney-as-president-and-chief-operating-officer?ref=blef.fr">dbt Labs hired Brandon Sweeney as new President and COO</a>. Brandon was previously dealing with Revenue at Hashicorp. The same company which recently <a href="https://thenewstack.io/hashicorp-abandons-open-source-for-business-source-license/?ref=blef.fr">changed licensing to BSL</a> getting backslashed by the tech community for it. Our prayers goes to dbt Core.</li><li>Onehouse , Microsoft and Google are working on table format standard called <a href="https://onetable.dev/?ref=blef.fr">Onetable</a>. This isn't a new format but a way to create interoperability between Delta, Iceberg and Hudi. </li><li>If you are curious about <a href="https://tabular.io/blog/iceberg-hudi-acid-guarantees/?ref=blef.fr">Iceberg and Hudi ACID guarantees</a> read the article.</li><li>Code faster with <a href="https://astral.sh/blog/the-ruff-formatter?ref=blef.fr">Ruff</a>, a Python formatter written in Rust. All the time wasted for black to reformat your code will be used for good purpose now.</li></ul><p></p><p>Taking other companies as example is often a good way to get ideas</p><ul><li><em>Gusto, </em><a href="https://engineering.gusto.com/data-engineering-on-people-data/?ref=blef.fr"><em>data platform to generate HR insights</em></a> — All data send to OneModel—a paid HR tool, in a Redshift with Tableau for visualisation.</li><li><em>Criteo, </em><a href="https://medium.com/criteo-engineering/how-we-compute-data-lineage-at-criteo-b3f09fc5c577?ref=blef.fr"><em>how to compute data lineage</em></a> — Criteo has a homemade application for data document called... Datadoc in which they compute their cross assets lineage.</li><li><em>Picnic, </em><a href="https://blog.picnic.nl/the-art-of-master-data-management-at-picnic-48b5cf978221?ref=blef.fr"><em>master data management</em></a> — Creating MDM for retailers is like the one-piece. </li><li><em>LinkedIn, </em><a href="https://engineering.linkedin.com/blog/2023/revolutionizing-real-time-streaming-processing--4-trillion-event?ref=blef.fr"><em>how to use 4 trillion events daily</em></a> — Leveraging Apache Beam and Samza.</li><li><em>Netflix, </em><a href="https://netflixtechblog.com/streaming-sql-in-data-mesh-0d83f5a00d08?ref=blef.fr"><em>streaming SQL</em></a> — Flink architecture in a data mesh organisation.</li><li><em>Zalando, </em><a href="https://engineering.zalando.com/posts/2023/11/patching-pgjdbc.html?ref=blef.fr"><em>how to patch Postgres and fix WAL</em></a> — Zalando team explains what they patched to Postgres JDBC driver that was growth in the write-ahead log.</li><li><em>GoDaddy, </em><a href="https://www.godaddy.com/engineering/2023/10/26/layered-architecture-for-a-data-lake/?ref=blef.fr"><em>layered architecture for a data lake</em></a> — Naming conventions ideas and 5 data layers: source, raw, clean, enterprise and analytical.</li></ul><p></p><p>A few food for thought articles about data concepts and roles.</p><ul><li><a href="https://towardsdatascience.com/from-data-platform-to-ml-platform-4a8192edab5d?ref=blef.fr">From data platform to ML platform</a> — How incrementally data platforms are built, first for analytical use-cases and then adding ML capabilities.</li><li><a href="https://www.patch.tech/blog/why-you-should-not-build-directly-on-data-warehouse/?ref=blef.fr">Why you should not build apps directly on the data warehouse</a>.</li><li><a href="https://medium.pimpaudben.fr/sql-is-not-designed-for-analytics-079fc97b139c?ref=blef.fr">SQL is not designed for analytics</a> and why <a href="https://whynowtech.substack.com/p/malloy-data?ref=blef.fr">Malloy</a> is a paving the future.</li><li><a href="https://towardsdatascience.com/would-you-become-a-data-strategist-59c0a179df44?ref=blef.fr">Would you become a data strategist?</a> — Great post from Marie about a key analytical role shaping companies strategies.</li><li><a href="https://luminousmen.com/post/two-archetypes-of-data-engineers/?ref=blef.fr">Two archetypes of data engineers</a> — Closer to business or to the tech. Best data engineering teams successfully blend the 2 archetypes.</li><li><a href="https://tech.instacart.com/the-economics-team-at-instacart-94c48db951e8?ref=blef.fr">The Economics team at Instacart</a> — Or how economists and PhDs become more tech-savy enabling more and more relevant usage of data.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong>ZenML</strong> <a href="https://www.zenml.io/blog/were-revving-up-zenml-exciting-news-and-what-it-means-for-you?ref=blef.fr">raises $3.7m additional Seed</a>. A MLOps platform that works with all cloud and tools.</li><li>Snowflake acquire <a href="https://www.linkedin.com/posts/sisu-data_sisu-is-joining-forces-with-snowflake-we-activity-7119690491054485505-7qtI/?ref=blef.fr" rel="noreferrer"><strong>Sisu</strong></a> and<a href="https://www.snowflake.com/blog/snowflake-to-acquire-ponder/?ref=blef.fr"> <strong>Ponder</strong></a>. The first one is an engine to monitor business metrics while the second is a tool to run pandas at scale.</li><li><a href="https://techcrunch.com/2023/11/01/yahoo-spin-out-vespa-lands-31m-investment-from-blossom/?guccounter=1&ref=blef.fr">Yahoo spin-out <strong>Vespa</strong> and raises $31m</a>. Vespa is a search engine and a vector database. This is the good timing to open-source is for AI use-cases.</li><li><a href="https://aleph-alpha.com/aleph-alpha-raises-a-total-investment-of-more-than-half-a-billion-us-dollars-from-a-consortium-of-industry-leaders-and-new-investors/?ref=blef.fr"><strong>Aleph Alpha</strong> raises $500m Series B</a> to build the German OpenAI.</li><li><a href="https://kyutai.org/CP_Kyutai_AI_EN.pdf?ref=blef.fr" rel="noreferrer"><strong>Kyutai</strong> is funded with $330m</a> from 2 French billionaires and Eric Schmidt—ex-Google CEO. Kyutai is a open science lab that wants to build the AGI. The team as a good resume and the science committee looks awesome (Yejin Choi, Yann Lecun and Bernhard Schölkopf).</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1572817544472-5fa378349697?q=80&amp;w=1000&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="green palm trees beside building during daytime" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Dreaming of sun (</span><a href="https://unsplash.com/photos/green-palm-trees-beside-building-during-daytime-oq9XjYRrLaI?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><hr><p>Ghost implemented a recommendation feature recently so I've added a few folks I like to read on internet.</p><div class="kg-card kg-button-card kg-align-center"><a href="#/portal/recommendations" class="kg-btn kg-btn-accent">Read a few friends</a></div><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.42 ]]></title>
                    <description><![CDATA[ Data News #23.42 — dbt Mesh and a new dbt alternative, a few fundraising, OpenAI crazy number, Meta banning Python ads, and more. ]]></description>
                    <link><![CDATA[ /data-news-week-23-42/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65322dc27515250001782524 ]]></guid>
                    <pubDate><![CDATA[ 2023-10-20 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1598181162450-56296491375c?auto=format&amp;fit=crop&amp;q=80&amp;w=1000&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="white sheep on green grass field near body of water during daytime" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Writing about dbt like a sheep (</span><a href="https://unsplash.com/photos/white-sheep-on-green-grass-field-near-body-of-water-during-daytime-PdmZgghWImI?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, this week Coalesce—the dbt Labs annual conference—took place. During 3 days, people shared how they used dbt around the world. I'll, as usual, write a takeaway post after binge watching all keynotes, but this is for next week. Still dbt Labs <a href="https://www.getdbt.com/blog/new-dbt-cloud-features-announced-at-coalesce-2023?ref=blef.fr">announcements</a> were mainly towards dbt Cloud with great features to drive adoption of the paid product.</p><p>They announced dbt Mesh a product enabling cross-project dependencies for teams with multiple dbt projects. In addition they also released an Explorer view that lets you navigate through all you project and see models, macros and more directly in one nice graph.</p><p>Does this mean that you have to use dbt Cloud to have a multi-project setup? No, you can activate <a href="https://www.blef.fr/dbt-multi-project-collaboration/">multi-project collaboration</a> with dbt Core. I've written a guide that helps you do it.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.blef.fr/dbt-multi-project-collaboration/" class="kg-btn kg-btn-accent">Read my dbt multi-project guide</a></div><p>📺 On the content side I'll also present next week the Fancy Data Stack project at the <a href="https://www.accelevents.com/e/deml-summit-2023?ref=blef.fr#about">Data Engineering And Machine Learning Summit 2023</a> organised by Seattle Data Guy. I'll be online on Thursday 26 at 5PM CEST. Add it to your calendar and sign up for the conference—the list of speakers is insane.</p><p>Data News is packed this week, take time to enjoy it, rainy times are coming, you can see it as a gift 🎁.</p><p></p><h1 id="enough-dbt-use-lea-%F0%9F%A5%B0">Enough dbt use lea 🥰</h1><p>Max—the first Data News member 🤗—open-sourced <a href="https://github.com/carbonfact/lea?ref=blef.fr">carbonfact<strong>/lea</strong></a> this week. lea aims to be a minimalist alternative to dbt by fixing a few flaws that comes with dbt. You can even see the traditional Jaffle shop example done in lea.</p><p>What are the main differences?</p><ul><li>You configure lea with env variables.</li><li>a <code>lea prepare</code> command that creates database objects that needs to be created (dataset, schema, etc.). Schema are interpreted from the folder structure (with DuckDB).</li><li>lea understand the views relationships, you don't need a ref. Jinja templating is still supported tho.</li><li>Tests are directly added in the SQL code at the column that is target. For instance if you need to test unicity on a column you add the @UNIQUE decorator. Singular tests are still supported.</li><li>lea generates documentation as Markdown in the workdir.</li><li>Other cool features: <code>lea teardown</code> delete database objects, lea diff shows table schema differences and you can write Python model as long as they return a DataFrame.</li></ul><p>Max also wrote a nice post about data downstream issues—which is the main problem leading the data contracts space: <a href="https://maxhalford.github.io/blog/shit-flows-downhill-but-not-at-carbonfact/?ref=blef.fr">Sh*t flows downhill, but not at Carbonfact</a>. You should read it because it gives another perspective of solution to fix it.</p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><ul><li><a href="https://huggingface.co/spaces/Vokturz/can-it-run-llm?ref=blef.fr">Can you run it?</a> — There is a HuggingFace app that tells you by taking your specs what you need to run a LLM model for inference or training.</li><li><a href="https://fondant.ai/en/latest/announcements/CC_25M_community/?ref=blef.fr">25 million Creative Commons image dataset released</a> — Fondant, an open-source processing framework, released publicly available images from web crawling with their associated license.</li><li><a href="https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview?ref=blef.fr">New Vertex AI Feature Store</a> — GCP Vertex AI is the place to do "serverless" AI. This is awesome to see this directly integrated within BigQuery as it obviously brings simplicity. In public preview.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1597953601389-5f66416e47c4?auto=format&amp;fit=crop&amp;q=80&amp;w=1000&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="panda bear on green grass during daytime" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Pandas appreciation post (</span><a href="https://unsplash.com/photos/panda-bear-on-green-grass-during-daytime-KhD1zZIieJ0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li><a href="https://lerner.co.il/2023/10/19/im-banned-for-life-from-advertising-on-meta-because-i-teach-python/?ref=blef.fr">Meta banned a creator for selling Python and Pandas courses</a> — The automated AI filters identified the ads as violating wildlife protection rules. We are therefore thinking with our feet these algorithms are probably written in Python. Do we still want a future where AI decide for us?</li><li>At the same time, luckily for us, <a href="https://engineering.fb.com/2023/10/18/ml-applications/meta-ai-custom-silicon-olivia-wu/?ref=blef.fr">Meta is creating custom silicon for AI</a>.</li><li><a href="https://twitter.com/DrJimFan/status/1711797997791838394?ref=blef.fr">Disney new intelligent robot / toy</a> — The entertainment company showcased a new toy with impressive capabilities opening doors for a fun future for kids.</li><li><a href="https://medium.com/blablacar/11-lessons-learned-managing-a-platform-team-within-a-data-mesh-0b191b7652ce?ref=blef.fr">11 lessons learned managing a platform team within a data mesh</a> — BlaBlaCar, a carpooling company, is well-known France for recently adopting a data mesh organisation. This post gives great insights about the impact on the data platform team.</li><li><a href="https://cube.dev/blog/the-need-for-an-open-standard-for-the-semantic-layer?ref=blef.fr">The need for an open standard for the semantic layer</a> — Following news about dbt's semantic layer, this post from cube opens the door to defining a standard when it comes to semantics. What should be the main entity type at the center of the semantics: metrics or datasets?</li><li><a href="https://kestra.io/blogs/2023-10-11-why-ingestion-will-never-be-solved?ref=blef.fr">Why data integration will never be fully solved</a> — Anna covers a few data integration tools and tries to explain why this is such a tricky field that have issue to be resolved with only one cloud tool.</li><li><a href="https://www.popsink.com/?ref=blef.fr">Popsink a real-time ingestion and processing platform</a> released their self-service offering this week. They are French and they built a great platform on top of Redpanda and Flink claiming to be 4x cheaper than Fivetran to do data replication. As an echo of last bullet point.</li><li>Following Popsink kind of stuff, an example of how Fortis Games, a game editor, <a href="https://thenewstack.io/a-real-time-data-platform-for-player-driven-game-experiences/?ref=blef.fr">developed real-time platform with the same technologies</a>.</li><li><a href="https://www.5x.co/articles/rise-of-the-data-generalist?ref=blef.fr">Rise of the data generalist: smaller teams, bigger impact</a> — You don't need to convince me. In all experience and talks I have with people smaller teams obviously drives bigger impact.</li><li><a href="https://www.entreprises.gouv.fr/fr/numerique/enjeux/la-strategie-nationale-pour-l-ia?ref=blef.fr">La stratégie nationale pour l'intelligence artificielle</a> — In French. This is about what France wants to until 2025 to drive IA adoption.<ul><li>3500 new students and at least 200 additional thesis on AI topics</li><li>Do between 10% and 15% of the world market share when it comes to embarked AI</li><li>and more stuff in order to attract foreigners and help companies</li></ul></li></ul><p></p><h1 id="engineering-stuff">Engineering stuff</h1><ul><li><a href="https://github.com/dagster-io/dagster-open-platform?ref=blef.fr">Dagster released their internal data platform in open</a> — Surprise they use Dagster as an orchestrator.</li><li>dbt related stuff<ul><li><a href="https://medium.com/intercom-rad/to-dbt-or-not-to-dbt-4e2d04f27d3a?ref=blef.fr">To dbt or not to dbt</a> — A few lessons learned while implementing dbt at Intercom.</li><li><a href="https://steep.app/blog/metricflow?ref=blef.fr">dbt MetricFlow, semantic layer 2.0</a> — An quick analyse of the new semantic layer vision.</li><li><a href="https://xebia.com/blog/data-contracts-and-schema-enforcement-with-dbt/?ref=blef.fr#:~:text=Data%20contracts%2C%20much%20like%20an,of%20data%20and%20output%20models">Data contracts and schema enforcement with dbt</a> — It comes with dbt Mesh and gives a lot of new metadata over your models to bring more software engineering practices to dbt development.</li></ul></li><li><a href="https://cassio-bolba.medium.com/data-modelling-x-one-big-table-obt-the-end-of-data-models-4e8739b3937e?ref=blef.fr">Pros and cons of One Big Table data modeling</a> — I really like OBT, it brings a lot of simplicity, especially in the downstream usage, but obviously it has known issues.</li><li><a href="https://blog.devgenius.io/what-is-data-versioning-and-3-ways-to-implement-it-4b6377bbdf93?ref=blef.fr">What is data versioning and 3 ways to implement it	</a>— A comparison between change data capture (CDC), dimensional modeling and slowly changing dimension (SCD).</li><li><a href="https://medium.com/@patrick.ml.walsh/mage-bigquery-and-bundled-up-bike-trips-672c041f808a?ref=blef.fr">Mage, BigQuery and bundled-up bike trips</a> — A homemade project where Patrick used Montreal public data of bike counting sensors.</li><li><a href="https://itnext.io/replace-dockerfile-with-buildpacks-f7e435ad2bfc?ref=blef.fr">Replace Dockerfile with Buildpacks</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://seekingalpha.com/news/4020481-microsoft-backed-openai-nears-stock-sale-at-90b-valuation-report?ref=blef.fr" rel="noreferrer"><strong>OpenAI</strong> is near $90b valuation</a>. With a product that has been launched in late 2022. It seems OpenAI is doing more than $100m in revenue per month. The numbers are just crazy.</li><li><a href="https://www.lonestarlunar.com/copy-of-declaration-of-independence?ref=blef.fr" rel="noreferrer"><strong>Lonestar</strong> raises additional $825k in Seed</a>. Lonestar provides immutable storage to be sent on the moon as a backup service. Yep on the moon 🌕.</li><li><a href="https://blog.aindo.com/posts/fundingA?ref=blef.fr" rel="noreferrer"><strong>Aindo</strong> raises €6m Series A</a>. Aindo is a synthetic data solution, it provides a platform to generate synthetic data from your real data in order to preserve statistical relevance while removing sensible information. With synthetic data you can then publicly seek for help among the world's data scientists.</li><li><a href="https://techcrunch.com/2023/10/17/scylladb-raises-43m-to-scale-its-nosql-database-platform/?ref=blef.fr" rel="noreferrer"><strong>ScyllaDB</strong> raises $43M Series C</a>. It's NoSQL database that is compliant with Apache Cassandra interfaces, and <a href="https://github.com/scylladb/scylladb?ref=blef.fr">open-source</a>.</li><li><a href="https://www.getpantomath.com/post/pantomath-raises-14-million-in-series-a-led-by-sierra-ventures?ref=blef.fr" rel="noreferrer"><strong>Pantomath</strong> raises $14m Series A</a>. A new data pipelines observability solution enters the game.</li><li><a href="https://techcrunch.com/2023/10/11/data-transformation-startup-prophecy-lands-35m-investment/?ref=blef.fr" rel="noreferrer"><strong>Prophecy</strong> raises $35m Series B</a>. This is a drag-n-drop data transformation product that I never heard of.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ dbt multi-project collaboration ]]></title>
                    <description><![CDATA[ Use cross-project references without dbt Cloud. This article showcases what you can do to activate dbt multi-project collaboration. ]]></description>
                    <link><![CDATA[ /dbt-multi-project-collaboration/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 652fe892e71341000187073e ]]></guid>
                    <pubDate><![CDATA[ 2023-10-19 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1557133285-a2b6b21f6e13?auto=format&amp;fit=crop&amp;q=80&amp;w=1000&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" class="kg-image" alt="yellow and blue metal machine" loading="lazy" width="1000" height="667" srcset="https://images.unsplash.com/photo-1557133285-a2b6b21f6e13?auto=format&amp;fit=crop&amp;q=80&amp;w=600&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 600w, https://images.unsplash.com/photo-1557133285-a2b6b21f6e13?auto=format&amp;fit=crop&amp;q=80&amp;w=1000&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D 1000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">cross-project dependencies (</span><a href="https://unsplash.com/photos/yellow-and-blue-metal-machine-MlpVwIvHyGM?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Over the last few years, dbt has become a de facto standard enabling companies to collaborate easily on data transformations. With dbt, you can apply software engineering practices to SQL development. Managing your SQL patrimony has never been easier.</p><p>So, yes, dbt is cool but there is a common pattern with it: you accumulate SQL queries. If your implementation of dbt is successful, many teams will use it, many business use cases will result in SQL queries in your warehouse. Fast forward to 2 years later, you find yourself with hundreds or thousands of SQL queries. Whatever the number, there will be a critical point at which a single project no longer scale.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">❓</div><div class="kg-callout-text">Read my guides <a href="https://www.blef.fr/get-started-dbt/" rel="noreferrer">How to get start started with dbt</a> and <a href="https://www.blef.fr/manage-and-schedule-dbt/" rel="noreferrer">how to manage and schedule dbt</a> as a preview about dbt.</div></div><p>Having too many models in a single repository will become unmanageable:</p><ul><li>Governance — <em>many data owners</em></li><li>Data domains — <em>a lot of different concepts that you would like to isolate as single units</em></li><li>Name clashes — <em>you can't have 2 models with the same name in a project</em></li><li><em>and more </em>😅</li></ul><p>This is when you consider a multi-project configuration for your dbt implementation. With a multi-project configuration, you can imagine isolated dbt projects with possible connections between them. We can draw a parallel with <a href="https://en.wikipedia.org/wiki/Microservices?ref=blef.fr">microservice</a> architecture. Each dbt project is like a microservice and instead of exposing an HTTP API, it exposes tables with enforced contracts.</p><p>Initially <strong>cross-project references was a feature aimed to be released in dbt Core</strong> (cf. roadmaps <a href="https://github.com/dbt-labs/dbt-core/blob/main/docs/roadmap/2022-08-back-for-more.md?ref=blef.fr#v15-next-year">2022-08</a> and <a href="https://github.com/dbt-labs/dbt-core/blob/main/docs/roadmap/2023-02-back-to-basics.md?ref=blef.fr#multi-project-deployments-v15" rel="noreferrer">2023-02</a>). But after research and first developments it was decided by dbt Labs that multi-project collaboration <a href="https://github.com/dbt-labs/dbt-core/discussions/6725?ref=blef.fr#discussioncomment-6905854">will become a feature of dbt Cloud</a>. Which I understand perfectly. It's the best feature for creating a differentiating commercial offering. What's more, multi-project collaboration is by its very nature an Enterprise—with a big <strong>E</strong>—feature, which makes it relevant for a paid-for solution.</p><p>Hence <a href="https://docs.getdbt.com/guides/best-practices/how-we-mesh/mesh-1-intro?ref=blef.fr">dbt Mesh</a>, which has been announced this week at Coalesce—dbt Labs annual conference. dbt Mesh is the dbt Cloud solution to manage <a href="https://docs.getdbt.com/docs/collaborate/govern/project-dependencies?ref=blef.fr">cross-project references</a>, <a href="https://docs.getdbt.com/docs/collaborate/explore-projects?ref=blef.fr">a multi-project node explorer</a> and all the governance.</p><p>Cross-project references is a key enabler to data team decentralisation. Let's imagine you have a <strong>core</strong> project, managed by the central data team. In this core model you have an <strong>orders</strong> model. On the other side the finance data team wants to build a revenue model on top of the <strong>core.orders</strong> model. With cross-project references you can declare a model to be public on core to use it elsewhere.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/10/Screenshot-2023-10-20-at-13.57.22.png" class="kg-image" alt="" loading="lazy" width="1738" height="792" srcset="https://www.blef.fr/content/images/size/w600/2023/10/Screenshot-2023-10-20-at-13.57.22.png 600w, https://www.blef.fr/content/images/size/w1000/2023/10/Screenshot-2023-10-20-at-13.57.22.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/10/Screenshot-2023-10-20-at-13.57.22.png 1600w, https://www.blef.fr/content/images/2023/10/Screenshot-2023-10-20-at-13.57.22.png 1738w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">dbt cross-project references use-case</span></figcaption></figure><p>All this is possible natively with dbt Cloud. But dbt Cloud multi-project is expensive. At the very least $100/month per project—Enterprise pricing, so it's not possible to have actual figures. But from what I know, it's expensive.</p><p><strong>What if we could do it with dbt Core?</strong></p><div class="kg-card kg-button-card kg-align-center"><a href="#/portal/signup/free" class="kg-btn kg-btn-accent">Join blef.fr for free</a></div><h1 id="enters-dbt-loom">Enters dbt-loom</h1><p>Obviously the community did not welcome well this announcement as it converged with the new <a href="https://www.getdbt.com/blog/consumption-based-pricing-and-the-future-of-dbt-cloud?ref=blef.fr">pricing</a>. It's a bit frustrating to see a product you truly love and I which you believe keeping awesome features behind closed-doors. But dbt is still open-source, so it's up to the community to adapt.</p><p>And the community adapted.</p><p>On my side I tried to fork dbt-core to inject in the what was need to make the multi-project working, but it was a burden. It was not very successful. On the other side Nicholas Yager worked on <a href="https://github.com/nicholasyager/dbt-loom?ref=blef.fr">dbt-loom</a> which leverages new <a href="https://github.com/dbt-labs/dbt-core/pull/7955?ref=blef.fr">dbt Plugins mechanism</a> that was introduced with v1.6. Nicholas wrote a <a href="https://nicholasyager.com/2023/08/dbt_plugin_api.html?ref=blef.fr">great explanation of the plugin API</a>.</p><p>Under the hood, you need to write a Plugin class, inheriting from <a href="https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/plugins/manager.py?ref=blef.fr#L23-L63"><code>DbtPlugin</code></a>, and implementing one of the 2 hooks available—or both: <code>get_nodes</code> and <code>get_manifest_artifacts</code> . The first hook is called every time dbt needs to get nodes and nodes are injected as external nodes, this is the one that interest us. Actually if we want to implement cross-project dependencies we need to add to a dbt project context the external nodes it depends on.</p><p>Here what you can do with dbt-loom.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/10/Screenshot-2023-10-20-at-13.30.32.png" class="kg-image" alt="" loading="lazy" width="2000" height="1369" srcset="https://www.blef.fr/content/images/size/w600/2023/10/Screenshot-2023-10-20-at-13.30.32.png 600w, https://www.blef.fr/content/images/size/w1000/2023/10/Screenshot-2023-10-20-at-13.30.32.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/10/Screenshot-2023-10-20-at-13.30.32.png 1600w, https://www.blef.fr/content/images/2023/10/Screenshot-2023-10-20-at-13.30.32.png 2106w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">dbt-loom in action with multi-project</span></figcaption></figure><h1 id="multi-project-collaboration-example">Multi-project collaboration example</h1><p>In order to help you understand what it really means here a working example with dbt-loom on a 2 projects setup—core and finance. First in the <code>core</code> project. In order to have reproducibility I use dbt-duckdb connector so everyone can try it at home. I have 1 seed that loads a few rows and 2 models: <code>stg_orders</code> and <code>orders</code>.</p><p>Obviously <code>orders</code> depends on <code>stg_orders</code> and respectively the first one is public and the second one is private.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- raw_orders.csv (dbt seed)
order_id,order_date,amount,customer_id
1,2023-01-01,340,c1
2,2023-01-02,13,c2
3,2023-01-03,1456,c1
4,2023-01-04,765,c3

-- stg_orders.sql
WITH raw AS (
    SELECT
        order_id,
        order_date::DATE AS order_date,
        customer_id,
        amount
    FROM {{ ref('raw_orders') }}
)

SELECT *
FROM raw

-- orders.sql
SELECT
    order_id,
    order_date,
    customer_id,
    amount::DECIMAL(8,2) AS amount_incl_vat,
    (amount / 1.2)::DECIMAL(8,2) AS amount_excl_vat
FROM {{ ref("stg_orders") }}</code></pre><figcaption><p><span style="white-space: pre-wrap;">The seed, the stg model and the final public model.</span></p></figcaption></figure><p>In order to declare these models as available for cross-project dependencies you need to specify it in the YAML. In our case <code>stg_orders</code> will be protected and orders will be public with an enforced contract. The contract is super important because as soon as you expose a model, you have to potential downstream consumers that are building stuff on your models, you can't delete a column or change a type without notifying. Or even more, <a href="https://docs.getdbt.com/docs/collaborate/govern/model-versions?ref=blef.fr">versioning</a> models.</p><figure class="kg-card kg-code-card"><pre><code class="language-YAML">version: 2

models:
  - name: stg_orders
    access: protected
  - name: orders
    access: public
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: int
        constraints:
          - type: not_null
      - name: order_date
        data_type: date
      - name: customer_id
        data_type: string
        constraints:
          - type: not_null
      - name: amount_incl_vat
        data_type: numeric(8,2)
      - name: amount_excl_vat
        data_type: numeric(8,2)</code></pre><figcaption><p><span style="white-space: pre-wrap;">models.yml that declares access and contracts for public model</span></p></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">There are 3 kind of accesses for a model. <b><strong style="white-space: pre-wrap;">It can be private, protected or public</strong></b>. Private means the model is accessible only within the same group—a model can be only in one group. Protected means only a reference within the project and public from everywhere. <a href="https://docs.getdbt.com/docs/collaborate/govern/model-access?ref=blef.fr" rel="noreferrer">See the doc</a>.</div></div><p>That's all for the core project. Once you have <code>dbt build</code> the core project a <code>manifest.json</code> will be generated and tables will be created in the database. On the finance project, with dbt-loom install—<code>pip install dbt-loom</code>— you need to declare the core project as a dependant manifest.</p><figure class="kg-card kg-code-card"><pre><code class="language-YAML">manifests:
  - name: core
    type: file
    config:
      path: ../core/target/manifest.json
</code></pre><figcaption><p><span style="white-space: pre-wrap;">dbt_loom.config.yml</span></p></figcaption></figure><p>Then you can write a few models that are using cross-project references.</p><figure class="kg-card kg-code-card"><pre><code class="language-SQL">-- stg_revenue.sql
WITH orders AS (
    SELECT *
    FROM {{ ref('core', 'orders') }} -- this is cross-project reference
)

SELECT *
FROM orders
LEFT JOIN {{ ref('margins') }} ON 1 = 1

-- revenue.sql
SELECT
    order_date,
    SUM(amount_excl_vat * margin) AS revenue
FROM {{ ref('stg_revenue') }}
GROUP BY order_date</code></pre><figcaption><p><span style="white-space: pre-wrap;">dbt finance project SQL models</span></p></figcaption></figure><p>Now you can dbt build this project as well and dbt-loom will extend dbt models list thanks to the plugin by adding the <code>core.orders</code> model.</p><p>In order for you to try it at home I've created a <a href="https://github.com/Bl3f/dbt-loom-example?ref=blef.fr">Github repository</a> with a working example using DuckDB as database. You can try it yourself.</p><div class="kg-card kg-button-card kg-align-center"><a href="#/portal/signup/free" class="kg-btn kg-btn-accent">Join for free to not miss any updates</a></div><h1 id="conclusion">Conclusion</h1><p>Multi-project collaboration is probably the best feature dbt Labs introduced in recent times. This feature has a huge potential to structure dbt projects and avoid chaos.</p><p>As a data engineer who loves open-source and community stuff, dbt-loom is a great workaround, but be aware that it's all experimental at the moment and if large workflows rely on this functionality, you should consider using the paid version with dbt Mesh.</p><p>In order to go further you can <a href="https://coalesce.getdbt.com/agenda/take-chances-make-mistakes-and-get-meshy-unlocking-model-governance-and-multi-project-deployments-with-dbt-meshify?ref=blef.fr">watch</a> a Coalesce 2023 talk about <a href="https://github.com/dbt-labs/dbt-meshify?ref=blef.fr">dbt-meshify</a> a tool that helps you automating your journey to a multi-project dbt setup from a monolith—<a href="https://attendees.bizzabo.com/433222/agenda/activity/1179792?ref=blef.fr">here the direct link to the video</a>.</p><p></p><p></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Airflow Summit 2023 takeaways ]]></title>
                    <description><![CDATA[ Data News #23.41 — Airflow Summit takeaways — Get Airflow vision, understand internals and a few companies giving feedbacks about their Airflow usage. ]]></description>
                    <link><![CDATA[ /data-news-week-23-41/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 65286b44828b9a00014afd4d ]]></guid>
                    <pubDate><![CDATA[ 2023-10-14 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1524281423221-234569bc0438?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="man standing on top of rock formation" loading="lazy"><figcaption><span style="white-space: pre-wrap;">(</span><a href="https://unsplash.com/photos/pqHRNS8Mojc?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hello, dear Data News reader, I hope you'll enjoy this new edition. It's amazing how quickly time flies and this summer I passed the 3-year mark since I started my freelance adventure. I'm so happy with what it's brought me. But I've got this internal alarm that goes off every 3 years asking me for new things. It's time for me to search for my future paths.</p><p>Don't worry the newsletter and the content stuff I do is something I enjoy so it will probably stay as an invariant in this quest.</p><p>Also, this week I wrote R code for the first time. It's not an experience I'd recommend. I tried using ChatGPT to help me with this task and every answer it gave me was always wrong. In 20 attempts, it never gave me a correct snippet. On the other hand, I asked the AI to help me write a TCP proxy in Python and it worked first time. Probably a training bias.</p><p>Going further I've looked at StackOverflow trends to see if there is a reason Python is better covered by ChatGPT than R—more than the obvious one—and Python was <a href="https://insights.stackoverflow.com/trends?tags=java%2Cc%2Cc%2B%2B%2Cpython%2Cc%23%2Cvb.net%2Cjavascript%2Cassembly%2Cphp%2Cperl%2Cruby%2Cswift%2Cr%2Cobjective-c&ref=blef.fr">6 to 7 times</a> more popular than R at the time of training. The graph also shows that Python has been losing popularity since 2022, although I don't really know why and it stays on top. Only C# get massive increase in <a href="https://www.tiobe.com/tiobe-index/?ref=blef.fr">TIOBE index</a>.</p><p>This week, the videos from the Airflow Summit 2023 have been released and as always, I'd like to provide you with a list of the talks I found interesting. You can also watch the <a href="https://www.youtube.com/watch?v=pi8V077KjEY&list=PLGudixcDaxY29qXIXhd90htHp_BFk-Bqf&pp=iAQB&ref=blef.fr">YouTube playlist</a> and show support to other talkers.</p><p></p><h1 id="airflow-summit-2023-%F0%9F%8C%AC%EF%B8%8F">Airflow Summit 2023 🌬️</h1><p>For the sake of reading I've sorted the few talks I've selected in 3 categories: general stuff, Airflow internals and feedbacks from companies.</p><h2 id="general-%E2%80%94-get-airflow-ideas">General — Get Airflow ideas</h2><ul><li><strong>The Summit opened with a panel about the <a href="https://www.youtube.com/watch?v=pi8V077KjEY&ref=blef.fr"><strong>past and the future of Airflow</strong></a></strong>. It was also the time for the panelist to give a huge shoutout to all Airflow contributors. I personally join the shoutout because Airflow has been in my professional journey for the last 5 years and it helped me grow and achieve so much.</li><li><strong>Then Marc Lamberti gave a huge <a href="https://www.youtube.com/watch?v=y9rSCboE6BY&list=PLGudixcDaxY29qXIXhd90htHp_BFk-Bqf&index=4&ref=blef.fr"><strong>update about Airflow</strong></a> but done differently</strong> — It wasn't about slides with a list of new features but rather about how you can write, in 2023, a data pipeline with Airflow. It's a presentation that silences critics about Airflow's rigidity and complexity.</li><li><a href="https://www.youtube.com/watch?v=J5pbH1TUv0U&ref=blef.fr"><strong>Airflow operators need to die</strong></a> — This is a funny topic. Airflow operators are often criticised because they don't work, so people just use Python or Bash operators to orchestrate their own stuff, which leave us with useless operator code. So, Airflow needs a new vision. This talk from Bolke is probably the beginning of operators rebirth. Bolke proposed new storage and dataframe APIs to remove hardcoded operators and decouple source from destinations.</li><li><strong>Airflow can also be at the center of data mesh</strong> discussion with companies using multiple Airflow instance to give power to many teams. Kiwi.com showcases how they moved from a <a href="https://www.youtube.com/watch?v=Ib8lgj9Xa2U&ref=blef.fr">monolith to several smaller envs</a> while Delivery Hero explained how they run <a href="https://www.youtube.com/watch?v=Or0nlM95b_o&ref=blef.fr">500 Airflow instances</a> with a lot of unique specificities.</li><li><a href="https://www.youtube.com/watch?v=Tagr4IqbqDI&ref=blef.fr"><strong>A microservice approach for DAG authoring using datasets</strong></a> — The idea is to apply SE patterns to pipelines like <a href="https://blogs.mulesoft.com/api-integration/patterns/data-integration-patterns-migration/?ref=blef.fr">migration</a>, <a href="https://blogs.mulesoft.com/api-integration/patterns/data-integration-patterns-broadcast/?ref=blef.fr">broadcast</a> and <a href="https://deviq.com/domain-driven-design/aggregate-pattern?ref=blef.fr">aggregate</a>. In addition you should create micropipelines which we can define <em>as small, loosely coupled DAG which operates on one input Dataset and produces one output Dataset.</em> And then each micropipelines will implement a unique pattern with defined input and output.</li><li><a href="https://www.youtube.com/watch?v=CjjZyxnHfdk&ref=blef.fr"><strong>Dynamic task mapping to orchestrate dbt</strong></a> — dbt has changed the data world and is immensely popular, but dbt orchestration is sill a <a href="https://www.blef.fr/manage-and-schedule-dbt/">problem</a>. Many of Airflow users have to integrate dbt within Airflow. This time Xebia team propose an usage of dynamic task mapping to do it (link to <a href="https://github.com/pgoslatara/dynamic_task_mapping_for_dbt?ref=blef.fr">Github</a> repo with multiple solutions).</li><li>Astro team also showcased how you can <a href="https://www.youtube.com/watch?v=mgA6m3ggKhs&ref=blef.fr">deploy LLM with Airflow</a> —&nbsp;following <a href="https://a16z.com/emerging-architectures-for-llm-applications/?ref=blef.fr">a16z infra guide</a>.</li></ul><p></p><h2 id="understand-airflow-internals">Understand Airflow internals</h2><p>3 talks you should watch to learn things you don't know about Airflow internals.</p><ul><li>Airflow is made of 3 main components interacting together: the <em>webserver</em>, the <em>scheduler</em>, the <em>executor</em> and they use a database to communicate. Within the scheduler there is a DAG parser process reading files to understand what needs to be scheduled. <ul><li>This DAG parsing step has flaws.<ul><li>By default you have to wait 5 minutes to have a new DAG displayed in the UI.</li><li>If you have 300 DAGs coming from a single file (forloop) it works way better than if you have 300 DAGs in 300 files.</li></ul></li><li>That's why we should probably <a href="https://www.youtube.com/watch?v=UkV1CAOul2w&ref=blef.fr">move to event-based DAG parsing</a> — In the presentation Bas explains the 4 steps in the DAG parser and what configuration you can change to have better performance. He also demo a event-based DAG parsing that instantaneously display DAGs in the UI.</li><li> Then John also explained what he did to <a href="https://www.youtube.com/watch?v=8gcBdknaM8I&ref=blef.fr">improve parsing performance</a> — Especially around Python import. Because parsing DAG means running Python DAG code (and import) and import fucks the import time.</li><li>➡️ In conclusion you should consider running the dag processor in standalone to remove the impact it could have on the scheduler and follow latest community improvements.</li></ul></li><li>Niko also discussed about the <a href="https://www.youtube.com/watch?v=VFC0E6Oyj7A&ref=blef.fr">executor decoupling</a> to unlock the development of third-party executors like an ECS Executor.</li></ul><p></p><h2 id="companies-feedback">Companies feedback</h2><p>To finish this newsletter 3 companies presentation about their Airflow that gave me inspiration.</p><ul><li><a href="https://www.youtube.com/watch?v=rF2Dz33TCsY&ref=blef.fr">Bloomberg, leveraging dynamic DAGs for data ingestion</a> — I'm a huge fan of dynamic DAGs, I think this is the way to go in Airflow because as a data engineer your role is to create a standardisation layer when it comes to data work rather than doing the actual data work, especially in a mesh concept. Here Bloomberg team create a nice categorisation of data tasks to provide DAGs as a config.</li><li><a href="https://www.youtube.com/watch?v=FExVqjvDjvw&ref=blef.fr">Reddit, How we migrated from Airflow 1 to Airflow 2</a> — If there are still people out there on Airflow 1, you should migrate, new Airflow are way much simpler and funnier. But to be honest Reddit presentation can be generalised to every team that want to migrate from a old software to a fresher one. Migration recipes can apply whatever the software you use.</li><li><a href="https://www.youtube.com/watch?v=1b6Uu4M0ExY&ref=blef.fr">Monzo, Evolving our data platform as the bank scales</a> — This presentation is full of awesome ideas. It talks about dbt integration within Airflow (using a custom DAGBuilder), monitoring, alerting and Slack interaction with the data stack.</li></ul><hr><p>See you next week ❤️ — this week other articles will be blended in next week Data News!</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.40 ]]></title>
                    <description><![CDATA[ Data News #23.40 — OpenAI iPhone?, Python 3.12, chat with BigQuery, save costs and awesome other stuff. ]]></description>
                    <link><![CDATA[ /data-news-week-23-40/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6523cb67b3f9a70001b459d6 ]]></guid>
                    <pubDate><![CDATA[ 2023-10-10 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1473830394358-91588751b241?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="person looking out through window" loading="lazy"><figcaption><span style="white-space: pre-wrap;">(</span><a href="https://unsplash.com/photos/gzhyKEo_cbU?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, I'm a bit late once again. I hope this newsletter edition finds you well. This is almost a raw edition, I had quite a big amount of links, I hope you will like this selection.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><ul><li>OpenAI’s plan to build the <a href="https://www.theverge.com/2023/9/28/23893939/jony-ive-openai-sam-altman-iphone-of-artificial-intelligence-device?ref=blef.fr">"iPhone of artificial intelligence"</a> —&nbsp;Obviously this is one of the main struggle for OpenAI. In order to stay forever in the B2C market they need more than chat interface, they need hardware, they need to enter in users everyday life. Still not sure we need a need addictive device.</li><li>❤️ <a href="https://ig.ft.com/generative-ai/?ref=blef.fr">Generative&nbsp;AI&nbsp;exists because of the&nbsp;transformer</a> — A scroll story by the Financial Times explaining what's Generative AI. Good for everyone.</li><li><a href="https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/?ref=blef.fr">Evaluating LLMs is a minefield</a> —&nbsp;Slides from Princeton Uni about LLMs evaluation and why it's hard to understand how it evolves. You can find the video version on <a href="https://pli.princeton.edu/events/aiprinceton/pli-launch-talks?ref=blef.fr">Princeton</a> website, named "Societal Impact of AI".</li><li><a href="https://www.cnbc.com/2023/10/03/jpmorgan-ceo-jamie-dimon-says-ai-could-bring-a-3-day-workweek.html?ref=blef.fr">JPMorgan CEO says AI could bring a 3½-day workweek</a> —&nbsp;<em>blablabla, AI is awesome, we want AI everywhere and pay people less blablala /s.</em></li><li><a href="https://browse.arxiv.org/pdf/2309.10668.pdf?ref=blef.fr#page14"><a href="https://browse.arxiv.org/pdf/2309.10668.pdf?ref=blef.fr">Language Modeling is compression</a></a> —&nbsp;Paper from Deepmind. Title looks cool. To be honest this is not the first time I see LLMs and compression in the same paper and it opens, at least, to funny experiments.</li><li><a href="https://www.nature.com/articles/s42256-023-00714-5?ref=blef.fr"><a href="https://www.nature.com/articles/s42256-023-00714-5?ref=blef.fr">Decoding speech perception from non-invasive brain recordings</a></a> — Even more crazier. This article describes the state-of-the-art in decoding speech from brain activity. There are multiples models and shows what we can achieve by just looking at electro or magnetic recordings of the brain.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.vantage.sh/blog/databricks-vs-microsoft-fabric-pricing-analysis?ref=blef.fr">Microsoft Fabric: should Databricks be worried?</a> — Vantage did a price analysis between Microsoft Fabric and Databricks. Generally Fabric pricing is simpler because fresh and new but for more complex stuff Databricks still shine.</li><li><a href="https://cube.dev/blog/introducing-python-and-jinja-for-data-modeling?ref=blef.fr">Introducing Python and Jinja in Cube</a> — Cube, an open source semantic layer, has released a new writing capabilities in Python with Jinja in the YAML definitions. Something that reminds dbt. You can now write macros to generate YAML.</li><li>Confluent announced <a href="https://www.jesse-anderson.com/2023/10/current-2023-announcements/?ref=blef.fr">Kafka roadmap</a> and <a href="https://siliconangle.com/2023/09/26/confluent-debuts-managed-apache-flink-service-generative-ai-features/?ref=blef.fr">Flink as a Cloud service</a> —&nbsp;this is the result of Confluent acquisition of Immerok. Confluent is still growing but struggle to become a real competitor to Databricks or Snowflake.</li><li><a href="https://docs.python.org/3/whatsnew/3.12.html?ref=blef.fr">Python 3.12 is out</a> — Every year a new Python minor version is released and this year it brings a few cool features. Mainly you get a new generic <a href="https://docs.python.org/3/whatsnew/3.12.html?ref=blef.fr#whatsnew312-pep695">type parameter</a>, better <a href="https://docs.python.org/3/whatsnew/3.12.html?ref=blef.fr#whatsnew312-pep701">f-strings</a> with multilines in curly brackets and quote reuse, the <a href="https://docs.python.org/3/whatsnew/3.12.html?ref=blef.fr#whatsnew312-pep684">per-interpreter GIL</a>—where <a href="https://engineering.fb.com/2023/10/05/developer-tools/python-312-meta-new-features/?ref=blef.fr">Meta is proud</a> to say it contributed to it.</li><li><a href="https://malloydata.github.io/blog/2023-10-03-malloy-four/?ref=blef.fr#announcing-malloy-4-0">Announcing Malloy 4.0</a> — This is the tool I have to try soon. Malloy is out with a new version and a lot of new features. As a reminder Malloy is a new analytical language meant to generate SQL to query databases.</li><li>4 tips to save warehouse money — Paul posted on LinkedIn about <a href="https://www.linkedin.com/feed/update/urn:li:activity:7115268234395676672/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7115268234395676672%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">search index</a>, <a href="https://www.linkedin.com/feed/update/urn:li:activity:7115625411715178497/?ref=blef.fr">avoiding rerunning dbt tests when possible</a> or <a href="https://www.linkedin.com/feed/update/urn:li:activity:7115977007707893761/?ref=blef.fr">just deleting tables</a>. Ian also proposed you identify and <a href="https://www.linkedin.com/feed/update/urn:li:activity:7115678942153375744/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7115678942153375744%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">remove things you don't need anymore</a>.</li></ul><p></p><h1 id="tech-and-data-engineering-stuff-%E2%9A%99%EF%B8%8F">Tech and data engineering stuff ⚙️</h1><ul><li><a href="https://slack.engineering/executing-cron-scripts-reliably-at-scale/?ref=blef.fr">CRON jobs at Slack scale</a> —&nbsp;Why do you need an orchestrator when you can run CRONs? Slack engineering team details how they wrapped CRON jobs on top of Kubernetes with database table to get monitoring.</li><li><a href="https://dropbox.tech/machine-learning/using-ml-to-identify-date-formats-in-file-names?ref=blef.fr">Using ML to identify date formats in file names</a> — Dropbox developed a classifier to identify date formats in file names. This is the backbone of a <em>naming convention</em> feature. It gives ideas. Based on <a href="https://huggingface.co/distilroberta-base?ref=blef.fr">DistilRoberta</a> this is something to look at to fix the mess of a datalake.</li><li><a href="https://www.youtube.com/watch?v=OCClTPOEe5s&ref=blef.fr">Data modeling is dead! Long live data modeling!</a>  — Joe Reis Keynote at Big Data London about his next book topic: data modeling. Joe covers why data modeling was put on the side in the recent years and why we need it back today, showcasing a few useful patterns and definitions.</li><li><a href="https://pub.towardsai.net/chat-bigquery-using-english-c9bd4bb1b127?ref=blef.fr">Chat with BigQuery data</a> — this is a recycling of all the chatbot use-cases, once again. It's an example where you can use natural language to access BigQuery data. There is also a walkthrough example on <a href="https://airbyte.com/tutorials/airbyte-and-llamaindex-elt-and-chat-with-your-data-warehouse-without-writing-sql?ref=blef.fr">Airbyte with LLamaindex</a>. </li><li><a href="https://medium.com/apache-airflow/creating-an-airflow-custom-hook-for-reliable-api-calls-975a4710c7cd?ref=blef.fr">Creating an Airflow custom hook for API calls</a> — A guide showing you how you can extend Airflow hooks to have a custom way to call APIs.</li><li><a href="https://dataengineeringcentral.substack.com/p/goodbye-spark-hello-polars-delta??ref=blef.fr">Goodbye Spark. Hello Polars + Delta Lake</a> — Spark is under attack. In the last years Spark has been powering a lot of data use cases but with the modern data stack and more recently with DuckDB, Polars and smaller size OLAP technologies it allows a new way to do data processing.</li><li><a href="https://eng.lyft.com/from-big-data-to-better-data-ensuring-data-quality-with-verity-a996b49343f6?ref=blef.fr">Ensuring data quality with Verity</a> — Lyft definition of data quality and a tour of the in-house product to address data quality in the data platform: Verity. This is a must-read and a good showcase of what you can do.</li><li><a href="https://engineering.mixpanel.com/database-file-format-optimization-per-column-dictionary-2e108df1d706?ref=blef.fr">Database file format optimization: per column dictionary</a> — Mixpanel developed a proprietary columnar database and this article shows what they did to improve compaction and increase performance.</li><li><a href="https://luminousmen.com/post/exploring-the-power-of-graph-databases?ref=blef.fr">Exploring the power of graph databases</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1596733541604-ee7020be9fdb?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="motorcycle parked on the side of the road" loading="lazy"><figcaption><span style="white-space: pre-wrap;">The new search index (</span><a href="https://unsplash.com/photos/EU6_2jY0_rs?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><ul><li><a href="https://techcrunch.com/2023/10/04/yahoo-spins-out-vespa-its-search-tech-into-an-independent-company/?ref=blef.fr">Yahoo spins out Vespa</a>. <a href="https://vespa.ai/?ref=blef.fr"><strong>Vespa</strong></a> is the tech behind Yahoo search engine, it's a search engine and a vector database. In the current Gen AI times, it looks like a good time to do it.</li><li><a href="https://contentsquare.com/blog/contentsquare-signs-agreement-acquire-heap/?ref=blef.fr">Contentsquare acquires Heap</a>. <a href="https://www.heap.io/?ref=blef.fr"><strong>Heap</strong></a> is a product analytics solution to understand better you funnel acquisition performance.</li><li><a href="https://kestra.io/?ref=blef.fr"><strong>Kestra</strong></a> <a href="https://kestra.io/blogs/2023-10-05-announcing-kestra-funding-to-build-the-universal-open-source-orchestrator?ref=blef.fr">raises $3m Seed funding</a>. Kestra is the new kid in the open-source orchestration space but disrupting the Python status quo because it's written in Java and requires you to writes pipelines in a declarative way, in YAML. IF you want to know more you can watch this <a href="https://youtu.be/sAc-uNvlveY?t=2709&ref=blef.fr">YouTube live</a> I did with Kestra's CTO demonstrating capabilities.</li></ul><hr><p>See you soon ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Upgrade your Modern Data Stack ]]></title>
                    <description><![CDATA[ Data News #23.39 — What can you do to upgrade your modern data stack without thinking first about technologies, Fast News and more. ]]></description>
                    <link><![CDATA[ /modern-data-stack-upgrade/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6511bbb60fad7400010c7f5f ]]></guid>
                    <pubDate><![CDATA[ 2023-09-29 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/41/bXoAlw8gT66vBo1wcFoO_IMG_9181.jpg?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="person riding on hot air balloon" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Make your data stack take-off (</span><a href="https://unsplash.com/photos/0fjGQmYCRW8?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hello, another edition of Data News. This week, we're going to take a step back and look at the current state of data platforms. What are the current trends and why are people fighting around the concept of the modern data stack.</p><p>Early September is usually conference season. All over the world, people gather in huge venues to attend conferences.. Last week it was <a href="https://bigdataldn.com/?ref=blef.fr">Big Data London</a>, this week it was <a href="https://www.bigdataparis.com/?ref=blef.fr">Big Data &amp; AI Paris</a>. I wasn't able to go. But every time I went to a conference in the past, I came back with ideas to change everything because someone introduced me to a new fancy stuff.</p><p>This feeling is right. But you should temper your excitation. Let's go through the current state of data to understand what you should do next.</p><p></p><h1 id="big-data-is-really-dead">Big Data is really dead</h1><p>Although the term Big Data is no longer very popular, London probably counted over 10,000 visitors and more than 160 vendors (2022 figures). Big Data London exists since 2016 and when we look at sponsors it's like an history book. Over the years Cloudera logo has been replaced by Snowflake and Databricks ones. Microsoft logo still standing over the years. <em>When everybody is digging for gold, it’s good to be in the pick and shovel business.</em></p><p>The era of Big Data was characterised by Hadoop, HDFS, distributed computing (Spark), above the JVM. This era was necessary and opened doors to the future, fostering innovation. <strong>But there was a big problem: it was hard to manage.</strong></p><p>That's why big data technologies got swooshed by the modern data stack when it arrived on the market—excepting Spark. We jumped from HDFS to Cloud Storage (S3, GCS) for storage and from Hadoop, Spark to Cloud warehouses (Redshift, BigQuery, Snowflake) for processing.</p><p>In fact, we're still doing the same thing we did 10 or 20 years ago. We need to store, process and visualise data, everything else is just marketing. I often say that data engineering is boring, insanely boring. When you are a data engineer you're getting paid to build systems that people can rely on. By nature it should be simple—to maintain, to develop—it should be stable, it should be proven. Something boring.</p><p>Big data technologies are dead—bye <a href="https://zookeeper.apache.org/?ref=blef.fr">Zookeeper</a> 👋—but data generated by systems are still massive and is the modern data stack relevant to answer this need in storage and processing?</p><p></p><h1 id="is-the-modern-data-stack-dying">Is the modern data stack dying?</h1><p>The modern data stack has always been nice words to bundle a philosophy used to build data platform. Cloud-first. With a handy warehouse at the center and multiple SaaS tools revolving around to answer useful—sometimes not—use-cases. Following an E(T)LT approach.</p><p>Historically, data pipelines were designed with an ETL approach, storage was expensive and we had to transform the data before using it. With the cloud, we got the—false—impression that resources were infinite and cheap, so we switched to ELT by pushing everything into a central data storage.</p><p>If we summarise <a href="https://www.getdbt.com/blog/future-of-the-modern-data-stack?ref=blef.fr">the initial modern data stack vision</a>, this is something like:</p><ul><li>move data with Fivetran</li><li>store data in Snowflake</li><li>transform data with dbt</li><li>visualise with Looker</li><li>document with a catalog, prevent with data observability, orchestrate</li></ul><p>So what's left of the original vision of the modern data stack that can be applied in 2023 and beyond? <strong>An easy-to-manage central storage and querying and transforming layer in SQL</strong>. When you put the things like this it opens the doors and does not limit the modern data stack to 4 vendors.</p><p>The central storage can be cloud storage, a warehouse, a real-time system, while the SQL engine can be a data warehouse or a dedicated processing engine. It can go further than that, you can—in fact you should—<a href="https://juhache.substack.com/i/136841647/multi-compute-engine-data-stack?ref=blef.fr">compose storages and engines</a>, there are too many use cases for any one solution to address. More importantly, the modern 4-vendor data stack <a href="https://win.hyperquery.ai/p/does-the-modern-data-stack-work-at?ref=blef.fr">is too expensive to scale</a>.</p><p>The modern data stack is not about to disappear, it's so simple to use in the first place and it's the core of too many data stacks and practices today. But it needs to adapt to today's needs, hence its incremental evolution.</p><p></p><h1 id="i-believe-in-incremental-evolution">I believe in incremental evolution</h1><p>What do you need to do? Well, it all depends on whether you're a newcomer and want to start building your data platform, or whether you already have a stack and are wondering what to do next. If you're starting your data stack in 2023, simply choose the solution that will be the quickest to implement to discover your business use cases, you'll build something later. A lot of companies started with Postgres + dbt + Metabase, don't be ashamed.</p><p>When it comes to incrementally change a data platform this is a bit different, you need to find what is going wrong and what could be improved. Like</p><ul><li><strong>data workflows are always failing, are always late</strong>—Identify why workflows fails, data contracts might help to bring <a href="https://www.youtube.com/watch?v=L9mEGb31snk&t=91s&ref=blef.fr">consensus as code</a> if it fails because of upstream producers, create metrics about failure or latency aim for a 30-days streak with no issues. Define SLAs, critically and ownership. For downstream data quality there are also a lot of tools.</li><li><strong>data stack is too expensive</strong>—With the current economic situation a lot of data team were in need to stop spending crazy amount of compute and introspect storage to remove useless data archives. DuckDB can help <a href="https://www.linkedin.com/feed/update/urn:li:activity:7110630962144649216/?ref=blef.fr">saving tons of money</a>.</li><li><strong>developer experience to add new workflows</strong>—This is something often neglected by data engineers, you need to build the best dev experience for other data people not everyone is fluent with the CLI.</li><li><strong>data debt</strong>—You might have too many dashboards or tables, workflows spaghetti. For this you need to do recurrent data cleaning. Find, tag and remove what is useless, what can be factorised. Only healthy routines can prevent this.</li><li><strong>poor data modeling</strong>—This topic might be too large to handle in one bullet. <a href="https://towardsdev.com/data-modeling-in-the-modern-data-stack-d29be964b3a7?ref=blef.fr">Data modeling</a> is the part that really don't scale in data stacks. Because of the growth your SQL queries patrimony will inflate and only data modeling will prevent data from being unusable, repetitive or false. <a href="https://medium.com/@sivailango.s/principles-of-data-layers-in-data-platform-a336a0ff9e1e?ref=blef.fr">Good data layers</a> are a good start.</li><li><strong>there is no data documentation</strong>—Rare are the people who are happy to document what they are doing. Best to do is to defined what is a good documentation and then enforce the requirements before going to production. Think the <a href="https://deezer.io/rethinking-your-data-platform-documentation-so-that-people-actually-read-it-84baff70b9a4?ref=blef.fr">documentation for your readers</a>.</li><li><strong>data is not easily accessible for humans or AI</strong>—We build data platforms in order to be used. You should create usage metrics over of your platform either about business users conversion in the downstream tools, about SQL query writers but also about how AI is using the data. How the <a href="https://a16z.com/emerging-architectures-for-llm-applications/?ref=blef.fr">AI platform</a> combines with the analytics platform?</li></ul><p>This list is probably not exhaustive, but it's a good start. If you think you're good on all counts, you've probably finished the game and that means your data team has built something that works. Don't forget the stakeholders though, as it's probably more useful to have a platform that barely works but serves users perfectly than the other way around.</p><p></p><h1 id="conclusion">Conclusion</h1><p>This post is a reflection on the changes in the data ecosystem. Marketing would have you believe that your data infrastructure may be obsolete but you shouldn't worry about it, if you're still using a crontab to run your jobs that's fine. Just use the right tool for the right job and identify what are your data needs. Tip: data needs are rarely a technology name.</p><p>I hope you like this different Data News edition, I'm curious to know what you think about it, I wanted to keep it short while giving a few practical links and ideas.</p><p><em>Your data stack won't explode if you don't use dbt.</em></p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Going deeper: <a href="https://wesmckinney.com/blog/looking-back-15-years/?ref=blef.fr" rel="noreferrer">The road to composable data systems: thoughts on the last 15 years and the future</a>. Wes McKinney—pandas and Arrow co-creator—is one of the best thought leader in the data space. This article depicts well how composable our platforms will be in the future and why Apache Arrow have to be everywhere.</div></div><p>PS: I wanted to write also about interoperability of data storage and file formats but that's for another time.</p><hr><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://motherduck.com/pricing/?ref=blef.fr">Motherduck has announced their pricing</a> — The model simplicity reminds me a lot BigQuery in the early ages. You pay for the cold and hot storages. Respectively $0.04 per GB per month and $0.02 per GB per hour. But it looks like way more expensive than BigQuery.</li><li><a href="https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-omni-cross-cloud-joins?hl=en&ref=blef.fr">Announcing BigQuery Omni cross-cloud joins</a><strong> —</strong> Join datasets located in BigQuery with datasets located in AWS or Azure. This is part of BigQuery Omni offering, which is 37% more expensive (in EU).</li><li><a href="https://moderndatanetwork.medium.com/3-lessons-to-learn-before-creating-your-own-data-team-1a64a5e22bca?ref=blef.fr">3 lessons to learn before creating your own data team</a> — Christelle wrote 3 lessons learned about a survey that has been run in a private French data community. Mainly it shows that the first hires in a data team have to be picked cautiously. </li><li><a href="https://medium.com/qonto-way/how-to-prioritize-projects-and-scale-your-data-science-team-efficiently-d4694f22eb49?ref=blef.fr">How to prioritise projects and scale your Data Science team efficiently</a> — A nice article about how to understand an OKR and make it your own to lead data science projects.</li><li><a href="https://mistral.ai/news/announcing-mistral-7b/?ref=blef.fr">Mistral 7B, the best 7B model so far and open-source</a> — Mistral AI is the French company that want to compete with OpenAI and they released under Apache license a first 7B model.</li><li><a href="https://dataanalysis.substack.com/p/a-selection-of-sql-tutorials-issue-cf9?ref=blef.fr">A selection of SQL tutorials</a> — a long list.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.rollstack.com/?ref=blef.fr"><strong>Rollstack</strong></a><strong> <a href="https://www.rollstack.com/articles/rollstack-raises-1-8m?ref=blef.fr">raises $1.8m Seed</a></strong>. This is a YC company and they propose a product that automates slide deck with data coming from your data stack. Without engineering or manual work. This is an awesome idea the young myself would have love 8 years ago when I was generating Powerpoint in Python.</li><li><a href="https://www.kolena.io/?ref=blef.fr"><strong>Kolena</strong></a> <a href="https://techcrunch.com/2023/09/26/kolena-a-startup-building-tools-to-test-ai-models-raises-15m/?ref=blef.fr">raises $15m Series A</a>. Kolena proposes an end-to-end framework to test and debug ML models to identify failures and regressions.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.38 (late) ]]></title>
                    <description><![CDATA[ Data News #23.38 — Usual data news with Microsoft Copilot, DALL·E 3, Postgres 16, the fast news and a lot of money spent. ]]></description>
                    <link><![CDATA[ /data-news-week-23-38/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 650d63170fad7400010c7d8c ]]></guid>
                    <pubDate><![CDATA[ 2023-09-26 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1571008887538-b36bb32f4571?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="pair of blue-and-white Adidas running shoes" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Early like my run (</span><a href="https://unsplash.com/photos/XiZ7pRvCzro?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure>
<p>Hey. This is a super late Data News, I wanted to send it earlier but I was travelling then enjoying time with friends and family. I'm still struggling a bit to write as fast as I would like, but 🤷‍♂️.</p>
<p>So, sorry for the late edition and enjoy.</p>
<h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1>
<ul><li><a href="https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/?ref=blef.fr">Announcing Microsoft Copilot</a> — Having everything under a common brand is great and Copilot is a great name. Microsoft announced that your AI companion called Copilot will be everywhere in the next Windows 11 update. For instance in Paint, Photos and in your web search (Edge and Bing).</li><li><a href="https://www.pcmag.com/news/microsoft-ai-employee-accidentally-leaks-38tb-of-data?ref=blef.fr">At the same time Microsoft leaked 38To of data</a> — through a Github repository containing a link to an Azure storage with public access open.</li><li><a href="https://openai.com/dall-e-3?ref=blef.fr">OpenAI announced DALL·E 3</a> — natively built with ChatGPT to create more impressive image from user prompts.</li><li>I recommend you to follow <a href="https://www.linkedin.com/in/olivermolander/?ref=blef.fr">Oliver</a> on LinkedIn if you don't want to miss anything related to Gen AI. He's writes the best takeaways multiples times a week.</li></ul>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<ul><li><a href="https://www.enterprisedb.com/blog/highlights-postgresql-16-beta-release?ref=blef.fr"><a href="https://www.postgresql.org/docs/current/release-16.html?ref=blef.fr">Postgres 16 has been released</a></a> — featuring a few <a href="https://www.enterprisedb.com/blog/highlights-postgresql-16-beta-release?ref=blef.fr">performance improvements</a> in parallel executions (<em>string_agg</em> and <em>array_agg</em>) but also with <em>SELECT DISTINCT</em> and <em>COPY</em> command.</li><li><a href="https://ask.astronomer.io/?ref=blef.fr">Astronomer released Ask Astro</a> — A LLM application that is able to understand Astro docs to answer most of the Apache Airflow questions. The source code is on <a href="https://github.com/astronomer/ask-astro?ref=blef.fr">Github</a>.</li><li><a href="https://www.prefect.io/blog/implications-of-scaling-airflow?ref=blef.fr">The implications of scaling Airflow</a> — Sarah, who's working at Prefect, wrote a post about Airflow downsides at scale and how Prefect mitigates them. I'd not say that all the downsides are relevant blockers but still it outlines on of the biggest Airflow issue: everything is implicit. Airflow is a framework allowing a wide range of code easily leading to debt.</li><li><a href="https://leo-godin.medium.com/quick-dbt-patterns-d9173700c08a?ref=blef.fr">dbt pattern, test-transform-publish</a> —&nbsp;Often called staging pattern. The idea is to publish the data once tests have validated that the is valid. What Leo proposes is an incremental transformation with tests on top. If the tests are valid then an view runs and select the last update.</li><li><a href="https://teej.ghost.io/a-guide-to-the-snowflake-results-cache/?ref=blef.fr">A guide to the Snowflake results cache</a> — Cache is a critical piece to every data warehouse either for reusing data between runs or between stages in the same run. This article details what you have to understand to optimise your Snowflake query writing.</li><li><a href="https://aws.amazon.com/blogs/big-data/use-the-new-sql-commands-merge-and-qualify-to-implement-and-validate-change-data-capture-in-amazon-redshift/?ref=blef.fr">Use the new SQL commands MERGE and QUALIFY in Redshift</a> — Redshift still exists and tries to catches with the competition. Merge allows you to deduplicate data by writing what you want to keep when rows matches and qualify filters results of a previously computed window function.</li><li><a href="https://www.arecadata.com/real-time-analytics-with-dynamic-tables-in-snowflake-redpanda/?ref=blef.fr">Real-time analytics with Snowflake dynamic tables &amp; Redpanda</a> — A good showcase of Snowflake dynamic tables with Wikipedia data.</li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><a href="https://www.cnbc.com/2023/09/21/cisco-acquiring-splunk-for-157-a-share-in-cash.html?ref=blef.fr">Cisco acquired Splunk</a> for $28b in cash. Crazy amount. Splunk has been here for a while providing a all-in-one platform for tech observability by ingesting logs and events to provide insights on a tech stack.</li><li><a href="https://www.secoda.co/?ref=blef.fr"><strong>Secoda</strong></a><strong> <a href="https://www.secoda.co/blog/secoda-series-a-monitoring?ref=blef.fr">raises a $14m Series A</a></strong>. Secoda is a data catalog tool with lineage and monitoring capabilities. Fresh money will help them to add AI capabilities to the product and increasing monitoring capabilities.</li><li><a href="https://motherduck.com/?ref=blef.fr"><strong>Motherduck</strong></a><strong> <a href="https://motherduck.com/blog/motherduck-open-for-all-with-series-b/?ref=blef.fr">raises $52.5m Series B</a></strong>. In total they raised $100m and announced that Motherduck product is open for everyone and not anymore behind a waitlist. Mainly Motherduck is the company providing DuckDB as a Cloud product but they are not developing DuckDB, their product is quite young but works like expected: with a simple string you can get an analytical cloud database that just works and that can be instantly replaced by a local one if needed.</li><li><a href="https://tabular.io/?ref=blef.fr"><strong>Tabular</strong></a> <a href="https://tabular.io/blog/the-case-for-independent-storage/?ref=blef.fr"><a href="https://www.businesswire.com/news/home/20230919876739/en/Tabular-Secures-26M-for-Independent-Data-Platform-based-on-Apache-Iceberg?ref=blef.fr">raised $26m Series B</a>.</a> Tabular is the company providing a cloud platform on top of Apache Iceberg—developed by Iceberg founders. I'd say that Iceberg (or table formats) are probably one of the technology that will incrementally change for the better the way we write data pipelines. Providing <a href="https://tabular.io/blog/the-case-for-independent-storage/?ref=blef.fr">more control</a> over data storage. Yet I think Iceberg is not yet ready to be widely used (<a href="https://github.com/apache/iceberg/issues/6564?ref=blef.fr">Python write support</a> still missing, you need Spark).</li><li><strong>Anthropic</strong> <a href="https://www.anthropic.com/index/anthropic-amazon?ref=blef.fr">could get $4b from Amazon</a>. Amazon did a first $1.3b in a corporate round to bring a lot of money to one of the biggest OpenAI. The ChatGPT alternative, <a href="https://www.anthropic.com/product?ref=blef.fr">Claude</a>, is already out there.</li></ul>
<hr>
<p>See you on Friday ✨.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.37 ]]></title>
                    <description><![CDATA[ Data News #23.37 — A lot of article this week, Falcon 180B, HuggingFac(ing) the senate, Snowflake and BigQuery tips, Databricks still burning cash and raising, etc. ]]></description>
                    <link><![CDATA[ /data-news-week-23-37/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64faeddb92b9c00001df3c3c ]]></guid>
                    <pubDate><![CDATA[ 2023-09-15 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1476164933423-150b771b627f?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="man walking near tall trees" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Facing the News (</span><a href="https://unsplash.com/photos/oDiU9WRz5CI?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure>
<p>Hello Data News readers. I'm still struggling to get back into my usual work rhythm. If you add the fact that last week I came up with fewer articles than I expected, this has led me to another blank page. Anyway, after 2 years of work, I have to accept and let go when necessary. But don't worry I don't forget you.</p>
<p>Let's quickly jump to the news, because it's rather busy.</p>
<p></p>
<h1 id="gen-ai-news-%F0%9F%A4%96">(Gen) AI News 🤖</h1>
<ul><li><a href="https://towardsdatascience.com/reinforcement-learning-an-easy-introduction-to-value-iteration-e4cfe0731fd5?ref=blef.fr">Reinforcement Learning: an easy introduction to value iteration</a> — Title says easy, but the article contains maths formula. RL is always something magical and this article explains it well through golf concepts.</li><li><a href="https://huggingface.co/blog/falcon-180b?ref=blef.fr">Falcon 180B has been released on HF</a> — This is interesting to note that Falcon has been developed at Technology Innovation Institute (TII) in Abu Dhabi. It brings diversity to Foundation models usually coming from US. But despite of the number of parameters (180B) <a href="https://towardsdatascience.com/falcon-180b-can-it-run-on-your-computer-c3f3fb1611a9?ref=blef.fr">can it run on your computer</a>? Spoiler, according to Benjamin it needs 100GB of RAM to run and a good GPUs to be able to fine tune.</li><li>If you're late to the party and you need fresh views on LLMs Daniel wrote an introduction <a href="https://dataengineeringcentral.substack.com/p/demystifying-the-large-language-models?ref=blef.fr">demystifying the Large Language Models</a> and Jesse wrote about <a href="https://www.jesse-anderson.com/2023/09/gpt-and-llms-from-a-data-engineering-perspective/?ref=blef.fr">LLMs impact from a Data Engineering perspective</a>.</li><li>At the same time Github Research <a href="https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/?ref=blef.fr">quantified GitHub Copilot’s impact on developer productivity and happiness</a> — Developer productivity is a difficult measure to compute. Also productivity ≠ speed, but speed is important. The research also shown that people using Github Copilot feel more 88% more productive and are more efficient and less frustrated.</li><li>HuggingFace CEO and co-founder <a href="https://twitter.com/ClementDelangue/status/1702095553503412732?ref=blef.fr">opening statement</a> at AI insight forum — This week US AI giants went to a 6-hours private meeting with 60 US senators to explore AI regulation. Clement Delangue transparently shared his speech on Twitter. Mainly he treats about openness, risks measurements—like mis-information, elections manipulation or carbon emission increase—and finally safeguards implementation.</li><li>Meta developed <a href="https://engineering.fb.com/2023/09/07/data-infrastructure/arcadia-end-to-end-ai-system-performance-simulator/?ref=blef.fr">an end-to-end AI system performance simulator</a> called Arcadia. From what I understand this performance simulator unlock capabilities in finding what are the best parameters for training.</li></ul>
<div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Additional big tech stuff to check: <a href="https://www.etsy.com/codeascraft/the-so-fine-real-time-ml-paradigm?ref=blef.fr" rel="noreferrer">real-time ML training</a> at Etsy and <a href="https://medium.com/pinterest-engineering/last-mile-data-processing-with-ray-629affbf34ff?ref=blef.fr" rel="noreferrer">last mile data processing with Ray</a> at Pinterest.</div></div>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1624628564627-89a340e05cdf?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="white and purple card on white surface" loading="lazy"><figcaption><span style="white-space: pre-wrap;">I can predict a project failure (</span><a href="https://unsplash.com/photos/mnf5Q9nTkhs?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure>
<ul><li><a href="https://www.theregister.com/2023/09/05/birmingham_city_council_oracle/?ref=blef.fr">Birmingham City Council has to pay 5x the initial price</a> of the new ERP Oracle project. From £20 million to around £100 million. Crazy amounts.</li><li>I just discovered this week that in June <a href="https://cloud.google.com/blog/products/data-analytics/join-optimizations-with-bigquery-primary-and-foreign-keys?hl=en&ref=blef.fr">BigQuery introduced primary keys and foreign keys</a>.</li><li>How to reduce warehouse costs? — Hugo propose <a href="https://medium.com/@hugolu87/5-minute-hacks-to-optimise-data-warehouse-cost-and-speed-snowflake-bigquery-postgres-etc-314e5d6444ac?ref=blef.fr">7 hacks to optimise data warehouse</a> cost. And if you can read French (🇫🇷) there is the super post by a French data collective about <a href="https://moderndatanetwork.medium.com/comment-r%C3%A9duire-ses-co%C3%BBts-google-bigquery-99f34d4fd2f0?ref=blef.fr">comment réduire ses coûts Google BigQuery?</a>.</li><li><a href="https://medium.com/@alvaroparra/snowflake-cron-format-conflicts-and-alternatives-to-solve-them-8b4cc4d34995?ref=blef.fr">* * * * * schedule Snowflake queries</a> —&nbsp;If you want to live dangerously you can use Snowflake table schedule to compute tables periodically. I don't recommend it, it's a Pandora's box we don't want to open.</li><li><a href="https://www.y42.com/blog/dimensional-modeling/?ref=blef.fr">Dimensional data modeling with dbt</a> — A great 6-steps process to create a simple dim-fact model with dbt. It also uses the dbt_utils macro to generate a surrogate key.</li><li><a href="https://medium.com/datamindedbe/head-to-head-comparison-of-dbt-sql-engines-497d71535881?ref=blef.fr">Head-to-head comparison of 3 dbt SQL engines</a> — A comparison between DuckDB, Spark and Trino where DuckDB wins almost every fight. Obviously biased by the fact that the comparison is done on a mono node and DuckDB is built for this.</li><li><a href="https://medium.pimpaudben.fr/scrape-analyze-football-data-with-kestra-duckdb-and-malloy-a0fbde7c2d31?ref=blef.fr">Scrape &amp; analyse football data</a> — Benoit nicely put in perspective how to use Kestra, Malloy and DuckDB to analyse data.</li><li><a href="https://dagster.io/blog/python-factory-patterns?ref=blef.fr">Factory Patterns in Python</a> — It remembers me Java design patterns classes at the engineering school. A bittersweet feeling. Still I think that the Factory pattern is probably the one that I've used the most since the beginning of my career and this post explains it well.</li><li><a href="https://nightingaledvs.com/spaghetti-dashboard-chart-solutions/?ref=blef.fr">When charts looks like spaghetti, try these saucy solutions</a> —&nbsp;Great tips to enhance your dashboards.</li><li>❤️ <a href="https://sambail.com/2023/09/01/the-key-to-building-a-high-performing-data-team-is-structured-onboarding/?ref=blef.fr">The key to building a high-performing data team is structured&nbsp;onboarding</a> — The title say it all. Still in the article it mentions 2 key piece. First you need a great onboarding doc and then you need to successfully pass the "bootcamp" phase, which matches the 2 first weeks.</li></ul>
<blockquote>Of course, great onboarding isn’t the only thing necessary to build a high performing team, but it’s almost impossible to build one without great onboarding</blockquote>
<p></p>
<h1 id="github-gems-%F0%9F%92%8E">Github gems 💎</h1>
<ul><li><a href="https://github.com/Nike-Inc/brickflow?ref=blef.fr"><strong>nike-inc/brickflow</strong></a> — Nike engineering team released a Python framework to orchestrate jobs in Databricks workflows. Mainly <a href="https://engineering.nike.com/brickflow/v0.10.1/highlevel/?ref=blef.fr">it maps to Airflow concepts</a> to have a declarative interface over Databricks objects like Cluster, Workflows or Notebooks in order to orchestrate them.</li><li><a href="https://github.com/sourcegraph/cody?ref=blef.fr"><strong>sourcegraph/cody</strong></a> — <em>Cody is a free, open-source AI coding assistant that can write and fix code, provide AI-generated autocomplete, and answer your coding questions. </em>Under the hood it uses either Anthropic or OpenAI LLMs to work and requires a free cody.dev account.</li><li><a href="https://github.com/teej/titan?ref=blef.fr"><strong>teej/titan</strong></a> — <em>Titan is a Python library to manage data warehouse infrastructure</em>. Titan allows you to create Snowflake Databases, Warehouses, Role and RoleGrant in a programmatic manner.</li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1553285991-4c74211f5097?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="rectangular red Supreme container" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Databricks atm (</span><a href="https://unsplash.com/photos/I9qcFjyuJGw?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure>
<ul><li><a href="https://sqream.com/?ref=blef.fr"><strong>SQream</strong></a> <a href="https://techcrunch.com/2023/09/12/sqream-series-c/?ref=blef.fr">raises $45m Series C</a>. SQream is a GPU-based SQL database that can act as a data warehouse promising performance peaks at PB scale because of the GPU architecture. It also works well for machine learning use-cases.</li><li><a href="https://www.gable.ai/?ref=blef.fr"><strong>Gable</strong></a> <a href="https://www.linkedin.com/feed/update/urn:li:activity:7107413267072917504/?updateEntityUrn=urn%3Ali%3Afs_updateV2%3A%28urn%3Ali%3Aactivity%3A7107413267072917504%2CFEED_DETAIL%2CEMPTY%2CDEFAULT%2Cfalse%29&ref=blef.fr">raises $7m in seed funding</a>. Chad Sanderson launched his data contracts product / platform in association with 2 other co-founders. Chad produced a lot of content around contracts in the last 2 years. It seems Gable is here to fix upstream data quality with contracts. Alerts will be sent in Github to alert owners when something breaks enforced rules.</li><li><strong>Databricks</strong> <a href="https://techcrunch.com/2023/09/14/databricks-raises-500m-more-boosting-valuation-to-43b-despite-late-stage-gloom/?ref=blef.fr">raises, another, $500m in Series I</a>. Soon there will be no letter in the alphabet to associate with Databricks fundraising. Since the beginning they raised $4b and are today valued at $43b. Nothing to say except than they love to <a href="https://www.theinformation.com/articles/inside-databricks-contrarian-playbook-burn-1-5-billion-to-buy-big-growth?ref=blef.fr">burn cash</a>. Be ready for a downhill in 2025 if you have picked Databricks.</li><li><strong>Treefera</strong> <a href="https://www.treefera.com/blog/treefera-pre-seed-funding-round?ref=blef.fr">raises $2.2m in pre-seed</a> to develop a data platform that monitors forests built for carbon offsetting and reforestation. I really like their "data products" approach and the geo visuals over forests risks.</li><li><a href="https://www.collibra.com/us/en/company/newsroom/press-releases/collibra-acquires-sql-data-notebook-vendor-husprey?ref=blef.fr">Collibra acquires SQL data notebook</a> <a href="https://www.husprey.com/?ref=blef.fr"><strong>Husprey</strong></a>. Husprey is a Notion-like directly in the warehouse to write stories on top of each interesting tables or facts. It will become a nice product in the Collibra data governance ecosystem.</li></ul>
<hr>
<p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.35 ]]></title>
                    <description><![CDATA[ Data News #23.35 — I&#39;m back. Let&#39;s digest what happened in August: dbt tests, Gen AI with Meta new models release, Python into Excel, Airflow new features, Terraform, etc. ]]></description>
                    <link><![CDATA[ /data-news-week-23-35/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64f1a2e052183200010f83e5 ]]></guid>
                    <pubDate><![CDATA[ 2023-09-01 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1535982330050-f1c2fb79ff78?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="flat lay photography of blue backpack beside book and silver MacBook" loading="lazy"><figcaption><span>Back to school (</span><a href="https://unsplash.com/photos/02z1I7gv4ao?ref=blef.fr" rel="noreferrer"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hey, I'm back.</p>
<p>I've taken an unplanned 3-week break since the last Data News, let's be honest, it was necessary! I spent a few hours working on the <a href="https://www.blef.fr/the-fancy-data-stack/">fancy data stack</a> project and articles are in the works, but it was idealistic to produce quality code and content while enjoying the summer. Like wine, it takes time to get it right. If you want a first glimpse of the Dagster code, you can look at it on <a href="https://github.com/Bl3f/tdf?ref=blef.fr">Github</a>, not yet documented but commits messages are clean.</p>
<p>On September 1, I'm still getting used to the school rhythm. A new year starts in September, new friends, new classes and new things. Even if, as an adult, things are different now. <strong>Data News is back, but with the same recipe: a weekly newsletter to let you catch up on the previous weeks' articles</strong>. I make the selection myself, I choose things I like while being under the others influence. But I'm not an influencer. I just create content.</p>
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/09/Screenshot-2023-09-01-at-15.00.14.png" class="kg-image" alt="" loading="lazy" width="2000" height="1458" srcset="https://www.blef.fr/content/images/size/w600/2023/09/Screenshot-2023-09-01-at-15.00.14.png 600w, https://www.blef.fr/content/images/size/w1000/2023/09/Screenshot-2023-09-01-at-15.00.14.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/09/Screenshot-2023-09-01-at-15.00.14.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/09/Screenshot-2023-09-01-at-15.00.14.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span>A glimpse into a fancy assets graph.</span></figcaption></figure>
<p></p>
<p>This week features what happened in August, even if it was summer holidays, news, features and drama got the data world. Enjoy the news recap.</p>
<p></p>
<h1 id="dbt-tests-%F0%9F%A7%AA">dbt tests 🧪</h1>
<p>dbt Core proposition has been to bring software engineering practices to SQL development. Obviously testing is invited to the party, but tests are hard and everyone does and understands tests differently. There are unit, integration, functional and end-to-end tests. </p>
<p>This summer a lot of people wrote about testing with dbt.</p>
<div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Before you start reading something else I recommend you the excellent video <a href="https://www.youtube.com/watch?v=hxvVhmhWRJA&ref=blef.fr" rel="noreferrer"><i><em class="italic">Testing: Our assertions vs. reality</em></i></a> from last Coalesce on YouTube.</div></div>
<ul><li><a href="https://www.elementary-data.com/post/dbt-tests?ref=blef.fr">dbt tests: How to write fewer and better data tests?</a> — Ari catalogs the kind of tests you can write with dbt. <strong>Do you want to test data or code changes?</strong> (<em>this is the more important question tbh</em>) Do you want to test schema changes, missing data, volume or value anomalies? He covers everything.</li><li><a href="https://datacoves.com/post/dbt-test-options?ref=blef.fr">An overview of testing options for dbt</a> — Another exhaustive and less opinionated list about the options out there to write test on the data.</li><li><a href="https://towardsdatascience.com/a-simple-yet-effective-approach-to-implementing-unit-tests-for-dbt-models-da2583ea8e79?ref=blef.fr">A simple approach to implementing unit tests for dbt Models</a> — Mahdi proposes a CTE nomenclature to create input and output in dbt models to unit tests them.</li><li><a href="https://github.com/dbt-labs/dbt-core/discussions/8275?ref=blef.fr">dbt Core unit tests are coming</a> — A discussion on Github about unit tests and fixtures definitions in YAML to tests models. If implement within dbt Core it would be the most awesome feature. Because hacking with seeds and custom macros looks nasty.</li></ul>
<p></p>
<h1 id="generative-ai-%F0%9F%A4%96">Generative AI 🤖</h1>
<p>I haven't really been keeping up with the news because it moves too fast, but here are a few things that have stood out:</p>
<ul><li><strong>Meta releasing models faster than before</strong> — <a href="https://ai.meta.com/blog/dinov2-facet-computer-vision-fairness-evaluation/?utm_source=twitter&utm_medium=organic_social&utm_campaign=blog&utm_content=video">Expanding DINOv2</a> a computer vision model (<a href="https://twitter.com/MetaAI/status/1697233910135148562?ref=blef.fr">on X</a>), releasing <a href="https://ai.meta.com/resources/models-and-libraries/seamless-communication/?utm_source=twitter&utm_medium=organic_social&utm_campaign=seamless&utm_content=card">SeamlessM4T</a> a multilingual multimodal translation model (<a href="https://twitter.com/MetaAI/status/1694020437532151820?ref=blef.fr">on X</a>), releasing <a href="https://ai.meta.com/blog/code-llama-large-language-model-coding/?ref=blef.fr">Code Llama</a> a LLM for coding.</li><li><strong>Snowflake <a href="https://www.snowflake.com/blog/meta-code-llama-testing/?ref=blef.fr"><strong>fine-tuning Code Llama</strong></a> for SQL generation</strong> — With these fine-tuning it seems they are close to GPT-4 accuracy in text-to-SQL.</li><li>Llama 2 is about as factually accurate as GPT-4 for summaries and is <a href="https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper?ref=blef.fr">30X cheaper</a> —</li><li>A <a href="https://twitter.com/DFintelligence?ref=blef.fr">French Youtuber</a> released on <a href="https://twitter.com/matteoepik/status/1695345336213295378?ref=blef.fr">Twitch a 24/7 AI deep-faking French presidents</a> (Macron, De Gaulle, Chirac) answering the Twitch chat questions, but his channel got banned by a Twitch bot after AI-Macron said something illegal while answering a question about worst french cities. AI fights this is the future we want.<a href="https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper?ref=blef.fr">
</a></li></ul>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.anaconda.com/wp-content/uploads/2023/08/Untitled-2.gif" class="kg-image" alt="" loading="lazy"><figcaption><span>A certain idea of hell.</span></figcaption></figure>
<ul><li><a href="https://www.anaconda.com/blog/announcing-python-in-excel-next-level-data-analysis-for-all?ref=blef.fr"><strong>Python into Excel</strong></a> — Microsoft and Anaconda announced Python coming into Excel. I'm bitter-sweet about it, on one side I don't think Excel is a good platform for software development, on the other side, let's be honest a face the truth Excel is the only data platform business users wants. Still the big winner of this is Microsoft, because Python code will run on Azure.</li><li><strong>After Excel, Notebooks get a second youth</strong> — Meta explained how they schedule <a href="https://engineering.fb.com/2023/08/29/security/scheduling-jupyter-notebooks-meta/?ref=blef.fr">Jupyter Notebooks in production</a>, Google announced the BigQuery studio with <a href="https://cloud.google.com/blog/products/data-analytics/whats-new-with-data-analytics-and-ai-at-next23?hl=en&ref=blef.fr">embedded Notebooks</a> in the UI and Jupyter released <a href="https://jupyter-ai.readthedocs.io/en/latest/?ref=blef.fr">Jupyter AI</a> (you call it with <code>%ai</code>) to bring Gen AI to the notebook.</li><li><strong>New features in Airflow</strong> — with 2.7 you get a <a href="https://airflow.apache.org/blog/airflow-2.7.0/?ref=blef.fr">Cluster Activity UI</a> and with <a href="https://github.com/kaxil/airflowctl?ref=blef.fr">airflowctl</a> new CLI you can spin up Airflow instances in a wink.<a href="https://maxhalford.github.io/blog/kpi-evolution-decomposition/?ref=blef.fr"></a></li><li><strong>Introducing the revamped <a href="https://www.getdbt.com/blog/introducing-new-look-dbt-semantic-layer/?ref=blef.fr"><strong>dbt Semantic Layer</strong></a></strong> — dbt Labs announced the Beta of the Semantic Layer which will be a paid product in dbt Cloud. I've already wrote a lot about the semantic layer and more is to come. So let's see where it goes.</li><li><strong>Introducing SOL: <a href="https://motifanalytics.medium.com/introducing-sol-sequence-operations-language-87a0d1d73497?ref=blef.fr"><strong>Sequence Operations Language</strong></a></strong> — A new dedicated to to sequence analyses, which can be useful when working with web traffic data.</li><li><strong>Answering "<a href="https://maxhalford.github.io/blog/kpi-evolution-decomposition/?ref=blef.fr"><strong>Why did the KPI change?</strong></a>" using decomposition</strong> — If you are an analyst who needs to explains everyday why a metric increased or decreased, this article is for you. Max explores metrics decomposition for sum and ratio. This is brillant.</li><li><a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-110?ref=blef.fr">Apache Hudi: From Zero To One (1/10)</a>.</li></ul>
<h3 id="drama">Drama</h3>
<ul><li><strong>Instacart's Snowflake bills </strong>— When public companies publish results numbers are looked at. This time Instacart bills have been overlooked. While the company said it has <a href="https://twitter.com/modestproposal1/status/1695177654822191184?ref=blef.fr">spent</a> $13m, $28m and $51m respectively for 2020, 2021 and 2022 in Snowflake spending and plan to spend $15m in 2023. <br><br>People supposed Instacart found the magic solution to reduce costs, others said it <a href="https://twitter.com/GergelyOrosz/status/1697192807801184561?ref=blef.fr">migrated</a> to Databricks. But the main reason is: prepaid credits. The <a href="https://www.snowflake.com/blog/snowflake-and-instacart-the-facts/?ref=blef.fr">Snowflake press</a> team even wrote a post.<br><br>Still you can watch the perfectly timed video about <a href="https://www.youtube.com/watch?v=up3bTjrBvTA&ref=blef.fr">How Instacart Optimized Snowflake Costs by 50%</a> or <a href="https://engineering.hellofresh.com/data-driven-snowflake-optimisation-at-hellofresh-55a5b56aa9af?ref=blef.fr">Snowflake optimisation at HelloFresh</a>.</li><li><a href="https://thenewstack.io/hashicorp-abandons-open-source-for-business-source-license/?ref=blef.fr"><strong>Hashicorp changed Terraform license model</strong></a> — Hashicorp decided to move from Mozilla Public License to Business Source License (BSL). BSL is source-available and not really open-source. Following the announcement OpenTF <a href="https://www.theregister.com/2023/08/28/opentf_forks_terraform_code/?ref=blef.fr">forked</a> the repo.</li></ul>
<h3 id="data-platform-stuff">Data platform stuff</h3>
<p>4 articles that gives food for thoughts about the future of the data field.</p>
<ul><li><a href="https://materialize.com/blog/warehouse-abuse/?ref=blef.fr">The uses and abuses of cloud data warehouses</a> — A streaming database saying to a batch database: "you're not suited for operational use-cases, only analytical". The batch database answers one day later.</li><li><a href="https://mattpalmer.io/posts/level-up-medallion-architecture/?ref=blef.fr">Level-up with a Medallion architecture</a> — bronze, silver, gold are the structuring layers of the Medallion architecture. Matt explains it for you.</li><li><a href="https://moderndata101.substack.com/p/the-data-contract-pivot-in-data-engineering-8bb?ref=blef.fr">The data contract pivot in data engineering</a> — It's a fancy name, but it aims to solves upstream data problems with a technical + process solution.</li><li><a href="https://substack.timodechau.com/p/after-the-modern-data-stack-welcome?ref=blef.fr">After the modern data stack: welcome back, data platforms</a>.</li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><a href="https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia/?ref=blef.fr" rel="noreferrer"><strong>Hugging Face</strong> raises $235m in Series D</a>. You can see Hugging Face like the Github of machine learning models, but it's much more today, this is a global platform to distribute AI—in every form possible. Obviously with the new popularity over Generative AI models HF is playing a key distribution role.</li><li><a href="https://www.stemma.ai/blog-post/stemma-teradata?ref=blef.fr"><strong>Stemma</strong> has been acquired by Teradata</a>. Stemma is a company that has been founded by ex-Lyft employees working on the company data catalog <a href="https://github.com/amundsen-io/amundsen?ref=blef.fr">Amundsen</a>. Mainly Stemma is built on top of Amundsen with Enterprise features. Consolidation.</li><li><a href="https://rockset.com/?ref=blef.fr"><strong>Rockset</strong></a> <a href="https://rockset.com/press/rockset-raises-44-million-to-power-search-analytics-and-ai-applications/?ref=blef.fr">raises $44m in Series B</a>. Rockset is a real-time search (and analytics) database aiming to replace Elastic. Like Elastic but in the cloud.</li><li><a href="https://www.prnewswire.com/news-releases/ikigai-labs-announces-25m-in-series-a-funding-to-bring-generative-ai-for-tabular-data-to-all-enterprises-301908366.html?tc=eml_cleartime&ref=blef.fr"><strong>Ikigai Labs</strong> raises $25m in Series A</a>. Ikigai provides a web platform to do data transformations in a visual way on top of tabular data. You can do entity resolution or forecasting for instance.</li><li><a href="https://dagster.io/blog/introducing-dagster-labs?ref=blef.fr">Elementl becomes <strong>Dagster Labs</strong></a>, to make it clear. I'm announcing soon blef Labs.</li><li>The Information reported that <a href="https://www.theinformation.com/articles/openai-passes-1-billion-revenue-pace-as-big-companies-boost-ai-spending?ref=blef.fr"><strong>Open AI</strong> will pass $1b in annual revenue</a> "over the next 12 months".</li></ul>
<hr>
<p>Feels good to be back, see you next week ❤️. I hope you enjoyed your summer.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ The fancy data stack—batch version ]]></title>
                    <description><![CDATA[ Data News Summer Edition — Design the fancy data stack to explore the Tour de France data. ]]></description>
                    <link><![CDATA[ /the-fancy-data-stack/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64c38732ebe10c0001212e1b ]]></guid>
                    <pubDate><![CDATA[ 2023-08-04 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1688325923282-f75db5d6695e?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="a harbor filled with lots of boats on top of water" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Summer Edition (</span><a href="https://unsplash.com/photos/a-harbor-filled-with-lots-of-boats-on-top-of-water-uORt2vJMTSk?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>This is the first article of the <strong>Data News Summer Edition: how to build a data platform</strong>. I tried to be as short as possible in this first article, details will come in the following ones.</p><p>The modern data stack has been criticised a lot, a few are saying it's dead other are saying we are in the post-modern era. The modern data stack as a collection of tools which interacts altogether to serve data to consumers is still relevant. Personally I think that the modern data stack characterises by having a central data storage in which everything happens.</p><p><strong>Let's design the most complete modern data stack, or rather the fancy data stack.</strong></p><p>In this article we will try to design the fancy data stack for a batch usage. A lot of logos and products will be mentioned. This is not a paid article. However over the years I've met people working at these companies so I might have a few biais.</p><p>As a disclaimer, this may not quite make sense in a corporate context, but since this is my blog, I'll do what I want. Still, the idea of this post is to give you an overview of existing tools and how everything fits together.</p><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">If you just want a few articles to read, just go to the bottom of the email.</div></div><p></p><h1 id="a-few-requirements">A few requirements</h1><ul><li>The source data lies in Postgres database, in flat CSV and in Google Sheets.</li><li>I want something cloud agnostic—when possible.</li><li>I want to use open-source tooling.</li><li>Everything I do should be production-ready and public. At the end of the experiment you should be able to access the tools—when possible.</li></ul><p></p><h1 id="source-data">Source data</h1><p>When I was looking for data, I was looking for a bit of volume, something geographical and without PII. I personally like <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page?ref=blef.fr">NYC Taxi trip data</a> but sadly it has been used many times so it removes a bit the fun. At the same time the Tour de France was ongoing and I found a "way" to get Strava data for Tour athletes on Strava. So I thought it was the perfect data to build a data platform.</p><p>Mainly there are 3 datasets:</p><ul><li><strong>Athletes</strong> — all the data about the athletes like their race ids, teams, their profile but also their body size. It will be a Google Sheets.</li><li><strong>Stages</strong> — le Tour de France is a 3-weeks race, it contains 21 stages, every stage is a GPS path with a few checkpoints. It will be 21 CSVs.</li><li><strong>Race</strong> — the actual race data which is a GPS data point every second for each athletes on Strava + other data points sometimes. It represents almost the half of the peloton. It will be a table in Postgres. Postgres is not the best solution for this, but as I want to mimic enterprise context, having a Postgres database is kinda mandatory.</li></ul><p>Race data will be partitioned per day, but as the Tour is already done, it will be a bit different than in real life environment. Still, this is something I keep in mind for future trainings. <strong>Because I'm convinced that to learn data engineering you need to experiment real life pipelines running every day to experiment the morning firefighting</strong>.</p><p>I'll delve into the data in the next article, but I won't detail how I got the data because, you know.... 🏴‍☠️. Actually, it's just a few Python scripts and a bit of F12, but that's not the point of this article.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/08/00---Data-platform-1-.png" class="kg-image" alt="" loading="lazy" width="1970" height="1028" srcset="https://www.blef.fr/content/images/size/w600/2023/08/00---Data-platform-1-.png 600w, https://www.blef.fr/content/images/size/w1000/2023/08/00---Data-platform-1-.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/08/00---Data-platform-1-.png 1600w, https://www.blef.fr/content/images/2023/08/00---Data-platform-1-.png 1970w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Source data (Postgres, CSV and Sheets)</span></figcaption></figure><p></p><h1 id="the-fancy-data-platform">The fancy data platform</h1><p>In order to have a complete data platform we will need to move the data from source to consumption. But what the consumption will look like?</p><p>I want to answer multiple use-cases:</p><ul><li>Create a dashboard to explore stages results</li><li>Give a LLM driven bot that answer common questions about the race</li><li>Compare 2 athletes performance on a specific segment and generate a GIF</li></ul><p>In order to answer this we will need to <strong>ingest data from the multiple sources</strong>, then <strong>transform and model the data in the chosen data storage</strong> and finally <strong>develop consumers apps</strong> to answer the business needs.</p><p>Let's try to throw a first design of our application—with logos. Obviously this can be subject to change. Either because it's too complicated either because I want to change. Once again this is fiction so I can afford to change stuff. </p><p>Actually as a one of my main advice is that <strong>you should never be strict about tech choices because you can't plan the unexpected</strong>. So do yourself a favour and accept to throw away something that does not work for you.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/08/00---Data-platform-2-.png" class="kg-image" alt="" loading="lazy" width="2000" height="839" srcset="https://www.blef.fr/content/images/size/w600/2023/08/00---Data-platform-2-.png 600w, https://www.blef.fr/content/images/size/w1000/2023/08/00---Data-platform-2-.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/08/00---Data-platform-2-.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/08/00---Data-platform-2-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The fancy data stack</span></figcaption></figure><p>Just for the sake of being open, there are a lot of alternatives and my choices could have been different. Here what you can also consider if you're doing your own platform.</p><ul><li><strong>Extraction</strong><br>Open-source — <a href="https://dagster.io/?ref=blef.fr">Dagster</a>, <a href="https://airbyte.com/?ref=blef.fr">Airbyte</a>, <a href="https://airflow.apache.org/?ref=blef.fr">Airflow</a>, <a href="https://www.prefect.io/?ref=blef.fr">Prefect</a>, <a href="https://mage.ai/?ref=blef.fr">Mage</a>, <a href="https://kestra.io/?ref=blef.fr">Kestra</a>, <a href="https://dlthub.com/?ref=blef.fr" rel="noreferrer">dltHub</a><br>SaaS ($) — <a href="https://www.stitchdata.com/?ref=blef.fr">Stitch</a>, <a href="https://portable.io/?ref=blef.fr">Portable</a>, <a href="https://www.getorchestra.io/?ref=blef.fr" rel="noreferrer">Orchestra</a> and the cloud versions of the OS tools<br>SaaS ($$) — <a href="https://www.fivetran.com/?ref=blef.fr">Fivetran</a></li><li><strong>Transformation</strong><br>SQL — <a href="https://github.com/dbt-labs/dbt-core?ref=blef.fr">dbt</a>, <a href="https://sqlmesh.readthedocs.io/en/stable/?ref=blef.fr">SQLMesh</a><br>Python — <a href="https://pandas.pydata.org/?ref=blef.fr">pandas</a>, <a href="https://www.pola.rs/?ref=blef.fr">polars</a><br>Distributed — <a href="https://spark.apache.org/?ref=blef.fr">Spark</a>, <a href="https://github.com/pathwaycom/pathway?ref=blef.fr">Pathway</a></li><li><strong>Datalake</strong><br>Open-source — <a href="https://min.io/?ref=blef.fr">MinIO</a>, <a href="https://ceph.io/en/?ref=blef.fr">Ceph</a>, <a href="https://lakefs.io/?ref=blef.fr">LakeFS</a>, <a href="https://github.com/open-io?ref=blef.fr">OpenIO</a><br>SaaS ($) — S3, Google Cloud Storage, Azure Blog Storage<br>Table format — <a href="https://iceberg.apache.org/?ref=blef.fr">Apache Iceberg</a>, <a href="https://hudi.apache.org/?ref=blef.fr">Apache Hudi</a>, <a href="https://delta.io/?ref=blef.fr">Delta</a></li><li><strong>Warehouse</strong><br>Open-source — <a href="https://duckdb.org/?ref=blef.fr">DuckDB</a>, <a href="https://clickhouse.com/?ref=blef.fr">ClickHouse</a>, <a href="https://pinot.apache.org/?ref=blef.fr">Apache Pinot</a>, <a href="https://kylin.apache.org/?ref=blef.fr">Apache Kylin</a>, <a href="https://doris.apache.org/?ref=blef.fr">Apache Doris</a><br>SaaS ($) — <a href="https://cloud.google.com/bigquery?ref=blef.fr">BigQuery</a>, <a href="http://snowflake.com/?ref=blef.fr">Snowflake</a></li><li><strong>Semantic Layer</strong><br>Open-source — <a href="https://cube.dev/?ref=blef.fr">Cube</a>, <a href="https://www.malloydata.dev/?ref=blef.fr">Malloy</a>, <a href="https://github.com/alash3al/sqler?ref=blef.fr">sqler</a><br>SaaS ($) — <a href="https://www.getdbt.com/product/semantic-layer/?ref=blef.fr">dbt Cloud</a>, <a href="https://cloud.google.com/looker/docs/what-is-lookml?ref=blef.fr#:~:text=LookML%20stands%20for%20Looker%20Modeling,relationships%20in%20your%20SQL%20database.">LookML</a></li><li><strong>Governance</strong><br>Open-source — <a href="https://datahubproject.io/?ref=blef.fr">Datahub</a>, <a href="https://openlineage.io/?ref=blef.fr">OpenLineage</a>, <a href="https://open-metadata.org/?ref=blef.fr">OpenMetadata</a><br>SaaS ($) — <a href="https://www.castordoc.com/?ref=blef.fr">CastorDoc</a>, <a href="https://atlan.com/?ref=blef.fr">Atlan</a></li><li><strong>Analytics</strong><br>Open-source — <a href="https://superset.apache.org/?ref=blef.fr">Superset</a>, <a href="https://www.metabase.com/?ref=blef.fr">Metabase</a>, <a href="https://www.lightdash.com/?ref=blef.fr">Lightdash</a><br>SaaS ($) — <a href="https://www.tableau.com/?ref=blef.fr">Tableau</a>, <a href="https://cloud.google.com/looker?ref=blef.fr">Looker</a>, <a href="https://powerbi.microsoft.com/fr-fr/?ref=blef.fr">PowerBI</a>, <a href="https://whaly.io/?ref=blef.fr">Whaly</a> and the cloud version of the open-source tools</li><li><strong>Exploration</strong><br>Open-source — <a href="https://streamlit.io/?ref=blef.fr">Streamlit</a>, <a href="https://jupyter.org/?ref=blef.fr">Jupyter</a><br>SaaS — <a href="https://hex.tech/?ref=blef.fr">Hex</a>, <a href="https://www.graphext.com/?ref=blef.fr">Graphext</a>, <a href="https://www.husprey.com/?ref=blef.fr">Husprey</a>, <a href="https://count.co/?ref=blef.fr">Count</a> (etc. this list can become infinite)</li></ul><p></p><h1 id="conclusion">Conclusion</h1><p>After this design exercice I have mix feeling. I'm in between. I think this is a fancy stack because I tried to put everything inside, but as the same time is find it quite boring. Like this is just stuff that works. This is linear, I'll move data from A to B to C in to order to use it with D. Actually this is just modern data engineering.</p><p>In the following part of this series you'll follow my adventure in the extraction, the transformation and in the serving for analytics and Gen AI usage.</p><p>I hope you'll enjoy this Data News Summer Edition.</p><p></p><h1 id="faq-and-remarks">FAQ and remarks</h1><ul><li><strong>Why do you use Google Cloud?</strong><br>Because my credit card is already in place and I'll be much faster. My opinion on the matter is this: all clouds are born equal, you just have to find the one you're most comfortable with, or suffer your company's choices.</li><li><strong>DuckDB is not really a data warehouse.</strong><br>I pick DuckDB because it's fancy. I think I'm gonna hit some limitation especially in Geo compute, so I might switch to ClickHouse or BigQuery if I lack of time.</li><li><strong>I hate Github actions, but I prefer putting code in public in Github.</strong></li><li><strong>I used the way to visualise data platform </strong><a href="https://about.gitlab.com/handbook/business-technology/data-team/platform/?ref=blef.fr#our-data-stack"><strong>Gitlab data team is using</strong></a><strong>.</strong></li><li><strong>What about the performance of the platform?</strong><br>I don't really care about performance, because this is not large data and I don't want to spend hour optimising for performance.</li><li><strong>Do you have a budget?</strong><br>Something reasonable. I think ~100€ / month is ok for this experience.</li><li><strong>What will you do in LLM category?</strong><br>I don't know yet. If you have ideas about what I can do reach me.</li><li><strong>Why Dagster?</strong><br>I've been building things with Airflow for almost 5 years, I love trying new things and in the list of orchestrators that have hyped me the most, Dagster is number one. Software-defined assets are something I wanted to play with.</li></ul><hr><h3 id="small-fast-news-%E2%9A%A1%EF%B8%8F">Small Fast News ⚡️</h3><p>If you want dont care about this, here a few articles you might want to read by the pool.</p><ul><li><a href="https://eczachly.substack.com/p/how-to-data-model-correctly-kimball?ref=blef.fr">How to model: Kimball vs One Big Table</a> — This is one of the main topic of discussion in the data space. Should you go for dimensional modeling or go for OBT or even go for <a href="https://seattledataguy.substack.com/?ref=blef.fr">query-driven data modeling</a> (coined by Joe Reis—who's writing a book about data modeling).</li><li><a href="https://engineering.linkedin.com/blog/2023/costwiz--saving-cost-for-linkedin-enterprise-on-azure?ref=blef.fr">Costwiz, Saving cost for LinkedIn enterprise on Azure</a> — LinkedIn developed a complete data platform on Azure to save costs on Azure.</li><li><a href="https://confidence.spotify.com/?ref=blef.fr">Confidence — An experimentation platform from Spotify</a> — After years of experience in building experimentations, Spotify decided to release a product for others to do it. This is in private beta and the move is interesting.</li><li><a href="https://medium.com/walmartglobaltech/duckdb-vs-the-titans-spark-elasticsearch-mongodb-a-comparative-study-in-performance-and-cost-5366b27d5aaa?ref=blef.fr">DuckDB vs. Spark, ElasticSearch and MongoDB</a> — Even if this is not really relevant to compare it to NoSQL databases, tests are showing that DuckDB looks better.</li><li><a href="https://engineeringblog.yelp.com/2023/07/overview-of-jupyterhub-ecosystem.html?ref=blef.fr">Overview of JupyterHub ecosystem</a> — Just saving this for me because I do stuff on it.</li><li>Read <a href="https://www.dataengineeringweekly.com/?ref=blef.fr">Data Engineering Weekly</a>.</li></ul><hr><p>See you next week ❤️. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1602566178436-8cf72756f4cb?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="red and white floral gift boxes" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Be kind to me, this is my birthday (</span><a href="https://unsplash.com/photos/1HIKnKtXEU0?ref=blef.fr" rel="noreferrer"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — mid-2023 popular articles ]]></title>
                    <description><![CDATA[ Data News #23.30 — popular articles since the beginning of the year. ]]></description>
                    <link><![CDATA[ /data-news-week-23-30/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64c245200c042f0001f16179 ]]></guid>
                    <pubDate><![CDATA[ 2023-07-28 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1513622470522-26c3c8a854bc?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="two gray and black boats near dock" loading="lazy"><figcaption><span>🧜‍♂️ (</span><a href="https://unsplash.com/photos/3_ZGrsirryY?ref=blef.fr" rel="noreferrer"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hey, this is a mid-2023 edition with some of my favourite articles and the popular articles that have been shared this year in the newsletter. There isn't any fancy calculation on how to find the popular articles. Here how it's done.</p>
<p>Every link sent in each newsletter is tracked in 2 ways:</p>
<ul><li>when you click on a link it first redirect you to my blog so I know that you've clicked on it</li><li>it adds <em>ref=blef.fr</em> to the url, so the original articles knows that the traffic comes from me, mainly it's a great way to support me by being discoverable to others</li></ul>
<p>I've used the click data to sort articles by popularity. Obviously it has a few biais like recent editions get more clicks because I have more subscribers but impact is minimal.</p>
<p>A few numbers. Since the beginning of the year I've shared around <strong>500 articles</strong>, which generated at least <strong>22k views</strong> on creators articles. I say at least because this is an low estimated number, from a projected experience I think that in the reality it's twice this number.</p>
<p>If you have travel time I also recommend you the first episode of <a href="https://podcasters.spotify.com/pod/show/blef/episodes/Episode-1--Joe-Reis-e23mt2h?ref=blef.fr">Data Minds, my podcast, with Joe Reis</a>.</p>
<h1 id="popular-articles">Popular articles</h1>
<p>I have sorted the articles by bucket. The order does not really makes sense, they were all popular.</p>
<h3 id="general">General</h3>
<ul><li>💰 Because we all love money, Mikkel's <a href="https://www.synq.io/blog/europe-data-salary-benchmark-2023?ref=blef.fr">Europe data salary benchmark</a> was the most viewed. In the article he shares salaries extracted from job listings, using dimensions like seniority, location and companies.</li><li>📃 In every data team this is super important to write documentation, Marie wrote an awesome <a href="https://towardsdatascience.com/data-documentation-101-why-how-for-whom-927311354a92?ref=blef.fr">101 about data documentation</a>. The article gives best practices for establishing complete and reliable data documentation. </li><li>🎰 <a href="https://locallyoptimistic.com/post/reducing-the-lottery-factor-for-data-teams/?ref=blef.fr">Reducing the lottery factor</a>, also named the bus factor is risk measurement about knowledge sharing. In data teams a lot of work have to be done in the early days to avoid knowledge to be lost later on. The article gives ~10 advices to apply to lower the risks. Among them I like the changelog, the pair-programming, the pre-recorded video and the stable credentials.</li><li>🌎 <a href="https://datajourneymanifesto.org/?ref=blef.fr">The data journey manifesto</a> is a manifesto to put principles on the data journey to avoid the mess in production. There are 11 principles and 11 new ideas to create an healthy platform. For instance <em>you should not trust your data providers</em> and <em>what worked last week will not work today</em>.</li></ul>
<h3 id="modern-data-stack">Modern data stack</h3>
<ul><li>🔮 <a href="https://databased.pedramnavid.com/p/the-future-of-data?ref=blef.fr">The future of data</a> by Pedram. 3 takes on the future of data teams. I really like Pedram, he tweets a lot—or we should way xs—a gives great advices with humour. Mainly the articles says finally we address ops teams, the semantic layer is the next big battle and business logic management is a mess. He also recently joined Dagster team in DevRel.</li><li>🔥 Matt gives <a href="https://mattpalmer.io/posts/hot-takes?ref=blef.fr">5 hot takes on the modern data stack</a>. I don’t totally agree with everything. This is about Redshift, Airflow, Airbyte, dbt and production.</li><li>🧱 A good summary of the required blocks composing <a href="https://technically.substack.com/p/whats-the-modern-data-stack?ref=blef.fr">the modern data stack</a>.</li></ul>
<h3 id="technical-deep-dive">Technical deep-dive</h3>
<ul><li>🏗️ Simon wrote an excellent 3 parts data modeling deep-dive. An <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-introduction?ref=blef.fr">introduction to data modeling</a>, the <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-approaches-and-techniques?ref=blef.fr">different techniques</a> and the <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-architecture-pattern-tools?ref=blef.fr">tools and future</a>.</li><li>📑 Data contracts were very trendy this year. I also think they are quite useful. <a href="https://github.com/paypal/data-contract-template?ref=blef.fr">PayPal released their template for data contract</a>. This is a exhaustive list of what you can expect in contract: schema, quality, SLAs, security and custom properties.</li><li>👨‍🏫 Count.co designed 2 amazing boards. You can <a href="https://count.co/canvas/pB7iGb4yyi2?ref=blef.fr">learn SQL</a> or follow a guide to <a href="https://count.co/canvas/vWnN0JCglDd?ref=blef.fr">hire your data team</a>.</li><li>🐍 Finally a few useful <a href="https://www.startdataengineering.com/post/code-patterns?ref=blef.fr">code patterns in Python</a>.</li></ul>
<hr>
<p>See you next week ❤️ and I wish you great holidays.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.29 ]]></title>
                    <description><![CDATA[ Data News #23.29 — Hightouch and Unstructured fundraising, data as a game, dbt and ChatGPT, OpenHouse the new warehouse. ]]></description>
                    <link><![CDATA[ /data-news-week-23-29/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64ba770ff0d3f20001236ef1 ]]></guid>
                    <pubDate><![CDATA[ 2023-07-22 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1469854523086-cc02fe5d8800?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="yellow Volkswagen van on road" loading="lazy"><figcaption><span>See you on the road (</span><a href="https://unsplash.com/photos/A5rCN8626Ck?ref=blef.fr" rel="noreferrer"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hey, I hope this newsletter finds you well. This is a small blogpost to give you a few reads while waiting for your next travel. We can already feel summer, I found less articles to enter the selection this week.</p>
<p>Also be ready for the <em>Data News: Summer Edition</em>. For the next 5 releases it will be a bit different than usual, less curation and more original articles written in advance to allow me to take a break.</p>
<p>You'll—probably—get:</p>
<ul><li>A 2023 must-read articles</li><li>How to create a batch data platform—using Tour de France data—from ingestion to visualisation using all the fancy tools the data world can offer (in 2 or 3 parts)</li><li>Docker for data people</li><li>The disparition of the data engineer</li></ul>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1577741314755-048d8525d31e?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="red Sony PS DualShock 4" loading="lazy"><figcaption><span>Give a controller to your stakeholders (</span><a href="https://unsplash.com/photos/YsPnamiHdmI?ref=blef.fr" rel="noreferrer"><span>credits</span></a><span>)</span></figcaption></figure>
<ul><li><a href="https://roundup.getdbt.com/p/for-the-love-of-the-game?ref=blef.fr">For the love of the game</a> — Winnie from dbt Labs wrote a great post about seeing data a as game, analytics being the game design. What if we conceive data as a game for our consumers and not as a linear tool to do boring actions. In the article the author also shares <a href="https://www.aranke.org/dbt-jquery/?ref=blef.fr">dbt is jQuery, not Terraform</a>, which awesomely describes how dbt helps you enter flow state for data work.</li><li><a href="https://benn.substack.com/p/how-an-acquisition-fails?ref=blef.fr">How an acquisition fails</a> — It's been a long time since I've shared Benn's articles, but as always I can't recommend him enough. This time it's about tech acquisitions and what can be done to fail—or succeed.</li><li><a href="https://www.advancinganalytics.co.uk/blog/2023/7/17/fabric-end-to-end-implementation?ref=blef.fr">Microsoft Fabric: An end to end implementation</a> — A first—blurred—glimpse of Microsoft Fabric capabilities, Jordan reads data from Sharepoint and Azure Storage, then transform it using PySpark to visualise stuff in PowerBI. Classically boring stuff.</li><li><a href="https://www.entechlog.com/blog/data/chat-with-data-in-snowflake-using-chatgpt-dbt-and-streamlit/?ref=blef.fr">How to chat with data in Snowflake using ChatGPT, dbt, and Streamlit</a> — Less boring, obviously when you put ChatGPT and dbt in the same sentence it creates buzz instantly. This is an interesting demo of how you can quickly build a chat experience—using OpenAI—on top of you data models.</li><li><a href="https://postgresml.org/blog/llm-based-pipelines-with-postgresml-and-dbt?ref=blef.fr">LLM based pipelines with PostgresML and dbt</a> — Mainly for me this is a discovery of the PostgresML an open-source extension that brings ML functions to the database. As cloud databases like Snowflake and BigQuery brought it years ago, this was mandatory for the Postgres stack. In the article it shows you that you can than run transformers or embeddings directly from dbt.</li><li><a href="https://engineering.linkedin.com/blog/2023/taking-charge-of-tables--introducing-openhouse-for-big-data-mana?ref=blef.fr">Taking charge of tables: introducing OpenHouse for big data management</a> — New data product at LinkedIn: OpenHouse. OpenHouse sits on top of the LakeHouse to bring a control plane to managed Iceberg files. It reminds me something... We used to call it warehouse back in the days.</li><li><a href="https://twitter.com/ClementDelangue/status/1680942084855943168?ref=blef.fr">Models on HuggingFace</a> — Clement, the CEO of HuggingFace, congratulates the community and himself because a lot of public models are hosted on HuggingFace, it shows how fast and deep things are going.</li><li><a href="https://observablehq.com/@observablehq/plot-gallery?ref=blef.fr">Plot Gallery on Observable</a> — I'm not often a fan, but Mike Bostock is different. He created d3.js while at the New York Time, he brought something unique to digital data visualisation. More recently he co-founder Observable, which is an awesome tool to do visualisations, and the plot gallery makes me envious—while quite simplistic. </li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><a href="https://hightouch.com/blog/funding-announcment-customer-360-toolkit?ref=blef.fr"><strong>Hightouch</strong> raises $38m</a> in a Venture round. Hightouch has been primarily known for his reverse ETL solution. With the money the team announced a new suite of tools to activate customers in the warehouse. You can see it as a CDP—customer data platform—in your warehouse. It means you get a unified view of customers across all your tables.</li><li><a href="https://techcrunch.com/2023/07/18/polar-analytics-9m-shopify-brands-ecommerce/?ref=blef.fr"><a href="https://www.polaranalytics.com/?ref=blef.fr"><strong>Polar Analytics</strong></a></a> <a href="https://techcrunch.com/2023/07/18/polar-analytics-9m-shopify-brands-ecommerce/?ref=blef.fr"><a href="https://www.polaranalytics.com/?ref=blef.fr">raises $9m Series A</a></a>. Polar Analytics is a vertical SaaS to provide analytics for Shopify vendors. This is less data engineering oriented but still I find interesting to see a "reporting" product raising money. Also vertical product like this can give ideas to marketplace on what can be great reportings.</li><li><a href="https://unstructured.io/?ref=blef.fr"><strong>Unstructured</strong></a> <a href="https://techcrunch.com/2023/07/19/unstructured-which-offers-tools-to-prep-enterprise-data-for-llms-raises-25m/?guccounter=1&ref=blef.fr">raises $25m Series A</a> to build ETL for LLMs. Unstructured wants to give you the ETL toolkit to use company complex data like HTML, PDF, CSV, PNG, PPTX, as they say on their site. Personally I did not know that CSV was a complex source of data but ok. To be honest at the moment it looks like a fancy text extractor.</li></ul>
<hr>
<p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.28 ]]></title>
                    <description><![CDATA[ Data News #23.28 — Elon Musk new company xAI, AP gives access to text archive to OpenAI, Sidekick, strikes and BigQuery costs. ]]></description>
                    <link><![CDATA[ /data-news-week-23-28/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64aff9c023904c00010c5525 ]]></guid>
                    <pubDate><![CDATA[ 2023-07-15 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1554900773-4dd76725f876?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="four floors building with stairs" loading="lazy"><figcaption><span>Have fun train models on this (</span><a href="https://unsplash.com/photos/j-0olYcaihg?ref=blef.fr" rel="noreferrer"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hey, it's Saturday I hope you're enjoying July, taking deserve break, reading data engineering articles while at the beach or traveling to unknown places. Sometimes there are Fridays when I don't find any glue between articles for the newsletter and I have an idea of something to compensate but it takes me the whole Friday of exploration.</p>
<p>And here we are on Saturday. Yesterday I found a way to get sensor data of half of the Tour de France peloton, I was sure it was a good dataset to explore new tools with. And it's honestly a great dataset but it's a bit hard to download and format all the data for exploration. So it will be for later.</p>
<p>Anyway, here a quick press roundup about a few news and articles.</p>
<p></p>
<h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1>
<ul><li>Elon Musk announced <a href="https://x.ai/?ref=blef.fr">xAI</a>, his new company, to show that he's better than the rest. He hired alumni from all the AI companies (e.g. Deep Mind, Google, OpenAI, etc.). They held a 2-hour Twitter Space in which they detailed the vision a little. It's mainly about building an AGI capable of understanding the universe. They say we are a few weeks away from their first release. Here a great <a href="https://twitter.com/EdKrassen/status/1679971231280365568?ref=blef.fr">summary of the space</a>.</li><li><a href="https://www.ap.org/press-releases/2023/ap-open-ai-agree-to-share-select-news-content-and-technology-in-new-collaboration?ref=blef.fr">Associated Press sign with OpenAI to share AP's text archive</a> — Interesting to say as it's one of the first deal like this. It reminds me when press gave up years ago on their own platform writing for Google and Facebook news platform. At least this time we will know what OpenAI uses for training.</li><li><a href="https://twitter.com/tobi/status/1679114154756669441?s=46&t=SidQqxd-lfVcXGrROSrmzg&ref=blef.fr">Shopify introduce Sidekick</a> — Once again Gen AI is a Copilot. Shopify introduced a right panel in the UI to help vendors in any way. Gen AI used as a Copilot. In the video we see the Sidekick generating a chart to answer a sales question.</li><li><a href="https://www.bbc.com/news/entertainment-arts-66196357?ref=blef.fr">Hollywood actors taking a strike action</a> — They don't want AI and computer-generated faces and voices to replace actors.</li><li>Clibrain, a Spanish startup, launches to build LLMs models for Spanish. They released <a href="https://huggingface.co/clibrain/lince-zero?ref=blef.fr">LINCE-ZERO</a>. Spanish is the second most spoken language by native speakers and the fourth most spoken by all speakers.</li></ul>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️ </h1>
<ul><li><a href="https://engineering.mixpanel.com/how-we-cut-bigquery-costs-by-80-by-identifying-and-optimizing-costly-query-patterns-1a297b46bd33?ref=blef.fr">How we cut BigQuery costs 80% by hunting down costly queries</a> — Mixpanel team hugely reduced their BigQuery spending. They use Fivetran, dbt and Census. In order to get started they first built a cost dashboard using information_schema.jobs tables. Then they took actions, mainly: <a href="https://cloud.google.com/bigquery/docs/best-practices-performance-compute?ref=blef.fr#avoid_select_">avoiding SELECT *</a>, materialising intermediate result, adding partition and going incremental. Nothing new but good reminder.</li><li><a href="https://medium.com/whatnot-engineering/data-contracts-in-the-modern-data-stack-d42cb2442dbd?ref=blef.fr">Data Contracts in the Modern Data Stack</a> — Whatnot is one of the company who embraced Data Contracts last year. This article details what they shared in they excellent Data Council talk. Mainly their implementation is a Protobuf Schema Registry and interface at event production and consumption.</li><li><a href="https://hex.tech/blog/dimensionality-reduction/?ref=blef.fr">Introduction to dimensionality reduction</a> — I've gave up on machine learning a few years ago, so I really like every article explaining with visual machine learning concepts. This article explains the dimensionality reduction that is often mandatory when datasets grows. There is a part two with live <a href="https://hex.tech/blog/dimensionality-reduction-techniques/?ref=blef.fr">Python examples</a>.</li><li><a href="https://discuss.python.org/t/a-fast-free-threading-python/27903/99?ref=blef.fr">Make Python free-threading</a> — This is how open-source is made. In a community discussion about removing Python GIL. Someone from Meta said they can dedicate 3 <em>CPython internals</em> engineers to work 2 years+ in breaking the barriers. Python GIL stands for Global Interpreter Lock, which is a lock that allows Python to use only one thread. Interesting to see.</li></ul>
<p></p>
<figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1579621970795-87facc2f976d?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="green plant in clear glass vase" loading="lazy"><figcaption><span>My savings on BigQuery money (</span><a href="https://unsplash.com/photos/ZVprbBmT8QA?ref=blef.fr" rel="noreferrer"><span>credits</span></a><span>)</span></figcaption></figure>
<hr>
<p>See you next week ❤️</p>
<p></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.27 ]]></title>
                    <description><![CDATA[ Data News #23.27 — My new French podcast, New vision for dbt Core semantic layer, langchain explained, carbon footprint of pizza. ]]></description>
                    <link><![CDATA[ /data-news-week-23-27/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64a6883f4233c000019601d6 ]]></guid>
                    <pubDate><![CDATA[ 2023-07-08 ]]></pubDate>
                    <content>
                        <![CDATA[ <p></p>
<figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2023/07/photo-1510005294384-c03e247f0542.jpeg" class="kg-image" alt="group of cyclists marching on highway" loading="lazy" width="1000" height="717" srcset="https://www.blef.fr/content/images/size/w600/2023/07/photo-1510005294384-c03e247f0542.jpeg 600w, https://www.blef.fr/content/images/2023/07/photo-1510005294384-c03e247f0542.jpeg 1000w" sizes="(min-width: 720px) 720px"><figcaption><span>Who's leading the data peloton? (</span><a href="https://unsplash.com/photos/IlUqSRJYp8c?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hey you, this is the Saturday Data News edition 🥲. Time flies. I'm working for the Series of articles in advance for August about "creating data platforms" and I'm looking for ideas about the data I could use for this. Having some kind of simulated real-time data would be the best. But it requires to write a simulation. Which is enough complicated. What would you use?</p>
<p></p>
<h1 id="small-french-aside-%F0%9F%87%AB%F0%9F%87%B7">Small French aside 🇫🇷</h1>
<p><em>(A small part in French, jump to next section)</em></p>
<p>Cette semaine j'ai lancé mon podcast en français nommé <strong>À l'heure des données</strong>. Dans ce podcast, qui sera mensuel, je vais discuter avec des experts francophones qui font l'écosystème. On discutera du présent mais aussi du futur.</p>
<p>Dans le premier épisode j'ai discuté avec <a href="https://www.linkedin.com/in/pimpaudben?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABvNCPEBftr20GrhxU-gwoNTnOWkjKfBSHc&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3B6gSYLYxbSMqZDF4Eeq2xrA%3D%3D&ref=blef.fr">Benoit Pimpaud</a> qui a été data scientist à l'Olympique de Marseille et qui s'est reconverti plus tard chez Deezer en data engineer. Aujourd'hui il s'occupe du produit chez Kestra, un orchestrateur open-source développé en France.</p>
<p>🎧 Pour nous écouter : <a href="https://podcasts.apple.com/fr/podcast/1-quel-est-le-futur-de-lorchestration-benoit-pimpaud-kestra/id1695911147?i=1000619384646&l=en-GB&ref=blef.fr">Apple</a> — <a href="https://open.spotify.com/episode/4ki4LvSBgjNezqdDq3Vc1J?ref=blef.fr">Spotify</a> — <a href="https://deezer.page.link/LHaF3dimKNrhPfCW8?ref=blef.fr">Deezer</a> — <a href="https://music.amazon.co.uk/podcasts/4cff4cc4-9eff-495b-b8e9-aef7f3f9f4a2/episodes/9b839558-ad02-4563-8055-f431b6a40c63/%C3%A0-l'heure-des-donn%C3%A9es-1-%E2%80%94-quel-est-le-futur-de-l'orchestration-%E2%80%94-benoit-pimpaud-kestra?ref=blef.fr">Amazon</a></p>
<p>Sue un tout autre sujet, Stéphane Bortzmeyer a participé au colloque du CNRS sur <em>Penser et Créer avec les IA génératives</em> et il a écrit un r<a href="https://www.bortzmeyer.org/ia-generatives-colloque.html?ref=blef.fr">apport sur ces 2 jours</a>.</p>
<p>PS : est-ce qu'une version française de mon contenu t'intéresse ?</p>
<p></p>
<h1 id="the-new-dbt-semantic-layer">The new dbt Semantic Layer</h1>
<p>Following the acquisition of Transform by dbt Labs a few months ago, dbt Core integrates MetricsFlow. MetricsFlow was the semantic layer of the acquired company. This week, Nick Handel, co-founder of ex-Transform, wrote about how dbt Core specs will adapt.</p>
<p>As a reminder <strong>a semantic layer is a definition on top of your models meant to be reusable. The idea, is then, to use the semantics to generate SQL queries</strong>. You can read <a href="https://www.blef.fr/metrics-store/">my article on the semantic layer</a>.</p>
<p>In the new <a href="https://www.getdbt.com/blog/new-dbt-semantic-layer-spec-dna/?ref=blef.fr">vision</a> it will be possible to define multiple things:</p>
<ul><li>entities —It defines the nodes of your business models. In a dbt model, you can define primary and foreign entities. A foreign entity defines an edge between models, hence a join in the final query.</li><li>measures — A value aggregation.</li><li>dimensions — A categorical or a time field than can be used either in a group by  either in a filter.</li><li>metrics — A pre-defined object that combines entities, measures and dimensions.</li></ul>
<figure class="kg-card kg-image-card kg-width-wide"><img src="https://www.blef.fr/content/images/2023/07/Semantic-Layer-new-vision.png" class="kg-image" alt="" loading="lazy" width="2000" height="1193" srcset="https://www.blef.fr/content/images/size/w600/2023/07/Semantic-Layer-new-vision.png 600w, https://www.blef.fr/content/images/size/w1000/2023/07/Semantic-Layer-new-vision.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/07/Semantic-Layer-new-vision.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/07/Semantic-Layer-new-vision.png 2400w" sizes="(min-width: 1200px) 1200px"><figcaption><span>Semantics and metrics in dbt Core explained. (credits: the example is reworked from Nick's examples)</span></figcaption></figure>
<p>Just ahead I gave you an precise example of how the new nomenclature will behave for a simple case with a fact_transaction model. This is important to notice that the semantic layer is something that sits on top of you current dbt models definitions.</p>
<p>To complete the picture this is important to notice that the revenue_usd metrics can be queried at the moment either with a <a href="https://docs.getdbt.com/docs/build/sl-getting-started?ref=blef.fr#test-and-query-your-metrics">CLI</a>, either via the API dbt Labs will release through their dbt Cloud offering.</p>
<div class="kg-card kg-button-card kg-align-center"><a href="https://docs.getdbt.com/docs/build/build-metrics-intro?ref=blef.fr" class="kg-btn kg-btn-accent">Read dbt metrics documentation</a></div>
<p>As an extension I've seen 2 things this week that I feel makes sense here:</p>
<ul><li><a href="https://github.com/Canner/vulcan-sql?ref=blef.fr">VulcanSQL</a> — A data API framework for DuckDB, Snowflake, BigQuery, PostgreSQL. Actually Vulcan let's you define in a blink parametrise SQL that you can expose through an API. It comes then with a catalog, a documentation and a way to connect downstream consumers tools (e.g. CSV exports, Excel, Sheets, etc.)</li><li>A Rill Data <a href="https://ui.rilldata.com/demo/rill-github-analytics/duckdb_commits?ref=blef.fr">dashboard about DuckDB commits</a> — DuckDB commits is just an example. What I want to show here is Rill Data UI, while being relatively simple offers a standardise way to explore a dataset. On the left you get the metrics, on the right the dimensions, everything can be clickable and allows you to drill down. Under the hood it's "BI-as-code", YAML defining this dashboard can be found on <a href="https://github.com/rilldata/rill-examples/tree/main/rill-github-analytics?ref=blef.fr">Github</a>.</li></ul>
<p>These two examples are not really semantic layers in the strict sense, but revolve around the concept.</p>
<div class="kg-card kg-signup-card kg-width-wide " data-lexical-signup-form="" style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading" style="color: #000000;"><span>Sign up for blef.fr</span></h2>
                    <h3 class="kg-signup-card-subheading" style="color: #000000;"><span>I put words on data engineering.</span></h3>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email="" type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewBox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"></circle>
                <circle cx="12" cy="12" r="3"></circle>
                <circle cx="20" cy="12" r="3"></circle>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success" style="color: #000000;">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" style="color: #000000;" data-members-error=""></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer" style="color: #000000;"><span>No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>
<p></p>
<h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1>
<ul><li><a href="https://towardsdatascience.com/deploying-falcon-7b-into-production-6dd28bb79373?ref=blef.fr">Deploying Falcon-7B into production</a> — If you want to launch your own open-source model on Kubernetes, this is a tutorial to do it.</li><li><a href="https://blog.devgenius.io/langchain-explained-and-getting-started-8f1ea40ab95d?ref=blef.fr">Langchain: explained and getting started</a> — Langchain is a toolkit that lets you <strong>chain</strong>—what a surprise—components. Actually it's some kind of pipelines, every component as inputs and outputs and Langchain do the glue. Components includes stuff like prompts, LLMs, agents or memory.</li><li><a href="https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/cube_semantic?ref=blef.fr">Langchain integrates Cube (the semantic layer)</a> — Wrapping-up with previous category, Langchain can use Cube as a data loader.</li><li><a href="https://www.indexventures.com/perspectives/the-rise-of-vertical-ai/?ref=blef.fr">The rise of Vertical AI</a> — Verticality in business always existed because it brings contextualisation. This articles described what will arrive on the market on top of Foundations and horizontal models that tries to be generic.</li><li><a href="https://www.numbersstation.ai/post/introducing-nsql-open-source-sql-copilot-foundation-models?ref=blef.fr">Introducing NSQL: Open-source SQL Copilot Foundation models</a> — This is a Foundation models that generates SQL, claiming to outperform others.</li><li><a href="https://openai.com/blog/introducing-superalignment?ref=blef.fr">Introducing Superalignment</a> — Some stuff OpenAI wrote about the future (I did not read).</li><li><a href="https://blog.salesforceairesearch.com/codegen25/?ref=blef.fr">CodeGen2.5: Small, but mighty</a> — Salesforce released a new version of the CodeGen model. I hope they did not trained it on their internal code 🫠</li></ul>
<p></p>
<p></p>
<figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2023/07/photo-1564936281291-294551497d81.jpeg" class="kg-image" alt="cooked food on round white ceramic plate" loading="lazy" width="1000" height="698" srcset="https://www.blef.fr/content/images/size/w600/2023/07/photo-1564936281291-294551497d81.jpeg 600w, https://www.blef.fr/content/images/2023/07/photo-1564936281291-294551497d81.jpeg 1000w" sizes="(min-width: 720px) 720px"><figcaption><span>Now you want to think twice before eating a pizza (</span><a href="https://unsplash.com/photos/cC0_UO1Obg4?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<ul><li><a href="https://www.brittanybennett.com/post/career-advice-for-aspiring-progressive-data-professionals?ref=blef.fr">Career advice for aspiring progressive data professionals</a> — Brittany has been working in progressive data for years and she's giving advices for people who wants to follow her path.</li><li><a href="https://engineering.linkedin.com/blog/2023/declarative-data-pipelines-with-hoptimator?ref=blef.fr">Declarative data pipelines with Hoptimator</a> — After trying to bring self-service for data pipelines at LinkedIn, they decided to go for declarative data pipelines supporting only a specific data movements. With YAML. We were visionary when we <a href="https://docs.google.com/presentation/d/1HPVwWSZAmOSCNy1uWTx7ecTS-e9l9Ize3s29SQdfqAE/edit?ref=blef.fr#slide=id.g48298f4f5f_0_56">designed and developed</a> this at Kapten 5 years ago.</li><li><a href="https://medium.com/apache-airflow/airflow-scalable-and-cost-effective-architecture-8edb4f8aed65?ref=blef.fr">Airflow: scalable and cost-effective architecture</a> — Hussein, an Airflow committer and PMC member, proposes an ideal architecture for big Airflow projects.</li><li><a href="https://medium.com/blablacar/scaling-data-teams-5-learnings-from-blablacar-9e00949957f3?ref=blef.fr">Scaling data teams: 5 learnings</a> — BlaBlaCar data team is well known in France now and recently embraced a data mesh organisation. Manu, the VP shares 5 learnings you should as a manager be aware of.</li><li><a href="https://maxhalford.github.io/blog/carbon-footprint-pizzas/?ref=blef.fr">Measuring the carbon footprint of pizzas</a> 🍕 — Shit I've eaten a pizza yesterday. Max includes in the study 4 axes: agriculture, transformation, packaging, and transport. With this Margharita obviously is the less emitting one. 4x less than a Calzone with meat.</li><li><a href="https://towardsdatascience.com/parquet-file-format-everything-you-need-to-know-4eed5c0019e7?ref=blef.fr">Parquet file format explained</a> — and how it compares with <a href="https://medium.com/@rahul.nanavaty/parquet-format-vs-orc-format-vs-avro-format-2af72b887903?ref=blef.fr">Avro &amp; ORC</a>.</li><li><a href="https://bitsondatadev.substack.com/p/iceberg-won-the-table-format-war?ref=blef.fr">Iceberg won the table format war</a> — Don't be click baited by the title, the article has been written by a dev rel at the company who mainly maintains Iceberg.</li><li><a href="https://engineering.razorpay.com/reducing-data-platform-cost-by-2m-d8f82285c4ae?ref=blef.fr">Reducing data platform cost by $2m</a> — How Razorpay optimised (mainly) their S3 storage (deletion, relocation) to save a lot of money.</li><li><a href="https://select.dev/posts/summit-2023?ref=blef.fr">Every major announcement at Snowflake Summit</a> — Another view than the one I shared last week by someone who actually was at the Summit.</li><li>An intro video to <a href="https://www.youtube.com/watch?v=rO3BPqUtWrI&ref=blef.fr">open lineage</a>, which is a important topic to give visibility over your data platform.</li><li><a href="https://ricardoanderegg.com/posts/makefile-python-project-tricks/?ref=blef.fr">Makefile tricks for Python projects</a> — One of the best data magical trick. We repurposed old good Makefile to create simpler CLI on top of our daily tool. This is an article giving tips to make your best Makefiles.</li><li>You can now <a href="https://www.reddit.com/r/dataengineering/comments/14midyu/now_in_snowflake_group_by_all/?ref=blef.fr">GROUP BY ALL in Snowflake</a>.</li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><a href="https://finance.yahoo.com/news/digitalocean-acquires-paperspace-expand-ai-120000933.html?ref=blef.fr">DigitalOcean acquires <strong>Paperspace</strong>.</a> Paperspace is an all-in-one SaaS product to develop, train and deploy AI applications. With a custom Notebook UI based on Jupyter you can develop your models while checking at ressources, when the models is reading you can deploy it within containers.</li><li><strong>Redpanda</strong> <a href="https://redpanda.com/press/redpanda-raises-100m-in-series-c-funding?ref=blef.fr">raises $100m in Series C</a>. Redpanda is a great product for developers. The best way to describe it is: this is a Kafka alternative. Built for modern times it removes most of the Kafka complexity by implementing all Kafka APIs.</li></ul>
<hr>
<p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Snowflake and Databricks summits ]]></title>
                    <description><![CDATA[ Data News #23.26 — Snowflake and Databricks summits wrap-up and a few fundraising. ]]></description>
                    <link><![CDATA[ /data-news-week-23-26/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 649f0426442df8000199cb54 ]]></guid>
                    <pubDate><![CDATA[ 2023-07-03 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2023/07/photo-1542692847287-8432313be7a5.jpeg" class="kg-image" alt="mountain peak" loading="lazy" width="1000" height="503" srcset="https://www.blef.fr/content/images/size/w600/2023/07/photo-1542692847287-8432313be7a5.jpeg 600w, https://www.blef.fr/content/images/2023/07/photo-1542692847287-8432313be7a5.jpeg 1000w" sizes="(min-width: 720px) 720px"><figcaption><span>2 summits (</span><a href="https://unsplash.com/photos/IjBgUHrcuWQ?ref=blef.fr" rel="noopener"><span>credits</span></a><span> I cropped the image)</span></figcaption></figure>
<p>Hey, since I said I should try to send the newsletter at a specific schedule I did not. Haha. Still here the newsletter for last week. This is a small wrap-up from the Snowflake and Databricks Data + AI summits which have taken place last week.</p>
<p>There are so many sessions at both summits that this is impossible to watch everything, more Databricks and Snowflake do not put in free access online everything so I can't wait everything. I'll try to recap the major announcements by reading between the lines and through social network posts.</p>
<div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><p><span>If you want another view on both the conferences Ananth from Data Engineering Weekly wrote about the </span><a href="https://www.dataengineeringweekly.com/p/the-week-of-data-conference-extravaganza?ref=blef.fr" rel="noopener"><span>conferences extravaganza</span></a><span> and a few trends he wanted to chat about.</span></p></div></div>
<p></p>
<div class="kg-card kg-signup-card kg-width-wide " data-lexical-signup-form="" style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading" style="color: #000000;"><span>Sign up for blef.fr</span></h2>
                    <h3 class="kg-signup-card-subheading" style="color: #000000;"><span>Words on data engineering.</span></h3>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            <input data-members-label="" type="hidden" value="summits">
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email="" type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Join us</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewBox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"></circle>
                <circle cx="12" cy="12" r="3"></circle>
                <circle cx="20" cy="12" r="3"></circle>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success" style="color: #000000;">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" style="color: #000000;" data-members-error=""></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer" style="color: #000000;"><span>No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>
<h1 id="snowflake-summit-%E2%9D%84%EF%B8%8F">Snowflake Summit ❄️</h1>
<p>The marketing tagline of Snowflake have always been "the Data Cloud", with this year announcement we can feel it really accelerated to achieve this vision. Snowflake wants you to send whatever data on their cloud and then now you can use a lot of different features to do stuff on it. They announced:</p>
<ul><li><a href="https://www.youtube.com/watch?v=OTycMK18d2M&ref=blef.fr">Document AI</a> — A new integrated product where you can ask questions in natural language on documents (PDF, etc.). With LLMs they will try to answer questions. Once you are happy with the quality of answer you'll be able to publish the model and use it in SQL queries and write pipelines on top of it to infer on new documents and send emails when needed.</li><li><a href="https://www.snowflake.com/blog/native-app-framework-available-developers-aws/?ref=blef.fr">Snowflake Native App framework</a> — Via the Snowflake marketplace vendors and developers will be able to create apps that you can run on your data. In the UI you pick the tables you want the app to run on. Here <a href="https://app.snowflake.com/marketplace?shareType=application&ref=blef.fr">the native apps marketplace</a>, there are only 25 apps and it only works on AWS at the moment.</li><li>Container Services &amp; <a href="https://techcrunch.com/2023/06/27/snowflake-nvidia-partnership-could-make-it-easier-to-build-generative-ai-applications/?ref=blef.fr">Nvidia partnership</a> — Snowflake is slowly becoming a one-stop shop, with container services you will be able to run your own apps in a Kubernetes cluster managed by Snowflake. For instance tomorrow you'll be able to launch Airflow (via <a href="https://www.astronomer.io/blog/astronomer-and-snowflake-unleash-the-power-of-snowpark-container-services-and-apache-airflow/?ref=blef.fr">Astronomer</a>) within Snowflake. On the same topic Nvidia partnership will bring GPUs to Snowflake offering for users in need of large compute for AI training. Thanks to this data do not move out of Snowflake, or if we say the truth, out of your underlying cloud.</li><li><a href="https://docs.snowflake.com/en/user-guide/dynamic-tables-about?ref=blef.fr">Dynamic Tables</a> — Dynamic tables are streaming tables. With Snowflake you can send real time data coming from Kafka, for instance, with dynamic tables you can create a table on top of the real time data that refreshes in real time, using only what's needed to compute the new state. Dynamic tables has been announced last year, but looks finally in preview. In the demo there is also how the SQL UI integrates <a href="https://youtu.be/fZ5mCmVZAQ0?t=277&ref=blef.fr">LLMs generating SQL from a comment</a>.</li></ul>
<p></p>
<p>PS: s/o to David who also <a href="https://davidsj.substack.com/p/all-change?r=125hnz&utm_medium=ios&utm_campaign=post&ref=blef.fr">covered Snowflake changes</a>.</p>
<p></p>
<h1 id="data-ai-summit-%F0%9F%97%BB">Data + AI Summit 🗻</h1>
<p>The theme of the Databricks summit is <em>Generation AI</em>, it's a well found title regarding the current state of data. I watched the 3 keynotes to find announcements but it looks like less structured that Snowflake still here a few takeaways:</p>
<ul><li>Microsoft and Databricks are still best friends, even after <a href="https://www.microsoft.com/fr-fr/microsoft-fabric?ref=blef.fr">Fabric</a>. In a quick Skype call Satya Nadella, Microsoft CEO said that discussions about responsible AI while developing it is a good thing. We should explore 3 parallel tracks at the same time: misinformation, real world harms (incl. bias), AI takeoff.</li><li>The CEO of Databricks was on stage and use words that I like, he says </li><li><ul><li>data <em>should be democratise to every employee </em></li><li><em>AI should be democratise in every product</em></li></ul></li></ul>
<figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2023/07/Screenshot-2023-07-03-at-11.20.32.png" class="kg-image" alt="" loading="lazy" width="1946" height="1018" srcset="https://www.blef.fr/content/images/size/w600/2023/07/Screenshot-2023-07-03-at-11.20.32.png 600w, https://www.blef.fr/content/images/size/w1000/2023/07/Screenshot-2023-07-03-at-11.20.32.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/07/Screenshot-2023-07-03-at-11.20.32.png 1600w, https://www.blef.fr/content/images/2023/07/Screenshot-2023-07-03-at-11.20.32.png 1946w" sizes="(min-width: 720px) 720px"><figcaption><span>Databricks vision about LLMs (in Wed. Keynote 2023 Data + AI Summit)</span></figcaption></figure>
<ul><li><a href="https://www.databricks.com/blog/introducing-lakehouseiq-ai-powered-engine-uniquely-understands-your-business?ref=blef.fr">LakehouseIQ</a> — Matei Zaharia presented it on stage. LakehouseIQ is a way to use your Enterprise signals (org charts, lineage, docs, queries, catalog, etc.) to contextualise LLMs used in UI assistants. In the demo LakehouseIQ is asked to "get revenue for Europe" but understand that Europe is not the exact name of the region for this company but EMEA. Here a <a href="https://youtu.be/h4z4vBoxQ6s?t=3151&ref=blef.fr">demo of LakehouseIQ</a>. In the demo we also sees that you can generate SQL from a comment in the UI.<br><br>This is their way to democratise data to every employee.</li><li><a href="https://www.mosaicml.com/blog/mosaicml-databricks-generative-ai-for-all?ref=blef.fr">Databricks acquires MosaicML</a> for $1.3b— It should land in data economy category but you know. I've shared MosaicML <a href="https://www.blef.fr/data-news-week-23-25/">last week</a> because they are the ones behind the first open-source LLMs, the MPT models, on Apache License. This is a great move from Databricks to set themselves in the AI ecosystem for real. As a side note Naveen Rao, Mosaic CEO, said that to train MPT-30B from scratch you need around 12 days and less than $1m.</li><li><a href="https://youtu.be/h4z4vBoxQ6s?t=6560&ref=blef.fr">LakehouseAI</a> — Research shown that 25% of the queries get their costs misestimated by the query optimisers and the error can be 10<sup>6</sup>. Databricks built a new way to do I/O with AI, they promise that you don't have to do any kind of indexes and the engine can "triangulate" where the data is to be faster than before. Mainly you have to see LakehouseAI like an AI DBA that does magical stuff to your engine by learning on all your queries telemetry.</li><li>They also announced a lot of stuff around <a href="https://www.youtube.com/watch?v=yj7XlTB1Jvc&ref=blef.fr">Spark</a>.</li></ul>
<p>As you can see Lakehouse is becoming more than ever a marketing brand around Databricks. In the end what we want is a place to store data and an engine to query data. That's all.</p>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><a href="https://www.thoughtspot.com/press-releases/thoughtspot-acquires-mode-analytics-for-200m?ref=blef.fr">ThoughtSpot acquires Mode analytics for $200m</a> — This is consolidation at work. ThoughtSpot is a company who tries to bring AI in the analytics domain. With TS you can define insights and access to it, with Mode they gain a end-user application that people are already using. Also you might know Mode through <a href="https://benn.substack.com/p/to-my-parents?ref=blef.fr">Benn Stancil blog</a>.</li><li><a href="https://www.globenewswire.com/en/news-release/2023/06/29/2696702/0/en/Hopsworks-reports-record-growth-and-raises-6-5M.html?ref=blef.fr">Hopsworks raises $6.5m</a> — Hopsworks is a feature store.</li><li><a href="https://www.forbes.com/sites/alexkonrad/2023/06/29/inflection-ai-raises-1-billion-for-chatbot-pi/?sh=66cd5acd1d7e&ref=blef.fr">Inflection AI raises $1.3b</a> from Bill Gates, Eric Schmidt, Microsoft and Nvidia. They developed a <a href="https://inflection.ai/?ref=blef.fr">personal AI called Pi</a> who's designed to be supportive, smart and here for you at anytime. Let's see where it goes.</li></ul>
<hr>
<p>See you soon ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.25 ]]></title>
                    <description><![CDATA[ Data News #23.25 — Yes I was late. A bit of Gen AI and the usual Fast News + Acryl Data fundraising. ]]></description>
                    <link><![CDATA[ /data-news-week-23-25/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6495450b3b554a00015d9db7 ]]></guid>
                    <pubDate><![CDATA[ 2023-06-24 ]]></pubDate>
                    <content>
                        <![CDATA[ <p></p>
<figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1490750967868-88aa4486c946?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="orange petaled flowers" loading="lazy"><figcaption><span>(</span><a href="https://unsplash.com/photos/koy6FlCCy5s?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hey, this is the Data News. It's super hard to change habits, but it's how it is, the newsletter is going out on Saturday. I hope this edition finds you well. Summer is coming  ☀️.</p>
<p>Thank you all because we crossed the 3000 subscribers mark last week. Let's go for the 4000 before the end of the year 🤗.</p>
<p>This is a almost-raw edition for this week.</p>
<p></p>
<h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1>
<ul><li><a href="https://huggingface.co/spaces/mosaicml/mpt-30b-chat?ref=blef.fr">MPT-30B-Chat</a> — This is a chat interface hosted on HuggingFace on the MPT-30B model. The <a href="https://www.mosaicml.com/blog/mpt-30b?ref=blef.fr">MPT models</a> are interesting because they are on Apache Licence, which can means true open-source, unlikely others.</li><li>In the continuity to licence topic you can watch this great video about <a href="https://www.youtube.com/watch?v=rOd9UteupGA&list=PLq-odUc2x7i-q7sHxBbIVFtMOwChWmIKF&index=29&ref=blef.fr">laptop-sized ML for text, with Open Source</a> where Nick Burch explore what you can do today on a laptop and introduce greatly the Gen AI field.</li><li><a href="https://engineering.linkedin.com/blog/2023/new-approaches-for-detecting-ai-generated-profile-photos?ref=blef.fr">New approaches for detecting AI-Generated profile photos</a> — This is the era we're going to live in. We'll be writing models moderating generative models. Am I the only one who thinks this is a waste of energy?</li><li><a href="https://davidgerard.co.uk/blockchain/2023/06/03/crypto-collapse-get-in-loser-were-pivoting-to-ai/?ref=blef.fr">Crypto collapse? Get in loser, we’re pivoting to AI</a> — It's a rant that begins with the fact that many opportunists are getting into AI after VC have left crypto. ChatGPT "is a stupendously scaled-up autocomplete", which lead to question about intelligence in AI. I really like the conclusion: "The <em>real</em> threat of AI is the bozos promoting AI doom who want to use it as an excuse to ignore real-world problems — like the risk of climate change to humanity (...) The VCs’ actual use case for AI is treating workers badly".</li></ul>
<figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1492562080023-ab3db95bfbce?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" class="kg-image" alt="smiling man standing near green trees" loading="lazy"><figcaption><span>Too perfect to be a real picture (</span><a href="https://unsplash.com/photos/VVEwJJRRHgk?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<ul><li><a href="https://motherduck.com/blog/announcing-motherduck-duckdb-in-the-cloud/?ref=blef.fr">MotherDuck announcing DuckDB in the cloud</a> — First, context. DuckDB is an in-memory analytics database. So it's single server. DuckDB has been open-source by DuckDB Labs. Then comes MotherDuck, a commercial company, with a <a href="https://duckdblabs.com/news/2022/11/15/motherduck-partnership.html?ref=blef.fr">partnership</a> with DuckDB Labs aiming to to build a modern serverless cloud analytics platform based on DuckDB. That's for the context.<br><br>So this week MotherDuck finally announced their cloud offering. It's invite only for the moment —&nbsp;and I did not get my invite yet. In a nutshell the announcement is: you can connect to remote DuckDB by doing <code>md:</code> in the connection string and you can join local and remote data (also seen on <a href="https://twitter.com/criccomini/status/1672024134648475651?ref=blef.fr">Twitter</a>).</li><li>Iceberg in the clouds — Last week BigQuery announced <a href="https://cloud.google.com/bigquery/docs/release-notes?ref=blef.fr">Iceberg support</a> in GA. At the same time James from Snowflake wrote a blog post helping you to <a href="https://medium.com/snowflake/apache-iceberg-or-snowflake-table-format-299eb9fb7b0c?ref=blef.fr">chose between Snowflake or Iceberg</a> table format. Mainly he says, pick Iceberg if you know what you're doing.</li><li><a href="https://www.youtube.com/watch?v=jCXpFagJsbo&list=PLq-odUc2x7i-q7sHxBbIVFtMOwChWmIKF&index=20&ref=blef.fr">An introductory video about Iceberg</a> — If you want a great Iceberg introduction, go watch Fokko's talk from Berlin Buzzwords.</li><li><a href="https://leo-godin.medium.com/understanding-dbt-runtime-environment-1fd28592bbd?ref=blef.fr">Understanding dbt runtime environment</a> — Leo takes the time to explicit what are the messages dbt CLI is telling you.</li><li><a href="https://blog.devgenius.io/replacing-apache-hive-elasticsearch-and-postgresql-with-apache-doris-de3840cdc792?ref=blef.fr">Replacing Apache Hive, Elasticsearch and PostgreSQL with Apache Doris</a> — This is a technology bingo. You can replace 3 technologies with only one! This post details the choices behind a migration to Apache Doris. Doris is a real time analytical database.</li><li><a href="https://medium.com/@timwebster85/beyond-data-pipelines-how-data-engineers-drive-data-culture-and-empower-users-953abc5418ac?ref=blef.fr">How data engineers drive data culture and empower users</a> — This article reminds all data engineers that you're part of the team that brings data culture to a company, so you need to play your part.</li><li><a href="https://www.startdataengineering.com/post/valuable-de-guide/?ref=blef.fr">How to become a valuable data engineer</a> — A post thats aggregates great ressources and advices to become a data engineer. I mention also that I have a similar one on the blog: <a href="https://www.blef.fr/learn-data-engineering/">how to learn data engineering</a>.</li><li><a href="https://www.carbonfact.com/blog/platform/missing-weight-data?ref=blef.fr">Dealing with missing weight data</a> — Carbonfact tries to measure the environmental footprint of a clothing. This is not an easy task and ask you to work with missing data. </li><li><a href="https://www.thoughtspot.com/data-trends/data-modeling/conceptual-vs-logical-vs-physical-data-models?ref=blef.fr">Conceptual vs logical vs physical data models</a> — The author mentions that there are 3 ways to model data with different layers of understanding. In the end he says that you should model your data in the 3 layers: conceptual, logical and physical. </li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><strong>Acryl Data</strong> <a href="https://www.acryldata.io/blog/a-control-plane-for-data-and-a-new-era-for-acryl?ref=blef.fr">raises $21m Series A</a>. Acryl Data is the company behind DataHub, the data catalog that has been open-sourced out of LinkedIn.</li></ul>
<hr>
<p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.24 ]]></title>
                    <description><![CDATA[ Data News #23.24 — AI Act, testing in dbt, data journey manifesto, SO survey, CDC with Clickhouse and fundraising. ]]></description>
                    <link><![CDATA[ /data-news-week-23-24/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6489af956e92b0000173ea81 ]]></guid>
                    <pubDate><![CDATA[ 2023-06-16 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1523349122880-44486ffa7b14?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" alt="close up photography of round green fruit" loading="lazy"><figcaption><span> The newsletter, a metaphor (</span><a href="https://unsplash.com/photos/O70hwncRDC8?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Hello, after the good weather comes the storm. I'm now under the Berlin rain with 20°. When I write in these conditions I feel like a tortured author writing a depressing novel while actually today I'll speak about the AI Act, Python, SQL and data platforms. Casual day at the office finally.
</p>
<p>Some personal news, next Monday and Tuesday I'll be at Berlin Buzzwords, if you're ping me, it would be a pleasure to meet and hang together.</p>
<p>There are still seats for the June Airflow Paris <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/293888353/?ref=blef.fr">Meetup</a> (in French).</p>
<figure class="kg-card kg-image-card"><a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/293888353/?ref=blef.fr"><img src="https://www.blef.fr/content/images/2023/06/Meetup--4-26-.png" alt="" loading="lazy" width="2000" height="433" srcset="https://www.blef.fr/content/images/size/w600/2023/06/Meetup--4-26-.png 600w, https://www.blef.fr/content/images/size/w1000/2023/06/Meetup--4-26-.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/06/Meetup--4-26-.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/06/Meetup--4-26-.png 2400w" sizes="(min-width: 720px) 720px"></a></figure>
<p></p>
<h1 id="ai-%F0%9F%A4%96">AI 🤖</h1>
<ul><li><a href="https://www.nytimes.com/2023/06/14/technology/europe-ai-regulation.html?ref=blef.fr">AI Act 🇪🇺 has been voted</a> in the European parliament. Also called GDPR 2.0 the AI Act is meant to regulate the usage of AI in tomorrow's world. It has been widely criticised by <a href="https://techcrunch.com/2023/06/13/google-delays-eu-launch-of-its-ai-chatbot-after-privacy-regulator-raises-concerns/?ref=blef.fr">lobbyists, companies or developers</a>. I'm not informed enough so I'll wait before giving my opinion on it.</li><li><a href="https://ai.facebook.com/blog/yann-lecun-ai-model-i-jepa/?ref=blef.fr">I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI</a> — Meta is in a frenzy to release new models. Yann's vision goes toward AI systems learning and reasoning like animals and humans.</li><li><a href="https://medium.com/pinterest-engineering/deep-multi-task-learning-and-real-time-personalization-for-closeup-recommendations-1030edfe445f?ref=blef.fr">Deep multi-task learning and real-time personalisation for closeup recommendations</a> — Pinterest still doing deep learning.</li><li>Last week I shared nice QR Code generated with ControlNet, this week someone released a model on HuggingFace to do it <a href="https://huggingface.co/DionTimmer/controlnet_qrcode?ref=blef.fr">QR Code Conditioned ControlNet</a> (not related to the Chinese original paper) and you can even use the  <a href="https://huggingface.co/spaces/huggingface-projects/QR-code-AI-art-generator?ref=blef.fr">generator web UI</a>.</li><li><a href="https://arxiv.org/abs/2306.03714?ref=blef.fr">DashQL – Complete analysis workflows with SQL</a> — A crazy paper about a new language that mixes SQL with analyses and graphs. It looks sexy but my brain can't read 9 pages PDF without overheating.</li><li><a href="https://medium.com/walmartglobaltech/model-and-data-versioning-an-introduction-to-mlflow-and-dvc-260347cd0f6e?ref=blef.fr">Model and Data Versioning: An Introduction to mlflow and DVC</a> — If you want to understand model versioning this is for you.</li></ul>
<p></p>
<h1 id="data-and-analytics-engineering-%F0%9F%A7%91%E2%80%8D%F0%9F%94%A7">Data and Analytics Engineering 🧑‍🔧</h1>
<ul><li><a href="https://medium.com/datamindedbe/testing-frameworks-in-dbt-3fa8933a5807?ref=blef.fr">Testing frameworks in dbt</a> — Robbert developed a small framework to do tests in dbt. Mainly he unit tests macros (the logic) with his framework and test data with soda and dbt contracts.</li><li><a href="https://datajourneymanifesto.org/?ref=blef.fr">The data journey manifesto</a> — <a href="https://datakitchen.io/why-the-data-journey-manifesto/?ref=blef.fr">DataKitchen</a> wrote a manifesto to put principles on the data journey to avoid the mess in production. There are 11 principles and 11 new ideas to create an healthy platform. For instance <em>you should not trust your data providers</em> and <em>what worked last week will not work today</em>.</li><li><a href="https://www.data-drift.io/blog/why-data-consumers-do-not-trust-your-reporting-and-you-might-not-even-know-it?ref=blef.fr">Why data consumers do not trust your reporting</a> — It is a good illustration of the data journey manifesto. <strong>Stakeholders often notice data issues before the data team does</strong>. This destroys any confidence they may have in the numbers. Data warehouses are mutable, this is one of the many root causes proposed by Lucas. The past often changes, whether because of code or data. This is metrics drift.</li><li><a href="https://towardsdatascience.com/data-documentation-101-why-how-for-whom-927311354a92?ref=blef.fr">Data Documentation 101: Why? How? For Whom?</a> — Marie wrote best practices for establishing complete and reliable data documentation. The first advice is about the documentation readers: data team, business users or other stakeholders.</li><li><a href="https://clickhouse.com/blog/clickhouse-postgresql-change-data-capture-cdc-part-1?utm_source=twitter&utm_medium=social&utm_campaign=blog">Change Data Capture (CDC) with PostgreSQL and ClickHouse</a> — This is a nice vendor post about CDC with Kafka as movement layer (using Debezium). The post explains well the architecture you need to make it work.</li><li><a href="https://betterprogramming.pub/a-deep-dive-into-graph-analytics-part-1-with-memgraph-5e3134609d86?ref=blef.fr">A deep dive into graph analytics</a> — Petrica tries and showcases Memgraph in a long-form post. I'm a fond of graph visualisations and analytics—as well as maps.</li><li><a href="https://engineering.atspotify.com/2023/06/experimenting-at-scale-the-spotify-home-way/?ref=blef.fr">Experimenting at Scale, the Spotify Home way</a> — Simple principles to run a good old' experiment at Spotify scale.</li><li><a href="https://count.co/canvas/pB7iGb4yyi2?ref=blef.fr">The ultimate SQL guide</a> — After the last canva on data interviews, here's a canva to learn SQL. From databases introduction to SQL writing. It covers simple SELECT and advanced concepts. This is neat.</li><li><a href="https://parakeet.solutions/the-power-of-pre-commit-and-sql-fluff/?ref=blef.fr">The power of pre-commit and SQLFluff</a> —SQL is a query programming language used to retrieve information from data storages, and like any other programming language, you need to enforce checks at all times. This is where you should use pre-commit and SQLFluff.</li><li><a href="https://medium.com/airbnb-engineering/metis-building-airbnbs-next-generation-data-management-platform-d2c5219edf19?ref=blef.fr">Metis: building Airbnb’s next generation data management platform</a> — The new manifesto for every data governance company /S.</li></ul>
<p><em>PS: I just split the Fast News to have a smaller one. Fast News contains lighter news and broad articles.</em></p>
<figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1675266873434-5ba73c38ce6f?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" alt="a man with glasses is looking at a laptop" loading="lazy"><figcaption><span>When the stakeholder notices issues before you (</span><a href="https://unsplash.com/photos/hHg9MC-G8_Y?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<ul><li><a href="https://survey.stackoverflow.co/2023/?ref=blef.fr#work-coding-outside-of-work">Stack Overflow developer survey 2023</a> — Every year SO sends a survey to developers and it gives a great overview of the technology usage across the space. This year ~90k people answered, they also integrate a small AI category to measure impact on dev work.<br><br>What we see related to data engineering is mainly: <strong>Python and SQL are still shining at the top of technology popularity</strong>—around 50% use them. Thanks to AI hype Python is the second most desired technology behind Javascript, which augurs well for the future. They also share salary figures and data engineer / science are well situated in the ecosystem, best-paid job in Germany after management position but less-paid in the US.</li><li><a href="https://vadimdemedes.com/posts/generating-income-from-open-source?ref=blef.fr">Generating income from open source</a> — Vadim shares how he makes money from all the different open-source projects he has. He shares what works and what does not work. In the post he also shares the journey of Sidekiq founder who's making $10m ARR alone.</li><li><a href="https://twitter.com/lloydtabb/status/1669049723549020160?ref=blef.fr">You can put space in BigQuery column names</a> — <em>The editors of blef.fr (me) have no comment</em>. In fact, yes, you are all crazy?</li><li><a href="https://lloydtabb.substack.com/p/malloys-near-term-roadmap?ref=blef.fr">Malloy's Near Term Roadmap</a> — I've shared recently <a href="https://www.blef.fr/data-council-austin-takeaways/">Malloy demo</a>, which was awesome. The article shares the recent features and says also something I will never forget: "<em>Malloy aims to be syntactically the same no matter what database contains the data</em>".</li><li><a href="https://www.astronomer.io/blog/cloud-ide-new-cell-types?ref=blef.fr">The Astro Cloud IDE</a> — Astronomer released a bunch of Airflow operators to their Cloud IDE (which was released in Dec. but I missed it). I get the point why companies wants us to go in their Cloud IDE, but I hate this trend. Let me alone in my PyCharm.</li><li>Cube announcements ; <a href="https://cube.dev/blog/introducing-data-graph?ref=blef.fr">Data Graph</a> and <a href="https://cube.dev/blog/introducing-orchestration-api?ref=blef.fr">Orchestration API</a> — This is 2 announcement from Cube. I really like following them because they are thoughts leader in the semantic layer space. Data graph create an entity diagram from the semantic definition with the API offers you an endpoint to launch pre-aggregations jobs from your scheduler.</li></ul>
<figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1605882171181-e31b036e4ceb?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" alt="gray concrete building under white clouds during daytime" loading="lazy"><figcaption><span>We don't need spaces (</span><a href="https://unsplash.com/photos/dsQiZoO1Q4Q?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p></p>
<h1 id="data-economy-%F0%9F%A4%96">Data Economy 🤖</h1>
<ul><li><a href="https://www.graphext.com/?ref=blef.fr"><strong>Graphext</strong></a> <a href="https://www.graphext.com/post/graphext-raised-4M-seed-round?ref=blef.fr">raises $4.6m in seed round</a> (second  to continue develop a data analysis platform build for exploration. The Spanish startup develop a tool where you quickly explore datasets and then build charts or AI models on top of it. Last year they build a <a href="https://public.graphext.com/f3d05874591c2c0d/index.html?section=graph&colorMap=graphext_cluster&areaMap=null&ref=blef.fr">graph</a> with Data News links, we clearly see the different content categories I share.</li><li><a href="https://www.telm.ai/?ref=blef.fr"><strong>Telmai</strong></a> <a href="https://www.telm.ai/blog/open-architecture-ai-driven-data-observability-startup-telmai-raises-oversubscribed-seed-funding-of-5-5-million?ref=blef.fr">raises $5.5m seed round</a>. A new data observability platform enters the space, it looks like they propose the same features as the competition: add your datasources, get automated alerts on data drifts.</li><li>At the same time <a href="https://mastheadata.com/?ref=blef.fr"><strong>Masthead</strong></a><strong> <a href="https://finance.yahoo.com/news/masthead-data-raises-1-3m-130000610.html?ref=blef.fr">raises raises $1.3m</a></strong> also as a data observability platform, but done differently. Masthead does not run SQL on your data—which generate costs uplift—but reading logs and metadata to identify anomalies.</li><li><a href="https://techcrunch.com/2023/06/14/informatica-acquires-privitar-once-valued-at-400m-to-expand-its-data-management-stack/?ref=blef.fr">Informatica acquires Privitar</a>. This consolidation will bring new features to Informatica. As a reminder Informatica has been funded in 1993 and is one of the dinosaurs in the ETL space. Privitar will bring "data security" stuff.</li></ul>
<hr>
<p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.23 ]]></title>
                    <description><![CDATA[ Data News #23.23 — dbt, data contracts, modeling, why AI will save the world, generate QR Codes with AI and more. ]]></description>
                    <link><![CDATA[ /data-news-week-23-23/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6482ddba56340300016db7eb ]]></guid>
                    <pubDate><![CDATA[ 2023-06-09 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card"><img src="https://images.unsplash.com/photo-1544280124-2f0a80ccee73?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" alt="man holding his eyeglasses" loading="lazy"><figcaption><span>Rethinking the newsletter (</span><a href="https://unsplash.com/photos/DpdTfB8lQTc?ref=blef.fr" rel="noopener"><span>credits</span></a><span>)</span></figcaption></figure>
<p>Here's a new edition of the Data News newsletter. Since my <a href="https://www.blef.fr/data-news-week-23-20/">2-year anniversary</a> post, I've been struggling to find the right writing rhythm. I've been sick and I've been stuck on a client project. Writing the newsletter was not an easy exercise. Even though I keep telling myself "it's not a question of motivation, it's a question of discipline" like a LinkedIn guy. I do things because I enjoy the process of doing things, not for the results.</p>
<p>That's why I'll try to change a bit the way the things are done for the next 3 months. As of today I do the newsletter every Friday. I search and read articles first and then I write. Starting next week I'll do it on Thursday, to schedule the sending at the same hour every Friday, at 2PM.</p>
<p>This way, I'll dedicate my Fridays to write original articles, explore ideas and preparing articles stock for the summer holidays. I plan to do a 1-month break during August. But at the same time I have the FOMO—fear of missing out. So I need to schedule articles in advance. I can tease you that I'll create content about "Create a data platform in 2023", with live examples.</p>
<p>In September I will do a retro and decide if this is the right way to continue or not.</p>
<hr>
<p>In term of content I've recorded a new podcast episode (in French) that will be out next week. The French version will be a bit different than <a href="https://podcasters.spotify.com/pod/show/blef?ref=blef.fr">Minds of data</a>. It'll be more round tables and discussions about the present and the future of our ecosystem.</p>
<p>We also scheduled the next <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/293888353/?ref=blef.fr">Paris Airflow Meetup</a> in Mirakl offices. Pierre, an Airflow committer and PMC member, will present his Airflow journey. Join us!</p>
<p></p>
<h1 id="data-contracts-dbt-and-modeling">Data contracts, dbt and modeling</h1>
<p>Back to the roots, it's been a long time since I did not share dedicated stuff about dbt. This week a natural cluster of articles have emerged. A few people already implemented things with the <a href="https://docs.getdbt.com/docs/collaborate/govern/model-contracts?ref=blef.fr">new model governance</a> dbt introduced last month in v1.5.</p>
<p>Julian shared a nice way to use dbt <a href="https://blog.datadrivers.de/how-we-use-dbt-s-model-governance-features-in-large-projects-ca524e366650?ref=blef.fr">model governance when you have 1000+ models</a>. In a nutshell you can add new characteristics to models that will give more context to dbt. Models can have group, access, contract and versions. In the article Julian greatly explains the software dev comparison when managing programatic APIs with public or private visibility with models management. Finally he also proposes 6 logical data layers to sort your models: source, base, cleanse, core, business and marts. </p>
<p>This structure gives also more visibility to the team because you can draw clear boundaries like: <em>data engineers are responsible for the 3 first layers, analytics engineers for the others</em>.</p>
<p>In order to go more in depth in the data contracts concepts applied to the warehouse and dbt you can <a href="https://medium.com/@mikldd/activating-ownership-with-data-contracts-in-dbt-4f2de41c4657?ref=blef.fr">activate ownership with dbt data contracts</a>. Mikkel also showcases his tool synq.io that runs tests and alerts on top of dbt.</p>
<p>In addition there are 2 awesome articles about related topics:</p>
<ul><li><a href="https://tobikodata.com/simplicity-or-efficiency-how-dbt-makes-you-choose.html?ref=blef.fr">Simplicity or efficiency: how dbt makes you choose</a> — This is a side-by-side comparison of dbt and SQLMesh, a growing alternative to dbt. The comparison is done using a project with 50 models on 3 aspects: make a change, deploy in dev and deploy in prod. In the end the article is obviously biased towards SQLMesh (on the company blog), but reveals good issues with dbt. </li><li><a href="https://carlineng.com/?postid=data-modeling-divide&ref=blef.fr#blog">The data modeling divide</a> — A discussion about different modeling techniques. OBT, star schema, activity schema, etc. and the divide within the community and tools companies for a consensus. </li></ul>
<p></p>
<h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1>
<ul><li><a href="https://a16z.com/2023/06/06/ai-will-save-the-world/?ref=blef.fr">Why AI will save the world</a> — Marc Andreessen writes about the prevailing panic and 5 risks associated with AI, asserting that AI will probably do the world more good than harm. Still it has cold war vibes inside 🙃.</li></ul>
<blockquote><em>The single greatest risk of AI is that China wins global AI dominance and we – the United States and the West – do not.<br><br>I propose a simple strategy for what to do about this – in fact, the same strategy President Ronald Reagan used to win the first Cold War with the Soviet Union.</em></blockquote>
<ul><li><a href="https://towardsdatascience.com/the-golden-age-of-open-source-in-ai-is-coming-to-an-end-7fd35a52b786?ref=blef.fr">The golden age of open source in AI is coming to an end</a> — An article about changes in open-source code licenses creating less permissive models.</li><li><a href="https://www.wsj.com/articles/rush-to-use-generative-ai-pushes-companies-to-get-data-in-order-c34a7e13?st=c5brvz1f3uh1n9w&ref=blef.fr">Rush to use Generative AI pushes companies to get data in order</a> — Garbage in garbage out. An article from the Wall Street Journal, obviously if you want to fine tune generative models you will have to be sure to have correct training datasets.</li><li><a href="https://mp.weixin.qq.com/s/i4WR5ULH1ZZYl8Watf3EPw?ref=blef.fr">Use ControlNet to generate QR Codes</a> — A Chinese engineer used ControlNet to generate visually appealing and hidden QR Codes. The result is quite impressive and works most of the time.</li></ul>
<figure class="kg-card kg-image-card"><img src="https://mmbiz.qpic.cn/mmbiz_png/xSnEeickjxibJqYibicHBeyMEaskfIOA517AKHQBeJgRaLibN43YiapJH09Rw4Tj1F09yibg9gRTswTFWTG4IuADX55KQ/640?wx_fmt=png&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" alt="Image" loading="lazy"><figcaption><span>A ControlNet generated QR Code, the link sends to a website to personalise QR codes developed by the author</span></figcaption></figure>
<p></p>
<h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1>
<ul><li><a href="https://towardsdatascience.com/which-team-should-own-data-quality-44f1d6996eb8?ref=blef.fr">Which team should own data quality?</a> — Wether it's data engineering, analytics engineering or more specialised functions supervised by a central governance this is a good question to have.</li><li><a href="https://www.castordoc.com/blog/the-next-chapter-for-castordoc?ref=blef.fr">The next chapter for CastorDoc</a> — CastorDoc, previously Castor, is a a data catalog. They recently did a rebrand and Tristan shared the new associated vision. They unveiled 5 pillars to achieve the new vision in which AI-powered insights is the second one.</li><li><a href="https://maxhalford.github.io/blog/graph-components-duckdb/?ref=blef.fr">Graph components with DuckDB</a> — Max always amaze me with his experiments. This time he writes a graph algorithms in SQL to identify connections.</li><li><a href="https://eng.lyft.com/gotchas-of-streaming-pipelines-profiling-performance-improvements-301439f46412?ref=blef.fr">Gotchas of streaming pipelines: profiling &amp; performance</a> — Feedbacks on how Lyft team increased performance on their streaming pipelines.</li><li><a href="https://www.figma.com/blog/how-figma-scaled-to-multiple-databases/?ref=blef.fr">The growing pains of database architecture</a> — Figma team shared learnings about scaling Postgres instances.</li><li><a href="https://dagster.io/blog/backfills-in-ml?ref=blef.fr">Backfills in data &amp; machine learning</a> — Backfilling is when you write or overwrite the historical data. Backfilling is one of the most complicated task in data engineering because it often requires design way ahead of problems. Dagster wrote a small guide about considerations you might have when doing backfills.</li><li><a href="https://blog.getdaft.io/p/introducing-daft-a-high-performance?ref=blef.fr">Daft: a high-performance distributed dataframe library</a> — Recently Polars took all the attention regarding dataframes manipulation. But this new library called Daft could also be a game changer. Daft is written in Rust, uses Arrow, can be distributed and can use complex types.</li><li><a href="https://www.ssp.sh/brain/select-insights-bundling-with-microsoft-fabric-and-orchestration/?ref=blef.fr">SELECT Insights</a> — A fresh new newsletter by Simon Späti. He shared a long list of links and genially structured the newsletter like a SQL query.</li></ul>
<p></p>
<h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1>
<ul><li><a href="https://cohere.com/?ref=blef.fr"><strong>Cohere</strong></a> <a href="https://txt.cohere.com/announcement/?ref=blef.fr">announces $270M Series C</a>. Cohere is an OpenAI alternative, they propose an API and a Python, Go or Node SDK to add "language" to your traditional app.</li></ul>
<hr>
<p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.22 ]]></title>
                    <description><![CDATA[ Data News #23.22 — Japan views on copyright for AI, a new AI camera, what&#39;s the hype behind DuckDB?. ]]></description>
                    <link><![CDATA[ /data-news-week-23-22/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6479bb3758799d0001efda72 ]]></guid>
                    <pubDate><![CDATA[ 2023-06-03 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/06/image-1.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/06/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2023/06/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/06/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/06/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Sun is coming in Berlin (<a href="https://unsplash.com/photos/nphovVuT9OE?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey, I've been sick longer than I expected, but I'm finally well. I hope this email finds you all well, as well. I've had to catch up on almost 3 weeks of content. When I step back, the amount of articles shared each week is insane, there are countless articles about things that have already been written. Sometimes I feel like I'm trying to find a needle in a stack. Or several needles.</p><p>I wanted to write more about Microsoft Fabric and the states of data that were <a href="https://www.blef.fr/data-news-week-23-21/">published last week</a> but I'll do it another time.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><p>As always the pace of innovation in this field is incredibly fast so here a few news I've seen I found worth it:</p><ul><li><a href="https://technomancers.ai/japan-goes-all-in-copyright-doesnt-apply-to-ai-training/?ref=blef.fr">Japan goes all in: copyright doesn’t apply to AI training</a> — I'm far from being a law expert but it looks like something that will create precedence. The article is saying that it lays down with Japanese new strategy to become a leader in AI technologies, by removing barriers on training data they hope to open doors. Obviously artists (especially mangakas) were not happy about it.</li><li><a href="https://www.politico.eu/article/open-ai-chatgpt-sam-altman-kicks-off-eu-charm-offensive-artifical-intelligence/?ref=blef.fr">Sam Altman, OpenAI CEO did an Europe Tour</a> — Sam went to Europe recently (Span, France, Poland, Germany and UL) in order to meet countries representatives. I guess that he did lobbying around the AI Act but also he was here to do real estate because OpenAI wants an European office.</li><li><a href="https://www.theverge.com/2023/5/29/23741011/this-is-what-a-144tb-nvidia-gpu-looks-like?ref=blef.fr">New Nvidia 144TB GPU</a> — Nvidia is the clear winning of the AI race. They announced an insanely crazy new GPU and Google, Meta and Microsoft are already customers. Surprising.</li><li><a href="https://doordash.engineering/2023/05/31/how-doordash-uses-xcodegen-to-eliminate-project-merge-conflicts/?ref=blef.fr">How DoorDash uses XcodeGen to eliminate project merge conflicts</a> — Ok now I don't want to resolve a Git conflit anymore 😅 .</li><li>US researchers developed a LLM-powered Minecraft agent: <a href="https://voyager.minedojo.org/?ref=blef.fr">Voyager</a>. Minecraft is a survival game and the agent has been designed to Minecraft learn life skills incrementally. In the end it generates a code that is used to send the agent in the cubic world.</li><li><a href="https://bjoernkarmann.dk/project/paragraphica?ref=blef.fr">A new kind of camera</a>— An artist developed an AI camera, the Paragraphica, that is a context-to-image camera. The camera is using location data to feed context to a generative algorithm.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/06/image.png" class="kg-image" alt loading="lazy" width="2000" height="767" srcset="https://www.blef.fr/content/images/size/w600/2023/06/image.png 600w, https://www.blef.fr/content/images/size/w1000/2023/06/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/06/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/06/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A dynamic prompt — (Paragraphica camera)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://meltano.com/blog/introducing-meltano-cloud-you-build-the-pipelines-we-manage-the-infrastructure/?ref=blef.fr">Meltano announced their Cloud</a> — Meltano is an open-source data integration project that has been started at Gitlab. With a few configuration and a CLI you can write data pipelines using hundreds of connectors (using Singer spec). The pricing is based on the number of runs and not the volume of data. This is a major difference with the competition (Airbyte, Fivetran, Stitch).</li><li><a href="https://rides.jurajmajerik.com/map?ref=blef.fr">A ridesharing app simulation</a> — Juraj developed over the last months a complete simulation of a ridesharing app (like Uber), he shared everything he did in blog posts and the results is kinda amazing. I recently spent hours on <a href="https://dinopoloclub.com/games/mini-motorways/?ref=blef.fr">Mini Motorways</a> so this is the kind of side projects I like.</li><li><a href="https://moderndataengineering.substack.com/p/breaking-into-data-engineering-as?ref=blef.fr">Breaking into data engineering as a self-taught developer</a> — A few advice from a fellow data engineer who was data analyst before.</li><li><a href="https://mattpalmer.io/posts/whats-the-hype-duckdb/?ref=blef.fr">What's the hype behind DuckDB?</a> — This is a great post from Matt Palmer about DuckDB. If you want a quick intro about the tool this is the way to start. In the article Matt also showcases how you could use DuckDB to write a transfer pipeline like moving a Parquet file from a disk to S3.</li><li><a href="https://tech.instacart.com/how-instacart-ads-modularized-data-pipelines-with-lakehouse-architecture-and-spark-e9863e28488d?ref=blef.fr">How Instacart Ads modularized data pipelines with Spark</a> — A great deep dive on a Lakehouse architecture for streaming. The article describes a migration from "thousands of complex SQL lines" to composable Spark SQL.</li><li><a href="https://zendesk.engineering/dbt-at-zendesk-part-i-setting-foundations-for-scalability-34b55e6a6aa1?ref=blef.fr">dbt at Zendesk ; setting foundations for scalability</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.databricks.com/blog/welcoming-bit-io-databricks-investing-developer-experience?ref=blef.fr">Databricks acquires bit.io</a> — bit.io was "the fastest way to get a Postgres database". In order to start you just had to send data and your database was already setup. When looking at the press release Databricks acquisition is a team acquisition to improve their own developper experience.</li></ul><hr><p>Now I go back on Diablo — See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.21 ]]></title>
                    <description><![CDATA[ Data News #23.21 — Raw news, Gen AI, Microsoft Fabric, states of data, dbt Labs laysoff and fundraising. ]]></description>
                    <link><![CDATA[ /data-news-week-23-21/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64705a686abd7b00016c0e78 ]]></guid>
                    <pubDate><![CDATA[ 2023-05-29 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/image-6.png" class="kg-image" alt="" loading="lazy" width="2000" height="1305" srcset="https://www.blef.fr/content/images/size/w600/2023/05/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/05/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Me (</span><a href="https://unsplash.com/photos/BuNWp1bL0nc?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Hey, I've been sick in the last 3 days and it was impossible to write something. As I still want to send something, here a raw edition with no comments. See you on Friday.</p><h1 id="gen-ai-%F0%9F%A4%96">Gen Ai 🤖</h1><ul><li><a href="https://github.com/artidoro/qlora?ref=blef.fr">QLoRA: Efficient Finetuning of Quantized LLMs</a> — 65B parameter model on a single 48GB GPU reaching 99.3% of the performance level of ChatGPT on Vicuna.</li><li><a href="https://www.engine.study/blog/modding-age-of-empires-ii-with-a-sprite-diffuser/?ref=blef.fr">Modding Age of Empires II with a Sprite-Diffuser</a>.</li><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7067532623547432962/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A7067532623547432962%29&ref=blef.fr">Github Copilot Chat announcement</a>.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.vantage.sh/blog/clickhouse-local-vs-duckdb?ref=blef.fr">clickhouse-local vs DuckDB</a> — DuckDB is not the only one to work great locally. In this test clickhouse works better.</li><li><a href="https://databased.pedramnavid.com/p/the-future-of-data?ref=blef.fr">The Future of Data </a>— Everyone wants a piece of the pie; no one wants to bake.</li><li><a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-architecture-pattern-tools?ref=blef.fr">Data Modeling, architecture Pattern, tools and the future</a> — part 3 of Simon's guide.</li><li><a href="https://www.microsoft.com/en-us/microsoft-fabric?ref=blef.fr">Microsoft Fabric</a> — everyone was talking about it on LinkedIn. This is the Lakehouse integration for Analytics into Azure. Here are <a href="https://datamonkeysite.com/2023/05/27/first-impression-of-microsoft-fabric/?ref=blef.fr">first impressions</a>, how it <a href="https://powerbi.microsoft.com/en-us/blog/introducing-microsoft-fabric-and-copilot-in-microsoft-power-bi/?ref=blef.fr">includes with Power BI</a> and a <a href="https://www.linkedin.com/pulse/answering-early-questions-fabrics-place-your-stack-luke-fangman%3FtrackingId=hOm50xIqSgiQHyTQfjVrZw%253D%253D/?trackingId=hOm50xIqSgiQHyTQfjVrZw%3D%3D&ref=blef.fr">few remarks</a>. Honestly this looks like a disguised Databricks.</li><li>States of data season — <a href="https://state-of-data.com/?ref=blef.fr">Airbyte's state of data</a>, <a href="https://www.databricks.com/sites/default/files/2023-05/databricks-2023-state-of-data-report.pdf?ref=blef.fr">Databricks's</a>, <a href="https://lakefs.io/blog/the-state-of-data-engineering-2023?ref=blef.fr">lakeFS's</a>.</li><li><a href="https://newsletter.engineering.land/p/engineering-levels-a-simple-framework?ref=blef.fr">Engineering Levels: a simple framework for startups</a>.</li><li><a href="https://towardsdatascience.com/writing-design-docs-for-data-pipelines-d49550f95580?ref=blef.fr">Writing design docs for data pipelines</a>.</li><li><a href="https://datamonkeysite.com/2023/05/22/databend-and-the-rise-of-data-warehouse-as-a-code/?ref=blef.fr">Databend and the rise of Data warehouse as a code</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.getdbt.com/blog/dbt-labs-update-a-message-from-ceo-tristan-handy/?ref=blef.fr">dbt Labs reduced 15% of their staff</a>. Tristan announced it on the blog and the company provided transition perks. It was a sad announcement.</li><li><a href="https://www.snowflake.com/blog/snowflake-acquires-neeva-to-accelerate-search-in-the-data-cloud-through-generative-ai/?ref=blef.fr">Snowflake acquired Neeva</a> — A generative AI search company that was in difficulty got acquired by Snowflake.</li><li><a href="https://www.politico.eu/article/eu-hits-meta-with-record-e1-2b-privacy-fine/?ref=blef.fr">EU hits Meta with record €1.2B privacy fine</a> — under GDPR.</li><li><a href="https://dagster.io/blog/elementl-series-b?ref=blef.fr">Elementl (Dagster) Raised $33m</a> — to continue building the data orchestrator.</li></ul><hr><p>See you soon. ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — 2 years anniversary ]]></title>
                    <description><![CDATA[ A personal letter to share my freelance / content creation journey publicly. To say thank you. ]]></description>
                    <link><![CDATA[ /data-news-week-23-20/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6465dc827bf870000134c168 ]]></guid>
                    <pubDate><![CDATA[ 2023-05-19 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/hbd.gif" class="kg-image" alt loading="lazy" width="690" height="388" srcset="https://www.blef.fr/content/images/size/w600/2023/05/hbd.gif 600w, https://www.blef.fr/content/images/2023/05/hbd.gif 690w"><figcaption>TWO YEARS —&nbsp;HAPPY BIRTHDAY</figcaption></figure><p>👋 Here is a special edition for me. Exactly 2 years ago, I sent out my <a href="https://www.blef.fr/data-news-2021-20/">first</a> email newsletter. At the time, only 3 people received it. I already told the story in <a href="https://www.blef.fr/blef-datagen-podcast/">Robin's podcast</a>, here is a written version. In 2021, I was doing Twitch lives twice a week, every Wednesday I was doing a data news round-up. One day, I decided to save the links on a blog created for the occasion, a few days later, 3 people subscribed. This is what made me decide to send emails containing my round-up. By chance.</p><p>So I want to thank Max, Théodore and Emiel, it is largely thanks to you that this newsletter exists. If you had not joined so early, I would never have realized that people would like to read my content. These bookmarks that I was saving mostly for myself.</p><p>Today, 104 editions later, I want to take a look back at my content creation journey, but also at my freelance journey that started one year earlier, in 2020.</p><div class="kg-card kg-callout-card kg-callout-card-red"><div class="kg-callout-emoji">😱</div><div class="kg-callout-text">If you only want to read Data News you can read my selection of talks from the <a href="https://www.blef.fr/data-council-austin-takeaways/">Data Council</a>.</div></div><p></p><h1 id="the-beginning">The beginning</h1><p>Before becoming a freelancer, I was working at Kapten, a French PHV company—an Uber competitor—where I was leading the data engineering team. We were a team of 6 people and our goal was to build the data platform for the company. During my time at Kapten, we built a data stack with Airflow, BigQuery and Metabase + Tableau. I was coming from the Hadoop world and BigQuery was a breath of fresh air. The component I'm most proud of is the ELT framework we built on top of Airflow to give total autonomy to analysts and scientists on the data loading and transformation processes.</p><p>In a nutshell it was an ETL-as-configuration on top of Airflow. You were able to define <a href="https://docs.google.com/presentation/d/1HPVwWSZAmOSCNy1uWTx7ecTS-e9l9Ize3s29SQdfqAE/edit?ref=blef.fr#slide=id.g48298f4f5f_0_56">configs</a> in Python to do full or incremental loading from different sources, processing in SQL or Python and exports. The framework and the processes were pretty strict, but it worked and gave full autonomy to analysts in building whatever they wanted to do. All the ownership was given back to others, we were just writing software and maintaining a platform.</p><p>I think it took almost a year to build the entire platform. We had set a goal: no broken Airflow pipelines in a 30-day sliding window. We achieved that. And we hit a plateau. We were doing less data engineering because everything was working well, less firefighting, looking for a new vision. As human beings, we wanted to fill the void, so we explored different things: real-time feature store, data lineage or data contracts—we call it this way today, but back in the days it was only schema management. But what was the next step for us?</p><p>I had done what I was hired to do: build a data platform for analytics and analysts. It was time for me to leave, at the same time the context changed: we got acquired and laid off. That's where my freelance journey started.</p><p></p><h1 id="going-into-freelance">Going into freelance</h1><p>I left the when the COVID was at a peak and a few people did not understand the move. To be honest I didn't even know where I was going but I was confident in my skillset and in my ability to sell my data engineering expertise. In retrospect I was just naive.</p><p>The Kapten experience brought me expertise on Airflow and GCP, a good knowledge about Kubernetes and a lead experience. In addition to my solid engineering and infra skills it creates a good resume.</p><p>By chance 2 of my former bosses heard about my freelancing and proposed me work. It led respectively to a 3-months and a 1-year mission with <a href="https://www.equancy.fr/fr/?ref=blef.fr">Equancy</a> and <a href="https://qonto.com/en?ref=blef.fr">Qonto</a>. Then I did a mission with <a href="https://yousign.com/?ref=blef.fr">Yousign</a> with Faouz that I met a few years earlier thanks to a LinkedIn message. The common point of the 3 missions was to build stuff around Airflow. In a blink my first company fiscal year was already done, with around €180k in revenue.</p><p>While I was at Qonto, we migrated to dbt, which was rapidly being adopted by French startups. This allowed me to add a new tool to my belt. Then it became a new expertise.</p><p>In my second fiscal year (2022), I had the privilege of working with the French tax authority to help them define the vision for the 2027 data platform and with the Ministry of Education to implement Superset and dashboards on that platform. In the blink of an eye, my second year was already over with less revenue (160k€) but in less time.</p><p>Along the way I also helped startups hiring—<a href="https://www.folk.app/?ref=blef.fr">Folk</a>, <a href="https://en.modjo.ai/?ref=blef.fr">Modjo</a>, <a href="https://www.kard.eu/?ref=blef.fr">Kard</a>—and did mentoring—<a href="https://blent.ai/?ref=blef.fr">Blent.ai</a>, <a href="https://libeo.io/en?ref=blef.fr">Libeo</a>, <a href="https://ibanfirst.com/?ref=blef.fr">iBanFirst</a>, <a href="https://nibble.ai/?ref=blef.fr">nibble</a>. I even hired 2 awesome interns who helped me on the blog for a few months. As 2023 is still running I'll keep it for another retrospective.</p><p>While my story is exciting, here are a few things to learn from it:</p><ul><li>Former co-workers are part of your network and are probably the ones who vouch for you the most.</li><li>In terms of networking, participate in events, give to the community and you will receive something at some point. Don't be afraid to solicit people on LinkedIn, people respond more often than you'd think.</li><li>Find the main reason why you want to freelance. It can be many things like money, freedom, issues with authority, digital nomadism, etc.</li><li>If it's money I think you gonna miss the freedom part of being freelance. If you want to do a lot of ca$h you will work every day, in a long-term mission for a big company. Which is actually like a permanent position without the perks of it (at least in countries where we have a social system).</li><li>Set your daily rate and (try to) stick to it. I started at €800, then went up to €1000 and now I'm at €1200. Don't forget that you are competing with agencies, often charging high prices.</li><li>One of my strict conditions is to work only part-time. In fact, I work an average of 2.5 days a week. To be successful, you have to be organized and be aware of <a href="https://en.wikipedia.org/wiki/Context_switch?ref=blef.fr">context switching</a>. To be honest, this is very difficult and I am bad at it.</li><li>In my opinion, to freelance in data engineering, you need at least two or three proven experiences in data engineering. Very often, as a freelancer, you are perceived as someone who knows things. To be assertive, you'll need to be confident in your recommendations.</li><li>Identify your strengths and communicate clearly about it. Here how I say it: <em>I'm a data engineer who built a lot of data platform for analytics, I have an expertise in Airflow, dbt, Superset and in infrastructure</em>.</li></ul><p></p><h1 id="juggling-with-content-creation">Juggling with content creation</h1><figure class="kg-card kg-image-card"><img src="https://www.blef.fr/content/images/2023/05/Untitled-Project-1-1-.gif" class="kg-image" alt loading="lazy" width="690" height="388" srcset="https://www.blef.fr/content/images/size/w600/2023/05/Untitled-Project-1-1-.gif 600w, https://www.blef.fr/content/images/2023/05/Untitled-Project-1-1-.gif 690w"></figure><p>Doing freelance data engineering is a great thing for me, I've been working with computers since I was young. It's always better when passion meets your work. Alongside this, I also started creating content in January 2021. This was one of my goals when I decided to go part-time so I could have time for content creation.</p><p>I did not set clear business objectives over my content creation. Finally I did an engineering school, not a business school. It's probably the reason why I often loose focus and do multiple things. Here is a small selection of what I tried:</p><ul><li>Twitch — I've done 4 months of Twitch at the beginning, but my 2-months holidays with no internet broke my routine. I don't think I'll go back to solo lives.</li><li>I made YouTube videos — I have 7 videos, each video took me about 20 hours, hard to add it into my daily routine but it will come back.</li><li>Twitter — even if went from 200 followers to 400 followers to Twitter I can't find my voice there. This is sad because Twitter is the social network I consume the most.</li><li>LinkedIn — I tried multiple things on LinkedIn but I don't have the discipline to publish one post a day. In the end I went from 2000 followers to 6000+ in 2 years.</li><li>Podcasts — the new thing I've recently started. Once again I loose focus, but, the podcast format is so statisfying to do.</li></ul><p>And finally, the newsletter, which is my safe place. I've found discipline in writing my own content with my own tone. It takes me about a day of work per week. Basically, I spend 2 hours selecting content, 1 hour reading the content, 2 hours writing, and 1 hour post-processing. In the end, I'm proud of the quality of the newsletter, but one day is a lot and after 3 years, I have to wonder which direction to go in.</p><p><strong>Actually, I don't care I'll continue like this</strong>. But why do I do content:</p><ul><li>I like to share / transmit to others, when I was a kid I wanted to be a maths teacher.</li><li>It creates visibility for me and as a freelance I need to be visible.</li><li>It helps me shape my ideas.</li><li>I love the adrenaline rush I get when I do things publicly. Even though there are serious downsides to it, like <a href="https://fr.wikipedia.org/wiki/Syndrome_FOMO?ref=blef.fr">FOMO</a> or addiction, I love it.</li><li>I hope that in the long run it will generate enough money for me to do less consulting. Blog subscriptions bring me 300 € / month. Which is less than 2% of my revenue 🫠.</li></ul><p></p><h1 id="conclusion">Conclusion</h1><p>This is a post that is more personnal than what I usually do. This time I did not promise anything like I already did in the past. Promises I didn't keep because I'm lazy. At least I learn from my mistakes.</p><p>Whether you are a customer, a friend or a subscriber thank you very much for your support over the past 3 years. Let's continue for another 3 years? ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data Council 2023 ]]></title>
                    <description><![CDATA[ A selection of 10 talks I really enjoyed among the Data Council forward thinking presentations. ]]></description>
                    <link><![CDATA[ /data-council-austin-takeaways/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 645e7208b647d00001c4e0ad ]]></guid>
                    <pubDate><![CDATA[ 2023-05-18 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1502" srcset="https://www.blef.fr/content/images/size/w600/2023/05/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/05/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>(<a href="https://unsplash.com/photos/p7av1ZhKGBQ?ref=blef.fr">credits</a>)</figcaption></figure><p>Data Council Austin is a yearly conference that features a great panel of speakers giving talks about the future of the data field. As I often do I've overlooked the 70 presentations and here a medley of what I've liked.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.youtube.com/watch?v=yNQWjCGHV88&list=PLAesBe-zAQmF-GpvZ3ba5YpVzoVbgzl8M&ref=blef.fr" class="kg-btn kg-btn-accent">Data Council 2023 YouTube playlist</a></div><p></p><h1 id="my-personal-selection">My personal selection</h1><p>If you had only 3 videos to watch it should be the 3 following:</p><ul><li><a href="https://www.youtube.com/watch?v=zmmJgwc3oPI&ref=blef.fr">Malloy an experimental language</a> — This is my favourite talk. Llyod, founder of Looker, puts 30 years of data warehousing into perspective in 30 minutes, especially the fact that we see "data in rectangles." Since joining Google, he's been working on Malloy, a new way to query data. Malloy compiles in SQL and works on data semantics. The presentation gives another look at the semantic layer. During the demo, Llyod does some data analysis in the browser and it's just mind-blowing 🤯. <br><br>At the same time someone Google also did a <a href="https://www.youtube.com/watch?v=oo1uwJ3qHwE&ref=blef.fr">Calcite</a> presentation.</li><li><a href="https://www.youtube.com/watch?v=qT-Atu9mfvM&ref=blef.fr">Data contracts, Accountable data quality</a> — Data contracts is a trendy concepts that contains a lot of things. Chad Sanderson did the best recap about it. DE is often constant firefighting, a lot of (spaghetti) SQL to maintain. A lot of breaking changes are coming from upstream producers (form or content).<br><br>At scale everything breaks without data quality, the modern data stack is good because self-service and easy to implement but lacks of everything to be mature in the future: ownership, data quality, context. It creates a non-consensual API, we pull data but never agreed on a contract (SLA, schema, etc.).<br><br>The root cause is mainly because of miscommunication between producers and consumers. Data contracts aims to fix with API-based agreements between producers and consumers that capture the schema, semantics, distributions and enforcement policies of the data. <br> <br>You can also watch Whatnot data contracts <a href="https://www.youtube.com/watch?v=h1IU8Q6KD2g&ref=blef.fr">implementation</a>.</li><li><a href="https://www.youtube.com/watch?v=Dbr8jmtfZ7Q&ref=blef.fr">Metric trees</a> — It reminds my KPIs framework people were doing when I started to work in a consultancy firm. This is nice way to represent your company business. Still today 90% of the value a data team delivers is in the analytics. The analytics goal is to model correctly business. You should answer 4 questions: what happened, why did it happened, what's going to happen, what should we do next.<br><br>Organisations are systems with inputs and outputs and a formula. Formulas have metrics, relationships and weights. In the end you can depicts all your KPIs with formulas.<br><br>The data team strategy should be mainly to define and operationalise the company growth model. Using a metric tree as a logical representation of a growth model. You have 3 types of outputs: customer value, financial and strategic.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/Screenshot-2023-05-12-at-15.23.06.png" class="kg-image" alt loading="lazy" width="1858" height="796" srcset="https://www.blef.fr/content/images/size/w600/2023/05/Screenshot-2023-05-12-at-15.23.06.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/Screenshot-2023-05-12-at-15.23.06.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/Screenshot-2023-05-12-at-15.23.06.png 1600w, https://www.blef.fr/content/images/2023/05/Screenshot-2023-05-12-at-15.23.06.png 1858w" sizes="(min-width: 720px) 720px"><figcaption>Screenshot of Metric trees presentation.</figcaption></figure><p></p><h1 id="other-stuff-i-liked">Other stuff I liked</h1><ul><li><a href="https://www.youtube.com/watch?v=z6sbY-c6gAQ&ref=blef.fr">Snowflake optimisation guide</a> — This is a pragmatic guide on how you can lower your Snowflake costs. In the current context we have to do more with less. The talk starts with a great introduction of Snowflake architecture. In a nutshell the speakers share tips about warehouses sizing and design, performance optimisation with pruning, clustering and query design.</li><li><a href="https://www.youtube.com/watch?v=TCoX7FQ1Jdc&ref=blef.fr">LLMs and Semantic layer</a> —This is something I've in mind for a few time. This is a tool presentation but still it's relevant. On the same topic of self-service Whatnot shared how they turned <a href="https://www.youtube.com/watch?v=wyW6hQGZxgY&ref=blef.fr">data consumers in data constructors</a>.</li><li><a href="https://www.youtube.com/watch?v=u82r_eqUaiI&ref=blef.fr">Scaling Uber metrics systems</a> (w/ Pinot) — uMetric migration from ES to Pinot. They created an unified layer where metrics uses the same logic for downstream consumers. uMetric manages definition, discovery, computation, verification and serving.</li><li><a href="https://www.youtube.com/watch?v=WR7e7dQgk7I&ref=blef.fr">Writing unit test for data science</a> — Pragmatic guide about unit tests.</li><li><a href="https://www.youtube.com/watch?v=yNQWjCGHV88&ref=blef.fr">Retro on data science by DJ Patil</a> — DJ Patil has been US Chief Data Scientist. He coined the "data scientist" term back in 2008. He does a great retro.</li><li><a href="https://www.youtube.com/watch?v=n2GO1EN5If8&ref=blef.fr">Dashboards as code</a> — Using code to make BI dev better, this is DataOps, we have almost X as code in the whole data chain, only dashboards lacks of it.</li><li><a href="https://www.youtube.com/watch?v=_mpWp_1kqKY&ref=blef.fr">Growing the data Team and data Culture at GitLab</a> — GitLab data playbook is well-known. The eng - director gap problem. This is when you have a director that manages an individual contributor.</li><li><a href="https://www.youtube.com/watch?v=cGgzHN6MG8E&ref=blef.fr">A deep-dive into the dbt manifest</a> — How to do a dry-run in cloud data warehouse, load the manifest as dynamic dags, enforce polices or build monitoring.</li><li>A<a href="https://www.youtube.com/watch?v=Kwo4ltNroak&ref=blef.fr">ugmenting the modern data stack</a> — by merging batch and real-time technologies in one database.</li></ul><hr><p>See you soon ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.19 ]]></title>
                    <description><![CDATA[ Data News #23.19 — Minds of data my new podcast, Google I/O takeaways, HuggingFace releases, Salesforce GPT and the Fast News ⚡️. ]]></description>
                    <link><![CDATA[ /data-news-week-23-19/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 645b9f11b647d00001c4dc32 ]]></guid>
                    <pubDate><![CDATA[ 2023-05-12 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/image-2.png" class="kg-image" alt loading="lazy" width="2000" height="1334" srcset="https://www.blef.fr/content/images/size/w600/2023/05/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/05/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Sorting the news (<a href="https://unsplash.com/photos/1hUY8SpJ8Cw?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, new Friday means Data News. This week is pretty stacked in term of content, especially video / audio content. I hope you will enjoy it as much as me. </p><p>Let's start with with my newly created podcast Minds of Data. In Minds of Data I'll met people from the data ecosystem in order to learn more about them. In the first episode I sad down with Joe Reis and we discussed about his professional journey before becoming the thought leader he is today, we also chatted about data engineering. You can listen the episode on <a href="https://open.spotify.com/show/7bkiM0BFwhgXdHvBVaThrB?ref=blef.fr">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/minds-of-data/id1686939820?ref=blef.fr">Apple Podcast</a> and <a href="https://www.deezer.com/us/show/6045437?ref=blef.fr">Deezer</a>.</p><!--kg-card-begin: html--><iframe src="https://podcasters.spotify.com/pod/show/blef/embed/episodes/Episode-1--Joe-Reis-e23mt2h" height="102px" width="400px" frameborder="0" scrolling="no"></iframe><!--kg-card-end: html--><p><em>PS: this is my first episode ever so feedbacks are more than welcome.</em></p><p>As the same time in Paris we organised last Tuesday the May Airflow meetup. We had 3 talks, that you can find on <a href="https://www.youtube.com/@parisairflow?ref=blef.fr">YouTube</a>. I really liked Benoit and Samy <a href="https://www.youtube.com/watch?v=xULkJUEaEsA&ref=blef.fr">presentation about Cloud Composer</a>—Managed Airflow on GCP. They shared good practices on how to manage Composer in the cloud, things like:</p><ul><li>Use the same configuration for staging and prod</li><li>Use a secret manager to manage your Airflow connections</li><li>Use IAM restrictions in the DAGs bucket</li><li>Use operators and define the company policy around it</li><li>Define clear policies to govern your Airflow</li></ul><p>Also <a href="https://airflow.apache.org/blog/airflow-2.6.0/?ref=blef.fr">Airflow 2.6</a> went out this week with a new trigger DAG parameterizable UI, new alert notifications framework (callbacks) and a new graph interface in the grid view.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><p>The pace of innovation and announcement in the (Gen) AI field doesn't deflate. I can't really cover the whole field because it moves so fast that I can't even keep up. This week the <a href="https://www.youtube.com/watch?v=cNfINi5CNbY&ref=blef.fr">Google I/O Keynote</a> was a major milestone.</p><h3 id="google-io-keynote-takeaways">Google I/O Keynote takeaways</h3><p>What amazed me from the Google Keynote is the fact that Generative AI is treated like a product, like the 2007 iPhone—look at this <a href="https://youtu.be/cNfINi5CNbY?t=3050&ref=blef.fr">ad</a>. When you think about it AI has always been something hidden, like an API call, a score or a recommendation in a larger UI. In Google's Keynote AI gets a 26 minutes segment and then all the derivations lasting for 2h.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/Screenshot-2023-05-12-at-11.55.50.png" class="kg-image" alt loading="lazy" width="1190" height="546" srcset="https://www.blef.fr/content/images/size/w600/2023/05/Screenshot-2023-05-12-at-11.55.50.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/Screenshot-2023-05-12-at-11.55.50.png 1000w, https://www.blef.fr/content/images/2023/05/Screenshot-2023-05-12-at-11.55.50.png 1190w" sizes="(min-width: 720px) 720px"><figcaption>Bold tagline &amp; Google ego speaking (screenshot from the Keynote)</figcaption></figure><p>To me Google annual conference is a sign that the party is over, especially for OpenAI. Actually OpenAI deal with Microsoft was probably the best deal they could have go for. Even if as human we want to send models in the <a href="https://lmsys.org/blog/2023-05-10-leaderboard/?ref=blef.fr">arena</a> to get the most performant one, or masturbate ourselves comparing the size of parameters. In the end the best integrated models will win. And Google as a head start—as well as Microsoft, as they remind us in the Keynote they have 15 products used by billions of people: they have our e-mails, our photos, our maps and more. AI is a just a feature in their product, even if it needs an UI rethink, this is just a feature.</p><p>So in the end Google, an AI-first company from the beginning wants to put AI everywhere and wants to offer you an AI collaborator. Here are the major takeaways from the Keynote:</p><ul><li>They release PaLM 2, the last foundation model. It will exists in 4 sizes: Gecko, Otter, Bison and Unicorn each asking for different hardware resources to work.</li><li>PaLM 2 will be natively integrated in Google products. Gmail will get enhance smart reply features, Maps will propose immersive view over a route and Photos will have a magic editor that will allow you in a single drag-n-drop to edit a picture.</li><li>Google will create a sidekick that will be available in Workspace—Sheets, Docs and Slides—called Duet AI, you'll be able to ask the AI to create content for you unlocking productivity gains. Duet AI will also work in GCP (in the console and within the web IDE).</li><li>According to the announcement PaLM 2 will particularly shine when fine-tuned (e.g. for IT security or medicine). You'll be able to do it by yourself within your own GCP instance in Vertex AI. They also released Imagen, Codey and Chirp resp. for image generation, code generation and speech-to-text.</li><li>Bard, the conversational model—ChatGPT equivalent—is now opened to everyone (actually not in all countries). Bard works great for code generation, debugging and code explainability.</li><li><strong>Bard might also be the Zero-ETL solution </strong>we were all waiting for. In the demo the speaker asks Bard to find schools in an area, then asks for it to be saved in a Google Sheets, then asks to for a new column in the sheet if the school is public or private. To be honest, what prevents Bard in the future to do the same in a database?</li><li>Finally Google tease their next-gen model Gemini which obviously will be awesome, to hear them and announce an evolution of the search interface will Gen AI as a new interactive way to search.</li></ul><p>In the end I really like the keynote because it gives a new milestone about what we can expect as integration in the products we daily use.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.youtube.com/watch?v=QpBTM0GO6xI&ref=blef.fr" class="kg-btn kg-btn-accent">📺 Watch the 10 mins recap (by Google)</a></div><h3 id="other-stuff">Other stuff</h3><ul><li>Hugging Face released an open model called <a href="https://huggingface.co/bigcode/starcoder?ref=blef.fr">StarCoder</a> that has been trained on Github code that is meant to act as a Copilot. Still the model is not yet ready to be used as an instruction model—ChatGPT way.</li><li>At the same time HF also introduced an <a href="https://github.com/huggingface/chat-ui?ref=blef.fr">open-source Chat UI</a>.</li><li>After Bill Gates, it Steve Wozniak—Apple co-founder—who gives his take on the AI breakthroughs in a <a href="https://www.bbc.com/news/technology-65496150?ref=blef.fr">BBC interview</a> mainly we can't stop the march of progress, AI will be used to scam people and we have still to put guardrails, but human guardrails.</li><li>Salesforce do not want to be leftover in the battle, they announced <a href="https://slack.com/blog/news/introducing-slack-gpt?ref=blef.fr">Slack GPT</a> natively integrated in Slack to summarise or compose messages but also a way for partners to bring new kind of Gen AI apps.</li><li>Also Salesforce did a makeup to Tableau with <a href="https://www.salesforce.com/news/stories/tableau-einstein-gpt-user-insights/?ref=blef.fr">Tableau GPT</a>, a way to provide <em>AI-powered analytics</em>. In Tableau Pulse you'll have access to auto-generated insights on your data. With a "For You" tab like you were in TikTok.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/image-3.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/05/image-3.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/image-3.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/image-3.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/05/image-3.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The StarCoder (<a href="https://unsplash.com/photos/d1Wj9qU5C-o?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://medium.com/@hugolu87/zero-elt-could-be-the-death-of-the-modern-data-stack-cfdd56c9246d?ref=blef.fr">Zero ELT could be the death of the modern data stack</a> — Amazon launched this trend a few months ago. In the current situation we're far from killing any ELT processes, but it might come. For instance Zapier launched <a href="https://zapier.com/tables?ref=blef.fr">Zapier Tables</a> some kind of data storage within your zaps.</li><li><a href="https://davidsj.substack.com/p/we-need-to-talk-about-excel?sd=pf&ref=blef.fr">We need to talk about Excel</a> — Let's be honest, as strong we try to kill Excel as strong he comes back. David shares interesting stories around Excel usage at companies that I can relate to. He finally mentions Count and Equals, two companies, that builds on top of tabular interfaces to do data.</li><li><a href="https://gist.github.com/sayle-doit/264d28dd990c478beb90b90ac3923681?ref=blef.fr">Determine BigQuery storage costs across an org</a> — A SQL query that I did not tried. Please read it twice before running it blindly.</li><li><a href="https://www.confessionsofadataguy.com/polars-laziness-and-sql-context/?ref=blef.fr" rel="bookmark">Polars, laziness and SQL context</a> — Daniel showcases the 2 features which should make you want to migrate to Polars.</li><li><a href="https://medium.com/whatnot-engineering/building-the-seller-analytics-dashboard-ccffd2a0151a?ref=blef.fr">Building the seller analytics dashboard</a> — An great example of what you should consider when building an analytics dashboard in the product and how you combine dbt and GraphQL APIs to build a pragmatic metrics store.</li><li><a href="https://www.theseattledataguy.com/oltp-vs-olap-what-is-the-difference/?ref=blef.fr">OLTP vs. OLAP</a> — One of the best explanation of the differences between both. Mainly it resides in the data storage. One being row-oriented while the other one is column-oriented, this is not the only difference.</li><li><a href="https://medium.com/data-engineer-things/correctly-loading-incremental-data-at-scale-c656704da86d?ref=blef.fr">Correctly loading incremental data at scale</a> &amp; <a href="https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-3-optimisations-and-monitoring-5f7a58d9d97?ref=blef.fr">real-time denormalized data streaming platform</a>.</li><li><a href="https://towardsdatascience.com/mastering-externaltasksensor-in-apache-airflow-how-to-calculate-execution-delta-425093323758?ref=blef.fr">ExternalTaskSensor in Apache Airflow: how to calculate execution delta</a> — I've seen multiple time that the delta computation was annoying for data engineering teams. This article deep-dives well on it.</li><li><a href="https://engineering.linkedin.com/blog/2023/upscaling-profile-datastore-while-reducing-costs?ref=blef.fr">Upscaling LinkedIn's profile datastore while reducing costs</a> — For optimisation geeks.</li></ul><p></p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">👋</div><div class="kg-callout-text">The newsletter is much longer than expected—I got lost today in watching fascinating videos—so I'll be sending out a second part over the weekend or early next week with a recap of the best talks from Data Council 2023. If you want to get a head start, my favourite talk was Lloyd's demonstration of <a href="https://www.youtube.com/watch?v=zmmJgwc3oPI&ref=blef.fr">Malloy, an experimental langage for data</a>.</div></div><hr><p>See you in a few days with Data Council takeaways ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ 🎙️ Episode 1 — Joe Reis ]]></title>
                    <description><![CDATA[ Episode 1 of Minds of data. In this episode we discover who&#39;s Joe Reis and why did he end up being the awesome data creator he is now. ]]></description>
                    <link><![CDATA[ /minds-of-data/episode-1-joe-reis/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64592bfa82a526000104eb5a ]]></guid>
                    <pubDate><![CDATA[ 2023-05-08 ]]></pubDate>
                    <content>
                        <![CDATA[ <!--kg-card-begin: html--><iframe src="https://podcasters.spotify.com/pod/show/blef/embed/episodes/Episode-1--Joe-Reis-e23mt2h" height="102px" width="400px" frameborder="0" scrolling="no"></iframe><!--kg-card-end: html--> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.18 ]]></title>
                    <description><![CDATA[ Data News #23.18 — Gen AI news, PayPal data contract, Prime Video stopped using microservices, Gitlab production database deletion explained, and more. ]]></description>
                    <link><![CDATA[ /data-news-week-23-18/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6453e2eb82a526000104d2b1 ]]></guid>
                    <pubDate><![CDATA[ 2023-05-06 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/image-1.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/05/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/05/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>It's wedding weekend (as you'll probably read it, congrats) (<a href="https://unsplash.com/photos/ULHxWq8reao?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, this is a Saturday edition of the Data News. I hope this email finds you well. This week you'll have less editorial content because I'm late. But still you'll find awesome articles that have been written recently.</p><p>As a reminder on Tuesday next week I'm organising the Apache Airflow Paris <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/292891570/?ref=blef.fr">meetup</a> that you should consider joining if in Paris. Also next week I'll publish my first podcast episode ever that I've recorded with Joe Reis—the co-author of the famous Fundamental of Data Engineering. I'm still looking for the name of the podcast, if you have ideas shoot.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><ul><li><a href="https://www.semianalysis.com/p/google-we-have-no-moat-and-neither?ref=blef.fr">Google "We have no moat, and neither does OpenAI"</a> — This is an internal note from a Google employee (which does not reflect Google views), that mainly says that open-source models will win over Google and OpenAI and closed-source policy for models might be a mistake especially in a world where some models leaks (e.g. Meta ones).</li><li>If you already have access to OpenAI in Azure you can now use <a href="https://azure.microsoft.com/en-us/blog/introducing-gpt4-in-azure-openai-service/?ref=blef.fr">GPT-4</a>—only in preview yet.</li></ul><p>And more <em>traditional AI</em>:</p><ul><li><a href="https://dagshub.com/blog/yolo-nas-by-deci/?ref=blef.fr">YOLO-NAS</a> a new object detection model — you should have seen already this model that detects people in real-time in videos. This new one seems to be better than the previous one.</li><li><a href="https://www.fast.ai/posts/2023-05-03-mojo-launch.html?ref=blef.fr">Mojo, a new programming language ready for the AI</a> — Mojo is a new programming language that looks like Python but at a lower level, this could unlock performance gains and new heights in AI models development.</li><li><a href="https://tech.ebayinc.com/engineering/ebays-blazingly-fast-billion-scale-vector-similarity-engine/?ref=blef.fr">eBay’s blazingly fast billion-scale vector similarity engine</a>.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://github.com/paypal/data-contract-template?ref=blef.fr">Paypal, template for data contract</a> — PayPal is implementing a Data Mesh and they provided in the open all their thoughts with data contracts. In the Github repo they are sharing a YAML template describing what's in the contract. This is insanely exhaustive. </li><li><a href="https://world.hey.com/dhh/even-amazon-can-t-make-sense-of-serverless-or-microservices-59625580?ref=blef.fr">Even Amazon can't make sense of serverless or microservices</a> — PrimeVideo tech team wrote an article that could be summarised by: <em>we migrated from functions based approach to a monolith in a VM</em>. Internet found this ironical. By doing this they reduced cost by 90%.</li><li><a href="https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b?ref=blef.fr">Lakehouse at Walmart</a> — Samuel from Walmart describe the research they did and why they picked Hudi over Delta in order to implement a Lakehouse architecture. As a reminder the Lakehouse is the merge of the datalake and the data warehouse, which is mainly a way to add a SQL friendly processing engine on top of a datalake with ACID transactions.</li><li><a href="https://engineering.grab.com/safer-flink-deployments?ref=blef.fr">Safer deployment of streaming applications</a> — This is how Grab deploy Flink applications.</li><li><a href="https://www.estuary.dev/debezium-alternatives/?ref=blef.fr#the-challenges-with-debezium">Why you should reconsider Debezium: challenges and alternatives</a> — Warning: this article has been written by a CDC solution, but still this is relevant because it shows what is the reality of managing Debezium.</li><li><a href="https://gtm-gear.com/posts/dataform-cloud-functions/?ref=blef.fr">Dataform: schedule daily updates using Cloud Functions</a> — Dataform is a solution Google bought a few years ago that is a dbt alternative but for BigQuery. This article gives a great overview of the product. To be honest it looks like a bit hacky.</li><li>📺 <a href="https://www.youtube.com/watch?v=tLdRBsuvVKc&ref=blef.fr">Dev Deletes Entire Production Database, Chaos Ensues</a> — If you want a greatly told story you should watch it, this is a YouTube video explaining how Gitlab remove the production database and how they fixed it. It reminds me my own <a href="https://www.blef.fr/data-deleted-from-production/">horror story</a> of deleting the whole <code>/data</code> folder in HDFS.</li><li><a href="https://www.infoworld.com/article/3695210/how-oracle-is-taking-on-aws-snowflake-with-autonomous-data-warehouse-updates.html?ref=blef.fr">Oracle is taking on Snowflake</a> — I often say the Snowflake will become the new Oracle. This is fun to see that Oracle still try to catch up. They come up with a lot of news: they will implement Delta Sharing protocol, lower the storage for $118/TB to $25, partner with AWS and propose low-code data integration tool.</li><li>Data modeling, again — Simon published the second part of his <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-approaches-and-techniques?ref=blef.fr">data modeling guide</a>, this time he covered the different techniques you can use when modeling data: dimensional, vault, anchor and more. You might also want to see <a href="https://www.thoughtspot.com/data-trends/data-modeling/conceptual-data-model-examples?ref=blef.fr">practical examples</a> of data modeling, Sonny wrote a nice article using a hotel business as a support.</li><li><a href="https://marcstone.substack.com/p/crafting-your-data-team?sd=pf&ref=blef.fr">Crafting your data team</a> — Practical tips on how to get started with your data team in a new startup. In the post Marc gives you the qualities you should look for and what hiring you should prio first.</li><li>🎮 The CS:GO Liquid team announced a <a href="https://twitter.com/TeamLiquidCS/status/1653869015033577474?ref=blef.fr">new data analyst</a>, DeMars did work previously on a <a href="https://twitter.com/DeMarsDeRover/status/1499401407845179399?ref=blef.fr">predictive analytics approach</a> on Valorant trying to predict who whould win a round in different situations. This is fun to see our beloved data analyst position reaching other fields.</li><li>Data projects on personal data — Petrica <a href="https://betterprogramming.pub/from-traffic-to-revenue-a-deep-dive-into-my-medium-data-98ac5b405605?ref=blef.fr">dive into her Medium data</a> with DuckDB and Plotly and Stefen <a href="https://medium.com/@stefentaime_10958/uber-project-analyzing-personal-uber-and-uber-eats-expenses-with-elt-data-pipeline-using-dbt-91ead4aea5df?ref=blef.fr">analysed his Uber spends</a> with dbt and Postgres. As a reminder, doing personal data projects is still the best way to learn about technical stuff.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/05/image.png" class="kg-image" alt loading="lazy" width="2000" height="1334" srcset="https://www.blef.fr/content/images/size/w600/2023/05/image.png 600w, https://www.blef.fr/content/images/size/w1000/2023/05/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/05/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/05/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>French mistral taking on OpenAI (<a href="https://unsplash.com/photos/WtwSsqwYlA0?ref=blef.fr">credits</a>)</figcaption></figure><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong>AuraML</strong>, an Indian-based company, <a href="https://www.indianweb2.com/2023/05/auraml-synthetic-image-data-platform.html?ref=blef.fr">raises $230k in pre-seed round</a>. AuraML is a 3D synthetic data company, their engine is capable to generate 3D realistic-looking environnements your might want to use in other models.</li><li><strong>Mistral AI</strong>, a French Gen AI company, <a href="https://www.bfmtv.com/economie/entreprises/mistral-ai-start-up-francaise-d-intelligence-artificielle-prepare-une-grosse-levee-de-fonds_AD-202305050805.html?ref=blef.fr">will probably raise €100m</a> (link in French) in the following weeks. It looks like at the moment the company only hired a few French people that were working previously at Meta or Alphabet on LLaMa or DeepMind. The goal of the company is to provide the first French—hence European—alternative to OpenAI. Obviously this is heavily political and strategic for Europe so we will follow it in the next weeks.</li><li>Anaconda is expanding and <a href="https://www.anaconda.com/press/anaconda-acquires-edublocks-to-empower-k-12-data-literacy-and-expand-educational-offerings?ref=blef.fr">buying EduBlocks.</a> EduBlocks is a scratch-like platform to write Python or HTML code. This is a cool thing in order to continue code democratisation.</li><li><em>Open-source done differently</em>. Sequoia—a VC—<a href="https://www.sequoiacap.com/article/sequoia-open-source-fellowship/?ref=blef.fr">will support Sebastián Ramírez</a> with an open-source fellowship. Sebastián is the creator of FastAPI, SQLModel and Typer. There isn't more detail in the press release but this is awesome to see.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.17 ]]></title>
                    <description><![CDATA[ Data News #23.17 — what happened to the Semantic Layer, OpenAI demo that feels like 2007 iPhone and the fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-23-17/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64490643745333003dbdd615 ]]></guid>
                    <pubDate><![CDATA[ 2023-04-28 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1500" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2023/04/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/04/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/04/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Berlin (<a href="https://unsplash.com/photos/TK5I5L5JGxY?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, new edition of the newsletter. This week summer time arrived in Berlin and it was awesome. I managed to move forward with my client projects this week and it also feels relieving. So I'm pretty happy, sun and great projects 🙂.</p><p>Regarding the content, if you are in Paris on May 9th, we are organising the <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/292891570/?ref=blef.fr">Paris Airflow Meetup</a> in Algolia offices, it will be in English so you don't have any excuses not to come. Also I'll be a lot in Paris in May so if you want to have a 🍜 / 🍺 ping me.</p><p></p><h1 id="what-happened-to-the-semantic-layer-%F0%9F%AB%A0">What happened to the Semantic Layer? 🫠</h1><p>This week dbt Labs disclose their vision about the semantic layer and especially <a href="https://www.getdbt.com/blog/dbt-semantic-layer-whats-next/?ref=blef.fr">what they want to do with the Transform acquisition</a>. This is mainly a roadmap of the MeticFlow integration within dbt ecosystem. At the moment we have a dbt Semantic Layer that correspond to YAML definitions and MetricFlow—which was Transform open-source project—that is able to understand the semantics to generates SQL.</p><p>A lot of changes will happen to MetricFlow incl. breaking changes:</p><ul><li>the dbt metrics spec will change, in the current state actually not a lot of people were using it, dbt_metrics package will be deprecated, probably they will merge dbt and MetricFlow syntax to define semantics and metrics</li><li><em>"The core MetricFlow package will become a stand-alone library for processing metric queries, generating a query plan, and rendering SQL against a target dialect." (cf. <a href="https://github.com/dbt-labs/metricflow/discussions/478?ref=blef.fr">Github discussion</a>)</em></li><li>The license will change to BSL.</li><li>The serving part of the system aka. the metrics store will be the paid service of dbt Labs and a part of the dbt Cloud offering. It means that you will define metrics and dimensions in YAML and then plug all you tools to dbt Cloud, it seems there isn't any open-source solution to do the serving—at least from dbt Labs side. And with the license change on MetricFlow dbt Labs are protecting themselves against someone using MetricFlow generation to propose such a paid service.</li><li>There are more described in the <a href="https://github.com/dbt-labs/metricflow/discussions/478?ref=blef.fr">Github discussion</a>.</li></ul><p>To add more spice to this Carlin <a href="https://carlineng.com/?postid=semantic-layer&ref=blef.fr#blog">wrote what happened to the Semantic Layer</a>. Carlin works at Google in the Malloy team (Google semantic layer to say it fast—tbh it's probably more) and he gives his views and also a small retrospective on the semantic layers.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><ul><li><a href="https://doordash.engineering/2023/04/26/doordash-identifies-five-big-areas-for-using-generative-ai/?ref=blef.fr">DoorDash identifies Five big areas for using Generative AI</a> — Doordash is a food delivery platform and they shared how they imagine Generative AI can help them in the future. Either by assisting humans, it can be customers (cart building, etc.) or employees (SQL writing or document drafting) ; either by improving actual AI stuff: search, discovery, information extraction. </li><li>When it comes to SQL writing the field is on fire, a lot of companies are trying to rise from the dead the Slack chatbots answering to insights. I think of <a href="https://shape.xyz/?ref=blef.fr">Shape</a> (YCombinator, out of stealth this week) and <a href="https://www.linkedin.com/company/delphilabs/?ref=blef.fr">Delphi Labs</a> or <a href="https://preset.io/blog/introducing-promptimize/?ref=blef.fr">Promptimize</a>. Promptimize is a toolkit to evaluate and tests prompts, for instance you can "unit tests" you natural langage to SQL prompts with it—it has been open-source by Maxime Beauchemin (Airflow and Superset creator).</li><li><a href="https://blog.google/technology/ai/code-with-bard/?ref=blef.fr">Bard now helps you code</a> — Google is finally going the Copilot way and proposes an alternative with Bard. Bard now can help you write code or Google Sheets functions, but it can do more by explaining or debugging code for you.</li><li>📺 <a href="https://www.youtube.com/watch?v=C_78DM8fG6E&ref=blef.fr">The Inside Story of ChatGPT’s astonishing potential</a> — A TED talk from OpenAI President and co-founder sharing his vision, the potentials and the limit of the technology. In the video you can feel Steve Jobs's 2007 <a href="https://www.youtube.com/watch?v=x7qPAY9JqE4&ref=blef.fr">iPhone keynote</a> vibes. The video also greatly showcase ChatGPT plugins. I higly recommend you watching it.</li></ul><p>Last but not least a more "traditional" AI category:</p><ul><li><a href="https://artificialcorner.com/end-to-end-machine-learning-modelling-in-bigquery-google-cloud-a0d9e7eca20b?ref=blef.fr">End-to-end ML modeling in BigQuery</a> — BigQuery added over the last years a lot of ML capabilities to the engine. This post showcases a lot of it (it uses a XGBoost model).</li><li><a href="https://eng.lyft.com/building-a-large-scale-unsupervised-model-anomaly-detection-system-part-2-3690f4c37c5b?ref=blef.fr">Building a large scale unsupervised model anomaly detection system (part 2)</a>.<br></li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image-6.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2023/04/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/04/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/04/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>AGI (<a href="https://unsplash.com/photos/w2DsS-ZAP4U?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://medium.com/wttj-tech/from-postgresql-to-snowflake-a-data-migration-story-5fd17f778019?ref=blef.fr">From PostgreSQL to Snowflake: A data migration story</a> — The migration lasted 9 months and included 8 steps. They went on this journey because in 2021 Postgres was already hitting read performance limits, degrading the downstream user experience in the BI tools. As Katia shares in the article a 9-months migration is a long tunnel where you encounter a lot of roadblocks and frustration but in the end everyone feels the difference: 10x performance gain—at least—on dashboard execution time.</li><li><a href="https://medium.com/checkout-com-techblog/building-dbt-ci-cd-at-scale-365358f64b6f?ref=blef.fr">Building dbt CI/CD at scale</a> — Every week a new great article about someone else dbt setup where you discover things. This time Damian shares how he designed checkout.com CI/CD pipelines—in Github. In a nutshell they get the actual production manifest, run a SQL Linter, validates models changes (by detecting what are the altered models and running them) and deploy to Airflow.</li><li><a href="https://mattpalmer.io/posts/making-the-most-of-airflow/?ref=blef.fr">Making the Most of Airflow</a> — I already shared Matt's article last week and this week he continues with an awesome article about Airflow. In the article he gives a great overview of Airflow main concepts: DAGs and TaskFlow API (I've also wrote something about <a href="https://www.blef.fr/airflow-dynamic-dags/">dynamic DAGs</a> last year), DRY and what to do to not redevelop stuff and how to test.</li><li><a href="https://docs.getdbt.com/blog/kimball-dimensional-model?ref=blef.fr">Building a Kimball dimensional model with dbt</a> — Jonathan from Canva wrote a large article about dimensional modeling and how to do it with dbt. This is a 7-parts tutorial that shows you how to create fact and dimensions tables.</li><li><a href="https://moderndataengineering.substack.com/p/data-engineering-design-principles?ref=blef.fr">Data engineering design principles you should follow</a> — It treats mainly of software engineering principles like SOLID. Idempotence and determinacy are forgotten from the article and if you want to go deeper on the topic you can read the most important article on this topic: <a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a?ref=blef.fr">functional data engineering</a>.</li><li><a href="https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-1-9f3c730dd9c6?ref=blef.fr">Real-time denormalized data streaming platform part 1</a> and <a href="https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-2-97dfff40fd8d?ref=blef.fr">part 2</a> — Razorpay data team describes how and why they needed to move their ETL process from daily to near real-time. Technologically moving from Airflow batches to Spark running on top of Kafka.</li><li><a href="https://medium.pimpaudben.fr/toward-declarative-data-orchestration-with-kestra-3b17264fbaab?ref=blef.fr">Toward declarative data orchestration with Kestra</a> — A few weeks ago in the Airflow alternatives meetup we organised, we invited Kestra. A YAML-based orchestrator written on the JVM. Recently Benoit joined Kestra as their PO. In this article he shares his vision. It's mainly a question of vocabulary and reach, Kestra believes that with their own declarative YAML syntax they can offer data pipelines to the masses. YAML is enough simple for your analysts (they already do dbt) or business to write their own pipelines.</li><li><a href="https://medium.pimpaudben.fr/toward-declarative-data-orchestration-with-kestra-3b17264fbaab?ref=blef.fr"><a href="https://atlasgo.io/blog/2023/04/21/terraform-v050?ref=blef.fr">Manage database schemas with Terraform in plain SQL</a></a> — Atlas is an open-source schema management tool. The post showcases the atlas provider in Terraform that allows you to write SQL to manage your database in Terraform. I can't wait to see dbt reimplemented in Terraform.</li><li><a href="https://tobikodata.com/automatically-detecting-breaking-changes-in-sql-queries.html?ref=blef.fr">Automatically detecting breaking changes in SQL queries</a> — When you alter a SQL query you can either do a breaking or a non-breaking change. What if with SQLglot you could detect a breaking change before it happens in production?</li></ul><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.16 ]]></title>
                    <description><![CDATA[ Data News #23.16 — Analytics engineering future, a new Airflow meetup, data engineering at Adyen and Meta, dbterra and more. ]]></description>
                    <link><![CDATA[ /data-news-week-23-16/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64419674745333003dbdbdc7 ]]></guid>
                    <pubDate><![CDATA[ 2023-04-21 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image-4.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image-4.png 600w, https://www.blef.fr/content/images/size/w1000/2023/04/image-4.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/04/image-4.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/04/image-4.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>If this picture had been generated with AI it would have been boring (<a href="https://unsplash.com/photos/U6WvLJU0l6o?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear readers, I hope you're doing good. We are close to the second anniversary of the newsletter. Which is crazy. Retrospectively it means that I've written 900 words on average every week for the last 102 weeks. When you look at the <a href="https://www.blef.fr/news-week-2021-18/">first edition</a> we came a long way—lmao.</p><p>We announced this week the <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/292891570/?ref=blef.fr">May Paris Apache Airflow meetup</a>. It will take place in Algolia offices, the 9th of May. We will have 3 speakers and for the first time all the presentations we will held in English. So if you're in Paris or in France do not hesitate to register.</p><h1 id="analytics-engineering-future">Analytics engineering future</h1><p>This week Tristan Handy—dbt Labs CEO—wrote a post about the future of analytics engineering: <a href="https://www.getdbt.com/blog/analytics-engineering-next-step-forwards/?ref=blef.fr">The next big step forwards for analytics engineering</a>. As introduction Tristan gives the original vision of dbt that became mainstream, today. A lot of data teams embraced dbt, or at least the SQL with engineering practices to transform data in cloud data warehouses.</p><p>The content of the post is more about the future and the vision of the next big thing in analytics engineering: new models capabilities. In dbt Core 1.5 we will be able to define:</p><ul><li><strong>Contracts</strong> — you will be able to define columns types and constraints and ask dbt to enforce it. If a model do not respect a contract it will not build. In dbt vocabulary <a href="https://docs.getdbt.com/reference/commands/build?ref=blef.fr">build</a> means run + other things.</li><li><strong>Access</strong> — you will be able to namespace models with groups and visibility. Models visibility will be either private, protected or public. This is a preambule to cross-project dependencies I guess.</li><li><strong>Versions</strong> — you will be able to define versions for models without breaking the downstream consumers. In order to do it you will have multiple SQL files suffixed with the version—<code>_v&lt;version&gt;</code> . To select a specific version you will have to do <code>{{ ref('model_name', version=1) }}</code> .</li></ul><p>I think that these improvements are really important to bring analytics engineering to the next level, this is new capabilities that will bring the field new software engineering practices to data assets management. If we had to this the semantic layer new (through <a href="https://thdpth.substack.com/p/why-dbt-labs-acquired-transform?ref=blef.fr">dbt Labs acquisition of Transform)</a> we are going in the right direction.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><ul><li>If you want to understand LLMs there is a note that has been written by an experts office of the French gov. You can read it in <a href="https://www.peren.gouv.fr/rapports/2023-04-06_Eclairage%20sur_CHATGPT_FR.pdf?ref=blef.fr">French</a> or in <a href="https://www.peren.gouv.fr/rapports/2023-04-06_Eclairage_sur_CHATGPT_EN.pdf?ref=blef.fr">English</a>. To be honest this is a great quality note that you can share to people who wants to understand what are all the AI concepts. Might still be a bit too technical to share it to your parents.</li><li><a href="https://www.chatclimate.ai/?ref=blef.fr">ChatClimate</a> — This is a chat that has been trained with the last IPCC report (the GIEC for the French audience). He showcases well the search capabilities of ChatGPT-based system because every answer is completed with references to the report chapters.</li><li><a href="https://blog.replit.com/llm-training?ref=blef.fr">How to train your own Large Language Models</a> — Now that you tried the previous chat, let's say that you want to run your own LLM. Replit team wrote a great overview of what you have to do.</li><li><a href="https://medium.engineering/building-a-chatgpt-plugin-for-medium-6813b59e4b24?ref=blef.fr">Building a ChatGPT Plugin for Medium</a>.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/Screenshot-2023-04-21-at-14.12.24.png" class="kg-image" alt loading="lazy" width="2000" height="670" srcset="https://www.blef.fr/content/images/size/w600/2023/04/Screenshot-2023-04-21-at-14.12.24.png 600w, https://www.blef.fr/content/images/size/w1000/2023/04/Screenshot-2023-04-21-at-14.12.24.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/04/Screenshot-2023-04-21-at-14.12.24.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/04/Screenshot-2023-04-21-at-14.12.24.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>ChatClimate answer to the most important question.</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://tech.instacart.com/building-a-flink-self-serve-platform-on-kubernetes-at-scale-c11ef19aef10?ref=blef.fr">Building a Flink self-serve platform on Kubernetes at scale</a> — Instacart engineering team migrated from Flink on EMR to Flink on Kubernetes. This article gives you an overview of the Kubernetes platform they implemented.</li><li><a href="https://github.com/fal-ai/isolate?ref=blef.fr">fal-ai/isolate</a> — Yet another package manager in Python. fal developed a new lightweight package manager to isolate environments for at function level. The project README is not yet really explicit.</li><li><a href="https://adyen.medium.com/data-engineering-at-adyen-ccded12a6eb?ref=blef.fr">Data Engineering at Adyen</a> — "Data engineers at Adyen are responsible for creating high-quality, scalable, reusable and insightful datasets out of large volumes of raw data". This is a good definition of one of the possible responsibilities of DE. This is a great article and they even included a flowchart to identify which role will suit you the most. It is interesting to read this post jointly with <a href="https://medium.com/@AnalyticsAtMeta/the-future-of-the-data-engineer-part-i-32bd125465be?ref=blef.fr">the future of data engineer at Meta</a>. Which gives another perspective, which is very business oriented.</li><li><a href="https://engineering.instawork.com/announcing-dbterra-feca4fb398a5?ref=blef.fr">Announcing dbterra: easily sync your jobs with dbt Cloud™️</a> — Eric developed a tool called dbterra that mixes dbt and Terraform in order to deploy open-source dbt project to dbt Cloud with configuration as code.</li><li><a href="https://medium.com/@corymaklin/test-driven-development-for-sql-539ed30164ed?ref=blef.fr">Test Driven Development for SQL</a> — A smal article that gives you a vanilla BigQuery framework with CTE to write unit tests. I think it has to be improve but it gives a greate boilerplate.</li><li><a href="https://medium.com/@alexroperez4/saving-with-bigquery-dbt-35937b1cf628?ref=blef.fr">Saving 💵 With BigQuery &amp; dbt</a> — A few tips to save money when using dbt and BigQuery. Mainly it says that you should consider switching your models to incremental.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong><a href="https://www.betterdata.ai/?ref=blef.fr">Betterdata</a></strong> <a href="https://techcrunch.com/2023/04/20/betterdata/?ref=blef.fr">raises $1.65m seed round</a>. A Singaporean company that provides a tool that generates synthetic data. Synthetic data are AI generated data. In Betterdata case you can use your own datasets and generate data that keep all the statistical metrics needed to do machine learning. This way you can work on data that is similar to yours but different. It's a technique to work with anonymised data.</li><li><strong><a href="https://www.coredb.io/?ref=blef.fr">CoreDB</a></strong> <a href="https://www.coredb.io/blog/introducing-coredb?ref=blef.fr">raises $6.5m seed round</a>. CoreDB is a managed Postgres service that put the emphase on the extensions in order to add more capabilities to your database cluster. CoreDB has been funded by the ex-CEO-CTO of Astronomer.</li><li>A lot of companies announced recently layoffs, sadly. The biggest one being Meta with a new round of 4k people laying off 21000 people since last November. Astronomer as also let 100 people go recently, if you are heavily relying on Airflow it might be interesting to reach people out.</li><li>Elon Musk, according to reports, founded a <a href="https://www.theverge.com/2023/4/14/23684005/elon-musk-new-ai-company-x?ref=blef.fr">new AI company</a> called X.AI Corp.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.15 ]]></title>
                    <description><![CDATA[ Data News #23.15 — Yann le Cun interview, hot takes on the modern data stack, costs saving and metrics layer. ]]></description>
                    <link><![CDATA[ /data-news-week-23-15/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 643851f8a39805003d922cf6 ]]></guid>
                    <pubDate><![CDATA[ 2023-04-14 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image-2.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2023/04/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/04/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/04/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The only AI I'm eager to see (<a href="https://unsplash.com/photos/HBGYvOKXu8A?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, the newsletter might be late today again, but this time this is not my fault. Ghost editor was down when I wanted to write. Anyway, here the weekly Data News, written faster than usual.</p><h1 id="ai-news-%F0%9F%A4%96">AI News 🤖</h1><p>Yann le Cun did a 10 minutes interview at a major French radio. If you want to read the French transcript you can do it <a href="https://www.radiofrance.fr/franceinter/yann-le-cun-la-technologie-cree-de-nouveaux-metiers-en-supprime-d-autres-reconnait-l-un-des-peres-des-ia-5596389?ref=blef.fr">here</a>. Mainly what he says:</p><ul><li>There is no doubt in the fact that one day there will be machines at least as intelligent as human. But ChatGPT isn't, it gives the impression but it is not.</li><li>AI can amplify human intelligence like machines amplify human strength.</li><li>Technology shift jobs. For instance before industrial revolution major part of the French population was working in fields, now it's less than 2%. It means we shouldn't be afraid of technology replacing jobs. He thinks that this will also allow more people to be creative.</li><li>Regarding fake news and ethics he compares to e-mails. He thinks that like when we develop spamming filters to avoid fake mails we will develop the same to avoid fake news.</li><li>For Yann ChatGPT has nothing revolutionary, but he admits it's good engineering. This is just a normal evolution of deep learning systems.</li><li>(<em>Last one because it's funny</em>). He bets than in 10-15 years (or more) we will not have smartphones anymore but augmented reality glasses. We will also use voice to interact with machines, so we can interract with them hands in the pockets—I can't wait to use Siri and Alexa 2.0.</li></ul><p>As a side project, if you want to practice machine learning this weekend you can replicate Rihab's project: <a href="https://rihab-feki.medium.com/ml-project-using-yolov8-roboflow-dvc-and-mlflow-on-dagshub-3e5c0b026297?ref=blef.fr">detect wildfire smoke with YOLOv8 model</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image-3.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image-3.png 600w, https://www.blef.fr/content/images/size/w1000/2023/04/image-3.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/04/image-3.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/04/image-3.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>:/ (<a href="https://unsplash.com/photos/_AwSiaesk40?ref=blef.fr">credits</a>)</figcaption></figure><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://medium.com/@jeremysrgt/airbyte-configuration-as-code-with-octavia-cli-dccd2046b764?ref=blef.fr">A tour of Airbyte’s Octavia CLI</a> — Airbyte, an open-source extract-load platform, released a few months ago a CLI called Octavia that let’s you create integration pipelines. Jeremy wrote a post that showcases how to do it.</li><li><a href="https://mattpalmer.io/posts/hot-takes/?ref=blef.fr">Hot takes on the Modern Data Stack</a> — Matt gives 5 hot takes about the MDS. I don’t totally agree with everything but this is a good read. He says that Redshift is not anymore competing in the warehousing space, which I agree with. He also says that Airflow is obsolete, I disagree, it became common recently to say bad things about Airflow. But as always the issue is between the chair and the keyboard. He is also hard with Airbyte and dbt.</li><li><a href="https://benn.substack.com/p/the-new-philosophers?ref=blef.fr">The new philosophers</a> — It's been a long time since I've shared Benn's posts. Still my favorites. Saying smart things, weeks after weeks. This time he writes about the new marketing approach of the modern data stack ecosystem. Plenty of tools, so let's develop a new tools to avoid the other tools. And add his views about ChatGPT disruption: "<em>We'll initially try to insert LLMs into the game we're currently playing [...]. Our data models won’t be augmented by LLMs; they’ll be built for LLMs</em>". Probably no-one knows, yet, what it means.</li></ul><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">In a presentation I made this week I wrote "<strong>Gen AI + Semantic Layer = self-service ?"</strong>. I think it sums up very well where we are today. But as Robert says "No tool can fix people, behaviors, process, and the semantic layer, however conceptually elegant or impactful, is no exception" (<a href="https://win.hyperquery.ai/p/the-semantic-layer-and-the-self-service?ref=blef.fr">read here</a>).</div></div><ul><li><a href="https://www.castordoc.com/blog/now-live-castor-ai?ref=blef.fr">Castor announced Castor AI</a> — Doing it the other way Castor released a feature that explains a SQL query in natural language. This is a good way to help business users understand what's happening in the transformation layer.</li><li><a href="https://medium.com/teads-engineering/how-we-made-our-reporting-engine-17x-faster-652b9e316ca4?ref=blef.fr">How we made our reporting engine 17x faster</a> — Teads engineering team explain how they significantly speed up their ads report generation. In a nutshell they replaced Spark (EMR) in-memory transformations by BigQuery.</li><li><a href="https://engineering.atspotify.com/2023/04/large-scale-generation-of-ml-podcast-previews-at-spotify-with-google-dataflow/?ref=blef.fr">Large-Scale generation of ML podcast previews at Spotify with Google Dataflow</a> — It became a common issue at vaste content platforms to generate previews to support the scale. This time Spotify explains how they did it with Apache Beam. As an input they take audio and transcript data and they generate podcast previews that will appear in your feed.</li><li><a href="https://eng.lyft.com/big-savings-on-big-data-9c74b7a35326?ref=blef.fr">Big savings on Big Data</a> — This is the current trend, with the current economic situation we have to do more with less (or at least with what we have). At Lyft they optimised their ML platform to save time and money on workloads. Especially they lowered all the dev costs.</li><li><a href="https://www.lastweekinaws.com/blog/localstack-why-local-development-for-cloud-workloads-makes-sense/?ref=blef.fr">LocalStack: Why local development for cloud workloads makes sense</a> — It does the glue with the previous bullet point. This time Corey writes about LocalStack, a tool that emulates locally AWS APIs. The emulation could be the future mainly because it avoids increasing cloud costs for development.</li><li><a href="https://towardsdatascience.com/using-duckdb-with-polars-e15a865e48a3?ref=blef.fr">Using DuckDB with Polars</a> — A nice showcase of the 2 new kids on the block working together. Mainly what you will do is querying in SQL with DuckDB Polars dataframes.</li><li><a href="https://doordash.engineering/2023/04/12/using-metrics-layer-to-standardize-and-scale-experimentation-at-doordash/?ref=blef.fr">Using Metrics Layer to standardize and scale experimentation at DoorDash</a> — A very good exhaustive article about a metrics layer. At DoorDash a lot of teams are doing experimentation and they were in need of a common ground between metrics definition. That’s why they built this system. Mainly they define measures, dimensions and metrics in YAML that will be materialised and made accessible to Curie (their experimentation platform).</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong>Cybersyn</strong> <a href="https://www.cybersyn.com/blog-series-a/?ref=blef.fr">raises $62.9m Series A</a>. Cybersyn is a data-as-a-service platform that provides public datasets for everyone. You can see it as a datasets marketplace of common public data. They are heavily supported by Snowflake so the dataset are accessible in Snowflake marketplace. For instance you can freely query the <a href="https://app.snowflake.com/marketplace/listing/GZTSZAS2KEE/cybersyn-inc-us-addresses?ref=blef.fr">US Addresses</a> dataset to get all the addresses in a zipcode.</li><li><strong>Rupert</strong> <a href="https://blog.hirupert.com/meet-rupert-delivering-business-outcomes-from-your-analytics/?ref=homepage">raises $8m in funding</a>. Rupert wants to fill the gap between the data analyst and the business users by providing a no-code UI to create data alerts on top of your semantic layer.</li></ul> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.14 ]]></title>
                    <description><![CDATA[ Data News #23.14 — Data modeling guide, entity-centric modeling, SQLMesh, GenAI: Italy bans, Samsung leak, Vicuna open-source model, reducing the lottery factor. ]]></description>
                    <link><![CDATA[ /data-news-week-23-14/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 642e6b08758865003d6bd146 ]]></guid>
                    <pubDate><![CDATA[ 2023-04-08 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image.png" class="kg-image" alt loading="lazy" width="800" height="533" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image.png 600w, https://www.blef.fr/content/images/2023/04/image.png 800w" sizes="(min-width: 720px) 720px"><figcaption>Data News entering in town (<a href="https://unsplash.com/photos/2v2Mbo6ibrw?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, if I wasn't late in my newsletter writing it wouldn't be me. But here is your usual Data News. The main reason behind this delay is because I've played with LLMs yesterday. I've tried to run open-source models locally on my own laptop. There are still a few bugs and the results are not really at OpenAI level but this is fun to do.</p><p>This Tuesday we hosted the second part of the Airflow alternatives meetup with Prefect and Dagster. You can find the replay on <a href="https://www.youtube.com/watch?v=2f7KJcFbUs0&ref=blef.fr">YouTube</a>.</p><p></p><h1 id="data-modeling">Data modeling</h1><p>Dear readers, I have to confess something. I did not care about data modeling for years. I mean, in the sense everyone understand it today, for 7 professional years I never did a star schema or something similar. I was in the Hadoop world and all I was doing was denormalisation. Denormalisation everywhere. The only normalisation I did was back at the engineering school while learning SQL with <a href="https://en.wikipedia.org/wiki/Database_normalization?ref=blef.fr#Normal_forms">Normal Forms</a>.</p><p>Actually what I cared was physical storage, data formats, logical partitioning or indexing. </p><p>But, actually, it's normal my role was not to translate business in tables. I still firmly believe that this is not the role of a data engineer. A data engineer should still be a software engineer working with data, empowering others with tooling and apps. Data modeling should not be a required data engineer skill. Enters the analytics engineer.</p><p>Still I feel that there is a hole in my skillset because I can't give relevant advices when it comes to model business with 3 facts tables instead of 5. And to be honest there isn't any good <em>modern</em> literature to answer this question. Simon started a multipart <a href="https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-introduction?ref=blef.fr">guide about data modeling</a>. I hope he will fill the gaps. In the first part he treats about the history of modeling and the main concepts.</p><p>At the same time Maxime Beauchemin wrote a post about <a href="https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/?ref=blef.fr">Entity-Centric data modeling</a>. In comparison to the dimensional modeling it uses entities instead of facts. Which is easier to conceptually understand but also to use in machine learning.</p><p>When it comes to modeling it's hard not to mention dbt. In the recent years dbt simplified and revolutionised the tooling to create data models. dbt, as of today, is the leading framework. But alternatives are coming. This week I discovered <a href="https://github.com/TobikoData/sqlmesh?ref=blef.fr">SQLMesh</a>, a all-in-one data pipelines tool. SQLMesh lets you define models like dbt but avoids you the burden of the Jinja ref/sources macros. Under the hood it uses <a href="https://github.com/tobymao/sqlglot?ref=blef.fr">sqlglot</a> the SQL parser that has been developper by the same developper. It seems there is also a scheduler and a web UI included in the open-source version.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><ul><li><a href="https://gizmodo.com/chatgpt-ai-samsung-employees-leak-data-1850307376?ref=blef.fr">It seems that Samsung employees leaked data to ChatGPT</a> — Unsurprisingly OpenAI saves all the prompts we type (🫠) and can eventually improve models incrementally. It seems that Samsung employees gave confidential information to ChatGPT. Which means that OpenAI owns Samsung data. But is it really different than what we already have with Gmail or AWS? Or like when <a href="https://jalopnik.com/tesla-employees-share-video-inside-customer-cars-1850307909?ref=blef.fr">Tesla employees where watching consumers in-car footage for years</a>.</li><li><a href="https://www.cnbc.com/2023/04/04/italy-has-banned-chatgpt-heres-what-other-countries-are-doing.html?ref=blef.fr">Italy decided to ban ChatGPT</a> — In order to do it Italian Data Protection Watchdog ordered OpenAI to temporarily ceases processing Italian users' data. France and Germany might follow.</li><li><a href="https://openai.com/blog/our-approach-to-ai-safety?ref=blef.fr">OpenAI: Our approach to AI safety</a> — 4 axes in which OpenAI wants to invest: improve safeguards, protect children, respect privacy and improve factual accuracy.</li><li><a href="https://cims.nyu.edu/~sbowman/eightthings.pdf?ref=blef.fr">Eight things to know about Large Language Models</a> — A PDF that will give me a headache.</li><li>On the practical side I've tried to run locally on my M1 Mac a LLM for the first time and it was a fun ride. In a nutshell I wanted to first run <a href="https://github.com/lm-sys/FastChat/?ref=blef.fr">Vicuna</a> an open-source chatbot that has <a href="https://vicuna.lmsys.org/?ref=blef.fr">great results when compared to GPT3.5</a>. In order to run Vicuna (or other similar open-source models) you need to get the weights from the LLaMA Meta foundation 65B params model. You can get the model either by waiting after completing a Google form or by other channels remembering me the early days of internet 🧲. Except from the fact that the inference was super slow—while using douzains Go of RAM—the results were not as good as ChatGPT but still great. If you find it interesting tell me I'll write a post about what I launched and how.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/04/image-1.png" class="kg-image" alt loading="lazy" width="800" height="533" srcset="https://www.blef.fr/content/images/size/w600/2023/04/image-1.png 600w, https://www.blef.fr/content/images/2023/04/image-1.png 800w" sizes="(min-width: 720px) 720px"><figcaption>Rare footage of a foundation model (<a href="https://unsplash.com/photos/73FOXT1DvjI?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm?ref=blef.fr">Twitter's recommendation algorithm</a> — It was an Elon tweet. Twitter published on Github (<a href="https://github.com/twitter/the-algorithm?ref=blef.fr">here</a> and <a href="https://github.com/twitter/the-algorithm-ml?ref=blef.fr">here</a>) their recommendation algorithm and they wrote a blogpost explaining how the recommendation is working. The machine learning is mainly in Python and uses PyTorch. But the algorithm as a whole contains a lot of features, filters and network algorithms.</li><li><a href="https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave1/data-integration/?ref=blef.fr">Microsoft data integration new capabilities</a> — Few months ago I've entered the Azure world. Not really without pain. Today, Microsoft announces new low-code capabilities for Power Query in order to do "data preparation" from multiple sources. Disclaimer: I don't use Power Query and I don't plan to ever use it.</li><li><a href="https://erdavis.com/2023/04/05/one-year-as-a-dataviz-journalist/?ref=blef.fr">One year as a dataviz journalist</a> — Saturday is a good day to have a look at great data visualisations. Erin celebrates his 1-year anniversary as a viz journalist by putting light on the work he is proud of. I really like the "Farthest distance between World Cup stadiums" or the paths to become CCO.</li><li><a href="https://stkbailey.substack.com/p/life-after-orchestrators?ref=blef.fr">Life after orchestrators</a> — Benjamin thinks that orchestrators are legacy systems and that we should all move in the real-time world where everything is simpler. No need to add trigger and to synchronise workflows together. Side node: Ben co-founded Popsink a real-time ETL company.</li><li><a href="https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/?ref=blef.fr">Meta introduces Segment Anything</a> — A new <a href="https://en.wikipedia.org/wiki/Foundation_models?ref=blef.fr">Foundation model</a> enters the game. His name is SAM, and SAM wants to identify which image pixels belong to an object. Will traditional computer vision the next space to become has-been with the new AI innovations?</li><li>❤️ <a href="https://locallyoptimistic.com/post/reducing-the-lottery-factor-for-data-teams/?ref=blef.fr">Reducing the lottery factor, for data teams</a> — if you had to read only one article today you should read this one. The lottery factor, also named the bus factor is risk measurement about knowledge sharing. In data teams a lot of work have to be done in the early days to avoid knowledge to be lost later on. The article gives ~10 advices to apply to lower the risks. Among them I like the changelog, the pair-programming, the pre-recorded video and the stable credentials.</li><li><a href="https://count.co/canvas/vWnN0JCglDd?ref=blef.fr">The ultimate guide to hire your data team</a> — An awesome canvas to conduct data interviews. This guide will help you before and during the interview. It includes a great list of example questions that you could ask in interviews.</li></ul><p><em>PS: <a href="https://www.datacouncil.ai/austin?ref=blef.fr">Data Council</a> took place in Austin a few days ago. As soon as the videos will be out on YouTube I'll do a wrap-up of the sessions. Data Council is usually a moment of the year when the US data gratin gather to discuss.</em></p><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong><a href="https://getdozer.io/?ref=blef.fr">Dozer</a></strong> <a href="https://techcrunch.com/2023/04/03/dozer-exits-stealth-to-help-any-developer-build-real-time-data-apps-in-minutes/?ref=blef.fr">raises $3m seed round</a>. Dozer is a platform to develop real-time data apps, looking like a real-time ETL platform. With Dozer you can connect to multiple sources, do transformations (SQL, Python or JS) and then expose the output in APIs for frontend consumers (React, Vue or Python). YAML configured. It also looks like that Dozer is not really under a proper open-source license. If you want to go deeper to me Dozer looks like <a href="https://materialize.com/?ref=blef.fr">Materialize</a> or <a href="https://www.popsink.com/?ref=blef.fr">Popsink</a> but with a different vision, offering more an API as a serving layer than a database.</li><li><strong><a href="https://www.roboto.ai/?ref=blef.fr">Roboto AI</a></strong> <a href="https://www.roboto.ai/post/roboto-raises-seed-funding-4-8m?ref=blef.fr">raises $4.8m seed round</a>. I hate this as much as I find it interesting. Roboto AI wants to create a AI-powered toolbox for people in robotics. In their demo you can use prompt to search over images or timeseries.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.13 ]]></title>
                    <description><![CDATA[ Data News #23.13 — Google BigQuery pricing changes, Looker Modeler, Bill Gates AI vision, open-letter to pause AI experiments, and usual Fast News. ]]></description>
                    <link><![CDATA[ /data-news-week-23-13/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64256d0d3b0c15003d2135dc ]]></guid>
                    <pubDate><![CDATA[ 2023-03-31 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-10.png" class="kg-image" alt loading="lazy" width="800" height="534" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-10.png 600w, https://www.blef.fr/content/images/2023/03/image-10.png 800w" sizes="(min-width: 720px) 720px"><figcaption>This newsletter is about money (<a href="https://unsplash.com/photos/2FWZEVs3XDE?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear readers, already 3 months done in 2023. We are slowly approaching the 2-years anniversary of the blog and the newsletter. We are almost 3000 and once again I want to thank you for the trust. To be honest time flies and I’d have preferred to do more for the blog in the start of the year but my freelancing activities and my laziness took me so much.</p><p>By the way, recently I’ve worked with Azure tooling and I changed a bit my mind. I had tried Azure years ago and the only memory I had of it was that it was not working. Like you ask for a VM and you don’t get a VM. But obviously it changed. Except the fact that they have a pretty bad vocabulary for things, it works, and the UI is surprisingly pleasant to use.</p><p>My personal preference hierarchy changed with this experience, which is subjective, is GCP &gt; Azure &gt; AWS. Still one complaint I might have is about documentation, sometimes docs page are not of great quality, the presentation is pretty bad and full of usage examples when we only need complete documentation of ressources.</p><p>If you did not register next week we’ll host an online meetup about Airflow alternatives and Prefect and Dagster teams will do a demo.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/Meetup--4-18--1.png" class="kg-image" alt loading="lazy" width="2000" height="433" srcset="https://www.blef.fr/content/images/size/w600/2023/03/Meetup--4-18--1.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/Meetup--4-18--1.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/Meetup--4-18--1.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/Meetup--4-18--1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://www.linkedin.com/events/7044560795581526017/comments/?ref=blef.fr">Join the event on LinkedIn</a> on April 4th 7PM CET (UTC+2)</figcaption></figure><p>PS: brace yourself, April fool's is tomorrow. I've already seen a few "jokes" on LinkedIn.</p><p></p><h1 id="google-data-cloud-ai-summit">Google Data Cloud &amp; AI Summit</h1><p>Two days ago Google announced new things at their Data Cloud &amp; AI Summit. Here a small recap of what has been announced.</p><h3 id="pricing-changes">Pricing changes</h3><p>First, a new <a href="https://cloud.google.com/blog/products/data-analytics/introducing-new-bigquery-pricing-editions?hl=en&ref=blef.fr">BigQuery pricing model</a>. Big changes—or should I say BigChanges—the flat-rate pricing will not be accessible starting on July 5 and will be replaced by a <a href="https://cloud.google.com/bigquery/pricing?ref=blef.fr#capacity_compute_analysis_pricing">capacity pricing</a> similar to what Snowflake is doing. <strong>It will start at $0.04 for a slot hour. It is hard to compare to the previous flat-rate pricing but the previous pricing was more around $0.028 for a slot hour (42% increase). </strong>Still Google says that it will lower your BigQuery costs because they have the smartest autoscaler on hearth 🫠 and will run only what's perfectly needed for your queries.</p><p>Let's take an example previously you could have run 100 BigQuery slots at every moment in time for $2000 a month. Tomorrow for instance you would be able to run 160 slot hour if you use BigQuery only 10h per day for the same amount ($2000). This changes means less computing power on average for the same price but with higher peaks.</p><p>Like good news never comes alone they will also increase the on-demand pricing by 25% starting July 5. It will costs $6,25 per TB compared to $5 before.</p><p>They also announced a "significant" increase in compression performance so that you should switch you storage pricing from logical (uncompressed) to physical (compressed—the actual bytes stored on disk). Compressed storage is at least twice more expensive than uncompressed but as they announce a 12:1 (previously 10:1) compression ratio your company wallet will be the winner. The icing on the cake.</p><p>This new pricing is sad to see, excluding the increase, I believe that for years, one of the strongest advantage of BigQuery was his apparent pricing transparency. Now you need to do multiplication and open and read 5 pages to understand the pricing.</p><p>5 paragraphs about the pricing, it was unexpected.</p><h3 id="looker-modeler">Looker Modeler</h3><p>They also announced <a href="https://cloud.google.com/blog/products/data-analytics/introducing-looker-modeler?hl=en&ref=blef.fr">Looker Modeler</a>, a single source of truth for BI metrics. This is finally over, we have Google take on the semantic layer and the evolution of the LookerML which was one of the first version. They created Looker Modeler as a metrics layer that will be accessible by all the application downstream.</p><p>In a nutshell it will mean—in Looker vocabulary:</p><ul><li>Data Engineer will create sources that will be available in Looker</li><li>Analytics Engineer (or Data Analysts) will create <em>Views</em> from a table with <em>Dimensions</em> and <em>Metrics</em> thanks to the LookML</li><li>Then AE will create <em>Models</em> on top of <em>Explores</em>—Explores are joined Views</li><li>Then you will be able to access the <em>Models</em> through the Looker Modeler via a JDBC interface or a REST API.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/Screenshot-2023-03-31-at-17.04.25.png" class="kg-image" alt loading="lazy" width="1930" height="1086" srcset="https://www.blef.fr/content/images/size/w600/2023/03/Screenshot-2023-03-31-at-17.04.25.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/Screenshot-2023-03-31-at-17.04.25.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/Screenshot-2023-03-31-at-17.04.25.png 1600w, https://www.blef.fr/content/images/2023/03/Screenshot-2023-03-31-at-17.04.25.png 1930w" sizes="(min-width: 720px) 720px"><figcaption>Screenshot from "<a href="https://cloudonair.withgoogle.com/events/summit-data-cloud-2023/watch?talk=t2_s5_trustedmetricseverywhere&ref=blef.fr">Trusted metrics everywhere</a>" keynote.</figcaption></figure><p>Thanks to this we will be able to read LookML data from Tableau. Awesome!</p><blockquote>If you teach someone SQL they can help themselves, if you teach them LookML they help everyone.</blockquote><h3 id="gen-app-builder">Gen App Builder</h3><p>As an answer to OpenAI offering Google Cloud started to propose cloud offering around generative AI. They announced a <a href="https://cloud.google.com/blog/products/ai-machine-learning/create-generative-apps-in-minutes-with-gen-app-builder/?hl=en&ref=blef.fr">web UI to create conversational AIs</a>. In the demo you upload a FAQ (in CSV) and a "How to guide" (in PDF), you pick either "Chat" or "Search" mode and it will generate a app that you can give to your customers to user personalised with your data. You can also feed to the pre-trained models BigQuery tables, GCS buckets or websites URLs.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><p>Google summit last section was a perfect transition to the GenAI category. Every week is richer than the previous one.</p><ul><li>Bill Gates published a note: <a href="https://www.gatesnotes.com/The-Age-of-AI-Has-Begun?WT.mc_id=20230321100000_Artificial-Intelligence_BG-TW_&WT.tsrc=BGTW&ref=blef.fr">The Age of AI has begun</a>. He says that GPT being able to ace university Bio exam is the <em>most important advance in technology since the graphical user interface</em> in 1980. He thinks that AI will help reduce world's inequities like health inequities—I doubt that, like pills and medicine only rich and educated people will benefit AI in the end. Then he enlightens us with how AI can change health and education systems, what are the risks and the next frontiers.</li><li>Then everyone started to freak out. An open letter has been written to ask a pause—at least 6 months—in <a href="https://futureoflife.org/open-letter/pause-giant-ai-experiments/?ref=blef.fr">giant AI experiments</a>, it has been signed by almost 2000 people, with some notorious CEOs, researchers and Elon Musk. This pause should be use to develop with policymakers AI governance systems.</li><li>Gary Marcus debated about the<a href="https://garymarcus.substack.com/p/ai-risk-agi-risk?ref=blef.fr"> AI risk ≠ AGI risk</a>. AGI means artificial general intelligence. He mainly thinks that LLMs are an “off-ramp” on the road to AGI.</li><li><a href="https://a16z.com/2023/03/30/b2b-generative-ai-synthai/?ref=blef.fr">For B2B generative AI apps, is less more?</a> — a16z, a huge US VC, predicts that we will enter in a second wave of AI called the SynthAI. <strong>Currently we generate information based on prompt, in the wave 2 we will generate insights based on information</strong>. This wave seems to be critical for B2B because AI should help decision making, but need to be concise for that.</li><li>As a fun use-case, John wrote how we can<a href="https://data.blacksquare.io/rethinking-product-search-in-the-world-of-chatgpt-57b435b5ce3c?ref=blef.fr"> rethink whisky search in the world of ChatGPT</a>.</li></ul><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🙃</div><div class="kg-callout-text">Last week I said I wanted to write about the current state of self-service, still you will have to wait a bit more because I got lost in today set of news. But if you want a small takeaway, I don't think we will achieve self-service only with ChatGPT, the issue is not only in technology but in the people.<br><br>Still LLMs and semantic layers are a goods initiatives to achieve this everlasting dream. But first ask yourself, will my CEO trust a bot answering on Slack or someone from the accouting team delivering an overcrowded Excel?</div></div><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://robertsahlin.substack.com/p/the-data-engineer-is-dead-long-live?ref=blef.fr">The data engineer is dead, long live the data plaform engineer</a> — This is a current trend of the marker that has been accelerate by the analytics engineer appearance. As AE are, theoretically, in between DE and DA it pushes the actual roles borders further. Meaning DE will have to do more infra stuff to support analytics initiatives. This is normal and Robert adds more to the table in his blogpost. You can also check this <a href="https://medium.com/@loresowhat/a-map-to-explain-all-data-roles-970ed69ba1?ref=blef.fr">map that explains all data roles</a>.</li><li><a href="https://www.bbc.co.uk/ideas/videos/five-charts-that-changed-the-world/p0fb69c1?playlist=made-in-partnership-with-the-royal-society&ref=blef.fr">Five charts that changed the world</a> — A 5 minutes video by the BBC that shows 5 awesome charts that changed the course of the world.</li><li><a href="https://github.com/datafuselabs/databend?ref=blef.fr">Databend, an open-source version of Snowflake</a> — at least this is what they claim. This week I discovered this "open-source data warehouse written in Rust". I'll try it out when I have time. If you tried it you'd love to get your feedback.</li><li><a href="https://docs.getdbt.com/blog/audit-helper-for-migration?ref=blef.fr">audit_helper in dbt</a> — A blog post that showcase how you can use dbt audit_helper package to improve your models. As the competition gets harder every week Datafold also wrote a blogpost comparing <a href="https://www.datafold.com/blog/dbt-audit-helper-vs-data-diff?ref=blef.fr">audit_helper vs. data-diff</a>.</li><li><a href="https://github.com/dbt-checkpoint/dbt-checkpoint?ref=blef.fr">dbt-checkpoint, a list of pre-commit to ensure dbt quality</a> — A list of 40 pre-commit written in Python that you can use to improve the quality of your dbt projects. It includes the useful <code>check-script-has-no-table-name</code> to check if a there is not table name leftovers.</li><li><a href="https://engineering.mixpanel.com/strategies-for-effective-data-compaction-a48718021b7b?ref=blef.fr">Strategies for effective data compaction</a> — From Mixpanel and how they developed an event compactor system with PubSub.</li><li>The <a href="https://twitter.com/erlichya/status/1639973591214182400?ref=blef.fr">2 inventors of the Lempel-Ziv algorithm</a> that is used in all ZIP files died recently. They wrote their proposal in <strong>1977</strong>. RIP.</li></ul><p></p><hr><p>Sorry for this longer edition, and see you next week ❤️.</p><p>PS: if you follow me on LinkedIn you might see this content recycled there because.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.12 ]]></title>
                    <description><![CDATA[ Data News #23.12 — Mage and Kestra takeaways, OpenAI plugins system and impact on job market, Reddit outage post mortem, etc. ]]></description>
                    <link><![CDATA[ /data-news-week-23-12/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 641cd59c0df1c6003d438548 ]]></guid>
                    <pubDate><![CDATA[ 2023-03-24 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-7.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-7.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image-7.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image-7.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image-7.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The Earth can also generate great images (<a href="https://unsplash.com/photos/8A8NJYytdFo?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear readers, I hope this new edition finds you well. It seems that you really liked the recent editions, which is perfect because it was fun to write. I feel that this week all the articles I found relevant for the newsletter are either AI related or technical. I really don't know how to deal with news overflow about the Gen AI landscape. Do you like all the GenAI hype? <a href="https://www.blef.fr/survey?vote=2312_yes">👍</a> or <a href="https://www.blef.fr/survey?vote=2312_no">👎</a></p><p></p><h1 id="airflow-alternatives-meetup">Airflow alternatives meetup</h1><p>This Tuesday also took place the first part of the Airflow alternatives Meetup with <strong>Mage and Kestra</strong>. It was an awesome online meetup. I really liked the presentations from Mage and Kestra and even if I was focus on hosting the event it was great to see 2 other visions about the future of orchestration. Which, to be honest, are not really far from Airflow.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.youtube.com/watch?v=sAc-uNvlveY&ref=blef.fr" class="kg-btn kg-btn-accent">📺 Watch the full replay</a></div><p>Here are my takeaways about the event:</p><ul><li>Mage and Kestra have been both developed with Airflow flaws in mind, especially about deployment complexity, reusability and data sharing between tasks.</li><li>The tagline "<em>Modern replacement for Airflow</em>" on Mage side makes sense. Out of the box Mage provide all-in-one web editor to write data pipelines with a great UX. In small browser text areas you will be able to write Python, SQL or R code and orchestrate theses transformations with drag-n-drop. I personally hate developing in the browser, but the promise looks good. But actually Mage and actual Airflow version are—almost—the same the only difference is the UX when developing pipelines. </li><li>Tommy, Mage's CEO, said that for the moment they will focus on building the best open-source data pipelines tool. They got enough funding for the next 2 years. </li><li>Facing the reality, even if worker management seems easier in Mage, the deployment is not yet ready to go. Either you go with Terraform script that will launch elastic containers either you go with helm, but requires Kubernetes.</li><li>Now Kestra. One of last kid on the block. Ludovic, the CTO who presented Kestra at the event, said that he started the development while at a mission at Leroy Merlin where people were <a href="https://kestra.io/blogs/2022-02-22-leroy-merlin-usage-kestra.html?ref=blef.fr">heavily unhappy about Airflow</a>. Kestra is a YAML-based data pipeline tool mixed with string templating. The YAML approach allowed less-technical users to be able to write pipeline.</li><li>Kestra vision is also very open, everything is accessible through APIs. Which leads to a variety of usage for a company. Under the hood Kestra is developed in Java which is totally different than other alternatives.</li><li>Kestra in the future can easily looks like Mage, YAML being the mid-step before a "drag-n-drop" like UI.</li></ul><p>It was so fun to organise this event and I'd love to do more live in the future with blef.fr. Still, in 2 weeks on April 4th the part 2 of the event with Prefect and Dagster will take place, I hope I'll see you there.</p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://www.linkedin.com/events/7044560795581526017/comments/?ref=blef.fr"><img src="https://www.blef.fr/content/images/2023/03/Meetup--4-18-.png" class="kg-image" alt loading="lazy" width="2000" height="433" srcset="https://www.blef.fr/content/images/size/w600/2023/03/Meetup--4-18-.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/Meetup--4-18-.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/Meetup--4-18-.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/Meetup--4-18-.png 2400w" sizes="(min-width: 720px) 720px"></a><figcaption>You should <a href="https://www.linkedin.com/events/7044560795581526017/comments/?ref=blef.fr">register for the part 2</a></figcaption></figure><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><p>The newsletter is already to big for today so I'll try to keep it short especially on this Gen AI that is already spammed everywhere.</p><p>OpenAI is slowly starting to create a gigantic ecosystem and could become the next GAFA-like company. The <a href="https://openai.com/blog/introducing-openai?ref=blef.fr">non-profit research company manifesto</a> is already far away. OpenAI released a <a href="https://arxiv.org/pdf/2303.10130.pdf?ref=blef.fr">study</a> about Large Langage Models—LLMs—impact on the job market (sorry I wanted to read the pdf but my brain is already grilled) and announced <a href="https://openai.com/blog/chatgpt-plugins?ref=blef.fr">ChatGPT plugins</a>. In a nutshell OpenAI is has created a AI interface that everyone likes and will add on top of it a App Store experience with plugins. It reminds me of something.</p><p>Because OpenAI is not everything, some news of the alternative world. Mozilla announced <a href="https://blog.mozilla.org/en/mozilla/introducing-mozilla-ai-investing-in-trustworthy-ai/?ref=blef.fr">Mozilla.ai</a> a community-based open-source AI ecosystem, Stanford researches released <a href="https://crfm.stanford.edu/2023/03/13/alpaca.html?ref=blef.fr">Alpaca</a> a model that behaves similarly than OpenAI text-davinci-003 but that costs a lot less ($600 to train it), there is also a list of <a href="https://github.com/nichtdax/awesome-totally-open-chatgpt?ref=blef.fr">open alternatives to ChatGPT</a>.</p><p>I'd have love to speak about tools offering to translate human langage to SQL like <a href="https://dev.to/trinly01/how-to-use-sequelai-to-convert-natural-language-queries-into-sql-queries-3f7o?ref=blef.fr">sequel.ai</a> or <a href="https://www.sqltranslate.app/?ref=producthunt">SQL translator</a> but it would open the Pandora's box about self-service analytics and this is for next week.</p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-8.png" class="kg-image" alt loading="lazy" width="800" height="450" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-8.png 600w, https://www.blef.fr/content/images/2023/03/image-8.png 800w" sizes="(min-width: 720px) 720px"><figcaption>Pi? (<a href="https://unsplash.com/photos/vqRMXgVtGXM?ref=blef.fr">credits</a>)</figcaption></figure><ul><li><a href="https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/?ref=blef.fr">You broke Reddit: The Pi-day outage</a> — a good retrospective post on Reddit outage for Pi-day—the 14th of March, 3/14 in US date format, unlucky we are all the rest of us we don't have a Pi-day. Jayme, a staff software engineer, shares that a Kubernetes version upgrade from 1.23 to 1.24 led to the outage. Actually Kubernetes introduces in 1.24 a terminology change from <em>master</em> to <em>control-plane</em> which was the trigger of the issue.</li><li><a href="https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/?ref=blef.fr">Apache <a href="https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/?ref=blef.fr">Arrow releases Arrow nanoarrow</a></a> — Recently Arrow got a lot of light because of DuckDB or Pandas 2.0 and it's good. Arrow is a multi-langage interface to use in-memory data structures. Nanoarrow is a C library that acts as simplified interface for application developers in order to put Arrow everywhere.</li><li><a href="https://www.linkedin.com/pulse/avoiding-data-pipeline-failures-importance-capability-ravindra-kumar%3FtrackingId=nBR%252FEpxUQy%252BgG4xZRFJXgw%253D%253D/?trackingId=nBR%2FEpxUQy%2BgG4xZRFJXgw%3D%3D&ref=blef.fr">Avoiding data pipeline failures: the importance of backfilling capability</a> — As I've already said in the past backfilling is one task that separates data engineers to great data engineers. Backfilling has to be thought at every step of a data pipeline design and development. This is a small LinkedIn article, but it's a good reminder.</li><li><a href="https://www.datafold.com/blog/dbt-development-testing-snowflake?ref=blef.fr">Datafold's data-diff now integrates dbt</a> — You can now run data diff after a dbt run to compare your models to the production state and get a summarised view of the rows impacted.</li><li><a href="https://engineering.linkedin.com/blog/2023/unified-streaming-and-batch-pipelines-at-linkedin--reducing-proc?ref=blef.fr">How LinkedIn reduced processing time with Apache Beam</a> — Beam is a distributed processing framework that proposes a unified execution engine for batch and real-time. LinkedIn team decided to migrate to a lambda architecture and got 94% uplift in performance.</li><li><a href="https://www.fivetran.com/blog/how-fast-is-duckdb-really?ref=blef.fr">How fast is DuckDB really?</a> — Georges, Fivetran CEO, ran a performance test to have metrics on DuckDB performance. The article conclusion is that a Macbook M1 (and probably M2) can have better performance than a server. I can relate.</li><li><a href="https://medium.com/alvin-ai/if-data-lineage-is-the-answer-what-is-the-question-bad7f5f44fb5?ref=blef.fr">If data lineage is the answer, what is the question?</a> — A good list of use-cases where data lineage would be useful.</li><li><a href="https://doordash.engineering/2023/03/21/using-cockroachdb-to-reduce-feature-store-costs-by-75/?ref=blef.fr">Using CockroachDB to reduce feature store costs by 75%</a> — More and more articles about costs optimisation, 2023 is the year were data engineers skills will be used to lower platform costs. And this is a good point.</li><li><a href="https://medium.com/@diogo22santos/how-shadow-data-teams-are-creating-massive-data-debt-d432113f4632?ref=blef.fr">How shadow data teams are creating massive data debt</a>.</li><li><a href="https://tech.instacart.com/distributed-machine-learning-at-instacart-4b11d7569423?ref=blef.fr">Distributed Machine Learning at Instacart</a>.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong>Sifflet</strong> <a href="https://www.siffletdata.com/blog/sifflet-secures-eu12m-in-series-a-financing-to-put-an-end-to-data-entropy?ref=blef.fr">raises €12.8m Series A</a>. Initially this is a data observability tool but it turned out they added features like lineage and cataloging. Often needed to better contextualise alerts but also to avoid tools multiplicity when working with big corporations.</li><li><strong>Hex</strong> <a href="https://hex.tech/blog/funding-round-march-2023/?ref=blef.fr">raises $28m in a Venture Round</a>[1]. Hex is a notebook-based analytics application. Cells are at the center of the analytics, they produce outputs than can be used later in other cells on in visualisation. The visualisation can be organised in a Notion-like document but with live data. I recently tried Hex, the UX is neat and I think the tool is worth it for production-ready explorations[2]. Here an example with the <a href="https://app.hex.tech/79b8d7af-cf83-4c25-b4e5-0e132af2df36/app/85c5d79c-e6c4-4642-8317-4d0ebe9520dd/latest?ref=blef.fr">MAD data</a>—I made no presentation effort.</li><li><strong>DragonflyDB</strong> <a href="https://siliconangle.com/2023/03/21/dragonflydb-reels-21m-speedy-memory-database/?ref=blef.fr">raises $21m Series A</a>. Dragonfly is a replacement for Redis claiming to outperform it in many way (throughput, snapshotting speed, scaling). I don't have a lot to say except the fact that we are going in a future with a lot of databases choices.</li></ul><hr><ol><li>A venture round is when the series has not been specified.</li><li>I just coined the term, I mean, it's when you do a great exploration and you want to share a professional result to your stakeholders.</li></ol><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.11 ]]></title>
                    <description><![CDATA[ Data News #23.11 — Airflow alternatives meetup, Gen AI new category, online gradient descent in SQL, data with Rust, dbt exposures to sources, etc. ]]></description>
                    <link><![CDATA[ /data-news-week-23-11/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 64134b63fc17eb003d390a66 ]]></guid>
                    <pubDate><![CDATA[ 2023-03-17 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Took a few days with the ☀️ (<a href="https://unsplash.com/photos/Nx0C3cDKRLw?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, I hope you had a great week. On my side I'm slowly starting to get on top of the things I had in queue. But, sadly, I work in <a href="https://es.wikipedia.org/wiki/Last_in,_first_out?ref=blef.fr">LIFO</a> so I feel that I'm never done. For people that are not use to it it means <strong>last in, first out</strong>. Which means that I get easily disturbed by a notification—or even a thought—and do something that I did not plan to do at first. It, probably, explains why you always get the newsletter late on Fridays—or Saturdays.</p><p>Thank you for the feedback about last week issue, it seems you liked it. I'll try to continue doing deep-dives on article from time to time.</p><p></p><h1 id="airflow-alternatives-meetup">Airflow alternatives meetup</h1><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://www.linkedin.com/events/7041351388609576960/about/?ref=blef.fr"><img src="https://www.blef.fr/content/images/2023/03/Meetup--4-11-.png" class="kg-image" alt loading="lazy" width="2000" height="433" srcset="https://www.blef.fr/content/images/size/w600/2023/03/Meetup--4-11-.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/Meetup--4-11-.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/Meetup--4-11-.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/Meetup--4-11-.png 2400w" sizes="(min-width: 720px) 720px"></a><figcaption>Click on the image to go the the LinkedIn event.</figcaption></figure><p>We are organising next week with the Paris Apache Airflow Meetup group an online event to discuss about Airflow alternatives. At every Airflow meetup we often get questions about Airflow competition so we decided to give a voice to alternatives in order to understand how they compare with Airflow and more.</p><p>The first even will take place next week, on March 21st at 7PM CET (UTC+1) and we invited Mage and Kestra. We will host another event soon after with others. You can either register on <a href="https://www.linkedin.com/events/7041351388609576960/about/?ref=blef.fr">LinkedIn</a> either join the <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/291965622/?ref=blef.fr">meetup event</a>.</p><p>How lucky you are because I will host the event, so you'll hear my awesome French accent. It also means that if you have any questions that you want me to ask you can send them to me beforehand 🫠.</p><p></p><h1 id="gen-ai-%F0%9F%A4%96">Gen AI 🤖</h1><p><em>I will create a specific category for generative AI.</em></p><p>If you live in a cave or if you only read my newsletter to get news about the data world you might have missed that <a href="https://openai.com/research/gpt-4?ref=blef.fr">GPT-4 has been announced and released</a> this week. I even had hard time navigating between data engineering memes and GPT4 tips on LinkedIn and my Twitter is divided between GPT-4 threads and protests in France. What a time to be alive. Politicians think we should work longer when we are slowly starting to discover new AI capabilities that will for sure impact workplaces.</p><p>I don't want to take the usual shortcut—but how could I not do that. Will AI replace jobs? I do think that AI should empower people, but will the capitalism think like this when an API call will be able to do the same job as a human? Does even capitalism think? Actually it's probably human decisions about AI that will lead to AI replacing people.</p><p>One field that has been totally impacted by the generative field is the Natural Langage Processing (NLP). On Reddit someone asked this if others were also <a href="https://www.reddit.com/r/MachineLearning/comments/11rizyb/d_anyone_else_witnessing_a_panic_inside_nlp_orgs/?ref=blef.fr">witnessing panic in NLP orgs</a>. The general feeling is that GPT made years of NLP research outdated.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">ℹ️</div><div class="kg-callout-text">BTW, technically GPT-4 will be multimodal, you will be able to use text and image as inputs and the model will give you text outputs.</div></div><p>A few other news:</p><ul><li>LinkedIn team also wrote a blog about <a href="https://engineering.linkedin.com/blog/2023/aI-at-linkedin-it-is-all-about-foundations?ref=blef.fr">AI principles</a> stating that AI is like oxygen for the engineering team—I personally would have say that data was oxygen, but who cares—and that with great power comes great responsibility. The same week Microsoft (who owns LinkedIn) reportedly <a href="https://www.theverge.com/2023/3/13/23638823/microsoft-ethics-society-team-responsible-ai-layoffs?ref=blef.fr">layed off the AI ethics and society teams</a>. Great timing.</li><li><a href="https://glaze.cs.uchicago.edu/?ref=blef.fr">Glaze, protecting artists from style mimicry</a> — A tool developed by researchers at Chicago university will help digital artists by cloaking their art to avoid mimicry from deep learning trainings.</li><li>Google and Microsoft will compete to include AI copilots in their offices suites — Microsoft announced <a href="https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/?ref=blef.fr">365 Copilot</a> that will work in Word, Excel, Powerpoint and Outlook. On the other side <a href="https://blog.google/technology/ai/ai-developers-google-cloud-workspace/?ref=blef.fr">Google announced the same for Google Docs and Gmail</a>.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-6.png" class="kg-image" alt loading="lazy" width="2000" height="1292" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Can we develop a GenAI that generates protests slogans? (<a href="https://unsplash.com/photos/It3dmqBbKRQ?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://engineering.grab.com/migrating-to-abac?ref=blef.fr">Migrating from role to attribute-based access control</a> — RBAC is probably one of the most use paradigm when it comes to autorisation especially because role based autorisations are faster to put in place. In the article Grab team explain how to migrated from roles to attributes autorisation on Kafka.</li><li><a href="https://medium.com/data-science-at-microsoft/speeding-up-reverse-etl-3af04e069fd1?ref=blef.fr">Speeding up “Reverse ETL”</a> — Ziqi works at Microsoft and details in this article what they had to consider to improve their Lakehouse exports to downstream databases. In short they switch SQL Server to columnar storage, disable indexes and locks when copying and played with parallelisation and batch size.</li><li><a href="https://maxhalford.github.io/blog/ogd-in-sql/?ref=limit-5">Online gradient descent written in SQL</a> — Max is one of the best when it comes to do great experiments. This time it shows that everything can be done in SQL. With recursive CTEs he implemented sklearn linear model and the code is not even that big.</li><li><a href="https://datawithrust.com/?ref=blef.fr">Data with Rust</a> — This is a handbook that will showcase how to work in data engineering with Rust. At the moment only part 1 and 2 are written but it looks promising.</li><li><a href="https://medium.com/@manish.ramrakhiani/automating-dbt-producer-consumer-pattern-57161ad178d5?ref=blef.fr">Sharing data between dbt projects, dbt exposures to sources</a> — When you have multiple dbt projects it can be a mess to reference a model from another project. This blog shows how you can automate it with a CI and definitions in exposures.</li><li><a href="https://www.sicara.fr/blog-technique/polars-vs-pandas?ref=blef.fr">Polars vs pandas : A new era for Python DataFrames</a> — This comparison is also slowly starting to be a great debate in the data world. Will Polars overtake pandas in the coming years? Guillaume wrote yet another great comparison.</li><li><a href="https://dagster.io/blog/fake-stars?ref=blef.fr">Tracking the fake GitHub star black market with Dagster, dbt and BigQuery</a> — Things are getting spicy here. Dagster team proposed a way to eventually identify Github projects buying stars.</li></ul><p></p><p>Other few articles but with no comment:</p><ul><li><a href="https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi?ref=blef.fr">Introducing multi-modal index for the Lakehouse in Apache Hudi</a>.</li><li><a href="https://megandibble.medium.com/how-to-be-a-good-data-analyst-without-good-data-e4459f2e8585?ref=blef.fr">How to be a good data analyst without good data</a>.</li><li><a href="https://pedram.substack.com/p/dbt-reimagined?ref=blef.fr">dbt Reimagined</a>.<br></li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li>The Austrian data protection authority has decided that <a href="https://noyb.eu/en/austrian-dsb-meta-tracking-tools-illegal?ref=blef.fr">Meta tracking tools are in violation of the GDPR</a>. It will create a precedent.</li><li><strong>Seldon </strong><a href="https://www.seldon.io/announcing-our-series-b?ref=blef.fr">raises $20m Series B</a>. Seldon is a MLOps platform that helps you deploying models in production. At core Seldon provides a framework that you can configure to serve you models on top of Kubenertes.</li><li><strong>👀 </strong><a href="https://www.adept.ai/?ref=blef.fr"><strong>Adept</strong></a> <a href="https://www.adept.ai/blog/series-b?ref=blef.fr">raises $350m Series B</a>. This is again a testimony of the frenzy about generative AI, and according to me the most impressive one. Adept want to create a general purpose AI teammate for everyone. At the moment it takes the form of a browser extension in which you can ask stuff when you navigate on Salesforce, Google Sheet or Craiglist.</li><li><strong>Cast AI</strong> <a href="https://cast.ai/press-release/cast-ai-receives-20m-in-new-funding-led-by-early-stage-vc-creandum/?ref=blef.fr">raises $20m in funding</a>. They propose an AI to cut your Kubernetes costs in half. Bold promise.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.10 ]]></title>
                    <description><![CDATA[ Data News #23.10 — The MAD landscape explained and the Silicon Valley Bank collapse. ]]></description>
                    <link><![CDATA[ /data-news-week-23-10/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 640a17c985afdd003d68caaf ]]></guid>
                    <pubDate><![CDATA[ 2023-03-11 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-4.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-4.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image-4.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image-4.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image-4.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Sorting all the eggs of the landscape (<a href="https://unsplash.com/photos/auEPahZjT40?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear readers, this week Data News lands on Saturday and will be a little bit different than usual because I found less relevant article and as promised last week I wanted to speak about the MAD Landscape.</p><p>I hope you will enjoy this topic focus edition where I speak about economics even if I'm a newbie about economy. In last minute I also added stuff about the Silicon Valley Bank that has been seized by the US FDIC, which will generate a crisis in scale-ups/startups world.</p><h1 id="the-mad-landscape">The MAD landscape</h1><p>The Machine learning, Artificial intelligence &amp; Data (MAD) Landscape is a company index that has been initiated in 2012 by Matt Turck a Managing Director at First Mark. First Mark is a NYC VC, in their portfolio they have Dataiku, ClickHouse and Astronomer among other tech or B2C companies.</p><figure class="kg-card kg-image-card kg-width-full kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/Frame-12.png" class="kg-image" alt loading="lazy" width="2000" height="685" srcset="https://www.blef.fr/content/images/size/w600/2023/03/Frame-12.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/Frame-12.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/Frame-12.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/Frame-12.png 2400w"><figcaption>Evolution between 2012 and 2023. We jumped from 142 logos to 1414, the world changed but Pig remains. (credits: <a href="https://mattturck.com/?ref=blef.fr">mattturck.com</a>)</figcaption></figure><p>Year after year the MAD Landscape has become an important tool to index the whole data landscape. The choice of categories is also a very clear way to categorise companies and to capture how the data field is changing. Obviously this kind of index is opinionated and they—Matt and his team—make editorial choices when they decide to include—or not—a company, but still, their selection depicts a reality.</p><p>Today I want to do a second lecture of the 4-parts article Matt wrote and to give my views on it. As Matt said in a <a href="https://www.linkedin.com/events/mattturck-sgonemad-the2023ml-ai7036336040386723840/comments/?ref=blef.fr">LinkedIn live with Joe Reis</a> the MAD landscape was not published last year (2022) because of time and the landscape has been totally shaken by 2 major events: the massive layoffs wave and the generative AI hype. As a reminder in 2021 edition money was flowing, Databricks did 2 huge rounds with $2.6b raised and Snowflake IPO was a success one year after.</p><p>In the MAD landscape we have 3 main parts that I will discuss about today:</p><ul><li><strong>Infrastructure and open source infrastructure</strong> — all the data tools everyone wants to use (<em>or not, Talend appears twice in the list </em>🙃) this part depicts well what a data engineering needs to create a data stack.</li><li><strong>Analytics</strong> — this is about the tools we will use to query the data lying down in the infrastructure.</li><li><strong>Machine learning &amp; AI</strong> — this category has been totally shaken by the generative AI trend, enterprise machine learning in 2023 is not the same as previously.</li></ul><p>Before going more into category changes and macro trends this MAD capture there are a few interesting facts highlighting some biais this index might have:</p><ul><li>933 companies out of 1414 (65%) are US-based companies</li><li>The continent repartition is 965 (68%) in North America, 182 (12%) in Europe , 74 (5%) in Asia and 192 companies are open-source, so they don't have a base country</li><li>Median founding year is 2015, which means that half of the companies are less than 7 years old, and 20% are less than 3 years old</li><li>GAFAM have logos everywhere. Amazon is the most represented one with 33, then Google with 30 logos, then Microsoft with 21. Apple and Meta are lower with both 2 logos. This is important to mention that IBM have 12 logos and IBM is the oldest company — 1911.</li></ul><p>Mainly what these fact are saying is that the MAD landscape is dominated by US-based companies and US-based companies are nowadays thinking how the world should do data, trying to replicate their problem and their vision to everyone. Which is kinda broken. Obviously there are companies or VCs in Europe/Asia but rare are the one with the same impact. Diversity-wise this is a world dominated by the Northern Hemisphere (as always), there is no company in Africa or Southern America for instance.</p><h4 id="key-insights">Key insights</h4><p>In a nutshell here are the key insights you need to know if you do want to read Matt's notes. First regarding data infrastructure:</p><ul><li><strong>The consolidation will come in the next months/years</strong> — every sub-category has between 20 and 30 logos, even if every company think it's unique, they often do the same as other and the market might not be as large as thought. Also there are a lot of <em>"single feature companies"</em> which will compete with broader ones and more likely fails because of offering. Snowflake and Databricks are the adults who will whistle the end of recess.</li><li><strong>Quality and observability are the same</strong> — sorry but everyone want to be the <em>"Datadog of data"</em>. When looking at the trend they all want to do the same.</li><li><strong>The future of data catalogs is unclear</strong> — I really like the definition of catalog Matt gives: "<em>there is a need for an organised inventory of all data assets</em>". Catalogs are still struggling being adopted even if they seem to be asked by a part of the industry. There are also too many alternative.</li><li><strong>With the recession, modern data stack is attacked</strong> — This is a big shortcut but true. Modern data stack is tightly coupled to ELT which means load first and think second. When you load first you have more data than you need which leads to avoidable costs. The actual MDS with unlimited computing power and storage might come to an end.</li><li>If you want another perspective with a more exhaustive list of changes you can read <a href="https://annageller.medium.com/2023-state-of-data-infrastructure-key-trends-from-matt-turcks-mad-landscape-7dc24e14a815?ref=blef.fr">Anna's takeaways</a> about MAD 2023 infra category.</li></ul><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">ℹ️</div><div class="kg-callout-text">tl;dr — Less money everywhere, optimisation everywhere. The golden age of flooding money is done. We will aim for a simplification of everything because often simple means less cash burn. Less everything.</div></div><div class="kg-card kg-button-card kg-align-center"><a href="https://mattturck.com/mad2023-part-iii/?ref=blef.fr" class="kg-btn kg-btn-accent">Read MAD 2023 — TRENDS IN DATA INFRA</a></div><p>After infrastructure Matt also writes about all AI impacts:</p><ul><li>The index this year depicts the generative AI hype with a lot of early stage startup doing almost everything possible with generative algorithms.</li><li>According to Matt we are now in the 3rd cycle AI hype. This is the largest one because it reached mainstream coverage. As a proof my father is using ChatGPT (in French "chat" means "cat" and he says CatGPT, which is a bit funny). But yeah AI became mainstream even if it was already everywhere before, but it was vertical models. But now everyone experiences the general purpose intelligence.</li><li>Startups might have difficulties catching up tech giants on this because they need data and probably a lot of computing power they might not have.</li><li>There are many backlashes AI companies will have to navigate through: impact on job market, algorithm bias, disinformation, hallucination—a word for AI is often wrong, and lastly AI is just boring.</li></ul><div class="kg-card kg-button-card kg-align-center"><a href="https://mattturck.com/mad2023-part-iv/?ref=blef.fr" class="kg-btn kg-btn-accent">Read MAD 2023 — TRENDS IN ML/AI</a></div><p>In addition to this O'Reilly released their <a href="https://www.oreilly.com/radar/technology-trends-for-2023/?ref=blef.fr">technology trends</a> from the searches on their website, when we only focus on the data field what we see is:</p><ul><li>Overall Python is the most popular concept and the most growing one — I think it explains because Python is the best entry-level langage for the IT world.</li><li>When it comes to data, data engineering is the most searched concept and growing</li><li>Spark and Hadoop have been less searched than last year</li><li>PowerBI is the 3rd most searched concept and I'm sad about it</li></ul><p></p><h1 id="silicon-valley-bank%E2%80%94wat">Silicon Valley Bank—wat?</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-3.png" class="kg-image" alt loading="lazy" width="800" height="533" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-3.png 600w, https://www.blef.fr/content/images/2023/03/image-3.png 800w" sizes="(min-width: 720px) 720px"><figcaption>🤞(<a href="https://unsplash.com/photos/utWyPB8_FU8?ref=blef.fr">credits</a>)</figcaption></figure><p>This is a bit last minute but this is freaking huge. Let me do a recap for you and why does it matter.</p><p>The Silicon Valley Bank (SVB) is a deposit bank based in California and has the biggest market share. SVB manages billions dollars of assets. Mainly the assets are coming from Silicon Valley startups, founders and employees. In a nutshell if you are a startup founder and you get millions from a seed round you put the money in the SVB.</p><p>2-3 years ago a lot of money was raised, the SVB got around $200 billion in deposit. The SVB wanted to put $80 billion of this money at work using Mortgage Backed Securities (MBS)—just as a reminder MBS were at the center of the 2008 financial crisis. MBS guarantees 1.5% return and because interests rates were low because of the pandemic it was ok. </p><p>In the last months the FED increased rates crossing recently the 4.5% mark which still was ok. But started triggering chain reaction of all actors. The SVB did a first <a href="https://www.prnewswire.com/news-releases/svb-financial-group-announces-proposed-offerings-of-common-stock-and-mandatory-convertible-preferred-stock-301766247.html?ref=blef.fr">mistake</a> that I'm not able to explain.</p><p>Then VCs started panicking (e.g. <a href="https://www.bloomberg.com/news/articles/2023-03-11/thiel-s-founders-fund-withdrew-millions-from-silicon-valley-bank?ref=blef.fr#xj4y7vzkg">Peter Thiel's</a>) advising founders and startups to get the money out of the SVB. Which led to a <a href="https://en.wikipedia.org/wiki/Bank_run?ref=blef.fr">bank run</a>.</p><blockquote>A <strong>bank run</strong> or <strong>run on the bank</strong> occurs when many clients withdraw their money from a bank, because they believe the bank may cease to function in the near future</blockquote><p>Then the SVB did another <a href="https://twitter.com/mbdailyshow/status/1634225082267516935?s=20&ref=blef.fr">mistake</a>. One day later the stock was 60% down and later the same day the bank collapsed.</p><p>What happened here is huge and will have a big impact on every US-based scale-ups/startups—it's very well linked to the MAD landscape. Mainly the deposits were only insured until 250k and it means that a lot of companies will lack of cash and probably have difficulties to pay salaries and/or vendors soon.</p><p>As a reaction it'll, sadly, imply more layoffs in the coming days and weeks. Other are also afraid of a contagion to the whole banking system as the SVB collapse became the <a href="https://en.wikipedia.org/wiki/List_of_largest_U.S._bank_failures?ref=blef.fr">second-largest collapse of the US history</a>.</p><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://spacenews.com/lonestar-raises-5-million-for-lunar-data-centers/?ref=blef.fr">Lonestar raises $5m in seed</a> to put data centers on the Moon. Yep, you read it well. Apparently moon market projected to generate $105B in revenue over next decade. While in France we are fighting to retire earlier people wants to send my Twitter history backups to the moon.</li><li><a href="https://www.darkreading.com/risk/employees-feeding-sensitive-business-data-chatgpt-raising-security-fears?ref=blef.fr">Employees are feeding sensitive business data to ChatGPT, raising security fears</a>.</li></ul><hr><p>See you next week with an usual Data News ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.09 ]]></title>
                    <description><![CDATA[ Data News #23.09 — How to get started with dbt, machine learning Saturday, writing as a data eng, SCDs, Snowflake announcements, etc. ]]></description>
                    <link><![CDATA[ /data-news-week-23-09/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6401b89c0cb29f003dce4779 ]]></guid>
                    <pubDate><![CDATA[ 2023-03-04 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Formula 1 is back (trying to jinx before it happens) (yes there is no link with the data news) (<a href="https://unsplash.com/photos/7gsDyd2gskA?ref=blef.fr">credits</a>)</figcaption></figure><p>Hello you, I hope this new Data News finds you well. After last week question about your consideration of a paying subscription I got a few feedbacks and it helped me a lot realise how you see the newsletter and what it means for a you. So thank you for that. I'll try to think about it in the following weeks to understand where I go for the third year of the newsletter and the blog.</p><p>Stay tuned and let's jump to the content.</p><p>This week I've published an compact article about <a href="https://www.blef.fr/get-started-dbt/">how to get started with dbt</a>. The idea behind this article is to define every dbt concept and objects from the CLI to the Jinja templating or models and sources. The article has been written as something you can add in your own internal dbt onboarding process for every newcomer.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.blef.fr/get-started-dbt/" class="kg-btn kg-btn-accent">Rad my article — How to get started with dbt</a></div><p></p><h1 id="machine-learning-saturday-%F0%9F%A4%96">Machine Learning Saturday 🤖</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-1.png" class="kg-image" alt loading="lazy" width="2000" height="1379" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Was it a boost ride? (<a href="https://unsplash.com/photos/hd0ZKh8VjhQ?ref=blef.fr">credits</a>)</figcaption></figure><ul><li><a href="https://medium.com/blablacar/how-blablacar-matches-passengers-and-drivers-with-machine-learning-1cf151451f?ref=blef.fr">How BlaBlaCar leverages machine learning to match passengers and drivers</a> — BlaBlaCar is a carpooling company and in this article they detail what they did—in terms of machine learning—to improve trips listing with a Boost feature that proposes detours to drivers in order to be able to cover more countryside cities. It does not include any generative AI but greatly shows how machine learning can impact business problems.</li><li><a href="https://engineering.linkedin.com/blog/2023/linkedin-s-responsible-ai-principles-help-meet-the-big-moments-i?ref=blef.fr">Sharing LinkedIn’s Responsible AI Principles</a> — Very short article that lists the 5 principles LinkedIn aims to follow. In a nutshell AI should be use as a tool to empower members and augment their success, while prioritising trust, privacy, security, and fairness, providing transparency in AI usage, and the right governance should be put in place to maintain accountability over AI algorithms.</li><li><a href="https://medium.com/data-monzo/designing-a-regional-experiment-to-measure-incrementality-9326ce6f9248?ref=blef.fr">Designing a regional experiment to measure incrementality</a> — Monzo team did an geographical experiment in order to understand how their referral program works.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://medium.com/@luukmes/writing-well-a-data-engineers-advantage-2fd08efaedb0?ref=blef.fr">Writing well: a data engineer’s advantage</a> — This is probably a leftover part of the data engineer toolkit, but writing is an essential skill. Luuk gives a few advices on how to improve in your email communications with coworkers in order to announce new release or to seek for budget for a factorisation project. </li><li><a href="https://towardsdatascience.com/heres-why-your-efforts-extract-value-from-data-are-going-nowhere-8e4ffacbdbc0?ref=blef.fr">Here’s why your efforts to extract value from data are going nowhere</a> — <em>If data science is “making data useful,” then data engineering is “making data usable.”. </em>This is a quote from Cassie article which I find awesome. But still, in order to make data works we still need to praise other data coworkers that have to do documentation and all the governance burden that no-one wants to do.</li><li><a href="https://python.plainenglish.io/understanding-slowly-changing-dimensions-scd-in-data-warehousing-20a566ae3fdd?ref=blef.fr">Understanding slowly changing dimensions (SCD) in data warehousing</a> — SCD modeling is an old technique but more and more relevant today as we need to keep track of transactional data. The article proposes 6 types of SCDs. I think the SCD type 2 is the most common and lossless one, but other are worth mentioning. As a side note, if you want to understand quickly what SCD are, <a href="https://docs.getdbt.com/docs/build/snapshots?ref=blef.fr">dbt snapshots</a> documentation page is the best path to go.</li><li><a href="https://blog.infuseai.io/how-to-run-dbt-with-bigquery-in-github-actions-97ccb1761f4b?ref=blef.fr">How to run dbt with BigQuery in GitHub Actions</a> — When you're starting with dbt you don't need any orchestrator or dbt Cloud, a CI/CD do it for sure. This article gives you the GitHub Action you need to setup.</li><li><a href="https://medium.com/snowflake/snowflake-query-acceleration-service-the-warehouse-booster-f24bc41b15b?ref=blef.fr">Snowflake: query acceleration service</a> — Snowflake invented a boost, that you activate with a flag at warehouse creation (in Snowflake a warehouse is the compute isolation your queries run in, the bigger the warehouse is the more compute you use and pay). When you activate the query acceleration service when Snowflake thinks that a query can be accelerated it will launch more compute than actually specified by your warehouse. Not related, they also announced <a href="https://www.snowflake.com/blog/snowpipe-streaming-public-preview/?ref=blef.fr">Snowpipe Streaming</a> this week.</li><li><a href="https://netflixtechblog.medium.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8?ref=blef.fr">Data ingestion pipeline with Operation Management</a> — At Netflix they annotate video which can lead to thousand of annotation but they need to manage the annotation lifecycle each time the annotation algorithm runs. This article explains how they did it.</li><li><a href="https://engineering.mixpanel.com/ensuring-data-consistency-across-replicas-cb7d650cb40?ref=blef.fr">Ensuring Data Consistency Across Replicas</a> — Mixpanel details how they ensure that different zones Kafka consumers are writing the data in the same manner. This way, when a zone is unavailable they can use the other zone to still have the data without any duplication or lack of messages.</li><li>Pandas 2.0.0 — A new <a href="https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html?ref=blef.fr#backwards-incompatible-api-changes">major Pandas release</a> is out. In the <a href="https://www.blef.fr/data-news-week-23-02/#polars%E2%80%94pandas-are-freezing">shadows of Polars</a> that seems to revolutionise DataFrame computation Pandas came with <a href="https://medium.com/@darshilp/pandas-2-0-is-here-427b026ab913?ref=blef.fr">a lot of optimisation and changes</a>.</li><li><a href="https://www.lastweekinaws.com/blog/aws-is-asleep-at-the-lambda-wheel/?ref=blef.fr">AWS lambdas are still on Python 3.9</a> — Corey rant about AWS lambdas that are still using Python 3.9 while all the competition upgraded to at least Python 3.10.</li><li>A small head's up, the Apache Airflow team has announced the Airflow Summit for 2023 which will be held in Toronto in September. They recently opened the <a href="https://sessionize.com/airflow-summit-2023/?ref=blef.fr">call for presentations</a>.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/03/image-2.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/03/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2023/03/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/03/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/03/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Footage of the new Snowflake query acceleration service—be careful it burns cash faster than ever (<a href="https://unsplash.com/photos/uj3hvdfQujI?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.qwak.com/?ref=blef.fr"><strong>Qwak</strong></a> <a href="https://techcrunch.com/2023/03/01/qwak-raises-12m-for-its-mlops-platform/?ref=blef.fr">raises $12m Series A</a>. Are the ducks the new elephants? Qwak proposes a all-in-one platform to manage all operations in a machine learning project. In the platform you do the feature engineering, the model creation, versionning, deployment and monitoring with all pipeline automated. I think a lot of platforms like this exists today.</li><li><a href="https://tabular.io/blog/announcing-tabular/?ref=blef.fr">Announcing Tabular</a> — Tabular has been released in public this week. Tabular is a cloud offer using Apache Iceberg. This is funny to see their offering because they offer a "managed data warehouse storage", which means without the compute. You bring your own compute. Some company also call it a lakehouse or a data lake, but the word shift is enough interesting to notice. At least for me.</li><li><a href="https://gradientflow.com/insights-from-new-data-and-ai-pegacorns/?ref=blef.fr">Insights from new data and AI Pegacorns</a> — Ben from GradientFlow gave a few economic insights about the data Pegacorns (companies with more than $100m annual revenue). I don't have much to say on except that next year probably we'll see generative AI companies on the track to enter the selection.</li></ul><p>I wanted to include a review of the <a href="https://mattturck.com/mad2023/?ref=blef.fr">2023 <a href="https://mattturck.com/mad2023/?ref=blef.fr">MAD landscape</a></a> in this newsletter but as I was late and it would have become a huge edition I'll try to write something on it specifically next week.</p><hr><p>See you next week ❤️. </p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ How to get started with dbt ]]></title>
                    <description><![CDATA[ What&#39;s a dbt model, a source and a macro? Learn how to get started with dbt concepts. ]]></description>
                    <link><![CDATA[ /get-started-dbt/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63ec9ff1ee02be003d51560c ]]></guid>
                    <pubDate><![CDATA[ 2023-03-01 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>This article is meant to be a resource hub in order to understand dbt basics and to help get started your dbt journey.</p><p>When I write dbt, I often mean <a href="https://github.com/dbt-labs/dbt-core?ref=blef.fr">dbt Core</a>. dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt Core has been developed by dbt Labs, which was previously named <a href="https://www.getdbt.com/blog/welcome-to-fishtown-analytics/?ref=blef.fr">Fishtown Analytics</a>. The company has been founded in May 2016. dbt Labs also develop dbt Cloud which is a cloud product that hosts and runs dbt Core projects.</p><p>In this resource hub I'll mainly focus on dbt Core—<em>i.e.</em> dbt.</p><p>First let's understand why dbt exists. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision. In terms of paradigms before 2012 we were doing <a href="https://en.wikipedia.org/wiki/Extract,_transform,_load?ref=blef.fr">ETL</a> because storage was expensive, so it became a requirement to transform data before the data storage—mainly a data warehouse, to have the most optimised data for querying. </p><p>With the public clouds—e.g. AWS, GCP, Azure—the storage price dropped and we became data insatiable, we were in need of all the company data, in one place, in order to join and compare everything. Enter the ELT. In the ELT, the load is done before the transform part without any alteration of the data leaving the raw data ready to be transformed in the data warehouse.</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-3.png" class="kg-image" alt loading="lazy" width="1200" height="439" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-3.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-3.png 1000w, https://www.blef.fr/content/images/2023/02/image-3.png 1200w" sizes="(min-width: 1200px) 1200px"><figcaption>dbt purpose as conceptualised in 2017—which is the same today (<a href="https://www.getdbt.com/blog/what-exactly-is-dbt/?ref=blef.fr">What, exactly is dbt?</a>)</figcaption></figure><p>In a simple words dbt sits on top of your raw data to organise all your SQL queries that are defining your data assets. And dbt only does the T of the ELT which is really clear in term of responsibilities.</p><blockquote><em>dbt is a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun.</em></blockquote><p>It was the previous tag line dbt Labs had on their website. This is important to understand that dbt is a framework. Like every framework there are multiple hidden pieces to know before becoming proficient with it. Still it very easy to get started.</p><p></p><h1 id="dbt-concepts">dbt concepts</h1><p>There are a few concepts that are super important and we need to define them before going further:</p><ul><li><strong>dbt CLI</strong> — CLI stands for Command Line Interface. When you have <a href="https://docs.getdbt.com/docs/get-started/installation?ref=blef.fr"><a href="https://docs.getdbt.com/docs/get-started/installation?ref=blef.fr">installed</a> dbt</a> you have available in your terminal the <code>dbt</code> command. Thanks to this you can run <a href="https://docs.getdbt.com/reference/dbt-commands?ref=blef.fr">a lot of different commands</a>.</li><li><strong>a dbt project</strong> — <a href="https://docs.getdbt.com/docs/build/projects?ref=blef.fr">a dbt project</a> is a folder that contains all the dbt objects needed to work. You can initialise a project with the CLI command: <code>dbt init</code>.</li><li><strong>YAML</strong> — in the modern data era <a href="https://en.wikipedia.org/wiki/YAML?ref=blef.fr">YAML</a> files are everywhere. In dbt you define a lot of configurations in YAML files. In a dbt project you can define YAML file everywhere. You have to imagine that in the end dbt will concatenate all the files to create a big configuration out of it. In dbt we use the <em>.yml</em> extension.</li><li><strong>profiles.yml</strong> — <a href="https://docs.getdbt.com/reference/profiles.yml?ref=blef.fr">This file contains the credentials</a> to connect your dbt project to your data warehouse. By default this file is located in your <code>$HOME/.dbt/</code> folder. I recommend you to create your own profiles file and to specify the <code>--profiles-dir</code> <a href="https://docs.getdbt.com/docs/get-started/connection-profiles?ref=blef.fr#advanced-customizing-a-profile-directory">option</a> to the dbt CLI. A connection to a warehouse requires a <a href="https://docs.getdbt.com/docs/supported-data-platforms?ref=blef.fr">dbt adapter</a> to be installed.</li><li><strong>a model</strong> — a model is a select statement that can be materialised as a table or as a view. The models are most the important dbt object because they are your data assets. All your business logic will be in the model select statements. You should also know that model are defined in <em>.sql</em> files and that the filename is the name of the model by default. You can also add metadata on models (in YAML).</li><li><strong>a source</strong> — a source refers to a table that has been extracted and load—EL—by something outside of dbt. You have to define sources in YAML files.</li><li><strong>Jinja templating</strong> — <a href="https://en.wikipedia.org/wiki/Jinja_(template_engine)?ref=blef.fr">Jinja is a templating engine</a> that seems to exist forever in Python. A templating engine is a mechanism that takes a template with "stuff" that will be replaced when the template will be rendered by the engine. Contextualised to dbt it means that a SQL query is a template that will be rendered—or compiled—to SQL query ready to be executed against your data warehouse. By default you can recognise a Jinja syntax with the double curly brackets—e.g. <code>{{ something }}</code> .</li><li><strong>a macro</strong> — a macro is a Jinja function that either do something or return SQL or partial SQL code. Macro can be imported from other dbt packages or defined within a dbt project.</li><li><strong>ref / source macros</strong> — <code>ref</code> and <code>source</code> macros are the most important macros you'll use. When writing a model you'll use these macros to define the relationships between models. Thanks to that dbt will be able to create a dependency tree of all the relation between the models. We call this a DAG. Obviously <a href="https://docs.getdbt.com/reference/dbt-jinja-functions/source?ref=blef.fr">source</a> define a relation to source and <a href="https://docs.getdbt.com/reference/dbt-jinja-functions/ref?ref=blef.fr">ref</a> to another model—it can also be other kind of dbt resources.</li></ul><p>In a nutshell the dbt journey starts with sources definition on which you will define models that will transform these sources to something else you'll need in your downstream usage of the data.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">ℹ️</div><div class="kg-callout-text">I want to mention that the dbt documentation is one of the best tools documentation out there. So do not hesitate to go there to understand better concepts we needed. You just have to understand that there is the <a href="https://docs.getdbt.com/reference/dbt_project.yml?ref=blef.fr">reference</a> part which is the detailed documentation of function or configuration and there is the <a href="https://docs.getdbt.com/docs/introduction?ref=blef.fr">documentation</a> part which is more about concepts and tutorials.</div></div><!--members-only--><h1 id="dbt-entities">dbt entities</h1><p>I don't want to copy paste the dbt documentation here because I think they did it great, there are multiple dbt entities—or objects, I don't know how to name it, they name it resources, but I don't want to clash with the resource as a link. So there are multiple dbt entities you should be aware of before starting any project, the list below is exhaustive (I hope) but more, the list is sorted by priority:</p><ul><li><strong>sources / models</strong> — you already know it, this is the key part of your data modelisation.</li><li><strong>tests</strong> — a way to define SQL tests either at column-level, either with a query. The trick is if the query returns results it means the test has failed.</li><li><strong>seeds</strong> — a way to quickly ingest static or reference files defined in CSV.</li><li><strong>incremental models</strong> — a syntax to define incrementally models with a if/else Jinja syntax. Here the <a href="https://docs.getdbt.com/docs/build/incremental-models?ref=blef.fr">reference</a>. You can choose the strategy you want depending on your adapter (cf. <a href="https://towardsdatascience.com/two-completely-different-types-of-dbt-incremental-models-in-bigquery-db794cbe022c?ref=blef.fr">examples on BigQuery</a>).</li><li><strong>snapshots</strong> — this is how you do slowly changing dimension. This is a methodology that has been designed more than 20 years ago that optimise the storage used. The <a href="https://docs.getdbt.com/docs/build/snapshots?ref=blef.fr">dbt snapshot page is the best illustration</a> I know of the SCD.</li><li><strong>macros</strong> — a way to create re-usable functions.</li><li><strong>docs</strong> — in dbt you can add metadata on everything, some of the metadata is already expected by the framework and thank to it you can generate a small web page with your light catalog inside: you only need to do <code>dbt docs generate</code> and <code>dbt docs serve</code> .</li><li><strong>exposures</strong> — a way to define downstream data usage.</li><li><strong>metrics</strong> — in your modelisation you create dimensions and measures mainly, in dbt you can next define metrics that are measures group by dimensions. The idea is to use metrics downstream to avoid materialising everything. You can read my <a href="https://www.blef.fr/metrics-store/">What is a metrics store</a> to help you understand.</li><li><strong>analyses</strong> — a place to store queries that are either not finished either queries that you don't want to add in the main modelisation.</li></ul><p>You can read <a href="https://docs.getdbt.com/docs/build/projects?ref=blef.fr">dbt's official definitions</a>.</p><div class="kg-card kg-callout-card kg-callout-card-red"><div class="kg-callout-emoji">⚠️</div><div class="kg-callout-text">I feel that this is important to mention again that dbt Core is a framework to organise SQL files and <strong>not a scheduler that will be able out of the box run your transformation on a fixed schedule</strong>.<br><br>Also dbt only does a pass-through to your underlying data compute technology, there is not any kind of processing within dbt. Actually dbt can be seen as an orchestrator with no scheduling capabilities.</div></div><p></p><h1 id="analytics-engineering">Analytics engineering</h1><p>dbt is becoming a popular framework while being extremely usable. A lot of companies have already picked dbt or aim to. There are multiple technological reasons for this, but technology is rarely the real reason. I think the reasons dbt is becoming the go-to are mainly organisational:</p><ul><li>dbt is a complete tool that you can give to analytics teams, it can become their unique playground. Within it they can do almost everything.</li><li>The network effect. Because more and more companies are betting on it, more and more trained people there will be in the market. It's also a strategical choice in order to be able to hire people. </li><li>The documentation, as I said earlier, is top of the notch.</li></ul><p>dbt Labs also popularised the analytics engineer role. We can quickly summarise the role as in-between the data engineer and the data analyst. But because companies can have very versatile definition of role, <strong>I'd say that the analytics engineering is the practice to create a data model that represents accurately the business and that is optimised for a variety of downstream consumers</strong>. So the analytics engineers are the one doing this.</p><p>By the position of this role and the freshness of it, people are coming into analytics engineering from data analytics. Usually they don't have a lot of software engineering good practices and knowledge, which is obvious, but the dbt framework is also meant to bring this to the table.</p><p>This is also fair to say that dbt as a tool is very easy to use and very often the complexity of the dbt usage will lie in the SQL writing rather than the tool usage by himself. There are also a few questions in term of <a href="https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview?ref=blef.fr">project structuration</a> that needs to be done.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.blef.fr/#/portal/signup/free" class="kg-btn kg-btn-accent">Subscribe to the blef.fr newsletter ❤️</a></div><p><em>If you like this article you should subscribe to my weekly newsletter to not miss any other article of this kind.</em></p><h1 id="resources">Resources</h1><p>As I only want to help you get started with concepts I know want to redirect you to other articles that I find relevant to go deeper:</p><ul><li><strong>dbt annual conferences </strong>— Every year dbt do their annual conference called Coalesce which features a lot of dbt user and usage. I've covered with takeways the 2 last one: <a href="https://www.blef.fr/dbt-coalesce-takeaways/">Coalesce 2021</a> and <a href="https://www.blef.fr/dbt-coalesce-takeaways-2022/">Coalesce 2022</a>. In these articles there are a lot of cool presentations you should watch to understand deeper how dbt works.</li><li><a href="https://docs.google.com/presentation/d/1MKjgNU_2hpq0XalSJAE8FmDATfxfJtu6jZiC8ZrekPc/edit?ref=blef.fr#slide=id.g13de222be64_0_0">Introduction slides about dbt</a> — This is a presentation I often give, you can also watch <a href="https://www.youtube.com/watch?v=Wsl9ExQBgyE&ref=blef.fr">a talk I gave in French</a>, there is also a <a href="https://www.youtube.com/watch?v=8FZZivIfJVo&ref=blef.fr">great introduction by Seattle Data Guy</a> that I recommend.</li><li>You can do tests in dbt — like: <a href="https://medium.com/hiflylabs/environment-dependent-unit-testing-in-dbt-c081a0a5ff1e?ref=blef.fr">environment-dependent unit testing in dbt</a>, <a href="https://www.datafold.com/blog/7-dbt-testing-best-practices?ref=blef.fr">7 dbt testing best practices</a> or <a href="https://www.synq.io/blog/the-complete-guide-to-building-reliable-data-with-dbt-tests?ref=blef.fr">a guide to building reliable data with dbt tests</a>.</li><li>You have to get inspiration from others dbt projects — <a href="https://build.thebeat.co/data-build-tool-dbt-the-beat-story-a5c09471cf66?ref=blef.fr">dbt @Beat</a>, <a href="https://medium.com/vimeo-engineering-blog/dbt-development-at-vimeo-fe1ad9eb212?ref=blef.fr">dbt @Vimeo</a>, <a href="https://medium.com/@imweijian/lessons-learned-after-1-year-with-dbt-a7f0ccf85b12?ref=blef.fr">dbt @ShopBack</a>.</li><li>Optimisation — An issue with dbt is that everything will run in SQL, which means you'll have to optimise a lot of thing. dbt Labs team wrote about an <a href="https://docs.getdbt.com/blog/how-we-shaved-90-minutes-off-model?ref=blef.fr">optimisation of a long running model</a>.</li><li><a href="https://maxhalford.github.io/blog/dbt-ref-rant/?ref=blef.fr">A rant against dbt ref</a> — A great article to make you think about dbt principles.</li><li><a href="https://medium.com/@oravidov/dbt-observability-101-how-to-monitor-dbt-run-and-test-results-f7e5f270d6b6?ref=blef.fr">How to monitor dbt models</a>.</li><li><a href="https://medium.com/snowflake/dbt-constraints-automatic-primary-keys-unique-keys-and-foreign-keys-for-snowflake-d78cbfdec2f9?ref=blef.fr">Generate databases constraints with dbt</a>.</li><li>🧑‍🏫 Online courses — I've tried any of the courses I'll recommend, but from the background of the mentors I think they are very relevant. There is first a Corise "<a href="https://corise.com/course/data-modeling?ref=blef.fr">Data modeling for the modern data warehouse</a>" that lightly covers dbt and mainly how to do data modeling and the <a href="https://analyticsengineers.club/?ref=blef.fr">analytics engineers club</a> that sells a training program to go "from analysts to engineer" in 10 weeks taught by an ex-dbt Labs employee. You can also contact me if you want something more personalised.</li></ul> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.08 ]]></title>
                    <description><![CDATA[ Data News #23.08 — A bit of infrastructure, analytics dev experience, measure everything and usual fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-23-08/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63f88954e50e38004d9f5381 ]]></guid>
                    <pubDate><![CDATA[ 2023-02-24 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-7.png" class="kg-image" alt loading="lazy" width="2000" height="1334" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-7.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-7.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/02/image-7.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/02/image-7.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Data engineering team moving data manually (<a href="https://unsplash.com/photos/J3W7Kfcj6gM?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear readers, I hope you had a great week. Each time I look back and I see the amount of Fridays I've spent reading and writing I'm still surprised. For the last 2 newsletters I've tried to ask your for paying support. From number of people who really paid I can see that I failed to either word it correctly, either to propose a newsletter where you see the value of paying for it.</p><p><strong>In any case, I need your honest feedback, what would make you consider paying for the content I create?</strong></p><p>This is something I struggle with, I really like writing, I really like this newsletter, I really like the blog, but it takes me one day per week to be done. If I want to continue for years I have to find a way to make it sustainable for me, and also if I want to continue more in this direction I have to find a model that works. I'm <a href="mailto:christophe@blef.fr">open to all honest feedbacks</a>.</p><p></p><h1 id="a-bit-of-infrastructure">A bit of infrastructure</h1><p>This week I've seen a lot of articles that I can put under the infrastructure category, so here we are. The current data state is heavily dependent on infrastructure, wether it's cloud, on-premise or semi-related we need to understand where the data lands and where the code runs.</p><p>First Bucky gave his <a href="https://www.kleinerperkins.com/perspectives/infrastructure-in-23/?ref=blef.fr">thoughts about the state of infra in 2023</a>. In a nutshell, Javascript is the future of everything, we say it for years, you write once and you run it everywhere—in the browser, on servers—then workflows systems are a key piece of every software architecture, we have a fragmentation of tooling and we want to run tasks one after the other which means we need something to orchestrate them, finally the OLAP databases are evolving in something different with many more features.</p><p>In order to improve your data infra you should sometimes try to <a href="https://medium.com/geekculture/why-you-should-occasionally-kill-your-data-stack-613143c986ea?ref=blef.fr">occasionally kill your data stack</a>, chaos engineering is something that helps discover issues. Monte Carlo also wrote this week about <a href="https://www.montecarlodata.com/blog-chaos-data-engineering-manifesto/?ref=blef.fr">chaos engineering</a>, with a manifesto.</p><p>When it comes to data storage, the real-time ecosystem has also changed a lot in the last few years and a lot of tooling went out to simplify the burden of managing Kafka clusters, <a href="https://materialize.com/blog/materialize-architecture?ref=blef.fr">Materialize—a real-time platform—detailed their architecture</a>. But if you want to continue using the underlying tools here an <a href="https://medium.com/@DavidElvis/apache-flink-101-understanding-the-architecture-3a36325035f3?ref=blef.fr">overlook of Flink architecture</a> or a few <a href="https://medium.com/@kestra-io/techniques-you-should-know-as-a-kafka-streams-developer-32442ac39925?ref=blef.fr">techniques you should know as a Kafka streams developer</a>.</p><p>Finally Whatnot shared how the migrated their <a href="https://medium.com/@whatnotengineering/signed-sealed-delivered-its-shipped-ee5befc4bcba?ref=blef.fr">CD processes to ArgoCD</a> and <a href="https://medium.com/pinterest-engineering/pinterest-is-now-on-http-3-608fb5581094?ref=blef.fr">Pinterest now uses HTTP/3</a>, which I didn't even know it was existing.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-8.png" class="kg-image" alt loading="lazy" width="800" height="555" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-8.png 600w, https://www.blef.fr/content/images/2023/02/image-8.png 800w" sizes="(min-width: 720px) 720px"><figcaption>Is it Kafka? (<a href="https://unsplash.com/photos/lRoX0shwjUQ?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://petrjanda.substack.com/p/modern-analytics-developer-experience?ref=blef.fr">The future analytics developer experience</a> — It's been a few months since we have articles complaining about the actual of analytics development experience. Often they are right. At the moment the best way to develop in your data warehouse is still in the query editor of BigQuery / Snowflake / etc., even if we have tools that are trying to provide a great experience as Petr is saying in the article we still lack something. I hope it will change.</li><li><a href="https://blog.dashlane.com/measuring-b2b-customer-satisfaction-how-dashlane-leverages-a-unique-data-driven-approach/?ref=blef.fr">Measuring B2B customer satisfaction</a> — Dashlane team shares how they are measuring customer satisfaction. I really like the KPI framework they put in place and how it translates in charts. </li><li><a href="https://eventuallycoding.com/en/2023/02/measuring-everything?ref=blef.fr">Measuring everything</a> — This blog is a proposition and a signal for you that you should measure absolutely everything to understand what is happening in your product. This goes further than being a <a href="https://www.linkedin.com/pulse/data-dead-why-data-driven-enterprise-doa-wouter-van-aerle%3FtrackingId=mq%252BX%252BHKiTuevcoxh0QpWew%253D%253D/?trackingId=mq%2BX%2BHKiTuevcoxh0QpWew%3D%3D&ref=blef.fr">data-driven enterprise</a>, you have to put in place a framework the puts data measurement at every product choice, resulting in maturity increase.</li><li><a href="https://hubertdulay.substack.com/p/stream-processing-vs-real-time-olap?ref=blef.fr">Stream processing vs real-time OLAP vs streaming database</a> — Data storage  + compute field is slowly becoming a mess a lot of technologies that are so close but so far away at the same time. Hubert tries to explain stuff in the real-time category.</li><li><a href="https://blog.devgenius.io/data-engineers-and-kubernetes-do-you-really-need-to-know-it-all-4eb81ee48ee7?ref=blef.fr">Data Engineers and Kubernetes</a> — A 101 guide about Kubernetes concepts and why you should as a data engineers understand kube basic entities. </li><li><a href="https://www.startdataengineering.com/post/code-patterns/?ref=blef.fr">Coding patterns in Python</a> — Startdatengineering is one of the best data engineering related blog and this time he propose a few patterns you might need to implement in Python when doing data pipelines.</li><li><a href="https://www.etsy.com/codeascraft/scaling-etsy-payments-with-vitess-part-1--the-data-model?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share">Etsy payments data model</a> — Articles are often about technologies and rarely about the actual data modeling and this time Etsy team shared what was their reflexion while re-modeling payments. Sadly this is more about transactional improvements and choices rather than analytics.</li><li><a href="https://louishourcade.github.io/aws-toucan-website/?ref=blef.fr">Shark attacks visualisation</a> — This is a great example of embedded analytics. Louis deployed a version of ToucanToco—a BI tool—using Redshift to visualise data about shark attacks. In a surprising way 3 shark attacks were deadly in Italy, I'll more careful next time when I swim in the Mediterranean.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong>Qbeast</strong> <a href="https://qbeast.io/qbeast-raises-e2-5m-to-make-data-lakes-fast-and-easy-to-use/?ref=blef.fr">raises €2.5m seed</a>. This is interesting to see that data lakes platforms can still raise money in 2023. Qbeast propose a different way to organise data to optimise queries performance, still it seems they use Spark.</li><li><a href="https://techcrunch.com/2023/02/21/openai-foundry-will-let-customers-buy-dedicated-capacity-to-run-its-ai-models/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAADK6RWfl9CX34b2E1bokCJapmssUubREhN3BxW8gWZq_zeTUFm3AWdtPmXUfbfGDjHn3xFoGV2_70Fmao7Cw9_ZrURTM8l1ZQpykti_Ex1lVpImicGvRl42CmdQE0-SykO5amA-KrX9L0hXXV3Zkb7y9E5Apps-80ye5sNqG-kxZ&ref=blef.fr">OpenAI new strategy</a> — Someone on Twitter reported that OpenAI privately announced a product called Foundry that would enable customers to run OpenAI on dedicated capacity with full control on the model configuration and profile.</li></ul><hr><p>See you next week ❤️ — small edition today, blank page issues 🫠</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.07 ]]></title>
                    <description><![CDATA[ Data News #23.07 — What&#39;s DataOps, decrease ETL costs with Arrow, the case for being biased, data validation framework... ]]></description>
                    <link><![CDATA[ /data-news-week-23-07/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63ef59136a405f003d38f316 ]]></guid>
                    <pubDate><![CDATA[ 2023-02-18 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-4.png" class="kg-image" alt loading="lazy" width="2000" height="1335" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-4.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-4.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/02/image-4.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/02/image-4.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>When the Data News lands on Saturday (<a href="https://unsplash.com/photos/msC14JchkKU?ref=blef.fr">credits</a>)</figcaption></figure><p>In last week newsletter I've also share what is a metrics store, which led to a longer edition than usual and I saw that a few people did not like it this way. It was a try I'll see in the future how I can do it better. Still, <a href="https://www.blef.fr/metrics-store/"><a href="https://www.blef.fr/metrics-store/">what is a metrics store</a>?</a> You can check out the post extracted from the newsletter. </p><p>On the same topic this week Pierre shared <a href="https://medium.com/plum-living/building-a-semantic-layer-in-preset-superset-with-dbt-71ee3238fc20?ref=blef.fr">how to create a semantic layer in Preset</a>—<em>i.e.</em> managed Apache Superset—to do so, it first defines metrics within dbt and then thanks to the CI/CD it pushes to Preset the metrics definition. This is a great example of a simple way to push down metrics to visualisation tools.</p><p></p><h1 id="is-dataops-really-a-thing">Is DataOps really a thing?</h1><p>Last year DataOps has been used in many different ways to describe so many data-related different tasks. When you look deeply at it some companies put behind DataOps word just data stuff. Which is a bit misleading when you read that <a href="https://servian.dev/the-real-definition-of-dataops-9016ccee2f1b?ref=blef.fr"><em>DataOps is "DevOps for data"</em></a>. Because all things wrapped DevOps is something different than software engineering. </p><p>I personally do share this perspective. Data engineering is mainly software engineering applied to data, or at least we try. If we see it this way, this is logical to say that DataOps is the movement to smoother the operation side, which technically means the infrastructure side—the IT as previous generations were saying, I don't like IT, it makes me feel old. Data engineering is also an infrastructure heavy field with a lot of technologies to put together to create something that works. This is why DataOps is important. This is why Infrastructure as Code is mandatory.</p><p>To me it stops here, all the marketing derivation of it saying we do data products using DataOps methodology is just marketing. Actually you are just writing code applied to data and using Docker containers to deploy it in the cloud. I think we should stick to software engineering vocabulary.</p><p>It also means that the <a href="https://siliconangle.com/2023/02/10/evolving-role-data-engineer/?ref=blef.fr">data engineer role is constantly evolving</a>. Especially with the new appearance of the analytics engineer role. Analytics engineers are taking tasks out of data engineers—which is for the better tbh. Data engineers will have to focus more on software and on infrastructure. Shifting the expertises. Analytics engineers will become the data modeling experts. Data engineers will own the infrastructure side and software related to data team—which is already a too broad field with different ownerships (DS, MLE, etc.).</p><p>In the end when I deploy data apps I end up doing Dockerfile with CI/CD processes and I look for cloud services to hosts my containers. If this is not DevOps what is it?</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1335" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/02/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/02/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>I do stuff in prod (<a href="https://unsplash.com/photos/zWOgsj3j0wA?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.castordoc.com/blog/three-faces-of-documentation?ref=blef.fr">Unveiling the three faces of documentation</a> — Practical advices about data documentations and how you can leverage through 3 main axes: assets knowledge, business knowledge and team onboarding.</li><li><a href="https://www.databricks.com/blog/2023/02/14/announcing-a-native-visual-studio-code-experience-for-databricks.html?ref=blef.fr">Databricks announced a VS Code extension</a> — This is a small news, but still interesting to see all-in-one platform like Databricks going this direction to provide end-users extension to support their way to write code rather than the vendor one. </li><li>📺 <a href="https://open.spotify.com/episode/5a9ONFThoYut90H3wxU3zH?ref=blef.fr">Understanding the business as a data analyst</a> — A podcast about the business privilege position data analysts have, but also the responsibilities to understand and modelise it correctly in order to provide the best value to data users.</li><li><a href="https://medium.com/@rcpassos/how-i-decreased-etl-cost-by-leveraging-the-apache-arrow-ecosystem-37b6d076bd54?ref=blef.fr">Decrease ETL costs with Apache Arrow</a> — I've often written data extraction with pandas by doing <code>pd.read_sql</code> because it's super handy and you can have something that works quickly, but the cost in memory can be high. This article shows how you can do it with Polars that leverage Arrow using less memory.</li><li><a href="https://blog.picnic.nl/deploying-data-pipelines-using-the-saga-pattern-ffc1cbe29cee?ref=blef.fr">Deploying data pipelines using the Saga pattern</a> — When you enter the real time journey your way of thinking data pipeline is a bit different and it can be overwhelming when you come from the batch world. The Saga pattern is a pattern meant to ensure consistency first in the system. Here Picnic showcases the usage of dead letter queues. </li><li><a href="https://benn.substack.com/p/the-case-for-being-biased?ref=blef.fr">The case for being biased</a> — It's been a long time since I've not featured Benn's posts, still awesomely written. It answers well to "<a href="https://www.winwithdata.io/p/analytics-is-not-about-data-its-about?ref=blef.fr">Analytics is not about data. It's about truth</a>" I've shared last week. Benn thinks about the role of a data team in the business decisional journey.</li><li><a href="https://dropbox.tech/infrastructure/balancing-quality-and-coverage-with-our-data-validation-framework?ref=blef.fr">Balancing quality and coverage with our data validation framework</a> — Dropbox tech team developed a data validation framework in SQL. The validation runs as an Airflow operator every time a new data has been ingested. In terms of design only one query runs—performance reasons—and if the query returns something different than zeros, it means something is going wrong. This validation process is also a staging step before sending a table to production.</li><li><a href="https://chengzhizhao.medium.com/i-built-a-game-for-data-visualization-with-streaming-data-fe05ce6018f?ref=blef.fr">I built a game for data visualization with streaming data</a> — Fun project. How to use streaming data to create a real-time javascript visualisation as a video game.</li><li>Pedram developed a <a href="https://github.com/PedramNavid/dbtpal?ref=blef.fr">NeoVim extension for dbt users</a>. If you're not familiar with Vim or NeoVim, <a href="https://www.freecodecamp.org/news/vim-language-and-motions-explained/?ref=blef.fr">Simon explained what is Vim</a>, and why this is more than an editor.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-6.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/02/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/02/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>There is a village called Vim in Indonesia—originally Vim stands for vi iMproved (<a href="https://unsplash.com/photos/vOTBmRh3-7I?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.synq.io/blog/europe-data-salary-benchmark-2023?ref=blef.fr">Europe data salary benchmark 2023</a> — Mikkel has become one of the best in Europe to picture correctly the data field by doing benchmark and studies across the whole market. This time he is looking at salaries. To me, as French, the most crazy number is to see that senior positions—5+ years—in Europe are compensated six figures.</li><li>Side note, this week I realised that <a href="https://duckdblabs.com/?ref=blef.fr">DuckDB Labs</a> was the team behind DuckDB and not MotherDuck who did a partnership with them to propose the duck technology to everyone.</li></ul><hr><p>See you next week.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ What is the metrics store ]]></title>
                    <description><![CDATA[ What is the metrics store? What are the key difference with the metrics layer or the semantic layer? ]]></description>
                    <link><![CDATA[ /metrics-store/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63e6a250ee02be003d514a53 ]]></guid>
                    <pubDate><![CDATA[ 2023-02-13 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>This week <a href="https://www.getdbt.com/blog/dbt-acquisition-transform/?ref=blef.fr">dbt Labs announced the intention to acquired Transform</a>. While, you should already be aware about what's dbt, there are still unknowns about what's Transform. Transform is a company that has been founded by ex-Airbnb employees—which is important here—that proposes an open-source metrics framework and a SaaS metrics store. </p><p>At the moment Transform is a small company compared to dbt Labs, only 40 employees according to LinkedIn and they raised around $25m. Which is only 10% of dbt Labs actual workforce. But I think this acquisition matters and will shape our data stacks.</p><p>In the past I've made jokes about the naming confusion the data field was into, especially with the following terms: semantic layer, metrics layer, metrics store, headless BI, features store. This is want I want to demystify today. I've spent the whole day reading and watching content in this category and I want to help you understand what it means for us. As a side note, it's fair to say that I also wasn't a believer in the actual necessity of this infrastructure piece. After a full day of research I'm more into it, but we have to be careful.</p><p></p><h1 id="first-definitions">First, definitions</h1><p>Before going further I have to write down some definitions. These definitions are mine and if you think I'm wrong you'd be more than happy to get your feedback on it. This is also super hard to have a universal definition across all vendors—as can be seen by this <a href="https://www.youtube.com/watch?v=Toqg0Yuz9b4&ref=blef.fr">discussion</a>.</p><ul><li><strong>Measure</strong> — a measure is a value on which we can do all sort of computations (addition, multiplication, etc.), in a warehouse context we do aggregations on measures (sum, count, avg). A measure is often numerical but not necessarily. As an example the <em>order price</em> is a measure.</li><li><strong>Dimension</strong> — a dimension is something that categorises a measure, it adds context to a measure. You can use a dimension to filter or group the data. For instance the <em>order date</em> is a dimension.</li></ul>
<aside class="gh-post-upgrade-cta">
    <div class="gh-post-upgrade-cta-content" style="background-color: #373f48">
                <h2>This post is for subscribers only</h2>
            <a class="gh-btn" data-portal="signup" href="#/portal/signup" style="color:#373f48">Subscribe now</a>
            <p><small>Already have an account? <a data-portal="signin" href="#/portal/signin">Sign in</a></small></p>
    </div>
</aside>
 ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.06 ]]></title>
                    <description><![CDATA[ Data News #23.06 — Understand the metrics store, Bard, migrate from Airflow to Dagster, lower Snowflake costs and data economy news. ]]></description>
                    <link><![CDATA[ /data-news-week-23-06/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63e5fdb5ee02be003d513b1f ]]></guid>
                    <pubDate><![CDATA[ 2023-02-10 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-2.png" class="kg-image" alt="" loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/02/image-2.png 1600w, https://www.blef.fr/content/images/2023/02/image-2.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">This is what the metrics store inspires me (</span><a href="https://unsplash.com/photos/JsdvKIcvAGo?ref=blef.fr"><span style="white-space: pre-wrap;">credits</span></a><span style="white-space: pre-wrap;">)</span></figcaption></figure><p>Dear Data News friend, every week there is a bit of randomness when this email will truly land in your mailbox—which, btw, breaks all the rules of newsletter writing. Yeah, you know, you have to get your readers used to a fixed schedule, which they can trust and bla, bla, bla. The good news is that at least with me you can trust that I have no schedule except that you should have the newsletter on Friday or Saturday.</p><p>While I feel privileged to be able every week to send my thoughts to so many people, it takes me a significant amount of time to craft and write the newsletter. I ask you to consider supporting me by becoming a paying subscriber. Especially if you think like me that the newsletter is great.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.blef.fr/#/portal/signup" class="kg-btn kg-btn-accent">Become a paid subscriber 💰</a></div><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>News from the generative AI universe — Google announced <a href="https://blog.google/technology/ai/bard-google-ai-search-updates/?ref=blef.fr">Bard</a> a competitor to ChatGPT, but with better ethics, etc. In the same time Microsoft opened in beta the <a href="https://www.theverge.com/2023/2/8/23590873/microsoft-new-bing-chatgpt-ai-hands-on?ref=blef.fr">ChatGPT integration with Bing</a>. Closer to us on the data space Hex proposed a <a href="https://hex.tech/blog/magic-private-beta/?ref=blef.fr">prompt that can do magic for you</a>.</li><li><a href="https://motherduck.com/blog/big-data-is-dead/?ref=blef.fr">Big Data is Dead</a> — A retrospective on why we don't need any more as much as computing power as before. Obviously the article is biased because it's from DuckDB mother company. As a reminder DuckDB runs on a single node fitting all computes in memory. But the article is relevant nonetheless.</li><li><a href="https://dagster.io/blog/dagster-airflow-migration?ref=blef.fr">Migrating from Airflow to Dagster is now a breeze</a> — In the orchestration competition Dagster made a step forward, they develop tooling to ease migration from one to the other and one side-effect is that you can orchestrate Dagster DAGs from Airflow. In order to understand Dagster philosophy you should <a href="https://askvinnie.substack.com/p/now-youre-thinking-with-assets?ref=blef.fr">now think with assets</a>.</li><li><a href="https://medium.com/ovrsea/data-analytics-framework-in-python-from-scientific-approach-to-actionable-implementation-d47737382769?ref=blef.fr">Data Analytics framework in Python: from scientific approach to actionable implementation</a> — A framework to conduct data analysis in Python.</li><li><a href="https://medium.com/the-prefect-blog/should-you-measure-the-value-of-a-data-team-95c447f28d4a?ref=blef.fr">Should you measure the value of a data team?</a> — Considerations about measuring the job a data team is doing and which metrics you should go for.</li><li><a href="https://www.winwithdata.io/p/analytics-is-not-about-data-its-about?ref=blef.fr">Analytics is not about data. It's about truth.</a> — This is an hot take this one because what's the truth?</li><li><a href="https://engineeringblog.yelp.com/2023/01/rebuilding-a-cassandra-cluster-using-yelps-data-pipeline.html?ref=blef.fr">Rebuilding a Cassandra cluster using Yelp’s Data Pipeline</a> — This is awesome when we can use our data engineering skills not only to do analytics but also to help fellow tech teams in tasks that are hard to do.</li><li><a href="https://stemma.webflow.io/blog-post/how-to-fix-your-etl-to-lower-snowflake-costs?ref=blef.fr">How to fix your ETL to lower Snowflake Costs</a> — Mark shares a 3 Snowflake queries that you can run to get table usage in order to identify what costs a lot.</li><li><a href="https://www.dataengineeringpodcast.com/six-year-retrospective-episode-361?ref=blef.fr">Reflecting on the past 6 years ff data engineering</a> — This is a podcast episode (which I did not listen because of time).</li><li><a href="https://www.synq.io/blog/the-complete-guide-to-building-reliable-data-with-dbt-tests?ref=blef.fr">The complete guide to building reliable data with dbt tests</a> — 10 practical points to improve your dbt tests.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.acceldata.io/?ref=blef.fr"><strong>Acceldata</strong></a> <a href="https://www.acceldata.io/newsroom/acceldata-raises-fifty-million-in-series-c-funding?ref=blef.fr">raises $50m in Series C</a>. Acceldata looks like an enterprise data observability tool that does everything other data observability tools are doing. Like drawing charts that shows that you probably have issues 🫠.</li><li>Recently the Kafka company (Confluent) acquired the Flink company (Immerok), economically it means a lot and <a href="https://hubertdulay.substack.com/p/the-stream-processing-shuffle?ref=blef.fr">reshuffle companies strategies</a>. In addition RisingWave also shared views on <a href="https://www.risingwave-labs.com/blog/Rethinking_stream_processing_and_streaming_databases/?ref=blef.fr">why you probably need a stream processing system</a>.</li><li><a href="https://thebuilderjr.substack.com/p/why-big-tech-companies-need-so-many?ref=blef.fr">Why big tech companies need so many people</a> — this is a good economical question. For instance, Twitter, should be easy to copy. Why do they need thousands of engineers to develop a website that I can re-develop over a weekend?</li><li><a href="https://www.getdbt.com/blog/dbt-acquisition-transform/?ref=blef.fr">dbt Labs intends to acquire Transform</a>. I just put this here for people who do not read the first part of the newsletter 🫠.</li></ul><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.05 ]]></title>
                    <description><![CDATA[ Data News #23.05 — machine learning at big tech, Airflow in Azure, think in SQL, dbt and snowflake clones, generative Seinfeld. ]]></description>
                    <link><![CDATA[ /data-news-week-23-05/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63db499fc33306003db24243 ]]></guid>
                    <pubDate><![CDATA[ 2023-02-03 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image.png 600w, https://www.blef.fr/content/images/2023/02/image.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Delivering the data news (<a href="https://unsplash.com/photos/hE1MjkZQPpI?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, it's already February. Every week same analysis for me. I plan too many tasks but I slowly deliver. I guess that's how it is. Still I love this Friday <em>rendezvous</em> that we have together. I'm still amazed by how I changed my old habits to add the writing in my workflow. And it brings me a lot of joy.</p><p>This is also funny because I don't consider newsletter writing as work. Which is maybe a bit stupid but when I work on the newsletter I upskill myself, I read, I discover stuff, I meet with people. But the newsletter takes 1 day per week to be done, which is significant to say it's work. I wish everyone to find this little thing that is actually work but that makes work less work.</p><p>I'd like to write more about my time organisation and especially about my freelancing activities but today is a day where I have less time for the newsletter, so it's more an appetizer for later. Let's jump directly to the news.</p><p></p><h1 id="ml-friday-%F0%9F%A4%96">ML Friday 🤖</h1><ul><li><a href="https://netflixtechblog.com/discovering-creative-insights-in-promotional-artwork-295e4d788db5?ref=blef.fr">Netflix, discovering creative insights in promotional artwork</a> — That probably the reason behind Netflix being now very conventional in term of artwork. The article shows our Netflix art creators are using past data to create new artworks. In the end this is a loophole, where everything looks like similar.</li><li><a href="https://tech.ebayinc.com/engineering/variable-hub-easier-data-integration-for-risk-decisioning/?ref=blef.fr">ebay, Variable Hub a data access layer for risk decisioning</a> — Looks like a feature store but for risk topics. The idea is to create a unified layer that stores all the data needed to take decisions.</li><li><a href="https://eng.lyft.com/powering-millions-of-real-time-decisions-with-lyftlearn-serving-9bb1f73318dc?ref=blef.fr">Lyft, powering millions of real-time decisions with LyftLearn Serving</a> — Architecture of the decentralized system Lyft use to deploy and serve ml models.</li><li><a href="https://engineering.atspotify.com/2023/02/unleashing-ml-innovation-at-spotify-with-ray/?ref=blef.fr">Spotify, Unleashing ML Innovation at Spotify with Ray</a> — I've never used Ray in the past, but looks promising as a unified way to describe machine learning pipelines no matter the framework you want to use.</li></ul><p>This is refreshing to see big tech machine learning articles that are still looking like machine learning we were doing 2 years ago.</p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://technically.substack.com/p/whats-the-modern-data-stack?ref=blef.fr">What's the Modern Data Stack?</a> — Another post about what's the modern data stack. The article is a good summary of the required blocks composing a modern data stack. You can also get inspired by <a href="https://medium.com/stuart-engineering/stuarts-data-journey-how-gustavo-and-the-bi-engineering-team-polished-the-t-in-elt-9f1c17402abe?ref=blef.fr">Stuart's modern data stack</a>.</li><li><a href="https://madisonmae.substack.com/p/analytics-engineer-a-glorified-bi?ref=blef.fr">Analytics Engineer- A Glorified BI Engineer?</a> — I feel guilty, I still think that Analytics Engineers are BI Engineers. But BI Engineer for the modern data stack times. In this post Madison tried to compare the two roles. In the end actually, the answer depends. Analytics Engineer role is still unclear and depends company to company. What's often stays is that AE is between DE and DA, so the definition is often done complementarily to other positions. </li><li><a href="https://techcommunity.microsoft.com/t5/azure-data-factory-blog/introducing-managed-airflow-in-azure-data-factory/ba-p/3730151?ref=blef.fr">Microsoft Azure announced managed Airflow</a> — Starting this week you'll be able to launch Apache Airflow within Azure Data Factory. The feature is in public preview. The way they integrated it within Azure looks a bit weird, but it at least exists now.</li><li><a href="https://pedram.substack.com/p/streaming-data-pipelines-with-striim?ref=blef.fr">Change data capture with DuckDB</a> — Pedram had a sneak peek of the future, he tried a CDC setup (with Striim) that writes to GCS and then DuckDB compute metrics downstream.</li><li><a href="https://www.synq.io/blog/data-team-size-at-100-scaleups?ref=blef.fr">Data team as % of workforce</a> — Mikkel is a reference when speaking about data team size. This week he categorised companies by data team size as % of workforce. For instance he found that Marketplace companies have bigger data teams than B2B ones. It makes sense.</li><li><a href="https://leerob.substack.com/p/databases-serverless-edge?ref=blef.fr">2023 state of databases for Serverless &amp; Edge</a> — I did not know that serverless databases field was so innovative right now. All things considered this is a normal evolution, databases connections are from an old time and web developers wants direct access to databases. This is interesting to see how serverless Postgres is going.</li><li><a href="https://towardsdatascience.com/think-in-sql-avoid-writing-sql-in-a-top-to-bottom-approach-476a67f53a59?ref=blef.fr">Think is SQL, avoid writing SQL in a top to bottom approach</a> — A nice post about the mismatch between the logical query processing order and the syntaxic order of SQL queries. </li><li><a href="https://pub.towardsai.net/parquet-best-practices-the-art-of-filtering-d729357e441d?ref=blef.fr">Parquet best practices: the art of filtering</a> — How to leverage Parquet filtering to save processing time.</li><li><a href="https://blog.montrealanalytics.com/optimizing-dbt-development-with-snowflake-clones-9bce961db64d?ref=blef.fr">Optimizing dbt development with Snowflake clones</a> — dbt development in large data warehouse can become expensive if you ask every dbt developer to <em>dbt run</em> the whole SQL tree. Montreal analytics propose a solution with Snowflake db clones. You can also use the dbt <a href="https://docs.getdbt.com/reference/node-selection/defer?ref=blef.fr">--defer</a> option which does something similar.</li><li><a href="https://betterprogramming.pub/great-data-platforms-use-conventional-commits-51fc22a7417c?ref=blef.fr">What if we use CHANGELOG in our data projects?</a> — This is important to have a consistent nomenclature when naming commits and changes, sadly the same should apply to dashboards, but hard to do.</li><li><a href="https://medium.com/artefact-engineering-and-data-science/how-we-deployed-a-simple-wildlife-monitoring-system-on-google-cloud-78b847cab10c?ref=blef.fr">How we deployed a simple wildlife monitoring system on Google Cloud</a> — Artefact engineering a serverless platform on GCP to do wildlife monitoring.</li><li>📺 <a href="https://www.twitch.tv/watchmeforever?ref=blef.fr">Seinfeld-like sitcom generated by AI 24/7 live on Twitch</a> — This is amazing how far we are able to go today in terms of content generation.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/02/image-1.png" class="kg-image" alt loading="lazy" width="2000" height="1500" srcset="https://www.blef.fr/content/images/size/w600/2023/02/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2023/02/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/02/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/02/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Few Snowflake clones (<a href="https://unsplash.com/s/photos/snowflake-clones?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong><a href="https://www.selectstar.com/?ref=blef.fr">Select Star</a></strong> <a href="https://www.businesswire.com/news/home/20230131005354/en/Select-Star-Raises-15-Million-in-Series-A-Funding-Led-by-Lightspeed-Venture-Partners?ref=blef.fr">raises $15m in Series A</a>. Select Star is another data catalog that automatically connects to your tools and provides the usual data catalog UI based on a search bar with metadata management inside. Nothing new under the sun.</li></ul><p></p><hr><p>See you next week ❤️.</p><p><em>PS: and sorry it was a fast data news today. I have a big presentation to prepare for Monday. I wish you a great weekend.</em></p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.04 ]]></title>
                    <description><![CDATA[ Data News #23.04 — GPT safe place here, dbt, Airflow, Dagster, data modeling and contracts, data creative people a lot of news. ]]></description>
                    <link><![CDATA[ /data-news-week-23-04/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63d2973cdbf070003dba81eb ]]></guid>
                    <pubDate><![CDATA[ 2023-01-27 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-9.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-9.png 600w, https://www.blef.fr/content/images/2023/01/image-9.png 900w" sizes="(min-width: 720px) 720px"><figcaption>My view from the train window (<a href="https://unsplash.com/photos/OY5zbCCrWN4?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear Data News readers it's a joy every week to write this newsletter, we are slowly approaching the second birthday of this newsletter. <strong>In order to celebrate this together I'd love to receive your stories about data</strong>—can be short or long, anonymous or not. This is an open box, just write me with what you have on the mind and I'll bundle an edition with it.</p><p>This is fun because I'm usually not someone who's good at having habits. Every week to be honest I get hit by Friday. I don't write in advance. Every week you get a taste of my current mood. I often try to sync my travels on Fridays, even if internet is terrible in the train, this is still a good way to fill the +8 hours travel time I'm used to.</p><p>Today I take the following commitment:<strong> I will never use any generative algorithm to write something in the newsletter</strong>. Fun story because one year ago I had an intern working with me on the blog to whom I had given the task to write code that was able to learn from my writings to generate a Data News edition. One year later, different views. In ChatGPT times, my idea is just boring.</p><p>On the other side, at the moment I'm not really organised to check if articles that I share have been totally written by humans, but same shit, I'll do as much as I can to avoid sharing empty articles like I've always did. It might be a good use-case for <a href="https://gptzero.me/?ref=blef.fr">GPTZero</a>.</p><p>As a data professional this is probably the height to not want to use AI. But right now the field feels like when cryptocurrencies arrived. Awesome raw ideas with sharks circling around waiting for a new productivity highness.</p><p><em>PS: last week I did a—bad—joke about Apache naming and a reader pointed me an article about the <a href="https://blog.nativesintech.org/apache-appropriation/?ref=blef.fr">ASF and non-Indigenous appropriation</a>.</em></p><p>This is enough about my life, let's jump to the news.</p><p></p><h1 id="back-to-the-roots-a-few-engineering-articles">Back to the roots, a few engineering articles</h1><p>I did not know how to put together these articles, so here a few loose articles. In my <a href="https://www.blef.fr/manage-and-schedule-dbt/">manage and schedule dbt</a> guide in a nutshell I say that in dbt projects you have 2 lifecycles. The first one is the developing experience and the second is the dbt runtime. It means you have to run dbt somewhere:</p><ul><li>Jonathan proposed a <a href="https://github.com/jonathanneo/data-aware-orchestration?ref=blef.fr">creative way to do it in Dagster</a> — every dbt model is a software defined asset, which means that the whole data chain is reactive and every model are refreshed on a trigger rather than on a cron-based schedule.</li><li>Astronomer team developed an awesome library that is meant to translate dbt DAG to Airflow DAG: <a href="https://github.com/astronomer/astronomer-cosmos?ref=blef.fr">astronomer-cosmos</a>. You either have a DbtDag object or a DbtTaskGroup, that dynamically creates an Airflow DAG from your dbt project. It looks very promising. Cosmos reads dbt models files and do not use the manifest.</li></ul><p>In term of data modeling ThoughtSpot wrote about the <a href="https://www.thoughtspot.com/blog/data-modeling-best-practices-analytics-engineers?ref=blef.fr">best data modeling methods</a> and Chad—the pope of Data Contracts—wrote about <a href="https://dataproducts.substack.com/p/data-contracts-for-the-warehouse?ref=blef.fr">data contracts for the warehouse</a>, mainly it shift the responsibilities to data producers in order to enforce schema and semantic, but in the data world it is sometimes rather an utopia. Producers are often software teams that, sadly, does not care about data teams.</p><p>Finally Noah shared how he <a href="https://noahlk.medium.com/dbt-how-we-improved-our-data-quality-by-cutting-80-of-our-tests-78fc35621e4e?ref=blef.fr">improved data quality by removing 80% of the tests</a> and Ronald proposed a <a href="https://medium.com/miro-engineering/writing-data-product-pipelines-with-airflow-1ace222f8f5a?ref=blef.fr">framework to create data products in Airflow</a>.</p><p></p><h1 id="data-people-are-creatives-%F0%9F%AA%84">Data people are creatives 🪄</h1><p><em>This is a new category that will appear in the next Data News edition. In this category I'll share stuff that we can do with data. The idea is to inspire others by promoting the end use-case rather than just the technology. I'll be more than happy to share what you do.</em></p><ul><li><a href="https://maxhalford.github.io/blog/airbnb-energy-usage/?ref=blef.fr">Are Airbnb guests less energy efficient than their host?</a> — Max tries to find if Airbnb guests energy consumption is higher than the hosts' one. I'm always amazed by straight to the point analyses like this.</li><li><a href="https://pandascore.co/blog/automated-object-localisation-in-esports-video-streams?ref=blef.fr">Automated object detection in CSGO</a> — PandaScore, a French company that generates data from public—and probably private—e-sports data, showcases how they used OCR to get data in CSGO live streams. I did something similar last year on Teamfight Tactics.</li><li><a href="https://storage.googleapis.com/website-storage-bucket/docs/football-data-pipeline-doc.html?ref=blef.fr">Football data pipeline project</a> — This is more a technical walk-through to build a Streamlit dashboard on the Premier League. Still this is interesting.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-10.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-10.png 600w, https://www.blef.fr/content/images/size/w1000/2023/01/image-10.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/01/image-10.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/01/image-10.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>This is us (<a href="https://unsplash.com/photos/oMpAz-DN-9I?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://airbyte.com/free-connector-program?ref=blef">Airbyte announced a free sync plan</a>. Starting today the connectors that are in alpha and beta will be free to use in Airbyte Cloud. It needs only one side of the sync to be in alpha/beta to have it for free. Once GA you'll have 2 weeks before being charged.</li><li>Earlier in January <a href="https://fivetran.com/docs/getting-started/consumption-based-pricing/2023-cbp-faq?ref=blef.fr">Fivetran also announced a free plan</a>. Starting February you will be able to use it to sync up to 500k distinct rows for free plus other perks.</li><li><a href="https://www.sqlalchemy.org/blog/2023/01/26/sqlalchemy-2.0.0-released/?ref=blef.fr">SQLAlchemy 2.0 released</a> — This is a major release with a lot of breaking changes. As I'm far from being an expert in SQLAlchemy  I can't say more than it seems to be shiny new better ORM.</li><li><a href="https://www.metaplane.dev/blog/announcing-data-test-previews-in-pull-requests?ref=blef.fr"><a href="https://www.metaplane.dev/blog/announcing-data-test-previews-in-pull-requests?ref=blef.fr">Metaplane announced data tests preview</a> in pull requests</a> — This is a way to compare the SQL code in a PR to the live production data to see directly in Github what have changed. It gives ideas.</li><li><a href="https://medium.com/snowflake/snowflake-min-by-and-max-by-aggregate-functions-8c0c7f30058e?ref=blef.fr">Snowflake released min_by and max_by functions</a> — With these new min/max functions you can in a select statement get the first/last status for an id. This is a great shortcut.</li><li><a href="https://towardsdatascience.com/compare-tables-bigquery-1419ff1b3a2c?ref=blef.fr">How to compare two tables for quality in BigQuery</a> — Giorgios propose a simple query to compare 2 tables in BigQuery. If you are a Snowflake user there is a <a href="https://docs.snowflake.com/en/sql-reference/operators-query.html?ref=blef.fr#minus-except">minus</a> operation to do it even easier and if you use dbt you can avoid this boilerplate by use <a href="https://github.com/dbt-labs/dbt-utils?ref=blef.fr#equality-source">dbt_utils.equality</a> function.</li><li><a href="https://medium.com/@ivanreznikov/how-misused-terminology-is-damaging-the-data-field-28881a96c7f?ref=blef.fr">How misused terminology is damaging the data field</a> — The title is a bit exaggerated and terminology gatekeeping damage even more the field. Actually in the end we all do stuff with data, right?</li><li><a href="https://engineering.zalando.com/posts/2023/01/how-you-can-have-impact-as-an-engineering-manager.html?ref=blef.fr">How you can have impact as an Engineering Manager</a> — Good question and good article. In a nutshell it's about your team and other teams and how you interact with other people in terms of behaviour, processes and practices.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li>Microsoft <a href="https://openai.com/blog/openai-and-microsoft-extend-partnership/?ref=blef.fr">finally announced</a> their "multi-billion dollar" investment—probably $10b—in OpenAI. Nothing more to say, you might have guessed my opinion in the introduction.</li><li><a href="https://www.whalesync.com/?ref=blef.fr"><strong>whalesync</strong></a> <a href="https://www.whalesync.com/blog/announcing-our-1-8m-pre-seed-round?ref=blef.fr">raises $1.8m pre-seed</a> to create another data movement SaaS that is connectors based. With bidirectional connectors. The difference with similar product is the possibility to also sync to Postgres. Usually tools like this only do it between SaaS. The enable also web page creation automation for SEO, which is unrelated to the data movement business.</li><li><strong><a href="https://www.komprise.com/?ref=blef.fr">Komprise</a></strong> <a href="https://www.komprise.com/komprise-raises-37m-to-fuel-growth-and-advance-leadership-in-unstructured-data-management/?ref=blef.fr">raises $37m Series D</a> to build yet another all-in-one data platform to do everything about data.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.03 ]]></title>
                    <description><![CDATA[ Data News #23.03 — Looking for Airflow speakers, the current state of data, data modeling techniques, Airflow misconceptions, don&#39;t target 100% coverage. ]]></description>
                    <link><![CDATA[ /data-news-week-23-03/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63ca96360a948f003d391891 ]]></guid>
                    <pubDate><![CDATA[ 2023-01-20 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-8.png" class="kg-image" alt loading="lazy" width="900" height="599" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-8.png 600w, https://www.blef.fr/content/images/2023/01/image-8.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Summer in coming (<a href="https://unsplash.com/photos/wtBex4wQw60?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey, new Friday, new Data News edition. I'm so happy to see new people coming every week. Thank you for every recommendation you do about the blog or the Data News. This kindness for my content gives me wings. </p><p>This week I don't want to be late, so let's start the weekly wrap-up. I got less inspired this week, it means shorter edition.</p><p>As a side note we are looking for speakers for a late February Airflow Meetup. Still open topics, so whatever you want to share—have to be related to Airflow at some point—we'll be happy to welcome you as speaker.</p><p></p><h1 id="the-current-state-of-data">The current state of data</h1><p>This week Benjamin Rogojan livestreamed an online conference featuring awesome data voices: <a href="https://www.youtube.com/watch?v=j-gruNSEd80&ref=blef.fr">state of data infra</a>. Matt wrote his <a href="https://medium.com/@matt_weingarten/state-of-data-takeaways-e19570957a3e?ref=blef.fr">takeaways</a> on Medium about the conference. In parallel Ben released the <a href="https://seattledataguy.substack.com/p/the-state-of-data-engineering-part?ref=blef.fr">results of a survey about data infras</a> he run among his followers. The main thing to notice is that the average company is a Finance company using Airflow with BigQuery and they struggle—like you probably—to hire people.</p><p>This is also time for my views about the state of data. After 2 years of running the newsletter writing every week about trends and following "influencers" for you I'm bored. If I'm being honest I'm French and probably I was born bored, but still. When I was a young professional I was so hype by new technologies, right now it's harder for me. I personally feel that data ecosystem is in a in-between state. In between the Hadoop era, the modern data stack and the machine learning revolution everyone—but me—waits for. But, funny, in the end we are still copying data from database to database by using CSVs, like 40 years ago.</p><p>If we go back to this week articles:</p><ul><li>Matt Hawkins <a href="https://hotcrossjoin.substack.com/p/the-unhappy-marriage-of-data-stacks?ref=blef.fr">tried to find the origins of the term "modern data stack"</a>.</li><li>Pedram wrote about the <a href="https://www.datafold.com/blog/the-state-of-data-testing?ref=blef.fr">state of data testing</a> — in the end of the article obviously because it's on Datafold blog they share data-diff, still the article is relevant near the four facets of data quality: accuracy, completeness, consistency and integrity.</li><li><a href="https://dev.to/apachedoris/a-glimpse-of-the-next-generation-analytical-database-5dob?ref=blef.fr">Apache Doris</a>, to me it looks like a character from Nemo, actually it's the new real-time warehouse of the Apache Foundation.</li><li>There is an <a href="https://blog.devgenius.io/datahub-an-introduction-a418d442383c?ref=blef.fr">introduction post about DataHub</a> — when you look at what you have to run to launch a data catalog: 4 components and 4 different data storage. Don't be surprised if no ones uses data catalogs. When I think that some people are saying Airflow is complex to launch.</li></ul><p>In a nutshell I just want to solve problems and empower people with what I build and I don't care if my stack is a post-modern aquarium, I just want it to be blazingly boring.</p><p></p><h1 id="data-modeling-techniques">Data modeling techniques</h1><p>Data modeling as of today is probably the most important skills of every data practitioner. We don't really care about your role or your tools. This is about optimisation. Optimisation at different levels, it can be <a href="https://select.dev/posts/snowflake-range-join-optimization?ref=blef.fr">performance optimisation</a>, costs optimisation, <a href="https://moderndata101.substack.com/p/optimizing-data-modeling-for-the?ref=blef.fr">business understanding optimisation</a>. Yeah, <em>in fine,</em> optimisation<em>.</em></p><p>There are many techniques out there to do it, I don't want to enumerate them because that's not really the intention. Still, aim for simplicity, keep it simple stupid and think about your consumers.</p><p><em>PS: this feedback about the Medallion architecture—bronze, silver, gold—might be interesting for you.</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-7.png" class="kg-image" alt loading="lazy" width="900" height="566" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-7.png 600w, https://www.blef.fr/content/images/2023/01/image-7.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Perfect your modeling techniques (<a href="https://unsplash.com/photos/Xl-ilWBKJNk?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://medium.com/@datajuls/why-i-moved-my-dbt-workloads-to-github-and-saved-over-65-000-759b37486001?ref=blef.fr">Why I moved my dbt workloads to GitHub and saved over $65,000</a> — With the dbt Cloud price increase I already shared companies started to look for innovative way to run dbt. This time this is an example demonstrating that you can do it in Github Actions.</li><li><a href="https://medium.com/@henryweller/10-common-misconceptions-about-airflow-b5f86d9bc1e?ref=blef.fr">10 Common Misconceptions about Airflow</a> — Airflow grown a lot and probably users that lost faith in Airflow a while back while never come back. Still this post tries to revalidate Airflow. Shortly, in recent Airflow versions it's easy for instance to get started, the UI is great—and tbh always has been, the scheduler is stable.</li><li><a href="https://www.youtube.com/watch?v=beLo1BGcRpI&ref=blef.fr">Lights on Versatile Data Kit</a> — A YouTube video about a tool developed by vmware that is an alternative to dbt—yeah, sorry this is the best way to define it.</li><li><a href="https://blog.dahl.dev/posts/data-engineering-interviews-in-stockholm/?ref=blef.fr">Data Engineering job market in Stockholm</a> — Alexander shared on a personal blog his job research in Sweden. Spoiler: out of 43 application he got 6 offers. This is a short post but describes well his experience.</li><li><a href="https://pudding.cool/2022/12/yard-sale/?ref=blef.fr">Why the super rich are inevitable</a> — Except the fact that we should <a href="https://en.wikipedia.org/wiki/Eat_the_rich?ref=blef.fr">eat the rich</a>. I just want to talk about the way the information is displayed. Alvin—the author—explained economical concept with a scrollable visualisation and with some simulation to help people understand concepts. I found it very pleasant and it looks like something data teams could do to package data analyses.</li><li><a href="https://medium.com/artefact-engineering-and-data-science/all-you-need-to-know-to-get-started-with-vertex-ai-pipelines-615e126ea00b?ref=blef.fr">All you need to know to get started with Vertex AI Pipelines</a> — Will people continue to do Data Science by themselves in 2023? Probably not like before and with more APIs in it. For that you can follow this overview about Vertex AI—the Google Cloud Platform manage machine learning product.</li><li><a href="https://medium.com/teads-engineering/bigquery-ingestion-time-partitioning-and-partition-copy-with-dbt-cc8a00f373e3?ref=blef.fr">BigQuery Ingestion-Time Partitioning and Partition Copy With dbt</a> — Christophe from Teads wrapped-up how they contributed to dbt 1.4 by adding ingestion-time partitioned table support for BigQuery.</li><li><a href="https://dev.to/antoinecoulon/dont-target-100-coverage-387o?ref=blef.fr">Don't target 100% coverage</a> — Yes. This is about JavaScript, but you can still send it to your boss that is asking for 100% coverage for data tests.</li><li><a href="https://emilie.substack.com/p/choose-your-adventure?ref=blef.fr">Choose your adventure</a>: <em>How changing how you spend your free time can genuinely make you feel like you have more of it and take care of your well-being.</em></li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><a href="https://www.cumul.io/?ref=blef.fr"><strong>Cumul.io</strong></a> <a href="https://blog.cumul.io/2023/01/17/cumul-io-raises-e10m-series-a-funding-to-drive-confident-business-decisions-with-embedded-analytics/?ref=blef.fr">raises €10m Series A</a>. Embedded analytics is the capabilities to introduce Business Intelligence apps within "traditional" software platforms like SaaS application or public website. Cumul.io provides a complete SDK to integrates Analytics in your app. Either by doing it yourself either by letting your customer do it.</li><li>Lay-offs are continuing at big tech. Google and Microsoft announced respectively 6% and ~5% jobs cuts. According to <a href="https://layoffs.fyi/?ref=blef.fr">layoffs.fyi</a> in January this year around 40k people got laid off in tech, it represents 25% of last year total lay-offs—150k. <strong>If it happened to you recently, you can reach me, I'll do whatever I can do to help you</strong>.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-6.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2023/01/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/01/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/01/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Almost in time today (<a href="https://unsplash.com/photos/iwW9PaAmC3E?ref=blef.fr">credits</a>)</figcaption></figure><p></p><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.02 ]]></title>
                    <description><![CDATA[ Data News #23.02 — Switch from pandas to Polars, hiring processes, new age of machine learning, how query engines work and data economy. ]]></description>
                    <link><![CDATA[ /data-news-week-23-02/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63c11fdb076542003dc0988b ]]></guid>
                    <pubDate><![CDATA[ 2023-01-14 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1313" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2023/01/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/01/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/01/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Abandoned Pandas (<a href="https://unsplash.com/photos/e3icLEb-z-M?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey. I have busy weeks, I'm sorry Data News are coming on Saturday again. This is a bit hard to travel by train, work and write at the same time. Plus I'm a fast context switcher, so it piles up. Also a few of you have sent me messages recently and I've not yet answered, I see you and I did not forget you. Now that I'm back in Berlin it'll be easy.</p><p>Last week we organised the first Paris Airflow meetup of the year. It was a round table that I've moderated with <a href="https://fromanengineersight.substack.com/?ref=blef.fr">Benoit Pimpaud</a>, <a href="https://medium.com/@pin.furcy?ref=blef.fr">Furcy Pin</a> and <a href="https://www.youtube.com/c/MarcLamberti?ref=blef.fr">Marc Lamberti</a>. We talked about the place of Airflow in 2023, the <a href="https://blog.fal.ai/the-unbundling-of-airflow-2/?ref=blef.fr">unbundling</a> of Airflow and the best way to run your Airflow DAGs today.</p><p>The discussion was in French and the recording will be released next week. In the meantime you can still check my article <a href="https://www.adventofdata.com/using-airflow-the-wrong-way/?ref=blef.fr">Using Airflow the wrong way</a> that summarize a bit the operators vs. containers debate. During the meetup we did not talk about Airflow alternatives, currently Mage is the rising tool that everyone tries out <a href="https://chengzhizhao.com/is-apache-airflow-due-for-replacement-the-first-impression-of-mage-ai/?ref=blef.fr">as a replacement for Airflow</a>?</p><p>Enjoy the Data News.</p><p></p><h1 id="polars%E2%80%94pandas-are-freezing">Polars—Pandas are freezing</h1><p>Recently influencers are betting that Rust will be the de-facto language in data engineering. The history repeat, we've seen it with Scala, Go or even Julia at some scale. In the end Python and SQL are still here for good. But with Rust the approach is different. The idea is not to replace Python but to replace the underlying bindings that are used by Python libraries.</p><p>And it makes sense, for instance <a href="https://github.com/charliermarsh/ruff?ref=blef.fr">ruff</a> a Python linter that is build in Rust that claims to be extremely faster that the usual stuff.</p><p>On the data processing side there is Polars, a DataFrame library that could replace pandas. Let's have a quick look at it. In this overview I'll not talk about performance because I don't have the time to do a proper benchmark—and I've never done this. Just the experience of a beginner that knows pandas very well.</p><p>The installation is pretty straight forward, you can do it with pip. When compared to pandas this is awesome because it seems polars as no dependencies so it does not need to build wheels like pandas.</p><pre><code class="language-bash">pip install polars</code></pre><p>Regarding the imports the documentation continues to treat me well. It looks like stuff I know with pandas.</p><pre><code class="language-Python">import polars as pl</code></pre><p>Then I can do my first CSV import, in the example I load a French railway open dataset about lost and found objects in stations.</p><pre><code class="language-Python">df = pl.read_csv("lost-objects-stations.csv", sep=";")</code></pre><p>Then you can use the same code as pandas to select the data (head, ["col"], etc.). I want now to try a group by.</p><pre><code class="language-Python">df.groupby("Station").agg([pl.count()]).sort("count", reverse=True)

# Same code but it pandas
df.groupby("Station")["Date"].count().sort_values(ascending=False)</code></pre><p>And lastly (because if I continue the newsletter gonna be too long for you to read), I just try to convert a str Series to datetime.</p><figure class="kg-card kg-code-card"><pre><code class="language-Python">df = df.with_columns(
	df["Date"].str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%Z").alias("Date")
)

# Same code in Pandas
pd_df["Date"] = pd.to_datetime(pd_df["Date"], format="%Y-%m-%dT%H:%M:%S%Z", utc=True)</code></pre><figcaption>We can already see the performance difference here.</figcaption></figure><p>To be honest I try polars for 15 minutes and I can already see how I could switch to it if I have the guaranty it is way faster. APIs are quite similar so I'm far from being lost.</p><p>🫠 If after this small introduction you want a deeper comparison of Polars you can check <a href="https://kevinheavey.github.io/modern-polars/?ref=blef.fr">Modern Polars</a> by Kevin Heavey or a 40 minutes <a href="https://www.youtube.com/watch?v=kVy3-gMdViM&ref=blef.fr">YouTube video that explains Polars internals</a>.</p><p></p><h1 id="hiring-processes">Hiring processes</h1><p>The current state of the data market is weird. At the same time we have a lot of lay-offs and a lot of companies that are still looking for data folks. Which is often a critical hiring for them, but they struggle. There is a huge gap between jobs, what folks are looking for and what companies are looking for.</p><p>This week <a href="https://medium.com/teads-engineering/our-engineering-hiring-process-at-teads-bc2975141c15?ref=blef.fr">Teads shared their engineering hiring process</a>. The process is not focused entirely on data, but still this is relevant because it can give ideas to hiring companies or junior looking for advices. They have a short 4 touchpoint interview which looks like a good compromise.</p><p>When focusing on data more, Galen wrote about <a href="https://towardsdatascience.com/what-i-look-for-in-every-data-analyst-candidate-7d05c52bb19e?ref=blef.fr">what he looks for in data analyst candidates</a>. One of the most interesting advice he gives that I can press is: you should spend time mastering the technologies you've chosen. With the current state of data this is easy to loose focus, so listen to this intervention. Stop chasing the last data trends and master what you daily use. I think that mastery in one domain can be easily transferable in other domain.</p><div class="kg-card kg-callout-card kg-callout-card-green"><div class="kg-callout-emoji">❓</div><div class="kg-callout-text"><strong>Would you be interested by data job offers in the newsletter?</strong> <br><br>I would like to propose you job offers that I personally validate—following an open checklist. Obviously companies would pay for this service and it will be a mean for me to get something in return for the curation/writing work I do every week.</div></div><p></p><h1 id="ai-saturday">AI Saturday</h1><ul><li><a href="https://doordash.engineering/2023/01/10/how-doordash-upgraded-a-heuristic-with-ml-to-save-thousands-of-canceled-orders/?ref=blef.fr">How DoorDash upgraded a heuristic with ML to save thousands of cancelled orders</a> — When running a marketplace this is an usual problem to deal with. DoorDash shares the models they used to replace their intuition.</li><li>👀 <a href="https://vietle.substack.com/p/defensible-machine-learning?ref=blef.fr">Building a defensible Machine Learning company in the age of foundation models</a> — This article is very complete, this is probably the best written article about the actual trends in the machine learning. A whole ecosystem is shifting from build it yourself to consume foundation models and APIs built by others.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-3.png" class="kg-image" alt loading="lazy" width="1227" height="1280" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-3.png 600w, https://www.blef.fr/content/images/size/w1000/2023/01/image-3.png 1000w, https://www.blef.fr/content/images/2023/01/image-3.png 1227w" sizes="(min-width: 720px) 720px"><figcaption>Credits Good Tech Things by @<a href="https://twitter.com/forrestbrazeal/status/1612473738259316736?ref=blef.fr">forrestbrazeal</a></figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://blog.fal.ai/announcing-dbt-fal-adapter/?ref=blef.fr">Announcing dbt-fal adapter</a> — I shared fal months ago when they launched. I'm still on their Discord and when dbt finally announced Python models support I was a bit sceptical about fal offering the same thing. But because of the small scoped dbt solution—Python code only runs in the warehouse. With this release you can really mix Python and SQL code.</li><li><a href="https://medium.com/similarweb-engineering/how-we-cut-our-databricks-costs-by-50-7c60d6b6c069?ref=blef.fr">How we cut our Databricks costs by 50%</a> — We can always find optimization in our cloud setup to save costs.</li><li><a href="https://www.brittanybennett.com/post/how-to-land-a-job-in-progressive-data?ref=blef.fr">How to land a job in progressive data</a> — If you want to use your skills to Do Good you have to look at Brittany's post about progressive data.</li><li><a href="https://www.bundeskartellamt.de/SharedDocs/Meldung/EN/Pressemitteilungen/2023/11_01_2023_Google_Data_Processing_Terms.html?ref=blef.fr">Statement of objections issued against Google’s data processing terms</a> — The German office competition regulation said that Google should do more in being explicit about how the data in processed to help Google business.</li><li><a href="https://howqueryengineswork.com/?ref=blef.fr">How query engines work</a> — This is a web book that explains how query engines work. I did not read it yet but it looks great.</li><li><a href="https://www.jesse-anderson.com/2023/01/analysis-of-confluent-buying-immerok/?ref=blef.fr">Analysis of Confluent buying Immerok</a> — Jesse Anderson analyses last week news of Confluent (Kafka) buying Immerok (Flink) and what it implies in the real-time low-level technologies competition between Kafka / Flink / Spark.</li><li><a href="https://medium.com/@maxillis/on-data-contracts-data-products-and-muesli-84fe2d143e2c?ref=blef.fr">On Data Contracts, Data Products and Muesli</a> — Another post on data contracts, a bit to long for me to read it. Sorry.</li><li><a href="https://clickhouse.com/blog/extracting-converting-querying-local-files-with-sql-clickhouse-local?utm_campaign=SF%20Data%20Weekly&utm_medium=email&utm_source=Revue%20newsletter">Extracting, converting, and querying data in local files using clickhouse-local</a> — This is awesome how fat Clickhouse can go. Looks like a wider alternative to DuckDB but also a good trend for other warehouse: provide a local experience that lives out of the cloud.</li><li><a href="https://www.mydistributed.systems/2023/01/bytegraph-graph-database-for-tiktok.html?ref=blef.fr">ByteGraph: A Graph Database for TikTok</a> — ByteGraph is the open-source graph database developed by the company behind TikTok. This article shows you what are the key concepts to understand it. To be honest I'm quite impressed with the first line stating that it has been designed to support OLAP, OLSP and OLTP workloads.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><ul><li><strong><a href="https://www.metaplane.dev/?ref=blef.fr">Metaplane</a></strong> <a href="https://www.metaplane.dev/blog/the-next-stage-of-metaplane?ref=blef.fr">raises $8.4m seed funding</a>. This is a bold claim, Metaplane wants to be the Datadog for data. Operating in the data observability space the usual set of features: tests, data quality monitoring based on historical data, lineage and alerts.</li><li><a href="https://xetdata.com/?ref=blef.fr"><strong>XetHub</strong></a> <a href="https://xetdata.com/blog/2022/12/13/introducing-xethub/?ref=blef.fr">raises $7.5m seed round</a>. XetHub brings git to data files management. They support up to 1TB repositories with git-like commands (checkout, push, commit, pull, etc.). I think that XetHub is super useful when in data science we need to keep the data alongside the models. When commit a change on a big file their repo hub summarise data diffs.</li><li>Generative AIs are booming, following all the stories with possible <a href="https://pitchbook.com/news/articles/microsoft-openai-largest-vc-deal?ref=blef.fr">Microsoft $10b investment</a> in OpenAI, <a href="https://www.seek.ai/?ref=blef.fr"><strong>Seek AI</strong></a> <a href="https://www.seek.ai/press-01-11-23?ref=blef.fr">raises $7.5m seed round</a>. Seek AI promise is a prompt where you ask your data anything and the AI responds on top of the raw data directly.</li></ul><hr><p>See you next week, maybe on Friday ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 23.01 ]]></title>
                    <description><![CDATA[ Data News #23.01 — First edition of the year (late to start the year on the right foot), 2022 throwback, data team role, data science, fast news and lay-offs. ]]></description>
                    <link><![CDATA[ /data-news-week-23-01/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63b7dabd66b8fc003da070b7 ]]></guid>
                    <pubDate><![CDATA[ 2023-01-07 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image.png" class="kg-image" alt loading="lazy" width="2000" height="1335" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image.png 600w, https://www.blef.fr/content/images/size/w1000/2023/01/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/01/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/01/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>You and me celebrating 2023 (<a href="https://unsplash.com/photos/PAykYb-8Er8?ref=blef.fr">credits</a>)</figcaption></figure><p>Happy new year 🎆. For those who were already subscribed at the start of last year I tried to put resolutions and objectives for the year that I did not succeed to follow. The year was so different to what I was expected. Maybe this is an excuse. Anyway I did not reach my goals. What about if we don't care for this year?</p><p>Still, what happened was awesome and here a small personal / professional throwback:</p><ul><li>I worked for the French public sector as a freelancer: tax administration and education ministry. It makes sense for me and this is something I also really care about.</li><li>Bootstrapped a coaching activity with companies and individuals—this is a new exercise but I feel it's close to management that I can't do in freelance.</li><li>I moved to Berlin, talk to my first meetup ever in English, met awesome people there but I'd like to met more.</li><li>We restarted the Paris Airflow Meetup and people liked it. There are still a few seats left for <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/290522528/?ref=blef.fr">next Tuesday meetup</a>.</li><li>I started to pay myself after 1 year and half of unemployment pay. This is maybe my main source of stress. Will I be next year able to find missions to pay me for the whole year? My business plan asks for 100k€ in revenue.</li><li>This year I deeply learned Superset—adding the tools to my tools expertise list.</li><li>My written content got around 100k views last year. The blog crossed the 2000 members mark (❤️) and I won the <a href="https://noonies.hackernoon.com/2022/emerging-tech/2022-best-data-science-newsletter?ref=blef.fr">best data science newsletter award</a>. On LinkedIn and Twitter I multiplied 2 my followers. Everywhere I was starting from the bottom and now we're here.</li><li>I talked in <a href="https://www.blef.fr/blef-datagen-podcast/">Robin's podcast</a> about the newsletter and my data engineering journey.</li></ul><p>I'm also sorry to start the year late with my newsletter sending. On the last 3 days I was teaching DataOps at a French school and I did not manage to find the time to write to you. <strong>And you know what, this is the first time in 7 years of teaching that more than 80% of the class wants to become data engineer.</strong></p><p>As a conclusion of this introduction, I want to thank everyone reading this newsletter and sharing feedback or good words about it. It means so much to me and it fuels me. For sure the Data News will be here for a new year and new stuff is coming.</p><p>Time for the news—I have around 30 links to share today so it might be less opinionated than usual. Happy reading.</p><p></p><h1 id="data-team-role">Data team role</h1><p>I really like all the thoughts around data team role, missions, vision and strategy. I still think that we did not reach any form of consensus about data teams. In term of tooling the modern data stack proposed something that works but the modern data team is still behind. Here the latest ideas I've seen this week:</p><ul><li><a href="https://petrjanda.substack.com/p/should-software-teams-start-learning?ref=blef.fr">Should software teams start learning from analytics engineers?</a> — Petr reverse the common idea where analytics teams should learn from software. Why actually everyone is just a part of engineering that helps all of us getting better at data <em>and</em> software.</li><li><a href="https://towardsdatascience.com/data-teams-as-support-teams-2bb1f1ed31b?ref=blef.fr">Data Teams as support teams</a> — Chad from Zendesk thinks that data teams are often misaligned with customers and because of the supportive nature of the relationship between something does not work. He then digs in modeling and analytics value to understand what are the impact on the relationship—<em>this is fun to read that someone from Zendesk does not want to be a support team!</em></li><li>❤️ <a href="https://wrongbutuseful.substack.com/p/elbows-of-data?ref=blef.fr">Elbows of data</a> — This is a good follow-up to Chad's post. Katie toss the term <em>elbow of data </em>who are<em> "</em>folks who have insisted on being involved in driving the company forward, whether they were invited to or not". When we do data we have skills and understanding to help our company. Once again our main role should be to empower stakeholders.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-1.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-1.png 600w, https://www.blef.fr/content/images/2023/01/image-1.png 900w" sizes="(min-width: 720px) 720px"><figcaption>You and your stakeholder, bff (<a href="https://unsplash.com/s/photos/empower?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-science-saturday-%F0%9F%A4%96">Data Science Saturday 🤖</h1><ul><li><a href="https://medium.com/qonto-way/how-to-invest-better-in-acquisition-channels-a-1-million-question-for-data-science-591c82b3e0e4?ref=blef.fr">How to invest better in acquisition channels?</a> — Marianne detailed how data science helped Qonto understanding their acquisition channels investment.</li><li><a href="https://counting.substack.com/p/data-science-has-a-tool-obsession?ref=blef.fr">Data science has a tool obsession</a>.</li><li><a href="https://doordash.engineering/2023/01/04/selecting-the-best-image-for-each-merchant-using-exploration-and-machine-learning/?ref=blef.fr">Selecting the best image for each merchant using exploration and ml</a>.</li><li><a href="https://huggingface.co/blog/intro-graphml?ref=blef.fr">Introduction to Graph Machine Learning</a> (related Grab <a href="https://engineering.grab.com/graph-service-platform?ref=blef.fr">Graph service platform</a>).</li></ul><p>We are in a middle of ChatGPT frenzy. A new day means a new interrogation about our future. Our future as developers but our future as humans. ChatGPT is seeking for money at <a href="https://www.wsj.com/articles/chatgpt-creator-openai-is-in-talks-for-tender-offer-that-would-value-it-at-29-billion-11672949279?ref=blef.fr">high valuation amount</a>. Still, should we trust OpenAI to be open as the name is saying 🫠?<br><br>If you want to understand better what's behind ChatGPT you can have a look at <a href="https://github.com/karpathy/minGPT?ref=blef.fr">minGPT</a> a minimal re-implementation via PyTorch.<br><br>At the same time <a href="https://twitter.com/AiBreakfast/status/1610620787052130305?ref=blef.fr">it seems</a> that following initial Microsoft investment in OpenAI, Bing will use GPT models to improve their text and images search. Who would have say that Bing would kill Google?</p><p>Final note: <a href="https://techcrunch.com/2022/12/31/how-china-is-building-a-parallel-generative-ai-universe/?ref=blef.fr">How China is building a parallel generative AI universe</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2023/01/image-2.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2023/01/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2023/01/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2023/01/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2023/01/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>You and ChatGPT being friends (<a href="https://unsplash.com/photos/0E_vhMVqL9g?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.sspaeti.com/blog/why-using-neovim-data-engineer-and-writer-2023/?ref=blef.fr">Why I'm using (Neo)vim as a Data Engineer and Writer in 2023</a> — If you want to take 2023 beginning as a sign to move to vim Simon wrote a great post for you.</li><li><a href="https://newsletter.pragmaticengineer.com/p/circlecis-unnoticed-holiday-security?ref=blef.fr">CircleCI’s unnoticed holiday security breach</a> — CircleCI had a security breach a few days ago.</li><li><a href="https://blog.malt.engineering/what-if-we-rewrite-everything-e1662e86da41?ref=blef.fr">What if we rewrite everything?</a> — Navigating through the technical debt and spending our entire career doing the same stuff over again. What is the right strategy to have? Probably Keep It Simple Stupid.</li><li><a href="https://jkebertz.medium.com/why-its-so-hard-to-become-a-staff-engineer-c4b94864a373?ref=blef.fr">Why It’s So Hard to Become a Staff Engineer</a> — A feedback to help people bringing the gap between senior and staff. I think this is even relevant to data world.</li><li><a href="https://arrow.apache.org/blog/2023/01/05/introducing-arrow-adbc/?ref=blef.fr">Introducing ADBC: Database Access for Apache Arrow</a> — When I see "minimal-overhead alternative to JDBC/ODBC for analytical applications" I'm instantly in. My all professional life I've heard architect saying JDBC is bad so if something better can come so we don't talk about it. You can also <a href="https://open.spotify.com/episode/0gKbMmA8MPE4oGHDn6HxkZ?si=Trl4Bs2lQmurZGZWxhTNNQ&nd=1&ref=blef.fr">listen a related podcast</a> about Arrow vision.</li><li><a href="https://cnr.sh/essays/recap-for-people-who-hate-data-catalogs?ref=blef.fr">Recap: a data catalog for people who hate data catalogs</a> — This one hurts. You may have noticed if you read me that I'm not very tender with current state of data catalogs. This week Chris started a small footprint data catalog written in Python called <a href="https://github.com/recap-cloud/recap?ref=blef.fr">Recap</a>. I'll have a look at it soon.</li><li><a href="https://medium.com/@nigel.vining_19228/observability-tick-5370982eb804?ref=blef.fr">Observability, Tick</a> — Nigel wrote a small post detailing how a smal startup can do observability without spending a lot of money.</li></ul><p></p><h1 id="data-economy-%F0%9F%92%B0">Data Economy 💰</h1><p>The economic situation is obviously not at his best. Previously data was not always impacted about the difficulties but it's also coming to the data world. That's why data fundraising will become a data economy wrap-up.</p><ul><li><a href="https://www.astronomer.io/blog/astronomer-update/?ref=blef.fr">Astronomer laid off  20% of their staff</a>—which represents 76 folks—and moved from a co-CEO structure to only one CEO. I appreciate the transparency effort that has been done to make this note public. But I still struggle to see Astronomer value and strategy, but this is hard because Astro hires a lot of core Airflow contributors and have important contributions to the data community.</li><li><a href="https://www.bloomberg.com/news/articles/2023-01-05/salesforce-crm-guts-tableau-after-spending-15-7-billion-in-2019-deal?ref=blef.fr">Salesforce is laying off 10% of their staff</a>—roughly 8000 people—including folks at Tableau. They acquired Tableau in 2019 and analysts are saying that Tableau ex-employees are more often impacted by the lay-off.</li></ul><p>In search of consolidation and levers to do companies are also merging:</p><ul><li><a href="https://www.qlik.com/us/company/press-room/press-releases/qlik-intends-to-acquire-talend?ref=blef.fr">Qlik wants to acquire Talend</a>. Qlik and Talend are two old BI giants. The first one has been founded in 1993 and the second one in 2005. They had obviously been challenged by the cloud vendors and the modern data stack vision that does not include them.</li><li><a href="https://www.confluent.io/blog/cloud-kafka-meets-cloud-flink-with-confluent-and-immerok/?ref=blef.fr">Confluent signed a deal to acquired Immerok</a>. They are respectively the home companies of Kafka and Flink. This is to be honest a natural move because the two technologies works at the best together and ksqlDB never took the place it should have been in the market. Sadly also right now they are challenged by real-time tooling that is way easier to setup.</li></ul><p>Finally a fundraise:</p><ul><li><a href="https://www.chaosgenius.io/?ref=blef.fr">Chaos Genius</a> is <a href="https://www.chaosgenius.io/blog/chaos-genius-raises-3-3m-seed-round-to-help-companies-cut-data-costs/?ref=blef.fr">raising $3.3m Seed Round</a>. They propose an optimisation platform for Snowflake to help you save up to 30% of your warehouse costs.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ I talked to DataGen podcast ]]></title>
                    <description><![CDATA[ In 2022 I talked in DataGen podcast about the newsletter curation process and why Data Engineering is so cool. ]]></description>
                    <link><![CDATA[ /blef-datagen-podcast/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63b93b9766b8fc003da072b6 ]]></guid>
                    <pubDate><![CDATA[ 2023-01-04 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>🎙 A few week ago I did my first podcast with <a href="https://www.linkedin.com/in/ACoAABOXu60BNWu22glLrpJCM_6wVqD4SszpF1Y?ref=blef.fr">Robin</a>. We talked about data engineering and everything around doing a weekly curation.<br><br>This is the first episode of Robin's podcast in English and you should follow him because more are coming!<br><br>In the podcast we talked about<br>🔥 My journey before launching the newsletter<br>🔥 Why and how I write<br>🔥 My main challenges as a Data Engineer<br>🔥 My favorite contents<br>🔥 What I like about data<br>🔥 A few tips for Data folks</p><p>You can listen the podcast on all the platforms:</p><ul><li>Apple Podcasts/Itunes: <a href="http://bit.ly/3X3qlOQ?ref=blef.fr">bit.ly/3X3qlOQ</a></li><li>Spotify: <a href="http://bit.ly/3GnfWXb?ref=blef.fr">bit.ly/3GnfWXb</a></li><li>Google Podcast: <a href="http://bit.ly/3VPSAPR?ref=blef.fr">bit.ly/3VPSAPR</a></li><li>Deezer: <a href="http://bit.ly/3ZjWsM6?ref=blef.fr">bit.ly/3ZjWsM6</a></li></ul> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — must-read 2022 articles ]]></title>
                    <description><![CDATA[ A collection of data articles that you should read to remember 2022. Best data articles of 2022. ]]></description>
                    <link><![CDATA[ /best-data-articles-2022/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63a044d087c8c8003d9bd503 ]]></guid>
                    <pubDate><![CDATA[ 2022-12-30 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-8.png" class="kg-image" alt loading="lazy" width="2000" height="1334" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-8.png 600w, https://www.blef.fr/content/images/size/w1000/2022/12/image-8.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/image-8.png 1600w, https://www.blef.fr/content/images/2022/12/image-8.png 2074w" sizes="(min-width: 720px) 720px"><figcaption>kitsch moment, from me to you (<a href="https://unsplash.com/photos/2PODhmrvLik?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, this is the last article of the year and it's gonna be about the articles and trends that made 2022 according to me. You'll see articles that I've already share during the year.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">You can also read the <a href="https://www.blef.fr/data-news-must-read-articles/">2021's must-read</a> that I've done one year and half ago or <a href="https://www.blef.fr/learn-data-engineering/">how to learn data engineering</a> that contains key articles to understand the field.</div></div><p>Once again thank you everyone for your support this year and see you next week for the first Data News of 2023. Sorry for the delay, I had a blank page syndrome today. Now let's jump to my selection.</p><p></p><hr><!--kg-card-begin: html--><h2 style="text-align: center;">ANALYTICS ENGINEERING</h2>
<!--kg-card-end: html--><p></p><p>We have to be honest in 2022 Analytics Engineering shaped up the data field and concentrated a lot of data discussions. Analytics Engineering can be seen as a renaming of the BI Engineering, if we look at it more precisely it mainly comes out of the data roles specialisation. Analytics Engineers is a specialized role between the Data Engineer and the Data Analyst. Madison had a look a job posting to see <a href="https://medium.com/geekculture/what-companies-really-want-in-an-analytics-engineer-1ac03ff4494a?ref=blef.fr">what are the skills companies really want in Analytics Engineers</a>.</p><blockquote>Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their own questions. [...], an analytics engineer spends their time transforming, testing, deploying, and documenting data. Analytics engineers apply software engineering best practices like version control and continuous integration to the analytics code base.<sup>1</sup></blockquote><p>Analytics Engineering brought back light on data modeling. Preset wrote a <a href="https://preset.io/blog/intro-data-modeling/?ref=blef.fr">gentle introduction to data modeling</a>. In a nutshell data modeling is the techniques we can use to structure the data in data warehouses. Nowadays we have:</p><ul><li><strong>Dimensional modeling</strong> — Introduced in 1996 by Ralph Kimball. We often use the <em>Snowflake Schema</em> or the <em>Star Schema</em> (that is a special case of the previous one, here Snowflake is not the data warehouse technology but more the shape of the table relationships—drawing a snowflake).</li><li><strong>Entity modeling</strong> — Introduced by Bill Inmon. In this methodology you use the 3NF (third normal form) to model your business entities to avoid redundancy. This approach is less flexible than the previous one.</li><li><a href="https://www.fivetran.com/blog/star-schema-vs-obt?ref=blef.fr">OBT—One big table</a> ; I don't really know who introduced OBT except the fact that Fivetran mentioned it in 2020. This is often the easiest approach to start. Everything in one table, denormalised.</li></ul><p>As a final note, a Reddit thread discussing <a href="https://www.reddit.com/r/dataengineering/comments/uhohlv/is_kimballs_dimensional_modelling_dead_in_2022_is/?ref=blef.fr">is Kimball's Dimensional Modelling dead in 2022?</a></p><p>In order to complete the AE articles list here a few I recommend as the best 2022 analytics engineering articles:</p><ul><li><a href="https://graflinger.medium.com/factless-fact-table-not-so-absurd-it-may-sound-at-first-9c2aab68089?ref=blef.fr">Factless Fact table — not so absurd it may sound at first</a></li><li><a href="https://www.youtube.com/watch?v=hxvVhmhWRJA&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=20&ref=blef.fr">Testing: Our assertions vs. reality</a> — Probably the best talk of 2022 about testing. This is a YouTube video.</li><li><a href="https://engineering.linkedin.com/blog/2022/super-tables--the-road-to-building-reliable-and-discoverable-dat?ref=blef.fr">Super Tables: The road to building reliable and discoverable data products</a> — LinkedIn data modeling choices explained and the introduction of Super Tables concept.</li><li><a href="https://teej.ghost.io/understanding-the-snowflake-query-optimizer/?ref=blef.fr">Understanding the Snowflake Query Optimizer</a> — In order to become better at data modeling you'll need to understand how the underlying warehouse engine is working. This article is a good way to go to understand how Snowflake works.</li><li><a href="https://hex.tech/blog/stop-using-so-many-ctes/?ref=blef.fr">Stop using so many CTEs</a> — This is a vendor article that showcases "Chained CTEs" in Hex. Still relevant because in today's data world CTE are everywhere and a lot of data transformations are just a bunch of SQL queries around a few hundreds of lines with a lot of CTEs. But CTEs are untestable blocs of code.</li><li><a href="https://blog.picnic.nl/7-antifragile-principles-for-a-successful-data-warehouse-574b655f0bc6?ref=blef.fr">7 Antifragile Principles for a Successful Data Warehouse</a> — Something to look at to create a healthy data warehouse.</li><li>My guide about <a href="https://www.blef.fr/manage-and-schedule-dbt/">managing and scheduling dbt from dev to production</a>.</li></ul><p></p><p></p><hr><!--kg-card-begin: html--><h2 style="text-align: center;">DATA TEAMS</h2>
<!--kg-card-end: html--><p></p><p>3 piece of content that I feel are relevant and not really trendy. This is more something long term that we have to have in mind:</p><ul><li><a href="https://locallyoptimistic.com/post/building-more-effective-data-teams-using-the-jtbd-framework/?ref=blef.fr">Building more effective data teams using the JTBD framework</a> — Data teams are still in between with no really good practices when it comes to routines or organisation. The Job To Be Done framework can be something to look at.</li><li><a href="https://datateams.amplifypartners.com/?ref=blef.fr">Building Modern Data Teams</a> — The most complete resource hub with around 40 articles on how to build data teams, data strategies or think about data work / hiring.</li><li><a href="https://a16z.com/2020/10/15/emerging-architectures-for-modern-data-infrastructure/?ref=blef.fr">Emerging Architectures for Modern Data Infrastructure</a> — The updated version of the a16z vision about modern data infrastructure.</li></ul><p></p><hr><!--kg-card-begin: html--><h2 style="text-align: center;"><strong>ENGINEERING</strong></h2>
<!--kg-card-end: html--><p></p><p>In loose, a few of the best 2022 data engineering articles:</p><ul><li><a href="https://dagster.io/blog/software-defined-assets?ref=blef.fr">Introducing Software-Defined Assets</a> — Best article to rethink the data pipelines and to consider datasets like assets.</li><li><a href="https://dev.to/alvinslee/the-rise-of-the-data-reliability-engineer-pno?ref=blef.fr">The rise of the data reliability engineer</a> — Data Engineers have a large part of their daily job that is close to SREs while not being SREs.</li><li><a href="https://mlu-explain.github.io/?ref=blef.fr">The best website to understand visually machine learning models</a>.</li><li><a href="https://joereis.substack.com/?ref=blef.fr">Joe Reis blog</a>. He started blogging recently after writing the excellent <em>Fundamental of Data Engineering</em> book. I often surprise myself agreeing to everything he says, if you have to follow someone except me I think it should be him.</li><li><a href="https://medium.com/data-monzo/the-many-layers-of-data-lineage-2eb898709ad3?ref=blef.fr">The many layers of data lineage</a> — The best metaphor to understand what you can do with data lineage.</li><li><a href="https://eugeneyan.com/writing/design-patterns/?ref=blef.fr">Design Patterns in Machine Learning Code and Systems</a> — Because we need design pattern even if I disliked the design pattern classes I had back in engineering school.</li><li><a href="https://medium.com/inato/3-tips-to-take-back-control-of-your-time-2016dc6308c2?ref=blef.fr">3 tips to take back control of your time</a>.</li></ul><p></p><hr><!--kg-card-begin: html--><h2 style="text-align: center;"><strong>A GLIMPSE INTO THE FUTURE</strong></h2>
<!--kg-card-end: html--><p></p><p>This year people talked about a lot of things, with no research here what I can remember:</p><ul><li>Data Mesh — The Mesh has been assimilated and tried by <a href="https://medium.com/blablacar/dos-and-don-ts-of-data-mesh-e093f1662c2d?ref=blef.fr">multiples</a> <a href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873?ref=blef.fr">organisations</a>, what we've seen is that it requires a minimal size to be started, we have yet to figure out if the organisational changes are worth it.</li><li>Data contracts — An interface between the data producers and the data consumers. The interface can take multiple form, we often summarize it as a schema registry. Very useful in a mesh organisation.</li><li><a href="https://en.wikipedia.org/wiki/Semantic_layer?ref=blef.fr">Semantic Layer</a> / Metric Layer / Headless BI — "<strong>Something“</strong><sup>2</sup><strong> between the data warehouse and the BI tool that will probably shape trends next year</strong>.</li><li><a href="https://blog.fal.ai/the-unbundling-of-airflow-2/?ref=blef.fr">Unbundling of Airflow</a> — This is year many Airflow alternatives went public, all with their own vision and great promises, in addition the one-dag-to-rule-them-all strategy has been challenged and execution has also been delocated to other system leaving Airflow like an empty shell. But in the end <a href="https://en.wikipedia.org/wiki/I%27ll_be_back?ref=blef.fr">he'll be back</a>.</li><li>GPT-3 applications — It has the potential to revolutionize industries through automation and augmenting human intelligence, but has also raised concerns about its potential negative impact on employment (this bullet has been generated by ChatGPT).</li></ul><p></p><p><strong>Now that I've said this, I think that 3 technologies will shape data engineering next year:</strong></p><ul><li>Wasm — WebAssembly is a portable compilation target in the browser. In human words it means you can run your favourite language code in a Firefox tab. One example is <a href="https://pyscript.net/?ref=blef.fr">PyScript</a>, that allows us to run Python in HTML. Thanks to Wasm <strong>we can use a decentralised power: your stakeholders laptops</strong>.</li><li>DuckDB — A single node in-memory OLAP database. We did not see yet the full potential. <a href="https://www.blef.fr/data-news-week-22-46/#my-two-cents-about-duckdb">What I think about DuckDB</a>.</li><li><a href="https://dagger.io/?ref=blef.fr">Dagger</a> — A programmable CI/CD engine that you can run everywhere.</li></ul><p></p><p></p><hr><ol><li><a href="https://www.getdbt.com/what-is-analytics-engineering/?ref=blef.fr#what-is-an-analytics-engineer">What is an analytics engineer?</a> (Claire Carroll)</li><li>Semantic Layer is more than just something. To be honest for the moment I take it sarcastically, because I'm not sure this is something really important—at least when I see my own French market.</li></ol> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.51 ]]></title>
                    <description><![CDATA[ Data News #22.51 — Advent of Data wrap-up, how to manage and schedule dbt, welcome new members, buy a data book for Christmas, I command you to hire junior data engineers. ]]></description>
                    <link><![CDATA[ /data-news-week-22-51/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 639c6703dfb0d5003db163fe ]]></guid>
                    <pubDate><![CDATA[ 2022-12-23 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-6.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2022/12/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/12/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A gift from me to you (<a href="https://unsplash.com/photos/IPx7J1n_xUc?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, if you just subscribed yesterday to the Data News I wish you a warm welcome ❤️‍🔥. The Data News is your Friday weekly data curation in which I select for you the most interesting—according to me—data articles of the last week. I hope you'll enjoy it ✨.</p><p>Christmas is coming, so whether you celebrate it or not, I wish you a great end of the year and good time with family and/or friends. There will be a last Data News next week that will be my 10 2022' must-read articles. In the meantime you can read Prukalpa's <a href="https://metadataweekly.substack.com/p/reading-list-the-top-5-must-read?ref=blef.fr">5 must-read data blogs from 2022</a>.</p><p>The <a href="https://www.adventofdata.com/?ref=blef.fr">Advent of Data</a> is also coming to an end tomorrow, it has been an awesome ride, I'm so happy we put together such an awesome list of content and I'm so grateful to the 24 creators who accepted the rules and wrote something for this first year. I'll do a wrap-up of the Advent in January to celebrate what we achieved together.</p><p><em>Remember: the Advent of Data was your daily spark of data joy in December. Every day a new data article has been published by a data creator.</em></p><p></p><h1 id="guide%E2%80%94manage-and-schedule-dbt">Guide—manage and schedule dbt</h1><p>I published 2 days ago the most <a href="https://www.blef.fr/manage-and-schedule-dbt/">complete guide about dbt management and scheduling</a>, in case you missed it you have to check it out! Original deep post that are exclusive to Data News members are something I'm willing to do more next year, to bring you additional value to this newsletter.</p><p>Next year I plan to talk about:</p><ul><li>Data engineering and analytics engineering career paths</li><li>State of the data integration—related to another 2023 project 📚</li><li>We have too many choices, my framework to take a decision</li><li>Something you want me to write on?</li></ul><p>Let's go back to dbt. So this guide in a nutshell will give you ideas on how you can manage dbt repository(ies)/project(s), what you have to think about to provide a top-notch developer experience, how to host and schedule your dbt code.</p><p>I'm really proud of the development experience part of the guide because I think that this is a still a unresolved part of every dbt project, something is still broken. From the first contact, the local installation, the (web?) IDE, the useless copy-pastes, the code reviews, the tooling to the development environments there is a lot to say.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://www.blef.fr/manage-and-schedule-dbt/" class="kg-btn kg-btn-accent">👀 Check the dbt guide</a></div><p>As an extension this week two great articles have been written about custom dbt setups. Monzo team detailed how they created their own <a href="https://monzo.com/blog/2022/12/15/building-an-extension-framework-for-dbt?ref=blef.fr">framework on top of dbt</a> to follow their growth and Albert from Superside explained how they migrated <a href="https://medium.com/albert-franzi/dbt-core-airflow-7d94edac9cdf?ref=blef.fr">from dbt Cloud to a custom setup with CI/CD, S3, Docker and Airflow</a>.</p><p><em>PS: small question, I did not email you for the guide, would you have wanted to receive an email for it?</em></p><p></p><h1 id="give-yourself-a-book-christmas-%F0%9F%8E%81">Give yourself a book Christmas 🎁</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-7.png" class="kg-image" alt loading="lazy" width="2000" height="1330" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-7.png 600w, https://www.blef.fr/content/images/size/w1000/2022/12/image-7.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/image-7.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/12/image-7.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Close your screen and read good old books (<a href="https://unsplash.com/photos/lUaaKCUANVI?ref=blef.fr">credits</a>)</figcaption></figure><p>If you need gift ideas for yourself I have a few books to propose to you. The selection is a mix between 2 things I love—data engineering and visualisation.</p><p>Here the selection 📚:</p><ul><li><a href="https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/?ref=blef.fr">Fundamental of Data Engineering</a> — It rapidly became a best seller, Joe and Matt wrote a greatly structured book that covers all the data engineering topics, I firmly recommend it from juniors to seniors.</li><li><a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/?ref=blef.fr">The Data Warehouse Toolkit, 3rd Edition</a> — With the rapid rise of the Analytics Engineering role the data modelisation came back as number one priority for a lot of data teams. Dimensional modeling has been a reference for years which is the main purpose of the Kimball method.</li><li><a href="https://www.effectivedatastorytelling.com/?ref=blef.fr">Effective Data Storytelling</a> — Data storytelling have been something really trendy in the last years, but in a lot of data teams because of the dashboard constraint we often lack of creativity, context or storytelling. This book is a must-read if you want to drive actions with data.</li></ul><p>Obviously there are <a href="https://twitter.com/NadiehBremer/status/1605225408542162945?ref=blef.fr">more</a> <a href="https://www.thoughtspot.com/blog/top-10-must-read-books-for-data-and-analytics-leaders-in-2022?ref=blef.fr">books</a> that went release this year that are awesome, but I just mentioned what you should absolutely have.</p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprint?ref=blef.fr">Functional Data Engineering - A Blueprint</a> — Ananth, from Data Engineering Weekly, the best data engineering newsletter, wrote a great follow-up to Maxime's functional data engineering post. In the post he shows how we can apply entity and event schematisation to Lakehouse architecture.</li><li><a href="https://seattledataguy.substack.com/p/tips-for-hiring-junior-data-engineers?ref=blef.fr">Tips for hiring junior Data Engineers</a> — MOST. IMPORTANT. POST. OF. 2022. Every data engineer has been junior once, this is important not to <a href="https://towardsdatascience.com/gatekeeping-and-elitism-in-data-science-74cf19cd5744?ref=blef.fr">gatekeep</a> others by forgetting we once knew nothing about data engineering. This is our duty as senior to hire juniors and to help them. I guaranty you this is the most satisfying feeling, aside from my recommendation the article is awesome and speaks the truth. Last point I think the max ratio is 3 juniors for 1 senior.</li><li>💥 <a href="https://cloud.google.com/bigquery/docs/data-catalog?ref=blef.fr#data_lineage">BigQuery data lineage</a> — It looks like a something huge, but I'm note sure tbh. Soon we will have a <em>data lineage</em> tab in the BigQuery UI. In order to have it you'll have to activate Data Catalog/Dataplex. This is in public preview.</li><li><a href="https://www.youtube.com/watch?v=7qY17c6Eiio&ref=blef.fr">Panel discussion about licenses in open-source</a>, relation with VCs, etc. Between Doug Cutting (Hadoop co-founder), Maxime Beauchemin (Airflow &amp; Superset creator) and David Nalley (Apache Foundation president). This is really geeky about licensing but few of you might find it interesting.</li><li><a href="https://medium.com/@babak4/maybe-snowflake-isnt-for-you-67069a6dbeca?ref=blef.fr">Maybe Snowflake isn’t for you!</a> — Thoughts around the expensive price of Snowflake that reminds Oracle. tl;dr: take control back of your tools to find the holy added value every data people has spoken about.</li><li><a href="https://www.philschmid.de/whisper-inference-endpoints?ref=blef.fr">Managed transcription with OpenAI whisper and Hugging Face inference endpoints</a> — I don't even understand the first chart of the article but it looks cool.</li><li><a href="https://eliasbenaddouidrissi.dev/posts/data_engineering_project_monzo/?ref=blef.fr">Personal Finances with Airflow, Docker, Great Expectations and Metabase</a> — when you're a nerd and you like to extend the data pleasure on Saturday.</li><li><a href="https://survey.stackoverflow.co/2022/?ref=blef.fr">StackOverflow 2022 developer survey</a> — 15% of respondents developers are in data roles (but they can have multiple hats) and when it comes to technologies SQL and Python come just after JavaScript and HTML/CSS that are just everywhere. Last number is that Spark is the framework that pays the most nowadays—not a good sign for Spark future.</li><li><a href="https://coraspe-ramses.medium.com/working-with-large-csv-files-in-python-from-scratch-134587aed5f7?ref=blef.fr">Working with large CSV files in Python from Scratch</a> — Good pattern to optimize your pandas computes by leveraging partionning.</li><li><a href="https://netflixtechblog.medium.com/data-reprocessing-pipeline-in-asset-management-platform-netflix-46fe225c35c9?ref=blef.fr">Data Reprocessing Pipeline in Asset Management Platform @Netflix</a> (<em>I did not read it, but I want to keep a track of it—looks interesting</em>).</li></ul><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><ul><li><a href="https://www.prnewswire.com/news-releases/qualytics-raises-2-5m-to-help-enterprises-improve-data-quality-301708801.html?ref=blef.fr"><strong><a href="https://qualytics.co/?ref=blef.fr">Qualytics</a></strong></a><strong> </strong><a href="https://www.prnewswire.com/news-releases/qualytics-raises-2-5m-to-help-enterprises-improve-data-quality-301708801.html?ref=blef.fr">raised $2.5m Seed round</a>. The newcomers in the data quality space that is already quite crowded. Qualytics is a small team—8 employees on LinkedIn—based in the US proposing a "data firewall" that protects and compares your data to detect drifts, anomalies and history discrepancies.</li></ul><hr><p>See you next week for the last edition of 2022 ❤️. Enjoy holidays.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ How to manage and schedule dbt ]]></title>
                    <description><![CDATA[ The most complete guide about everything you need to know when you manage and schedule dbt. It features an exhaustive list of solutions. ]]></description>
                    <link><![CDATA[ /manage-and-schedule-dbt/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63a03d8187c8c8003d9bd4db ]]></guid>
                    <pubDate><![CDATA[ 2022-12-19 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>Last week <em>dbt Labs</em> decided to change the pricing of their Cloud offering. I've already analysed this in <a href="https://www.blef.fr/data-news-week-22-50/">week #22.50 of the Data News</a>. In a nutshell, <a href="https://www.getdbt.com/pricing/?ref=blef.fr">dbt Cloud pricing</a> is per seat based, which means you pay for each dbt developer. Previously for a team it was $50/month/dev and they increase to $100/month/dev, a 100% increase with a team limit of 8 devs and only one project. To overpass this limit you'll need to take the Enterprise pricing which is opaque as all pricing of this kind.</p><p>But this article is not about the pricing which can be very subjective depending on the context—what is 1200$ for dev tooling when you pay them more than $150k per year, yes it's US-centric but relevant.</p><p>Let's go deeper than this to list what are today the options out there to schedule dbt in production. We will also cover what it means to manage dbt<sup>1</sup>. This article will be written like a guide that aim to be exhaustive by listing all the possible solutions but if you feel I missed something do not hesitate to <a href="mailto:christophe@blef.fr">ping me</a>.</p><p></p><h2 id="dbt-a-small-reminder">dbt, a small reminder</h2><p>Everyone—incl. me—is speaking about dbt, but what the heck is dbt. In simple words dbt Core is a framework that helps you organise all your warehouse transformation. The framework usage grew a lot over the last years. It's important to say that a lot of the usages we have today have not been initially designed by <a href="https://www.getdbt.com/blog/welcome-to-fishtown-analytics/?ref=blef.fr">Fishtown Analytics</a>.</p><p>At first dbt transformations were only SQL queries, but in the recent version with supported warehouse it has been possible to add Python transformations. dbt responsibility is to transform the collection of queries into an usable DAG. The dependencies between the queries are humanly defined—which means prone to error—thanks to 2 handful function <em>source</em> and <em>ref</em>. These 2 functions are called macros because they use Jinja, a Python templating engine, in dbt macros transform Python+SQL code in SQL, we can say that we have templated queries.</p><p>Everything I just mention before we can consider it static. If we do a parallel with software development, this is your codebase. Python and SQL together in the dbt framework is your codebase. You can do development on your codebase. In order to go in production you'll have to manage and schedule dbt.</p><p>To manage dbt you will have to answer multiple questions, but mainly dbt management is how the data team develop on dbt, how the project is validated/deployed, how you get alerted when something goes wrong, how you monitor.</p><p>In addition the dbt management you will have to find the place where dbt will be scheduled. Where dbt will run. dbt scheduling is tricky but not really complicated. If you followed what we've just seen dbt is in a SQL queries orchestrator. dbt does not run the queries, all the queries are sent to the underlying warehouse, which means that theoretically dbt does not need a lot of computing power—CPU/RAM—because he is only sending SQL queries sequentially to your data warehouse which does the work.</p><p>Obviously every dbt project has been designed differently but if we simplify the workflow all dbt project will need at some point to run one or multiple <a href="https://docs.getdbt.com/reference/dbt-commands?ref=blef.fr">dbt CLI commands</a>.</p><p><strong>In this guide we will first see how we can manage dbt, <em>i.e.</em> git structures, how to code, the CI/CD and the deployment then in the second part how we schedule dbt code,<em>i.e.</em> on which server and triggers. </strong></p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">This is a big guide, do not hesitate to use the table of content to jump to the interesting parts.</div></div><p></p><h2 id="how-to-manage-dbt-%F0%9F%A7%91%E2%80%8D%F0%9F%94%A7">How to manage dbt 🧑‍🔧</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-4.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-4.png 600w, https://www.blef.fr/content/images/2022/12/image-4.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Data team workshop to setup dbt (<a href="https://unsplash.com/photos/SYTO3xs06fU?ref=blef.fr">credits</a>)</figcaption></figure><p>One of the dbt founding principle is to bring software engineering practices to the data development work, especially into SQL development world. In order to follow-up on this we will try to treat the workflow like an engineering project, even if sometimes it could feel over-engineered.</p><p>You have to consider development and deployment when managing dbt project(s):</p><ul><li>Like every engineering project the management will obviously start with a git repository—depending on your scale it can be multiple repositories, but if you're just starting I recommend you to go with a single one.</li><li>The next step is the development experience. What we often call DevEx. Sometimes data teams forget it. In order to understand this point we have to ask ourselves who are the dbt developers and what do they need.</li><li>After development often comes deployment. It can be deployment in all environment or as a lot of data only in production, because only production exists. But before sending your code to production you still want to validate some stuff, static or not, in the CI/CD pipelines.</li></ul><h3 id="git-repositories-considerations">Git repositories considerations</h3><p>This is the everlasting debate of every software engineering team, <a href="https://en.wikipedia.org/wiki/Monorepo?ref=blef.fr">monorepo</a> or multirepo? This is tightly linked to another dbt related question which is mono-project or multi-project. By default and by design dbt is meant to work with mono-project, but when you're starting to grow or when you want to have clear domain borders the single project can quickly reach limits.</p><p>As I said previously if you're just starting with dbt and you're a small team <strong>I still recommend you to go with one repo one project.</strong> Try first to organise correctly the <em>models</em> folder before trying to structure at a higher level.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">❓</div><div class="kg-callout-text">By definition here a dbt project corresponds to the folder that has been generated by the command <em>dbt init</em>. While a repo is a folder that can be larger than this, that why a repo can contains multiple projects.</div></div><p>The first question you'll probably hit is: how do I put models in different schema/datasets? This is the first step of project organisation. The solution is to override the <a href="https://docs.getdbt.com/docs/build/custom-schemas?ref=blef.fr">generate_schema_name</a> macro.</p><p>Then if you want to go for multiple projects you'll maybe have to decide how you do the interface between multiple projects, within the dbt toolkit you have 2 solutions:</p><ul><li>Every project can define exposures<sup>2</sup>. Exposures are then a way to define a downstream usage of models in the project. With the exposure nomenclature you can regroup multiple models in the <em>depends_on</em> for an <em>type:application</em> that is supposed to use them. ‌‌‌‌If we imagine <em>some-company</em>, with 2 projects—domains—Ops and Marketing we can have in the Ops exposures the models that we want the outside to be aware of define in it. Then with some kind of automation we can generate sources accordingly in the Marketing project.‌‌‌‌ summarized everything here, to go further GoDataDriven team did an awesome talk at Coalesce explaining how you can achieve this: <a href="https://www.youtube.com/watch?v=P1erB7GfIUY&ref=blef.fr">dbt &amp; data mesh: the perfect pair (?)</a>.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/carbon-4--1.png" class="kg-image" alt loading="lazy" width="1828" height="818" srcset="https://www.blef.fr/content/images/size/w600/2022/12/carbon-4--1.png 600w, https://www.blef.fr/content/images/size/w1000/2022/12/carbon-4--1.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/carbon-4--1.png 1600w, https://www.blef.fr/content/images/2022/12/carbon-4--1.png 1828w" sizes="(min-width: 720px) 720px"><figcaption>This is a way to define exposures for downstream Marketing usage</figcaption></figure><ul><li>The other solution is to go for a dbt packages structure. In this solution every project—domain—can be installed as a dep in other projects, but I think it will end up in a nightmare of dependencies management. In addition you'll have to be smart in the way you run the models in the end because packages installation could duplicate models execution.</li></ul><p></p><p>Once project/repo structure has been defined there are still open questions, here are a few:</p><ul><li>How do I structure my dbt models folder? You can opt for the <a href="https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview?ref=blef.fr">dbt recommended solution</a> or for <a href="https://www.adventofdata.com/modern-data-modeling-start-with-the-end/?ref=blef.fr">Brice's recommendations</a>. Personally here my only advice is: don't be shy to create folders to separate concerns.</li><li>One YAML to rule them all — Do you want to create only one big YAML file that describe all the sources and all the models or do you split it. In my opinion sources have to be at schema/database level and YAML models have to be at the model level. So it means one YAML per SQL file.</li><li>Who is the real owner of the git repo? Data engineering or analytics team? — It depends but I'm in favour if possible to give the responsibilities and ownership to the analytics team, dbt is their playground, as a data—platform—engineer it's your responsibility to help them, but it's up to them to learn by doing. Under the hood it also means that dbt project(s) have to be independent(s) from other tools (<em>e.g. dbt repo should not be in the Airflow repo</em>).</li></ul><h3 id="development-experience-with-dbt">Development Experience with dbt</h3><p>First important thing to say. I'm a data engineer and I truly think that my main mission when in a data team is to empower others through data tools. In the dbt context it means you have to understand how your analytics team is working. I've also noticed over the years that analytics teams are often not able to identify that they are under-equipped or doing something that is not efficient. This is your role as data engineer to identify these issues. This is your role to provide a neat developer experience for every dbt users.</p><p>But who are your dbt users?</p><ul><li>They can be data engineers—working on the founding layers of the modelisation. Probably the sources and the staging tables.</li><li>They can be analytics engineers—doing the same as data engineers in the previous point and going deeper in the modelisation layer into the core, intermediate and mart models.</li><li>They can be data analysts, business analysts, web analysts—people using the final mart models, sometimes also doing it. They mainly want to be able to understand from where or how a column is computed or do small changes. They also need a place to store their <a href="https://docs.getdbt.com/docs/build/analyses?ref=blef.fr">analyses</a> or all the modelisations they were doing before in their BI tool, which is often their main playground.</li><li>Management roles (head of data, VP tech, etc.)—they want to be sure dbt is the right tool but also they want to take a higher view on the modelisation, dbt docs are sometimes a good first entry point for them.</li><li>Stakeholders—not sure they are dbt users, dbt is too technical, and you don't want them to see the whole complexity that exists in it.</li></ul><p>Now that we listed a few of the dbt users, let's focus on the development experience, especially for analytics team—analytics engineers and data analysts. This is super important to provide a smooth experience for these users because they will spend a lot of working hours in the models, the neater the workflow is, the happier people will be.</p><p>What are the levers you can act on to provide this great experience:</p>
<aside class="gh-post-upgrade-cta">
    <div class="gh-post-upgrade-cta-content" style="background-color: #373f48">
                <h2>This post is for subscribers only</h2>
            <a class="gh-btn" data-portal="signup" href="#/portal/signup" style="color:#373f48">Subscribe now</a>
            <p><small>Already have an account? <a data-portal="signin" href="#/portal/signin">Sign in</a></small></p>
    </div>
</aside>
 ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.50 ]]></title>
                    <description><![CDATA[ Data News #22.50 — dbt Cloud pricing x2, Facebook ads trials in California and usual fast news, Dataiku fundraising and the Advent of data. ]]></description>
                    <link><![CDATA[ /data-news-week-22-50/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 639c3338dfb0d5003db15a81 ]]></guid>
                    <pubDate><![CDATA[ 2022-12-16 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-3.png" class="kg-image" alt loading="lazy" width="900" height="595" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-3.png 600w, https://www.blef.fr/content/images/2022/12/image-3.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Prepping me to deliver Christmas' Data News (<a href="https://unsplash.com/photos/EC92VYoYwC4?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, the end of the year is coming soon. I really liked this year with you. It was super fun to write every Friday of the year my opinion on data topics, I don't know yet if next year I'll be able to pull out stuff without repeating myself, I hate repeating myself, but for sure I'll try and I'll continue.</p><p>We have still 2 Fridays left until the end of the year, I'll try like last year to do special editions but no promise.</p><p>As a small reminder the <a href="https://www.adventofdata.com/?ref=blef.fr">Advent of Data</a> 🎄 is still running and this week we got awesome articles again! So go check them out. For instance Marie and Bryan wrote great pieces to help you get started with data: <a href="https://www.adventofdata.com/is-tourism-back-to-its-pre-covid/?ref=blef.fr">Is tourism back to its pre-COVID-crisis level?</a> and <a href="https://www.adventofdata.com/get-started-with-data/?ref=blef.fr">How to get started with data and help your local community</a>.</p><p></p><h1 id="dbt-cloud-pricing-update-%F0%9F%8E%81">dbt Cloud pricing update 🎁</h1><p>dbt Labs announced yesterday a nice Christmas present for all dbt Cloud customers: <a href="https://www.getdbt.com/blog/dbt-cloud-package-update/?ref=blef.fr">a new pricing model</a>. But you know this is the kind of Christmas present your uncle offers you that you don't like. Something you want to return directly because it does not suit you.</p><p>Let's have a look a it. Below are listed the major changes:</p><ul><li>Team plan x2. From <em>$50/month/per dev</em> to <em>$100/month/per dev</em> but limited to 8 devs</li><li>Team plan is now limited to only one project</li><li>Team plan will include the Semantic Layer no-one is asking</li><li>The free tier now announced it's US based only</li></ul><p>Small teams will get their dbt Cloud budget increase by 100%. For instance a small team of 2 analytics engineers will pay now $2400/year just to have a server running their SQL queries and a web IDE that is yet to perfect.<br><br>Obviously, dbt Labs has all the data points regarding activity and features usage to take this decision, but feels weird as dbt Cloud was a simple and costless solution for small users to enter the dbt world. </p><p>In term of strategy it also means that dbt Labs want to push companies to go for Enterprise plan with hidden pricing—don't forget <em>transparency always wins</em> this is one of dbt Labs core value.</p><p>Usual readers of the Data News might notice that I don't go softly with dbt Labs when it comes to their Cloud product, but this is a reality, if I caricature a bit right now dbt Cloud is only a web IDE with the capabilities to run your models, it should be commodity, the real value of dbt exists only in Core for the moment and in the community. In the open-source part.</p><p>As a comparison I pay PyCharm for years and it costs me €99/year and I can <em>almost</em> do everything that is included in the dbt web IDE plus I have all my comfort developer setup. The pricing difference is not worth it.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/Frame-11.png" class="kg-image" alt loading="lazy" width="1985" height="1858" srcset="https://www.blef.fr/content/images/size/w600/2022/12/Frame-11.png 600w, https://www.blef.fr/content/images/size/w1000/2022/12/Frame-11.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/Frame-11.png 1600w, https://www.blef.fr/content/images/2022/12/Frame-11.png 1985w" sizes="(min-width: 720px) 720px"><figcaption>dbt Cloud's new pricing compared to previous one</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>Meta—Facebook—has been sued in Northern District of California following Cambridge Analytica scandal leftovers by a Californian attorney company. You'll probably say: "<em>there is nothing new under the sun</em>". OK. Then <a href="https://storage.courtlistener.com/recap/gov.uscourts.cand.327471/gov.uscourts.cand.327471.1085.29.pdf?ref=blef.fr">court files went public</a> and <strong>listed the tables storing user identifiers for ads, 11051 Hive tables and 1190 Python pipelines</strong> have been listed. Nothing new under the sun.</li><li>Yep 11051 Hive tables on the previous bullet point you didn't misread it. They need 11051 tables to run their ads system.</li><li><a href="https://www.timeplus.com/post/query-your-data-in-kafka-using-sql?ref=blef.fr">Query your data in Kafka using SQL</a> — This is a post that compares Flink, ksqlDB, Trino, Materialize, RisingWave and timeplus (the authors) in order to query Kafka. Even if it's vendor oriented this is a good starting point to have an overview of what you can expect from these tools.</li><li><a href="https://blog.devgenius.io/traditional-vs-modern-analytics-data-processing-part-2-25269ccc5dd4?ref=blef.fr"><a href="https://ownyourdata.ai/wp/traditional-vs-modern-analytics-data-processing-part-2/?ref=blef.fr">Traditional vs modern analytics data processing (part 2)</a></a> — Petrica compare two ways to write a data models, with schema auto-discovery on and off.</li><li><a href="https://www.youtube.com/playlist?list=PLgyvStszwUHjko19Z3PxkBxApbxgVjWp8&ref=blef.fr">Airbyte move(data) conf videos</a> — A YouTube playlist with 38 videos I did not watched because of lacking time from the online data engineering conference Airbyte organised a few weeks ago. You can read <a href="https://medium.com/@matt_weingarten/move-data-takeaways-866b3d36ddc2?ref=blef.fr">Matt's takeaways</a>.</li><li><a href="https://seattledataguy.substack.com/p/a-zero-etl-future?ref=blef.fr">A Zero ETL Future</a> — Benjamin explores the promise of Zero ETL in the future following announcement from AWS or Snowflake.</li><li><a href="https://engineering.hometogo.com/how-hometogo-has-connected-superset-dashboards-to-dbt-exposures-to-improve-data-discoverability-3d0add162e4a?ref=blef.fr">How HomeToGo has connected Superset Dashboards to dbt Exposures</a> — Small article but great ideas.</li><li><a href="https://dataengineeringcentral.substack.com/p/why-is-everyone-trying-to-kill-airflow?sd=pf&ref=blef.fr">Why is everyone trying to kill Airflow?</a> — Imagine a Cluedo and Airflow is Dr. Black. Who did it, when and with which weapon?</li><li><a href="https://medium.com/@zanasimsek/migration-of-postgres-from-9-6-to-10-via-pglogical-for-a-debezium-included-tech-stack-61114cb3f783?ref=blef.fr">Migration of Postgres from 9.6 to 10 via PgLogical for a Debezium</a>.</li><li><a href="https://medium.com/@TianchenW/unit-test-sql-using-dbt-1b8aa214365e?ref=blef.fr">Unit Test SQL using dbt</a> — Small setup to use seeds and tests to create a framework where you have unit tests.</li></ul><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><ul><li><a href="https://tech.eu/2022/12/13/french-founded-dataiku-raises-200-million/?ref=blef.fr"><strong>Dataiku</strong> raised, once again, a $200m Series F</a>. This new round of investing bring the total amount of money raised to $846m but with the economic global slowdown they did it at a lower valuation—$3.7b. Dataiku has been one of the first company to take AI path with all-in-one product. But it seems over the years as they focused big corporations they struggles selling their graphical drag-n-drop UI to smaller businesses.</li></ul><p><em>As a side note, this is crazy to compare dbt Labs' valuation with Dataiku ones. Almost the same but even if I don't like Dataiku the depth of the product is by far not comparable.</em></p><hr><p>See you next week ❤️</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.49 ]]></title>
                    <description><![CDATA[ Data News #22.49 — ChatGPT, Paris Airflow Meetup takeaways, GoCardless data contracts implementation, schema drift, Pathway and Husprey fundraise. ]]></description>
                    <link><![CDATA[ /data-news-week-22-49/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6392edbac5b576003d799be6 ]]></guid>
                    <pubDate><![CDATA[ 2022-12-09 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-2.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-2.png 600w, https://www.blef.fr/content/images/2022/12/image-2.png 900w" sizes="(min-width: 720px) 720px"><figcaption>This is what we call a Chat in French (<a href="https://unsplash.com/photos/9UUoGaaHtNE?ref=blef.fr">credits</a>)</figcaption></figure><p>Hello there, this is Christophe, live from the human world. Last week have been totally driven by <a href="https://openai.com/blog/chatgpt/?ref=blef.fr">ChatGPT</a> frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me. OpenAI perfectly executed it, they dedicated an gigantic amount of computing power to offer a neat pay-as-you-query experience, like BigQuery. And I bet it will transform our industry as far as BigQuery did. But do we want big companies holding decision power in their own pre-trained models, leaving real data science to the big ones?</p><p>I don't want to be alarmist, this is not the tone I have here in the Data News, but do we want a future where the support chat of our home train service or our mobile carrier is under the hood ran by a Musk's company? Ok, it's a caricature, but imagine. I can't wait to see Excel comparing average cost per words written between a human and a machine.</p><p>🎄 Let's switch topic. It's time for the <a href="https://www.adventofdata.com/?ref=blef.fr">Advent of Data</a> head's up. Since last week edition we had 6 new articles published in the calendar. Go taste your daily chocolates. In a nutshell you can now <a href="https://www.adventofdata.com/python-pip-package-for-data-team/?ref=blef.fr">develop an internal pip package for your data team</a>, <a href="https://www.adventofdata.com/clean-up-your-data-swap-but-make-it-a-team-sport/?ref=blef.fr">handle governance</a>, <a href="https://www.adventofdata.com/the-go-to-guide-for-how-to-work-with-data-people/?ref=blef.fr">explain to stakeholders what you're doing</a>, <a href="https://www.adventofdata.com/embedded-machine-learning/?ref=blef.fr">send AI models to small devices</a> while understanding <a href="https://www.adventofdata.com/rust-for-data-engineering/?ref=blef.fr">Rust for data engineering</a> and <a href="https://www.adventofdata.com/geospatial-metrics/?ref=blef.fr">3 keys geospatial metrics</a>.</p><p></p><h1 id="paris-airflow-meetup-%F0%9F%A7%91%E2%80%8D%F0%9F%94%A7">Paris Airflow Meetup 🧑‍🔧</h1><p>On Tuesday I organised the 4th Paris Apache Airflow Meetup. The first one since 2019 and it was awesome, I met with a lot of people, the talks and the venue were awesome. The goal now is to do a meetup per month in 2023. For this I'll look for speakers and hosts, so if you live in France and you want to share something with the French community reach me, I have a lot of ideas.</p><p>After an small introduction the evening started with a presentation by <a href="https://www.linkedin.com/in/clementdelpech?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAA3sBvYBCpfgCubxNNe0TGvi1rUEEm8ii6A&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BQ2aa6pRFQaW8ky7agjLg5A%3D%3D&ref=blef.fr">Clément</a> and Steff from leboncoin data engineering team. They shared with us the good practices they implemented to scale their Airflow development. As a figure at leboncoin 7 teams are using Airflow to operate more the 1000 DAGs. For you a short takeaway in English of their presentation:</p><ul><li>Stop using custom Operators or Hooks if there is a community one available—this point is particularly relevant if you feel your custom stuff creates tech debt</li><li>Be careful with <a href="https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html?ref=blef.fr">Airflow's variables</a>, each <em>Variable.get</em> does a database call and drives bad performances. The replacement solution was to use Jinja templating combined with something more traditional in app development: a constant file.</li><li>Use <em>priority_weight</em>, for this they created an enum with 5 different priority humanly understandable.</li><li>And lastly: give ownership context to DAGs, develop custom macros for repeating tasks like <em>generate_s3_url,</em> <strong>use pendulum date library to avoid the pain of managing dates</strong>, use cluster policies and finally do tests. And if you don't know how to do tests have a look at how Airflow is written and copy how they do it.</li></ul><p>Then Qonto data engineering team with <a href="https://www.linkedin.com/in/charles-cazals/?ref=blef.fr">Charles</a> &amp; <a href="https://www.linkedin.com/in/charles-andre/?ref=blef.fr">Charles</a> shared how they integrated dbt within Airflow. After a small introduction of the classic modern data stack combo—snowflake-dbt-tableau-airflow—Charles presented what is dbt and what are the alternatives to integrate dbt within Airflow. </p><p>In a nutshell you have 3 options to do it:</p><ul><li>You use the <code>DbtCloudRunJobOperator</code> but it requires dbt Cloud</li><li>You use a <code>BashOperator</code> that runs <code>dbt run</code> command</li><li>You use multiple <code>BashOperator</code> running <code>dbt run --select model</code> command</li></ul><p>Qonto decided to go for the last option.  Then the other Charles detailed what it means and how they monitor what is happening. Obviously there are a few pro/cons for this approach that are:</p><ul><li><strong>cons</strong>: Airflow UI does not like having too many tasks (especially the graph view), in their setup with a KubernetesExcutor it means a lot of cold start because a model run means a new pod with a dbt CLI bootstrap, you have a lot of dependencies to manage</li><li><strong>pros</strong>: You are very flexible because you can run one model at a time if you want, the incident management is simplified because as dbt flaws on this topic are filled by Airflow standards, the monitoring can be done</li></ul><p>In the end they showcased their Metabase dashboard helping them understand every dbt run that is very complete mixing data from Airflow with a clever trick—they use XCom to save metadata in the database to be able to use it in Metabase—and the dbt artifacts.</p><p></p><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://drive.google.com/drive/u/0/folders/1obCKu97ifdt4SvErBZg0it5rv35ZKnIh?ref=blef.fr" style="cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; border-radius: 5px;">👀 See the slides</a></p><!--kg-card-end: html--><p></p><p><em>PS: shout-out to people I met there reading the newsletter, your kind words are important and it gives me a lot of motivation. See you soon ❤️.</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/1670369787586-6-.jpg" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2022/12/1670369787586-6-.jpg 600w, https://www.blef.fr/content/images/size/w1000/2022/12/1670369787586-6-.jpg 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/1670369787586-6-.jpg 1600w, https://www.blef.fr/content/images/2022/12/1670369787586-6-.jpg 2048w" sizes="(min-width: 720px) 720px"><figcaption>Studious atmosphere to listen Charles^2 (credits Alaeddine)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://medium.com/memphis-dev/how-to-avoid-schema-drift-a36bd06ed622?ref=blef.fr">How to avoid “schema drift”</a> — This article put word on the schema drift concept, which is the same as configuration drift (e.g. in Terraform) but for data. It happens for instance when you have 2 producers of the same event but they are not using the same type for a column. Although the article is a bit vendor oriented it is still relevant and will ring a bell to a lot of engineers.</li><li><a href="https://www.montecarlodata.com/blog-data-contracts/?ref=blef.fr">7 lessons from GoCardless’ implementation of data contracts</a> — Before ChatGPT hype the whole LinkedIn was speaking of data contracts. Here are takeaways from GoCardless. To be honest I should take more space than a bullet point to detail what they are doing, the key learnings are worth reading.</li><li><a href="https://towardsdatascience.com/what-i-learned-in-my-first-6-months-as-a-director-of-data-science-d9b7b98a48f7?ref=blef.fr">What I learned in my first 6 months as a director of data science</a> — tl;dr be ready to rumble for the hiring competition.</li><li><a href="https://shopifyengineering.myshopify.com/blogs/engineering/server-sent-events-data-streaming?ref=blef.fr">Using server sent events to simplify real-time streaming at scale</a> — Interesting discussion about concepts around real-time communication for apps.</li><li><a href="https://engineering.grab.com/zero-trust-with-kafka?ref=blef.fr">Zero trust with Kafka</a> — Sorry I've read too many articles these days, my brain can't process this one but I like diagrams.</li><li><a href="https://ergestx.substack.com/p/learning-advanced-sql?ref=blef.fr">How to get REALLY good at advanced SQL</a> — I may be interesting for a few of us, this article slightly treat the expertise increase in SQL.</li><li><a href="https://medium.com/helpshift-engineering/generating-chatbot-performance-insights-using-spark-sql-at-helpshift-6cf15e905604?ref=blef.fr">Generating Chatbot performance insights using Spark SQL at Helpshift</a>.</li></ul><p></p><h1 id="data-fundraising-%F0%9F%92%B0%F0%9F%87%AB%F0%9F%87%B7">Data Fundraising 💰🇫🇷</h1><ul><li><a href="https://pathway.com/?ref=blef.fr"><strong>Pathway</strong></a> <a href="https://sifted.eu/articles/female-led-deeptech-pathway-ai/?ref=blef.fr">raises $4.5m pre-seed round</a>. This is an insane amount of money for a pre-seed. Pathway is a French startup in open beta providing real time processing. You need to pip install their package and then you're able in Python to transform your tables. Transformations are operations like select, index, filter, join or map.</li><li><a href="https://www.husprey.com/blog/seed?ref=blef.fr"><strong>Husprey</strong> raises $3m seed round</a>. Husprey provide an alternative to the dashboard world for data analyses with advanced SQL notebooks. They already have a large number of connectors and even integrate with dbt. Husprey is also a French founded company.</li></ul><hr><p>See you next next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.48 ]]></title>
                    <description><![CDATA[ Data News #22.48 — Very fast news, Advent of Data debuts, Snowflake sends emails, deprecates dashboards. ]]></description>
                    <link><![CDATA[ /data-news-week-22-48/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 638a303db41d25003d56f045 ]]></guid>
                    <pubDate><![CDATA[ 2022-12-03 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image.png" class="kg-image" alt loading="lazy" width="2000" height="1499" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image.png 600w, https://www.blef.fr/content/images/size/w1000/2022/12/image.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/12/image.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/12/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Train(s) (<a href="https://unsplash.com/photos/rbBEs6Hljyg?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, this is an unusual Saturday. I'm terribly late with this newsletter. This week I had a huge amount of work to deal with and <strong>we've launched the <a href="https://www.adventofdata.com/?ref=blef.fr">Advent of Data</a>, your daily spark of data in December</strong>. Thanks to everyone who accepted to participate, we already published the 3 first articles and I can't wait to read everything else writers are working on.</p><p>In a nutshell the 3 first articles are:</p><ul><li><strong>MLOps isn’t DevOps for ML</strong> — Abi strongly answer to thenewstack.io stating that machine learning field should find DevOps practitioners to fill the lack of people in ML operations. </li><li><strong>Using Airflow the wrong way</strong> — An experimental article I wrote where I explore Airflow as a framework rather than a all-in-one scheduler/orchestrator tool. What if we decide to schedule Airflow DAGs in Gitlab?</li><li><strong>Modern Data Modeling: Start with the End?</strong> — Brice wrote about dbt projects structure and foundations to have a working modelisation.</li></ul><p></p><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://www.adventofdata.com/?ref=blef.fr" style="cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; border-radius: 5px;">🎄 Go to the Advent of Data 🎄</a></p><!--kg-card-end: html--><p></p><p>On a side note, we are 200 members away from the 2000 members and it'll be an awesome gift if I reach this number before next year. So if you like the newsletter maybe recommend it to your co-workers 🙃.</p><p></p><h1 id="fast-news-very-fast-this-time-%E2%9A%A1">Fast News (very fast this time) ⚡</h1><p>Because I want to deliver the news as soon as possible after my initial delay and a few IRL adventures—I'm currently stuck between Germany and France—this edition will only be a collection of bullet points with opinions. </p><ul><li>Joe Reis launched his Substack — Joe is the co-author of the great <em>The Fundamentals of Data Engineering</em> and his blog already have 2 articles I deeply recommend: <a href="https://joereis.substack.com/p/no-extra-credit-for-complexity?ref=blef.fr">No extra credit for complexity</a> &amp; <a href="https://joereis.substack.com/p/groundhog-days?ref=blef.fr">Groundhog Days</a>. The first article can be resumed as aim for simplicity for every system you build while the second one tries to answer how can data be believable and add value.</li><li><a href="https://hoffa.medium.com/hey-snowflake-send-me-an-email-243741a0fe3?ref=blef.fr">Hey Snowflake, send me an email</a> — Christmas is sometimes the time of the year where magic happens. And once again it happened. Felipe showcases how you can send an email from Snowflake. But, sadly, let's be honest, I hate the way it has to be done. Stored procedures, meh, feels like Oracle.</li><li><a href="https://doordash.engineering/2022/11/29/how-doordash-secures-data-transfer-between-cloud-and-on-premise-data-centers/?ref=blef.fr">How DoorDash secures data transfer</a> — This is network stuff, but still interesting for a large part of data engineers. Doordash needed to send traffic from AWS to on-premise datacenters of their payment providers vendors.</li><li><a href="https://sarahsnewsletter.substack.com/p/the-thrill-of-deprecating-dashboards?ref=blef.fr">The thrill of deprecating dashboards</a> — Last week while summarizing my data dream team presentation I've said that every data team should clean their BI tool every 6 months. This week Sarah shares a few tips on how to do it, from dumping BI data to the warehouse to stats report before cleaning.</li><li><a href="https://towardsdatascience.com/how-data-and-finance-teams-can-be-friends-and-stop-being-frenemies-7ecc357f51ef?ref=blef.fr">How data and finance teams can be friends</a> — As weird as it can be data team is sometimes seen as an annoying stakeholder by business teams. Often because we are stuck between in search of engineering stability while keeping fast delivery pace to follow growth. This post tries to show how data team should work with finance team to avoid this spat.</li><li><a href="https://annageller.medium.com/how-to-manage-data-teams-build-a-reliable-platform-ensure-data-quality-bd56ab81f0bf?ref=blef.fr">How to manage data teams, build a reliable platform &amp; ensure data quality</a> — 20 bullet points split in 4 categories. Anna shares a good checklist to build data team foundations.</li><li><a href="https://preset.io/blog/intro-data-modeling/?ref=blef.fr">Introduction to Data Modeling</a> — Data modeling is the trendy skill everyone wants to learn today, dbt and the Analytics Engineering trends put modeling back in the front seat. This article by the Preset team is a good introduction.</li><li><a href="https://www.technologyreview.com/2022/11/25/1063707/ai-minecraft-video-unlock-next-big-thing-openai-imitation-learning/?ref=blef.fr">A bot that watched 70,000 hours of Minecraft could unlock AI’s next big thing</a> — Yes, generative models are fun, but I've you watched imitative or reinforcement learning? This is so crazy how these kind of models are impressive and I really like the fact that video games are the support of this research. <br><em>PS: <a href="https://upcommons.upc.edu/bitstream/handle/2117/367221/164814.pdf?sequence=1&isAllowed=y&ref=blef.fr">same stuff is happening</a> in Rocket League and this is also impressive.</em></li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/12/image-1.png" class="kg-image" alt loading="lazy" width="900" height="540" srcset="https://www.blef.fr/content/images/size/w600/2022/12/image-1.png 600w, https://www.blef.fr/content/images/2022/12/image-1.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Generative trains (<a href="https://unsplash.com/photos/ZJKE4XVlKIA?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h4 id="big-tech-watch">Big tech watch</h4><ul><li><a href="https://trino.io/blog/2022/11/28/trino-summit-2022-apple-recap.html?ref=blef.fr">Trino at Apple</a> — Getting inspiration from big tech companies has always been a great way to discover patterns. This time Apple engineers shared at Trino—good old Presto for lost people—Summit how they use it.</li><li><a href="https://engineering.linkedin.com/blog/2022/topicgc_how-linkedin-cleans-up-unused-metadata-for-its-kafka-clu?ref=blef.fr">How LinkedIn cleans up unused metadata for its Kafka clusters</a> — Kafka garbage collection, if you like to understand Kafka internals this post is for you.</li><li><a href="https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/?ref=blef.fr">Enabling static analysis of SQL queries at Meta</a> — reducing the feedback loop on your data models edition is probably the biggest challenge data teams are facing today. I'm not afraid to say it. This is not data contracts or Rust, I think that the most annoying thing for a data team is the time lost on data models development. This is why having a great static SQL analysis is a good starting point in reducing the amount of manual steps. Obviously Meta is Meta and they redeveloped everything from the ground up.</li></ul><p></p><hr><p>Thank you all and see you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.47 ]]></title>
                    <description><![CDATA[ Data News #22.47 — Advent of data 2022, how to build the data dream team, Postgres to DynamoDB, graphs and scaled data mesh. ]]></description>
                    <link><![CDATA[ /data-news-week-22-47/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6380a8b411f4c8003d569b6e ]]></guid>
                    <pubDate><![CDATA[ 2022-11-25 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-10.png" class="kg-image" alt loading="lazy" width="900" height="612" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-10.png 600w, https://www.blef.fr/content/images/2022/11/image-10.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Capturing the news (<a href="https://unsplash.com/photos/LZ4EQjr-aHE?ref=blef.fr">credits</a>)</figcaption></figure><p>Hello you, I hope this data news finds you well. Time flies to be honest.</p><p>I've launched in a rush an <strong>Advent of Data</strong>. The goal is simple, in December: 24 data people will produce 24 data gems. Every day a new piece of content will be release on a dedicated website. If you wanna join the initiative please reply, we are still looking for a few slots to be filled in. I know it's a late notice thing, but this is a good occasion to contribute to the data community.</p><p></p><h1 id="how-to-build-the-data-dream-team">How to build the data dream team</h1><p>This Monday I've done my first ever presentation in a international meetup, in English. The experience was great and I enjoyed it, I hope people in the audience liked it also. A video should be out soon with the whole presentation but while waiting here a small glimpse of the talk. </p><p>In this presentation I tried to share ideas on <a href="https://docs.google.com/presentation/d/1hTqtvGOoVyJ7whYpQ2jRLFLJliJHwuC473xo0iI0Ons/edit?ref=blef.fr#slide=id.gfc4a593a50_0_30">how you can create a data dream team</a>. This is more a presentation that is meant to be a collection of ideas and concepts you have to think about rather than a go-to solution. I'd also say that you should always avoid following blindly general advices because every time implementation depends. It always depends on so many things: the product, the resources you have, the company vision, the localisation, etc.</p><p>So yeah, right now the data market is pretty hot. A lot of companies are heavily looking for senior data engineers and analytics people—whether DA or AE—<a href="https://layoffs.fyi/?ref=blef.fr">while layoffs are as high as the COVID period</a>. In my opinion in order to create the data dream team you should understand you team creation funnel. Something like:</p><ul><li><strong>Attract</strong> — you need to make people candidate or at least answer to acquisition managers</li><li><strong>Welcome</strong> — you never get a second chance to make a first impression, so pay attention to the first week</li><li><strong>Onboard</strong> — after the welcoming part you need to pay attention at the first 6 months</li><li><strong>Keep</strong> — this is as important as previous step in the funnel, you have to pay attention to keep people satisfied</li></ul><p>At the meetup I've especially detailed what you can do to keep people. You have to build the data dream team <strong>everyone wants to join and no one wants to leave</strong>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/Screenshot-2022-11-25-at-17.04.30.png" class="kg-image" alt loading="lazy" width="2000" height="1106" srcset="https://www.blef.fr/content/images/size/w600/2022/11/Screenshot-2022-11-25-at-17.04.30.png 600w, https://www.blef.fr/content/images/size/w1000/2022/11/Screenshot-2022-11-25-at-17.04.30.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/11/Screenshot-2022-11-25-at-17.04.30.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/11/Screenshot-2022-11-25-at-17.04.30.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A few ideas on what you can do to create a great data team working environment</figcaption></figure><p>To be honest this impossible to have everything done instantly this is more a long time game. I also think that there are 3 majors levers that are very important in the happiness of a team.</p><ul><li>You need to find the correct roles ratio. I mean, how many data engineers the team should have compared to scientists and analysts. In the past Jesse Anderson always advocated for 2-3 DE per DA/DS in simple team and 4-5 in more complex setup. I still think this is only a dream. As of today I believe the a good ratio would be <strong>DE / (DA + DS) &gt; 1</strong>. This ratio management is only a frustration management. With simple words, the less engineers the more data people will be frustrated.</li><li>Define the vision, the strategy and roadmap of the data team. Every data team go through an identity crisis at some point. A lot of data teams started by doing Shadow IT by saying yes to every data related project. But at some point it has to stop. Data team mission should be clear and understood by everyone.</li><li>Last but not least, aim for no tech debt. Obviously this is easier said than done. But this is something that should be tackled early a in team because this is another topic that will lead for frustration. And frustration leads to resignation.</li></ul><p>Finally I have a slide that I really like with strong opinion on topics that is meant to just make people think. Here below:</p><ul><li>Automate everything (IaC)</li><li>Data engineers don’t write ETL</li><li>Standards, a straight pipe is easier to fix than a curved one</li><li>Data analysts know data better than everyone</li><li>Do Python, don't do Java</li><li>Real time is useless</li><li>Describe every warehouse field</li><li>SREs and software engineers are your best friends</li><li>Who has 0 pipelines issues in the last 30 days?</li><li>Who can’t answer to this question in less than 5s?</li><li>Ask your DE to talk to stakeholders</li><li>GDPR—no one does it, right?</li></ul><p><em>This is in a nutshell my presentation. I'm curious to hear what you think about this. In the last slide of my presentation you have <a href="https://docs.google.com/presentation/d/1hTqtvGOoVyJ7whYpQ2jRLFLJliJHwuC473xo0iI0Ons/edit?ref=blef.fr#slide=id.gfc4a593a50_0_36">links to 10 articles</a> that will help you for sure.</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-11.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-11.png 600w, https://www.blef.fr/content/images/2022/11/image-11.png 900w" sizes="(min-width: 720px) 720px"><figcaption>WOW MY DATA TEAM IS AWESOME 🤣🤣🤣🤣 (<a href="https://unsplash.com/photos/p74ndnYWRY4?ref=blef.fr">credits</a>)</figcaption></figure><p></p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.strategy-business.com/article/In-a-data-led-world-intuition-still-matters?ref=blef.fr">In a data-led world, intuition still matters</a> — The title says it all. I mainly think this is a reminder that data-driven decisions are good but as Alfred Sauvy said once "<em>Numbers are fragile beings who, by dint of being tortured, end up confessing everything we want them to say</em>" (thanks to Pierre 😉).</li><li><a href="https://www.businessinsider.com/google-ai-write-fix-code-developer-assistance-pitchfork-generative-2022-11?ref=blef.fr">Google has a secret new project that is teaching artificial intelligence to write and fix code</a> — We are still waiting for self-autonomous car to really replace drivers, so lmao.</li><li><a href="https://tech.instacart.com/from-postgres-to-amazon-dynamodb-4791220b2d5d?ref=blef.fr">From Postgres to Amazon DynamoDB</a> — Another migration story. This time by Instacart who benchmarked DynamoDB to replace Postgres in their push notification system. In the article they detailed the data model adaptation they did.</li><li><a href="https://blog.devgenius.io/versioning-in-analytics-platforms-7d9968d3e146?ref=blef.fr">Versioning in analytics platforms</a> — Petrica has a great sense when it comes to depict the analytical work. This time she shows at which step of the analytical work you can add versioning. She also showcases Nessie, a data catalog that works with incremental changes like git.</li><li><a href="https://medium.com/@ugociraci/scaled-data-mesh-250e2fe5c36f?ref=blef.fr">Scaled data mesh</a> — The author tries to enlighten the limitations every organisation will face with a mesh strategy. This is mainly a governance problem, but from companies already trying to implement it, I'd say that this is a point already identified.</li><li><a href="https://dantelore.com/posts/simplest-data-pipeline/?ref=blef.fr">World's Simplest Data Pipeline?</a> — "<em>Data Engineering is very simple.  It’s the business of moving data from one place to another.</em>" This is something I could have said. This is article is so simple, but so true. Few bullet points to check. Every data folk should read it before writing any pipeline.</li><li><a href="https://jacobjustcoding.medium.com/retry-pattern-f56df433038c?ref=blef.fr">Retry pattern</a> — When writting a pipeline you also have to think about error resolution. This resolution can by automated with few retry patterns. This is short and conceptual, but good points.</li><li><a href="https://engineering.grab.com/graph-for-fraud-detection?ref=blef.fr">Graph for fraud detection</a> — Grab team explained how they used graphs to do fraud detection. Which is, by the way, one of the best way to handle fraud detection.</li></ul><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">📗</div><div class="kg-callout-text"><strong>White paper</strong> — <a href="https://dl.acm.org/doi/pdf/10.1145/3542929.3563483?ref=blef.fr">Elastic cloud services: scaling Snowflake’s control</a><br>p<a href="https://dl.acm.org/doi/pdf/10.1145/3542929.3563483?ref=blef.fr">lane</a>. Have fun reading this. It deeply details how the control plane works.</div></div><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><ul><li><a href="https://www.oneschema.co/blog/oneschema-announces-6m-fundraise?ref=blef.fr"><strong>OneSchema</strong> raises $6.3m in Seed</a>. OneSchema believe that even if we have awesome replacements CSV files will stay forever in the tech world. So they developped a suite of tools in order to help engineers to ingest CSVs. With their SDK you can add a drag-and-drop panel that will in the browser auto-detect your CSV and let the user fix the issue while you'll be able to validate the data before inserting it in the database.</li></ul><hr><p>See you next week with a new edition and the Advent of Data 🎄.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.46 ]]></title>
                    <description><![CDATA[ Data News #22.46 — Paris Airflow meetup, DuckDB, data teams need to break out of their bubble, select * exclude and the fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-22-46/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 637751269129f0003db9d7a6 ]]></guid>
                    <pubDate><![CDATA[ 2022-11-18 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-8.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-8.png 600w, https://www.blef.fr/content/images/2022/11/image-8.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Scracthing the surface (<a href="https://unsplash.com/photos/qMUCSiEkHIo?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, a new Friday means data news. This week feels a bit like old data news with a variety of articles on different cool topics while I navigate through the actual data trends.</p><p>Next Monday I'll present "How to build a data dream team" at Y42 meetup. I'll share in next week edition a written form of my talk. But this week as an appetizer there are 2 articles I really liked about data teams composition.</p><p>Last but not least, if you are in Paris on the 6th of December you can <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/289492007/?ref=blef.fr">join us for the reboot of the Apache Airflow meetups</a>—I'm the organizer. Talks will be given in French. The agenda:</p><ul><li>leboncoin will share best practices around Airflow</li><li>Qonto will show how you can greatly integrate dbt within Airflow</li><li>I'll also introduce the meetup with last Airflow features</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-7.png" class="kg-image" alt loading="lazy" width="676" height="380" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-7.png 600w, https://www.blef.fr/content/images/2022/11/image-7.png 676w"><figcaption>I organised the Paris Apache Airflow Meetup — 6th of Dec. — <a href="https://www.meetup.com/fr-FR/paris-apache-airflow-meetup/events/289492007/?ref=blef.fr">JOIN US</a></figcaption></figure><p></p><h1 id="my-two-cents-about-duckdb">My two cents about DuckDB</h1><p>Ok, right now, LinkedIn and Twitter data world are a bit going one-way down the Rust and DuckDB street. While I don't have any opinion on Rust except the fact it's look like a programming language eternal debate I'm bored of, I have one on DuckDB. </p><p>Here a small description I wrote about DuckDB 2 newsletter ago:</p><blockquote>If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.</blockquote><p>First, let's decrypt the marketing. DuckDB mother company called MotherDuck says stuff like: "BigData is dead" or "Your laptop is faster than your data warehouse". Which theorically opens the door back to single instance processing for your data. This is brillantly good, tbh. I buy it. Plus they add this fun tone with ducks, which creates sympathy for the product. </p><p><strong>But is it really something?</strong></p><p>I think it is, but I might have already been influenced by the marketing. When I think about DuckDB simplicity. It's exhilarating.</p><p>You do <code>pip install duckdb</code> then <code>import duckdb</code> and you are good to go. You don't need to run a server. A database is available to you, you can read files (CSV or Parquet) and execute SQL or Dataframe operation on it seamlessly. </p><p>I can imagine a list of use cases that will help improving the data engineering workflow but in the same time I don't believe Duck can become the main processing engine of a data platform. I mean, by his single-node nature the technology will for sure serve with brio decentralised teams with central lake but I see more edge use-cases like: running data processing in the CI/CD to quickly validate stuff, provide a great local dev experience to every data developer or empower small data analytics products.</p><p>I don't think it can replace current data warehouse vision or technologies and according to me it shouldn't be sell or compared with. But more a cool sidekick to the actual modern data stack. Still I'm afraid with the huge amount of money invested and the actual course of things where everyone wants to try the hype it'll turn differently.</p><p><a href="https://olivermolander.medium.com/duckdb-whats-the-hype-about-5d46aaa73196?ref=blef.fr">Oliver also shared deeper views on the hype</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-9.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-9.png 600w, https://www.blef.fr/content/images/2022/11/image-9.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Ducks on the horizon (<a href="https://unsplash.com/photos/t0QQAOKVqYU?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-teams-need-to-break-out-of-their-bubble">Data teams need to break out of their bubble</h1><p>Mary MacCarthy published a great post. It's a wake-up post for data teams. In the current economic situation, all the intellectual discussions about the vision of the field are fun but this is not really for what data teams are built. Data teams are meant to exist in most company to empower other teams. I also bet that the semantic layer, DuckDB, Rust or other trendy stuff is not something that will empower your stakeholders. </p><p>Right now the best move you can do according to Mary to empower your stakeholders is to <a href="https://hightouch.com/blog/data-teams-need-to-break-out?ref=blef.fr">break out of your bubble</a> to really work in pair with them. In the article she takes the example of the relation between the marketing team and the data team that often looks like <a href="https://en.wikipedia.org/wiki/Shadow_IT?ref=blef.fr">shadow IT</a>. Martech solutions are often another all-in-one data platform. </p><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://hightouch.com/blog/data-teams-need-to-break-out?ref=blef.fr" style="display: inline-block; cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; margin: 30px auto; border-radius: 5px; text-decoration: none;">Read the article</a></p><!--kg-card-end: html--><p>On the same topic Mikkel Dengsøe came back with a great article about <a href="https://mikkeldengsoe.substack.com/p/purple-people-outside-data?ref=blef.fr">data people outside of the data team</a>. He brings few tips and pitfalls to make this setup works.</p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.notion.so/product/ai?wr=dcdb13eeb837ea4e&ref=blef.fr">Notion announced Notion AI</a> — Notion will introduce an AI assist bloc that will be able to generate text in your Notion pages. Right now in alpha waitlist. Under the hood it uses OpenAI, in the FAQ Notion promises that you data will be protected and not use by OpenAI.</li><li><a href="https://heyashy.medium.com/supercharge-your-python-code-with-dataclasses-6965ddd7fb98?ref=blef.fr">Dataclasses: Supercharge your Python code</a> — If you don't use Python's dataclasses you should look at this article that gives you usage examples. I personally use a lot dataclasses when it comes to create configuration for my data pipelines. It allows me to type my configuration and to get rid of the bracket notation to use objects which is more comfortable when developing.</li><li><a href="https://medium.com/snowflake/snowflake-select-exclude-rename-3e9c9a4073ed?ref=blef.fr">Snowflake <code>SELECT * EXCLUDE/RENAME</code></a> — It has been one of the feature I was missing the more when I switched from BigQuery to Snowflake. Here it is. You'll be able now to supercharge your Snowflake select * by either excluding unwanted columns or renaming on the fly some. It saves precious SQL lines when you have a lot of columns.</li><li><a href="https://medium.com/mlearning-ai/visualization-tips-for-data-story-telling-1e99cccbb8c7?ref=blef.fr">Visualization tips for data story-telling</a> — How to pick colors, how to display text and at what size and how can you emphases a data among others. This article is a good head's up.</li><li><a href="https://github.com/StarRocks/StarRocks?ref=blef.fr">StarRocks, a next-gen sub-second MPP database</a> — I discovered a new open-source real time OLAP database. Nothing more to say except that I had it in the newsletter as a save for later.</li><li><a href="https://www.coinbase.com/blog/revamping-the-apache-airflow-based-workflow-orchestration-platform-at?ref=blef.fr">Revamping the Apache Airflow-based workflow orchestration platform at Coinbase</a> — What to do when you have around 1000 pipelines and more than 1500 PR every month on your project.</li><li><a href="https://towardsdatascience.com/spark-data-pipelines-in-the-cloud-118f38ea90b7?ref=blef.fr">Building Spark data pipelines in the cloud, what you need to get started</a> — Spark have not yet disappeared even if I don't share that much content around it in the newsletter. This is a complete guide about Spark worth mentioning.</li><li><a href="https://towardsdatascience.com/your-data-catalog-shouldnt-be-just-one-more-ui-e6bffb793cf1?ref=blef.fr">Your data catalog shouldn’t be just one more UI</a> — In today's data ecosystem all data catalogs have been developed following the same concepts coming from SV big tech startups. In this article the author explicits that a data catalog still should be more than a search bar in the entities. More a data catalog should firstly be a central metadata repository with open APIs allowing every data team to activate real use cases. <br><em>See also: <a href="https://nonodename.com/post/ddlsemantics/?ref=blef.fr">More on semantics &amp; databases</a>. What if we could add more semantic directly in the database schema comments.</em></li><li>(I did not had the time to read these 2 articles) <a href="https://medium.com/@chris.jackson_46175/simplifying-3nf-c0ad1090a2fc?ref=blef.fr">Simplifying 3NF</a> &amp; <a href="https://dishanka.medium.com/data-skew-101-e5a7bda36f76?ref=blef.fr">Data Skew : 101</a>.</li></ul><p></p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">⚖️</div><div class="kg-callout-text"><a href="https://blog.developer.atlassian.com/learn-how-to-prepare-for-new-european-data-privacy-requirements/?ref=blef.fr">Learn how to prepare for new European data privacy requirements</a> — Rare article about data privacy requirements. Atlassian shares law stuff that might resonate to your legal team if you do international data transfer.</div></div><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><ul><li><a href="https://www.quix.io/blog/quix-raises-12-9-million-series-a-funding/?ref=blef.fr"><strong>Quix</strong> raises $12.9m Series A</a>. Quix is a serverless real time platform that allows developers to focus on developing real time apps rather than spending time on the underlying infra. Their SDK works with Python and C#.</li><li><a href="https://motherduck.com/blog/announcing-series-seed-and-a/?ref=blef.fr"><strong>MotherDuck</strong> raises $47.5m Seed and Series A</a>. Just a side note about the DuckDB mother fundraising. I've already mainly shared my thoughts about this in this newsletter edito. The company seems to be in the tracks of the giants fuelled with a16z money. <a href="https://data-folks.masto.host/@sspaeti/109351920363518396?ref=blef.fr">As others are betting</a> we have few months ahead of us with trendy ducks.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.45 ]]></title>
                    <description><![CDATA[ Data News #22.45 — I&#39;ve joined Mastodon, Equals and EdgeDB fundraising, how Riot do ML, schema change management, etc. ]]></description>
                    <link><![CDATA[ /data-news-week-22-45/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 636e367d29df9d004da10989 ]]></guid>
                    <pubDate><![CDATA[ 2022-11-11 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-4.png" class="kg-image" alt loading="lazy" width="900" height="596" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-4.png 600w, https://www.blef.fr/content/images/2022/11/image-4.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Mastodon and Hadoop are on a boat... (<a href="https://unsplash.com/photos/j4ocWYAP_cs?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey you, 11th of November was usually off for me. Since I've started my freelancing activities I don't really follow the usual calendar, working whenever I need/want. I mainly work 3 to 4 days a week. Which is awesome but it has a major drawback I never took a break longer than 1 week. Which, yeah, kinda sucks. Let's change this next year.</p><p>On a social note, today I've joined data-folks Mastodon server, you can <a href="https://data-folks.masto.host/web/@blef?ref=blef.fr">follow me there</a>. I'll add this new community as source for my curation and I'm gonna try to be active there.</p><p>Also, on the 21st of November I'm gonna talk to a meetup for the first time in English and in Berlin. So if you wanna listen my terrible French accent, <a href="https://www.meetup.com/fr-FR/berlin-data-analytics-meetup-group/events/289313238/?ref=blef.fr">join us</a>. I'll speak about "How to build the data dream team".</p><p>Let's jump onto the news.</p><p></p><h1 id="ingredients-of-a-data-warehouse">Ingredients of a Data Warehouse</h1><p>Going back to basics. Kovid wrote an article that tries to explain <a href="https://servian.dev/ingredients-of-a-data-warehouse-cd68b48f5306?ref=blef.fr">what are the ingredients of a data warehouse</a>. And he does it well. A data warehouse is a piece of technology that acts on 3 ideas: the data modeling, the data storage and processing engine.</p><p>In the post Kovid details every idea. In this cloud world where everything is serverless a good data modeling is still a key factor in the performance—which often mean cost—of a data platform. Modeling is often lead by the dimensional modeling but you can also do 3NF or data vault. When it comes to storage it's mainly a row-based vs. a column-based discussion, which in the end will impact how the engine will process data.</p><p></p><h1 id="schema-changes-management">Schema changes management</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-6.png" class="kg-image" alt loading="lazy" width="900" height="902" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-6.png 600w, https://www.blef.fr/content/images/2022/11/image-6.png 900w" sizes="(min-width: 720px) 720px"><figcaption>A story of an int becoming a str (<a href="https://unsplash.com/photos/hoS3dzgpHzw?ref=blef.fr">credits</a>)</figcaption></figure><p>I bet that most common data horror stories are about schema changes. It could be because the product team changed an integer to a varchar in a source Postgres table or because an analyst remove the tax field in the income table. Every time it means morning headaches with Slack messages, Airflow screaming at you with red circles and downstream pipelines to re-run.</p><p>Fast forward to today, more and more team are trying to fix this. Here are few articles that will give you few ideas about stuff to do—tbh, there isn't a one-stop solution to fix it:</p><ul><li><a href="https://blog.devgenius.io/programmatic-schema-management-1b5efd180e68?ref=blef.fr">Programmatic schema management</a> — Manage all your schema with some kind of code. Petrica showcase at the end of the article Alembic which works, but I think to adds so much overhead in the data warehousing world. </li><li><a href="https://blog.infuseai.io/how-to-be-more-confident-making-data-model-changes-76a2f65feffa?ref=blef.fr">How to be more confident making data model changes</a> — This article is an hidden ad by the author but still. It greatly depicts what you can do at the CI/CD with a static diff that checks old schemas with new schema.</li><li><a href="https://engineering.fb.com/2022/11/09/developer-tools/tulip-schematizing-metas-data-platform/?ref=blef.fr">Tulip: Schematizing Meta’s data platform</a> — Shows a tool called Tulip that handle schematization of message while also handling schema evolution. </li></ul><p></p><h1 id="machine-learning-at-riot-games">Machine learning at Riot Games</h1><p>If you play video games like me you'll like this video. If not, you'll still like it I think. This is a morning coffee from the MLOps Community with Ian Schweer who works at Riot Games. <a href="https://www.youtube.com/watch?v=JjMc8TguPvQ&t=664s&ref=blef.fr">Ian describes how Riot Games uses data</a> and what machine learning means.</p><p>Even if I recommend you to watch the video here few points I've written that were interesting to me:</p><ul><li>A good part of the discussion was around the fact that DEs and MLEs should just copy what SREs are doing for years. In the end why data management should be different than config management—ok, except from the scale?</li><li>Riot has also embraced the concept of feature store, but at the scale of the enterprise there isn't yet a standard way to do it. In their case it also means they embark the ml models in the game binaries.</li><li>This is probably the concept I liked the most from the video. <em>The end-game dataset</em>. It means that every game can be capture as a dataset, with a known schema in a immutable storage accessible for everyone. I like this idea and it can be replicated to a lot of different business.</li></ul><p></p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.madrona.com/dbt-labs-founder-tristan-handy-on-the-modern-data-stack-partnerships-and-creating-community/?ref=blef.fr">dbt Labs Founder Tristan Handy on the modern data stack, partnerships</a> — This is a cool (long) interview of dbt co-founder Tristan about his vision of the product. If you have time listen or read it. My main takeaway is around the fact that dbt (core at least) is community-led. The community created dbt as a framework. A framework to organise your data assets and your knowledge. As of today, dbt is the most advanced framework to do this. The rest is just implementation details.</li><li><a href="https://olivermolander.medium.com/is-it-time-to-rebrand-or-rethink-the-modern-data-stack-5d76366e3c95?ref=blef.fr">Is it time to rebrand (or rethink) the Modern Data Stack?</a> — It completes well the previous interview. 10 years after the "Redshift revolution", it's probably time to put words on today's stacks?</li><li><a href="https://towardsdatascience.com/2003-2023-a-brief-history-of-big-data-25712351a6bc?ref=blef.fr">2003–2023: A Brief History of Big Data</a> — If in parrallel you need a great description of the last 20 years, Furcy wrote the whole data platforms history from the Google File System in 2003 to the 2022 lake house swarm.</li><li><a href="https://medium.com/@cautaerts/data-engineering-is-not-software-engineering-af81eb8d3949?ref=blef.fr">Data engineering is not software engineering</a> — Even if the title is a bit clickbait, the article holds some truths. The author states that data pipelines are not applications and pipelines are single-person tasks that have to be 100% completed otherwise worthless. IMHO, this is partially true and it'll only depend on how the team is mature in their data assets design.</li><li><a href="https://select.dev/posts/introduction-to-snowflake-micro-partitions?ref=blef.fr">Introduction to Snowflake's Micro-Partitions</a> — I think that explaination about databases internals are my favourite tech articles. It comes probably from the fact that I like to understand how the stuff I'm using is working.</li><li><a href="https://medium.com/gooddata-developers/gooddata-and-dbt-metrics-aa8edd3da4e3?ref=blef.fr">GoodData and dbt Metrics</a> — Headless BI or Semantic Layer will be next the big vocabulary discussion in the data ecosystem. BI tools will want to sell headless BI when transformation platform will sell metrics or semantic layers, the idea is to capture via propretary code the data warehouse exposition.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-5.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-5.png 600w, https://www.blef.fr/content/images/2022/11/image-5.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Delivering the fast news (<a href="https://unsplash.com/photos/XoAUPASBOdc?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><ul><li><strong><a href="https://equals.app/?ref=blef.fr">Equals</a></strong> <a href="https://wraptext.equals.app/equals-series-a/?ref=blef.fr">raises $16m Series A</a>. 4 months after a Seed round they get once again money to develop their Excel alternative. The SaaS app connects to your warehouse and displays your data in a tabular format after a query (graphical built or SQL). It looks like a Google Sheets on steroids for data.</li><li><a href="https://www.edgedb.com/?ref=blef.fr"><strong>EdgeDB</strong></a> <a href="https://www.edgedb.com/blog/edgedb-series-a?ref=blef.fr">raises $15m Series A</a>. Slowly, years after years, graph databases time is coming up. Enterprises are relying on a multitude of apps with a varied view of their clients. Graph databases are a key piece of technology that provide an unified view over relationships. EdgeDB is an hybrid open-source graph database developed on top of Postgres.</li></ul><p><em>PS: Regarding database trends Cloud Database Report wrote a great article about <a href="https://clouddb.substack.com/p/7-database-trends-driven-by-aws-google?ref=blef.fr">7 actual database market trends</a>. More serverless, graph, vector, Postgres is used everywhere, etc.</em></p><hr><p>See you next week.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.44 ]]></title>
                    <description><![CDATA[ Data News #22.44 — Datalake with DuckDB and Dagster, databases time, ML Saturday, autocomplete for analysts and good self-service. ]]></description>
                    <link><![CDATA[ /data-news-week-22-44/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6364e8e35888fe003da5532c ]]></guid>
                    <pubDate><![CDATA[ 2022-11-05 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-1.png" class="kg-image" alt loading="lazy" width="2000" height="1593" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2022/11/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/11/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/11/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Saturday be like (<a href="https://unsplash.com/photos/tnp3F6Nw6XI?ref=blef.fr">credits</a>)</figcaption></figure><p>Hello data news readers. I hope you had a great last week. This is the Saturday data news, yesterday I had a blank page syndrome. I hope you don't mind.</p><p>Before jumping in the news, I have 2 things to say. First, I've been listed as <a href="https://www.noonies.tech/2022/emerging-tech/2022-best-data-science-newsletter?ref=blef.fr">Best data science newsletter</a> by Hackernoon. If you like this newsletter I'd love to get your vote. Then, I'll organise an Airflow meetup in Paris on the 6th of December and I'm still looking for speakers. Probably for 5mins light talks—fr or en.</p><p>Have fun.</p><p></p><h1 id="build-a-data-lake-from-scratch-with-duckdb-and-dagster">Build a data lake from scratch with DuckDB and Dagster</h1><p>I've recently shared a lot of articles around DuckDB in the newsletter. If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.</p><p>This database technology got a lot of traction because of its simplicity to install and to use. Which also mean that influencers and bloggers can experiment easily to show you how wonderful it is. This article is no exception.</p><p>Dagster on the other hand is another orchestration tool that has been thought for the cloud and the data orchestration. They firstly popularized <em>software-defined assets</em> concept which is a way to define data assets as code. This way the orchestrator knows data dependencies and can do reactive scheduling rather than CRON-based.</p><p>So, Pete and Sandy from Dagster team showcase how you can <a href="https://dagster.io/blog/duckdb-data-lake?ref=blef.fr">create a s3 datalake with DuckDB as query engine</a> on top of it. I really like the article because it shows in a small amount of code how you can:</p><ul><li>ingest data from Wikipedia with Pandas</li><li>write a compact pipeline end-to-end test with a simple test before code</li><li>define Dagster data assets</li><li>use DuckDB to read and write s3 assets</li></ul><p>Obviously what they did is purely experimental but it gives ideas on how every company could create a lake with a smaller footprint and a smaller price. I mean, BigQuery and Snowflake are also launching processing on-demand, but here with DuckDB you really know what's running and it's fairly simple so you can measure all the costs.</p><p><em>PS: as I never used—I plan soon—DuckDB and Dagster all my comments are based on my theoretical understanding of the technologies and all the readings I had about it.</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-2.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-2.png 600w, https://www.blef.fr/content/images/2022/11/image-2.png 900w" sizes="(min-width: 720px) 720px"><figcaption>DucksDB (<a href="https://unsplash.com/photos/oJ_Reviogx8?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="databases-time">Databases time</h1><p>It looks like a special edition about databases but this is not. Dremio wrote an article to explain <a href="https://www.dremio.com/subsurface/the-life-of-a-read-query-for-apache-iceberg-tables/?ref=blef.fr">how a read query works with Iceberg tables</a>. In a nutshell, a read query first uses the catalog to find the right metadata files. They will point on the correct manifest files in order to get the correct data. With even more simple words, it uses metadata systems to narrow the data search, the less you read data the faster the query will be. </p><p>If we go on a more exotic database side. Redis team wrote a guide of <a href="https://redis.com/blog/legacy-database-migration/?ref=blef.fr">things to consider when doing a database migration</a> and Mohammad wrote a <a href="https://www.mydistributed.systems/2022/10/dynamodb-ten-years-later.html?ref=blef.fr">retro on DynamoDB 10 years after the general release</a>.</p><p></p><h1 id="playing-dataviz-tennis-for-collaboration-and-fun">Playing dataviz tennis for collaboration and fun</h1><p>This idea is so fun and I'd love to try it in a data team. For content purpose <a href="https://nightingaledvs.com/playing-dataviz-tennis-for-collaboration-and-fun/?ref=blef.fr">Georgios and Lee played at dataviz tennis</a>. Every dataviz tennis match lasted for 8 rounds, with 45 minutes per round and the person who served picked the dataset. So it means player 1 choose a dataset, work on a viz for 45 minutes and then shot the viz to player 2 that work on it for 45 minutes, and so on. All of this in R with ggplot2.</p><p>I think this is a fun way to collaborate and for some projects we should try it in data teams. This is a alternative way to do pair programming and it can be done with data pipelines as well.</p><p></p><h1 id="ml-saturday-%F0%9F%A4%96">ML Saturday 🤖</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/Group-2.png" class="kg-image" alt loading="lazy" width="1018" height="116" srcset="https://www.blef.fr/content/images/size/w600/2022/11/Group-2.png 600w, https://www.blef.fr/content/images/size/w1000/2022/11/Group-2.png 1000w, https://www.blef.fr/content/images/2022/11/Group-2.png 1018w" sizes="(min-width: 720px) 720px"><figcaption>How would you rate your job satisfaction in your current role? (<a href="https://www.anaconda.com/state-of-data-science-report-2022?ref=blef.fr">credits</a>)</figcaption></figure><p>In bulk here few cool articles:</p><ul><li><a href="https://medium.com/@ankurkaul_6335/anatomy-of-a-data-science-team-4547f6ed55bb?ref=blef.fr">Anatomy of a Data Science Team</a> — 9 roles that appears in a data science team. This is a bit caricatural but it depicts well forces at stake when creating a data science team. As a side note I've also read Anaconda survey about the state of data science, in the survey we can see that data engineer are slightly more dissatisfied about their jobs than data scientists (cf. picture ahead).</li><li><a href="https://deezer.io/how-deezer-built-the-first-emotional-ai-a2ad1ffc7294?ref=blef.fr">How Deezer built an emotional AI</a> — Deezer a French music streaming app shows how they adapted their music recommendation engine—Flow—to learn from their users' mood while identifying mood in songs.</li><li><a href="https://engineering.fb.com/2022/10/31/ml-applications/instagram-notification-management-machine-learning/?ref=blef.fr">Improving Instagram notification management with machine learning and causal inference</a> — All this science just to send noisy notifications to make you addicted /s.</li><li><a href="https://towardsdatascience.com/fooled-by-statistical-significance-7fed1bc2caf9?ref=blef.fr">Fooled by statistical significance</a> &amp; <a href="https://medium.com/@helloheld/how-der-spiegel-uses-machine-learning-to-identify-its-most-valuable-potential-subscribers-8af4007d3a66?ref=blef.fr">How Der Spiegel uses Machine Learning</a>.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.deepchannel.com/posts/bringing-autocomplete-to-analytics-engineers?ref=blef.fr">Bringing autocomplete to Analytics Engineers</a> — Analysts are probably the less equipped team when it comes to productivity. Navigating through thousands lines of SQL or developing SQL in a web browser is not very funny. Today deep channel proposed a workspace to solve this hurdle by adding autocomplete and faster error detection—it integrates mainly with dbt. To be honest this is an issue larger than this that will not be solve by 1 tool, but still the idea is great.<br><em>PS: on this matter you can still try my </em><a href="https://github.com/Bl3f/dbt-helper?ref=blef.fr"><em>dbt-helper extension</em></a><em> to extends you BigQuery console.</em></li><li><a href="https://medium.com/yousign-engineering-product/snowflake-rbac-implementation-with-permifrost-3d30652825ad?ref=blef.fr">Snowflake RBAC Implementation with Permifrost</a> — Managing Snowflake rights can be a huge pain in the ass. You can do it with custom code or Terraform. Here Yousign team detailed how they did it in YAML with Permifrost. Then they manage dbs, warehouse, users and roles with configuration. The article also gives what is the standard when it comes to Snowflake rights at Enterprise.</li><li><a href="https://medium.com/geekculture/how-to-make-data-documentation-sexy-c0ef0d696f78?ref=blef.fr">How to make data documentation sexy</a> — Madison wrote of lot of content when it comes to documenting data knowledge. This time she proposes rules to apply when writing documentation.</li><li><a href="https://www.dataduel.co/what-good-data-self-serve-looks-like/?ref=blef.fr">What good data self-serve looks like</a> — For years every data team wanted to provide self-service to stakeholders in order to reach the heaven. In heaven stakeholders do SQL, are autonomous and data team concentrates in delivering value and analyses justifying high salary costs. But this is only in heaven, most of self-service is badly achieve. Data teams being a support team trying to creating trust in the data. Nate tries in the article to define good self-service and what are the key levers to act on.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/11/image-3.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/11/image-3.png 600w, https://www.blef.fr/content/images/2022/11/image-3.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Good self-service (<a href="https://unsplash.com/photos/a-8LxEvb4kU?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><ul><li><strong><a href="https://dataloop.ai/blog/dataloop-raises-33-million-to-help-companies-build-data-engines-for-ai/?ref=blef.fr">Dataloop</a></strong><a href="https://dataloop.ai/blog/dataloop-raises-33-million-to-help-companies-build-data-engines-for-ai/?ref=blef.fr"> raises $33m Series B</a>. Dataloop is mainly a data labelling platform that focus on quality. They propose an end-to-end platform to do everything about AI.</li><li><a href="https://www.alation.com/press-releases/alation-raises-series-e-funding/?ref=blef.fr"><strong>Alation</strong> raises $123m Series E</a>. The company founded in 2012 raised another round to push forward their data catalog solution to enterprises. I've not a lot to say except the fact that they have too many products for my brain to understand what they really sell.</li><li><a href="https://techcrunch.com/2022/11/01/mlops-platform-galileo-lands-18m-to-launch-a-free-service/?ref=blef.fr"><strong>Galileo</strong> gets $18m Series A</a>. Galileo package integrates within your Python machine learning stack to add debug and tracking to your work.</li></ul><hr><p>See you next week ❤️ — PS<em>: below should appear a survey about how you like the newsletter, please tell me what you think</em>. </p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Coalesce 2022 ]]></title>
                    <description><![CDATA[ dbt Coalesce 2022 main takeaways from my perspective. A bit of Python, semantic layer and a lot of analytics engineering and data teams impact. ]]></description>
                    <link><![CDATA[ /dbt-coalesce-takeaways-2022/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 635ba33fd5024c004dd022fc ]]></guid>
                    <pubDate><![CDATA[ 2022-10-29 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-10.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-10.png 600w, https://www.blef.fr/content/images/2022/10/image-10.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Me right now (<a href="https://unsplash.com/photos/dOnEFhQ7ojs?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey dear members. I have to confess I'm lazy. Every week I want to create content, I want to work on a new article or video. The more I have ideas the more I procrastinate. Every week, Friday appears and I'm still here, late with the newsletter. For years I was convinced I could change it, but let's face the truth, I'm 30 now, it will never change. </p><p>Still, while procrastinating this week I've decided to watch all replays—around 120—from the dbt annual conference. This newsletter will give you <strong>my</strong> Coalesce 2022 takeaways.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">🔭</div><div class="kg-callout-text">Don't forget that this selection of talks represent my reading of the conference. It also represents my views and my understanding. You can disagree with what I said and if you feel I'm deeply wrong on something, once again hit reply or comment.<br><br>I have also added a ❤️ on my 3 favourite talks.</div></div><h1 id="coalesce-2022">Coalesce 2022</h1><p>The conference agenda has been divided, according to me, into 5 categories that are similar to last year ones:</p><ul><li>dbt future — which direction the data field is going with dbt at center</li><li>Analytics engineering</li><li>HR — Grow your data career and fix your data team</li><li>Diversity talks about how we can be more open in the data field</li><li>Partners — dbt's booming, everyone wants to be in</li></ul><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://www.youtube.com/watch?v=W5hApW2GVqY&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=1&ref=blef.fr" style="display: inline-block; cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; margin: 30px auto; border-radius: 5px; text-decoration: none;">📺 Watch the dbt YouTube playlist</a></p><!--kg-card-end: html--><h2 id="dbt-future">dbt future</h2><p>Obviously Coalesce has been the theatre for dbt Labs to announce new stuff. Nothing revolutionary or surprising because it was already discussed or announced before the conference. During the 5-days dbt Labs talks were focused on 3 main topics: Python, Semantic Layer and Community. In the modern data stack the warehouse is king, at the center, dbt sits on top of it. In this privileged position dbt usage is growing. </p><p>Being at the center of a community of users and partners means a lot. You foster a variety of usages while attracting with your growth a lot of partners in search of integration with you. This is what dbt Labs has to juggle with. As a personal opinion I think that too much tools were just demoing their product without any added value, still this is not a big issue as I can skip them.</p><p>Technically speaking, being at the center of the data stack leads also to the next step for dbt: <a href="https://youtu.be/sEeJJ7qD9wA?list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&t=1886&ref=blef.fr">the Semantic Layer</a>. This layer has been designed to be the all-in-one interface for all the tools in needs for data. dbt Labs will open-source a new project called the dbt Server—not yet released—that will put an HTTP API on top of dbt Core to do dbt operations. In addition dbt Cloud will offer a proprietary Metadata API and a Cloud proxy. The Cloud proxy will be able to translate YAML metrics definition to SQL. As I already said it feels a bit like their best try to generate revenue.</p><p>If I'm being sarcastic and defensive I don't see as a good sign that dbt wants to be my new data connector on top of my warehouse, adding a layer of complexity in my infrastructure.</p><p>Lastly the Python support, while being fairly simple, impressed me. In a form of a duel Jeremy vs. Cody dbt team demoed <a href="https://www.youtube.com/watch?v=rVprdyxcGUo&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=18&ref=blef.fr">what you can and can't do with Python models</a>. In a versus Python vs. SQL models we've seen usage of pandas describe and pivot, fuzzy matching and sklearn.</p><p>On a side node dbt team also presented their <a href="https://youtu.be/Qg9JA-SUdmo?list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&t=2333&ref=blef.fr">focus for 2023 and 2024</a> outlined by their user research. As Tristan said, dbt wants to be the <a href="https://youtu.be/sEeJJ7qD9wA?list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&t=242&ref=blef.fr">open standard to create and disseminate knowledge</a>. So 2023 will mean: better lineage support for datascience, standardization around metrics and semantic layer and enriched dbt DAG capabilities to add more context to it—whatever it means <a href="https://roundup.getdbt.com/p/bundled-or-unbundled-data-stack?ref=blef.fr">re-bundling</a> is coming.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-12.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-12.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-12.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-12.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-12.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>analytics engineers facing the future (<a href="https://unsplash.com/photos/XRcEsQKTWGk?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h2 id="analytics-engineering">Analytics engineering</h2><p>2022 is probably the year of analytics engineering being popularized. While being still unclear what are the true frontiers of the role, everyone knows that "dbt developers" are analytics engineers. But it goes deeper that this. It implies a mix of business understanding with technical expertise over SQL engines and data modelisation.</p><p>At Coalesce we've seen that analytics engineering has a wide range of applications. But in the end <a href="https://www.youtube.com/watch?v=sDueM6pQNZI&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=46&ref=blef.fr">you don't build models, you construct knowledge</a>, this knowledge is essential to <a href="https://www.youtube.com/watch?v=itcp28mup3c&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=44&ref=blef.fr">find the common ground</a> between the company verticals. Even if AE is still new, it relies on <a href="https://www.youtube.com/watch?v=-yQa_DxEqaQ&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=63&ref=blef.fr">old principles</a> like <a href="https://www.youtube.com/watch?v=CPTao9jxLyg&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=13&ref=blef.fr">Kimball modeling, but is it still relevant?</a> <em>Spoiler</em>: yes, even if it's not like before for performance reasons, Kimball brings understandability.</p><p>Under analytics engineering I really like 3 presentations that I would recommend to any people in analytics, these presentation while approaching technical concepts in a good way bring good food for thoughts to improve any dbt project:</p><ul><li><a href="https://www.youtube.com/watch?v=eAeIFFY5818&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=80&ref=blef.fr">Outgrowing a single `dbt run`</a> — at scale the schedule based orchestration can fail, having CRON that runs dbt will lead to issue so you need a smarter orchestration pattern. This is were reactive/proactive scheduling enters the room. In the Airflow world it means you have to use sensors to trigger runs. Here Prratek also recommend to run staging model each time a source is refreshed and once every staging have been run to run the marts. I think this is a good pattern.</li><li>❤️ <a href="https://www.youtube.com/watch?v=hxvVhmhWRJA&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=19&ref=blef.fr">Testing: Our assertions vs. reality</a> — Probably the best talk of the conference to me. Mariah shows how dbt is natively badly designed when it comes to testing. dbt tests are mixing code and data quality which are 2 different piece of the testing framework. She also greatly illustrates the difference between assumptions and assertions when it comes to data.</li><li><a href="https://www.youtube.com/watch?v=L97ao-GmBLA&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=37&ref=blef.fr">Efficient is the new sexy - A minimalist approach to growth</a> — Matthieu propose a framework to handle team growth while tackling engineering problems. He also tackles issues like modularity (linked to mesh concepts) and testing on another angle than the previous one. </li></ul><p>Lastly data contracts concepts were on fire in the data community. This time Jake and Emily are providing us with practical example using <a href="https://www.youtube.com/watch?v=s6iy0hqjcLk&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=96&ref=blef.fr">jsonschema to define interface between product and data teams</a>.</p><p></p><h2 id="grow-as-an-individual-and-fix-your-data-team">Grow as an individual and fix your data team</h2><p>A lot of talks this year have tried to answer to a simple question: how can a data team have an impact? This is obviously related to the fact that all data teams around the world are costing a lot and leaders are still struggling to find the Return on Investment (ROI).</p><p>In this introspective search of what's a data team, the picture seems to be the same for everyone. Cultural challenges are the main blockers for massive data adoption. 5 talks tried to propose something to help adoption:</p><ul><li><a href="https://www.youtube.com/watch?v=Fc6yy8nPdA4&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=26&ref=blef.fr">Know your worth: Unpacking business value delivered by data teams</a> — A framework to build knowledge to exploit data for stakeholders</li><li><a href="https://www.youtube.com/watch?v=i6mb_fkkfB8&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=4&ref=blef.fr">Data teams v. The recession</a> — <em>How to win the ROI battle</em>. You have to at least act on 3 levers: core business reporting, avoid people pleasing and drive decisions that affect revenue. Chetan illustrates with Airbnb and Webflow examples.</li><li><a href="https://www.youtube.com/watch?v=VMlrT4wXTgg&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=24&ref=blef.fr">How to build data accessibility for everyone</a> — use the JTBD framework to know your data users to achieve self-service.</li><li><a href="https://www.youtube.com/watch?v=q1sIRhrFoeY&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=58&ref=blef.fr">Money, Python, and the Holy Grail: Designing Operational Data Models</a> — We need to simplify data models a simple modelisation of the business means that you've understood what's going on. Data teams should not be a consultant team that answer every questions. <strong>Data team creates a simple understandable view for everyone</strong>.</li><li>❤️ <a href="https://www.youtube.com/watch?v=mH4TK7XucPw&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=23&ref=blef.fr">Operations vs. product: The data definition showdown</a> — Every operational team is different and data should do the glue between stakeholders even if it's hard. Words have different meaning per teams. Data alignment is a people and langage problem, not a technical one.</li></ul><p>Being in an analytics team can be difficult because you're in the middle of everything without the power to take decisions. That's why data team have to be empathic. Empathy means "the action of understanding" (cf. <a href="https://www.youtube.com/watch?v=_2uLZFTR0DY&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=28&ref=blef.fr">Empathy-building in data work</a> and <a href="https://www.youtube.com/watch?v=hIt_FtgxzDY&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=68&ref=blef.fr">How insensitive: Increasing analytics capacity through empathy</a>).</p><h3 id="purple-people">Purple people</h3><p>dbt blog mentioned <a href="https://www.getdbt.com/blog/we-the-purple-people/?ref=blef.fr">purple people</a> concept last year. Purple people are these generalists that are doing the glue between the business and the data stack. But being a generalist is often a solo job. You are navigating between specialist world and you help these expert communities communicate between each other. This is what Stephen greatly depicted in ❤️ <a href="https://www.youtube.com/watch?v=wB0ulHmvU7E&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=62&ref=blef.fr">Excel at nothing: How to be an effective generalist</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-11.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-11.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-11.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-11.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-11.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>(<a href="https://unsplash.com/photos/0aMMMUjiiEQ?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h2 id="%F0%9F%AB%B6-cool-stuff">🫶 Cool stuff</h2><p>There also were open formats. This creativity shows how great the data community is. Tiankai sang a <a href="https://www.youtube.com/watch?v=Eu_Yb3BDPNo&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=51&ref=blef.fr">data jam 🎵</a>, <a href="https://www.youtube.com/watch?v=4U1LM2qYoZ4&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=77&ref=blef.fr">competitors battled to answer business questions</a> as fast as possible and <a href="https://www.youtube.com/watch?v=gzr4CbeVY5s&list=PL0QYlrC86xQlj9UDGiEwhXQuSjuSyPJHl&index=41&ref=blef.fr">Joe developed an Unity SQL game</a>.</p><p>Final shout-out to Mehdio who did <a href="https://www.youtube.com/watch?v=UIcAfFag9E4&list=PLtIcsFR-XFuIx4lgOpdRVillrsHqtFKPY&ref=blef.fr">video interviews and highlights</a> of the conference because he was there in-person.</p><p>Last thing I discovered is the <a href="https://github.com/dbt-labs/dbt-project-evaluator?ref=blef.fr">dbt-project-evaluator</a> package, which seems amazing to create CI/CD rules to detect for instance direct join to sources or documentation coverage.</p><p></p><!--kg-card-begin: html--><p style="text-align:center;"><a data-portal="signup" style="cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; border-radius: 5px;">📬 Subscribe to my weekly newsletter 📬</a></p><p style="text-align: center; font-size:.7em; font-style: italic;">(to get data curation each week in your inbox saving your 5 hours of tech watch)</p><!--kg-card-end: html--><p></p><hr><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🗒️</div><div class="kg-callout-text">You can also read my <a href="https://www.blef.fr/coalesce-2022-raw-notes/">raw notes about Coalesce</a>. This is for members only as the format is quite awful I think.</div></div><hr><p>PS: I've already done this last year for the <a href="https://www.blef.fr/dbt-coalesce-takeaways/">Coalesce 2021</a> if you wanna check out.</p><p>PS2: sorry for the length of this edition, for the delay and I hope the reading was enjoyable I'm not really proud of my writing here.</p><p>See you next week.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ dbt Coalesce 2022 — my raw notes ]]></title>
                    <description><![CDATA[ This is my raw notes about the Coalesce conference. Sorry about the format, I publish it as it might be interesting for you. ]]></description>
                    <link><![CDATA[ /coalesce-2022-raw-notes/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 63525fa756210e004d4893ba ]]></guid>
                    <pubDate><![CDATA[ 2022-10-29 ]]></pubDate>
                    <content>
                        <![CDATA[ 
<aside class="gh-post-upgrade-cta">
    <div class="gh-post-upgrade-cta-content" style="background-color: #373f48">
                <h2>This post is for subscribers only</h2>
            <a class="gh-btn" data-portal="signup" href="#/portal/signup" style="color:#373f48">Subscribe now</a>
            <p><small>Already have an account? <a data-portal="signin" href="#/portal/signin">Sign in</a></small></p>
    </div>
</aside>
 ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.42 ]]></title>
                    <description><![CDATA[ Data News #22.42 — Women in data part 2, dbt Coalesce first glimpse, data contracts, fundraising and fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-22-42/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 635254e356210e004d489320 ]]></guid>
                    <pubDate><![CDATA[ 2022-10-21 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-7.png" class="kg-image" alt loading="lazy" width="900" height="563" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-7.png 600w, https://www.blef.fr/content/images/2022/10/image-7.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Navigating through the numbers (<a href="https://unsplash.com/photos/kI0qIeN4hLw?ref=blef.fr">credits</a>)</figcaption></figure><p>Halo, a lot of content has been published this week with the Coalesce and I kept a lot of articles from the last week that I needed to navigate through this quantity to produce this edition. I'm not that proud of the format but it's ok. </p><p>As a side node I'm gonna do the <a href="https://lab.blef.fr/map-challenge-2022/?ref=blef.fr">30-day map challenge</a> in November. So if you do it or if you want to do it say hi.</p><h1 id="women-in-data-%E2%80%94-part-2-%F0%9F%91%A9%E2%80%8D%F0%9F%92%BB">Women in Data — part 2 👩‍💻</h1><p>Second part of the summary of the Women in Data meetup we organized 2 weeks ago. In the second round table the discussions were about the parity in the data ecosystem.</p><p><strong>What can we collectively do to achieve parity in data ecosystems? 💪</strong></p><p>Several answers and ideas were proposed by the speakers. Let's dive-in by topics.</p><ul><li><strong>Culture</strong>. The enterprise culture plays a big part in parity topics. Every manager should be trained and encourage to address equality topic. Also every incorrect behaviour should be mentioned and addressed—still, there was some debate on if it should be addressed with humour or firmness. Gabrielle also described an internal collective she presides to help women finding their place. Along with their mission they identified 5 important points for these collectives to work: define a clear vision, find a sponsor, understand issues with interviews, plan actions that integrates in what already exists in the company then develop content to infuse culture.</li><li>Also on the culture topic—yes I move to another bullet because the first on is too big—there are also initiatives at Deezer to help women by providing material or days-off during periods. Last but not least, everything related to the words we use. We should use inclusive writing—in French this is more prevailing than in English. For instance <a href="https://heyguys.cc/?ref=blef.fr">"hey guys" should be banned</a>.</li><li><strong>Hiring</strong>. Everyone is saying this is hard to find women in the data field. This is a fact, probably true. But if you don't force yourself into searching to add diversity it'll never change. So one solution is to put a ratio when searching for people, for instance you can ask your hiring agency to propose at least one woman per 3 candidates and if not you'll not look at the profiles no matter what. Then you have to care about the whole hiring funnel.</li><li>Other issues about hiring were discussed. The salary gap depending on the gender, the fact that studies shown that women tend to candidate less if they don't tick all the requirements.</li><li><strong>What else to change</strong>. All the differences can be fixed at a local level in the company but this is something that needs deeper change in the society. At the meetup speakers shared with us initiatives to promote tech/data works at kid school for instance. The idea is to show model roles to inspire younger generations. Tech industry is not a men's world.</li></ul><p>That's all for this Women in Data meetup. I hope I've transcript the discussion with the right words and intention. I might have misinterpreted some chats and if it's the case I'm sorry. </p><p>My last point on this topic, let's not forget we talk about diversity, so this is not only about man and women, there is more to be diverse and inclusive.</p><h1 id="dbt-coalesce-2022">dbt Coalesce 2022</h1><p>dbt Coalesce took place this week, this is the annual 4-days conference organised by dbt Labs. While all data influencers were there to meet and chat about the next trends of the analytics industry a few announcements were made.</p><p>dbt Labs took the time to announce the <a href="https://www.getdbt.com/blog/frontiers-of-the-dbt-semantic-layer/?ref=blef.fr">Semantic Layer</a>. While others call it the metrics layer or feature store in the data science space. We'll see a lot of buzz around this unique layer to access metrics in 2023. dbt Labs will push forward this architecture, in search for revenue growth. They will add this as a product in their cloud offering—with a Proxy SQL and a Metadata API.</p><p>If you want to see on how the semantic layer can be use <a href="https://hex.tech/blog/dbt-semantic-layer-integration/?ref=blef.fr">Hex demoed it</a>. You can also see this semantic rise up from the BI perspective with the <a href="https://www.lightdash.com/blogpost/bi-for-the-semantic-layer?ref=blef.fr">Semantic BI</a>. In this new world everyone wants to see the issues from his perspective, which is annoying for users but fun as an outsider 🙃.</p><p>I'll dedicate a full post with my highlights of the conference early next week after watching all the replays.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-9.png" class="kg-image" alt loading="lazy" width="900" height="547" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-9.png 600w, https://www.blef.fr/content/images/2022/10/image-9.png 900w" sizes="(min-width: 720px) 720px"><figcaption>The metrics layer (<a href="https://unsplash.com/photos/ngZ4V-myG5s?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-contracts-%F0%9F%91%BB">Data contracts 👻</h1><p>Even if I try not to fall in the hype stuff to give a higher view on trends when I see data contracts everywhere I have to still share it. In a nutshell data contracts are contractualized interfaces between data producers and consumers. The most common pattern seems to be an API—http, file, event, table, etc.—between software engineers and the data team with a way to enforce the contract. <strong>We call this schema for ages</strong>.</p><p>I'm convinced for a long time that data contracts is not a data problem but an IT problem. If the whole tech team is not aligned on the way data changes should be managed you'll fix only a small part of the problem. Petr greatly wrote about the way we <a href="https://petrjanda.substack.com/p/the-art-of-drawing-lines?ref=blef.fr">draw lines</a>. What belongs where?</p><blockquote>Data contracts aligned around business areas (domains) rather than technology layers. Contracts are technology-agnostic and can live anywhere inside the Data Platform.</blockquote><p>Andrew and Daniel respectively wrote their own way of seeing data contract implementation. Andrew at <a href="https://medium.com/gocardless-tech/implementing-data-contracts-at-gocardless-3b5c49074d13?ref=blef.fr">GoCardless</a> and <a href="https://medium.com/@danthelion/implementing-data-contracts-82800b9186b?ref=blef.fr">Daniel by himself</a>.</p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.sumup.com/careers/apache-kafka-blog/?ref=blef.fr">Apache Kafka SSL Security</a> — A simple explanation of how SSL handshakes works and why you should add it to your Kafka cluster.</li><li><a href="https://www.music-tomorrow.com/blog/towards-recommender-system-optimization-how-can-artists-influence-recommendation-algorithms?ref=blef.fr">How Can Artists Influence Recommendation Algorithms?</a> — Second part of the MusicTomorrow series about their tool to help music artists to become more viral on music platforms.</li><li><a href="https://medium.com/starschema-blog/ingestions-in-dbt-how-to-load-data-from-rest-apis-with-snowpark-a2c27c4b5315?ref=blef.fr">Load Github API data with Python model in dbt</a> — A new way to see data ingestions. In this article the author get Github data with a dbt Python model running in Snowpark. Demoing an extract-load orchestrated directly in your dbt project. This is a good example, not sure it should be reproduced at scale.</li><li><a href="https://medium.com/geekculture/druid-an-introduction-441c4af03107?ref=blef.fr">Is Druid still a thing?</a> — Druid is a distributed OLAP database that can be used for real-time. In the past the main issue of Druid was the lack of SQL. But it <a href="https://druid.apache.org/docs/latest/querying/sql.html?ref=blef.fr">changed</a>. This post is an introduction of the Druid architecture.</li><li><a href="https://medium.com/airbnb-engineering/mussel-airbnbs-key-value-store-for-derived-data-406b9fa1b296?ref=blef.fr">Airbnb’s key-value store for derived data</a> — Giants can't stop inventing new databases to solve problems at their scale. This time Airbnb created Mussel as a combination of other OSS to have a scalable key-value store.</li><li><a href="https://medium.com/data-engineer-things/data-engineering-excellency-at-netflix-7c12af609159?ref=blef.fr">Data Engineering Excellency at Netflix</a> — How Netflix empowers the data engineering team to reach excellency. They even compare data engineers to X-Men. They all have different superpowers to work on different villains. For instance to work on <a href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c?ref=blef.fr">Maestro, the data/ml orchestrator</a>.</li><li><a href="https://www.sicara.fr/blog-technique/end-to-end-data-pipeline-tests-on-databricks?ref=blef.fr">End-to-end data pipeline tests on Databricks</a> — I like all the testing topics even if it's in Spark (😬). Sicara detailed here how they did it for data quality and unit tests.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-8.png" class="kg-image" alt loading="lazy" width="2000" height="1335" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-8.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-8.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-8.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-8.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Data engineers are superheros (<a href="https://unsplash.com/photos/qJDkJRTedNw?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><p>This week a lot a few data satellite companies raised money. When I say satellite I mean companies that are not really related to data field, but they put data at the centre of their product.</p><ul><li><strong><a href="https://www.risingwave-labs.com/?ref=blef.fr">RisingWave</a></strong> <a href="https://www.risingwave-labs.com/blog/risingWave-labs-raises-36M-in-series-a-funding/?ref=blef.fr">raised a $36m Series A</a>. A cloud-native streaming database that uses SQL. You can either deploy your own Docker instance either use their new cloud offering. It works with materialized views that are refreshed in real-time on top of tables connected to real-time sources like Kafka, Redpanda or CDC.</li><li><strong>Tellius</strong> <a href="https://www.tellius.com/tellius-announces-series-b-funding/?ref=blef.fr">raised $16m in Series B</a>. Tellius offers an augmented analytics platform. A one-stop platform with insights discovery that does anomaly detection on your metrics.</li><li><strong>Keebo</strong> <a href="https://www.prnewswire.com/news-releases/keebo-raises-15-million-launches-automated-warehouse-optimization-to-reduce-cloud-data-warehousing-costs-301654278.html?ref=blef.fr">got $15m in a Series A</a>. Keebo provides a way to lower your warehouses costs by rewriting your SQL queries on the fly. With their solution rather than connecting to Snowflake you connect to Keebo and you let them do the magical optimisation. Even if I like the promise I don't think this is a good idea to rely on a third party to do optimisations. You better done if you teach people to write performances tips with CI/CD checks for instance.</li><li>The "security" space got some traction this week with 3 companies raising money. <strong>Anonos</strong> <a href="https://www.anonos.com/anonos-secures-50-million-in-ip-backed-financing?ref=blef.fr">raised $50m in debt</a> and provide a compliant pseudonymization engine. <strong>OutThink</strong> <a href="https://www.securityweek.com/outthink-raises-10-million-human-risk-management-platform?ref=blef.fr">raised a $10m seed</a> to tackle automatically data breach by highlighting company risks. <strong>Velotix</strong> raised $10m seed to automate data accesses over the complete platform.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.41 ]]></title>
                    <description><![CDATA[ Data News #22.41 — Women in Data, develop your leadership, how Google fails, few fast news and fundraising. ]]></description>
                    <link><![CDATA[ /data-news-week-22-41/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 634413a1906c01003d99ae71 ]]></guid>
                    <pubDate><![CDATA[ 2022-10-14 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-3.png" class="kg-image" alt loading="lazy" width="2000" height="1331" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-3.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-3.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-3.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-3.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Me in a few years after sending my weekly newsletter (<a href="https://unsplash.com/photos/d83kZqsNS7k?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear members, it's already fall here, and the weather is getting chilly, but good news: your favourite newsletter is back! So put on your best sweater, make yourself a hot drink, take a seat and enjoy this week's reading 🍂☕️</p><h1 id="women-in-data-%E2%80%94-part-1-%F0%9F%91%A9%E2%80%8D%F0%9F%92%BB">Women in Data — part 1 👩‍💻</h1><p>This Tuesday I co-organized with <a href="https://twitter.com/DeezerDevs?ref=blef.fr">Deezer Devs</a>, <a href="https://www.datageneration.co/?ref=blef.fr">DataGen</a> and Modern Data Network a <em>Women in Data</em> meetup. We invited 8 inspiring women working in data to discuss their experiences during 2 1-hour round tables. In the end all the discussions were global and not only narrowed to the data field. I liked it a lot.</p><p>While I'm working on translating the whole discussion in English because I feel it should be shared with everyone here is a small summary of what was said during the evening. </p><p>They started first with leadership. How can women develop their leadership? During this round table they tackled 4 main topics:</p><ul><li>Behaviour. How should women behave when in a leadership position? The society often depicts women qualities in leadership as empathic, adaptable, sensible but this is a stereotype. In opposition men's management is seen as top-down and authoritarian. In the end everyone should find a personal leadership-style, there is no caricatural behaviour. Society has a lot to gain from encouraging diverse leadership styles—being empathic is a great way to collect better feedback—ultimately it is in the best interest of the company in a highly concurrential job market.</li></ul><blockquote>I think we almost all heard it (as women), anyway you are too nice, you are too emotional to manage a team or to take on responsibilities</blockquote><ul><li><a href="https://en.wikipedia.org/wiki/Impostor_syndrome?ref=blef.fr">Impostor syndrome</a>. A lot of women suffer from it. Especially when in leadership position when you don't see a lot of lookalike. A lot of content has been discussed on this syndrome. Either when as a woman you question yourself when applying/accepting a job with responsibilities when a man would never question himself. Either when in the tech industry you only see geeks and gatekeepers and you don't recognize yourself, you don't feel at the right place. Either when you're the only woman sharing content in front of a crowd at meetup/conf and people discredit you afterwards.</li></ul><blockquote>Board meetings always started with "Hello gentlemen", finishing with "Questions gentlemen?". I felt that I did not have my place. I had a position with responsibilities but no-one asked for my opinion. I was invisible [...]. How do we come to a point where we are given a place somewhere but we are still made to feel that we are not legitimate to occupy it?</blockquote><ul><li>Public speaking. Regarding public appearance, tips were given as it's more practical. You can try a few tricks like for instance creating a character when on stage or finding allies in the audience. Allies with whom you'll do eye-contact to help you support the stage fright. People nodding are also a good help. This is important that everyone of us help people in difficulties when witnessing these situations.</li></ul><blockquote>OK, you don't trust yourself (...) for this presentation but what matters is that you give the impression that you are. When you go on stage draw a line. You walk and the moment you cross that line you are the character you want to be—it's like acting class.</blockquote><ul><li>Relationship with managers, peers and team—who often are men in tech. A lot of times women experience that they need to build tactics in they daily life to avoid awkward or possibly dangerous situations. Often it's making a joke to someone who is undermining their professional abilities simply because they are a woman, someone making a sexist joke, someone flirting.</li></ul><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://www.youtube.com/watch?v=irl0s0o7EpY&ref=blef.fr" style="display: inline-block; cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; margin: 30px auto; border-radius: 5px; text-decoration: none;">📺 Watch the meetup (🇫🇷)</a></p><!--kg-card-end: html--><p>Part 2 coming up next week. What can we collectively do to achieve parity in data ecosystems? 💪</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1240" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Robin, Elisa, Gabrielle, Fatimata, Virginie, Marion et Christelle</figcaption></figure><p>Thanks to all the amazing speakers: <a href="https://www.linkedin.com/in/ACoAAAU9cOsBcL9UhO_GOkzFwKfuYXbHkdBwzsM?ref=blef.fr">Marion B.</a>, <a href="https://www.linkedin.com/in/ACoAABLa5Y4BEEGZnZz6Q1Zp-EvMaFAT8jmVqMc?ref=blef.fr">Gabrielle Béranger</a>, <a href="https://www.linkedin.com/in/ACoAAADEYYQBHyMZO6CH7xj0pBiKYiNW6YW0bWQ?ref=blef.fr">Virginie Cornu</a>, <a href="https://www.linkedin.com/in/ACoAAAxQk44BufLnreLImNIs_eAtBkc2al7td94?ref=blef.fr">Nathalie Gémin</a>, <a href="https://www.linkedin.com/in/ACoAAA6M9_sB5VHa1dEq3xE6YHeXoGG_tnjAz2s?ref=blef.fr">Elisa GILLES</a>, <a href="https://www.linkedin.com/in/ACoAAA6XpIcByHDHU-6dLmEfTowGO0gy-h8wynA?ref=blef.fr">Christelle Marfaing</a>, <a href="https://www.linkedin.com/in/ACoAABsWaF4BrXhEGgmYx4hklhY0TQoROCEaBKI?ref=blef.fr">Arielle Marouani</a> and <a href="https://www.linkedin.com/in/ACoAABno5loBklMVlmVH-D8GN9SlvWeh8V3h5pk?ref=blef.fr">Fatimata Sall</a>. Moderated by <a href="https://www.linkedin.com/in/ACoAABOXu60BNWu22glLrpJCM_6wVqD4SszpF1Y?ref=blef.fr">Robin Conquet</a>.</p><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">📢</div><div class="kg-callout-text">I'd love to hear your experience on this topic. I also want to open the blog for guest writing on this topic, so if you're interested just hit reply, everything is welcome.</div></div><p></p><h1 id="how-google-fails-%E2%98%81%EF%B8%8F">How Google fails ☁️</h1><p>I'm finally adopting the clickbait title like other influencers. This week the Google Next 22 took place. The main news for the data world was about Looker. Or may I say a no-news. <strong>Google decided to rename Data Studio, Looker Studio</strong>. The <a href="https://www.youtube.com/watch?v=Bc_hcLVyFJI&ref=blef.fr">YouTube replay</a> has been the most seen video from Next after the keynote.</p><p>This first news is simply a renaming. Besides this they decided to create a paid pro version Looker Studio Pro that will include enterprise features with team workspaces and SLAs stuff.</p><p>To be honest I'm still lost after this announcement, the Google BI catalog will now include:</p><ul><li>Looker (now Google Cloud core product)</li><li>Looker Studio</li><li>Looker Studio Pro</li><li>LookML</li><li>Dataform?</li><li>BI Engine</li></ul><p>Between the lines Google also announced the initial Looker product will explode and integrate within GCP. But to me this is not as clear as it should be. Looker Studio will also access the LookML layer. </p><p>Since I've started this newsletter I've watched all the Google news around data and although I have been a huge BigQuery fan from the first hour I've always struggling understanding the strategy and the vision within the Google ecosystem. In the past GCP was the best solution to me because it was blazing simple. One solution for one problem. This vision seems very different today, while BigQuery remains the storage, there are way too many way to move and transform data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/Screenshot-2022-10-14-at-17.17.06.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2022/10/Screenshot-2022-10-14-at-17.17.06.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/Screenshot-2022-10-14-at-17.17.06.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/Screenshot-2022-10-14-at-17.17.06.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/Screenshot-2022-10-14-at-17.17.06.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The BI Cloud</figcaption></figure><p>When you compare with how Snowflake position in the market, GCP became a complete suite of tools but a complex one. What are your thoughts on this?</p><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><p>As I already wrote too many words I'll keep a few links for later, but here are 3 cool write-ups.</p><ul><li><a href="https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html?ref=blef.fr">Modern Data Stack in a Box with DuckDB</a> — DuckDB got a lot of traction recently unlocking a new range of performance on a single node. This article shows conceptually how you can deploy a <em>Modern Data Stack</em> using DuckDB as storage.</li><li><a href="https://towardsdatascience.com/when-change-data-capture-wins-271875e3df1a?ref=blef.fr">When Change Data Capture Wins</a> — Sarah explains how to get started with Change Data Capture and how it can improve your integration SLAs.</li><li><a href="https://luciacerchie.hashnode.dev/introduction-to-key-apache-kafka-concepts?ref=blef.fr">Introduction to Key Apache KafkaⓇ Concepts</a> — Kafka became over the last year an important piece of every data stack. This article details the main concepts you need to know about Kafka. Combine this with a CDC pattern, you got your first realtime platform. </li></ul><h1 id="data-fundraising-%F0%9F%92%B0">Data fundraising 💰</h1><p><em>After a discussion with a reader I've decided to put the fundraising category at the end now.</em></p><ul><li><strong>Alvin</strong> <a href="https://venturebeat.com/data-infrastructure/alvin-nabs-6m-to-help-enterprises-map-data-flows-address-quality-issues/?ref=blef.fr">raised $6m in seed</a>. Alvin is a data lineage first platform, it connects to all your sources, BI tools and operational tools to create the lineage from the logs. The you can visualise or query the lineage data to adapt your own platform.</li><li><strong><a href="https://www.climatiq.io/?ref=blef.fr">Climatiq</a></strong> <a href="https://www.eu-startups.com/2022/10/carbon-intelligence-platform-climatiq-raises-e6-million-to-fuel-net-zero-future/?ref=blef.fr">raised €6m in seed funding</a>. What if we could put a sensor everywhere on our servers to measure the climate impact of what we do? Climatiq gives you this in real-time.</li><li><strong>Homa</strong> <a href="https://www.homagames.com/blog/homa-raises-100-million-to-supercharge-game-developers-fortunes?ref=blef.fr">raised $100m Series B</a>. Homa is a tracking analytics SDK for game creators. I find it interesting to put it here because game analytics is also something we often forget as we do not accept cookies, but still very present.</li></ul><hr><p>See you next week, enjoy your weekend ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.40 ]]></title>
                    <description><![CDATA[ Data News #22.40 — Fundraises of Lightdash and Flink Cloud offering, ClickHouse Cloud launch, data engineering migrations projects and more. ]]></description>
                    <link><![CDATA[ /data-news-week-22-40/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 633fdddf317a46003de9965d ]]></guid>
                    <pubDate><![CDATA[ 2022-10-07 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image.png" class="kg-image" alt loading="lazy" width="900" height="623" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image.png 600w, https://www.blef.fr/content/images/2022/10/image.png 900w" sizes="(min-width: 720px) 720px"><figcaption>It's already sunset (<a href="https://unsplash.com/photos/MfNHW8vlbLs?ref=blef.fr">credits</a>)</figcaption></figure><p>Dear members. Once again a late Friday edition. I was travelling this week and I slept too much. But not more excuses, below your Data News edition.</p><h1 id="data-fundraising-%F0%9F%92%B0">Data fundraising 💰</h1><ul><li><strong>Lightdash</strong> is finally launching their commercial product. <a href="https://www.lightdash.com/blogpost/lightdash-raises-seed-round?ref=blef.fr">They raised $8.4m in funding</a> (pre-seed + seed). Lightdash is a dbt-based BI tool. It leverages metrics and dimensions defined within dbt to provide an explore UI where you can create visualisations to answer questions. Later add these to dashboards. In my opinion Lightdash is conceptually very similar to Metabase.</li><li><strong>Immerok</strong> <a href="https://www.immerok.io/blog/immerok-cloud-early-access?ref=blef.fr">raised $17m seed round</a> to launch a serverless service for Apache Flink. The promise is make real time mainstream by providing a no-operation platform while using all Flink APIs.</li><li><a href="https://clickhouse.com/cloud?ref=blef.fr"><strong>ClickHouse</strong> Cloud</a> launched, one year after their $250m Series B. ClickHouse is a real time OLAP database developed within Yandex. The database promise is to reunite the warehouse-first approach with real-time performances. The Cloud (only AWS for now) will charge you for storage, compute, write and read if you "pay as you go".</li></ul><p>What a crazy period we live in. Every open-source technology launch a cloud based offering of their tool expecting money to finance development. Is it really sustainable?</p><p></p><h1 id="a-bit-of-data-engineering">A bit of data engineering</h1><p>I do not share a lot what I do as a data engineer outside of this newsletter. Even if this is probably for a dedicated post I think today I'll do a category about the data engineer's life. At the moment I'm working on two projects that are migrations. For the first project I migrate from a 12 years old custom made analytical application to a new one made within Apache Superset. </p><p>I also feel that a lot of the projects I've worked on as a data engineer were migrations. Sometimes small migrations like changing a data pipelines, sometime larger one like migrating a warehouse technology or an orchestration tool.</p><p>Migrations fuel data engineering work today and Ben depicts it greatly in his new post <a href="https://medium.com/coriers/realities-of-being-a-data-engineer-migrations-3dd76c9c5357?ref=blef.fr"><em>Realities of being a data engineer — Migrations</em></a>. As Ben said we have different kind of migrations : operations systems, hardware, cloud, analytics or data. Every migration obviously brings a risk and that's why we do a preparatory work to mitigate risk. But even with a good experience we can't plan the unexpected and <a href="https://www.blef.fr/data-engineering-deadlines/">deadlines will slide</a>.</p><p>Later in the post Ben propose a 5-steps framework every migration should follow:</p><ul><li>Initiate — Justify the migration and get buy-in from stakeholders</li><li>Design and discovery — Do the product work and design what you expect, take time to explore the unknown</li><li>Execute implementation — Develop what you have to develop and automate the boring stuff (a lot of migrations contain boring stuff, so automate it)</li><li>Testing and validation —  Check everything and do a double run with you old system and the new one</li><li>Roll out and the long tail — Decided when to stop the old system and use the opportunity to change the processes with the new system</li></ul><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://medium.com/coriers/realities-of-being-a-data-engineer-migrations-3dd76c9c5357?ref=blef.fr" style="display: inline-block; cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; margin: 30px auto; border-radius: 5px; text-decoration: none;">👉 Read Ben's article</a></p><!--kg-card-end: html--><p>After all the different migrations I've done and read I think one of the advice I can give you is to invest in developing custom tools to follow and help the migration. For instance if you have to migrate 200 SQL queries from Postgres to BigQuery, develop a dashboards that gives the progression of the migration and provide automated scripts to dumbly do it. Migration application is boring, gamify it.</p><p>To illustrate this post Ronnie from Airbnb described <a href="https://medium.com/airbnb-engineering/upgrading-data-warehouse-infrastructure-at-airbnb-a4e18f09b6d5?ref=blef.fr">how they upgraded their data warehouse infrastructure</a>. Migrating from Hive to Spark3 + Iceberg.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-1.png" class="kg-image" alt loading="lazy" width="2000" height="1500" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-1.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-1.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-1.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Data migration (<a href="https://unsplash.com/photos/1pZJqQlgpsY?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="ml-friday-%F0%9F%A4%96">ML Friday 🤖</h1><ul><li><a href="https://doordash.engineering/2022/10/05/homepage-recommendation-with-exploitation-and-exploration/?ref=blef.fr">Homepage recommendation with exploitation and exploration</a> — How DoorDash created a personalized homepage with their custom ranking algorithms.</li><li>Also this week Etsy wrote about their <a href="https://www.etsy.com/codeascraft/deep-learning-for-search-ranking-at-etsy?ref=blef.fr">search ranking personalisation</a> with Dee Learning.</li><li>Finally, Walmart detailed more their <a href="https://medium.com/walmartglobaltech/element-walmarts-machine-learning-platform-b8a1f7870784?ref=blef.fr">machine learning platform</a>. In a nutshell this is a big platform with a lot of fancy technologies involved. It sits on top of kubernetes and, be ready, mentions BigQuery, Spark, Cassandra, Trino, Hive, GCS at least as data storage platforms.</li><li>📅 The <a href="https://www.featurestoresummit.com/fss-2022/agenda-2022?ref=blef.fr">feature store summit</a> will take place next week on Oct. 11st.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/10/image-2.png" class="kg-image" alt loading="lazy" width="2000" height="1335" srcset="https://www.blef.fr/content/images/size/w600/2022/10/image-2.png 600w, https://www.blef.fr/content/images/size/w1000/2022/10/image-2.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/10/image-2.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/10/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A personalized homepage (<a href="https://unsplash.com/photos/K1NObYzL86k?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://www.technologyreview.com/2022/10/01/1060539/eu-tech-policy-harmful-ai-liability/?ref=blef.fr">The EU wants to put companies on the hook for harmful AI</a> — "<em>A new bill will allow consumers to sue companies for damages—if they can prove that a company’s AI harmed them."</em> Once again EU regulates, probably for the best, while companies are trying AI everywhere. If it ripples others like the GDPR it could be good.</li><li><a href="https://www.insee.fr/en/statistiques/6530562?sommaire=6530621&ref=blef.fr">Recruitment Difficulties, an analyses on 2019 French companies data</a> — This is a study from the French statistical studies bureau. The study outlines high mismatch between labour supply and demand.</li><li><a href="https://medium.com/insiderengineering/apache-iceberg-reduced-our-amazon-s3-cost-by-90-997cde5ce931?ref=blef.fr">Use Iceberg to reduce storage cost</a> — Deniz describes how migrating from ORC + Snappy to Iceberg with Parquet + Zstandard drastically reduced the S3 GetObject costs (by ~90%). As a side effect it also reduced the Spark compute cost by 20%.</li><li>❤️ <a href="https://dagster.io/blog/skip-kafka-use-postgres-message-queue?ref=blef.fr">Postgres: a better message queue than Kafka?</a> — Dagster recently launched their cloud offering. They decided to use Postgres as foundation for their logging system. This post explains why. I really like the post because it treats about technologies choices and problem framing.</li><li><a href="https://github.com/matanolabs/matano?ref=blef.fr">matanolabs/matano</a> — <em>The open-source security lake platform for AWS</em>. Matano provides you a way to query and alert from logs collected from all your sources. Matano stores everything as Iceberg files in S3 and you can write Python rules to get real-time alerts on top of it.</li><li><a href="https://techwithadrian.medium.com/dbt-repository-to-split-or-not-to-split-909d366d0998?ref=blef.fr">dbt repository — to split or not to split?</a> ; this is a hard question for every dbt developer. Should I go for a monorepo like dbt recommend or should I go for a modular approach? Adrian covers in the post the 2 ways. I personally think everyone should start with a monorepo. Once the data team moves to a mesh organisation the modular approach with packages should be considered.</li><li><a href="https://dshersh.medium.com/too-many-mlops-tools-c590430ba81b?ref=blef.fr">Another tool won’t fix your MLOps problems</a> — Whether it's MLOps or DataOps we have too many tools and yet more marketing than practionners in the space. We need to reach the plateau like for the DevOps to avoid tools collection like panini cards.</li><li><a href="https://medium.com/@Not4j/what-we-are-missing-in-data-ci-cd-pipelines-c3d7f02e0894?ref=blef.fr">What we are missing in data CI/CD pipelines?</a> — Thoughts around a CI/CD incremental approach for dbt.</li></ul><hr><p>See you next week ❤️.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.39 ]]></title>
                    <description><![CDATA[ Data News #22.39 — Unravel, Coalesce and Wasabi fundraise, are tables data products?, time travel, data masking and more. ]]></description>
                    <link><![CDATA[ /data-news-week-22-39/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6336b25d63d0ad004d657b4c ]]></guid>
                    <pubDate><![CDATA[ 2022-09-30 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-12.png" class="kg-image" alt loading="lazy" width="900" height="603" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-12.png 600w, https://www.blef.fr/content/images/2022/09/image-12.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Welcome to the 80 new members of this week (<a href="https://unsplash.com/photos/67L18R4tW_w?ref=blef.fr">credits</a>)</figcaption></figure><p>Tomorrow we'll enter in the last quarter of the year. This is crazy on how the time is flying. At the end of the year my freelancing activity will become my most significant professional experience. But at the same time I feel I've just started yesterday.</p><p>I'm so happy to see how to newsletter is turning these days. I really like to get feedbacks from you, so do not hesitate to reach me if you have something to say, it helps me a lot. In my plans I want to write more original content—that will be only for members (free and paid). But I struggle finding the time to do it. <strong>I need to rethink my time management and prioritisation. I'm super bad at it. How do you do it?</strong></p><p></p><h1 id="data-fundraising-%F0%9F%92%B0">Data Fundraising 💰</h1><p>As opposed to the last 2 weeks, fundraising are back this week. Money is coming back. But before, bad news. <a href="https://newsletter.pragmaticengineer.com/p/the-scoop-layoffs-at-docusign?ref=blef.fr">Docusign is laying off 9% of its staff</a>.</p><ul><li><strong>Unravel Data</strong> <a href="https://www.unraveldata.com/welcome-third-point-ventures-series-d-funding/?ref=blef.fr">raised $50m Series D</a>. They tick a lot of buzzwords in their tag line: <em>DataOps Observability for the Modern data stack</em>. It feels they do a lot of stuff to help data teams understand better their platform: monitor cloud costs, recommend performance tuning to apps and pipeline, help discover issues faster. In the end they do observability like others. As a side note, they still mention Oozie in their demos. Modern data stack they said.</li><li><strong>Coalesce</strong> <a href="https://coalesce.io/?ref=blef.fr">raised $26m Series A</a>. Coalesce is a boring drag and drop web UI to create data transformations for your Snowflake warehouse. Maybe they need money to pay the trademark lawsuit with dbt Labs regarding <em>Coalesce</em> term. They are fighting in court in the <a href="https://www.law.com/thelegalintelligencer/2022/08/22/data-engineering-companies-dispute-trademark-of-coalesce/?slreturn=20220830053439&ref=blef.fr">US</a> and the <a href="https://www.trademarkelite.com/uk/trademark/trademark-detail/UK00003741355/COALESCE?ref=blef.fr">UK</a> (cf. <a href="https://twitter.com/pdrmnvd/status/1566417768152322048?ref=blef.fr">Twitter</a>) 🙄.</li><li><strong>Wasabi</strong> <a href="https://wasabi.com/press-releases/wasabi-technologies-closes-250-million-in-new-funding-to-usher-in-the-future-of-cloud-storage/?ref=blef.fr">raised $250m Series D</a> to fuel their the cloud storage alternative. Claiming 80% price cuts compared to AWS while being faster, it looks like a solid contender.</li></ul><p></p><h1 id="are-tables-data-products">Are tables data products?</h1><p>Data mesh initiative brings at his root the domain ownership to data teams. With simple words the major change is obviously organisational. It puts technical teams closer to their business. In this case you may have to look at the Conway law to define your <a href="https://carlosgrande.me/team-topologies/?ref=blef.fr">teams topologies</a>.</p><p>In order to get your teams ready for the big change you'll need to identify data products every team will deliver. Data products are entities on which you apply product principles. Data products, among other things, have to be interoperable, discoverable, shareable, bounded and owned.</p><p>And it applies very well to tables. Tables are highly interoperable, discoverable and shareable—ok it's depends on your storage/engine, but still it's more than decent. Also with some processes you can easily make the tables bounded and owned. So yes, we can say that tables can be considered as a sufficient data product. BUT, not every table in the warehouse should be considered like so. LinkedIn decided to name these data products the <a href="https://engineering.linkedin.com/blog/2022/super-tables--the-road-to-building-reliable-and-discoverable-dat?ref=blef.fr">Super Tables</a>.</p><p>At LinkedIn Super Tables are unit of work like the <em>jobs</em> or the <em>ads_event</em> table. For instance their <em>jobs</em> table consolidate more the 57 sources into 158 columns. Which obvioulsy means a lot, 57 sources into one table is probably more than the average data team use in a whole warehouse. Every ST should enforce SLA to reach 99%+ availability. It then creates datasets everyone in the company can trust and use in downstream data flows.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-11.png" class="kg-image" alt loading="lazy" width="1207" height="509" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-11.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-11.png 1000w, https://www.blef.fr/content/images/2022/09/image-11.png 1207w" sizes="(min-width: 720px) 720px"><figcaption>LinkedIn move from Source-of-Truth tables to Super Tables (image from the source article)</figcaption></figure><p>Creating a Super Table is not an easy task. You'll need to clearly identify why people need the data to create this common asset that delivers value to the stakeholders. With domain data teams it's easier to do it because team are closer to their sources and dedicated per business, so, they should know better what's needed. </p><p>But still, once you have all the requirements you'll need to apply <a href="https://ergestx.substack.com/p/why-data-modeling-is-a-super-skill?ref=blef.fr"><a href="https://ergestx.substack.com/p/why-data-modeling-is-a-super-skill?ref=blef.fr">data modeling</a> super skills</a>.</p><blockquote>As a data modeler you can help leadership bring in millions of dollars in revenue by adjusting a few lines of code.</blockquote><p>As a final note on this, everyone is speaking about Kimball but no one read him—I confess myself—Justin wrote a post about the <a href="https://medium.com/@jjghavami/data-engineer-must-kimballs-4-step-dimensional-design-process-a9cc7bdf8d4?ref=blef.fr">4-steps dimensional design</a> every data modeler should follow to create a well architecture tables.</p><p></p><h1 id="ml-friday-%F0%9F%A4%96">ML Friday 🤖</h1><ul><li><a href="https://medium.com/artefact-engineering-and-data-science/forecasting-something-that-never-happened-how-we-estimated-past-promotions-profitability-5f55cfa1d477?ref=blef.fr">Forecasting something that never happened</a> — This is a good problem to have in machine learning and something I've seen multiple time. Luca describe how you can guess the uplift that will be generated by a promotion when you've never done it.</li><li><a href="https://doordash.engineering/2022/09/27/five-common-data-quality-gotchas-in-machine-learning-and-how-to-detect-them-quickly/?ref=blef.fr">5 common data quality gotchas in Machine Learning</a> — Doordash developed a DataQualityReport Python package that will help you identify missing values, invalid values and sampling errors while finding the distribution anomalies.</li></ul><p></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-13.png" class="kg-image" alt loading="lazy" width="900" height="601" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-13.png 600w, https://www.blef.fr/content/images/2022/09/image-13.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Going back in time wihting my warehouse (<a href="https://unsplash.com/photos/h0dngiRxMeA?ref=blef.fr">credits</a>)</figcaption></figure><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://ai.googleblog.com/2022/09/tensorstore-for-high-performance.html?ref=blef.fr">Google released TensorStore</a> a new way to store and manipulate arrays. I don't get what's the hidden power of this kind of innovation but I feel this is something when applied to PB brain images.</li><li>How to use <a href="https://medium.com/google-cloud/quickly-restore-bigquery-dataset-with-time-travel-and-cloud-workflows-a66b868f4684?ref=blef.fr">time travel on BigQuery tables</a> — This is enabled by default and you can restore your table state at any point in time in the last 7 days.</li><li>Use <a href="https://right-triangle.com/2022/09/data-masking-snowflake-and-data-vault/?ref=blef.fr">Snowflake data masking</a> — Every warehouse should have a privacy layer. In Snowflake you can do it with masking policy. <a href="https://docs.snowflake.com/en/sql-reference/sql/create-masking-policy.html?ref=blef.fr">Masking policies</a> are functions that will mask the data if queried without privileges. Philosophically it can be applied to every database engine—for instance <a href="https://www.postgresql.org/about/news/postgresql-anonymizer-10-privacy-by-design-for-postgres-2452/?ref=blef.fr">Postgres</a>.</li><li><a href="https://eng.lyft.com/evolution-of-streaming-pipelines-in-lyfts-marketplace-74295eaf1eba?ref=blef.fr">Evolution of streaming pipelines in Lyft’s marketplace</a> — Lyft engineering team has been a leader of thoughts when it came to feature engineering. In this post they detail the different phase they went through years after years.</li><li><a href="https://blog.twitter.com/engineering/en_us/topics/infrastructure/2022/data-quality-automation-at-twitter?ref=blef.fr">Data quality automation at Twitter</a> — Small article that details how Twitter developed their Data Quality Platform (DQP) on top of Great Expectations. In a nutshell they define rules in YAML files that are compile into Airflow DAGs that runs periodically to check if everything runs fine. In the end they show reports in Looker.</li><li><a href="https://www.lastweekinaws.com/blog/the-baffling-maze-of-kubernetes/?ref=blef.fr">The baffling maze of Kubernetes</a> — Kubernetes is the far west. In the article Corey mentions that there isn't any consensus in the community as of now on how to develop iteratively on a Kube cluster. More than 25 products claiming to do it. On my side atm I'm deploying a bare-metal kube cluster and to be honest everyday I'm facing new issues, it reminds me good old Hadoop days.</li><li><a href="https://dataanalysis.substack.com/p/saas-metrics-reporting-a-peek-behind?ref=blef.fr">SaaS metrics reporting</a> — What are the metrics you should follow when doing analytical work for a SaaS product.</li><li><a href="https://medium.com/event-driven-utopia/comparing-stateful-stream-processing-and-streaming-databases-c8c670f3f4bb?ref=blef.fr">Comparing stateful stream processing and streaming databases</a>.</li></ul> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.38 ]]></title>
                    <description><![CDATA[ Data News #22.38 — Hidden gems in dbt artifacts, understand the Snowflake query optimizer, Python untar vulnerability, fast news and ML Friday. ]]></description>
                    <link><![CDATA[ /data-news-week-22-38/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 632dc40213284d004d523307 ]]></guid>
                    <pubDate><![CDATA[ 2022-09-23 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-9.png" class="kg-image" alt loading="lazy" width="2000" height="1331" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-9.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-9.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-9.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-9.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>🇫🇷 (<a href="https://unsplash.com/photos/CVqdh5rlytc?ref=blef.fr">credits</a>)</figcaption></figure><p>Bonjour vous ! Like sometimes I'm late. Today, I write the first words of the newsletter at 5PM. Which is 8h later than usual. Pardon me. In term of content it has been a huge week for me, I've prepared a meetup presentation that I enjoyed giving this Wed. It feels good to present stuff in public.</p><p>So yeah, let's talk a bit of this presentation.</p><p></p><h1 id="find-the-hidden-gem-in-dbt-artifacts">Find the hidden gem in dbt artifacts</h1><p>On Wednesday I made a 30 minutes presentation looking for hidden gems in dbt artifacts. The talk was a bit experimental, the idea is to show that this is possible for everyone to add context to you data infrastructure by leveraging generated artifacts. It means you can use the 4 JSON files generated to create tooling around your dbt project. </p><blockquote><strong>Shoemakers children are the worst shod.</strong></blockquote><p>Why not using the data generated by dbt artifacts to create useful data models to self-improve our data platforms?</p><p>While leveraging the 4 JSON files (manifest, run_results, sources, catalog) we could:</p><ol><li>Sources monitoring like in dbt Cloud</li><li>Extends your dbt docs HTML</li><li>Send data in your BI tool. We already have Metabase or Preset integrations.</li><li>Enforce and visualise your data governance policy. Refuse every merge request if a model owner is not defined for instance.</li><li>dbt observability, monitoring and alerting, have fun with analytics on your analytics.</li><li>Create a dbt model time travel viewer. Create an automated changelog process than display your data model evolutions.</li><li><a href="https://github.com/Bl3f/dbt-helper?ref=blef.fr">dbt-helper</a> — Your SQL companion</li><li>dbt-doctor — It’s time to detect issues. Idea: a CLI tool to detect any dbt FROM leftovers to fail in CI if yes.</li></ol><p>I also shared that every data engineer should consider the artifacts like a way to understand their customers. If you manage to get the artifacts from every envs (local, ci, staging, prod) you have the data to understand how everyone is using the tool. Especially useful if you have junior analysts lost within the tool, it'll detect silent local issues.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/Screenshot-2022-09-23-at-18.59.13.png" class="kg-image" alt loading="lazy" width="2000" height="968" srcset="https://www.blef.fr/content/images/size/w600/2022/09/Screenshot-2022-09-23-at-18.59.13.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/Screenshot-2022-09-23-at-18.59.13.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/Screenshot-2022-09-23-at-18.59.13.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/Screenshot-2022-09-23-at-18.59.13.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Send artifacts from every env to understand how everyone uses dbt.</figcaption></figure><p><em>🔗 <a href="https://docs.google.com/presentation/d/1ThZ6UnH5xVmSdVhMcc4zjC7X8ucAMzTu54Rk7iFbXrM/edit?usp=sharing&ref=blef.fr">Here the slides of my presentation</a>.</em></p><h3 id="closing-on-dbt">Closing on dbt</h3><p>To finish this edito about dbt here 3 other articles I found interesting. While we live our best life by creating dbt projects the complexity of the projects will only rise in the future. By facilitating the way we creating data models we encourage the data model creation. So what does it means we you have more the 700 models written by more than 43 humans? <a href="https://roundup.getdbt.com/p/complexity-the-new-analytics-frontier?ref=blef.fr">Anna from dbt Labs wrote an introspection post about it</a>. </p><p>Adrian also raised the <a href="https://techwithadrian.medium.com/manage-complexity-in-dbt-projects-ca1cb4e87a3?ref=blef.fr">complexity topic</a> on Medium. He states that with the modern data stack and the all-SQL paradigm we wrote complex code that risks to be unmanageable.</p><p>Finally if you want to have a course on data modeling Miles from GitLab will run a CoRise on <a href="https://corise.com/course/data-modeling?ref=blef.fr">Data Modeling for the Modern Warehouse</a>. Seems a good resources to get started at Kimball methodology.</p><p></p><h1 id="understanding-the-snowflake-query-optimizer">Understanding the Snowflake query optimizer</h1><p>❤️ If you had to read only one article this week it would be this one. I think Teej is doing an awesome job demystifying Snowflake internals. And he striked once again. It's time to <a href="https://teej.ghost.io/understanding-the-snowflake-query-optimizer/?ref=blef.fr">understand how the Snowflake query optimizer works</a>. Even if you don't use Snowflake I recommend this article to you.</p><blockquote>The job of a <strong>query optimizer</strong> is to reduce the cost of queries without changing what they do. Optimizers cleverly manipulate the underlying data pipelines of a query to eliminate work, pare down expensive operations, and optimally re-arrange tasks.</blockquote><p>In a nutshell the query optimizer tries to transform the badly written 500 lines query to optimized instructions for the database. In order to run the query the database will need to load data in memory and the query optimiser will try to find what is the minimal set of data the engine needs to scan in order to answer as fast as he can.</p><p>Once the database knows exactly what to read, the optimizer will rewrite the query in a more optimized syntax but logically identical. It will replace the views or functions with their underlying physical objects, unselect the useless columns (called column pruning) and push the predicates. The predicate pushdown is the step where the optimiser tries to move all the data filtering (WHEREs) as early as possible in the query.</p><p>Then it will do a join optimization. But for this I let you read it on Teej excellent post.</p><!--kg-card-begin: html--><p style="text-align:center;"><a href="https://teej.ghost.io/understanding-the-snowflake-query-optimizer/?ref=blef.fr" style="display: inline-block; cursor: pointer; background-color: #E4E6E1; padding: 10px 20px; margin: 30px auto; border-radius: 5px; text-decoration: none;">👉 Read the full article</a></p><!--kg-card-end: html--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-10.png" class="kg-image" alt loading="lazy" width="900" height="600" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-10.png 600w, https://www.blef.fr/content/images/2022/09/image-10.png 900w" sizes="(min-width: 720px) 720px"><figcaption>Inside Snowflake partitions system (<a href="https://unsplash.com/photos/qGgUsvwEY70?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="ml-friday-%F0%9F%A4%96">ML Friday 🤖</h1><ul><li><strong>Netflix</strong> — <a href="https://netflixtechblog.medium.com/machine-learning-for-fraud-detection-in-streaming-services-b0b4ef3be3f6?ref=blef.fr">Machine learning for fraud detection in streaming services</a>. </li><li><strong>Snowflake &amp; Prophet (Meta)</strong> — <a href="https://hoffa.medium.com/facebook-prophet-forecasts-running-in-snowflake-with-snowpark-14fc870b56ae?ref=blef.fr">Run forecasts directly within the warehouse with Snowpark</a>.</li><li><strong>River, Redpanda and Materialize</strong> — Max developed a small Streamlit <a href="https://github.com/MaxHalford/taxi-demo-rp-mz-rv-rd-st?ref=blef.fr"><a href="https://www.linkedin.com/feed/update/urn:li:activity:6977917696159977473/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A6977917696159977473%29&ref=blef.fr">application predicting in real time taxi trip durations</a></a>.</li><li><a href="https://mlu-explain.github.io/linear-regression/?ref=blef.fr"><strong>Linear Regression explained</strong></a> — Once again mlu-explain created the best resource to explain how linear regression is working. While scrolling you understand how the model works.</li><li><strong>OpenAI</strong> — <a href="https://openai.com/blog/whisper/?ref=blef.fr">Whisper, a new model released by OpenAI to automatically detect English speech</a>.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li><a href="https://airflow.apache.org/blog/airflow-2.4.0/?ref=blef.fr">Airflow 2.4 out, the data-aware scheduling</a> — This new release features a new way to approach Airflow scheduling. You define datasets and relations between them. Airflow handle the logic to run DAG related to each dataset when needed. This behaviour has been introduced by <a href="https://dagster.io/blog/software-defined-assets?ref=blef.fr">Dagster</a> months ago.</li><li>⚠️ <a href="https://securityaffairs.co/wordpress/136081/hacking/python-bug-cve-2007-4559.html?ref=blef.fr">350 000 Python projects subject to a 15 years old vulnerability</a> — <a href="https://nvd.nist.gov/vuln/detail/CVE-2007-4559?ref=blef.fr">CVE-2007-4559</a> has been discovered in August 2007 and allows an attacker to overwrite files when the archive is untared with <code>..</code> relative names (I've been told that this attack also exists in zip). </li><li><a href="https://towardsdatascience.com/bigquery-functions-for-data-cleaning-4b96181fbc3?ref=blef.fr">BigQuery SQL functions for data cleaning</a> — 4 useful functions like normalize, pattern_matching, safe divide and date formatting.</li><li>📺 <a href="https://www.youtube.com/watch?v=obgY1DBojbY&ref=blef.fr">Saving the planet one query at a time</a> — A part of the data ecosystem live in a dream. The dream of the infinite resources hidden in Google or AWS. But this is as wrong as the infinite oil principle our economy is based on. The time will come to reconsider running a fancy clustered Spark job and replacing it with a local DuckDB compute. To go further the French org <em>The Shift Project</em> wrote a <a href="https://theshiftproject.org/former-les-ingenieurs-a-la-transition/?ref=blef.fr">manifesto to help university shaping the next engineers</a>.</li><li>On DuckDB topic there is a — not self-explanatory — demo on how to <a href="https://djouallah.github.io/tcph_web/?ref=blef.fr">combine Malloy and DuckDB to do analytics in the web browser</a>.</li><li><a href="https://medium.com/coriers/the-evolution-of-data-companies-167ff4b65e1d?ref=blef.fr">The evolution of data companies</a> — Ben analyze the extract-load connectors vision of Portable, Airbyte and Estuary. There are 3 companies with founders coming from Liveramp, and Ben tries to see which problem from Liveramp helped them imagine the data products they run today.</li><li><a href="https://medium.com/@westlakealexa/gamification-of-data-knowledge-9c3c66213952?ref=blef.fr">Gamification of data knowledge</a> — How to create the best data documentation by adding gamification to the process.</li><li><a href="https://medium.com/checkout-com-techblog/testing-monitoring-the-data-platform-at-scale-e22d9cf433e8?ref=blef.fr">Testing &amp; monitoring the data platform at scale</a> — With Airflow and MonteCarlo inside.</li></ul><hr><p>See you next week 👻</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.37 ]]></title>
                    <description><![CDATA[ Data News #22.37 — Data roles: lead, analytics engineer, data engineer, the metrics layers, McDonald&#39;s event-driven and fast news. ]]></description>
                    <link><![CDATA[ /data-news-week-22-37/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6321cd396c0338003d2b2a04 ]]></guid>
                    <pubDate><![CDATA[ 2022-09-16 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-6.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-6.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-6.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-6.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>My weeks are like (<a href="https://unsplash.com/photos/5UNYknY0MTA?ref=blef.fr">credits</a>)</figcaption></figure><p>Halo Data News readers. The weeks are pretty intense for me and every Friday come in a blink of an eye. I write the introduction before the content of the newsletter so I don't how it'll turn today. But I hope you'll enjoy.</p><p><strong>For a future deep-dive, I'm looking for data engineering career paths. If you have one or something similar in your company I'd love to have a look at it — everything will be anonymous by default ofc.</strong></p><p>No fundraising this week. I did not find any news to put light on.</p><p></p><h1 id="data-roles">Data roles</h1><p>Every tech lead face this identity issue a day or another. This is the same for every data lead. How should you divide your time between management, contribution and stakeholders? Mikkel describe well the <a href="https://towardsdatascience.com/the-difficult-life-of-the-data-lead-a31186ef0d27?ref=blef.fr">difficult life of the data lead</a>. I previously was in a lead role and the main advice I could say to people in the same case is: <strong>make your grief and stop the contribution work except for the code reviews</strong>.</p><p>To some extent, 2 other posts I like this week:</p><ul><li><a href="https://medium.com/gousto-engineering-techbrunch/what-is-the-difference-between-an-analytics-engineer-and-a-data-engineer-9a5ca0c7b7b1?ref=blef.fr">What is the difference between an Analytics Engineer and a Data Engineer?</a></li><li><a href="https://medium.com/pipeline-a-data-engineering-resource/5-lessons-that-helped-me-not-quit-my-data-job-in-week-1-8064363643ea?ref=blef.fr">Lessons that helped me not quit my data job in Week 1</a> ; the best tip inside is: B<em>other your seniors. that’s what they’re for.</em></li></ul><p></p><h1 id="the-metrics-layer">The metrics layer</h1><p>Pedram produced a <a href="https://pedram.substack.com/p/what-is-the-metrics-layer?ref=blef.fr">deep-dive on the metrics layer</a>. He tried to explain what's behind and what are the current solutions proposing a metric layer: Looker, dbt Metrics and Lightdash.</p><p>In the current state of the technology <strong>the metrics layer is nothing more than a declarative way (a file) to describe what are metrics, dimensions, filters, segment in your warehouse</strong> tables. In Looker you write it in LookML, in dbt and Lightdash you use the dbt YAML, in Cube you use Javascript to do it.</p><p>The final vision of the metrics layer is to create an interoperable way to define metrics and dimensions that every BI tool will understand natively avoiding hours to create this knowledge in the tool. But we are far from there.</p><p></p><h1 id="mcdonald%E2%80%99s-event-driven-architecture">McDonald’s event-driven architecture</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-7.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-7.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-7.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-7.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-7.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Event flows at McDonald's (<a href="https://unsplash.com/photos/0-SARiQX6NE?ref=blef.fr">credits</a>)</figcaption></figure><p>A two posts series detailed what's behind the McDonald's events architecture. First, they <a href="https://medium.com/mcdonalds-technical-blog/behind-the-scenes-mcdonalds-event-driven-architecture-51a6542c0d86?ref=blef.fr">define what it means to develop such an architecture</a>. Something that need to be scalable, available, performant, secure, reliable, consistent and simple. Mainstreamsly they picked up Kafka — but managed by AWS, the Schema Registry, DynamoDB to store the events and API Gateway to create an API endpoint to receive events. It feels like nothing facing, but looks strong.</p><p>In the second post they give the <a href="https://medium.com/mcdonalds-technical-blog/mcdonalds-event-driven-architecture-the-data-journey-and-how-it-works-4591d108821f?ref=blef.fr">global picture and how everything orchestrate together</a> defining the typical data flow. We can summarize it like: define event schema, produce event, validate, publish and if something goes wrong they use a <a href="https://en.wikipedia.org/wiki/Dead_letter_queue?ref=blef.fr">dead letter topic</a> or write directly to DynamoDB.</p><p></p><h1 id="ml-friday-%F0%9F%A4%96">ML Friday 🤖</h1><ul><li><a href="https://madewithml.com/courses/mlops/data-stack/?ref=blef.fr">Data Stack for Machine Learning</a> — This is a MLOps course that contains a data stack chapter. It covers data storage, extract, load and transform. The whole course seems great.</li><li><a href="https://www.linkedin.com/pulse/how-ai-eat-perfume-industry-nikolaj-groeneweg/?ref=blef.fr">How AI will eat the perfume industry</a> — "<em>Google AI identifies scents more reliably than humans</em>". </li><li>📺 <a href="https://www.youtube.com/watch?v=Y9NUo_3cUIw&ref=blef.fr">Learned data augmentation for bias correction</a> — I really like the fact it's a PhD defence talk given at a technical university in Denmark by Pola Schwöbel.</li><li><a href="https://netflixtechblog.medium.com/new-series-creating-media-with-machine-learning-5067ac110bcd?ref=blef.fr">Creating media with Machine Learning</a> at Netflix — This is a new blog series where Netflix tech team will explain how they use machine learning to produce creative media content.</li></ul><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-8.png" class="kg-image" alt loading="lazy" width="2000" height="1333" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-8.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-8.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-8.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-8.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The true Uber alternative (<a href="https://unsplash.com/photos/cOywAM6fsPo?ref=blef.fr">credits</a>)</figcaption></figure><ul><li><a href="https://www.nytimes.com/2022/09/15/technology/uber-hacking-breach.html?ref=blef.fr">Uber has been — apparently — hacked</a> this night. The attacker claims to be a 18 years old. He got VPN access using social engineering on a IT person. He then scanned the intranet and found a Powershell script on the shared network. The script contained username/password of Uber access management platform. That's how he got in. This is a small reminder of "nothing is really secured".</li><li>Generative AI news — now that we got over complicated generative AIs people developed product to <a href="https://phraser.tech/?ref=blef.fr">generate prompt that will work with each AI</a>. There is also an <a href="https://www.tattoosai.com/?ref=blef.fr">AI to find your next tattoo</a>.</li><li><a href="https://cloud.google.com/blog/products/data-analytics/introducing-seamless-database-replication-to-bigquery?hl=en&ref=blef.fr">Introducing Datastream for BigQuery</a> — Google developed a integrate solution to do Change Data Capture for GCP. It can use MySQL, Oracle and Postgres as sources and GCS and BigQuery as destination for the moment. This is a good solution to go real-time with minimal footprint.</li><li><a href="https://www.getbluesky.io/?ref=blef.fr">Bluesky, monitor your Snowflake cost and get alerted</a> — As I recently shared, we may see a lot of tools similar to this one in future as warehouses took a prominent place in the current data stacks. Watching all SQL queries to indentify unbalanced performance/costs queries.</li><li><a href="https://engineering.monday.com/how-to-replace-your-database-while-running-full-speed/?ref=blef.fr">How to replace your database while running full speed</a> — Every data engineer has to face a migration a day or another. Lior from monday explained how they realized a migration from an analytical database to Snowflake with no downtime. It consisted in 4 steps: create all the schema, migrate the writes, validate, migrate the reads.</li><li>Airbyte released a <a href="https://glossary.airbyte.com/?ref=blef.fr">data glossary</a> with a graph network to see relationships between articles.</li><li>Iceberg articles — A <a href="https://www.dremio.com/subsurface/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/?ref=blef.fr">list of useful articles</a> when you want to understand what's Iceberg and a post explaining the <a href="https://www.dremio.com/subsurface/how-z-ordering-in-apache-iceberg-helps-improve-performance/?ref=blef.fr">Z-Ordering</a> with Iceberg. Regarding zorder, this is a way to cluster data to optimise collocation when accessing data. But it comes at a cost obviously.</li><li><a href="https://engineering.linkedin.com/blog/2022/real-time-analytics-on-network-flow-data-with-apache-pinot?ref=blef.fr">Real-time analytics on network flow data with Apache Pinot</a> — How LinkedIn use Kafka and Pinot to do real time analytics on TBs of network data.</li><li><a href="https://towardsdatascience.com/its-time-to-set-sla-slo-sli-for-your-data-team-only-3-steps-ed3c93009aa5?ref=blef.fr">It’s time to set SLA, SLO, SLI for your data team</a> — It's time to apply SREs metrics to data teams.</li><li><a href="https://airtable.com/shrQMzHOF4hWfdTBG/tblA6Jm3vnbGCyLeC/viwTJgSb0J6YDUofu?ref=blef.fr">Connectors catalog</a> — Pierre created an Airtable detailing every connectors out there. If you want to copy data from a specific source have a look at it to find which tool you can use.</li></ul><hr><p>See you next week and please stop writing about data contracts.</p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Airflow dynamic DAGs ]]></title>
                    <description><![CDATA[ Learn how to create Apache Airflow dynamic DAGs (with and without TaskFlow API). ]]></description>
                    <link><![CDATA[ /airflow-dynamic-dags/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6317162ca8b153003dbf2200 ]]></guid>
                    <pubDate><![CDATA[ 2022-09-13 ]]></pubDate>
                    <content>
                        <![CDATA[ <p>Airflow is a wonderful tool I have been using for the last 4 years. Sometimes I like it. Sometimes I don't. This post is dedicated to Airflow dynamic DAGs. I want to show you how to do it properly. In this case we can see Airflow as a Python framework, so writing dynamic DAG is just writing another Python code.</p><h1 id="why-should-i-use-dynamic-dags">Why should I use dynamic DAGs?</h1><p>Airflow Dynamic DAGs are useful when you have multiple tables for instance and you want a DAG per table ingested. In order to avoid create multiple Python files and doing copy-paste you can factorize your code and create a dynamic structure.</p><p>If we illustrates further with an example. Let's imagine you have to copy your production Postgres database. In order to do it you create a list of the tables you want to get every morning. The <code>factory</code> will take this table list as an input and will dynamically produce a list of DAGs.</p><p>If for instance you want to have do different stuff depending on the table type — incremental/full e.g. — you can go deeper by creating configuration files per table and then looping over all the configuration files to create a DAG per table.</p><p>When you're doing an extract and load process I recommend you to create a DAG per table rather than having a DAG per schema or database for instance. This way you'll have smaller scope in each DAG and backfilling table will be easier. The main disadvantage of this solution is that you have to use more sensors in downstream dependencies.</p><p>In summary you can use dynamic DAGs for:</p><ul><li>Ingestion multiple tables from a database → a DAG per table</li><li>Running a list of SQL queries per domain → a DAG per domain</li><li>Scraping a list of website → a DAG per website</li><li>Every-time you are copy pasting DAG code</li></ul><p></p><h1 id="dynamic-dags-with-taskflow-api">Dynamic DAGs with TaskFlow API</h1><p>We will use last Airflow version — 2.3.4 — here, but it'll work for every version with the TaskFlow API. Let's say we have 3 sources and we want to create a DAG per source to do stuff on each source. These sources are <code>user</code>, <code>product</code> and <code>order</code>. For each source we want to apply a prepare and a load function.</p><figure class="kg-card kg-code-card"><pre><code class="language-python">import pendulum

from airflow.decorators import dag, task


@task
def prepare(source):
    print(f"Prepare {source}")
    pass


@task
def load(source):
    print(f"Load {source}")
    pass


def create_dag(source):
    @dag(
        schedule_interval="0 1 * * *",
        start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
        catchup=False,
        dag_id=f"prepare_and_load_{source}"
    )
    def template():
        """
        ### Prepare and load data
        This is the DAG that loads all the raw data
        """
        prepare_task = prepare(source)
        load_task = load(source)

        prepare_task &gt;&gt; load_task

    return template()


for source in ["user", "product", "order"]:
    globals()[source] = create_dag(source)
</code></pre><figcaption>dags/prepare_and_load.py</figcaption></figure><p>The important part of this code is the last line. It creates a global variable that contains the DAG object that the Airflow DagBag will parse and add for every scheduler loop.</p><pre><code class="language-python">globals()[source] = create_dag(source)</code></pre><p>If you want to go further you can also create a configuration per source. I recommend you to create Python configuration rather than JSON. The main reason is because Python configuration can be linted, can be statically checked and you can comment Python dicts.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/Screenshot-2022-09-06-at-16.50.47.png" class="kg-image" alt loading="lazy" width="2000" height="599" srcset="https://www.blef.fr/content/images/size/w600/2022/09/Screenshot-2022-09-06-at-16.50.47.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/Screenshot-2022-09-06-at-16.50.47.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/Screenshot-2022-09-06-at-16.50.47.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/Screenshot-2022-09-06-at-16.50.47.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Airflow UI with dynamic DAGs</figcaption></figure><h1 id="dynamic-dags-with-configurations">Dynamic DAGs with configurations</h1><p>So you have a configuration folder called <code>config</code> in which you have the 3 sources configuration.</p><figure class="kg-card kg-code-card"><pre><code class="language-python">config = {
    "name": "user",
    "type": "A",
}
</code></pre><figcaption>config/user.py (as an example)</figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-python">import os
from dataclasses import dataclass

import pendulum
from importlib.machinery import SourceFileLoader

from airflow.decorators import dag, task

CONFIG_FOLDER = "dags/config"


@dataclass
class Config:
    name: str
    type: str


@task
def prepare(source):
    print(f"Prepare {source}")
    pass


@task
def load(source):
    print(f"Load {source}")
    pass


def create_dag(source):
    @dag(
        schedule_interval="0 1 * * *",
        start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
        catchup=False,
        dag_id=f"prepare_and_load_{source.name}"
    )
    def template():
        """
        ### Load monthly data to the warehouse
        This is the DAG that loads all the raw data to the warehouse
        """
        prepare_task = prepare(source)
        load_task = load(source)

        prepare_task &gt;&gt; load_task

    return template()


for file in os.listdir(CONFIG_FOLDER):
    if file.endswith(".py"):
        filename = os.path.join(CONFIG_FOLDER, file)
        module = SourceFileLoader("module", filename).load_module()
        config = Config(**module.config)
        globals()[config.name] = create_dag(config)

</code></pre><figcaption>dags/prepare_and_load_advanced.py</figcaption></figure><p>I decided to use a dataclass to parse every configurations and a module loader to load the configuration. This way every configuration will be statically checked and if an error has been added in your configuration Python code will be invalid. You can then catch it in you CI/CD process for instance.</p><h1 id="dynamic-dags-without-taskflow">Dynamic DAGs without TaskFlow</h1><p>You can also do it without TaskFlow API, for that you just need to also have a <code>create_dag</code> function that returns a DAG and you're set. Below a small example.</p><figure class="kg-card kg-code-card"><pre><code class="language-python">import os
from dataclasses import dataclass

import pendulum
from importlib.machinery import SourceFileLoader

from airflow import DAG
from airflow.operators.python import PythonOperator

CONFIG_FOLDER = "dags/config"


@dataclass
class Config:
    name: str
    type: str


def create_dag(source):
    dag = DAG(
        dag_id=f"prepare_and_load_{source.name}",
        start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
        catchup=False,
        schedule_interval="0 1 * * *",
    )

    prepare_task = PythonOperator(
        task_id="prepare",
        python_callable=lambda x: print(x),
        dag=dag
    )

    load_task = PythonOperator(
        task_id="load",
        python_callable=lambda x: print(x),
        dag=dag
    )

    prepare_task &gt;&gt; load_task

    return dag


for file in os.listdir(CONFIG_FOLDER):
    if file.endswith(".py"):
        filename = os.path.join(CONFIG_FOLDER, file)
        module = SourceFileLoader("module", filename).load_module()
        config = Config(**module.config)
        globals()[config.name] = create_dag(config)

</code></pre><figcaption>dags/main_with_taskflow.py</figcaption></figure><h1 id="conclusion">Conclusion</h1><p>Creating dynamic in Airflow is super easy. You can create DAG factories for all repetitive tasks you may have, thanks to this you'll be able to unit test your ETL code. </p> ]]>
                    </content>
                </item>
                <item>
                    <title><![CDATA[ Data News — Week 22.36 ]]></title>
                    <description><![CDATA[ Data News #22.36 — Arize and Hebbia 💰, Firebolt lay-offs?, data mesh/contracts, dashboard explosion, a big ML Friday and news. ]]></description>
                    <link><![CDATA[ /data-news-week-22-36/ ]]></link>
                    <guid isPermaLink="false"><![CDATA[ 6315ae3c3d7133003daced47 ]]></guid>
                    <pubDate><![CDATA[ 2022-09-09 ]]></pubDate>
                    <content>
                        <![CDATA[ <figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-3.png" class="kg-image" alt loading="lazy" width="2000" height="1125" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-3.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-3.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-3.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-3.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>👑 (<a href="https://unsplash.com/photos/3E3AVpvlpao?ref=blef.fr">credits</a>)</figcaption></figure><p>Hey, weeks are passing so fast. Every week I'm like I have time until Friday and here it's already Friday.</p><p>On the 21st I'll give a talk in French at a meetup: <a href="https://www.meetup.com/fr-FR/IA-Engineering/events/288302156/?ref=blef.fr">dbt and the modern data stack</a>. I'll talk about the dbt artifacts and my extension <a href="https://github.com/Bl3f/dbt-helper?ref=blef.fr">dbt-helper</a>. I'd love to see you there 🤗.</p><p>Enjoy this week edition.</p><h1 id="data-fundraising-%F0%9F%92%B0">Data fundraising 💰</h1><ul><li><strong><a href="https://arize.com/?ref=blef.fr">Arize</a></strong>, a machine learning observability platform, <a href="https://arize.com/blog/arize-ais-next-era-of-growth/?ref=blef.fr">raised $38m Series B</a>. Used by big names. They integrates with the Python standard machine learning stack, with a free tier. If you need drift detection, model monitoring or explainability it's worth looking.</li><li><strong><a href="https://www.hebbia.ai/?ref=blef.fr">Hebbia</a></strong> a document search engine <a href="https://techcrunch.com/2022/09/07/hebbia-raises-30m-to-launch-an-ai-powered-document-search-tool/?ref=blef.fr">raised $30m Series A</a>. Their website does not detail a lot what they are doing and how. You can ingest PDFs, Office docs, etc. and then ask natural language question to get answer.</li><li>😥 Firebolt is apparently doing a lay-off firing <a href="https://www.calcalistech.com/ctechnews/article/symoqnigj?ref=blef.fr">dozen of employees</a>. We don't have more information but if it appears to be true it'll sad. It also shows that the data warehouse competition is harder than ever before and their high valuation — $1.4b in Jan — was a tricky spot to deliver.</li><li>On the same sad note, <a href="https://newsletter.pragmaticengineer.com/p/the-scoop-24?ref=blef.fr">Snap will shut down Zenly</a> app firing off the whole Paris team. Almost 3 years I was in the same situation with my former employer, I wish all the best to Zenly team. As everyone is saying, Zenly was one of the best French tech team, so if you are looking for talented people try to reach out to them.</li></ul><p></p><h1 id="dos-and-donts-of-data-mesh">Do's and don'ts of data mesh</h1><p>BlaBlaCar is one of the most advanced French company when it comes to data. The travel company decided to implement a mesh organisation at the beginning of the year rearranging 5 teams into 5 domains. Teams are cross functionnal — like feature teams — in 5 domains: demand, supply (x2), marketing and infrastructure. </p><p>In the post Kineret details few <a href="https://medium.com/blablacar/dos-and-don-ts-of-data-mesh-e093f1662c2d?ref=blef.fr">do's and don'ts</a> when deciding to move to a mesh structure. As always for a migration the communication is one of the most important topic. With big changes, transparency should come first.</p><p>Continuing on the organisation aspect of a mesh, if you want your domain-oriented teams to succeed you'll need to create a way for team to communicate between each other. Data contract is a piece of the puzzle. As data contracts picked up again recently mehdio explained <a href="https://towardsdatascience.com/data-contracts-from-zero-to-hero-343717ac4d5e?ref=blef.fr">how you can implement data contracts</a> and why it is important. </p><p>Small head's up here: you can implement data contracts without an event bus, and even with an event bus you might still need to implement "contracts" that goes deeper than just the messaging system. Because you'll still have exceptions and a lot of stuff will happen outside of the bus.</p><p></p><h1 id="what-if-every-dashboard-self-destructed">What if every dashboard self destructed</h1><p>The title says it all. This is a fun title but it means a lot. In data we have too many things. Many dashboards. Many tables. Many KPIs. <a href="https://counting.substack.com/p/what-if-every-dashboard-self-destructed?ref=blef.fr">What if we automatically destroy dashboards</a>? What if we do it based on views numbers? We could also remove and clean the whole data chain behind a dashboard. In real life I'm not a tidy person, but when it comes to data warehouses or a BI tools I feel this is way more important than my bedroom.</p><p>When people are trying to predict the BI future they are often saying that notebooks are the dashboards replacement. I don't think it'll be the case but it's a move forward. In the future of the future people are saying that <a href="https://blog.count.co/bye-bye-notebooks-hello-canvas/?ref=blef.fr">canvas are the notebooks replacement</a>. I feel this is a good idea, to me it joins the dashboard creativity to the linear execution of the notebook to create a good story.</p><p>A small advice I heard this week in the excellent <a href="https://www.datageneration.co/?ref=blef.fr">DataGen podcast</a> — <a href="https://www.deezer.com/fr/episode/429389087?ref=blef.fr">Deezer episode</a>, in French. If you use a dashboard, a notebook, a canvas or whatever when you release analysis record an additional video to put sound on your analysis. It will for sure help people onboard faster on your work.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-4.png" class="kg-image" alt loading="lazy" width="2000" height="1329" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-4.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-4.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-4.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-4.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Tableau after you read the previous article (<a href="https://unsplash.com/photos/hLUTRzcVkqg?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="ml-friday">ML Friday</h1><ul><li><a href="https://tech.instacart.com/lessons-learned-the-journey-to-real-time-machine-learning-at-instacart-942f3a656af3?ref=blef.fr">The journey to real-time machine learning at Instacart</a> — Whatever we way today's data stacks are still mainly batch. The main reason is often because data is used for analytics where batch is enough. That's also why machine learning often starts in batch. But if you want to do production you'll need to be more reactive. Instacart details their journey from batch to real-time with a feature store at the center.</li><li><a href="https://medium.com/walmartglobaltech/unsung-saga-of-mlops-1b494f587638?ref=blef.fr">Unsung saga of MLOps</a> — Jaya from Walmart writes about what are the operational concepts around machine learning in production. Training, modeling and canary deployment in the post.</li><li><a href="https://doordash.engineering/2022/09/08/evolving-doordashs-substitution-recommendations-algorithm/?ref=blef.fr">Evolving DoorDash’s substitution recommendations algorithm</a> — How a retailer can recommend product when some are not available? This is a great machine learning exercise for aspiring data scientist.</li><li><a href="https://slack.engineering/recommend-api/?ref=blef.fr">Recommendations APIs at Slack</a> — This is a bit of an insider post that show where Slack uses ML and also what's the API infrastructure to do it. Mainly batch, orchestrated by Airflow. <strong>Next time when slackbot will suggest you to leave a channel you'll know what's behind</strong>.</li><li><a href="https://www.music-tomorrow.com/blog/towards-recommender-system-optimization-data-tool-for-algorithmic-optimization-on-streaming-platforms?ref=blef.fr">Recommender System Optimization</a> — Music Tomorrow is a platform that gives knowledge to music professional. They reversed-engineered the Spotify recommendation engine to help music industry create more recommended content ➰.</li><li><a href="https://building.nubank.com.br/data-science-interview-pratical-tips/?ref=blef.fr">Acing the data science interview: 8 practical tips with examples</a></li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.blef.fr/content/images/2022/09/image-5.png" class="kg-image" alt loading="lazy" width="2000" height="1331" srcset="https://www.blef.fr/content/images/size/w600/2022/09/image-5.png 600w, https://www.blef.fr/content/images/size/w1000/2022/09/image-5.png 1000w, https://www.blef.fr/content/images/size/w1600/2022/09/image-5.png 1600w, https://www.blef.fr/content/images/size/w2400/2022/09/image-5.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The newsletter feels like a bullet point collection these days (<a href="https://unsplash.com/photos/RLw-UC03Gwc?ref=blef.fr">credits</a>)</figcaption></figure><p></p><h1 id="fast-news-%E2%9A%A1%EF%B8%8F">Fast News ⚡️</h1><ul><li>✨ <a href="https://twitter.com/teej_m/status/1567622047739805696?ref=blef.fr">Funnel analysis, a presentation from the Snowflake Summit</a> — If you work in data you have written funnel analysis at least once. Teej made a great presentation on how you can do it in Snowflake. It compares 3 methods  joins, windows and regexp and this is clever.</li><li><a href="https://github.com/axa-group/Parsr?ref=blef.fr">Parsr</a> — open-source document data extraction toolchain. With Parsr you can clean, parse and extract data from image, pdf, dock and eml.</li><li><a href="https://blog.devgenius.io/metrics-of-a-data-platform-560aee4239d6?ref=blef.fr">Metrics of a data platform</a> — A long list of metrics you can track when running a data platform. If you are just starting do try to implement all at once but do it incrementally. I really like the survey metrics like <em>Ease of getting data</em> and the <em>P90/Time to accommodate</em>. It represents well where a data eng team should perform.</li><li><a href="https://www.plural.sh/blog/pros-and-cons-of-kubernetes/?ref=blef.fr">The pros and cons of Kubernetes</a> — I hate working on Dockerfile and YAML files. This is an infinite loophole.</li><li><a href="https://blog.coinbase.com/building-a-python-ecosystem-for-efficient-and-reliable-development-d986c97a94a0?ref=blef.fr">Building a Python ecosystem for efficient and reliable development</a> — How Coinbase used Pants to develop a complete build system.</li><li><a href="https://pipebird.com/blog/why-every-saas-company-will-offer-native-data-pipelines?ref=blef.fr">Why every SaaS company will offer native data pipelines</a> — This is about a trend. I think data connectors is one of the hardest data B2C business to do. So many competitors, so many open-source ways to do it and as it's pipelines, random issues will pop-out every day. Reversing the logic and saying, you have my data so it's up to you to push to me can fix this logic, but tools will need help to do it.</li><li><a href="https://michaelberk.medium.com/how-to-automate-your-data-infrastructure-with-code-751b96355665?ref=blef.fr">Terraform 101</a> — A well written post about what's terraform for beginners.</li><li><a href="https://fly.io/blog/sqlite-virtual-machine/?ref=blef.fr">How the SQLite virtual machine works</a> — I already spent too much time on this edition, and I did not read this article but I want to.</li><li><a href="https://wrongbutuseful.substack.com/p/deciding-if-a-data-leadership-role?ref=blef.fr">Deciding if a data leadership role is something you actually want to do</a></li></ul><hr><p>See you next week.</p> ]]>
                    </content>
                </item>

    </channel>
</rss>
