Data News — Week 18

Data News #18 — PyScript → the end of Javascript?, MLOps explained, Pick a data catalog, how to be data analyst, etc.

7 May 2022 — 5 min read —

Hello folks, I hope the newsletter finds you well. This week I had too many articles to share and tried to sort it out differently. I hope you'll like it.

Small survey: Would you be interested in getting the sources I use for the curation? If yes or no, please reply to this email.

PyScript, doing Python in the browser 🐍

At the PyCon US this week was announced PyScript a library that allows everyone to write Python in the browser. Yep, you read well. It means you can write Python directly within a <py-script> tag. Under the hood it uses Pyodide, which is a CPython port in WebAssembly, providing a REPL directly in the browser.

Will Python become a Javascript replacement for all data people with frontend needs? No. No. And no.

Is it a good idea? To be honest I don't know. Technically I like it a lot, it shows how far we are capable to go today. Even if this isn't that surprising, we are able to run Doom in the browser for years already. So yeah, running Python should be business as usual... yeah? Actually, no.

When you give a deeper look at it is not as trivial as it seems. Python is not designed to be asynchronous, but doing stuff in the browser will require to be. It will also imply browsers to make significant changes and to align on APIs — right now in Firefox I get a lot of out of memory errors with the examples.

Finally if we thought about the use cases behind PyScript this is hard to imagine something that goes further that "templating". I understand that we have a real REPL in the browser but when we look at the example each time we want a bit of interactivity we fallback to Javascript — with panel library — a to a JS binding — with D3 for instance. Which is not so far of what we already do in Jinja. If I'm being totally honest you can also do animation with only Python but it'll require to recode everything we already have in Javascript with other many concerns (e.g. performance, HTML APIs available).

Last word will be that I think it'll be a cool alternative to do web UIs faster in Python but with a lot of limitations.

xkcd Python meme with PyScript in the Browser (source code)

MLOps in 10 Minutes

If you are interested in learning more about machine learning operations — a.k.a. MLOps — this starter is for you. The DataTalksClub community led by Alexey started a free MLOps Zoomcamp on Github and Alexey wrote a 10 minutes post about the MLOps. It explains well what are the main processes to consider when putting models in production: design, train, operate.

Related to this, Databricks in a spree to announce stuff publicly released their feature store.

Choosing a Data Catalog

You are used to Sarah's posts in the newsletter. This week she decided to write about data catalog and what to consider when choosing one. She decided to put catalogs in tree buckets: all-in-one, no-code with integrations and customizable with code. Even though I agree with this split, I would say every bucket we'll try to create will always compare old companies, startups and open-source products.

Still, one question before choosing a data catalog: find the root reason behind a need for catalog and you will have a lead on the solution you need. But remember that a data catalog is not a miracle that will solve all your data problems, this is only a technology layer. This technology layer without a community with practices and leadership will probably fail. I recommend you to truly dedicate human resources for animation and moderation.

How to add value as a Data Analyst

Now it's time for Analytics Engineers. Cassie — Chief Decision Scientist at Google — wrote an excellent article about the journey to become a "real" data analyst. She debunks 3 misconceptions we could have about the analytics discipline. In order to help you get started on the field you can also read the dbt analytics engineering glossary — fairly simple, it can help standardization in data teams.

On a more technical aspect you can now start to use the package developed by the community that will help you interact with the dbt Cloud API in CLI. You can also get inspiration from Whatnots data platform choices regarding dbt structure: they use marts with a "centrally-managed domain-owned" approach.

You can also have a look at the dbt Snowflake package to put physical constraints on your warehouse (snf or Postgres supported).

Kimball x OBT

I may have been ignorant for the last 4 years but I just discovered the OBT term. OBT means "One Big Table" and was actually heavily promoted by Google team when doing modelisation in BigQuery as opposed to traditional Kimball dimensional modelling. On Reddit they heavily debated on the matter and Fivetran wrote a performance comparison overlook 2 years ago.

Workflow orchestration vs. data orchestration

Anna from Prefect team tried to explain differences between workflow and data orchestration. The "data orchestration" term has been promoted by Dagster¹. So, it feels a bit like word war, but still the article is interesting because it explains with simple words concepts. Also confessionsofadataguy did a light Prefect review, I disagree a bit with him — but my review is in draft for months.

Fast News ⚡️

Snowflake announced a partnership with Dell that seems really huge. Snowflake will be able to run workloads on Dell's on-premise object storage. I know that probably no one is using Dell's on-premise storage here but still it means they are starting to open the on-premise door.
Douwe Maan, the CEO of Meltano explained their DataOps OS vision — Seeing the Modern Data Stack as an operating system with many OSS applications putting your data in motion is real.
Following conferences category last week Denis from Deezer wrote "what I loved, what I learned, what I loathed" from Devoxx. This is super nice with YouTube links (mainly in French).
Performance improvements on Delta Lake v1.2.
Trino released Project Tardigrade in order to provide out-of-the-box fault tolerance. Queries can restart at checkpoints after failure or they improved concurrent queries consumption sharing.
Control your Airflow DAGs from an external database — This is a good experimentation to show you how you can create dynamic DAGs from an external source. I do not recommend this in production.
Learn how to UNDROP a Snowflake table — This tip saved my team few months ago.
Discover databases with this introduction to NoSQL DBs or with 34 10-minutes videos about databases in a Twitter thread format.

¹ You can be leader of your category if you invent the category².
² This is the first time I add foot notes so if you read this 👋.

Data News

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments

Data News — Week 24.30

Data News #24.30 — TV shopping for foundational models (OpenAI, Mistral, Meta, Microsoft, HF), BigQuery newly released stuff, and more obviously.

13 Jul 2024

Paid Members Public

Data News — Week 24.28

Data News #24.28 — Catching up the news, OpenAI, Claude, kyutai and all the engineering stuff from the last 3 weeks.

Data News — Week 18

PyScript, doing Python in the browser 🐍

MLOps in 10 Minutes

Choosing a Data Catalog

How to add value as a Data Analyst

Kimball x OBT

Workflow orchestration vs. data orchestration

Fast News ⚡️

Data Explorer

The hub to explore Data News links

Christophe Blefari

Comments

Related Posts

Data News — Week 24.30

Data News — Week 24.28

PyScript, doing Python in the browser 🐍

MLOps in 10 Minutes

Choosing a Data Catalog

How to add value as a Data Analyst

Kimball x OBT

Workflow orchestration vs. data orchestration

Fast News ⚡️

Data Explorer

The hub to explore Data News links

Christophe Blefari

blef.fr Newsletter

Comments

Related Posts

Data News — Week 24.30

Data News — Week 24.28