Skip to content

Data News — Week 18

Data News #18 — PyScript → the end of Javascript?, MLOps explained, Pick a data catalog, how to be data analyst, etc.

Christophe Blefari
Christophe Blefari
5 min read
Patchwork (credits)

Hello folks, I hope the newsletter finds you well. This week I had too many articles to share and tried to sort it out differently. I hope you'll like it.

Small survey: Would you be interested in getting the sources I use for the curation? If yes or no, please reply to this email.

PyScript, doing Python in the browser 🐍

At the PyCon US this week was announced PyScript a library that allows everyone to write Python in the browser. Yep, you read well. It means you can write Python directly within a <py-script> tag. Under the hood it uses Pyodide, which is a CPython port in WebAssembly, providing a REPL directly in the browser.

Will Python become a Javascript replacement for all data people with frontend needs? No. No. And no.

Is it a good idea? To be honest I don't know. Technically I like it a lot, it shows how far we are capable to go today. Even if this isn't that surprising, we are able to run Doom in the browser for years already. So yeah, running Python should be business as usual... yeah? Actually, no.

When you give a deeper look at it is not as trivial as it seems. Python is not designed to be asynchronous, but doing stuff in the browser will require to be. It will also imply browsers to make significant changes and to align on APIs — right now in Firefox I get a lot of out of memory errors with the examples.

Finally if we thought about the use cases behind PyScript this is hard to imagine something that goes further that "templating". I understand that we have a real REPL in the browser but when we look at the example each time we want a bit of interactivity we fallback to Javascript — with panel library — a to a JS binding — with D3 for instance. Which is not so far of what we already do in Jinja. If I'm being totally honest you can also do animation with only Python but it'll require to recode everything we already have in Javascript with other many concerns (e.g. performance, HTML APIs available).

Last word will be that I think it'll be a cool alternative to do web UIs faster in Python but with a lot of limitations.

xkcd Python meme with PyScript in the Browser (source code)

MLOps in 10 Minutes

If you are interested in learning more about machine learning operations — a.k.a. MLOps — this starter is for you. The DataTalksClub community led by Alexey started a free MLOps Zoomcamp on Github and Alexey wrote a 10 minutes post about the MLOps. It explains well what are the main processes to consider when putting models in production: design, train, operate.

Related to this, Databricks in a spree to announce stuff publicly released their feature store.

Choosing a Data Catalog

You are used to Sarah's posts in the newsletter. This week she decided to write about data catalog and what to consider when choosing one. She decided to put catalogs in tree buckets: all-in-one, no-code with integrations and customizable with code. Even though I agree with this split, I would say every bucket we'll try to create will always compare old companies, startups and open-source products.

Still, one question before choosing a data catalog: find the root reason behind a need for catalog and you will have a lead on the solution you need. But remember that a data catalog is not a miracle that will solve all your data problems, this is only a technology layer. This technology layer without a community with practices and leadership will probably fail. I recommend you to truly dedicate human resources for animation and moderation.

Hard to choose a data catalog (credits)

How to add value as a Data Analyst

Now it's time for Analytics Engineers. Cassie — Chief Decision Scientist at Google — wrote an excellent article about the journey to become a "real" data analyst. She debunks 3 misconceptions we could have about the analytics discipline. In order to help you get started on the field you can also read the dbt analytics engineering glossary — fairly simple, it can help standardization in data teams.

On a more technical aspect you can now start to use the package developed by the community that will help you interact with the dbt Cloud API in CLI. You can also get inspiration from Whatnots data platform choices regarding dbt structure: they use marts with a "centrally-managed domain-owned" approach.

You can also have a look at the dbt Snowflake package to put physical constraints on your warehouse (snf or Postgres supported).

Kimball x OBT

I may have been ignorant for the last 4 years but I just discovered the OBT term. OBT means "One Big Table" and was actually heavily promoted by Google team when doing modelisation in BigQuery as opposed to traditional Kimball dimensional modelling. On Reddit they heavily debated on the matter and Fivetran wrote a performance comparison overlook 2 years ago.

Workflow orchestration vs. data orchestration

Anna from Prefect team tried to explain differences between workflow and data orchestration. The "data orchestration" term has been promoted by Dagster¹. So, it feels a bit like word war, but still the article is interesting because it explains with simple words concepts. Also confessionsofadataguy did a light Prefect review, I disagree a bit with him — but my review is in draft for months.

Fast News ⚡️


¹ You can be leader of your category if you invent the category².
² This is the first time I add foot notes so if you read this 👋.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.