Skip to content

Data News — Week 22.33

Data News #22.33 — Omni new BI tool, data job market, the best data team, Snowflake stuff, data ingestion best practices, RL at Netflix and fast news.

Christophe Blefari
Christophe Blefari
4 min read
A nightmare for people in holidays (credits)

Hello dear members. For people in holidays I hope you enjoy as much as you can. For the others I feel you and here the usual Data News to keep you up-to-date on this Friday afternoon.

Let's start a conversation this week. What is your biggest problem right now?

Mine is Superset virtual datasets, I'm working on a Superset project and the textarea to write SQL queries to build the data model is too small in a way that it is so unproductive. Except from this I'm surprised by how far Superset can deliver.

Data fundraising 💰

  • Omni raised $17.5m Series A — and a seed funding in April. The field is starting to get stacked on the "next generation" of BI tools. Omni is trying to put a fresh look at it with founders coming from Looker (VPs Product) and Stitch/Talend (CTO). With such a line-up it looks promising. Omni wants to combine a neat BI platform with a auto-generated data models out of one-off queries. I feel Omni is gonna compete with Lightdash, the open-source BI built on dbt.

Data job market status

It's well know that data job market is highly stressed. All companies have engineering and analytics positions opened. Last week at the Data Analytics Careers Summit, Dustin revealed data about the data analytics market. This is super interesting. We can see that demand in SQL grew by 27pts since 2020, while Tableau, PowerBI and Excel are each mentioned in a third of job postings.

On the same topic someone analysed all the jobs posted in the dbt Slack community (~3k) and made some statistics about it. We can clearly see that analytics engineering has picked up while data engineering demand stayed the same.

The best data team

There are many ways to create great data teams. This is the kind of articles I'm really into. What if we could create a partnership between data and teams enabling data to be more than a support role, this is the Data Business Partnership. The post provides great guidelines to try this implementation.

We often say that this is a bad idea to look at data unicorns. But what if instead we were looking for data heroes. Mikkel wrote another great post about data teams. So yeah, how can find, activate and retain data heroes? You know, these people that add this little more in your team.

Find your data heroes (credits)

It's summer, so let's speak about snowflakes

What a boring category title but as I got 3 articles about Snowflake I wanted to group them here.

Firstly, you can try to do machine learning with Snowflake by using the recently released Snowpark. Then it's time to master the query profiler, thanks to Teej you'll able to get started at query graphs readings. And now that you had fun playing with ML and queries you should have a look at your Snowflake bills.

7 best practices for data ingestion

Saikat wrote a small wrap-up about basics best practices everyone should follow when writing data ingestion pipelines. Guess which one is my favourite.

On the same topic Matt wrote about data backfilling. To me, backfilling is one topic that really shows the difference between a data engineer and a great data engineer. Probably because backfilling requires experience and patience. It's easy to run a pipeline, but when your pipeline should recompute or reingest to data from the last 4 years, the stress it'll put on system will be heavy.

And also I have to disagree with Matt's post on how to handle backfilling. I've made the mistake in the past to create dedicated backfilling pipelines but I think this is a bad idea. If pipelines are idempotent and deterministic your don't need another branching, at least this is the Airflow way to do it.

Cool findings 🔎

  • I discovered the StackExchange data explorer (via Vlad Mihalcea) — Yeah, I might be late to the party. This is a place where you can write SQL to query SE data and get insights.
  • command-not-found.com, a website where you can search every OS package and the website gives you the installation command for each distribution.
  • dbt-jsonschema — A way to validate your dbt YAML within VSCode.

ML Friday — Reinforcement Learning at Netflix

Reinforcement learning is something a bit mystic for me. Every year when I give classes I try to give examples and this is one from Netflix is quite interesting. They picked RL models to find optimal recommendations under constraints, our most limited resource: our time.

More ML: 4 essential steps for building a simulator.

Don't let the copilot take the wheel (credits)

Fast News ⚡️


See you next week ❤️

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.16

Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.

Members Public

Data News — Week 24.15

Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.