Skip to content

Data News — Week 45

Data News #45 — Headless BI raising, Data Analyst role is hard, ship ml models in days, databases explained.

Christophe Blefari
Christophe Blefari
4 min read
Just a regular Friday for data people on duty — cleaning the pipes (credits)

Hey there people working on this Friday, I'm here to bring you sun while others are not working. This edition will surely give you food for thoughts.

Data fundraising 💰

  • Headless BI term is now out, Supergrain raised $6.8m of seed funding in order to transform how BI software works. They are plugged on top of your warehouse with a 3 layers app: YAML metrics definitions → web catalog → query-able API to get your metrics.
  • Datafold, a data reliability-monitoring-quality-observability platform, raised $40m in Series A. The platform already include Data Diff their product to easily find regression testing on ETL, column-level lineage, scheduled SQL alerts and finally a data catalog (something that is not a data quality tool).
  • Collibra, another dinosaur that ends in "a" (founded in 2008), raised $250m in Series G. Collibra is a data platform aiming to help companies in finding, understanding and accessing data. They call it a data intelligence cloud.

Why the Data Analyst role has never been harder

Even if we added "modern" before our data platform tools, we are probably still at the beginning and we are still artisans when it comes to debugging and monitoring. In his article Petr illustrates very well why maintaining a data model is really hard.

I totally agree on his wishlist and join my forces to ask for the same. I hope that companies that are inspired by Datadog to build data monitoring will try to address these issues.

OLAP Cube, round 2

After the first explanation we already saw in #37 edition we have the round 2 of the OLAP Cube explained. It starts a written discussion between Claire and Cedric. Claire was saying that an OLAP Cube is nothing more than a table with some conventions, Cedric on his side tried to look at the 30 years history to find the different definitions we used.

In the end OLAP history is cool story to be told, but is it more in this time where metrics layer is coming back stronger? Is it still a mandatory skill for analysts and engineers? I don't think so, but somewhat important.

OLAP cube but triangle (credits)

How to ship a machine learning model in days not months?

We all know the blabla about the data science jobs that are 80% of data cleaning vs. 20% of delivering insights and that mostly never (87%) make it into production. It's often due to lack of skills, processes or silos, we then created the MLOps concept, but still there are issue because the gap is not filled with software engineers.

On the other side Doctolib team shows with brio how they are able to ship models in days and not in months. This is a combination of re-usable components to avoid rebuilding the wheel each time and a collaboration with developers (software engineers). To have tried it in the past I'd say that both points are equally important — and you should also include product collaboration.

Feature store to unlock data superpowers

It's related to previous category: in order to be able to ship models faster you need to have a data storage ready for that. If you want to copy paste what bigger companies like Uber did, this is maybe a solution. João Santiago explained how Billie developed a feature store on AWS with Redis, Snowflake and Lambda functions. I like this pragmatic approach that gives wings to Snowflake.

On the contrary if you are still building a data lake in 2021 — that's not something to be ashamed of — here meltwater journey explained: from database to data lake.

Should data engineers fear low-code tools?

Zach Wilson tries to explain why as data engineers we shouldn't really fear low-code tools. Even if each day there are more and more tools that aims to simplify everything, data engineering is not something only about tools. This article scratch only the surface.

And remember, even low-code tools need skilled people that understands what's going on.


If you like the Data News Subscribe to get it by email each Friday. Forever free.


Database replication explained + Postgres unknowns features

All data stacks today have used or are still using a database replica to read the product data. Here the Part 1 of database replication explained focusing on Single Leader Replication. Understanding replication could help you understanding better your data stack.

To go further in understanding I propose you this awesome post exploring lesser known PostgreSQL features. Features you already have but may not know about! For instance I bet you don't know the dollar quoting is fun.

Database replication — leader and followers (credits)

Git branching: Best practices for BI and Data Teams

Beat team evaluates Git branching practices — mainly Gitflow — and gives you feedback on it. I think this is a good article for entry-level data practitioner in the need of Git data concepts.

Releases 👻

Fast News ⚡

  • Prefect Cloud is doubling the free tier — It's an orchestrator aiming to answer Airflow flaws. The free tier is now at 20k successful runs per month. That's huge but a bit misleading because you'll still have to pay for your workers (where the code run), Prefect is only a tasks orchestrator in that case.
  • cube.js — your API layer between warehouse and frontend app. I just discovered cube.js (11k stars on Github) and I can't wait to try it, because it bridges the gap between serverless warehouses and JavaScript frontends.
  • How Uber migrated financial data from DynamoDB to Docstore
  • HEY R USERS I HAVE SOMETHING FOR YOU — I found this article describing how you can query in SQL parquet files from R, so because I liked it here you are.
datanews

Christophe Blefari

Data Engineering Coach that enjoys all kind of data platform.