Skip to content

Data News — Week 51

Data News #51 — Happy Holidays, Airbyte raised, ROI of data work, structure of data, data versioning.

Christophe Blefari
Christophe Blefari
2 min read — ·
I want free open-source data tools (credits)

What a year, it's almost the last week of the year, I imagine a lot of people are in holidays. So today it'll be a short Data News and next week will be a retrospective post about what we achieve in 2021.

Data fundraising 💰

  • Airbyte raised $150m in Series B, it's an extract-load open-source platform with a Cloud version aiming to compete with leaders like Fivetran or Stitch. The money will mainly be used to increase hirings but also to launch other products like real-time data ingestion or reverse-ETL.
  • brytlyt, raised another $5m in an extends of a Series A to launch their data analytics and visualisation platform. They leverage PostgreSQL with GPUs in order to create analytics platforms that "scale".

How to think about the ROI of data work

Once again Monzo data team offers us an awesome data article. This time it's about measuring the ROI of data work. This is probably a question all the data teams have. How can we prove the C-level that data investments are profitable?

In the article Mikkel shows a new way to talk about ROI, he also brings nice visuals to explain all the concepts. To be honest this is a must-read.

How should organizations structure their data

Every once in a while we get data modeling articles and Kimball concepts comes back to the denormalisation world Hive, BigQuery and Snowflake have brought years ago. Michael compares Kimball, Inmon and Data Vault structures to help you get started.

Personally I'm more a pragmatic person so the simpler structure, to me, is often the better.

Kimball, maybe the last train (credits)

Improving Data Quality with Data Contracts

Sometimes we expect (or we wait) for a magic product to solve all our Data Quality issues. But, spoiler, it may not solve everything. Probably you will need to define schema (Data Contracts) on you data and enforce them. The team at GoCardless added a schema validation layer in their CDC architecture to bring a better data quality. If you are in this, go check it out.

Deploying Airflow 2 on EKS using Terraform, Helm and ArgoCD

This is a huge 2 parts tutorial. Vitor explains how you can deploy Airflow 2 on AWS using ArgoCD, Helm and Terraform (part 1 & part 2). Obviously this is a way to deploy Airflow, but not the only one. When we look at the numbers more and more companies are now deploying Airflow on top of Kubernetes.

In the tutorial you will find Terraform files and also how to configure your Argo to make Airflow works. If you are new to these technologies it'll give you a overlook.

The guide to data versioning

If you want to understand how data versioning is working, LakeFS team wrote an article detailing the 3 most common versioning practices.


Thank you all for the support over this year, this week I have been hit by the Covid so this edition is shorter than usal but I still wish you Happy Holidays and see one next week for the last of the year.

Stay safe.

datanews

Christophe Blefari

I do Data Engineering in Python.

Comments