Skip to content

Data News — Week 7

Data News #7 — Apache Arrow fundraising, unbundling of Airflow, data people as co-founders SQL as a library language?

Christophe Blefari
Christophe Blefari
4 min read — ·
Metamates (credits)

Hello dear datanewsmates, it's Saturday and as usual I'm a bit late. Let's 👋 our metamates friends. This week we will unbundle things.

Data fundraising 💰

  • Voltron Data raised $100m in seed and Series A in order to create a startup around Apache Arrow. Arrow, which has been co-created by Pandas creator Wes McKinney, is an awesome in-memory columnar format. The strength of Arrow resides in their agnostic design: it's cross-languages and working with CPU and GPU. It can be used in pair with Parquet or ORC for instance that are on-disk columnar formats.
    As of today, a lot of big tech companies are using Arrow under the hood for connectors or processing. Once Arrow will be used by many, performances could reach new standards.
  • Promethium raised $26m in Series A to build their all-in-one data platform. Once your warehouse has been connected you can explore your data and then write pipelines and query in order to do visualization on it.

Follow-up on EU to US data transfers

I just discovered this week that the European Commission monitors data flows to cloud. The tool is a mix between actual sector data and forecast until 2030. On Twitter someone used the tool to get inflow and outflow diff. We mainly notice that Germany and Ireland are positive. In Germany, Frankfurt is probably one of the biggest EU city in terms of data centers and attractive for the financial market. And Dublin in Ireland is the place of many US headquarters.

On the tracking side, Google sees the light, follows Apple lead,  they announced that they plan to adopt new privacy settings in order to limit tracking across apps following Apple decision.

The unbundling of Airflow (and others)

Two weeks ago I've shared the fal.ai initiative that I liked about developing tools around dbt to improve the Modern Data Stack experience. This week they wrote a nice piece of article that tries to depict the data ecosystem evolution like an Airflow unbundling. In summary, Airflow was good but covers too many use-cases so let's explode it in small pieces.

To be honest I mainly agree with the article but it saddens me. I'm still convinced that for most of data teams a central platform is better than a fragmented one. Today with the proliferation of SaaS tools it feels more like Airflow was doing great but lets explode this free perfect tool to milk money out of startups with accumulation of small pricing.

If we still consider data engineering as software engineering for data I don't think we are going in the right direction. Even the article conclusion sounds weird.

A diverse set of tools is unbundling Airflow and this diversity is causing substantial fragmentation in modern data stack. Like everyone else, I also predict some consolidation of these tools in the coming years. I believe dbt Cloud is the best positioned place for this consolidation to happen.

Does that mean we gonna write "The unbundling of dbt" in the coming years? I really like dbt but I don't want my dbt — or even worse dbt Cloud — project to be aware about all my data complexity outside of the transformation layer.

On the same topic, surfing the wave, Joseph wrote the unbundling of the BI dashboard. It will always amaze me how Tableau is absent from every Modern Data Stack posts but still present everywhere in the real world.

Airflow (credits)

How can data people become technical co-founders?

David is trying to research how data people can become technical co-founders. As more and more data people are becoming polyglot — multiple languages — and purple — navigating through business and data — we should see in the future data co-founders.

My answer is: it depends. Doing data today means to many different things. It could be Spark development with good practices to mixing SaaS products to create a MDS. I had a colleague that was frustrated because data engineering is sometimes more Lego than engineering — cf. last part also. So yeah, I think folks from the data field could become co-founders but the step is high.

What Substack Analytics Engineers must be thinking

Last week Substack — a blogging platform — team discovered a "bug" in their views count: they were double-counting the email opens. This is fun because one year ago they also discovered a bug in the subscribers count.

Sarah tried to analyze the situation and how it happened and why it's hard to define metrics, especially in the web analytics space. I really liked the conclusion.

Analytics teams should always strive to create dashboards that are either standalone or include links to provide the relevant context. Context is also important to get alignment on expectations. Curiosity saves the day when it prevents business users from misinterpreting results or analytics teams from misreporting them.

Can SQL be a library language?

George Fraser, Fivetran CEO, wrote thoughts to see SQL as a library language. He explores how databases engines added over the years packages to extends SQL capabilities but also how dbt packages hub is bringing modularity to the light.

Migrating to BigQuery

This week two teams detailed how they migrated data to BigQuery. The first post is about a Back Market migration from Delta files read by Snowflake to Delta files in BigQuery. To do so they mainly used Google Data Transfer triggered by Airflow, I do not agree with everything in the article, but it shows a good overview on what's possible.

In the second post Remya described another Snowflake to BigQuery migration handled by Airflow but it could have been done way much better (and they still use Airflow 1 😡).

Data engineering: always doing pipelines (credits)

Fast News ⚡️

datanews

Christophe Blefari

I do Data Engineering in Python.

Comments