Skip to content

Data News — Week 23

Data news #23 — Snowflake summit, Data Engineering Manifesto, please praise Data Engineers, MLOps security.

Christophe Blefari
Christophe Blefari
3 min read
A worker fixing a pipeline the hard way. (credits)

This is a new week, with new news (say it loud 🤘). This week has been marked with the Snowflake Summit and a lot of announcements, with again a lot of articles around the data engineering position, skills and a Manifesto. Then a super detailed article about security concepts for MLOps.

✍️ This week I also co-wrote an article with the Modern Data Network that you can read here. It's an inspiration to help you build your stack from 100 stacks.

Data fundraising 💰

Big founding near AI space incoming.

  • Prefect.io Series B, Prefect raised $32M "to solidify [their] position as the new standard in dataflow automation". Prefect is an alternative to Airflow and all others in the orchestator and scheduler space. They provide an open-source platform as well as a paid (task-run based) cloud version of it.
  • Eightfold.ai announced $220M in funding to improve the AI driven recruiting platform. Is the AI label driving this fundraising?
  • Sources report that DataRobot seeks to raise $500M in a founding round. DataRobot is an AI platform for Entreprise where you can manage the whole lifecycle of ML model from creation to monitoring.

The Data Engineering Manifesto 📗

Two employees from the Belgian data agency Dataminded wrote a manifesto for the data engineering position. This sketch was drawn from a talk about the topic. My 2 favorite principles are the 2 and the 6!

The Data Engineering Manifesto (dataminded.be + connectingdots.xyz)

Are real time pipelines worth it for analytical?

This is a question worth asking. Indeed, as writing real-time pipelines is something infinitely more complex than batch pipelines. Is it really needed to write analytical pipelines for analytics? Anna Geller tries to answer this tricky question.

❄️ Snowflake Summit announcements

At the Snowflake Summit they announced a suite of new feature in order to unlock new use-cases for Snowflake users. As a TL;DR I give you a glimpse of what they announced:

  • Java UDFs, in private preview now, you will be able to write Java and Scala (next Python) user defined functions to compute the data in your warehouse. They called this experience Snowpark
  • Support of Unstructured Data (cf. this meme)
  • Schema detection for Parquet, Avro and ORC
  • You will be able to query Snowflake through REST API (bye JDBC clients 👋)
  • New Governance capabilities, among it I expect a lot from the Classification feature that will detect automatically PII
  • Performance optimization (30% better compression) and 8x duration improvment for "some" workloads
  • A Snowflake partner space

Some of the announcements just fill the gap with BigQuery and some are true innovations.

Rise of Customer Data Ecosystems

As a follow up of the Snowflake partner space a well written article about the Marketing Customer Data Platform and they could be replaced by Cloud data warehouses.

Meta-datalake or Metadata lake?

As we all know the data volume is exploding in every company. That means that the metadata volume is also exploding. How can we structure our metadata to enable data discovery & lineage tools? We can be crazy enough to think about a metadata lake in order to store everything related to metadata.

Data engineering position

This week we also got a lot of articles around the data engineering jobs. After the Manifesto which describe well our job we go a Tribune asking to better consider Data Engineer, in the other hand people are asking to praise Data Engineers and stop hiring data analysts.

These 3 articles shows one thing: building block of data platforms are hard to build and sometimes not rewarding. For everyone discovering this new space I would advice to read What I wish I knew before going into DE.

As a side note, this newsletter category is oriented from a data engineer point of view. In the real life not eveyrhing is black or white. We also have gray colors.

7 Layers of MLOps Security 🔒

Denys Linkov wrote a long article detailing 7 layers to secure your MLOps pipelines or applications. This is a first article dealing with this security topic. Worth checking the concepts to have it in mind for your next ML pipelines.

Automated document processing at Alan

To finish the newsletter an AI article related to how Alan is doing OCR on health insurrance documents to increase the velocity in the reimbursement.

datanews

Christophe Blefari

Data Engineering Coach that enjoys all kind of data platform.