Skip to content

How to learn data engineering

How to learn data engineering in 2022? This article will help you understand everything related to data engineering.

Christophe Blefari
Christophe Blefari
6 min read — ·
Learn data engineering, all the references (credits)

This is a special edition of the Data News, it should have been the number 22.23. But right now I'm in holidays finishing a hiking week in Corsica 🥾. So I wrote this special edition about: how to learn data engineering in 2022.

The aim of this post is to create a repository of important links and concepts we should care about when we do data engineering. Obviously I'm full of bias, so if you feel I missed something do not hesitate to ping me with stuff to add. The idea is to create a living reference about Data Engineering.

📬 Subscribe to the excellent weekly newsletter 📬

A bit of context

It's important to take a step back and to understand from where the data engineering is coming from. Data engineering inherits from years of data practices in US big companies. Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center.

In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics.

Who are the data engineers?

Every company out there has his own definition for the data engineer role. In my opinion we can easily say a data engineer is a software engineer working with data. The idea behind is to solve data problem by building software. Obviously as data is different than "traditional product" — in term of users for instance — a data engineer uses other tools.

In order to define the data engineer profile here some resources defining data roles and borders.

What is data engineering

As I said it before data engineering is still a young discipline with many different definitions. Still, we can have a common ground when mixing software engineering, DevOps principles, Cloud — or on-prem — systems understanding and data literacy.

If you are new to data engineering you should start by reading the holy trinity from Maxime Beauchemin. He wrote some years ago 3 articles defining data engineering field.

There is a global consensus stating that you need to master a programming language (Python or Java based) and SQL in order to be self-sufficient.

Some concepts

When doing data engineering you can touch a lot of different concepts. Firstly, read the Data Engineering Manifesto, this is not something official in any kind but it greatly depicts all the concepts data engineers daily face.

Then here a list of global resources that can help you navigate through the field:

If we go a bit deeper, I think that every data engineer should have basis in:

  • data modeling — this is related to the way the data is stored is a data warehouse and the field has been cracked years ago by Kimball dimensional modeling and also Inmon model. But it recently got challenged because of "infinite" cloud power with OBT (one big table or flat) model. In order to complete your understanding of data modeling you should learn what's an OLAP cube. The cherry on the cake here is the Slowly Changing Dimensions — SCDs — concept.
  • formats — This is a huge part of data engineering. Picking the right format for your data storage. Wrong format often means bad querying performance and user-experience. In a nutshell you have: text based formats (CSV, JSON and raw stuff), columnar file formats (Parquet, ORC), memory format (Arrow), transport protocols and format (Protobuf, Thrift, gRPC, Avro), table formats (Hudi, Iceberg, Delta), database and vendor formats (Postgres, Snowflake, BigQuery, etc.). Here a small benchmark between some popular formats.
  • batch — Batch processing is at the core of data engineering. One of the major task is to move data from a source storage to a destination storage. In batch. On a regular schedule. Sometime with transformation. This is close to what we also call ETL or ELT. The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory. In this category I recommend also to have a look at data ingestion (Airbyte, Fivetran, etc.), workflows (Airflow, Prefect, Dagster, etc.) and transformation (Spark, dbt, Pandas, etc.) tools.
  • stream — Stream processing can be seen as the evolution of the batch. This is not. It addresses different use-cases. This is often linked to real-time. Main technologies around stream are bus messages like Kafka and processing framework like Flink or Spark on top of the bus. Recently all-in-one cloud services appeared to simplify the real-time work. Understand Change Data Capture — CDC.
  • infrastructure — When you do data engineering this is important to master data infrastructure concepts. You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team. You'll be also asked to put in place a data infrastructure. It means a data warehouse, a data lake or other concepts starting with data. My advice on this point is to learn from others. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill).
  • new concepts — in today's data engineering a lot of new concepts enter the field every year like quality, lineage, metadata management, governance, privacy, sharing, etc.

Is it really modern? (credits)

The modern (and the future) data stack

Coming from Hadoop — also called the old data stack — people are now building modern data stacks. This is a new way to describe data platforms with a warehouse at the core where all the company data and KPIs sit. Below some key articles defining this new paradigm.

And now some articles I like that will help you get inspiration.


Once again if you feel I forgot something important do not hesitate to tell me. I'll add more and more stuff to this article in the future.

If you enjoyed this article please consider subscribing to my weekly newsletter about data where I demystify all these concepts. I help you save 5 hours per week.

datanews

Christophe Blefari

I do Data Engineering in Python.

Comments