This is a special edition of the Data News, it should have been the number 22.23. But right now I'm in holidays finishing a hiking week in Corsica 🥾. So I wrote this special edition about: how to learn data engineering in 2022.
The aim of this post is to create a repository of important links and concepts we should care about when we do data engineering. Obviously I'm full of bias, so if you feel I missed something do not hesitate to ping me with stuff to add. The idea is to create a living reference about Data Engineering.
A bit of context
It's important to take a step back and to understand from where the data engineering is coming from. Data engineering inherits from years of data practices in US big companies. Hadoop initially led the way with Big Data and distributed computing on-premise to finally land on Modern Data Stack — in the cloud — with a data warehouse at the center.
In order to understand today's data engineering I think that this is important to at least know Hadoop concepts and context and computer science basics.
- What is Hadoop? A quick overview of what everyone used for years (and still using it for some of us). It's important to understand the distributed computing concepts, MapReduce, Hadoop distributions, data locality, HDFS.
- Data & Data Engineering — the past, present, and future ; this is a good overlook on data engineering history.
- This one is a gitbook with a lot of content but I recommend you specifically to read the introduction to data engineering.
- In order to become a great data engineer you'll also need to understand computer science. How do computer works? Additionally by understanding how web works — frontend & backend, deployment, etc. This is oversimplified but I did not found a simple resource on this topic, so if you have something, I'm interested.
Who are the data engineers?
Every company out there has his own definition for the data engineer role. In my opinion we can easily say a data engineer is a software engineer working with data. The idea behind is to solve data problem by building software. Obviously as data is different than "traditional product" — in term of users for instance — a data engineer uses other tools.
In order to define the data engineer profile here some resources defining data roles and borders.
- Data Organization: why are there so many roles ? — And why it is important to understand them. This is one of the most synthesized article about data roles. Furcy defined Programming as the core skill for data engineers.
- To complete the picture here are some missions and skills that are expected to be done by data engineers. Warning, the article is from an online bootcamp but they summarize pretty well everything. You can also have a look at the gov.uk data engineer job card, they detail every seniority level expectations.
- We don't need data scientists, we need data engineers — for years companies were hiring data scientists because it was booming, then realized they were in need for data engineers to team up with scientists. This post shows the data job market with numbers.
What is data engineering
As I said it before data engineering is still a young discipline with many different definitions. Still, we can have a common ground when mixing software engineering, DevOps principles, Cloud — or on-prem — systems understanding and data literacy.
If you are new to data engineering you should start by reading the holy trinity from Maxime Beauchemin. He wrote some years ago 3 articles defining data engineering field.
- The Rise of the Data Engineer
- The Downfall of the Data Engineer
- Functional Data Engineering — a modern paradigm for batch data processing
There is a global consensus stating that you need to master a programming language (Python or Java based) and SQL in order to be self-sufficient.
When doing data engineering you can touch a lot of different concepts. Firstly, read the Data Engineering Manifesto, this is not something official in any kind but it greatly depicts all the concepts data engineers daily face.
Then here a list of global resources that can help you navigate through the field:
- The Data Engineer Roadmap — An image with advices and technology names to watch.
- Reddit r/dataengineering wiki a place where some data eng definitions are written.
- This book, 📘 Data Pipelines Pocket Reference, defines everything related to data pipelines and how to treat data movement from source to target.
If we go a bit deeper, I think that every data engineer should have basis in:
- data modeling — this is related to the way the data is stored is a data warehouse and the field has been cracked years ago by Kimball dimensional modeling and also Inmon model. But it recently got challenged because of "infinite" cloud power with OBT (one big table or flat) model. In order to complete your understanding of data modeling you should learn what's an OLAP cube. The cherry on the cake here is the Slowly Changing Dimensions — SCDs — concept.
- formats — This is a huge part of data engineering. Picking the right format for your data storage. Wrong format often means bad querying performance and user-experience. In a nutshell you have: text based formats (CSV, JSON and raw stuff), columnar file formats (Parquet, ORC), memory format (Arrow), transport protocols and format (Protobuf, Thrift, gRPC, Avro), table formats (Hudi, Iceberg, Delta), database and vendor formats (Postgres, Snowflake, BigQuery, etc.). Here a small benchmark between some popular formats.
- batch — Batch processing is at the core of data engineering. One of the major task is to move data from a source storage to a destination storage. In batch. On a regular schedule. Sometime with transformation. This is close to what we also call ETL or ELT. The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory. In this category I recommend also to have a look at data ingestion (Airbyte, Fivetran, etc.), workflows (Airflow, Prefect, Dagster, etc.) and transformation (Spark, dbt, Pandas, etc.) tools.
- stream — Stream processing can be seen as the evolution of the batch. This is not. It addresses different use-cases. This is often linked to real-time. Main technologies around stream are bus messages like Kafka and processing framework like Flink or Spark on top of the bus. Recently all-in-one cloud services appeared to simplify the real-time work. Understand Change Data Capture — CDC.
- infrastructure — When you do data engineering this is important to master data infrastructure concepts. You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team. You'll be also asked to put in place a data infrastructure. It means a data warehouse, a data lake or other concepts starting with data. My advice on this point is to learn from others. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill).
- new concepts — in today's data engineering a lot of new concepts enter the field every year like quality, lineage, metadata management, governance, privacy, sharing, etc.
The modern (and the future) data stack
Coming from Hadoop — also called the old data stack — people are now building modern data stacks. This is a new way to describe data platforms with a warehouse at the core where all the company data and KPIs sit. Below some key articles defining this new paradigm.
- 👍 The Modern Data Stack: Past, Present, and Future
- Emerging Architectures for Modern Data Infrastructure
- The New Generation Data Lake
- Bootstrap a Modern Data Stack in 5 minutes with Terraform
- Modern Data Stack as a Service
- Storm in the stratosphere: how the cloud will be reshuffled
- ❤️ A path towards a data platform that aligns data, value, and people
And now some articles I like that will help you get inspiration.
- Gitlab Data Team Handbook — One of the best data resource. This is a public documentation on how Gitlab data team do stuff.
- Airbnb is great at exposing what they are doing in term of data. For instance with these 2 articles: How Airbnb achieved metric consistency at scale & How Airbnb built “Wall” to prevent data bugs
- Data Engineering patterns are important — Dagster tried to introduce Software-Defined Assets and Prefect spoke about Positive and Negative engineering.
- Scaling data analytics with software engineering best practices
- Jesse Anderson ; Creating a Data Engineering Culture and his book 📘 Data Engineering Teams
- What is MLOps? Some people wrote a white paper detailing Machine Learning Operations (MLOps): Overview, Definition, and Architecture in which they write about rols and missions.
Once again if you feel I forgot something important do not hesitate to tell me. I'll add more and more stuff to this article in the future.
If you enjoyed this article please consider subscribing to my weekly newsletter about data where I demystify all these concepts. I help you save 5 hours of curation per week.
Join the newsletter to receive the latest updates in your inbox.