Skip to content

Data Engineering Alphabet #1

Understand data engineering concepts through an alphabet.

Fiona Gérent
2 min read — ·
  • Airflow is a open-source tool to author, schedule and monitor workflows
  • BigQuery is cloud data warehouse to help data into valuable business insights
  • Column-oriented database is database that organizes data by field, keeping all of the data associated with a field next to each other in memory
  • DAG, acronym of Directed Acyclic Graph, is used for representing many different types of flows
  • ETL / ELT are two processing methods which collect data to distant source. The first collects, transforms and loads in the databases. Whereas, the second collects, loads in the database and transforms only if the data team needs to use it
  • Flink is a distributed processing engine. It uses for processing data streams at a large scale and delivering  real-time analytical insights
  • Git is an open source distributed version control system designed to handle project
  • Hadoop is an open source distributed processing framework that manages data processing and storage
  • Iceberg is a high performance format for huge analytic tables. You can safely work with the same tables, at the same time.
  • JSON, for JavaScript Object Notation, is a text data format
  • Kafka is a distributed system for continuous data diffusion, that allowed to publish, stock, subscribe recording stream
  • Lake (data) is a type of repository that stocks data in their initial format through the ELT processes. The transformation is done only if the Data Analyst needs to use it
  • Machine Learning is a part of artificial intelligence which focuses on the use of data and algorithms to imitate the way that humans learn
  • NoSQL, for Not only SQL, is non-tabular database and stores data diffrently than relational tables. Main types could be document, key-value, wide-column or graph
  • OLAP / OLTP are file format type : in row-based storage, data is stored row by row, called OnLine Transactional Processing (OLTP). They are usually very specific in the task that they perform to involve a small selection of records. Conversely, in column-based storage, data is stored in a sort of cube, called OnLine Analytical Processing (OLAP). This storage is used for quickly respond to analytical queries
  • Pipeline is a set of tools and processes used to automate the movement and transformation of data from a source system to a target repository
  • Query is a request for data or information from a database table or combination of tables
  • Raw data is the data collected form a source, but in his initial state. It is not cleaned or organized
  • Snowflake is a cloud-agnostic data warehouse
  • Tableau is a data visualization tool used for data analysis and business intelligence
  • Unstructured data means that datasets are not stored in a structured database format. Structure is not predefined through data models
  • (data) Versioning is the storage of different versions of data that were created or changed at specific points in times
  • Warehouse (data) is another type of repository which stocks data already transform through the ETL processes
  • XML is a data structuring language, used for the management and exchange of information on Internet. It is more powerful than HTML
  • YAML, acronym of Yet Another Markup Language, is an human-readable data serialization language that is used for writing configuration files
  • Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services
data engineering