Data Engineering Alphabet #1

Understand data engineering concepts through an alphabet.

29 Jun 2022 — 2 min read —

Airflow is a open-source tool to author, schedule and monitor workflows
BigQuery is cloud data warehouse to help data into valuable business insights
Column-oriented database is database that organizes data by field, keeping all of the data associated with a field next to each other in memory
DAG, acronym of Directed Acyclic Graph, is used for representing many different types of flows
ETL / ELT are two processing methods which collect data to distant source. The first collects, transforms and loads in the databases. Whereas, the second collects, loads in the database and transforms only if the data team needs to use it
Flink is a distributed processing engine. It uses for processing data streams at a large scale and delivering real-time analytical insights
Git is an open source distributed version control system designed to handle project
Hadoop is an open source distributed processing framework that manages data processing and storage
Iceberg is a high performance format for huge analytic tables. You can safely work with the same tables, at the same time.
JSON, for JavaScript Object Notation, is a text data format
Kafka is a distributed system for continuous data diffusion, that allowed to publish, stock, subscribe recording stream
Lake (data) is a type of repository that stocks data in their initial format through the ELT processes. The transformation is done only if the Data Analyst needs to use it
Machine Learning is a part of artificial intelligence which focuses on the use of data and algorithms to imitate the way that humans learn
NoSQL, for Not only SQL, is non-tabular database and stores data diffrently than relational tables. Main types could be document, key-value, wide-column or graph
OLAP / OLTP are file format type : in row-based storage, data is stored row by row, called OnLine Transactional Processing (OLTP). They are usually very specific in the task that they perform to involve a small selection of records. Conversely, in column-based storage, data is stored in a sort of cube, called OnLine Analytical Processing (OLAP). This storage is used for quickly respond to analytical queries
Pipeline is a set of tools and processes used to automate the movement and transformation of data from a source system to a target repository
Query is a request for data or information from a database table or combination of tables
Raw data is the data collected form a source, but in his initial state. It is not cleaned or organized
Snowflake is a cloud-agnostic data warehouse
Tableau is a data visualization tool used for data analysis and business intelligence
Unstructured data means that datasets are not stored in a structured database format. Structure is not predefined through data models
(data) Versioning is the storage of different versions of data that were created or changed at specific points in times
Warehouse (data) is another type of repository which stocks data already transform through the ETL processes
XML is a data structuring language, used for the management and exchange of information on Internet. It is more powerful than HTML
YAML, acronym of Yet Another Markup Language, is an human-readable data serialization language that is used for writing configuration files
Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services

data engineering

Fiona Gérent

Comments

dbt multi-project collaboration

Use cross-project references without dbt Cloud. This article showcases what you can do to activate dbt multi-project collaboration.

1 Mar 2023

Paid Members Public

How to get started with dbt

What's a dbt model, a source and a macro? Learn how to get started with dbt concepts.

Data Engineering Alphabet #1

Data Explorer

The hub to explore Data News links

Fiona Gérent

Comments

Related Posts

dbt multi-project collaboration

How to get started with dbt

Data Explorer

The hub to explore Data News links

Fiona Gérent

blef.fr Newsletter

Comments

Related Posts

dbt multi-project collaboration

How to get started with dbt