Extract: The focus of this step is to retrieve data from various sources. The sources can be diverse, including OLTP business databases, program applications, files, system logs, etc. These encompass structured data, semi-structured data, and unstructured data.
Transform: The focus of this step is on changing the raw data from its original structure to a format that meets the requirements of the target system (the system where you plan to store data for analysis). For instance, in a data warehouse with strict layers, each layer has fixed data specifications and models. Fields' data types and value formats are strictly defined. However, raw data usually does not conform to these specifications, containing issues like duplicate data and dirty data. In such cases, transformation operations are performed to align the data with the format of the target system based on your requirements.
Load: The focus of this step is to import data into the target system, which could be an OLAP database system, a data warehouse, or a data lake.
ETL vs ELT?
ETL and ELT, in my view, are two modes based on different data processing architectures.
- ETL Mode
ETL mode involves three steps of extraction, transformation, and loading before data enters the target system.
- ELT Mode
ELT mode is the current mainstream solution for big data processing. It involves the target system (which could be a database, a data warehouse, or a data lake) receiving all raw data, combining extraction and loading operations. In ELT, you can find all raw data layers in the target system, while other transformation operations are performed within the database, data warehouse, or data lake. Multiple transformations can be applied to raw data to meet different requirements.
Difference
Extract, Load, Transform (ELT) improves upon Extract, Transform, Load (ETL) in several aspects.
All of this has evolved with the advent of data warehousing, based on the computational power, rich data processing modes, and security provided by cloud data warehouses.
ELT offers better compatibility, faster transformation speed, lower costs, and improved security.
Better Data Compatibility | ETL is most suitable for structured data, displayed in tables with rows and columns. It transforms a set of structured data into another structured format before loading.
In contrast, ELT can handle all types of data, including unstructured data that cannot be stored in tabular format, such as images or documents. ELT loads various data formats into the target data warehouse, where it can be further transformed into the required format. |
Faster Speed | ELT is faster than ETL. ETL requires an additional step before loading data into the target, making it difficult to scale and slowing down the system as data size increases.
On the other hand, ELT can load data directly into the target system while parallelly transforming it. It uses the processing power and parallelism provided by cloud data warehouses for real-time or near-real-time data transformation. |
Lower Costs | ETL processes require analysis upfront. Analysts need to plan the reports they want to generate, define data structures, and formats. The setup time increases, leading to increased costs. Additional server infrastructure for transformation also adds to the expenses.
ELT systems are fewer than ETL, as all transformations occur within the target data warehouse. Fewer systems mean less maintenance, simplifying the data stack and lowering setup costs. |
Improved Security | When dealing with personal data, compliance with data privacy regulations is crucial. Companies must protect Personally Identifiable Information (PII) from unauthorized access. In ETL, developers must build custom solutions, such as PII masking, to monitor and protect data.
ELT solutions offer many security features directly within the data warehouse, such as fine-grained access control and multi-factor authentication. This allows more time to be invested in analysis, reducing the time required to meet data regulatory requirements. |