How ETL Pipelines Work
A step-by-step walkthrough of Extract, Transform, Load with a practical example.
What is ETL?
ETL stands for Extract, Transform, Load — a process for moving data from source systems into a destination, such as a data warehouse, while cleaning and reshaping it along the way.
Think of ETL as a factory assembly line for data: raw materials (source data) come in, get processed into a usable shape, and the finished product (clean, structured data) is delivered to its destination.
The Three Stages
Extract pulls data from one or more sources — databases, APIs, files, or event logs. Transform cleans, filters, joins, and reshapes that data into the format your destination expects. Load writes the transformed data into its final destination, ready for analysis.
A Simple Example
Imagine we extract raw order data from an application database, then transform it to calculate revenue per customer before loading it into a warehouse table.
-- Transform step: aggregate raw orders into customer revenue
SELECT
customer_id,
COUNT(order_id) AS total_orders,
SUM(order_amount) AS total_revenue
FROM raw_orders
WHERE order_status = 'completed'
GROUP BY customer_id;
A Python script might orchestrate this by extracting rows, applying transformation logic, and loading the result into the destination table:
import pandas as pd
# Extract
raw = pd.read_sql('SELECT * FROM raw_orders', source_conn)
# Transform
clean = raw[raw['order_status'] == 'completed']
summary = clean.groupby('customer_id').agg(
total_orders=('order_id', 'count'),
total_revenue=('order_amount', 'sum')
).reset_index()
# Load
summary.to_sql('customer_revenue', destination_conn, if_exists='replace', index=False)
Common Pitfalls
New ETL pipelines often fail in predictable ways: forgetting to handle duplicate records, not planning for schema changes in source systems, and skipping validation before loading. Building idempotent jobs — ones that can safely re-run without creating duplicates — solves many of these issues early.
Related Tutorials
Keep building your data engineering foundations.
SQL Joins Explained
INNER, LEFT, RIGHT and FULL joins explained with diagrams and sample queries.
Read guide →Building a Simple Data Pipeline
Combine extraction, transformation, and loading into one working example.
Read guide →Batch vs Streaming Data
Understand the two fundamental ways data can be processed.
Read guide →