Home Data Engineering ETL & ELT Data Pipelines Tutorials Blog Databases Data Warehousing Big Data Cloud Data SQL Guides Python Guides Tools Glossary Resources About Contact
Beginner

How ETL Pipelines Work

A step-by-step walkthrough of Extract, Transform, Load with a practical example.

What is ETL?

ETL stands for Extract, Transform, Load — a process for moving data from source systems into a destination, such as a data warehouse, while cleaning and reshaping it along the way.

Think of ETL as a factory assembly line for data: raw materials (source data) come in, get processed into a usable shape, and the finished product (clean, structured data) is delivered to its destination.

The Three Stages

Extract pulls data from one or more sources — databases, APIs, files, or event logs. Transform cleans, filters, joins, and reshapes that data into the format your destination expects. Load writes the transformed data into its final destination, ready for analysis.

A Simple Example

Imagine we extract raw order data from an application database, then transform it to calculate revenue per customer before loading it into a warehouse table.

SQL
-- Transform step: aggregate raw orders into customer revenue
SELECT
    customer_id,
    COUNT(order_id)      AS total_orders,
    SUM(order_amount)    AS total_revenue
FROM raw_orders
WHERE order_status = 'completed'
GROUP BY customer_id;

A Python script might orchestrate this by extracting rows, applying transformation logic, and loading the result into the destination table:

PYTHON
import pandas as pd

# Extract
raw = pd.read_sql('SELECT * FROM raw_orders', source_conn)

# Transform
clean = raw[raw['order_status'] == 'completed']
summary = clean.groupby('customer_id').agg(
    total_orders=('order_id', 'count'),
    total_revenue=('order_amount', 'sum')
).reset_index()

# Load
summary.to_sql('customer_revenue', destination_conn, if_exists='replace', index=False)

Common Pitfalls

New ETL pipelines often fail in predictable ways: forgetting to handle duplicate records, not planning for schema changes in source systems, and skipping validation before loading. Building idempotent jobs — ones that can safely re-run without creating duplicates — solves many of these issues early.