Intermediate
Building a Simple Data Pipeline
Combine extraction, transformation, and loading into one working example.
Overview
A simple data pipeline connects the three ETL stages into one repeatable script. This example reads raw CSV data, cleans it, and writes summarized results to an output table.
Step 1: Extract
PYTHON
import pandas as pd
def extract(path):
return pd.read_csv(path)
Step 2: Transform
PYTHON
def transform(df):
df = df.dropna(subset=['customer_id'])
df = df.drop_duplicates(subset=['order_id'])
summary = df.groupby('customer_id').agg(
total_orders=('order_id', 'count'),
total_revenue=('order_amount', 'sum')
).reset_index()
return summary
Step 3: Load
PYTHON
def load(df, destination_path):
df.to_csv(destination_path, index=False)
Putting It Together
PYTHON
def run_pipeline():
raw = extract('orders_raw.csv')
summary = transform(raw)
load(summary, 'customer_revenue.csv')
print('Pipeline complete:', len(summary), 'customers processed')
if __name__ == '__main__':
run_pipeline()
In production, this same pattern is usually scheduled with an orchestrator like Airflow and reads from and writes to real databases instead of local files.
Continue Learning
Related Tutorials
Keep building your data engineering foundations.
How ETL Pipelines Work
A step-by-step walkthrough of Extract, Transform, Load with a practical example.
Read guide →Airflow Workflow Basics
Learn how DAGs, tasks, and schedules coordinate real pipelines.
Read guide →Data Quality Checks
Practical rules and checks that keep pipelines trustworthy.
Read guide →