Home Data Engineering ETL & ELT Data Pipelines Tutorials Blog Databases Data Warehousing Big Data Cloud Data SQL Guides Python Guides Tools Glossary Resources About Contact
Intermediate

Building a Simple Data Pipeline

Combine extraction, transformation, and loading into one working example.

Overview

A simple data pipeline connects the three ETL stages into one repeatable script. This example reads raw CSV data, cleans it, and writes summarized results to an output table.

Step 1: Extract

PYTHON
import pandas as pd

def extract(path):
    return pd.read_csv(path)

Step 2: Transform

PYTHON
def transform(df):
    df = df.dropna(subset=['customer_id'])
    df = df.drop_duplicates(subset=['order_id'])
    summary = df.groupby('customer_id').agg(
        total_orders=('order_id', 'count'),
        total_revenue=('order_amount', 'sum')
    ).reset_index()
    return summary

Step 3: Load

PYTHON
def load(df, destination_path):
    df.to_csv(destination_path, index=False)

Putting It Together

PYTHON
def run_pipeline():
    raw = extract('orders_raw.csv')
    summary = transform(raw)
    load(summary, 'customer_revenue.csv')
    print('Pipeline complete:', len(summary), 'customers processed')

if __name__ == '__main__':
    run_pipeline()

In production, this same pattern is usually scheduled with an orchestrator like Airflow and reads from and writes to real databases instead of local files.