Python for Data Cleaning
Use pandas to handle missing values, duplicates, and inconsistent formats.
Why Clean Data First
Raw data is rarely analysis-ready. Missing values, duplicate rows, and inconsistent formatting are common, and cleaning them early prevents inaccurate downstream reports.
Handling Missing Values
Pandas makes it straightforward to find and handle missing values, either by removing them or filling them with a sensible default.
import pandas as pd
df = pd.read_csv('orders.csv')
# Find missing values
print(df.isnull().sum())
# Drop rows missing a required field
df = df.dropna(subset=['customer_id'])
# Fill missing numeric values with 0
df['discount'] = df['discount'].fillna(0)
Removing Duplicates
Duplicate rows can silently inflate totals in reports. Always check for and remove them before aggregating data.
# Identify duplicate rows
duplicates = df[df.duplicated(subset=['order_id'])]
# Drop duplicates, keeping the first occurrence
df = df.drop_duplicates(subset=['order_id'], keep='first')
Standardizing Formats
Inconsistent text casing, date formats, or whitespace can break joins and groupings. Standardize these early in your pipeline.
df['country'] = df['country'].str.strip().str.upper()
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
Related Tutorials
Keep building your data engineering foundations.
How ETL Pipelines Work
A step-by-step walkthrough of Extract, Transform, Load with a practical example.
Read guide →Data Quality Checks
Practical rules and checks that keep pipelines trustworthy.
Read guide →Building a Simple Data Pipeline
Combine extraction, transformation, and loading into one working example.
Read guide →