Home Data Engineering ETL & ELT Data Pipelines Tutorials Blog Databases Data Warehousing Big Data Cloud Data SQL Guides Python Guides Tools Glossary Resources About Contact
Intermediate

Python for Data Cleaning

Use pandas to handle missing values, duplicates, and inconsistent formats.

Why Clean Data First

Raw data is rarely analysis-ready. Missing values, duplicate rows, and inconsistent formatting are common, and cleaning them early prevents inaccurate downstream reports.

Handling Missing Values

Pandas makes it straightforward to find and handle missing values, either by removing them or filling them with a sensible default.

PYTHON
import pandas as pd

df = pd.read_csv('orders.csv')

# Find missing values
print(df.isnull().sum())

# Drop rows missing a required field
df = df.dropna(subset=['customer_id'])

# Fill missing numeric values with 0
df['discount'] = df['discount'].fillna(0)

Removing Duplicates

Duplicate rows can silently inflate totals in reports. Always check for and remove them before aggregating data.

PYTHON
# Identify duplicate rows
duplicates = df[df.duplicated(subset=['order_id'])]

# Drop duplicates, keeping the first occurrence
df = df.drop_duplicates(subset=['order_id'], keep='first')

Standardizing Formats

Inconsistent text casing, date formats, or whitespace can break joins and groupings. Standardize these early in your pipeline.

PYTHON
df['country'] = df['country'].str.strip().str.upper()
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')