Data Cleaning and Preprocessing
Summary
This chapter covers the critical skills of preparing raw data for analysis. Students will learn to identify and handle missing values, detect and remove duplicates, identify outliers, and validate data quality. The chapter also covers data transformation techniques including filtering, type conversion, and feature scaling. By the end of this chapter, students will be able to clean messy real-world datasets and prepare them for visualization and modeling.
Concepts Covered
This chapter covers the following 20 concepts from the learning graph:
- Missing Values
- NaN
- Null Detection
- Dropna Method
- Fillna Method
- Imputation
- Data Type Conversion
- Duplicate Detection
- Duplicate Removal
- Outliers
- Outlier Detection
- Data Validation
- String Cleaning
- Column Renaming
- Data Filtering
- Boolean Indexing
- Query Method
- Data Transformation
- Feature Scaling
- Normalization
Prerequisites
This chapter builds on concepts from:
The Art of Cleaning Up Messes
In the last chapter, you learned to load data and explore it like a detective. You probably felt pretty good about yourself—data was flowing, head() was working, and everything seemed under control. Then reality hit: the data was messy.
Welcome to the real world of data science.
Real-world data is never clean. It's full of gaps where values should be, duplicates that snuck in somehow, numbers that shouldn't exist (like someone being -5 years old), and formatting disasters that would make an English teacher weep. This isn't a bug in the data pipeline—it's just how data works in the wild.
Here's the thing: messy data will destroy your analysis. Train a machine learning model on dirty data? It learns the wrong patterns. Calculate averages with missing values? Your numbers lie. Build a report with duplicates? Everything is inflated. Garbage in, garbage out—that's the first law of data science.
But here's the good news: cleaning data is a superpower in itself. Most people don't know how to do it well. Master this chapter, and you'll be the hero who transforms chaotic datasets into pristine, analysis-ready gold. Let's get scrubbing!
Diagram: Data Cleaning Pipeline Overview
Data Cleaning Pipeline Overview
Type: workflow
Bloom Taxonomy: Understand (L2)
Learning Objective: Help students visualize the complete data cleaning workflow from raw data to analysis-ready data
Purpose: Show the sequential steps in a typical data cleaning process
Visual style: Horizontal flowchart with icons for each stage
Steps (left to right):
-
RAW DATA Icon: Messy document with question marks Color: Red Hover text: "Data as received - full of problems"
-
MISSING VALUES Icon: Grid with empty cells highlighted Color: Orange Hover text: "Identify and handle NaN, None, empty strings"
-
DUPLICATES Icon: Two identical rows with X on one Color: Yellow Hover text: "Find and remove duplicate records"
-
OUTLIERS Icon: Box plot with point far outside Color: Yellow-green Hover text: "Detect and decide how to handle extreme values"
-
DATA TYPES Icon: Type conversion symbol (A→1) Color: Green Hover text: "Convert columns to appropriate types"
-
VALIDATION Icon: Checkmark in shield Color: Blue Hover text: "Verify data meets business rules"
-
TRANSFORMATION Icon: Gear with arrows Color: Purple Hover text: "Scale, normalize, and prepare for analysis"
-
CLEAN DATA Icon: Sparkly document with checkmark Color: Gold Hover text: "Analysis-ready dataset!"
Annotations below pipeline: - "Each step catches different problems" - "Order matters: missing values before duplicates" - "Always document your cleaning decisions"
Error feedback loops: - Dashed arrows from steps 2-6 back to "Log Issues" box - "Log Issues" connects to "Data Quality Report"
Implementation: SVG with CSS hover effects
Missing Values: The Silent Killers
Missing values are the most common data quality problem you'll encounter. They're sneaky—sometimes they look like empty cells, sometimes they're the word "NULL," and sometimes they're special numeric codes like -999.
Understanding NaN and Null
In pandas, missing values are represented as NaN (Not a Number). This is a special value from the NumPy library that indicates "something should be here, but isn't."
1 2 3 4 5 6 7 8 9 10 | |
Output:
1 2 3 4 5 | |
Notice how None and np.nan both represent missing data, but they're handled slightly differently. Pandas is smart enough to recognize both as missing values.
| Missing Value Type | Where You'll See It | Pandas Representation |
|---|---|---|
| Empty cell | CSV files, Excel | NaN |
| None | Python code, JSON | NaN (converted) |
| NULL | Databases | NaN (converted) |
| Empty string ("") | Text files | Not automatically NaN! |
| Special codes (-999, 9999) | Legacy systems | Not automatically NaN! |
Hidden Missing Values
Be careful! Empty strings and special codes like -999 are NOT automatically treated as missing. You need to identify and convert them manually. Always inspect your data after loading.
Null Detection: Finding the Gaps
Null detection is the process of finding where your missing values hide. Pandas provides several methods for this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The isnull() method is your first line of defense. Run it immediately after loading any dataset—it tells you exactly where the problems are.
1 2 3 4 5 | |
Diagram: Missing Value Detection MicroSim
Missing Value Detective MicroSim
Type: microsim
Bloom Taxonomy: Apply (L3)
Learning Objective: Help students practice identifying and counting missing values in different scenarios
Canvas layout (750x500px): - Top (750x200): DataFrame display with highlighted missing values - Bottom left (350x300): Detection code panel - Bottom right (400x300): Results and quiz area
Visual elements: - DataFrame grid showing 8 rows × 5 columns - Missing values highlighted in red when detected - Color legend showing: Present (green), NaN (red), None (orange), Empty string (yellow) - Running count of missing values by column
Sample data scenarios (rotate through): 1. Simple NaN values only 2. Mix of NaN and None 3. Hidden missing values (empty strings, -999) 4. Missing values with pattern (all in one column) 5. Sparse data (>50% missing)
Interactive controls: - Button: "Show isnull()" - highlights all missing - Button: "Count by Column" - shows bar chart of missing counts - Button: "Find Hidden Missing" - identifies non-standard missing values - Dropdown: "Select Scenario" - changes dataset - Quiz mode: "How many missing in column X?" with input field
Behavior: - Clicking detection buttons animates the detection process - Correct quiz answers earn points and unlock harder scenarios - Hints available after wrong answers - Progress tracker shows scenarios completed
Visual style: Data detective theme with magnifying glass cursor
Implementation: p5.js with animated highlighting
Handling Missing Values: Your Three Options
Once you've found missing values, you have three main strategies:
- Drop them - Remove rows or columns with missing data
- Fill them - Replace missing values with something reasonable
- Leave them - Some algorithms can handle missing values directly
The dropna Method: Clean Sweep
The dropna method removes rows or columns containing missing values:
1 2 3 4 5 6 7 8 9 10 11 | |
When to use dropna():
- Missing values are rare (less than 5% of data)
- Rows with missing data are truly unusable
- You have plenty of data to spare
- Missing values are random (not systematic)
When NOT to use dropna():
- It would remove too much data
- Missing values follow a pattern (systematic missingness)
- The columns with missing data are important
- You suspect the missing data contains signal
The fillna Method: Filling the Gaps
The fillna method replaces missing values with specified values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
| Fill Strategy | Best For | Example |
|---|---|---|
| Mean | Normally distributed numeric data | Age, height |
| Median | Skewed numeric data, outliers present | Income, prices |
| Mode | Categorical data | City, category |
| Forward fill | Time series data | Stock prices, temperatures |
| Constant (0) | When zero has meaning | Transaction amounts |
| Constant ("Unknown") | Categorical placeholders | Status fields |
Imputation: The Smart Approach
Imputation is the fancy term for filling missing values based on patterns in the data. It's more sophisticated than simple mean/median filling.
1 2 3 4 5 6 7 8 9 | |
Advanced imputation strategies include:
- K-Nearest Neighbors (KNN): Fill based on similar rows
- Regression imputation: Predict missing values from other columns
- Multiple imputation: Create multiple filled versions and combine
The Golden Rule of Missing Data
Always ask: WHY is the data missing? Random missing values can often be filled. But if data is missing for a reason (like patients who missed follow-up because they got worse), filling it can introduce bias. When in doubt, document your decision and consider sensitivity analysis.
Duplicate Detection and Removal
Duplicates are copies of data that shouldn't exist. They inflate your counts, skew your statistics, and make your analyses unreliable. They sneak in through:
- Data entry mistakes
- Multiple data imports
- System glitches
- Merging datasets incorrectly
Duplicate Detection: Finding the Clones
Duplicate detection identifies rows that appear more than once:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The duplicated() method marks duplicates starting from the second occurrence. Use keep=False to see ALL copies, including the first one.
1 2 3 4 5 6 7 8 9 10 | |
Duplicate Removal: Eliminating the Clones
Duplicate removal keeps only unique rows:
1 2 3 4 5 6 7 8 9 10 11 | |
When removing duplicates, ask yourself:
- Should you keep the first or last occurrence?
- Are you sure the duplicates are truly identical?
- Should you merge duplicate information instead of just dropping?
Diagram: Duplicate Handling Decision Tree
Duplicate Handling Decision Tree
Type: diagram
Bloom Taxonomy: Analyze (L4)
Learning Objective: Help students decide the appropriate strategy for handling different types of duplicates
Purpose: Guide decision-making process for duplicate handling
Visual style: Decision tree flowchart
Start: "Duplicates Detected"
Decision 1: "Are rows EXACTLY identical?" - Yes → "Safe to drop_duplicates()" - No → Decision 2
Decision 2: "Which columns make rows 'the same'?" - Identify key columns (ID, email, etc.) - → Decision 3
Decision 3: "Do non-key columns differ?" - No → "drop_duplicates(subset=[key_cols])" - Yes → Decision 4
Decision 4: "Which version is correct?" - First occurrence → "keep='first'" - Last occurrence → "keep='last'" - Need to merge → "Use groupby().agg()" - Can't determine → "Flag for manual review"
Special case branch: "Is duplication intentional?" - Yes (e.g., same product bought twice) → "Don't remove!" - No → Continue with removal
Color coding: - Green: Safe actions - Yellow: Need investigation - Red: Be careful - Blue: Decision points
Annotations: - "Always examine duplicates before removing" - "Document which strategy you used and why"
Implementation: SVG decision tree with interactive highlights
Outliers: The Extreme Values
Outliers are data points that are unusually far from other observations. They might be:
- Legitimate extremes: A billionaire in income data—rare but real
- Data errors: Someone's age recorded as 999
- Measurement errors: A sensor glitch recording impossible values
Outliers matter because they can dramatically affect your statistics. One outlier can shift your mean, expand your standard deviation, and confuse your machine learning models.
Outlier Detection: Finding the Extremes
Outlier detection identifies values that don't fit the normal pattern. Common methods include:
1. Visual inspection (always start here!):
1 2 3 4 5 6 7 8 9 | |
2. Statistical methods:
1 2 3 4 5 6 7 8 9 10 11 12 | |
| Method | How It Works | Best For |
|---|---|---|
| Z-score | Distance from mean in standard deviations | Normal distributions |
| IQR | Distance from quartiles | Any distribution, robust to extremes |
| Domain rules | Based on what's possible | When you know valid ranges |
3. Domain knowledge rules:
1 2 3 4 5 | |
Diagram: Outlier Detection Methods MicroSim
Outlier Detection Playground
Type: microsim
Bloom Taxonomy: Apply (L3)
Learning Objective: Let students experiment with different outlier detection methods and see how they identify different points
Canvas layout (800x550px): - Left (550x550): Scatter plot / histogram visualization - Right (250x550): Controls and detected outliers list
Visual elements: - Data points displayed as circles - Normal points in blue - Detected outliers highlighted in red with labels - Detection threshold lines/zones shown - Summary statistics displayed
Sample datasets (toggle between): 1. Normal distribution with obvious outliers 2. Skewed distribution (income-like) 3. Multi-modal distribution 4. Real-world messy data with errors
Interactive controls: - Radio buttons: Detection method (Z-score, IQR, Custom range) - Slider: Z-score threshold (1.5 to 4.0) - Slider: IQR multiplier (1.0 to 3.0) - Number inputs: Custom min/max values - Dropdown: Dataset selector - Toggle: Show/hide threshold lines
Display panels: - Count of outliers detected - Outlier values and indices - Percentage of data flagged - Before/after mean comparison
Behavior: - Adjusting thresholds immediately updates highlighting - Hovering over points shows their values - Clicking outliers adds them to "investigate" list - Show how different methods catch different outliers
Educational annotations: - "Z-score catches 3 outliers, IQR catches 5 - which is right?" - "Lower threshold = more outliers flagged" - "Some 'outliers' might be valid data!"
Visual style: Clean statistical visualization with gridlines
Implementation: p5.js with dynamic data visualization
Handling Outliers: What To Do
Once you've identified outliers, you have options:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Don't Just Delete Outliers!
Before removing outliers, ask: Are they errors or legitimate extremes? Removing real data points because they're "inconvenient" is bad science. Document every outlier decision.
Data Validation: Enforcing the Rules
Data validation checks that data meets expected criteria. It's the quality control checkpoint that catches problems before they contaminate your analysis.
Common validation checks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Building a validation function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | |
Data Transformation Techniques
Now that your data is clean, it's time to transform it into the shape your analysis needs. Data transformation includes filtering, selecting, renaming, type conversion, and more.
Data Filtering: Selecting Subsets
Data filtering extracts rows that meet specific criteria. You learned the basics in Chapter 3—now let's master advanced techniques.
1 2 3 4 5 6 7 8 9 10 11 | |
Boolean Indexing: The Power Tool
Boolean indexing is the technique behind filtering—using True/False arrays to select data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Understanding boolean indexing makes complex filtering intuitive. You can build masks step-by-step, test them separately, and combine them logically.
The Query Method: SQL-Like Filtering
The query method offers a cleaner syntax for complex filters:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
| Approach | Syntax | Best For |
|---|---|---|
| Boolean indexing | df[df["col"] > val] |
Simple conditions, programmatic use |
| Query method | df.query("col > val") |
Complex conditions, readability |
String Cleaning: Taming Text Data
String cleaning standardizes text data that comes in many messy forms:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Common string cleaning operations:
1 2 3 4 5 6 7 8 9 10 11 | |
Column Renaming: Clear Names Matter
Column renaming makes your data self-documenting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Good column names are:
- Lowercase with underscores (snake_case)
- Descriptive but not too long
- Free of spaces and special characters
- Consistent across your project
Data Type Conversion: Getting Types Right
Data type conversion ensures columns have appropriate types for analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
| Original Type | Convert To | Why |
|---|---|---|
| String numbers | float64 or int64 |
Enable math operations |
| Dates as strings | datetime64 |
Enable date arithmetic |
| Numeric codes | category |
Save memory, show meaning |
| Float IDs | Int64 |
Preserve as integers with NaN support |
Diagram: Data Type Conversion Guide Infographic
Data Type Conversion Reference
Type: infographic
Bloom Taxonomy: Remember (L1)
Learning Objective: Quick reference for common data type conversions and when to use them
Purpose: Visual guide for choosing the right data type conversion method
Layout: Two-column reference card style
Section 1: "Converting TO Numeric"
- pd.to_numeric(col) - Basic conversion
- pd.to_numeric(col, errors='coerce') - Invalid → NaN
- col.astype(float) - When you're sure it's clean
- Visual: String "42" → Integer 42
Section 2: "Converting TO Datetime"
- pd.to_datetime(col) - Smart parsing
- pd.to_datetime(col, format='%Y-%m-%d') - Specific format
- Visual: String "2024-03-15" → Datetime object
Section 3: "Converting TO Categorical"
- col.astype('category') - Basic categorical
- pd.Categorical(col, categories=[...], ordered=True) - Ordered
- Visual: Repeating strings → Category codes
Section 4: "Converting TO String"
- col.astype(str) - Simple conversion
- col.map('{:.2f}'.format) - With formatting
- Visual: Number 3.14159 → String "3.14"
Common pitfalls callout box: - "Integer columns with NaN need Int64 (capital I)" - "datetime parsing can be slow on large datasets" - "Category type saves memory but changes behavior"
Color coding: - Blue: Numeric conversions - Green: Date conversions - Purple: Categorical conversions - Orange: String conversions
Interactive elements: - Hover over conversion methods to see code examples - Click to copy code snippet
Implementation: HTML/CSS with JavaScript tooltips
Feature Scaling and Normalization
When you're preparing data for machine learning, feature scaling becomes critical. Different features might have vastly different ranges—age might be 18-80 while income might be 20,000-500,000. Without scaling, the larger numbers dominate the analysis.
Feature Scaling: Bringing Features to Same Scale
Feature scaling transforms features to comparable ranges:
1 2 3 4 5 6 7 8 9 | |
Normalization: Statistical Standardization
Normalization typically refers to scaling to unit norm or standard distribution:
1 2 3 4 5 6 7 8 9 10 | |
| Scaling Method | Formula | When to Use |
|---|---|---|
| Min-Max | \((x - min) / (max - min)\) | Neural networks, bounded algorithms |
| Z-score/Standard | \((x - \mu) / \sigma\) | Most ML algorithms, normally distributed data |
| Robust | \((x - median) / IQR\) | Data with outliers |
| Log transform | \(\log(x + 1)\) | Right-skewed data (income, counts) |
When to Scale
Scale features when using distance-based algorithms (KNN, SVM, K-means) or gradient descent (neural networks). Tree-based algorithms (Random Forest, XGBoost) usually don't need scaling.
Diagram: Feature Scaling Comparison MicroSim
Feature Scaling Visualizer
Type: microsim
Bloom Taxonomy: Understand (L2)
Learning Objective: Help students visualize how different scaling methods transform data distributions
Canvas layout (800x500px): - Top (800x200): Original data distribution histogram - Bottom left (400x300): Scaled distribution histogram - Bottom right (400x300): Controls and comparison stats
Visual elements: - Original data histogram with descriptive statistics - Scaled data histogram (updates with scaling method) - Before/after comparison statistics table - Visual axis showing value ranges
Sample datasets: - Normal distribution (symmetric) - Right-skewed (income-like) - With outliers - Bimodal distribution
Interactive controls: - Radio buttons: Scaling method - None (original) - Min-Max [0,1] - Standard (Z-score) - Robust (median/IQR) - Log transform - Dropdown: Dataset selector - Checkbox: Show outliers highlighted - Checkbox: Show before/after overlay
Comparison statistics displayed: - Min, Max, Range - Mean, Median - Std Dev, IQR - Visual indicator of how outliers are affected
Behavior: - Switching scaling method animates the transformation - Hover over bars to see exact values - Toggle overlay to see original vs scaled superimposed - Outliers maintain highlighting through transformation
Educational annotations: - "Notice: Min-Max squishes outliers to 0 or 1" - "Standard scaling keeps outliers as extreme z-scores" - "Robust scaling ignores outliers!" - "Log transform pulls in right tail"
Visual style: Statistical visualization with clean grid
Implementation: p5.js with real-time distribution updates
Complete Data Cleaning Workflow
Let's put it all together with a complete data cleaning workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
Achievement Unlocked: Data Janitor
You now have the skills to transform any messy dataset into clean, analysis-ready data. This isn't glamorous work, but it's where real data scientists spend 60-80% of their time. You're now equipped for the real world!
Common Patterns and Best Practices
The Cleaning Checklist
Before any analysis, run through this checklist:
- [ ] Load data and check
shape - [ ] View
head()andtail()for anomalies - [ ] Check
dtypesfor correct types - [ ] Run
isnull().sum()for missing values - [ ] Run
duplicated().sum()for duplicates - [ ] Check
describe()for impossible values - [ ] Validate against business rules
- [ ] Document all cleaning decisions
Anti-Patterns to Avoid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Chapter 4 Checkpoint: Test Your Understanding
Question: You receive a dataset with the following issues: - 5% of rows have missing ages - 2% of rows have duplicate emails - Some scores are recorded as -1 (meaning "not applicable") - Income column is stored as strings like "$50,000"
Write a cleaning plan for this data.
Click to reveal answer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Key Takeaways
-
Missing values (NaN) must be found with
isnull()and handled withdropna(),fillna(), or imputation strategies. -
Duplicates inflate your data—detect with
duplicated()and remove withdrop_duplicates(). -
Outliers can be errors or legitimate extremes—detect with IQR or z-scores, then decide whether to remove, cap, or keep.
-
Data validation enforces business rules—build validation functions to catch problems early.
-
Boolean indexing and the query method let you filter data with complex conditions.
-
String cleaning standardizes text—use
.strmethods for cleaning operations. -
Data type conversion ensures columns have appropriate types for analysis.
-
Feature scaling (Min-Max, Standard, Robust) brings features to comparable ranges for machine learning.
-
Document everything—cleaning decisions affect your entire analysis, so keep a record.
-
The cleaning workflow (missing → duplicates → outliers → types → validation → transformation) should become second nature.
You've mastered the art of data cleaning—arguably the most valuable practical skill in data science. In the next chapter, you'll learn to visualize your clean data, turning numbers into insights that everyone can understand. The glamorous part is coming!