YData Profiling Tutorial: Comprehensive Data Analysis
Overview
YData Profiling (formerly pandas-profiling) is a powerful Python library that generates comprehensive reports for exploratory data analysis. This tutorial demonstrates its capabilities using a synthetic e-commerce customer dataset designed to showcase various data quality issues and patterns that YData Profiling excels at detecting.
Installation
First, install YData Profiling:
1 |
|
For Jupyter notebooks, you might also want:
1 |
|
Dataset Overview
Our tutorial dataset contains 503 rows of e-commerce customer data with 16 columns featuring:
- Missing values in multiple columns (ages, income, phone numbers, etc.)
- Data quality issues (age outliers, inconsistent phone formatting)
- Various data types (numeric, categorical, datetime, boolean, text)
- High cardinality categorical variables (cities)
- Correlations between variables (age and product preferences)
- Duplicate rows for detection testing
- Skewed distributions (income follows log-normal distribution)
Basic Usage
1. Generate a Simple Report
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
2. Customized Report with Advanced Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
Key Features Demonstrated
1. Data Types and Overview
YData Profiling automatically detects and categorizes:
- Numerical variables: age, annual_income, total_spent, satisfaction_score
- Categorical variables: gender, education_level, favorite_category
- DateTime variables: registration_date
- Boolean variables: premium_member
- Text variables: last_review
- High cardinality categorical: city (60+ unique values)
2. Missing Data Analysis
The report provides multiple visualizations for missing data:
- Bar chart: Shows missing count per column
- Matrix plot: Visualizes missing data patterns
- Heatmap: Shows correlations in missingness
- Dendrogram: Clusters variables by missing patterns
Our dataset includes strategic missing values: - 25 missing ages (5%) - 35 missing incomes (7%) - 40 missing credit scores (8%) - 25 missing phone numbers (5%)
3. Data Quality Issues Detection
YData Profiling automatically identifies:
Outliers: - Age outliers (150 and 5 years old) are flagged as extreme values - Income distribution shows high-value outliers
Inconsistent Formatting: - Phone numbers in multiple formats: (xxx) xxx-xxxx, xxx-xxx-xxxx, xxx.xxx.xxxx, xxxxxxxxxx - Gender entries with inconsistent capitalization: Male, M, m, male
Data Type Issues: - Suggests better data types for mixed-format columns
4. Distribution Analysis
For each numerical variable, the report shows:
- Descriptive statistics: mean, median, std, quartiles
- Distribution plots: histograms with optional normal distribution overlay
- Skewness and kurtosis: measures of distribution shape
Our dataset demonstrates: - Normal distribution: age (with outliers) - Log-normal distribution: annual_income (right-skewed) - Poisson-like distribution: total_purchases - Gamma distribution: total_spent
5. Correlation Analysis
Multiple correlation methods reveal relationships:
- Pearson: Linear relationships between continuous variables
- Spearman: Monotonic relationships (rank-based)
- Phi-K: Correlation for categorical variables
- Cramér's V: Association between categorical variables
Expected correlations in our dataset: - Age and favorite product category - Income and total spent - Education level and income
6. Categorical Variable Analysis
For categorical variables, the report provides:
- Frequency tables: Count and percentage for each category
- Bar charts: Visual representation of category distributions
- Cardinality warnings: Flags for high-cardinality variables like 'city'
7. Duplicate Detection
The report identifies: - 3 duplicate rows intentionally added to the dataset - Exact matches across all columns - Percentage of duplicates: Impact on dataset size
8. Text Analysis
For text columns like 'last_review': - Length distribution: Character count statistics - Sample values: Examples of text entries - Completeness: Percentage of non-null text entries
Advanced Configuration Options
Minimal Configuration for Large Datasets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Sensitive Data Configuration
1 2 3 4 5 6 7 8 |
|
Interpreting the Report
1. Executive Summary
The report begins with high-level insights: - Dataset dimensions (503 rows × 16 columns) - Missing cells percentage - Duplicate rows count - Data types distribution
2. Variable Analysis
Each variable gets detailed analysis: - Distinct count: Unique values - Missing count: Null values - Memory usage: Storage requirements - Type-specific metrics: Based on variable type
3. Warnings and Alerts
YData Profiling automatically flags: - High cardinality: Variables with many unique values - High correlation: Potentially redundant variables - Skewed distributions: Variables needing transformation - Constant values: Variables with no variation - Missing values: Variables with significant missingness
Best Practices
1. Performance Optimization
For large datasets:
- Use minimal=True
for quick overview
- Disable expensive correlation calculations
- Limit sample sizes
- Use lazy=False
for immediate computation
2. Customization Tips
- Add variable descriptions for better documentation
- Configure correlation methods based on data types
- Customize missing data visualizations
- Set appropriate duplicate detection keys
3. Integration Workflow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Common Use Cases
1. Initial Data Exploration
Perfect for understanding new datasets before analysis: - Data quality assessment - Variable relationship discovery - Missing data patterns - Distribution characteristics
2. Data Quality Monitoring
Regular profiling for ongoing data pipelines: - Detect data drift - Monitor missing value trends - Track distribution changes - Identify new data quality issues
3. Documentation Generation
Automated documentation for datasets: - Share with stakeholders - Document data characteristics - Support reproducible research - Create data dictionaries
Conclusion
YData Profiling provides comprehensive automated exploratory data analysis that would take hours to perform manually. This tutorial dataset demonstrates its ability to:
- Automatically detect various data types and quality issues
- Generate publication-ready visualizations
- Identify patterns and relationships
- Provide actionable insights for data cleaning
- Create comprehensive documentation
The tool is invaluable for data scientists, analysts, and anyone working with datasets who needs quick, thorough data understanding.
Next Steps
After reviewing the YData Profiling report:
- Address data quality issues: Clean outliers, standardize formats
- Handle missing values: Decide on imputation or removal strategies
- Feature engineering: Use correlation insights for feature selection
- Distribution analysis: Consider transformations for skewed variables
- Duplicate handling: Remove or investigate duplicate records
YData Profiling provides the foundation for informed data preprocessing and analysis decisions.