All examples for Automated Dataset Profiler

Demo Mode - Retail Purchase Dataset

Demonstrates demo mode generating synthetic retail/e-commerce data with realistic quality issues (negative ages, missing values). Shows full profiling capabilities without requiring any external file.

Input

What you provide to the skill

Profile a sample retail dataset with customer purchases - show me what the profiler can do

Output

What the skill generates for you

======================================================================
DATASET PROFILE REPORT: Demo Retail Purchase Dataset

DATASET OVERVIEW


Rows: 1,000 | Columns: 8
Missing: 23 cells (0.29%)
Memory usage: 110.5 KB
Data types: 6 numeric, 1 categorical, 1 boolean
Dataset: Synthetic retail purchase orders

VARIABLE PROFILES


customer_age (Numeric)

Mean: 37.9 | Median: 38.0 | Std: 11.4
Range: -5 to 84
Missing: 15 (1.5%)
⚠️ DATA QUALITY: 3 negative ages (likely errors)
Action: Fix impossible values before analysis

total_amount (Numeric)

Mean: $130.45 | Median: $94.67 | Std: $119.31
Range: $8.57 - $1158.04
Distribution: Right-skewed (skew=2.91)
Outliers (IQR method): 70 orders (>320 or <-104) = 7.1%
Missing: 8 (0.8%)

quantity (Numeric)

Mean: 4.0 | Median: 4.0 | Range: 1-10
Distribution: Most orders have 3-4 items

category (Categorical)

Unique values: 5
Top 3 categories:

  • Electronics: 229 (22.9%)
  • Clothing: 211 (21.1%)
  • Sports: 191 (19.1%)
    Distribution: Relatively balanced

satisfaction (Numeric - Ordinal Scale)

Mean: 3.77 | Median: 4.0 | Range: 1-5
Distribution: Left-skewed (most customers satisfied)
Rating breakdown:

  • 1 stars: 52 (5.2%)
  • 2 stars: 101 (10.1%)
  • 3 stars: 196 (19.6%)
  • 4 stars: 332 (33.2%)
  • 5 stars: 319 (31.9%)

delivery_days (Numeric)

Mean: 5.0 days | Median: 5.0 | Range: 1-13

returned (Boolean)

Return rate: 79 orders (7.9%)
Non-returned: 921 orders (92.1%)

CORRELATION ANALYSIS


MISSING DATA ANALYSIS


Variables with missing data:

  • customer_age: 15 (1.5%) - Low impact
  • total_amount: 8 (0.8%) - Low impact

Pattern assessment: Random (MCAR) - missing values appear independent
Imputation recommendation:

  • customer_age: Median imputation or remove rows (1.5% missing)
  • total_amount: Median or drop rows (0.8% missing)

DATA QUALITY ALERTS


High Priority:

  1. Fix 3 impossible customer ages (negative values)
    → Set to NaN or impute with median age

Medium Priority:

  1. Review 70 high-value orders (>$320)
    → Verify these are legitimate large purchases, not data errors
  2. Consider imputation strategy for 23 missing values

Low Priority:

  1. Return rate (7.9%) is within normal e-commerce range

RECOMMENDATIONS


Data Cleaning:

  1. Fix customer_age impossible values:
    → Replace negative ages with NaN
    → Impute with median (38) or remove 18 affected rows (1.8%)
  2. Handle missing total_amount values:
    → Median imputation or drop 8 rows (0.8%)
  3. Consider capping extreme outliers if modeling

Analysis Strategy:

  1. Segment analysis by category:
    → Compare satisfaction and return rates across product types
    → Identify high-value categories
  2. Investigate delivery impact:
    → Correlation between delivery_days and satisfaction
    → Set delivery time targets to optimize satisfaction
  3. Customer segmentation by age:
    → Analyze purchase patterns across age groups
    → Target marketing based on age preferences
  4. Return analysis:
    → Identify categories with highest return rates
    → Correlate returns with satisfaction scores

SUMMARY STATISTICS TABLE


order_id customer_age ... satisfaction delivery_days

count 1000.00 985.00 … 1000.00 1000.00
mean 1500.50 37.74 … 3.76 5.00
std 288.82 11.61 … 1.15 2.05
min 1001.00 -5.00 … 1.00 1.00
25% 1250.75 30.00 … 3.00 4.00
50% 1500.50 38.00 … 4.00 5.00
75% 1750.25 45.00 … 5.00 6.00
max 2000.00 84.00 … 5.00 13.00

[8 rows x 6 columns]

======================================================================
Quality Score: 96/100 | Dataset Status: Excellent