All examples for Automated Dataset Profiler

Focused Analysis - Outliers and Correlations

Demonstrates requesting a profile with specific analytical focus. Generates demo data and provides detailed outlier detection (IQR and Z-score methods) and full correlation matrix with interpretation of relationships.

Input

What you provide to the skill

Profile a sample dataset - I’m particularly interested in finding outliers and understanding correlations between variables

Output

What the skill generates for you

================================================================================
DATASET PROFILE REPORT: Demo E-Commerce Dataset

DATASET OVERVIEW


Rows: 1,000 | Columns: 8 | Missing: 23 cells (0.29%)
Data types: 6 numeric, 1 categorical, 1 boolean
Memory usage: 0.11 MB
Generated: Synthetic e-commerce order data

VARIABLE PROFILES


order_id (Numeric)

Mean: 1500.50 | Median: 1500.50 | Std: 288.82
Range: 1001.00 to 2000.00
Q1: 1250.75 | Q3: 1750.25

customer_age (Numeric)

Mean: 37.74 | Median: 38.00 | Std: 11.61
Range: -5.00 to 84.00
Q1: 30.00 | Q3: 45.00
Missing: 15 (1.5%)
OUTLIERS (IQR method): 10 values (1.0%)
Lower bound: 7.50 | Upper bound: 67.50
OUTLIERS (Z-score |z|>3): 5 extreme values
⚠️ DATA QUALITY ALERT: 3 negative ages (impossible values)
Action: Fix impossible values before analysis

total_amount (Numeric)

Mean: 130.45 | Median: 94.67 | Std: 119.31
Range: 8.57 to 1158.04
Q1: 55.33 | Q3: 161.27
Missing: 8 (0.8%)
OUTLIERS (IQR method): 70 values (7.1%)
Lower bound: -103.58 | Upper bound: 320.18
OUTLIERS (Z-score |z|>3): 20 extreme values
Distribution: right-skewed (skewness: 2.91)

quantity (Numeric)

Mean: 3.96 | Median: 4.00 | Std: 1.66
Range: 1.00 to 10.00
Q1: 3.00 | Q3: 5.00
OUTLIERS (IQR method): 6 values (0.6%)
Lower bound: 0.00 | Upper bound: 8.00
OUTLIERS (Z-score |z|>3): 6 extreme values

satisfaction (Numeric)

Mean: 3.77 | Median: 4.00 | Std: 1.15
Range: 1.00 to 5.00
Q1: 3.00 | Q3: 5.00

delivery_days (Numeric)

Mean: 5.00 | Median: 5.00 | Std: 2.05
Range: 1.00 to 13.00
Q1: 4.00 | Q3: 6.00
OUTLIERS (IQR method): 28 values (2.8%)
Lower bound: 1.00 | Upper bound: 9.00
OUTLIERS (Z-score |z|>3): 4 extreme values

category (Categorical)

Unique values: 5
Top categories:
- Electronics: 229 (22.9%)
- Clothing: 211 (21.1%)
- Sports: 191 (19.1%)
- Books: 190 (19.0%)
- Home: 179 (17.9%)

returned (Boolean)

False: 921 (92.1%)
True: 79 (7.9%)

CORRELATION ANALYSIS


No strong correlations found (all |r| < 0.3)

Full Correlation Matrix:

order_id customer_age total_amount quantity satisfaction delivery_days
order_id 1.00 0.02 -0.03 0.01 -0.04 -0.00
customer_age 0.02 1.00 -0.01 -0.03 0.03 0.03
total_amount -0.03 -0.01 1.00 0.00 -0.05 -0.03
quantity 0.01 -0.03 0.00 1.00 0.00 -0.04
satisfaction -0.04 0.03 -0.05 0.00 1.00 0.02
delivery_days -0.00 0.03 -0.03 -0.04 0.02 1.00

MISSING DATA ANALYSIS


Variable Missing Count Missing %

customer_age 15 1.5%
total_amount 8 0.8%

Pattern Assessment:
• Low overall missingness (<5%) - appears random
• Recommendation: Median/mode imputation or listwise deletion

DATA QUALITY ALERTS


HIGH PRIORITY:

  1. Fix 3 negative customer ages (impossible values)

MEDIUM PRIORITY:

  1. Review 20 extreme outliers in total_amount (|z| > 3)
  2. Review 6 extreme outliers in quantity (|z| > 3)
  3. Review 69 unusually high-value orders

RECOMMENDATIONS


DATA CLEANING:

  1. Fix customer_age impossible values (set to NaN or median imputation)
  2. Impute missing values using median for numeric, mode for categorical

ANALYSIS STRATEGY:

  1. Segment analysis by category to understand purchasing patterns
  2. Investigate correlation between quantity and total_amount
  3. Analyze relationship between delivery_days and satisfaction
  4. Study return patterns by category and customer demographics
  5. Focus on outlier orders (both high and low) to identify patterns

Quality Score: 87/100
Processing Time: <1 second