All examples for Categorical Variable Profiler

Rare Category Detection and Data Quality Check

Analysis focused on identifying rare categories (below 1% threshold) and comprehensive data quality assessment. Demonstrates the skill's ability to flag small categories, detect data quality issues, and provide exploratory insights with appropriate statistical caveats for small sample sizes. Shows proper handling of chi-square assumption violations with transparent reporting.

Input

What you provide to the skill

Check category distributions in /tmp/test_customer_data.csv and flag any rare categories

Output

What the skill generates for you


Analysis Complete! 🎉

I’ve successfully executed the categorical-profiler skill on your test customer data. Here’s what was delivered:

📊 Key Findings Summary

Rare Categories: No categories fell below the 1% threshold, but “Dissatisfied” is the smallest at 13.3% (n=4).

Critical Business Insights:

  1. West Region Crisis 🚨

    • 57% dissatisfaction rate (vs 0% in South)
    • 100% of West customers buy only Clothing
    • Strong evidence of regional product quality issues
  2. Clothing Category Problems ⚠️

    • ALL 4 dissatisfied customers bought Clothing
    • Statistically significant association (p=0.011, large effect size)
    • Clear product quality or fulfillment issue
  3. South Region Excellence

    • 100% positive satisfaction (75% satisfied, 25% very satisfied)
    • Should be studied as a model for other regions

📈 Statistical Associations Found

Three significant relationships (after multiple testing correction):

  • Region × Product_Category (p<0.001, V=0.780 - very strong)
  • Region × Satisfaction (p=0.033, V=0.474 - moderate-large)
  • Satisfaction × Product_Category (p=0.011, V=0.566 - large)

📁 Generated Files

Distribution Charts (4 files):

  • Region_distribution.png
  • Age_Group_distribution.png
  • Satisfaction_distribution.png
  • Product_Category_distribution.png

Cross-tabulation Heatmaps (6 files):

  • Region_vs_Age_Group_heatmap.png
  • Region_vs_Satisfaction_heatmap.png
  • Region_vs_Product_Category_heatmap.png
  • Age_Group_vs_Satisfaction_heatmap.png
  • Age_Group_vs_Product_Category_heatmap.png
  • Satisfaction_vs_Product_Category_heatmap.png

⚠️ Important Caveat

The sample size (n=30) is too small for reliable chi-square testing - chi-square assumptions were violated with expected cell frequencies <5 in most cells. The patterns detected are exploratory and should be validated with a larger sample (minimum n=100 recommended).