All examples for Correlation Matrix Explorer

Housing Data Multicollinearity Check

Pre-regression multicollinearity analysis for a 200-home dataset. Shows detection of problematic correlations between beds/baths, provides specific variable exclusion recommendations, and identifies sqft as the strongest price predictor for feature selection.

Input

What you provide to the skill

Generate housing dataset (200 homes: price, sqft, beds, baths, lot_size, age, garage). Check multicollinearity for regression modeling

Output

What the skill generates for you

======================================================================
MULTICOLLINEARITY CHECK FOR REGRESSION MODELING

Dataset Overview:

  • Observations: 200
  • Variables: 7
  • Missing values: 0

CORRELATIONS WITH TARGET (price):

Strong (|r| > 0.6):

  • sqft: r = 0.889 ***

Moderate (0.3 < |r| < 0.6):

  • beds: r = 0.476 ***
  • baths: r = 0.413 ***
  • lot_size: r = 0.374 ***

Weak (|r| < 0.3):

  • age: r = -0.116
  • garage: r = -0.097

MULTICOLLINEARITY ISSUES:

🚨 CRITICAL (|r| > 0.8):

  • beds ↔ baths: r = 0.821
    → DO NOT include both. Keep beds (stronger price predictor)

✓ No moderate multicollinearity (0.7 < |r| < 0.8)


TOP 10 PREDICTOR PAIRS (excluding price):

Variable_1 Variable_2 Correlation P_Value Strength
beds baths 0.821 0.0000 Strong
sqft beds 0.268 0.0001 Weak
sqft baths 0.189 0.0074 Weak
baths lot_size 0.174 0.0139 Weak
beds lot_size 0.173 0.0140 Weak
sqft garage -0.163 0.0212 Weak
sqft lot_size 0.157 0.0262 Weak
baths garage -0.120 0.0919 Weak
beds garage -0.107 0.1312 Weak
baths age -0.056 0.4317 Weak


RECOMMENDATIONS FOR REGRESSION MODELING:

Variables with multicollinearity concerns:

  • Exclude: baths (redundant with correlated predictor)

Recommended model specification:
Include: sqft, beds, lot_size, age, garage
Exclude: baths

Next steps:

  1. Fit regression with recommended variables
  2. Calculate VIF (Variance Inflation Factor) for all predictors
  3. If VIF > 10 for any variable, consider removing it
  4. Check residual plots for linearity and homoscedasticity

======================================================================
KEY INSIGHTS:

  1. sqft is the strongest predictor (r = 0.889)
  2. Only 1 variable shows strong correlation with price
  3. 1 predictor pairs show high multicollinearity (|r| > 0.8)
  4. Removing 1 redundant variable(s) recommended
  5. Sample size (n=200) is good - correlations |r| > 0.2 are reliable

======================================================================
Files generated:

  • correlation_heatmap.png
  • correlation_results.csv
  • housing_data.csv

⚠️ Correlation ≠ Causation. These relationships are associative,
not causal. Further analysis needed to establish causal effects.