Small Dataset Feasibility Analysis

Guidance on building a churn model with only 28 users (below the 30-user minimum), including statistical limitations, confidence thresholds, and alternative approaches.

Input

What you provide to the skill

I have 28 users with login_count, feature_adoption, and churned column. Can I build a model with this smaller dataset?

Output

What the skill generates for you

Small Dataset Feasibility Analysis

Your Situation

Sample size: 28 users
Features: login_count, feature_adoption, churned
Below minimum: The skill recommends minimum 30 users

Can You Build a Model? YES, with caveats

Option 1: Override Minimum Sample Size (Recommended)

You can bypass the 30-user validation by using the --min-sample flag:

cd /workspace/webserver/lib/smartvox/skill_hub/skills/churn-risk-scoring-model-builder
uv run scripts/calculate_churn_risk.py your_data.csv --format markdown --min-sample 20 > analysis.md

This will allow the script to process your 28-user dataset.

Critical Requirements You MUST Meet

Even with 28 users, you need:

Minimum churn representation: At least 5% churned users
- With 28 users, you need at least 2 churned users (7%)
- Ideally 3-4 churned users (10-14%) for meaningful patterns
Binary churned column: Must contain only 0 (retained) or 1 (churned)
Numeric behavioral columns: login_count and feature_adoption must be numeric values

Statistical Limitations with 28 Users

Reduced confidence:

Correlation coefficients will have wider confidence intervals
P-values may not reach statistical significance (p < 0.05)
Risk of overfitting to noise rather than true patterns

Minimum threshold per correlation:

With 28 users, you need |r| > 0.37 for statistical significance (p < 0.05)
Weaker correlations (|r| < 0.37) should be treated as exploratory

What this means:

If the model shows login_count has r = -0.65, that’s likely meaningful
If it shows feature_adoption has r = -0.25, that’s too weak to trust with 28 users

Expected Model Quality

Best case scenario (strong signals):

2 behavioral features with clear patterns
Correlations > 0.6 in absolute value
Model can identify highest-risk users
Use for: Prioritizing who to reach out to first

Realistic scenario (moderate signals):

Correlations in 0.4-0.6 range
Model provides rough risk ordering
Use for: Directional insights, not precise predictions

Poor scenario (weak signals):

Correlations < 0.4
Model has limited predictive power
Warning issued: “Weak correlations detected. Model may have limited predictive power.”

Recommended Approach with 28 Users

1. Run the analysis:

uv run scripts/calculate_churn_risk.py your_data.csv \
  --format markdown \
  --min-sample 20 \
  --verbose > small_dataset_analysis.md

2. Check correlation strength:

Look at the “Risk Signal Analysis” table
Focus on signals with absolute correlation > 0.5
Ignore signals with p-value > 0.05

3. Use conservatively:

Treat as directional insights not precise predictions
Focus on users scoring 80+ (clear red flags)
Don’t over-interpret middle scores (40-60 range)

4. Validate manually:

Review top 5 highest-risk users
Do their behavioral patterns make intuitive sense?
Are churned users actually scoring higher than retained?

5. Plan to iterate:

As you gather more users (30, 50, 100+), rerun the model
Compare how risk scores change with more data
Update intervention thresholds based on real outcomes

Alternative: Descriptive Analysis

If you find weak correlations, don’t force a predictive model. Instead:

Manual comparison:

# What you can do with pandas directly
churned_users = df[df['churned'] == 1]
retained_users = df[df['churned'] == 0]

print("Churned users average login_count:", churned_users['login_count'].mean())
print("Retained users average login_count:", retained_users['login_count'].mean())

Look for patterns:

Do churned users have consistently lower login_count?
Is there a threshold (e.g., <5 logins) where most churned users fall?
What feature_adoption level distinguishes the groups?

When to Wait for More Data

DON’T run the model yet if:

Fewer than 2 churned users (can’t identify patterns)
All users have very similar behavior (no variance)
You have clear data quality issues (missing values, inconsistent definitions)

Wait until:

You have 50+ users for reliable correlations
At least 5 churned users for stable patterns
Consistent behavioral tracking (same definitions across all users)

Bottom Line

Yes, you can build a model with 28 users, but:

Use --min-sample 20 to override validation
Interpret results conservatively (directional, not definitive)
Focus on strong signals only (|r| > 0.5)
Validate insights manually before acting
Plan to rerun with more data soon

Expected outcome:
You’ll get risk scores for all 28 users, but treat this as version 0.1 of your model - useful for initial prioritization, not production-grade predictions.

Small Dataset Feasibility Analysis

Input

Output

Small Dataset Feasibility Analysis

Your Situation

Can You Build a Model? YES, with caveats

Option 1: Override Minimum Sample Size (Recommended)

Critical Requirements You MUST Meet

Statistical Limitations with 28 Users

Expected Model Quality

Recommended Approach with 28 Users

Alternative: Descriptive Analysis

When to Wait for More Data

Bottom Line

About This Skill

More Examples

Trial User Churn Risk Analysis

200-User SaaS Churn Model for CSM Outreach