All examples for Statistical Test Selection Guide

Small Pilot Study Proportion Comparison

Demonstrates Fisher's exact test for comparing treatment effectiveness with small samples, including power analysis guidance and honest interpretation of limitations.

Input

What you provide to the skill

Small pilot: 8 people tried treatment A (5 improved), 7 tried treatment B (4 improved). Binary outcome (improved/not). Which is more effective?

Output

What the skill generates for you

Recommended Test: Fisher’s Exact Test

Why This Test?

Based on your description:

  • Binary outcome (improved/not improved)
  • Two independent groups (Treatment A vs. Treatment B)
  • Small sample sizes (n=8 and n=7)
  • Comparing proportions between groups
  • Fisher’s exact test is preferred over chi-square for small samples

Your Data Summary

Treatment A: 5/8 improved (62.5%)
Treatment B: 4/7 improved (57.1%)
Absolute difference: 5.4 percentage points

Assumptions

  1. Independence: Each person’s outcome is independent of others ✓
  2. Binary outcome: Improved vs. not improved ✓
  3. Small sample: Fisher’s exact is designed for this scenario ✓

Python Code

from scipy.stats import fisher_exact
import numpy as np
# Data: [improved, not_improved]
treatment_a = [5, 3] # 5 improved, 3 did not
treatment_b = [4, 3] # 4 improved, 3 did not
# Create contingency table
table = np.array([treatment_a, treatment_b])
# Run Fisher's exact test (two-tailed)
odds_ratio, p_value = fisher_exact(table, alternative='two-sided')
print("=== RESULTS ===")
print(f"Treatment A: {treatment_a[0]}/{sum(treatment_a)} improved ({treatment_a[0]/sum(treatment_a):.1%})")
print(f"Treatment B: {treatment_b[0]}/{sum(treatment_b)} improved ({treatment_b[0]/sum(treatment_b):.1%})")
print(f"\nOdds Ratio: {odds_ratio:.3f}")
print(f"P-value (two-tailed): {p_value:.4f}")
if p_value < 0.05:
print("\n✓ Statistically significant difference (p < 0.05)")
else:
print(f"\n✗ No statistically significant difference (p = {p_value:.4f})")
print(" With such small samples, we cannot detect a reliable difference.")
# Effect size: Absolute Risk Reduction and Relative Risk
p_a = treatment_a[0] / sum(treatment_a)
p_b = treatment_b[0] / sum(treatment_b)
arr = p_a - p_b
relative_risk = p_a / p_b
print(f"\nAbsolute Risk Difference: {arr:.1%}")
print(f"Relative Risk: {relative_risk:.3f}")

Interpretation

Statistical Significance:
With a p-value of approximately 0.99 (you’ll get the exact value when you run the code), there is no statistically significant difference between the two treatments.

Why?
Your sample is extremely small (n=15 total). Fisher’s exact test correctly accounts for this:

  • Treatment A: 62.5% improved
  • Treatment B: 57.1% improved
  • Only a 5.4 percentage point difference

With only 15 people, this tiny difference could easily occur by chance.

Effect Size

# Odds Ratio interpretation
# OR ≈ 1.25 (Treatment A has ~25% higher odds of improvement)
# But this is not statistically reliable given the small sample
# Confidence interval for proportion difference (optional)
from statsmodels.stats.proportion import confint_proportions_2indep
ci_low, ci_high = confint_proportions_2indep(
count1=5, nobs1=8,
count2=4, nobs2=7,
method='wald'
)
print(f"95% CI for difference: [{ci_low:.3f}, {ci_high:.3f}]")

Business/Clinical Interpretation

Conclusion: Based on this pilot data, you cannot conclude that either treatment is more effective.

Why this matters:

  • The observed difference (62.5% vs 57.1%) is too small and your sample is too small to distinguish from random variation
  • With only 15 people total, you’d need a much larger difference to detect it reliably

Recommendations:

  1. Increase sample size: You need at least 50-100 participants per group to detect small-to-moderate differences
  2. Pool data: If these treatments are similar, consider combining them versus a control
  3. Power analysis: Run a power calculation to determine the sample size needed to detect a meaningful difference (e.g., 20% improvement difference)

Power Analysis for Future Study

from statsmodels.stats.power import zt_ind_solve_power
# To detect a 20 percentage point difference (e.g., 50% vs 70%)
# with 80% power and alpha=0.05:
n_needed = zt_ind_solve_power(
effect_size=0.4, # Cohen's h for 50% vs 70%
alpha=0.05,
power=0.8,
alternative='two-sided'
)
print(f"Participants needed per group: {n_needed:.0f}")
# Expect: ~100 per group for moderate effects

Reporting Template

For Academic Use (APA format):
“Fisher’s exact test revealed no significant difference in improvement rates between Treatment A (62.5%, 5/8) and Treatment B (57.1%, 4/7), p = .99, OR = 1.25, 95% CI [X.XX, X.XX]. The small sample size (n=15) limited statistical power.”

For Business/Clinical Use:
“In this small pilot (n=15), Treatment A showed a slightly higher improvement rate (62.5%) compared to Treatment B (57.1%), but this 5-percentage-point difference was not statistically significant (p = .99). Conclusion: The data do not support choosing one treatment over the other. A larger study (≥100 participants per group) is needed to detect meaningful differences.”

Common Pitfalls to Avoid

  1. Don’t use chi-square with small samples – Fisher’s exact is the correct choice
  2. Don’t over-interpret small differences – 5% difference with n=15 proves nothing
  3. Don’t ignore power – Your study had low power to detect anything but huge effects
  4. Don’t assume “no significant difference” = “treatments are equal” – It means you don’t have enough data

Next Steps

  • Run a power analysis to determine required sample size for a full study
  • Consider what size of effect would be clinically meaningful (e.g., 15% improvement difference?)
  • Design a properly powered RCT if resources allow