All examples for Statistical Test Selection Guide

Comparing Two Independent Groups

Demonstrates independent samples t-test selection for comparing continuous scores between two groups, with assumption checks, effect size calculation, and APA reporting template.

Input

What you provide to the skill

I have test scores from 20 students in a traditional classroom and 20 in an online classroom. Both groups have continuous scores 0-100. Are the groups significantly different?

Output

What the skill generates for you

Recommended Test: Independent Samples t-Test

Why This Test?

Based on your description:

  • ✓ ONE continuous dependent variable (test scores 0-100)
  • ✓ ONE categorical independent variable (classroom type) with TWO levels (traditional vs. online)
  • ✓ Independent groups (different students in each classroom)
  • ✓ Adequate sample size (n=20 per group, total N=40)

Assumptions to Check

  1. Normality: Test scores approximately normal within each group

    • Check: Shapiro-Wilk test, Q-Q plots
    • Robustness: t-test is robust to moderate violations with equal group sizes
    • If violated: Mann-Whitney U test
  2. Homogeneity of Variance: Similar variance across groups

    • Check: Levene’s test
    • If violated: Welch’s t-test (unequal variances version)
  3. Independence: Each score independent of others (satisfied by design)

Python Code

from scipy.stats import ttest_ind, levene, shapiro, mannwhitneyu
import numpy as np
# Your data
traditional = np.array(<your_data>) # 20 test scores from traditional classroom
online = np.array(<your_data>) # 20 test scores from online classroom
# Descriptive statistics
print(f"Traditional: M = {np.mean(traditional):.2f}, SD = {np.std(traditional, ddof=1):.2f}")
print(f"Online: M = {np.mean(online):.2f}, SD = {np.std(online, ddof=1):.2f}")
print(f"Difference: {np.mean(traditional) - np.mean(online):.2f} points")
# Check normality for each group
stat_trad, p_trad = shapiro(traditional)
stat_online, p_online = shapiro(online)
print(f"\nNormality checks:")
print(f"Traditional: p = {p_trad:.4f}")
print(f"Online: p = {p_online:.4f}")
# Check equal variance
stat_levene, p_levene = levene(traditional, online)
print(f"\nLevene's test for equal variances: p = {p_levene:.4f}")
# Decision logic for which test to use
if p_trad > 0.05 and p_online > 0.05:
print("\n✓ Normality assumption satisfied")
if p_levene > 0.05:
print("✓ Equal variances assumption satisfied")
print("\nUsing: Standard independent samples t-test")
t_stat, p_value = ttest_ind(traditional, online)
else:
print("✗ Equal variances violated")
print("\nUsing: Welch's t-test (unequal variances)")
t_stat, p_value = ttest_ind(traditional, online, equal_var=False)
else:
print("\n✗ Normality assumption violated")
print("\nUsing: Mann-Whitney U test (non-parametric alternative)")
u_stat, p_value = mannwhitneyu(traditional, online, alternative='two-sided')
print(f"U-statistic: {u_stat:.3f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("\n✓ SIGNIFICANT DIFFERENCE between classroom types")
else:
print("\n✗ NO SIGNIFICANT DIFFERENCE between classroom types")

Effect Size: Cohen’s d

# Cohen's d for independent samples
pooled_std = np.sqrt(((len(traditional)-1)*np.var(traditional, ddof=1) +
(len(online)-1)*np.var(online, ddof=1)) /
(len(traditional) + len(online) - 2))
cohens_d = (np.mean(traditional) - np.mean(online)) / pooled_std
print(f"\nCohen's d = {cohens_d:.3f}")
# Interpretation
if abs(cohens_d) < 0.2:
print("Effect size: Negligible")
elif abs(cohens_d) < 0.5:
print("Effect size: Small")
elif abs(cohens_d) < 0.8:
print("Effect size: Medium")
else:
print("Effect size: Large")

Cohen’s d benchmarks:

  • Small: 0.2
  • Medium: 0.5
  • Large: 0.8

APA Reporting Template

If significant:

An independent samples t-test revealed that traditional classroom students
(M = XX.X, SD = X.X) scored significantly [higher/lower] than online classroom
students (M = XX.X, SD = X.X), t(38) = X.XX, p = .XXX, d = X.XX.

If not significant:

An independent samples t-test found no significant difference in test scores
between traditional classroom students (M = XX.X, SD = X.X) and online classroom
students (M = XX.X, SD = X.X), t(38) = X.XX, p = .XXX, d = X.XX.

Common Pitfalls to Avoid

  1. Don’t confuse statistical and practical significance: With n=20 per group, you have moderate power. A non-significant result doesn’t prove groups are identical—it may indicate insufficient power to detect a small effect.

  2. Report effect size alongside p-value: The magnitude of difference (Cohen’s d) is as important as whether it’s significant.

  3. Check assumptions: Don’t skip normality and equal variance checks. If violated, use the appropriate alternative.

  4. Two-tailed vs. one-tailed: Use two-tailed unless you had a specific directional hypothesis BEFORE collecting data.

  5. Interpretation context: A statistically significant 2-point difference may not be educationally meaningful, while a non-significant 8-point difference might warrant further investigation with larger samples.

Power Considerations

With n=20 per group, you have:

  • 80% power to detect a large effect (d=0.8) at α=0.05
  • 50% power to detect a medium effect (d=0.5)
  • Low power (<30%) to detect a small effect (d=0.2)

If you find a non-significant result, consider whether your sample size was adequate for the expected effect size.