Something went wrong!
Hang in there while we get back on track
A/B Test: Large Sample Statistical Trap
Demonstrates the classic 'large sample trap' where statistical significance (p=0.008) masks a very small effect size (Cohen's h=0.018). Provides business impact scenarios at different traffic levels and a decision framework for implementation.
Input
What you provide to the skill
A/B test: 25,000 users per variant. Control 8.2% conversion, treatment 8.7% conversion. p=0.008. Worth implementing?
Output
What the skill generates for you
======================================================================
A/B Test Effect Size Analysis
Cohen’s h: 0.0180
Absolute lift: 0.0050 (0.50 percentage points)
Relative lift: 0.0610 (6.1%)
95% CI for difference: [0.0001, 0.0099]
[0.01%, 0.99%]
Number Needed to Treat (NNT): 200
(Need 200 users in treatment to get 1 extra conversion)
z-statistic: 2.010
p-value: 0.008
======================================================================
INTERPRETATION
Statistical Significance: YES (p = 0.008)
- With 50,000 total users, the test has high power to detect small effects
- The result is statistically reliable
Effect Size: SMALL
- Cohen’s h = 0.0180 (very small by conventional standards)
- Absolute lift: 0.5 percentage points
- Relative lift: 6.1%
- This is a 6.1% improvement over baseline
Practical Significance: DEPENDS ON SCALE
======================================================================
BUSINESS IMPACT SCENARIOS
Monthly impact at different scales:
(assuming 0.5pp lift = 6.1% relative improvement)
Low traffic: 10,000 monthly visitors
→ 50 extra conversions/month
@ $25 AOV: $1,250/month ($15,000/year)
@ $50 AOV: $2,500/month ($30,000/year)
@ $100 AOV: $5,000/month ($60,000/year)
Medium traffic: 100,000 monthly visitors
→ 500 extra conversions/month
@ $25 AOV: $12,500/month ($150,000/year)
@ $50 AOV: $25,000/month ($300,000/year)
@ $100 AOV: $50,000/month ($600,000/year)
High traffic: 1,000,000 monthly visitors
→ 5000 extra conversions/month
@ $25 AOV: $125,000/month ($1,500,000/year)
@ $50 AOV: $250,000/month ($3,000,000/year)
@ $100 AOV: $500,000/month ($6,000,000/year)
======================================================================
DECISION FRAMEWORK
✓ IMPLEMENT IF:
-
High traffic volume (>100K visitors/month)
- At 1M visitors: 5000 extra conversions/month
- At $50 AOV: $250,000/month revenue impact
-
Low implementation cost
- Simple CSS/copy changes
- No major engineering effort
- Low maintenance burden
-
No negative impacts
- No increase in returns/refunds
- No degradation of user experience
- No harm to brand perception
✗ RECONSIDER IF:
-
Low traffic volume (<10K visitors/month)
- 50 conversions/month may not justify effort
-
High implementation cost
- Requires major platform changes
- Ongoing maintenance complexity
- Technical debt concerns
-
Opportunity cost is high
- Dev time could be spent on bigger wins
- Other tests showing larger effects
- Strategic priorities elsewhere
? INVESTIGATE FURTHER:
-
Segment analysis
- Does effect vary by user segment?
- Mobile vs. desktop differences?
- New vs. returning customers?
-
Long-term effects
- Does lift sustain over time?
- Any learning effects?
- Impact on LTV, not just initial conversion?
-
Qualitative data
- What does user feedback say?
- Any unexpected behaviors?
- Does it align with brand strategy?
======================================================================
STATISTICAL VS. PRACTICAL SIGNIFICANCE
This is a textbook example of the “large sample trap”:
• Statistical significance: p = 0.008 (highly significant)
• Effect size: Cohen’s h = 0.0180 (very small)
With 25,000 users per group, you have ~80% power to detect effects as
small as d ≈ 0.03. This means you can reliably detect effects that are
too small to matter in practice.
The p-value tells you the effect is REAL (not due to chance).
The effect size tells you the effect is SMALL (0.5pp improvement).
Whether “small but real” is worth implementing depends entirely on:
• Your traffic volume
• Your implementation costs
• Your opportunity costs
======================================================================
APA-STYLE REPORTING
The treatment group showed a statistically significant but small increase
in conversion rate (8.7%) compared to control (8.2%), z = 2.01,
p = 0.008, h = 0.018, 95% CI [0.01%, 0.99%].
The absolute lift of 0.5 percentage points represents a
6.1% relative improvement over baseline. At scale, this
translates to approximately 500 additional conversions
per 100,000 visitors.
======================================================================
VERDICT
SHORT ANSWER: Probably YES, but it depends on your scale.
The effect is statistically reliable (p = 0.008) but small in magnitude
(6.1% relative lift). The decision to implement should be
based on:
-
TRAFFIC: If you have >100K monthly visitors, the cumulative impact
likely justifies implementation. -
COST: If implementation is low-effort (simple changes), the ROI is
favorable even at modest traffic levels. -
PRIORITY: If you have other experiments showing larger effects or
higher-priority initiatives, consider those first.
RECOMMENDATION:
• High traffic (>500K/month) + low cost → IMPLEMENT
• Medium traffic (100K-500K/month) + low cost → IMPLEMENT
• Low traffic (<100K/month) + low cost → MAYBE (run cost-benefit)
• Any traffic + high cost → INVESTIGATE FURTHER (segment analysis)
The key insight: This is not a transformative change, but a marginal
improvement. At sufficient scale, marginal improvements compound into
meaningful business impact.
======================================================================
About This Skill
Calculate and interpret effect sizes (Cohen's d, eta-squared, odds ratios, correlations) with context-specific guidance. Distinguish statistical significance from practical importance and convert metrics for meta-analysis.
View Skill DetailsMore Examples
Education Study: Flipped Classroom Intervention
Analyzes a teaching methods comparison where the p-value narrowly misses significance (p=0.052) but the effect size (d=0.40) is educationally meaningful. Demonstrates interpreting non-significant results with substantial effect sizes and comparing to domain benchmarks.
Meta-Analysis Effect Size Conversions
Converts effect sizes from three studies using different metrics (correlation r, odds ratio, F-statistic) to Cohen's d for meta-analysis. Includes conversion formulas, assumptions, and ready-to-use methods text.