Calculate the sample size needed for statistically significant A/B tests. Estimate test duration based on your traffic and conversion rates.
Smaller effects require larger samples. Consider whether a 10% lift would meaningfully impact your business.
Sample size is the number of visitors or users needed in each variation of an A/B test to detect a statistically significant difference between versions. Running a test with insufficient sample size leads to unreliable results—you might conclude there's no difference when one exists, or declare a winner based on random noise.
Proper sample size calculation before launching an experiment helps you avoid wasting time on tests that can't produce meaningful insights and prevents you from ending tests prematurely.
Sample size for A/B tests uses statistical power analysis. The formula for a two-proportion z-test is:
Where:
Baseline conversion rate: Your current conversion rate. If 3% of visitors currently convert, that's your baseline.
Minimum detectable effect (MDE): The smallest improvement you want to be able to detect. A 20% MDE on a 3% baseline means detecting a change from 3.0% to 3.6%.
Statistical significance (α): The probability of a false positive—declaring a winner when no real difference exists. 95% confidence (α = 0.05) is standard.
Statistical power (1-β): The probability of detecting a real effect when one exists. 80% power is standard; 90% is more conservative.
If your test doesn't have enough statistical power, you might fail to detect a variant that genuinely performs better. This is a Type II error—concluding "no difference" when there actually is one.
With 80% power, you have a 20% chance of missing a real effect. With 50% power (common in rushed tests), you're essentially flipping a coin.
Early test results are highly variable. A variant might look 50% better or 50% worse purely by chance. Only with adequate sample size does the true conversion rate emerge from the noise.
If you stop a test early when you see "significance," you're likely overestimating the effect size. Winners in small samples tend to regress toward the baseline when you have more data.
Sample size increases dramatically as MDE decreases:
| Baseline | MDE | Sample per variant | Total (A/B) |
|---|---|---|---|
| 3% | 50% lift | ~2,700 | ~5,400 |
| 3% | 25% lift | ~10,400 | ~20,800 |
| 3% | 10% lift | ~64,000 | ~128,000 |
| 3% | 5% lift | ~255,000 | ~510,000 |
This is why you should think carefully about what effect size actually matters for your business. Testing for a 5% improvement on a minor page might require months of traffic—and even if successful, the impact might be negligible.
Instead of testing subtle variations, prioritize bold hypotheses likely to produce meaningful effects. A completely redesigned checkout flow is more likely to show a 25% lift than moving a button.
A page with 10,000 daily visitors will reach statistical significance faster than one with 500. Prioritize testing where you have sufficient traffic.
While waiting for a test to reach significance, you might be leaving money on the table with the inferior variant. Balance statistical rigor against business impact.
Even if you hit sample size mid-week, different days have different traffic patterns. Running for complete weeks reduces day-of-week effects.
Checking results frequently and stopping when you see significance inflates your false positive rate. If you check daily at 95% significance, your actual false positive rate over two weeks might be 30-40%.
Solution: Pre-register your sample size and don't stop early.
Each additional variant requires more traffic. With 5 variants instead of 2, you need 2.5x more traffic to maintain the same power.
Solution: Limit variants to 2-3 unless you have extremely high traffic.
Statistical significance doesn't mean business significance. A 0.1% conversion increase might be "significant" with enough traffic, but implementing and maintaining that change might not be worth it.
Solution: Define upfront what effect size would be worth implementing.
Primary metrics should directly relate to business outcomes. Testing click-through rate when you care about revenue can lead you astray—a variant might increase clicks but decrease purchases.
Solution: Choose metrics that align with actual business goals.
Modern approaches like sequential testing allow you to check results as data accumulates without inflating false positives. These methods adjust significance thresholds dynamically.
Bayesian methods provide probability distributions for each variant's conversion rate, letting you make decisions based on probability of being best rather than binary significance.
Bandits automatically shift traffic toward winning variants during the test. This reduces opportunity cost but makes it harder to declare definitive winners.
Once you have sample size, estimate test duration:
A test requiring 50,000 total visitors with 2,000 daily visitors will take approximately 25 days.
Add buffer time for:
In some situations, practical considerations override statistical rigor:
But for decisions with significant business impact, proper statistical methodology protects against costly mistakes based on noise rather than signal.