What is A/B test sample size?

Sample size is the number of visitors or users needed in each variation of an A/B test to detect a statistically significant difference between versions. Running a test with insufficient sample size leads to unreliable results—you might conclude there's no difference when one exists, or declare a winner based on random noise.

Proper sample size calculation before launching an experiment helps you avoid wasting time on tests that can't produce meaningful insights and prevents you from ending tests prematurely.

How sample size is calculated

Sample size for A/B tests uses statistical power analysis. The formula for a two-proportion z-test is:

n = 2 \times \frac{(Z_{\alpha/2} + Z_\beta)^2 \times (\hat{p}_1(1-\hat{p}_1) + \hat{p}_2(1-\hat{p}_2))}{(\hat{p}_2 - \hat{p}_1)^2}

Where:

n = sample size per variation
Z_α/2 = z-score for significance level (1.96 for 95%)
Z_β = z-score for statistical power (0.84 for 80%)
p₁ = baseline conversion rate
p₂ = expected conversion rate with the effect

Key inputs explained

Baseline conversion rate: Your current conversion rate. If 3% of visitors currently convert, that's your baseline.

Minimum detectable effect (MDE): The smallest improvement you want to be able to detect. A 20% MDE on a 3% baseline means detecting a change from 3.0% to 3.6%.

Statistical significance (α): The probability of a false positive—declaring a winner when no real difference exists. 95% confidence (α = 0.05) is standard.

Statistical power (1-β): The probability of detecting a real effect when one exists. 80% power is standard; 90% is more conservative.

Why sample size matters

Underpowered tests miss real winners

If your test doesn't have enough statistical power, you might fail to detect a variant that genuinely performs better. This is a Type II error—concluding "no difference" when there actually is one.

With 80% power, you have a 20% chance of missing a real effect. With 50% power (common in rushed tests), you're essentially flipping a coin.

Small samples amplify noise

Early test results are highly variable. A variant might look 50% better or 50% worse purely by chance. Only with adequate sample size does the true conversion rate emerge from the noise.

The winner's curse

If you stop a test early when you see "significance," you're likely overestimating the effect size. Winners in small samples tend to regress toward the baseline when you have more data.

Relationship between MDE and sample size

Sample size increases dramatically as MDE decreases:

Baseline	MDE	Sample per variant	Total (A/B)
3%	50% lift	~2,700	~5,400
3%	25% lift	~10,400	~20,800
3%	10% lift	~64,000	~128,000
3%	5% lift	~255,000	~510,000

This is why you should think carefully about what effect size actually matters for your business. Testing for a 5% improvement on a minor page might require months of traffic—and even if successful, the impact might be negligible.

Practical considerations

Focus on impactful changes

Instead of testing subtle variations, prioritize bold hypotheses likely to produce meaningful effects. A completely redesigned checkout flow is more likely to show a 25% lift than moving a button.

Test high-traffic pages

A page with 10,000 daily visitors will reach statistical significance faster than one with 500. Prioritize testing where you have sufficient traffic.

Consider opportunity cost

While waiting for a test to reach significance, you might be leaving money on the table with the inferior variant. Balance statistical rigor against business impact.

Run tests for full weeks

Even if you hit sample size mid-week, different days have different traffic patterns. Running for complete weeks reduces day-of-week effects.

Common mistakes

Peeking and early stopping

Checking results frequently and stopping when you see significance inflates your false positive rate. If you check daily at 95% significance, your actual false positive rate over two weeks might be 30-40%.

Solution: Pre-register your sample size and don't stop early.

Testing too many variants

Each additional variant requires more traffic. With 5 variants instead of 2, you need 2.5x more traffic to maintain the same power.

Solution: Limit variants to 2-3 unless you have extremely high traffic.

Ignoring practical significance

Statistical significance doesn't mean business significance. A 0.1% conversion increase might be "significant" with enough traffic, but implementing and maintaining that change might not be worth it.

Solution: Define upfront what effect size would be worth implementing.

Testing the wrong metric

Primary metrics should directly relate to business outcomes. Testing click-through rate when you care about revenue can lead you astray—a variant might increase clicks but decrease purchases.

Solution: Choose metrics that align with actual business goals.

Sequential testing and alternatives

Sequential testing

Modern approaches like sequential testing allow you to check results as data accumulates without inflating false positives. These methods adjust significance thresholds dynamically.

Bayesian A/B testing

Bayesian methods provide probability distributions for each variant's conversion rate, letting you make decisions based on probability of being best rather than binary significance.

Multi-armed bandits

Bandits automatically shift traffic toward winning variants during the test. This reduces opportunity cost but makes it harder to declare definitive winners.

Calculating test duration

Once you have sample size, estimate test duration:

\text{Days} = \frac{\text{Total Sample Needed}}{\text{Daily Traffic}}

A test requiring 50,000 total visitors with 2,000 daily visitors will take approximately 25 days.

Add buffer time for:

Traffic variability (weekends, holidays)
Running for complete weeks
Unexpected issues requiring investigation

When to skip sample size calculation

In some situations, practical considerations override statistical rigor:

Rapid iteration: Early-stage products might prioritize speed over precision
Obvious differences: A variant converting at 5x the baseline doesn't need statistical proof
Low-stakes decisions: Testing button colors on an internal tool
Qualitative goals: Some experiments aim to learn, not optimize

But for decisions with significant business impact, proper statistical methodology protects against costly mistakes based on noise rather than signal.

A/B Test Sample Size Calculator

Conversion rates

Sample size by minimum detectable effect (MDE)