A/B testing (also called split testing) is a controlled experiment where two or more versions of a web page, email, or feature are shown to different users simultaneously to determine which performs better. Traffic is randomly split between variants, and statistical analysis determines if observed differences in conversion rates are genuine or due to random chance.

What is sequential testing / peeking in A/B tests?

Peeking refers to checking your A/B test results before you've collected the required sample size and stopping early if you see significance. This inflates the false positive rate dramatically — if you check at 5 equally-spaced interim points, your actual false positive rate can reach 22% even at p < 0.05. This calculator warns you when your current visitor count is less than 80% of the required sample size.

How long should I run an A/B test?

Run your test until you reach the required sample size — not until you hit significance (that's peeking). Calculate required sample size using your baseline CVR and MDE, then divide by daily visitors to get the required number of days. Always run tests for a minimum of 1–2 full business cycles (typically 7–14 days minimum) to account for day-of-week effects, even if you hit your sample size faster.

What is a good conversion rate uplift for an A/B test?

Industry benchmarks vary widely by type of change. Small copy tweaks typically show 2–5% relative lift. Significant design changes may show 10–30% lift. Pricing or offer changes can show 20–50%+ lift. A landing page complete redesign might show 50–100%+ lift. However, the business relevance of the lift matters more than the percentage — a 5% lift on a high-traffic, high-AOV page may be worth far more than a 50% lift on a small feature.

A/B Test Calculator — Statistical Significance & Revenue Impact

Q: What is statistical significance in A/B testing?

Statistical significance means the difference between your control and variant is unlikely to be due to random chance. A 95% confidence level means there's only a 5% probability the observed lift happened by chance (p < 0.05). Most A/B testing best practices recommend 95% as the minimum confidence threshold before declaring a winner and shipping a change.

Q: What is the difference between Bayesian and frequentist A/B testing?

Frequentist A/B testing uses p-values and z-scores to determine if results are statistically significant. It answers: "Given no real difference exists, how likely is this data?" Bayesian A/B testing directly estimates the probability that the variant is better than the control. It answers: "What is the probability that variant B beats control A?" Bayesian is generally more intuitive for business decision-making, while frequentist is the traditional academic standard. This calculator supports both.

Q: How many visitors do I need for an A/B test?

The required sample size depends on three factors: (1) your baseline conversion rate, (2) the minimum effect size you want to detect (MDE), and (3) your desired confidence level and statistical power. For example, detecting a 10% relative lift from a 3% baseline CVR requires approximately 9,500 visitors per variant at 95% confidence and 80% power. Use the Reverse Calculator tab in this tool to calculate your specific sample size requirement.

Q: What is a minimum detectable effect (MDE)?

The minimum detectable effect (MDE) is the smallest relative improvement in conversion rate that your test is designed to detect with your chosen sample size, confidence level, and statistical power. A smaller MDE requires a larger sample size. A typical MDE is 10–20% relative improvement. Setting MDE too small (trying to detect a 1% lift) requires enormous sample sizes that are often impractical.

Q: How long should I run an A/B test?

Run your test until you reach the required sample size — not until you hit significance (that's peeking). Calculate required sample size using your baseline CVR and MDE, then divide by daily visitors to get the required number of days. Always run tests for a minimum of 1–2 full business cycles (typically 7–14 days minimum) to account for day-of-week effects, even if you hit your sample size faster.

Q: What is a good conversion rate uplift for an A/B test?

Industry benchmarks vary widely by type of change. Small copy tweaks typically show 2–5% relative lift. Significant design changes may show 10–30% lift. Pricing or offer changes can show 20–50%+ lift. A landing page complete redesign might show 50–100%+ lift. However, the business relevance of the lift matters more than the percentage — a 5% lift on a high-traffic, high-AOV page may be worth far more than a 50% lift on a small feature.

Variant	CVR	Uplift	Z-Score	P-Value	Bayesian %	CI 95%	Annual Impact	Days to Sig
Control (A)	3.00%	—	—	—	—	2.63–3.37%	—	—
Variant B	3.44%	+14.5%	1.55	0.1204	96.1%	3.03–3.84%	+$75.3K	~92d

What Is an A/B Test?

An A/B test (also called a split test) is a controlled experiment in which two versions of a web page, email, button, headline, or any digital element are shown to separate, randomly assigned groups of users simultaneously. By measuring which version drives more conversions, you get statistically grounded evidence to guide product and marketing decisions — instead of relying on HiPPO opinions (Highest Paid Person's Opinion).

A typical A/B test has a Control (A) — your current version — and a Variant (B) — the challenger version with a specific change. Traffic is split evenly, usually 50/50, and you measure the primary metric (often conversion rate) for both groups. When the difference is large enough relative to random chance, you declare a winner.

How to Calculate Statistical Significance

Statistical significance is calculated via a two-proportion Z-test. Given two conversion rates p₁ (control) and p₂ (variant), and sample sizes n₁ and n₂:

pPool = (p₁·n₁ + p₂·n₂) / (n₁ + n₂)
SE = √(pPool · (1-pPool) · (1/n₁ + 1/n₂))
z = (p₂ - p₁) / SE
p-value = 2 · (1 - Φ(|z|))

A p-value below 0.05 means there's less than a 5% probability the observed difference happened by chance — equivalent to 95% confidence. This calculator also reports Bayesian probability (P(variant beats control)) via Monte Carlo simulation, which many practitioners find more intuitive.

Bayesian vs Frequentist A/B Testing: Which Should You Use?

Both approaches have advantages. Here's a practical comparison:

Frequentist

Best for: Regulatory / compliance contexts

Uses p-values and z-scores. Answers: "Is this difference unlikely to be random?" More conservative, harder to abuse. Required in clinical trials and FDA-regulated contexts.

Bayesian

Best for: Product teams, marketing, e-commerce

Directly answers: "What is the probability variant B is better?" More intuitive for business decisions. Allows early stopping without inflating false positives. Used by Optimizely, VWO.

Both (recommended)

Best for: When you want the full picture

Use frequentist p-value as a sanity check and Bayesian probability for decision-making. If both agree, you can ship with confidence.

What Is a Minimum Detectable Effect (MDE)?

The Minimum Detectable Effect (MDE) is the smallest relative improvement in your primary metric that your test is statistically powered to detect. It's the threshold below which any real effect would be invisible within your sample size.

MDE is directly tied to sample size: a smaller MDE requires a larger sample. Most product teams set MDE at 10–20% relative improvement. Setting it at 1–2% would require millions of visitors per variant, which is rarely feasible outside very high-traffic properties.

Practical tip: Start by asking “What lift would make this change worth shipping?” If a change only matters at 20%+ uplift, set your MDE to 20% and run a smaller, faster test.

How Long Should You Run an A/B Test?

Run your test until you collect the required sample size — calculated from your baseline CVR, MDE, and confidence target. Do not stop when you see significance (that's “peeking,” which inflates false positives by 2–5×).

Additional rules of thumb:

Minimum 7 days, even if you hit sample size faster. Day-of-week effects are real — Monday visitors convert differently than weekend visitors.
Maximum 4–6 weeks. Beyond this, seasonal drift, competitor activity, and novelty effects contaminate results.
Avoid running during anomalies: product launches, PR spikes, major sale events — these bias your traffic composition.
Use the Duration calculator in the Reverse Calculator tab to get exact days based on your daily visitor count and MDE.

A/B Test Sample Size Calculator: How Many Visitors Do You Need?

Sample size is calculated using the Fleiss formula, which accounts for your baseline conversion rate, desired MDE, significance level (α), and statistical power (1-β):

n = (z_α/2 · √(2p̄(1-p̄)) + z_β · √(p₁(1-p₁) + p₂(1-p₂)))² / (p₂-p₁)²

Example required sample sizes per variant (95% confidence, 80% power):

Baseline CVR	5% MDE	10% MDE	20% MDE
1%	158,500	40,300	10,400
2%	82,400	21,000	5,400
3%	56,300	14,300	3,700
5%	34,800	8,800	2,300
10%	17,400	4,400	1,150

Per variant. Double for total visitors in a 50/50 split A/B test.

Common A/B Testing Mistakes

Peeking (stopping early)High

Checking results before planned sample size is reached inflates false positive rate from 5% to 22%+ (at 95% confidence). Use this tool's peeking warning.

Too many variants at onceMedium

Each variant requires independent sample size. Running A/B/C/D simultaneously multiplies required traffic. Use multi-arm tests only when traffic supports it.

Testing too many things at onceHigh

Changing headline, CTA, image, and color simultaneously makes it impossible to know what caused the lift. Test one change at a time (or use a factorial design).

Ignoring business metricsMedium

A statistically significant 3% uplift that translates to $400/year annual impact may not justify engineering and design time. Always check revenue impact alongside p-values.

Not checking for sample ratio mismatchHigh

If your 50/50 split shows 52/48 or worse, your randomization is broken. Validate actual split before analyzing results.

Running tests during anomaliesMedium

Product launches, PR spikes, major sales events, or holiday periods bias traffic composition and invalidate results.

A/B Testing Benchmarks by Industry

Typical conversion rates and expected test outcomes by industry:

Industry	Avg CVR	Typical Uplift	Min Sample
E-commerce	2–4%	5–20%	~5K/variant
SaaS (trial signup)	3–7%	10–30%	~4K/variant
B2B Lead Gen	1–3%	15–40%	~8K/variant
Media / Content	3–6% (click)	5–15%	~6K/variant
Mobile App	10–25% (engagement)	5–15%	~2K/variant
Fintech / Lending	1–3%	10–25%	~8K/variant

Benchmarks based on industry analyses. Your actual numbers depend on traffic quality, offer, audience, and change type.

Frequently Asked Questions

What is A/B testing?

A/B testing is a controlled experiment where two versions of a digital element are shown to random user groups to determine which drives more conversions. Statistical analysis determines if the difference is real or due to chance.

What is the difference between Bayesian and frequentist A/B testing?

Frequentist uses p-values: "Given no effect, how likely is this data?" Bayesian directly estimates: "What is the probability variant B beats control?" Bayesian is more intuitive for business decisions. This calculator supports both.

What is statistical significance in A/B testing?

Statistical significance means the observed difference is unlikely to be random. At 95% confidence (p < 0.05), there's only a 5% chance the result is a false positive. Most A/B testing best practices recommend 95% as the minimum threshold.

How many visitors do I need for an A/B test?

Use the Reverse Calculator tab: enter your baseline CVR and desired MDE. For example, detecting a 10% lift from a 3% baseline CVR requires ~14,300 visitors per variant at 95% confidence, 80% power.

What is a minimum detectable effect (MDE)?

The MDE is the smallest relative improvement your test is statistically powered to detect. A 10% MDE at a 3% CVR means you're testing for a change from 3.0% → 3.3% CVR. Smaller MDE = larger required sample size.

What is peeking in A/B testing?

Peeking means stopping a test early when you see significance before collecting the required sample size. This inflates false positives dramatically — checking 5 times during a test can raise your false positive rate from 5% to 22%.

Can I test more than 2 variants?

Yes. This calculator supports A/B, A/B/C (3-way), and A/B/C/D (4-way) tests. Each variant is compared against the Control (A). Note that each variant needs its own independent sample size, so 4-way tests require 4× the traffic.

How is revenue impact calculated?

Revenue impact = (variant CVR - control CVR) / control CVR × monthly revenue × 12. If you provide an Average Order Value, it uses visitor-based calculation: monthly visitors × CVR difference × AOV × 12.

Last updated: March 2026. Statistical formulas based on standard two-proportion Z-test and Beta distribution Bayesian analysis. Sample size via Fleiss formula.

A/B Test Calculator