A/B Test Calculator

Bayesian + frequentist significance, annual revenue impact, 6-dimension test quality report card. Supports A/B, A/B/C, and A/B/C/D tests. Free, no signup.

Last reviewed: March 2026

⚠️You may be peeking — inflate risk of false positives. Wait until sample size reaches 53,227 visitors.
Variant B WINS
+$75.3K
annual
revenue
impact
Variant B 96% confidence+14.5% uplift
Verdict
Variant B WINS
Uplift
+14.5%
Confidence
96.1%
Annual Impact
+$75.3K
via LotofTools.org — A/B Test Calculator
Test Configuration
Industry Preset
Test Name (optional)
Hypothesis (optional)
Variants
Confidence
Method
Variant Data
LabelVisitorsConversionsCVR
3.00%
WINNER
3.44%
Monthly Revenue
Avg Order Value
Currency
Results
VariantCVRUpliftZ-ScoreP-ValueBayesian %CI 95%Annual ImpactDays to Sig
Control (A)
3.00%2.633.37%
Variant B
3.44%+14.5%1.550.120496.1%3.033.84%+$75.3K~92d
Conversion Rate Comparison
Loading chart…
Test Quality Report Card
Overall: C
Sample QualityA
Min 8,000 visitors per variant
Effect SizeC
Cohen's h = 0.025 (small)
Statistical PowerF
~12% power
Business ImpactA
+11.4% annual revenue
Test VelocityD
~92 days to significance
ConfidenceA
96.1% best variant confidence
Variant B wins with 96.1% Bayesian confidence
Sample size is 15% of recommended — results may be unreliable
Annual impact vs current revenue: +11.4%

Simulate changes to your test parameters and see the projected outcome.

Traffic Multiplier1.0×
Conversion Lift Target %+10%
Test Duration (days)14d
Revenue Per Visitor OverrideOff
Projected Outcome
Variant B
+26.1% uplift(+11.5%)
+$135.0K annual
99.7% confidence
Export & Share

What Is an A/B Test?

An A/B test (also called a split test) is a controlled experiment in which two versions of a web page, email, button, headline, or any digital element are shown to separate, randomly assigned groups of users simultaneously. By measuring which version drives more conversions, you get statistically grounded evidence to guide product and marketing decisions — instead of relying on HiPPO opinions (Highest Paid Person's Opinion).

A typical A/B test has a Control (A) — your current version — and a Variant (B) — the challenger version with a specific change. Traffic is split evenly, usually 50/50, and you measure the primary metric (often conversion rate) for both groups. When the difference is large enough relative to random chance, you declare a winner.

How to Calculate Statistical Significance

Statistical significance is calculated via a two-proportion Z-test. Given two conversion rates p₁ (control) and p₂ (variant), and sample sizes n₁ and n₂:

pPool = (p₁·n₁ + p₂·n₂) / (n₁ + n₂)
SE = √(pPool · (1-pPool) · (1/n₁ + 1/n₂))
z = (p₂ - p₁) / SE
p-value = 2 · (1 - Φ(|z|))

A p-value below 0.05 means there's less than a 5% probability the observed difference happened by chance — equivalent to 95% confidence. This calculator also reports Bayesian probability (P(variant beats control)) via Monte Carlo simulation, which many practitioners find more intuitive.

Bayesian vs Frequentist A/B Testing: Which Should You Use?

Both approaches have advantages. Here's a practical comparison:

Frequentist
Best for: Regulatory / compliance contexts
Uses p-values and z-scores. Answers: "Is this difference unlikely to be random?" More conservative, harder to abuse. Required in clinical trials and FDA-regulated contexts.
Bayesian
Best for: Product teams, marketing, e-commerce
Directly answers: "What is the probability variant B is better?" More intuitive for business decisions. Allows early stopping without inflating false positives. Used by Optimizely, VWO.
Both (recommended)
Best for: When you want the full picture
Use frequentist p-value as a sanity check and Bayesian probability for decision-making. If both agree, you can ship with confidence.

What Is a Minimum Detectable Effect (MDE)?

The Minimum Detectable Effect (MDE) is the smallest relative improvement in your primary metric that your test is statistically powered to detect. It's the threshold below which any real effect would be invisible within your sample size.

MDE is directly tied to sample size: a smaller MDE requires a larger sample. Most product teams set MDE at 10–20% relative improvement. Setting it at 1–2% would require millions of visitors per variant, which is rarely feasible outside very high-traffic properties.

Practical tip: Start by asking “What lift would make this change worth shipping?” If a change only matters at 20%+ uplift, set your MDE to 20% and run a smaller, faster test.

How Long Should You Run an A/B Test?

Run your test until you collect the required sample size — calculated from your baseline CVR, MDE, and confidence target. Do not stop when you see significance (that's “peeking,” which inflates false positives by 2–5×).

Additional rules of thumb:

  • Minimum 7 days, even if you hit sample size faster. Day-of-week effects are real — Monday visitors convert differently than weekend visitors.
  • Maximum 4–6 weeks. Beyond this, seasonal drift, competitor activity, and novelty effects contaminate results.
  • Avoid running during anomalies: product launches, PR spikes, major sale events — these bias your traffic composition.
  • Use the Duration calculator in the Reverse Calculator tab to get exact days based on your daily visitor count and MDE.

A/B Test Sample Size Calculator: How Many Visitors Do You Need?

Sample size is calculated using the Fleiss formula, which accounts for your baseline conversion rate, desired MDE, significance level (α), and statistical power (1-β):

n = (z_α/2 · √(2p̄(1-p̄)) + z_β · √(p₁(1-p₁) + p₂(1-p₂)))² / (p₂-p₁)²

Example required sample sizes per variant (95% confidence, 80% power):

Baseline CVR5% MDE10% MDE20% MDE
1%158,50040,30010,400
2%82,40021,0005,400
3%56,30014,3003,700
5%34,8008,8002,300
10%17,4004,4001,150

Per variant. Double for total visitors in a 50/50 split A/B test.

Common A/B Testing Mistakes

Peeking (stopping early)High
Checking results before planned sample size is reached inflates false positive rate from 5% to 22%+ (at 95% confidence). Use this tool's peeking warning.
Too many variants at onceMedium
Each variant requires independent sample size. Running A/B/C/D simultaneously multiplies required traffic. Use multi-arm tests only when traffic supports it.
Testing too many things at onceHigh
Changing headline, CTA, image, and color simultaneously makes it impossible to know what caused the lift. Test one change at a time (or use a factorial design).
Ignoring business metricsMedium
A statistically significant 3% uplift that translates to $400/year annual impact may not justify engineering and design time. Always check revenue impact alongside p-values.
Not checking for sample ratio mismatchHigh
If your 50/50 split shows 52/48 or worse, your randomization is broken. Validate actual split before analyzing results.
Running tests during anomaliesMedium
Product launches, PR spikes, major sales events, or holiday periods bias traffic composition and invalidate results.

A/B Testing Benchmarks by Industry

Typical conversion rates and expected test outcomes by industry:

IndustryAvg CVRTypical UpliftMin Sample
E-commerce2–4%5–20%~5K/variant
SaaS (trial signup)3–7%10–30%~4K/variant
B2B Lead Gen1–3%15–40%~8K/variant
Media / Content3–6% (click)5–15%~6K/variant
Mobile App10–25% (engagement)5–15%~2K/variant
Fintech / Lending1–3%10–25%~8K/variant

Benchmarks based on industry analyses. Your actual numbers depend on traffic quality, offer, audience, and change type.

Frequently Asked Questions

What is A/B testing?

A/B testing is a controlled experiment where two versions of a digital element are shown to random user groups to determine which drives more conversions. Statistical analysis determines if the difference is real or due to chance.

What is the difference between Bayesian and frequentist A/B testing?

Frequentist uses p-values: "Given no effect, how likely is this data?" Bayesian directly estimates: "What is the probability variant B beats control?" Bayesian is more intuitive for business decisions. This calculator supports both.

What is statistical significance in A/B testing?

Statistical significance means the observed difference is unlikely to be random. At 95% confidence (p < 0.05), there's only a 5% chance the result is a false positive. Most A/B testing best practices recommend 95% as the minimum threshold.

How many visitors do I need for an A/B test?

Use the Reverse Calculator tab: enter your baseline CVR and desired MDE. For example, detecting a 10% lift from a 3% baseline CVR requires ~14,300 visitors per variant at 95% confidence, 80% power.

What is a minimum detectable effect (MDE)?

The MDE is the smallest relative improvement your test is statistically powered to detect. A 10% MDE at a 3% CVR means you're testing for a change from 3.0% → 3.3% CVR. Smaller MDE = larger required sample size.

What is peeking in A/B testing?

Peeking means stopping a test early when you see significance before collecting the required sample size. This inflates false positives dramatically — checking 5 times during a test can raise your false positive rate from 5% to 22%.

Can I test more than 2 variants?

Yes. This calculator supports A/B, A/B/C (3-way), and A/B/C/D (4-way) tests. Each variant is compared against the Control (A). Note that each variant needs its own independent sample size, so 4-way tests require 4× the traffic.

How is revenue impact calculated?

Revenue impact = (variant CVR - control CVR) / control CVR × monthly revenue × 12. If you provide an Average Order Value, it uses visitor-based calculation: monthly visitors × CVR difference × AOV × 12.

Last updated: March 2026. Statistical formulas based on standard two-proportion Z-test and Beta distribution Bayesian analysis. Sample size via Fleiss formula.