What is Sample Ratio Mismatch (SRM) and why does it matter?

Sample Ratio Mismatch occurs when your test does not split traffic in the intended ratio (usually 50/50). A large SRM — detected with a chi-squared test — indicates a potential tracking bug, bot traffic filtering asymmetry, or a broken randomization algorithm. An SRM invalidates your test results regardless of p-value, because the treatment and control groups are no longer comparable. This tool automatically detects SRM and warns you when chi² > 3.841 (p < 0.05).

What is statistical power and why does it matter for pricing tests?

Statistical power is the probability that your test will detect a real effect if one exists. At 80% power (industry standard), you will miss 20% of true effects. Low-power tests are dangerous in pricing because they produce many false negatives — you miss real improvements. The required sample size grows as the minimum detectable effect shrinks: detecting a 5% conversion lift requires ~4x more visitors than detecting a 10% lift.

How do I calculate the revenue impact of a pricing A/B test?

Revenue impact = (Variant B conversion rate − Control rate) × daily visitors × 30 × ARPU. For example, if your control converts at 4.7%, Variant B at 5.8%, you have 450 daily visitors and $299 ARPU, monthly uplift = (0.058 − 0.047) × 450 × 30 × 299 ≈ $4,436/month or $53,232/year. This tool computes this automatically when you enter ARPU and daily traffic in the Business Context section.

LotofTools

Pricing A/B Test Calculator

Q: What is statistical significance in an A/B test?

Statistical significance means the observed difference between variants is unlikely to be due to random chance. At 95% confidence, there is only a 5% probability of seeing a difference this large if no true effect exists. This does NOT mean the effect is practically important — it only means it is real. Always combine significance with effect size and business impact.

Q: What is the difference between Bayesian and Frequentist A/B testing?

Frequentist testing gives you a p-value: the probability of observing this data if no effect exists. A p-value < 0.05 means significance at 95% confidence. Bayesian testing gives you a direct probability: "89% chance Variant B truly converts better." Bayesian results are more intuitive for business decisions and naturally incorporate prior uncertainty about true rates. For pricing tests, Bayesian is often preferred because it directly answers "should I deploy B?"

Q: How long should I run a pricing A/B test?

The minimum duration has two components: statistical (you need enough visitors to achieve 80%+ power) and temporal (you need at least 14 days / 2 full business cycles to capture weekly traffic patterns). Use the Test Duration Estimator in this tool to get a calendar date based on your current daily traffic. Never call a test early based on "peeking" — each intermediate check inflates your false positive rate by up to 26% (Johari & Koehler, 2017).

Q: What is statistical power and why does it matter for pricing tests?

Statistical power is the probability that your test will detect a real effect if one exists. At 80% power (industry standard), you will miss 20% of true effects. Low-power tests are dangerous in pricing because they produce many false negatives — you miss real improvements. The required sample size grows as the minimum detectable effect shrinks: detecting a 5% conversion lift requires ~4x more visitors than detecting a 10% lift.

Q: How do I calculate the revenue impact of a pricing A/B test?

Revenue impact = (Variant B conversion rate − Control rate) × daily visitors × 30 × ARPU. For example, if your control converts at 4.7%, Variant B at 5.8%, you have 450 daily visitors and $299 ARPU, monthly uplift = (0.058 − 0.047) × 450 × 30 × 299 ≈ $4,436/month or $53,232/year. This tool computes this automatically when you enter ARPU and daily traffic in the Business Context section.

Q: What conversion rate should a SaaS pricing page have?

Pricing page conversion rates vary significantly by segment. PLG/self-serve SaaS: median 2.1%, top quartile 3.8%, top decile 5.5% (Lenny's Newsletter 2024). SMB SaaS: median 1.8%, top quartile 3.2% (OpenView 2024). Enterprise SaaS: median 0.9%, top quartile 1.7% (Gartner). eCommerce: median 1.5%, top quartile 2.9% (Baymard 2024). Consumer apps: median 3.2%, top quartile 5.1% (AppsFlyer). Use the Benchmark Comparison bar in this tool to see where your control rate stands.

Statistical significance calculator with revenue impact — know when to call your test.

—

PRICING TEST RESULT

Enter your test data

Visitors and conversions for each variant

← Use preset scenarios above to see an example

Conversion Rate Comparison

Sampling Distribution Overlap

Enter data to see distribution analysis.

Each curve shows the sampling distribution of the true conversion rate. Less overlap = more decisive result.

Revenue Impact

Test Duration Timeline

Add daily traffic in Business Context to unlock calendar estimate

💡Pricing Intelligence

Your test is gathering data. Here's what to watch for:

✓ Don't call this early — peeking inflates false positives by up to 26% (Johari & Koehler, 2017).
→ See the Pricing Psychology panel below for insights you can apply to the next iteration.

✨Pricing Psychology Advisor

Add price amounts to your variant labels (e.g., "$49" or "$99/mo") for pricing psychology tips.

Statistical Power

How many visitors do you need?

To detect a 10% lift at 95% confidence with 80% power:

53,228 visitors/variant
(106,456 total)

MDE: 10% — drag to adjust

5% (harder)30% (easier)

Test Validity Checklist

0 / 6 ✓

✗

Sufficient sample sizeUnder-powered tests miss real effects.

✗

Traffic split balancedUneven splits can bias results.

✗

Test ran for 14+ days (add start date to verify)Full business cycles capture weekly traffic patterns.

Single variable changed (one thing different between A/B)Multiple changes make it impossible to know what drove the result.

Hypothesis pre-registered before seeing dataPost-hoc hypothesis formation inflates false discovery rate.

No novelty effect (watch for performance decline after initial spike)New UI changes often get a temporary lift from curiosity that fades.

Last reviewed: March 2026

How to Use the Pricing A/B Test Calculator

Enter visitor and conversion counts. Input the number of unique visitors and conversions (signups, trials, or purchases) for your Control (A) and your Variant (B). Optionally rename each variant to the actual price you tested — e.g., "$49/mo" and "$59/mo" — to unlock pricing psychology tips.
Choose your confidence threshold. Select 90%, 95%, or 99% confidence. For pricing tests with meaningful revenue at stake, 95% is the industry default. Use 99% for enterprise deals where a false positive is costly.
Toggle Bayesian mode for an intuitive reading. Bayesian mode shows you "X% probability Variant B truly converts better" — a direct answer that's easier to present to stakeholders than a p-value. Frequentist mode shows the classical p-value and z-score.
Add ARPU and daily traffic for revenue impact. Open "Business Context" and enter your Monthly ARPU and daily pricing page visitors. The Revenue Impact card unlocks instantly, showing annual and monthly dollar uplift if the winning variant deploys.
Share when significance is reached. Click PNG to export a 1200×630 shareable result card, or "Copy for Slack" to paste a formatted summary into your standup thread. A URL encoding all inputs is automatically generated for sharing exact scenarios.

What is Statistical Significance?

Statistical significance tells you whether the observed difference in conversion rates between two pricing variants is real or due to random chance. At 95% confidence — the industry standard for most SaaS pricing tests — there is only a 5% probability of observing a difference this large if no true effect exists. That 5% threshold is called the alpha level, and it determines how much uncertainty you're willing to accept before calling a winner.

Statistical significance is not the same as practical significance. A pricing page test might reach 95% confidence with only a 0.2% absolute lift in conversion rate — statistically significant, but worth less than $500/year in revenue. Always pair your significance result with the effect size (uplift %) and the revenue impact to determine whether the test result is worth acting on.

This calculator uses a two-proportion z-test for frequentist significance: it pools the observed conversion rates to estimate a null-hypothesis proportion, computes a standard error, and converts the z-score to a two-tailed p-value via the normal CDF. The p-value is then compared against your chosen alpha level (1 − confidence).

Bayesian vs. Frequentist A/B Testing: Which Should You Use?

Frequentist testing — the classical approach taught in statistics courses — gives you a p-value. The p-value answers: "If there were no true difference, how often would I see data at least this extreme?" A p-value below 0.05 at 95% confidence means you can reject the null hypothesis. The limitation is that p-values are notoriously hard to communicate to non-statisticians, and they say nothing directly about the probability that Variant B is better.

Bayesian testing gives you a direct probability: "89% chance Variant B has a higher true conversion rate than the control." This is what most founders and PMs actually want to know. Bayesian results are more intuitive, naturally incorporate uncertainty about the true rates (via prior distributions), and avoid the "peeking" problem that inflates false positive rates in sequential frequentist testing.

This tool uses a Beta-distribution Bayesian model. Each variant's conversion rate is modeled as a Beta posterior updated by observed conversions and non-conversions. P(B > A) is estimated by Monte Carlo sampling: drawing 1,500 pairs of random conversion rates from the two posteriors and counting how often B's draw exceeds A's. For pricing tests where a false positive has real revenue cost, Bayesian mode is the recommended default.

How Long Should You Run a Pricing A/B Test?

Pricing A/B test duration is determined by two independent constraints — both must be satisfied before you call a winner:

Statistical constraint: You need enough visitors to achieve at least 80% statistical power for your minimum detectable effect. At 10% relative lift and a 3% base conversion rate with 95% confidence, that's roughly 2,600 visitors per variant. Use the Power Analysis panel to compute this for your specific rates.
Temporal constraint: Run for at least 14 days — two full business cycles — to capture weekly traffic patterns. B2B SaaS pricing pages typically see 40–60% more traffic on weekdays, which means a one-week test will over-represent weekday behavior and produce biased results.
No peeking: Each time you check your test results before the pre-committed end date, you inflate your false positive rate. Johari & Koehler (2017) found that continuous monitoring at 95% confidence produces an actual false positive rate of up to 26%. Pre-commit to an end date and don't call the test early.

Enter your daily pricing page traffic in the Business Context section to unlock the Test Duration Timeline, which shows a calendar date for expected significance at current traffic rates.

Pricing Page Conversion Rate Benchmarks by Segment

Conversion rate benchmarks vary significantly across SaaS segments. Use these as context for your control variant — not as targets. If your control already sits above the 75th percentile, you have less headroom for improvement and should focus testing on revenue-per-visitor (ARPU uplift through price increases) rather than conversion rate.

Segment	Median	75th Pct	90th Pct	Source
PLG / Self-serve SaaS	2.1%	3.8%	5.5%	Lenny's Newsletter 2024
SMB SaaS	1.8%	3.2%	4.7%	OpenView 2024
Enterprise SaaS	0.9%	1.7%	2.6%	Gartner B2B
eCommerce	1.5%	2.9%	4.1%	Baymard 2024
Consumer App	3.2%	5.1%	7.8%	AppsFlyer
Usage-based SaaS	2.4%	4.0%	6.1%	Reforge

The Benchmark Comparison bar inside the tool places your control conversion rate on this scale automatically when you select your segment in Business Context.

Common Pricing A/B Test Mistakes (and How to Avoid Them)

1. Stopping the test too early (peeking)

The most common and expensive mistake. If you check results every day and stop when you first see p < 0.05, your actual false positive rate is closer to 26% at 95% nominal confidence. Pre-commit to a sample size before the test starts and don't check until you reach it.

2. Running an under-powered test

Running a test with only 400 visitors per variant when you need 2,600 means you have a 20–30% chance of detecting a real 10% lift. You'll declare "no significant difference" when one actually exists. Calculate required sample size before starting, not after.

3. Ignoring Sample Ratio Mismatch

If your test is supposed to split 50/50 but actually shows 62/38, your randomization is broken — possibly due to a tracking bug, bot filtering, or a client-side redirect. An SRM invalidates all results regardless of p-value. This tool detects SRM automatically and warns you.

4. Testing multiple variables simultaneously

Changing both the price and the page layout simultaneously makes it impossible to isolate what drove the result. Run one variable at a time. If you need to test multiple combinations, use a factorial design with a proper multi-armed bandit tool.

5. Ignoring the novelty effect

New UI changes often get a temporary lift from user curiosity that fades over 7–14 days. If your test ran for only 5 days and showed a large uplift for the new variant, wait for the novelty effect to dissipate before calling the winner.

Frequently Asked Questions

What is statistical significance in an A/B test?

Statistical significance means the observed difference is unlikely to be due to random chance. At 95% confidence, there is only a 5% probability of a false positive — seeing a difference this large when no true effect exists. It does not mean the effect is large or practically meaningful, only that it is real.

What's the difference between Bayesian and Frequentist A/B testing?

Frequentist testing gives a p-value — the probability of observing this data if no effect exists. Bayesian testing gives a direct probability that Variant B is better. Bayesian results are more intuitive ("89% chance B wins" vs "p = 0.043") and work better for sequential monitoring without inflating false positive rates.

How long should I run a pricing A/B test?

Two constraints apply simultaneously: statistical (enough visitors for 80%+ power) and temporal (at least 14 days / 2 business cycles). Never call a test early based on live peeking — each check inflates your false positive rate. Use the Test Duration Estimator in this tool for a calendar date based on your traffic.

What is Sample Ratio Mismatch (SRM)?

SRM occurs when your traffic split deviates significantly from the intended ratio (usually 50/50). Detected via chi-square test (chi² > 3.841), an SRM indicates a tracking bug, bot filtering asymmetry, or broken randomization. An SRM invalidates test results regardless of p-value — fix the instrumentation before trusting the data.

What is statistical power and why does it matter?

Statistical power is the probability that your test detects a real effect if one exists. At 80% power (industry standard), 20% of true effects are missed. Detecting a 5% relative lift requires ~4× more visitors than detecting a 10% lift. The Power Analysis panel in this tool shows your current power and required sample size.

How do I calculate the revenue impact of a pricing test?

Monthly revenue uplift = (Variant B rate − Control rate) × daily visitors × 30 × ARPU. Annual uplift = monthly × 12. Enter your ARPU and daily pricing page traffic in the Business Context section to unlock this calculation automatically. The urgency line shows cost of each additional week of delay.

What conversion rate should a SaaS pricing page have?

Benchmarks vary by segment. PLG / self-serve SaaS median is 2.1% (top decile: 5.5%). SMB SaaS median is 1.8%. Enterprise SaaS median is 0.9%. eCommerce median is 1.5%. Consumer apps median is 3.2%. Select your segment in Business Context to see where your control variant stands on the benchmark bar.