When it comes to A/B tests, anyone has a natural desire to get trustworthy results without spending a heap of money on traffic. Alas, it’s not always possible with classic A/B testing which requires enormous sample sizes at times.
Is there a better way? Sure, there is!
Sequential A/B testing might become a robust alternative. Such experiments don’t only optimize necessary traffic volumes but also reduce the likelihood of mistakes. If you feel like exploring how it works in our platform, read on SplitMetrics sequential A/B testing principles.
Nor, let’s take a closer look at this method theoretically and learn how it differs from the classic A/B testing flow.
Classic A/B Testing Flow: Core Principles and Challenges
Before proceeding to sequential A/B testing, let’s spare time to brush up our understanding of a classic A/B test. Its principles are quite straightforward:
- You define a baseline conversion and Minimum Detectable Effect (MDE):
- Baseline conversion — the percentage which defines the current conversion rate of the page you want to test.
- Minimum Detectable Effect (MDE) — the minimum expected conversion lift.
- You calculate the sample size you need for a meaningful experiment with a specialized calculator using the above-mentioned variables and statistical power as well as significance level:
- Statistical power — the percent of the time when the minimum effect size is detected, assuming it exists. Normally, 80% is an optimal statistical power.
- Significance level — the percent of the time the difference (MDE) will be detected, assuming it doesn’t exist. Put simply, it’s the percent of the time when a test proves your hypothesis when in reality this hypothesis is false. As a rule, 5% significance level is used in mobile A/B testing.
- You start driving traffic to your variations. The test can be finished only when each variation was visited by the necessary number of users (for example 1,030 visitors as in the picture above).
- You evaluate the results of your A/B test. If the difference in performance between variations reached MDE or exceeded it, you can consider that the hypothesis of your experiment is proven right. Otherwise, it’s necessary to start the test from scratch.
There are a few common rules when it comes to classic A/B testing:
The higher your baseline conversion rate, the less traffic you need to find a statistically significant difference.
For example:
- Baseline conversion=3%, MDE=5%, sample size (per variation) = 204,493
- Baseline conversion=5%, MDE=5%, sample size (per variation) = 120,146
The higher the expected conversion lift, the less traffic you need.
This aspect once again proves the importance of a strong hypothesis which potentially causes a greater difference in variations performance. Judge for yourself, sample size discrepancy might be truly dramatic:
- Baseline conversion=3%, MDE=5%, sample size (per variation)=204,493
- Baseline conversion=3%, MDE=10%, sample size (per variation)=51,486
A considerable amount of required traffic isn’t the only issue with classic A/B testing. A lot of managers violate core principles — they make conclusions before reaching the necessary sample size for all variations under test and finish the experiment earlier. They go even further at times and keep stopping and launching one of the alternatives multiple times within one A/B test.
There is another common scenario — unsatisfied with their hypothesis not being proven, managers keep filling failed experiments with traffic in hope for a change. Needless to say, it’s a reckless waste of money which has nothing to do with getting quality statistically significant results.
In the end, such cases of poor individual initiative increase the chance of mistake (significance level). For example, from totally acceptable 5% this parameter may reach, let’s say, 12% waving goodbye to the hope of trustworthiness.
It’s vital to understand that only responsible approach and observance of all rules give ground for meaningful results. Any violation of the classic A/B testing workflow screws up your experiment.
Good news is that classic A/B testing can be modified. Such updated approach is known as sequential A/B testing.
Defining Parameters for Simple Sequential A/B Testing
The best part about sequential A/B testing is that it gives users a chance to finish experiments earlier without increasing the possibility of false results. Sounds great, doesn’t it? Now let’s figure out how this type of testing works and what makes speedy results possible.
As it was mentioned above, when it comes to classic A/B testing, it’s allowed to check test results only at the very end when the sample size for both variations is reached. Sequential A/B testing in its turn allows multiple checks on every step ensuring that error level won’t exceed 5%.
The workflow of tests with sequential sampling starts as classic A/B testing — with defining sample size using a specialized calculator.
Сalculations are made for a one-sided test in this calculator. It is called one-sided as it helps to check if the critical area of a distribution is either greater than or less than a certain value, but not both. Let’s analyze an example to see when we need one-sided tests and when two-sided ones.
Imagine we believe that our control variation A performs worse than variation B. Therefore, A > B is our null hypothesis (the one we’ll try to reject via A/B testing):
- If we want to prove that A≠B, we will have to check the null hypothesis A=B using a two-sided test;
- If we want to prove that the performance of variation B isn’t worse compared to variation A (A≤B), our null hypothesis will be A>B and we’ll check it with a one-sided test.
To reject a certain percentage of observations, we have to define the parameters we need for a sequential A/B test:
Minimum Detectable Effect (MDE) — a minimum improvement over the baseline conversion that you’re willing to detect in an A/B test. It’s better to opt for Relative MDE as it allows you to skip defining of the Baseline conversion rate. This parameter is critical for your experiment as it favors precision. Let’s say, your MDE is 10% and your A/B test showed the following results:
- The conversion rate of your control variation A was 50%
- The conversion rate of your treatment variation B was 55%
Such results fail to detect a meaningful difference between variations as with relative MDE=10% the system treats 50% as 45%.
Significance level (α) — the probability of the null hypothesis being rejected while in reality, the null hypothesis is right. Simply put, this is the chance of false results of our A/B test. Going back to the example above, this parameter defines the percentage of time when the test proves that A≤B while in reality A>B. 5% is the optimal significance level for A/B testing.
Statistical power (100%—β) — the probability of the null hypothesis being rejected while in reality the null hypothesis is indeed wrong and our initial presumption was correct. 80% is the optimal power for A/B testing.
Even though the significance level is more important than statistical power when it comes to A/B testing, a meaningful experiment is impossible without proper statistical power.
Sequential A/B Testing Workflow
Now let’s examine each step of the sequential A/B testing procedure:
- Start your experiment with choosing sample size, let’s call it N;
- Randomly assign variations under test to the treatment and control, with 50% probability each.
- Track the number of incoming successes for both variations. Let’s refer to the conversion rate of treatment variation as T, and CR of control as C.
- It’s necessary to finish the test when T−C reaches √2N and declare the treatment variation to be the winner of your A/B test.
- It’s necessary to finish the test when T+C reaches N. In such case, declare that the experiment had no winner.
Detailed calculations which support this workflow can be found in this article by Evan Miller.
Comparing Classic and Sequential A/B Tests
Imagine that your baseline conversion is 2%. Let’s calculate the sample size for a classic A/B test based on this parameter:
We launch our experiment and wait for each variation to have at least 309,928 visitors. Only then we can get down to analyzing the results.
Assume that each variation had 310,000 sessions after 7 days of our A/B test:
- Control variation A got 6,200 conversions;
- Treatment variation B got 6,420 conversions.
Let’s use this data estimating our results with help of Chi-Squared Test calculator:
The calculations show that variation B won and the test can be finished. The total number of conversions was 12,620 (6,200 + 6,420) and the difference between variations was 220 conversions.
Considering that for a successful experiment the chance of mistake (p-value — the worst-case probability when the null hypothesis is true) shouldn’t exceed 5%, the 218 conversions difference was enough for statistical significance of our A/B test.
Now let’s run sequential A/B test using the same baseline conversions and MDE. Sequential Sampling Calculator helps us define stop configuration which defines when our test can be finished.
Assume that after 5 days of this test we got the following results:
- Control variation A got 5,460 conversions;
- Treatment variation B got 5,681 conversions.
The sum of conversions is 11,141 (5,460 + 5,681) and the difference between variations performance exceeds 207. Thus, the test can be finished.
If we compare these two experiments we’ll notice that sequential A/B testing helped to reduce the number of required conversions from 12,620 to 11,141 (12%) and take minimal required conversion difference from 218 to 207.
Furthermore, the calculations by Evan Miller demonstrate that:
- The less is MDE, the more traffic is saved in case of sequential sampling. It’s possible to save 40% or even more at times;
- There are cases when sequential sampling requires more conversions than classic A/B testing;
- To ensure that sequential sampling method suits you, do the following calculations:
- If the result exceeds 36%, classic approach to A/B testing will help to finish your experiment earlier.
- If the result is less than 36%, it makes sense to opt for sequential A/B testing as it will help you get trustworthy results faster using less traffic.
Considering fairly low conversion in such popular app categories as games or photo & video, sequential sampling is an amazing opportunity to speed up results getting in the context of mobile A/B testing.
If you feel like trying this kind of experiments, read about how sequential A/B testing is implemented in SplitMetrics!