Bayesian A/B Testing by SplitMetrics

The SplitMetrics team is happy to introduce the freshly released Bayesian A/B testing.

The Bayesian approach can be useful in cases where marketers have some beliefs and knowledge to use as a primer assumption (in our case, it is informed prior in the default settings) that helps algorithms calculate the probability of related events to the likelihood of a certain outcome. Hence, users can make faster decisions with lower costs of experiments by incorporating beliefs or knowledge as part of the experiment, compared with the Frequentist or Sequential methods. 

The benefits of the Bayesian approach in SplitMetrics are that users can find indirect control over the risk because they can control the expected loss stopping rule (threshold of caring). 

The stopping rule – stops the test to save budgets once the winner or underperformer is obvious.

Bayesian A/B Testing by SplitMetrics

A Bayesian test starts from a weak prior assumption about the expected conversion (prior distribution). The prior may fix the noise at the beginning of the test when the sample size is low. As the test progresses and more visitors participate, the impact of the prior fades.

Leverage your A/B testing with the Bayesian approach as an industry gold standard for iterative A/B testing and growth hacking with less traffic required.

The Bayesian flow

  1. Test setup: utilizing assumptions based on prior test history  
  2. Warm-up stage (250 unique visitors per variation) + Knowledge update 
  3. Activation of stopping rule algorithm based on real data + Knowledge update 
  4. The winner is declared if the expected loss hits the threshold.

Three simultaneous criteria for Bayesian A/B testing

  • The constant flow of cheap-to-produce and hard-to-test ideas
  • The goal is to increase conversion rate
  • The implementation scale is defined in scope and time.

Otherwise, if criteria don’t meet in the growth-hacking model.

Where you should choose the alternative

  • A single idea to test and long release cycle: growth managers don’t expect new features to challenge and replace the tested ones in the foreseeable future. Therefore they wouldn’t want to stick with a valueless feature indefinitely. 

There is a chance of getting stuck with a slight change (deterioration). Hence, it can accumulate a significant effect over time.  

  • Indefinite implementation scale: the test result is the shared knowledge that can be considered at an uncontrollable scale across the company. Checking hypotheses to make changes in products, marketers and product managers know how exactly changes will be implemented. But they might gain constant knowledge for big companies and enterprises with no changes (for instance, green background always improves conversion rate). With the Bayesian method, you can get stuck with deterioration and scale improvements unpredictably.

The possible flaws of Bayesian in some contexts are the side effects of treating control and test groups equally and controlling for cost/value rather than False Positive/False Negative error rate. 

Or if a conservative strategy is rationally justified 

  • Low-trust environment: if you are not testing in person, your team’s decision-makers are incentivized to maximize the number of releases. Thus they tend to take excessive risks. 

If you want a strict rule limiting the team’s discretion, you should try the Sequential approach with a standard p_value threshold.

  • Risk aversion: rolling out a variant with close-to-zero value is unacceptable because your release cost is high.

Even if there is a slight difference between variations in the Bayesian approach, the one that looks at least a little better at the moment will be selected as recommended. Lean towards a conservative strategy with the Sequential approach that ensures statistical significance data for each variation to make conclusions. 

The main value of Bayesian A/B testing

Online cost optimization

Early stopping by toc (threshold of caring) with the expected loss tolerance if there’s overperformer, underperformer, or approximately equal with the opportunity to spend fewer budgets on a test.

The Frequentist test came from the academic setup, where a false positive result is a major failure by default: you mislead the entire scientific community by reporting one. Thus, you are conservative by default: you prefer sticking to the current state and sacrificing exploitation for the low chance of being even marginally wrong.

  • In most commercial applications, decision-makers are concerned not with a false positive rate per se but with continuous product improvement. It is okay to use less traffic and make minor mistakes as long as you over-compensate them with major improvements—the Bayesian test controls for the magnitude of potential error rather than a fraction of false-positive results.
  • Imagine running a series of ten experiments. Your current conversion is 20%. Three of the tested features increase the target conversion by 0.01%, another three decrease it by 0.01%, one is a killer feature with +5% effect, and the last one is a bug that would cause a 10% conversion decrease. You don’t care much about minor improvements and losses in a typical growth hacking setting. You need to make sure you have dumped the bug and released the killer feature. You can save a lot of traffic if you reformulate your task from academia-inherited false-positive-intolerance to the industrial big-error-intolerance concept. The Bayesian test has an interface for this, which brings us to the next point.

Interpretability

The Bayesian A/B testing in the SplitMetrcis platform showcase a winning probability for experiment interpretability.

  • Bayesian delivers comprehensible business insights. 
  • The output looks like this: “B is 87% likely to outperform A. You may either stop the test and gamble on this or spend more traffic to be more sure.”
  • In the Frequentist test paradigm, you would probably get just “pvalue>0.05, the result is insignificant”.

Unleash the full potential of app growth hacking. 

Boost conversion and installs with SplitMetrics A/B testing
Request Demo