Everyone agrees A/B tests are important. However, it’s difficult to design them correctly and the results can easily be misinterpreted. This leads to a lack of confidence in, among other things, the A/B testing system itself. Because we are test-oriented, we try to test the A/B testing system. But, for reasons that will be made clear, these tests are at least as hard to perform and to analyze as the specific A/B tests that we run. People get positive results where they expect none (like here), and lose trust in the test system, causing testing to grind to a halt.
In this post we’ll go over how (not) to test your test system. We’ve found that having total confidence in the test system itself is crucial for building a correct A/B testing methodology, and is a required first step on the road to properly interpreting test results.
How Not to Run an A/A Test
True story: A promising experiment—one that many resources were devoted to—returns no significant winner (perhaps due to bad design that did not take statistical power into account). Worse yet, it may claim that the original is significantly better than the new variant. Enterprising product managers and/or developers then turn to the “obvious” culprit and decide to test the A/B test system. They then run an A/A test and wait for the results.
Over the past few years, there’s been a lot of talk about the perils of classical A/B testing. Evan Miller’s seminal “How Not to Run an A/B Test” is probably the most popular, and if you haven’t read it, you should probably read at least the first part. All the points raised there and in similar posts are still valid: classical A/B testing is prone to bad design, result misinterpretation, multiple sampling (“peeking”), multiple metrics,reporting significance without effect size, or effect size without confidence intervals, and more. Furthermore, all of these problems also exist when trying to run an A/A test.
From this list of potential issues, we will focus mainly on multiple metrics and multiple sampling, which are particularly important in this case. As a brief reminder, both of these issues occur when you take a test that has a 95% chance of returning the right answer and, in effect, conduct it multiple times. This can happen either through having multiple metrics or by checking the experiment (“peeking”) multiple times before stopping it. Basically, your chances of getting a false positive result are 5% in each test you make, and therefore the odds of getting a false positive in at least one test are 1-(1-0.05)^n. For example, if you have 5 metrics, each tested at 95% significance, your chances of seeing a false positive are 1-0.774=22.6%, and almost 1 in 4 significant results you get will be a false positive. The same applies if you look at your A/B test 5 times before stopping it. If you do both, your ratio of false positives goes up to over 40%.
In our system at the time, a typical A/B test had one main metric and 13 additional default metrics displayed in the analysis report. This is a contentious design issue, as having 14 separate metrics greatly increases the chance of false positives, as discussed above. In addition, it is common for many teams to run an A/B test until the metrics become significant—a classic “peeking” error that inflates false positives in a similar fashion.
When analyzing A/A tests, these issues occurred in the same fashion, but the effect was much more devastating. Whereas getting a false positive in an A/B test—where a difference is expected—may lead to a wrong product decision, getting a false positive in an A/A test leads to loss of faith in either the A/B testing system itself or the entire A/B testing methodology. Since running the test until a significant effect is reached is guaranteed to eventually return a significant result, the problem is self-evident. In severe cases, this can disrupt the work of entire departments in the organization.
How to Run an A/A Test
While we did not see entire departments giving up testing, we did notice that many teams lacked confidence in our system so we set out to investigate more systematically. As noted above, we had 14 default metrics for each test. While having so many metrics is usually problematic, in this case it enabled us to run 10 parallel tests and collect a large sample of 140 metrics. We performed the relevant power calculations in advance, set our sample size, and ran the tests until the required size was reached.
The results were unsurprising for a system that we basically believed to be working properly. Only 1 in 10 tests showed a significant difference in the main metric. The chance of getting 1 or more positive results in 10 binomial trials with .05 probability is (in R):
for p=0.4, meaning there is no reason to reject the null hypothesis (that is, the system is working fine).
Of the entire 140 metric set, 4 (<3%) showed significant differences. Again, at .05 probability of a false positive, this gives p=0.92, and even at 8 significant metrics we still have p=0.4.
Again, this provided us no reason to reject our hypothesis that the system was working well.
How Not to Need an A/A Test
In this case, we were able to run a large sample of tests on our system, report the findings to the various teams in the company, and restore trust in our infrastructure. However, we wanted to prevent future recurrences of faulty A/A tests as well. To get rid of A/A tests, we first needed to understand what they were testing. In fact, since an A/A test literally shows the same variant to both groups, the only thing it is testing is the random assignment of the A/B test system itself. An A/A test will not find cases where one variant has a bug in data collection or an actual product bug (like an extra menu tab moving the upgrade button outside the screen—another true story). If you want something that will help you with these types of issues, read this fine piece from Twitter, “Detecting and Avoiding Bucket Imbalance in A/B Tests”.
Once we understood what it was that we were testing—random assignment by the A/B test system—we started replacing A/A tests with monitoring scripts. Basically, these are continuous A/A tests that monitor each segment of Wix where tests can be run, and verify both the distribution between A vs. A’ (which is also A), as well as the agreement between the total amount of users from the A/B test system and the actual amount of visitors to the segment.
As a first measure, our shiny new monitors do not perform any advanced statistical tests. They merely raise an alert if the imbalance crosses some predefined threshold. Yet even at this simplistic level, these scripts have been instrumental in finding major issues. Most were caused by improperly trying to share experiments between segments, causing the test system to report more users than were actually supposed to have received the test. In one case, however (and pretty soon after we started running these monitors) they alerted us to a deeper issue with our assignment algorithm. This led to a deeper investigation of the system code, leading us to discover and fix some bugs. More importantly, this compelled us to write proper unit tests for the A/B testing system itself, which also proved non-trivial and will be the subject of the next post.
Posted by Ory Henn