Thursday, September 15, 2011

A/B testing. Is Khan doing it wrong?

A/B testing is where you try an old way of doing things and a new way each with a sample of the users and see which one works better. It is frequently used by ads where you test two wordings "Buy new coke!" and "Buy improved coke!" to see which one gets more clicks.

If you ever write a book or need to name a shop you should spend a few quid buying google ads for the two names you are trying to pick "How to start a fight" and "How to win an argument" and see which one gets clicked on more. That one should be the name.

This is one of those embarrasing posts that is probably wrong. But I think the A/B testing used by the Khan academy makes a fairly fundamental mistake. The Khan acedemy has lessons in various subjects. So presumably they want to use A/B testing to see if kids taught "1+1=2" or "1 + 1 = 2" learn more quickly and such.

A/B testing is a useful way to see if little tweaks result in better user experience. In Khan's case learning. It does not substitute for good design vision but can help make some relatively small tweaks. Improving the Khan acedemy and kids education is really important so if there is a bug in their A/B testing they might be making the wrong choices about how to improve their teaching.

For this kind of testing you need to pick the number of test cases in advance. How not to run an A/B test explains why and the effects of looking before all the test is finished. This is an odd feature of frequentist statistics

'However, the significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless. This result is completely counterintuitive and all the A/B testing packages out there ignore it'

The A/B testing used by Khan seems not to do this as the gae bingo system says

"Controlling and ending your experiments

Typically, ending an experiment will go something like this:

You'll notice a clear experiment winner and click "End experiment, picking this" on the dashboard. All users will now see your chosen alternative."

This seems to be saying that either you should notice what is statistically significant which you won't always or that something can be declared statistically significant before all the samples are tested. Think of it this way. If every test has a 5% chance of being wrong and you think of everytime you look at the A/B test as adding 5% to the chance of being wrong. It is not quite that bad but it gives you a feeling of the problem.

Now there are ways you can tell that something is statistically significant really early in a test. "Bayesian Statistics and the Efficiency and Ethics of Clinical Trials" deals with these. In medical trials you want to know as early as possible if a new treatement is better or worse than an old one. Giving someone the wrong ad wont kill anyone but the wrong cancer drug might. This paper goes through how you would figure this out using Bayesian methods. These methods are also described in chapter 37 of MacKay's 'Information Theory, Inference, and Learning Algorithms'

But looking at the code GAE bingo uses for A/B testing they do not seem to be using these methods. So it looks to me that they are making the mistake of letting you stop a test when you want to. Which in frequentist statistics can be an error.

Also I think Vanity another rails A/B testing framework makes the same assumption
"This experiment will conclude once it has 1000 participants for each alternative, or a leading alternative with probability of 95% or higher:"

The system used by the BBC is based on time and not numbers according to this article. "Example use
For 5 in 100 people to get a two-option test running for 24 hours the function is initialised like this:". Which is not nearly as bad. But it is assuming that at the end of the time period you have had enough users to make a good test.

There is a proper explanation as to what can happen if you stop a trial early in "How not to run A/B testing". But from my reading many of the A/B testing frameworks out there seem to be making this error. Please correct me in the comments.

Addition: Ben from gae bingo got back to me in a comment on their blog.
"You're right that this is an issue, and that's a great blog post. However, this is significantly mitigated by a) letting your experiments run long enough to get a large-ish sample size for your population and b) simply not checking your dashboard constantly and making snap decisions.

We could build stuff into the system to mandate that, but at the moment I believe we'll be able to get solid value out of the existing framework (just like most A/B systems)." This seems fair enough, Khan are going to have such a high volume of users that they will be able to get a large sample size quickly.

Allen Downey has a great review of the problem with simulations here