Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A/A testing (jvns.ca)
104 points by luu on Feb 7, 2015 | hide | past | favorite | 35 comments


You might be interested in the statistical technique called "bootstrapping": http://en.wikipedia.org/wiki/Bootstrapping_(statistics)

The "A/A" method described is not a terribly robust way to estimate variance, but the basic idea of using subsamples to estimate variance is what bootstrapping does more systematically.


I agree. Bootstrapping is somehow not covered in most "first course in statistics" yet it is very valuable in practice for data people, especially at startups: Because you don't have that much data, and most of your data do not naturally follow anything like the normal distribution, it could be misleading to use the normal theory to estimate variance. The bootstrap helps you combat this.


Yes bootstrapping should be introduced far sooner. Most of the variables we are interested in, say revenue per session, are in no way normally distributed, thus violating the assumptions of classical two-sampled t-tests. Bootstrapping, and MC methods, provide a better solution than parametric tests.


Most classical tests don't require the variable to be normally distributed, they require the test statistic to be. I.e., you don't need revenue/session to be normal, you need sum(revenue_per_session) to be normal. As long as you don't have any long tails and your variables are IID, that will happen: https://en.wikipedia.org/wiki/Central_limit_theorem

More interestingly, things like revenue/visitor have a known probability distribution. It's not normal, but it is known. You can use a LOT fewer samples if you use a parametric test (either Bayesian or SPRT) based on the correct distribution.

If you use bootstrapping instead, you'll a) give up all your finite-sample guarantees and b) wind up using a LOT more samples than you need.


But how useful is comparing sum(revenue_per_session) when you want to test significance of one batch to the other? Aren't you then just comparing 2 values and seeing which is greater?

If you compare the 2 batches of revenue/session distributions using a monte-carlo simulation you can calculate the probability that one is significantly different than the other. This generalizes beyond the 2 sample t-test because those underlying distributions are non-normal.

Please let me know if I'm thinking of this correctly (or not)


Ok, to test one relative to the other, you might test W=sum(revenue_per_session_A - revenue_per_session_B). Interpret the subtraction as a vector op. (Adjust a bit if you want to do a Welch test.) Assuming the CLT holds this statistic is normally distributed. Assuming the null hypothesis holds, it has mean 0.

Thus, you can do all your normal Stats 101 tests on it.

If you compare the 2 batches of revenue/session distributions using a monte-carlo simulation you can calculate the probability that one is significantly different than the other.

A frequentist test (which includes most bootstrap methods) can never tell you this. Frequentist statistics doesn't even acknowledge this as a legitimate question to ask.

Now I agree, if you can use the exact distribution of revenues directly in the test, you can get answers even before you have enough samples for the CLT to apply. But if you use a nonparametric method like bootstrap, you'll need to use up a lot of samples unnecessarily.


>More interestingly, things like revenue/visitor have a known probability distribution.

This actually depends. Based on my experience at a very early stage startup, it was definitely not the case for some datasets (I even tried fiddling with various well-known distribution's parameters).

If I recall correctly, I believe that the bootstrap had some asymptotic guarantee on the rate of convergence (although my memory is hazy on this)?

EDIT: never mind, it is asymptotic, hence not finite-sample necessarily.


I had the same thought. The bootstrap is a really simple and easy-to-implement technique that we've had for decades.

There is even some recent work on a "Big Data" (distributed) version of the bootstrap from Michael Jordan's group [1]. It's also pretty easy to implement and can be really useful in practice.

[1] http://onlinelibrary.wiley.com/doi/10.1111/rssb.12050/full


I use bootstrapping all the time. Such an easy way to estimate variance of your mean, median, variance, etc. The most important issue I've come across is to make sure that your experiments are really as random as you can get. Otherwise you can end up with biases due to systematically picking outliers.


I really love how this very clearly visualises the need for statistical significance. To me, a novice, the A/A/B chart is wildly more illustrative of a point than the an A/B chart based of the same sample data with some significance number next to it. I understand from some of the comments here that there's all kinds of ways in which this A/A/B thing is subpar. But, if the chance of someone misinterpreting a chart decreases more than the chance of the chart itself being misleading, then isn't it a big win?

I'm really nerd sniped here. Is there any branch of statistics that focuses on human understanding? For example, there's all kinds of blogs and stories out there about how doctors routinely make wrong choices because they don't understand statistics well enough. Is there any serious body of knowledge that explores ways of getting these doctors to make these mistakes less frequently, without having to send them to sites with titles like "An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem"?


I have no idea. But I will chime in with one factoid I've heard a few times. If you say things like "the false positive rate is 9%", people's intuitions lead them astray, but if you say things like "9/100 healthy people are labeled as sick by this test", then intuitions work much better.


The problem is the more than likely chance that only doing one resample (in bootstrap terminology) may not lead to any clear rejection of statistical significance. That is, it is quite likely that in practice of this "A/A/B" testing, the second control group may look pretty much like the first, and so you can't clearly argue that the control deviates far away or not from the "Great Idea".

What do you then? Well, the natural answer is to try the resample again. Do this n times, get an average for the variability, and that is precisely bootstrap.


business and making money is always about a trade off of more information and making quicker decisions cheaply. the nice thing about business/making money is that detail doesn't matter, small information gain on your decisions can pay huge dividends. i mean the data is usually dirty and discusting in the first place


Another cool way to cover your bases is to run monte carlo simulations.

At my previous employer, we open sourced a command line utility that we used to validate our statistical models if anyone's interested: https://github.com/monetate/monte-carlo-simulator


A/A tests (also known as Null tests) are useful to validate that users are assigned to the control and experiment groups without bias.

Offline resampling methods, like bootstrapping, are better if you're looking to robustly estimate the variance of the experiment.


Yes, a Null test really should be the first test you run if you're looking to get into A/B testing.

You should also run an ongoing A/A test across your site or app to have confidence your bucketing, data pipeline, stats tests and effect on metrics is working as expected over time.


Bootstrapping is a simple-to-explain and extremely powerful statistical method that is essentially the equivalent of an A/A/A/A/A/A/.../B/B/B/B/B/B/... test in OP's terminology.

What is especially powerful about bootstrapping is that it does n't make any simplifying assumptions about the underlying distribution, unlike other methods to obtain confidence intervals.


Your example shows that simple statistical tests aren't always that simple. Especially when talking about small effects.

What you really want are confidence intervals which show what would be a significant change. You can calculate that from your A-data and from your B-data. If they overlap you probably aren't quite there yet.

Comparing A/A vs. B or A/A/...A/A vs. B/B/...B/B is a poor man's approach to visualize the distribution of the mean values.

Things get further complicated when doing a lot of tests. If you do hundreds of A/B-Tests and a handful show a weakly significant result that may actually be a statistical fluke. The likelyhood that a wrong seemingly significant result is present when doing hundreds of tests can actually be pretty high. You should rerun these tests with fresh data and check for consistency, which in itself is some kind of A/A/B/B-test.


A/A testing is a waste of precious testing time https://plus.google.com/105925791633746539648/posts/EhFuZ6Fh...


Although the link to which you have referred to here does say A/A testing is a waste of time, the OP article said the headline was a bit of a misnomer - they're actually referring to A/A/B testing.


and tells you how long you’ll need to run your experiment for to see statistical significance.

I always thought that statistical significance isn't something that should be tried to achieve, but merely a performance indicator how good the experiment was. Isn't it odd to try to "achieve significance over time"?

Shouldn't it be: "Your experiment requires 5,000 visitors and after that we'll check if the result was significant enough to not be merely due to random chance"?

Could someone with more statistical understanding elaborate this a bit?


> Shouldn't it be: "Your experiment requires 5,000 visitors and after that we'll check if the result was significant enough to not be merely due to random chance"?

That's basically what is happening with the tool, I think. It is asking for how many users per day you get in order to approximate the sample size for x days, then it's asking how much power you want. Power is the likelihood of detecting a difference if there is one. It also asks what confidence level you want. All of those together give you an approximate answer to the amount of time, assuming # of users/day is roughly constant.


There are 4 inputs needed to estimate sample size for a test: power, confidence level, expected difference, and variance. You need all 4 before you run any test. You use the A/A test to estimate variance. Power is the probability of detecting an x% difference when one really exists. Typically you see .8 or .9. Confidence level is the probability of detecting a difference when one really does not exist, typically .05. The 4th item is the expected difference of the test. If you want to detect a 1% difference, you will need more sample than if you want to detect a 5% difference.

You have to know all 4 before you do a test. A test is designed specifically to detect a certain difference. You cannot launch a test without knowing that as part of your hypothesis.


Yep, that's a more precise version of what I was saying w.r.t. estimating sample size. I think the tool makes some assumption about variance, but the other 3 are things you supply. Note that I wasn't saying anything about the A/A test article, just the sample size estimator that's linked to.


Please see the papers here: http://www.exp-platform.com/Pages/default.aspx

These are from the team that built amazon's weblab. The foundation of large scale web experimentation.

To be working in this field and not be familiar with this work, eg. the concept of A/A testing, is like deciding to build jet engines without having heard the idea of a bypass ratio.


That's an excellent link! Wasn't aware of it, despite heavy involvement in analytics.


A box plot is another - I think better - way to represent the variance in a control group A. See http://www.mathworks.com/matlabcentral/fileexchange/screensh...


Interesting stats hack definitely. I love the point the other comments are making: Learn more about simulation and bootstrapping. It'll still require a little probability but all of the results will make a ton of sense.


But, but, but... Optimizely?


How would this be different from simply increasing the size of the control group? Or subdivide the control into N groups of sufficient size in order to more effectively visualize the variation?


If you just have one big control, this still does not tell about variance inside the control. You can subdivide the control in N groups, but I think you quickly increase the noise to signal ratio.

A control group split into two is a good compromise, and intuitive to reason about, like the author points.


But statistically, they're exactly equivalent. I guess I don't get the advantage.


Huh.

A quick and dirty way to avoid having to do much of any stats. Interesting.


It's like a reverse of Simpson's paradox


seriously? how is this any more accurate than pure "hunch" or eyeballing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: