If the stats aren't based on samples of real-world data, rather than very rigorously designed experimental comparisons, I think there's another reason to be wary of p-values. [Note, in what follows I probably use some terminology wrong, because I'm not a statistician, but I do think the point is important and I don't see much written about it.] In the real world data is not a bunch of independent events, but events (or data points) that are interconnected in ways that are difficult to quantify. I once heard a presentation by an expert in AB testing at a leading tech company who became wary of his results and brought his concerns about non-independence of events that were being tested to corporate statisticians. (He was concerned that, even though the AB testing procedure supposedly randomized the test, interconnectedness within website traffic was not being accounted for.) By his account, the statisticians agreed it was a problem but recommended that he assume that variations due to non-independence would more or less balance each other out. He wasn't satisfied with this and said that he went ahead and did some more measurements, then ran out all the binomial expansions rather than relying on approximations. When he did this more detailed work, he found out that with at least some web-based AB tests, where conventional statistical formulas showed a p-value of .05, he was getting a confidence level of more like 30%. (I don't think you could call his measurement a p-value because he wasn't using the formulas normally used to compute p-values, but I think what he was saying was more or less that the p-value from the forumalae was .05 when it should have been .30 from a more rigorous look)
As I'm not a statistician by trade I don't keep up on the literature very well, but interconnectedness of data does seem to me to be a very important issue. I'm wondering if anyone can point me to some helpful reading to understand this side of the issue better. In particular, is there any approach to AB testing that can reliably address the issue of data interconnectedness in the kind of situation described above?
Could you expand on what you mean by "interconnectedness within website traffic"?
If one person visiting the site has no influence on other people visiting the site, then measurements of their behavior will be independent. If Facebook tests a different interface on half of their users and the changed behavior of those users indirectly has some impact on the behavior of the control group, then your measurements would have some level of dependence – I can imagine that this could happen but it's not clear to me that this would be a common scenario. The same would happen if you measure the behavior of the same person more than once – but in this case there's many procedures for working with paired or autocorrelated data.
The presenter I mentioned did not go into details about what interconnectedness he found, but I think it's quite obvious that people visiting a site do have influence on other visitors, which is at least a part of the underlying issue. On the simplest level, most web sites have share buttons to make it as easy as possible for visitors to influence other traffic. Or, other examples, a trending tweet can massively influence patterns of usage of a website or Facebook page (or many tweets with small reach can cause many small influences); or an RSS feed might influence patterns of tweeting or posting elsewhere. I think there are myriad other interconnections within web traffic. We do a great deal of work to drive traffic that is based on the premise that different visitors to websites are mostly interconnected. It's these factors that give me pause when I think about statistical measures that are premised on an assumption that we're measuring independent events.
Hmmm, again, there are certainly ways that interactions between visitors can cause statistical dependence, but not in the specific case you mention. Let's take an A/B test on a referral funnel. If a user invites all of his friends, and his friends then visit the site, they will be randomized over A and B just like the original user, and so any effect that is not due to changes in the referral experience will simply not matter because it will contribute equally to both groups.
Without better examples it's very hard to judge whether this is a real problem.
I understand if you think this is a non-issue, though I don't agree. The speaker I referenced about asked the statisticians at his company about this and they said it was a non-issue because things balanced out. He thought that was an idealization and claimed to have tested it building in some real world data, and reported that interconnected data of this kind drastically affected confidence levels. He didn't get into the details of how he measured interconnectedness, however.
The example you give seems to me to oversimplify the issue of complex interconnections between data points, as if the traffic on a real website came from one set of referrals, while in reality its much more complex, with referrers inducing other referrers and a variety of campaigns, postings, etc. influencing each other, and over time, overlaid in a fairly complex pattern. In other words, a bunch of interrelated data, very little of which is actually independent of other items.
I'm not really asking for an explanation of this in the comment thread here; what I'd like to know is, if there are any studies or other publications that deal with the issue of how to evaluate tests run on interconnected data of this kind.
There are absolutely ways to deal with what you call interconnected data, as I mentioned earlier: paired tests, corrections for autocorrelation, nonparametric and bootstrap methods for non-normal data and so on. But barring any examples of what you mean with interconnectedness in this context, it's hard to recommend any studies or publications because there is no One Method Of Interconnectedness Correction.
Also, statistics deals with many idealizations but the idea that randomization allows you to cleanly measure the effect of an intervention in the face of what would otherwise be confounding is simply not one of them. Sorry to disappoint, but with all you're telling us it simply sounds like the speaker was clueless.
Well, if he was clueless then two very large and successful tech companies had a clueless guy running their AB testing and showing great results in each context.
I'm certainly not looking for "One Method for Interconnectedness Correction" (especially not, as you put it, with each word capitalized). I'm looking for studies or papers that might have addressed anything like the effect of interconnectedness of web data on AB testing. I think you're saying, you don't know of any, and also that you personally don't think it's a real issue.
As I'm not a statistician by trade I don't keep up on the literature very well, but interconnectedness of data does seem to me to be a very important issue. I'm wondering if anyone can point me to some helpful reading to understand this side of the issue better. In particular, is there any approach to AB testing that can reliably address the issue of data interconnectedness in the kind of situation described above?