Ok, interesting. I was about to dismiss that approach, but instead took some tim...

Ok, interesting. I was about to dismiss that approach, but instead took some time to think about it, and it might just work.

The problems I see:

> 1. Query lots(100 000s) of Youtube videos and store the comments associated with the video.

One would have to do it as early as possible, before they change too much, as the thesis is they already changed. Though I would be surprised if there weren't some studies which used comparable data, maybe something like that is available?

> 2. Repeat the operation after 6 months when the Google+ integration goes in full effect and there is enough G+ comments.

Is there a next step of the integration? If not, one wouldn't have to wait that long.

> 3. Label an initial set of comments (100 000s) as spam, non-spam, hateful, sexist, neutral, etc using Mechanical Turk.

That is the main culprit. I'm not convinced that the new set of comments is easily detectable as offending, given that the context seems to be more readily used by the trolls. First and Rickrolling is a thing of the past. Besides, even given the low prices there, to rate 100k would cost a lot…

But still. Even something like "they changed a lot and are hard to compare" would be an interesting result.

The algorithm is of course the next question, is something like that easily doable given the nature of the comments?

Hm. Is that something you seriously consider to do? It could be an interesting experiment, it surely would be an interesting HN-worthy article - and if you are in academics, it might be even worthy of a publication (maybe something like "study of the effect of de-anonymization on commenters on an internet-plattform") or at least a few credit point. Is there a working API to get those comments?