In this context it's probably better measure of randomness than true randomness.
What 'appears random' is better measurement that matches what people try to do.
People are not good at recognizing randomness, they confuse it with homogeneity. True randomness generates more human recognizable patterns than people think. If you as people to generate random string of 1's and 0's, they avoid long strings of 0's or 1's too much.
I don't quite follow what you're attempting to say here, I mean to say that the study is flawed because "It's a study on your ability to generate random list of results", with the inference that supposedly those who generate the most random results and those who appear by human estimation to do so are the same people, because humans are actually bad at randomness, exactly as you yourself say here.
So what "appears random" is not at all a good measure of what is "actually random" to put it as simply as possible.
The goal is the ability to generate strings that match the "approximate sense of complexity" (ASC) of the subjects. To do so requires the ability to avoid any routine and inhibit prepotent responses.
Research goal was to measure cognitive ability and randomness is just measure stick. The actual mathematical complexity is correlated but there is human bias. The bias itself is irrelevant if it's constant. The relevant is how closely subjects can generate strings that appear random and complex (randomness with bias) for humans.
In other words
measure = statistical randomness + bias
Because the bias is almost universal (see the modulating factors in the article) it's not interfering with the thing they try to measure.
Why? Do you suspect that subjects were deliberately making their choices less random because they anticipated the poor performance of another person in assessing randomness? It seems like a good assumption that they knew they were trying to fool a sophisticated randomness test that they couldn't second-guess.
Considering that a random generator could generate 11111111111111111111111111 5 times in a row...it's a bit hard to claim someone's test results as a failure of randomness unless you claim you want a gaussian distribution or some other type of "character of randomness." Ie. running diehard tests on it and whatnot. A fixed number of trials is deceptive unless you can see the algorithm behind the numbers. And for a human being..you can't. So unless you have a giant sample size, and even when you do, there is a bit of a mischaracterization done depending on what tests you are going to use to determine how random the data is
It's not supposed to be a measure of "true linguistic proficiency" but a measure of the capacity to generate linguistic-proficiency-like results.
So in each case what's the gap between actual linguistic proficiency / randomness, and the appearance thereof? And of what value is measuring these human perceptions rather than the actual facts in each instance (like taking all the results and putting them into a scatter output and seeing if there is actually a pattern in the pseudorandom data, or formally analysing the grammar and spelling in question and verifying that it is technically correct rather than just "english sounding" https://youtu.be/gU4w12oDjn8?t=2m)
Can we draw conclusions about that Italian gentleman's ability to make a song that sounds like English pop music "better" than an English pop music song that is actually technically grammatically correct, and use it to infer that he's got better English skills than the writer of the technically correct song?
And if not, why are we trying to make statements about the ability of some randomness souce not based on any actual measure of true randomness?
>So in each case what's the gap between actual linguistic proficiency / randomness, and the appearance thereof?
That there are no hard constraints/expectations like in measuring the quality of a e.g. software random number generator implementation.
They don't expect to find true randomness in the results, just to measure how much randomness (entropy if you will) those various age groups are capable of producing.