I saw this piece as the start of an experiment, and the use of a "council of AI" as they put it to average out the variability sounds like a decent path to standardization to me (prompt injecting would not be impossible, but getting something past all the steps sounds like a pretty tough challenge)
They mention getting 100% agreement between the LLMs on some questions and lower rates on other, so if an exam was composed of only questions where there is near 100% convergence, we'd be pretty close to a stable state.
I agree it would be reassuring to have a human somewhere in the loop, or perhaps allow the students to appeal the evaluation (at cost?) if they is evidence of a disconnect between the exam and the other criteria. But depending on how the questions and format is tweaked we could IMHO end up with something reliable for very basic assessments.
PS:
> Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
Nothing indeed. The arms race hasn't started here, and will keep going IMO.
So the whole thing is a complete waste of time then as an evaluation exercise.
>council of AIs
This only works if the errors and idiosyncrasies of different models are independent, which isn’t likely to be the case.
>100% agreement
When different models independently graded tests 0% of grades matched exactly and the average disagreement was huge.
They only reached convergence on some questions when they allowed the AIs to deliberate. This is essentially just context poisoning.
1 model incorrectly grading a question will make the other models more likely to incorrectly grade that question.
If you don’t let models see each other’s assessments, all it takes is one person writing an answer in a slightly different way that causes disagreement among models to vastly alter the overall scores by tossing out a question.
This is not even close to something you want to use to make consequential decisions.
Imagine that LLMs reproduce the biases of their training sets and human data sets are biased against nonstandard speakers with rural accents/dialects/AAVE as less intelligent. Do you imagine their grade won't be slightly biased when the entire "council" is trained on the same stereotypes?
Appeals aren't a solution either, because students won't appeal (or possibly even notice) a small bias given the variability of all the other factors involved, nor can it be properly adjucated in a dispute.
I might be given too much credit, but given the tone of the post they're not trying to apply this to some super precise extremely competitive check.
If the goal is to assess whether a student properly understood the work they submitted or more generally if they assimilated most concepts of a course, the evaluation can have a bar low enough for let's say 90% of the student to easily pass. That would give enough of margin of error to account for small biases or misunderstandings.
I was comparing to mark sheet tests as they're subject to similar issues, like students not properly understanding the wording (and usually the questions and answer have to be worded in pretty twisted ways to properly) or straight checking the wrong lines or boxes.
To me this method, and other largely scalable methods, shouldn't be used for precise evaluations, and the teachers proposing it also seem to be aware of these limitations.
A technological solution to a human problem is the appeal we have fallen for too many times these last few decades.
Humans are incredibly good at solving problems, but while one person is solving 'how do we prevent students from cheating' a student is thinking 'how I bypass this limitation preventing me from cheating'. And when these problems are digital and scalable, it only takes one student to solve that problem for every other student to have access to the solution.
They mention getting 100% agreement between the LLMs on some questions and lower rates on other, so if an exam was composed of only questions where there is near 100% convergence, we'd be pretty close to a stable state.
I agree it would be reassuring to have a human somewhere in the loop, or perhaps allow the students to appeal the evaluation (at cost?) if they is evidence of a disconnect between the exam and the other criteria. But depending on how the questions and format is tweaked we could IMHO end up with something reliable for very basic assessments.
PS:
> Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
Nothing indeed. The arms race hasn't started here, and will keep going IMO.