One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.
> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.
My own experience agrees: the generally available "SOTA" models are not especially robust, and can be _extremely_ bad (>50% absolute error rate) at some tasks. I'll post some preliminary numbers in a sibling comment and look into running my full set of tests on Whisper.
It looks like Whisper is probably leaving a lot of accuracy on the table, but initially it does seem to be a lot more robust than general "SOTA" models.
For a quick comparison, Silero's accuracy charts are kind of nice because they post results for a large variety of datasets. Scroll down to the EN V6 xlarge EE model (not the xlarge CE) [1]
> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.