The response timing in the chart in the blog post shows that even with perfect precision/recall Sparrow-1 also has the fastest true positive response times.
The turn taking models were evaluated in a controlled environment with no additional cascaded steps: LLM, TTS, Phx. This matters to get apples to apples comparison: without the rest of the pipeline variability influencing the measurements.
The video conversation examples are sparrow-1 within the full pipeline. These responses aren’t as fast as sparrow itself because the LLM, TTS, facial rendering, and network transport also take time. Without Sparrow-1 they would be slower. Sparrow-1 enables the responses being as fast as they are, and with a faster CVI pipeline configuration the responses can be as fast as 430ms in my testing.
If you watch the demo video you can see how they would get this: the model is not aggressive enough. While it doesn't cut you off, which is nice, it also always waits an uncanny amount of time to chime in.
Common ...