Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Eep.

So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only

(Beginning of Tale of Two Cities)

but the problem is Javert skips over parts of sentences! Eg, it starts:

> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."

Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"

Which... Doesn't exactly inspire faith in a TTS system.

(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)





All the models I tried have similar problems. When trying to batch a whole audiobook, the only way is to run it, then run a model to transcribe and check you get the same text.

Václav from Kyutai here. Thanks for the bug report! A workaround for now is to chunk the text into smaller parts where the model is more reliable. We already do some chunking in the Python package. There is also a more fancy way to do this chunking in a way that ensures that the stitched-together parts continue well (teacher-forcing), but we haven't implemented that yet.

Is this just sort of expected for these models? Should users of this expect only truncation or can hallucinated bits happen too?

I also find Javert in particular seems to put in huge gaps and spaces... side effect of the voice?


Yeah Javert mangled up those sentences for me as well, it skipped whole parts and then also moved words around

- "its noisiest superlative insisted on its being received"

Win10 RTX 5070 Ti


Using your first text block 'Eponine' skips "we had nothing before us" and doesn't speak the final "that some of its noisiest"

I wonder what's going wrong in there


interesting; it skipped "we had everything before us," in my test. Yeah, not a good sign.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: