Kokoro is better for tts by far For voice cloning, pocket tts is walled so I can...

echelon · 2026-01-16T00:30:45 1768523445

What are the advantages of PocketTTS over Kokoro?

It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.

I couldn't tell an audio quality difference.

hexaga · 2026-01-16T05:38:36 1768541916

Kokoro is fine tunable? Speaking as someone who went down the rabbit hole... it's really not. There's no (as of last time I checked) training code available so you need to reverse engineer everything. Beyond that the model is not good at doing voices outside the existing voicepacks: simply put, it isn't a foundation model trained on internet scale data. It is made from a relatively small set of focused, synthetic voice data. So, a very narrow distribution to work with. Going OOD immediately tanks perceptual quality.

There's a bunch of inference stuff though, which is cool I guess. And it really is a quite nice little model in its niche. But let's not pretend there aren't huge tradeoffs in the design: synthetic data, phonemization, lack of train code, sharp boundary effects, etc.

jamilton · 2026-01-16T01:19:24 1768526364

Being able to voice clone with PocketTTS seems major, it doesn't look like there's any support for that with Kokoro.

echelon · 2026-01-16T01:54:45 1768528485

Zero shot voice clones have never been very good. Fine tuned models hit natural speaker similarity and prosody in a way zero shot models can't emulate.

If it were a big model and was trained on a diverse set of speakers and could remember how to replicate them all, then zero shot is a potentially bigger deal. But this is a tiny model.

I'll try out the zero shot functionality of Pocket TTS and report back.

Barbing · 2026-01-16T17:53:05 1768585985

Would be curious to hear!

jhatemyjob · 2026-01-16T03:19:08 1768533548

Less licensing headache, it seems. Kokoro says its Apache licensed. But it has eSpeak-NG as a dependency, which is GPL, which brings into question whether or not Kokoro is actually GPL. PocketTTS doesn't have eSpeak-NG as a dependency so you don't need to worry about all that BS.

Btw, I would love to hear from someone (who knows what they're talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.

miki123211 · 2026-01-16T05:22:48 1768540968

Kokoro only uses Espeak for text-to-phoneme (AKA G2P) conversion.

If you could find another compatible converter, you could probably replace eSpeak with it. The data could be a bit OOD, so you may need to fiddle with it, but it should work.

Because the GPL is outdated and doesn't really consider modern gen AI, what you could also do is to generate a bunch of text-to-phoneme pairs with Espeak and train your own transformer on them,. This would free you from the GPL license completely, and the task is easy enough that even a very small model should be able to do it.

jcelerier · 2026-01-16T14:06:16 1768572376

If it depends on espeak NG code, the complete product is 100% GPL. That said, if you are able to change the code to take off the espeak dependency then the rest would revert to non-GPL (or even if it's a build time option that you can disable like FFMPEG with --enable-gpl)

seunosewa · 2026-01-16T01:35:39 1768527339

Chatterbox-turbo is really good too. Has a version that uses Apple's gpu.