Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.

Cool tech demo though!





That's a pretty crazy requirement for something to be "useful" especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.

You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to my and many others' annoyance?

YouTube's voice to voice is absolutely horrible though. Having the ability for the youtubers to clone their own voice would make it much, much more appealing.

Uh, no? This is not at all an absurd requirement? Screen readers literally do this all the time, with voices that are the classic way of making a speech synthesizer, no AI required. ESpeak is an example, or MS OneCore. The NVDA screen reader has an option for automatic language switching as does pretty much every other modern screen reader in existence. And absolutely none of these use AI models to do that switching, either.

They didn’t say it was a crazy requirement. They said it was crazy to consider it useless without meeting that requirement.

That doesn't really change what I said though. It isn't crazy to call it useless without some form of ALS either. Given that old school synthesis has been able to do it for like 20 years or so.

How does state of the art matter when talking about usefulness? Is old school synthesis useless?

No? But is it not unreasonable to expect "state of the art" TTS to be able to do at least what old school synthesis is capable of doing? Being "state of the art" means being the highest level of development or achievement in a particular field, device, procedure, or technique at a specific point in time. I don't think it's therefore unreasonable to expect supposed "state of the art" text-to-speech synthesis to do far better at everything old-school TTS could do and then some.

> Being "state of the art" means being the highest level of development or achievement in a particular field, device, procedure, or technique at a specific point in time. I don't think it's therefore unreasonable to expect supposed "state of the art" text-to-speech synthesis to do far better at everything old-school TTS could do and then some.

Non sequitur. Unless the 'art' in question is the 'art of adding features', usually this phrase is to describe the quality of a very specific development, these are often not even feature complete products.


This is a great illustration that nothing you ever do will be good enough without people whining.

Excuse me for pointing out that yet another LLM tech demo is presented to our attention.

But it wouldn't be for those who "speak exclusively English", rather, for those who speak English. Not only that but it's also common to have system language set to English, even if one's language is different.

There's about 1.5B English speakers in the planet.


Let's indeed limit the use case to the system language, let's say of a mobile phone.

You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.

You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.

You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.

But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...

And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...

> only that but it's also common to have system language set to English

Ask a German whether their system language is English. Ask a French person. I can go on.


> Ask a German whether their system language is English. Ask a French person. I can go on.

I'm German but my system language is English

Because translations often suck, are incomplete or inconsistent


If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway. Your speech sub-systems can't lock and sync to the audio track containing languages you don't speak. Let alone transliterate or pronounce.

Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained. It's more like you can make;make install docker, after which you can attach/detach into/out of alternate environments while on terminal to do things or take in/out notes.

People sometimes picture multilingualism as owning a single joined-together super-language in the brain. That usually doesn't happen. Attempting this especially at young age could lead a person into a "semi-lingual" or "double-limited" state where they are not so fluent or intelligent in any particular languages.

And so, trying to make an omnilingual TTS for criticizing someone not devoting significant resources at it, don't make much sense.


> If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway

This is plainly not true.

> Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained

This and the analogy make no sense to me. Mind you I am trilingual.

I also did not imply that the model itself needs to be multilingual. I implied that the software that uses the model to generate speech must be multilingual and support language change detection and switching mid-sentence.


I'm Martian so everything you create better support my language on day 1

> it must be multilingual and dynamically switch between languages pretty much per word

Not abundantly obviously a satire and so interjecting: humans, including professional "simultaneous" interpreters, can't do this. This is not how languages work.


You can speak one language, switch to another language for one word, and continue speaking in the previous language.

But that's my point. You'll stop, switch, speak, stop, switch, resume. You're not going to be "I was in 東京 yesterday" as a single continuous sentence. It'll have to be broken up to three separate sentences spoken back to back, even for humans.

>"I was in 東京 yesterday"

I think it's the wrong example, because this is actually very common if you're a Chinese speaker.

Actually, people tend to say the name of the cities in their own countries in their native language.

> I went to Nantes [0], to eat some kouign-amann [1].

As a French, both [0] and [1] will be spoken the French way on the fly in the sentence, while the other words are in English. Switching happens without any pause whatsoever (because there is really only one single way to pronounce those names in my mind, no thinking required).

Note that with Speech Recognition, it is fairly common to have models understanding language switches within a sentence like with Parakeet.


Okay, it's getting clear that I'm in the wrong here with my insistence that languages don't mix and foreign words can't be inserted mid-sentence, yet that is my experience as well as behaviors of people sharing the language, incidentally including GP who suggested that I can always do the switching dance - people can if wanted, but normally don't. It's considered a show-off if the inserted word could be understood at all.

Perhaps I have to admit that my particular primary language is officially a human equivalent of an esoteric language; the myth that it's a complex language is increasingly becoming obsolete(for good!), but maybe it still qualify as being esoteric one that are not insignificantly more incompatible with others.


I think this is totally wrong. When you have both parties speaking multiple languages this happens all the time. You see this more with English being the loaner more often than it is the borrower, due to the reach that the language has. Listen to an Indian or Filipino speak for a while, it's interspersed with English words ALL the time. It happens less in English as there is not the universal knowledge base of one specific other language, but it does happen sometimes when searching for a certain, je ne sais pas.

Not really, most multilinguals switch between languages so seamlessly that you wouldn't even notice it! It even has given birth to new "languages", take for example Hinglish!!

English has more users than all but a few products.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: