Quite cool... But if this is generated by the text to speech engine from OS X, then I am afraid it is going beyond the license that come up with OS X. I remember reading through that license and it was clearly stated that using the OS X TTS was only for local usage on your Mac.
So I am extremely curious to know the license behind this tts-api? Can the OP provide such info or provide some of the tech behind it?
In case anyone else is curious, the section you're thinking of is
""F. Voices. Subject to the terms and conditions of this License, you may use the system voices included in the Apple Software (“System Voices”) (i) while running the Apple Software and (ii) to create your own original content and projects for your personal, non-commercial use. No other use of the System Voices is permitted by this License, including but not limited to the use, reproduction, display, performance, recording, publishing or redistribution of any of the System Voices in a profit, non-profit, public sharing or commercial context.""
For those who want to run their own copy of this, here's how to do it:
1. Find a Mac-based server (a co-located Mac Mini will be fine)
2. Run `say -o output.wav $TEXT` to generate the voice
3. Compress the WAVE file with `lame` or the system builtin `afconvert` to get the MP3 file.
`say` command supports multiple languages and dialects, but you'll have to install the necessary voice engines in OS X 10.8. Man page for `say` can be found here http://pastebin.com/nWbvJAAX
The complete list of voices/languages supported so far:
The open-source, cross-platform equivalent of `say` is a piece of software called "SVOX Pico". There is also a Python-based wrapper for it called picospeaker. Relevant AUR link for ArchLinux users:
This is a nice one, however I'm still confounded by the lack of progress since bell labs made an online text to speech converter many years ago. Particularly, the notion that the interpretation of each sentence is idempotent is just wrong. Want to see what I mean? A human would not speak like the following; there should be differences in intonation, "emotion" (sounding bored, angry, excited, etc. that varies depending on the number of times "dogs" would be said), speed, and delay. In addition, you have to breathe at some point, and even the best audiobooks have some level of breath noise.
This is a bit off topic, but a related question: I have been looking for a "bad" text to speech library that produces Stephen Hawking-style audio, similar to what's found in old 1970/80s electronics. Examples:
Pretty impressive, I've given it a go with a few of the more technical terms that I come across at work and that other TTS' have difficulty handling and it dictated them flawlessly. Very interested to see where this goes!
Neat, once again, emscripten proves useful. I do find it important though to point out the lack of a good open text-to-speech engine.
Here is a speech as rendered by tts-api.com (http://goo.gl/PoZc4). Now, for speak.js [1], to make a comparison, paste in the first few of the top paragraphs from here [2] and compare the quality between the two.
There really is a gap to fill for a good open-source alternative here. But I suspect the main barrier is that there is a large amount of data needed to generate good voices. Still, a worthy target.
[1]: I tried to make a URL for this too, but despite the URL looking as if it could take arguments it refused to work, at least for me under Firefox and Chrome.
You might be able to do non-English pronunciations by trying phonetic spellings, which can be tricky. The best I could get for "felicidades" was this: fell isseedadesh.
Speech to text is a far more computationally difficult problem. Google has an unofficial one -- you can curl flac voice files to them but even their transcription is not terrific. (They use it for automatic captions on youtube -- use that to judge...)
So I am extremely curious to know the license behind this tts-api? Can the OP provide such info or provide some of the tech behind it?