The speed of improvement of tts models reminds me of early days of Stable Diffusion. Can't wait until I can generate audiobooks without infinite pain. If I was an investor I'd short Audible.
Isn’t it more like an art gallery of prints of paintings? The primary art is the text of the book (like the painting in the gallery), TTS (and printing a copy) are just methods of making the art available.
I think it can be argued that audiobook's add to the art by adding tone and inflection by the reader.
To me, what you're saying is the same as saying the art of a movie is in the script, the video is just the method of making it available. And I don't think that's a valid take
No, that's an incorrect analogy. The script of a movie is an intermediate step in the production process of a movie. It's generally not meant to be seen by any audiences. The script for example doesn't contain any cinematography or any soundtrack or any performances by actors. Meanwhile, a written work is a complete expressive work ready for consumption. It doesn't contain a voice, but that's because the intention is for the reader to interpret the voice into it. A voice actor can do that, but that's just an interpretation of the work. It's not one-to-one, but it's not unlike someone sitting next to you in the theater and telling you what they think a scene means.
So yes, I mostly agree with GP. An audiobook is a different rendering of the same subject. The content is in the text, regardless of whether it's delivered in written or oral form.
It's not perfect, but I already have a setup for doing this on my phone. Add SherpaTTS and Librera Reader to your phone. (both available free on fdroid).
Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There's an icon with a little person wearing headphones, which lets you send the text continuously to your phone's tts, using just local processing on the phone. I don't have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn't stop and start.
The voice isn't totally human sounding, but it's a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn't get it running on my phone) or similar tts engines and a more powerful phone.
One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.
I feel like TTS is one of the areas that as evolved the least. Small TTS models have been around for like 5+ years and they've only gotten incrementally better. Giants like ElevenLabs make good sounding TTS but it's not quite human yet and the improvements get less and less each iteration.