I trained a base model on the Linda Johnson speech (LJS) data set for several days.
I then transfer learned for each of these speakers. Some speakers have as little as 40 minutes of data, others have up to five hours. The resulting quality isn't strictly a function of the amount of training data, though more typically helps. It's also important to have high fidelity text transcriptions free of errors.
The transfer learning runs vary between six hours and thirty six hours.
I'm using 8xV100 instances to train glow-tts and 2x1080Ti to train melgan. I'm continuously training melgan in the background and simply adding more training data. The same model works for all speakers.
Have you had any success with using speaker embeddings to generate voices with fewer samples of speech? I did some cursory experiments but I couldn't get too far beyond getting pitch similar to the target speaker.
My reasoning for this approach: IMO, if the model learns a "universal human voice", it shouldn't need too much additional information to get a target voice.
I did! I tried creating a multi-speaker embedding model for practical concerns: saving on memory costs. I'm going to have to add additional layers, because it didn't fit individual speakers very well. I wish I'd saved audio results to share. I might be able to publish my findings if I look around for the model files.
I think you're right in that if we can get such a model to work, training new embeddings won't require much data.
I then transfer learned for each of these speakers. Some speakers have as little as 40 minutes of data, others have up to five hours. The resulting quality isn't strictly a function of the amount of training data, though more typically helps. It's also important to have high fidelity text transcriptions free of errors.
The transfer learning runs vary between six hours and thirty six hours.
I'm using 8xV100 instances to train glow-tts and 2x1080Ti to train melgan. I'm continuously training melgan in the background and simply adding more training data. The same model works for all speakers.