VALL-E X

or to skip the queue.

VALL-E X can synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt, even in another language for a monolingual speaker.
This implementation supports zero-shot, mono-lingual/cross-lingual text-to-speech functionality of three languages (English, Chinese, Japanese)

See this demo page for more details.

Upload a speech of 3~10 seconds as the audio prompt and type in the text you'd like to synthesize.
The model will synthesize speech of given text with the same voice of your audio prompt.
The model also tends to preserve the emotion & acoustic environment of your given speech.
For faster inference, please use "Make prompt" to get a .npz file as the encoded audio prompt, and use it by "Infer from prompt"

language
accent
Examples
Text language accent uploaded audio prompt Transcript