VALL-E X
VALL-E X can synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt, even in another language for a monolingual speaker.This implementation supports zero-shot, mono-lingual/cross-lingual text-to-speech functionality of three languages (English, Chinese, Japanese)
See this demo page for more details.
Upload a speech of 3~10 seconds as the audio prompt and type in the text you'd like to synthesize.
The model will synthesize speech of given text with the same voice of your audio prompt.
The model also tends to preserve the emotion & acoustic environment of your given speech.
For faster inference, please use "Make prompt" to get a .npz
file as the encoded audio prompt, and use it by "Infer from prompt"
Text | language | accent | uploaded audio prompt | Transcript |
---|
Upload a speech of 3~10 seconds as the audio prompt.
Get a .npz
file as the encoded audio prompt. Use it by "Infer with prompt"
Prompt name | uploaded audio prompt | Transcript |
---|
Faster than "Infer from audio".
You need to "Make prompt" first, and upload the encoded prompt (a .npz
file)
Text | language | accent | Voice preset |
---|
Very long text is chunked into several sentences, and each sentence is synthesized separately.
Please make a prompt or use a preset prompt to infer long text.