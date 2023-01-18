A "neural codec language model", VALL-E uses discrete codes derived from an off-the-shelf neural audio codec model to synthesize high-quality personalised speech with only a 3-second recording of an unseen speaker.

The AI is trained with 60,000 hours of English speech with over 7,000 unique speakers. All this data is taken from Libri-Light, the Meta-owned audio library that collects spoken English audio.

It can also imitate the speaker's emotional tone and acoustic environment.

"Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity," say Microsoft researchers in their paper.

The three-second voice input needs to match with some other samples in the training data provided to have a better result. This is why VALL-E should be more diverse in the future. The training data will be scaled up to improve the performances of prosody, speaking style, and speaker similarity perspectives, Microsoft says.

How can we benefit from VALL-E?

For now VALL-E can only convert text into speech in the chosen voice. It can’t create new content.

Its creators are hopeful that VALL-E can provide various benefits in terms of speech editing and audio content creation.

The example of Stephen Hawking using a text-to-speech generator to continue his studies while suffering from classical motor neuron disease (ALS) has shown the world one of the highest benefits one could get from this technology.

VALL-E can be used for simultaneous translations, or to create the voice of our loved ones who had passed away.

Creating audiobooks would be a lot easier and faster with VALL-E. One can create a voice for any written peace or text message in a short time.

For all these uses and more, we need to wait for Microsoft to open VALL-E to public use. Microsoft has not said yet when the new AI will be available for public consumption.

VALL-E might bring risks