Microsoft claims new AI can simulate voices based on three-second audio

10 Jan 2023

Image: © Tobias Arhelger/

VALL-E is pitched by Microsoft as an AI model that can preserve both a speaker’s emotion and acoustic environment.

Researchers at Microsoft have built a text-to-speech AI model called VALL-E that they claim can simulate anyone’s voice using only three seconds of audio.

Known as a neural codec language model, VALL-E is trained on 60,000 hours of English speech – which the researchers claim is “hundreds of times” larger than existing systems.

VALL-E “can be used to synthesise high-quality personalised speech with only a three-second enrolled recording of an unseen speaker as an acoustic prompt,” the researchers wrote on GitHub.

This means that the model can be used to mimic a person’s voice and say things the person never said. The researchers even claim that VALL-E can do this while preserving both the speaker’s emotion and acoustic environment.

Findings of the research were published in a paper late last week and demos of the AI’s capabilities are showcased on GitHub along with an explanation of the technology behind it.

While VALL-E has the potential to be used for many high-quality text-to-speech applications, the technology also raises concerns for its potential to be misused.

“The experiments in this work were carried out under the assumption that the user of the model is the target speaker and has been approved by the speaker,” the researchers cautioned.

“However, when the model is generalised to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech.”

OpenAI investment

Separately, there are reports of Microsoft’s intention to invest up to $10bn in OpenAI, the company behind advanced software such as AI image generator DALL-E and ChatGPT, an experimental chatbot designed to answer questions in a conversational way.

A Bloomberg report suggests that the two companies have been in talks over the deal for months. Reports also suggest that Microsoft intends to integrate OpenAI tech into its applications such as Word, PowerPoint and Outlook.

Just last week, the Wall Street Journal reported that OpenAI is in talks with VC firms Thrive Capital and Founders Fund to raise capital at a valuation of almost $30bn.

10 things you need to know direct to your inbox every weekday. Sign up for the Daily Brief, Silicon Republic’s digest of essential sci-tech news.

Vish Gain is a journalist with Silicon Republic