Baidu AI learns to talk by itself in hours, with no human involvement

9 Mar 2017

Image: REDPIXEL.PL/Shutterstock

Baidu has been developing its own AI system for four years, and its latest achievement shows it is capable of learning vast amounts of information with no human involvement.

Baidu – commonly referred to as the Chinese Google – has been following its Silicon Valley counterparts in developing AI with an emphasis on deep learning and, seemingly, beating ancient parlour games.

Now, after four years of development within its research labs, the Chinese company has revealed that its AI can now handle speech synthesis with unprecedented accuracy.

According to MIT Technology Review, the breakthrough was made by developing its deep learning machine to reduce the amount of time its speech synthesis needs to switch gears when putting emphasis on words.

Under many existing text-to-speech systems, the voice comes from a vast database of different words from a single source, which are then combined into one phrase or sentence.

To facilitate a change of emphasis or a new speaker, however, it would require the need for an entirely new database, which would be both costly and time-consuming.

Baidu’s Deep Voice AI has counteracted this by creating speech in real time, thereby learning how to talk to itself – with no human involvement – in the space of a few hours.

The Deep Voice AI is a development from Baidu’s previous research using WaveNet, which was similar in design but required some human involvement to tidy up its speech patterns.

Strain in computer power

To achieve this latest advancement with Deep Voice, Baidu researchers said in a scientific journal that it broke down text into phenomes, or units of sound.

By using smaller sound units, it can recreate words more accurately, and sound more natural.

The only problem is that, in doing so, the processing power required increases dramatically, as having only 20 microseconds to generate a sample word means that each of these phenomes must be generated within 1.5 microseconds.

“Our system is trainable without any human involvement, dramatically simplifying the process of creating text-to-speech systems,” said the research team.

To show how human-like the phrases sound, Baidu has begun uploading audio samples to the crowdsourcing Amazon site Mechanical Turk to see if the wider public can tell the difference.