Remember the universal translator (UT) device from Star Trek? Well, Microsoft has just revealed it is further along than most in developing near-instant translation technologies, unveiling a technology that can turn spoken English into Chinese in two steps and using the same voice of the speaker.
Microsoft’s chief research officer Rick Rashid in recent weeks gave a demonstration at Microsoft Research Asia’s 21st Century Computing event about the realm of natural user interfaces, focusing particularly on human speech.
He was able to demonstrate that not only have significant advances been made in speech translation accuracy but he was able to speak words in English and have them translated back into Chinese using his own voice.
Rashid said that until recently, even the best speech-recognition systems still had a word error rate of 20pc to 25pc on arbitrary speech and added that several products on the market today, including Xbox Kinect, use speech input to provide simple answers or navigate a user interface and Windows and Office products have included speech recognition in them since the late Nineties.
Referring to the 1970s breakthrough by researchers at Carnegie Mellon in using a technique called hidden Markov modelling to build statistical speech models, Rashid said that in the last 10 years faster computers and the ability to build dramatically more data have had a major impact.
Video of Rich Rashid’s presentation:
Deep neural methods
Rashid said that just over two years ago researchers at Microsoft Research and the University of Toronto made a major breakthrough using a technique called Deep Neural Networks, which is patterned after human brain behaviour.
Researchers, he said, were able to train more discriminative and better speech recognisers than previous methods.
“We have been able to reduce the word error rate for speech by over 30pc compared to previous methods. This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight. While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modelling in 1979, and as we add more data to the training we believe that we will get even better results.
“Machine translation of text is similarly difficult. Just like speech, the research community has been working on translation for the last 60 years, and as with speech, the introduction of statistical techniques and big data have also revolutionised machine translation over the last few years. Today millions of people each day use products like Bing Translator to translate web pages from one language to another.
“In my presentation, I showed how we take the text that represents my speech and run it through translation – in this case, turning my English into Chinese in two steps. The first takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part. The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages.
“Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful.
“Most significantly, we have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice, which is what I demonstrated in China. It required a text-to-speech system that Microsoft researchers built using a few hours’ speech of a native Chinese speaker and properties of my own voice taken from about one hour of pre-recorded (English) data, in this case recordings of previous speeches I’d made.
“Though it was a limited test, the effect was dramatic, and the audience came alive in response. When I spoke in English, the system automatically combined all the underlying technologies to deliver a robust speech-to-speech experience – my voice speaking Chinese,” Rashid said.
According to Star Trek lore, the UT technology wasn’t developed until 2151. Well, it’s 2012 and Microsoft will most likely make this technology universally available much sooner.
“In other words, we may not have to wait until the 22nd century for a usable equivalent of Star Trek’s universal translator, and we can also hope that as barriers to understanding language are removed, barriers to understanding each other might also be removed. The cheers from the crowd of 2,000 mostly Chinese students, and the commentary that’s grown on China’s social media forums ever since, suggests a growing community of budding computer scientists who feel the same way,” Rashid said.