Bringing lip-reading tech to the world of voice activation

23 Nov 2018

Image: © ra2 studio/Stock.adobe.com

TechWatch editor Emily McDaid hears from the team behind Liopa, which has created an app that can read lips.

Voice applications are all the rage right now, with Alexa, Siri, Cortana and Google Assistant seeing an upsurge of new users.

“Meanwhile, voice activation is becoming more popular in cars,” said Liopa’s co-founder, Liam McQuillan. For the automotive industry alone, the voice recognition market is projected to be worth $3.9bn by 2025.

For decades now, computer scientists have championed voice as being the holy grail of the human-computer interface. But it’s taken a long time to come to fruition, due to the intricacies of languages and the fact that no two people speak exactly alike.

The introduction of machine-learning techniques has accounted for a huge leap forward in the technology. Higher processing capability through GPUs, more training data and sophisticated developments in Google DeepMind have made their impact.

McQuillan and his team have used machine learning to create a unique automated lip-reading application called Liopa.

four men wearing shirts standing in hallway.

Liopa founders, from left: Richard McConnell, Liam McQuillan, Dr Darryl Stewart and Dr Fabian Campbell-West. Image: TechWatch

Today, speech recognition is based on analysis of the speaker’s audio signal. “These audio systems can be very accurate but the problem is that, when there’s background noise, the accuracy and usability degrades rapidly,” McQuillan said.

Instead, Liopa’s technology analyses a video of the speaker’s lip movements. Using an AI-based core, the software deciphers what the person is saying. The technology is agnostic to audio noise and, when combined with audio speech recognition, will improve accuracy of the overall system.

It works on any device with a standard camera, especially in situations where you can train the camera directly at the speaker’s face. It helps in real-world noisy environments – for example, using a voice activation system in a car, or a virtual assistant in a restaurant, or outside.

“We’re commercialising research that’s been done for the past 10 years at QUB on lip-reading technology,” said McQuillan.

Liopa is the product of two academic researchers, Dr Darryl Stewart and Dr Fabian Campbell-West, joining forces with two proven commercial entrepreneurs, McQuillan and his colleague Richard McConnell.

McQuillan tells me the scope of Liopa is still being determined. “We want to develop a product that supports a large vocabulary of the 130,000 words in the English language, along with other languages, and will perform in real time,” McQuillan said. “That’s a little bit down the line, but there are plenty of lucrative use cases that we’re addressing initially that require less vocab support.”

He said that voice-activated cars degrade over time because the car starts to emit more engine noise, and lets in more road noise. “It’s been shown that the accuracy of in-car voice activation degrades badly, with passengers, whether the radio is on and the age of vehicle.”

McQuillan said: “In cars, we’d combine our solution into AVSR – audiovisual speech recognition.”

‘We want to develop a product that supports a large vocabulary of the 130,000 words in the English language’
^{– LIAM MCQUILLAN}

Another important use case involves checking for someone’s identification. McQuillan explained: “There is something called liveness checking for digital identification. It’s when a facial recognition system needs to ascertain that there’s a live person presenting and not a high-res static image. If someone had a picture of you, they could fool the system into IDing you.”

Liopa’s technology can ensure these systems are secure to an almost 100pc degree of accuracy. “When the person presents to the screen, we pop up a random sequence of digits and ask that person to mime or speak those digits into the camera. If they’ve said the right digits, you can be pretty sure it’s them,” said McQuillan.

Liopa could help in healthcare, for patients with trouble speaking due to vocal cord or throat injuries. In the security industry, an important application involves CCTV footage. McQuillan said: “For a lot of CCTV usage, it’s illegal to capture the audio of what someone is saying but, even if it’s not illegal, the microphone is too far away to hear their voice. Using HD cameras and our system, you could analyse video footage and ascertain what someone is saying, which could provide key insights into what is actually happening.”

Liopa has been incorporated since 2015 and its first commercial trials will kick off in the next few weeks. McQuillan cannot say who it is, but one triallist is a company “several orders of magnitude bigger than us”.

Will the sales model be based on throughput? “Yes, we’re going for a usage-based licensing model, charged per transaction,” he said.

Liopa has already enjoyed a substantial seed funding round. McQuillan said: “We’ll be raising a more substantial round by Q2 of next year.”

By Emily McDaid, editor, TechWatch

A version of this article originally appeared on TechWatch