Google is developing sign language detection for video conferencing

5 Oct 2020

A person wearing a grey sweater sitting in front of a laptop with a video call displayed on screen.

Image: © Вадим Пастух/Stock.adobe.com

Google has released a demo of the technology that could be used to frame people communicating through sign language as active speakers on video conferencing platforms.

Amit Moryossef, an intern at Google Research, recently published a blogpost outlining some of the work that the company has been doing to make video conferencing technology more accessible for people who communicate through sign language.

Moryossef noted that most video conferencing applications focus on those who speak aloud during calls, which can make it difficult for people using sign language to “get the floor” to communicate easily and effectively.

“Enabling real-time sign language detection in video conferencing is challenging, since applications need to perform classification using the high-volume video feed as the input, which makes the task computationally heavy,” he wrote. Due to these challenges, research on sign language detection for video conferencing technology has been limited.

Earlier this year, Google presented a research paper on real-time sign language detection using human pose estimation at the Sign Language Recognition, Translation and Production 2020 workshop. It also demoed its research at the European Conference on Computer Vision, which took place online this year as a result of the Covid-19 pandemic.

The company presented a real-time sign language detection model and demonstrated how it can be used to provide video conferencing systems with a mechanism to identify the person signing as the active speaker.

How it works

“To enable a real-time working solution for a variety of video conferencing applications, we needed to design a lightweight model that would be simple to ‘plug and play’,” Moryossef said.

“Previous attempts to integrate models for video conferencing applications on the client side demonstrated the importance of a lightweight model that consumes fewer CPU cycles in order to minimise the effect on call quality.”

To reduce input dimensionality, the company has developed technology that can isolate the information the model needs from the call in order to perform the classification of each frame.

The company is using a pose estimation model, called PoseNet, to reduce input from an entire HD image to a small set of landmarks on the user’s body, including their eyes, nose, shoulders and hands.

With these landmarks, Moryossef said that Google is calculating the frame-to-frame optical flow, to quantify user motion for use by the model without retaining user-specific information. Through the company’s research so far, it has managed to predict signing in between 83.4pc and 91.5pc of cases, depending on the architecture it used.

Google has uploaded an experimental demo of the technology, which anybody can test out at the moment. It has also made the training code and models as well as the web demo source code available on GitHub.

“We believe video conferencing applications should be accessible to everyone and hope this work is a meaningful step in this direction,” Moryossef said. “We have demonstrated how our model could be leveraged to empower signers to user video conferencing more conveniently.”