A young man sits in front of a webcam, responding to queries about his personal interests and academic and work background. This may look like a remote job interview, but the questions are actually being posed by a bot. The interview proceeds smoothly until at one point the man mumbles a short answer and averts his gaze. The bot immediately recognizes that he is behaving less attentively, and classifies him as “disengaged.”
Meet Dr. Zhou Yu’s Multimodal HALEF, a real-time intelligent interactive system whose primary mission is to hold humans’ attention.
A Carnegie Mellon University (CMU) PhD grad, Dr. Yu joined the University of California in Davis (UC Davis) as an assistant professor last summer. The 29-year-old scientist was recently named to Forbes 2018 “30 Under 30” in Science list for her groundbreaking work in multimodal dialogues, an interdisciplinary field that enables robots to watch, listen, speak, and physically interact with humans. It’s hoped that Dr. Yu’s research will enable intelligent assistants such as Siri and Alexa to understand the nuances of non-verbal communication and use them to their advantage.
Most dialogue systems are either text- or speech-based. However, natural human communication is not limited to words or voices — it is supplemented by gestures and facial expressions. We frown or raise our eyebrows to express disagreement, and wave our hands to say goodbye. Dr. Yu believes robots should be able to recognize such actions and give appropriate responses.
In the human-machine interactions of the future, a customer service bot, for example, could evaluate customers’ moods or emotional states and recommend appropriate products like a real salesperson. A healthcare consultant bot meanwhile might judge whether patients are potentially suffering from mental illnesses.
“A lot of my work is to target different users, and to design unique interaction plans for each individual automatically through data driven methods,” says Dr. Yu.
Born in the beautiful canal city Suzhou, Dr. Yu’s keen interest in computer science began when her grandfather bought her a computer at a very young age. She started coding in elementary school and participated in coding competitions since then. In 2007 she was admitted to Chu Kochen Honors College, an elite undergraduate school of Zhejiang University, one of China’s top academic institutions. Dr. Yu pursued a double major of Computer Science and English with a focus on linguistics. She also took classes in computer vision and machine translation.
Equipped with her unique inter-disciplinary background, in 2011 Dr. Yu enrolled at CMU’s Language Technology Institute, where she apprenticed to Dr. Alan W Black and Dr. Alexander I. Rudnicky — top-tier experts in spoken language dialogue systems.
At CMU Dr. Yu first worked on general aspects of vision and speech dialogues, before narrowing her focus to a simple challenge: How to keep humans engaged in a conversation. Recalls Dr. Rudnicky: “Zhou wanted to work on engagement in non-task oriented dialogues (often called chatbots), which wasn’t really a research area at that time. Now the area is much bigger and others are working there. But she was one of the early people who started publishing in that field.”
“The reason focus on this problem is that we believe engagement is key to determining whether humans will continue interacting with robots,” says Dr. Yu. Her technique incorporates dialogue systems with different information streams — such as vision information about face & body, and non-speech voice characteristics — to evaluate a human’s level of engagement in a conversation, and then use that to initiate changes in strategy. The key is choosing the sort of strategies humans use in their own conversations.
The technique can also be employed in task-oriented dialogues, which are widely used in booking services, customer services, and tutors. For example in natural human communication, interjections like “Excuse me” or “I beg your pardon” are commonly employed to bring a disengaged person back into the conversation.
One of Dr. Yu’s experiments involves a direction-giving robot that says “Excuse me” if humans look away or talk to other humans instead of the robot. If the human ignores this the robot will deploy a humanlike “silent treatment” until it regains its subject’s attention.
User engagement may be abstract, but for research purposes it needs to be quantified. Dr. Yu regularly consults with cognitive scientists with expertise in nonverbal behavior study to make annotations. Subjects also review their testing videos and indicate when they did not pay much attention. These annotations and feedback are used for training the dialogue systems with supervised learning, a method of machine learning dependent on labeled data.
Multimodal dialogue remains an early-stage research area, and only five or six research groups have made significant breakthroughs. “The field has high stakes, but many challenges remain,” says Dr. Yu.
One of the biggest challenges is the integration of different modalities. In a speech-based dialogue, researchers only deal with the voice sampling range in Hertz. But a multimodal dialogue also involves visual data, such as videos, which are in a total different sampling range. Integrating disparate information in real time is much more difficult.
That challenge in integration of modalities leads to another problem — data collection. Unlike deep learning models in image recognition or machine translation that can be trained on large static datasets, training a multimodal dialogue system requires dynamic interaction with people. Such data is both difficult and expensive to acquire. Therefore Dr. Yu has turned to reinforcement learning, a machine learning method to let models take actions along with maximum rewards. Reinforcement learning can generate simulated data of human’s conversations.
Another challenge remain in speech recognition is speed. Humans want machines to respond as quickly as possible, but speech recognition is time-consuming. Finding a way to reduce this lag inspired Dr. Yu’s research into incremental speech recognition, a method to decode the voice while the user is still speaking.
“Of course the multimodal dialogue is a really hard problem, so it’s not going to be fully solved in a short time,” says Yu’s mentor Dr. Black.
When she’s not busy at the lab, Dr. Yu enjoys walks in the park or camping with friends. On rainy days she reads paperbacks or watches YouTube animal videos. “Watching small animals is the best way to relieve my stress,” she says.
Last December Dr. Yu joined the the Language, Multimodal and Interaction Lab at UC Davis as an assistant professor, migrating from the Steel City to the Golden State. Living and working in the world’s greatest AI hub better positions Dr. Yu for financial support. Since 2016 Amazon has supported her research with US$100,000 in annual funding. She is also backed by Intel and Tencent.
“Its good to see PhD graduates have a strong vision for their area, and have a well-defined way to realize their research. She’s in a strong position to do a lot more great work, and has the maturity to lead her own team,” says Dr. Black about his former student.
This is just the beginning for Dr. Yu, who believes her research can make a positive impact on society. Some have already benefited, like the young man who gained job interview skills from HALEF. She estimates her work could be widely integrated in intelligent bots within five to 10 years. But for this dedicated young researcher, the process is just as important as the product: “Multimodal dialogue is and will always be my passion and my lifelong research area.”
Journalist: Tony Peng | Editor: Michael Sarazen