Since Siri’s 2011 debut, smart voice assistants have become one of AI’s hottest applications. Powered by speech recognition and natural language processing (NLP), Siri and agents like Cortana, Alexa and Google Assistant facilitate human-machine interaction through natural dialogue. They can be used to manage schedules, retrieve information, produce human-like greetings and jokes, and even provide a sort of chatty companionship.
Mobvoi(出门问问), a Beijing-based AI company founded in 2012, is taking the intelligent personal assistant beyond phones. Their “Ticwatch,” released in 2015, quickly became the second-best-selling smartwatch in China, right after Apple Watch.
To further advance its products, Mobvoi hired Dr. Mei-Yuh Hwang as its Vice President of Engineering in 2016. A respected speech recognition researcher and former Principle NLP Scientist Manager at Microsoft, Dr. Hwang is a pioneer in voice recognition and machine translation. She was the Lead Scientist on Bing translation and Chinese Cortana. Her move to Mobvoi ranks as one of last year’s biggest Chinese AI talent acquisitions and opens a world of opportunities for both Dr. Hwang and her new company.
Dr. Hwang received her Bachelor of Science from National Taiwan University in 1986, and the following year began her Ph.D. studies at Carnegie Mellon University. At CMU she met Kai-Fu Lee, who introduced her to Lawrence Rabiner’s paper on a statistical method of modeling speech using Hidden Markov models.
“I was fascinated to see a way to combine both of my passions — programming and math — to solve one interesting and useful problem!” says Dr. Hwang.
With strong support from Turing Award winner Professor Raj Reddy, Dr. Hwang followed Lee’s advice to pursue academic research in speech recognition, and in 1994 joined her co-advisor Xuedong Huang on Microsoft’s speech recognition research team.
In Dr. Hwang’s first 10 years with Microsoft she focused on four projects in addition to research: the Whisper dictation system, Microsoft Speech API, Office XP English/Mandarin/Japanese dictation, and Speech Server.
University of Washington
By 2004, Dr. Hwang was ready for a change. As a mother of two young children, she found that balancing motherhood and her demanding work schedule brought increasing levels of stress into her life. To better care for her disabled son, Dr. Hwang took a break from Microsoft, accepting a part-time research position at the University of Washington’s Signal, Speech and Language Interpretation (SSLI) Lab, to develop speech recognition systems using the SRI Decipher engine.
Dr. Hwang’s project was funded by DARPA (Defense Advanced Research Projects Agency), which supports development of emerging technologies for use by the American military. The project goal was to produce a system able to monitor multilingual newscasts, text documents, and other forms of communication, and make the information searchable in English. ”Two source languages were chosen: Mandarin, because China was getting stronger; and Arabic, because the US wanted to hunt down Bin Laden,” says Dr. Hwang.
The project, known as GALE, involved three main components: speech recognition in Mandarin or Arabic, translation to English, and information retrieval. Dr. Hwang directed research on the first step, in Mandarin, with two graduate students (who could have predicted that one of them would become the CTO of Mobvoi nine years later.)
The core algorithms of speech recognition are language independent. However, to fine-tune each system, applying language-dependent phonology and linguistics is always helpful. “Lumping all the phonetic dictionaries you can find on the web into your system is not necessarily the optimal solution,” says Dr. Hwang, “(Semi)automatic filtering of bad entries or bad speech data can make a difference in these intensive evaluations. Text normalization, word segmentation, voice activity detection, tone modelling, filler-word modelling, etc., though not eye-catching in a speech recognition system, cannot be ignored if you want your system to stand out.”
Dr. Hwang’s efforts in Mandarin speech recognition garnered much respect and achieved great success.
A New Challenge in Translation
After wrapping up her UW project, Dr. Hwang returned to Microsoft Research (MSR) in 2008, this time in the area of machine translation. “Because of the GALE project, I became curious to get hands-on with machine translation. Speech recognition and machine translation resemble each other in many ways. They are both based on Bayes’ rule, rely on big data and statistical methods, and share a language model. Hence I was excited to enter the new field,” says Dr. Hwang.
Today, machine translation based on powerful sequence-to-sequence neural networks with attention models can achieve great results without an explicit language model.
Dr. Hwang’s first mission with the translation team was challenging. At the time, Microsoft did not have its own translation system. Instead, it licensed a service from top machine translation company SYSTRAN, which mostly used linguistic parsing-based translation and only provided about six languages. MSR wanted its own translation system, based on big data statistical methods, and Dr. Hwang helped make it happen.
In just six months Dr. Hwang and her team delivered Bing Translator, which has become Google Translator’s main competitor. Bing initially included SYSTRAN’s languages (Chinese, English, Italian, German, French and Spanish), but her team’s success with MapReduce algorithms and big-data training platforms helped expand that to 40+ languages. A translation hub was also built to provide APIs to developers building their own domain-dependent and language-dependent translation systems.
The Bing translation platform was put to a real stress test in 2010 when Haiti suffered a devastating earthquake, with thousands of people injured or killed. The Red Cross sent doctors to Haiti, but locals could only speak Haitian Creole, a variant of French. The language barrier delayed the medical teams’ efforts.
Using CMU’s Creole-English parallel data in interpolation with Bing’s French-English translation, Dr. Hwang’s team was able to train, transliterate, and deploy a Creole-English bi-directional translation in only four days. “The doctors could now communicate with local people by typing and translation,” says Dr. Hwang. “A number of journalists rushed thank-you notes to Microsoft, praising how much the Bing translator had helped the humanitarian effort.”
Returning to Asia for Chinese Cortana
While Microsoft was developing its voice assistant Cortana, Dr. Hwang volunteered to build the Chinese-language version.
With her speech recognition team in Beijing and a brand-new language understanding team in Suzhou city, Dr. Hwang led the main science work behind Cortana, pushing China-centric features and designing much of the implementation. The first Chinese Cortana was released in late summer of 2014 on Windows Phone 8.1. “Cortana” was translated into “Xiao Na” (“Little Na” or 小娜, from the last syllable of Cortana). Despite the fact that the Windows Phone was going nowhere (and is dead today), Little Na was a huge success and raised awareness of AI maturity.
Another New Chapter with Mobvoi
In 2016, Dr. Hwang’s passion for new AI challenges led her to Seattle and Mobvoi — whose CTO Xin Lei had been her student and worked side-by-side with her on GALE at the University of Washington. “I worked with him for two years, happily and passionately,” says Dr. Hwang. “He is not only smart and has a great personality, he is also a workaholic, just like his partner (Mobvoi Founder & CEO) Zhifei Li, whom I became familiar with due to my work in machine translation.”
“I know financially I’ll have to take a big cut for a few years, until the big bang happens, but this offers me a flexible schedule to care for my son, and working with the right people is priceless!”
Mobvoi is going full speed ahead on intelligent IoT devices. Last year, the company raised US$75 million in Series C funding from Google on the success of Ticwatch. Mobvoi followed up with the vehicle-based Ticmirror, which interacts with the driver via voice to help with navigation, playing music, etc. Mobvoi brought Ticmirror from concept to market in less than nine months, an amazing feat that impressed Volkswagen Group China. In April of this year Volkswagen invested US$180 million in Mobvoi and the two established a joint venture focused on voice assistants for cars.
Dr. Hwang’s Seattle Mobvoi Lab will tackle smart voice assistants’ bottlenecks: noise-robust speech recognition, high precision high recall for hot-word triggering with customisation, semantic understanding platform, and multi-turn dialog management.
Today’s AI-enabled voice assistants are not yet as articulate or intimate as “Her” in the movie, but the race is on to get there. And with the global voice assistant market expected to reach US$13 billion by 2024, no company wants to be left behind.
As a guru of AI technology, Dr. Hwang has experimented with voice-powered products for almost 30 years, evolving from basic dictation to speech translation and smart voice assistants. As her career shifts into a new phase, Dr. Hwang’s creative drive shows no sign of slowing down.
Journalist: Tony Peng | Editor: Michael Sarazen