Q&A With Microsoft Chief Speech Scientist Xuedong Huang

A performance boost of less than one percent may not seem like much to most people, but for Microsoft Global Technical Fellow and Chief Speech Scientist Xuedong Huang, it’s cause for celebration.

A performance boost of less than one percent may not seem like much to most people, but for Microsoft Global Technical Fellow and Chief Speech Scientist Xuedong Huang, it’s cause for celebration. In an exclusive interview, Huang told Synced why — and shared his thoughts on new voice technologies, changing priorities, and the road to product development.

Synced: In October 2016, Microsoft achieved a word error rate (WER) of 5.9 percent on the Switchboard conversational speech recognition task, achieving the level of human professional stenographers. (Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems). Have there been any significant speech recognition breakthroughs since?

Huang: Our WER dropped from 5.9 to 5.1 percent in September 2017. That may seem insignificant, but in the “last mile” of development it becomes extremely difficult to achieve each 0.1 percentage point decrease. We must ensure that the system is free from any bugs. Moreover, from the perspective of relative error rate reduction, from 5.9 to 5.1 is actually an improvement which is over 10 percent and we are happy to see it.

We also learned a lot working on Switchboard, a technology now deployed for products like Cortana, Cognitive Services, and PowerPoint Presentation Translator.

Synced: How did you improve from 5.9 to 5.1 percent? Did you change your model or adjust the parameters?

Huang: We ran more than a thousand experiments, evaluated hundreds of different models, and tried almost all permutations and combinations. We worked strenuously!

Model progress came from the following: First, in terms of voice models, we used Bi-LSTM and ResNet at the same time, whereby the models are independent and parallel to each other. We then concatenate the CNN and Bi-LSTM into one model, extract the underlying features through a three-layer convolution, and use the six-layer Bi-LSTM to learn sequence dependencies between features.

Second, the language model was further refined from the word level to the character level, utilizing both global information of the entire dialogue and the local information of the session.

The third contribution involved using different signals for different model combinations. This is similar to the idea of improving decision trees and random forest algorithms, where signals are the most basic subphonetic senone. The introduction of different signals makes the system more robust.

Synced: If a 5.9 percent error rate reaches the level of human professional stenographers, how do we define going beyond that to 5.1 percent? Have we solved speech recognition?

Huang: Voice recognition as a whole is still far from being solved, but this is a solution for Switchboard. Let me provide an example. IBM has a four-person professional transcription team in Australia. Working together, they compared, discussed, and listened repeatedly to the voice data and achieved 5.1 percent. Although our system also reached this “superman” level, it only did so on the Switchboard task.

In reality, speech recognition is complicated by accents, noise, far-field challenges, and speech rate variations. Our “superman” achievement is only a tiny milestone. We added four additional neural networks and have a more powerful language model than last year, but the effect isn’t immediate — so there is still a certain distance from actual application.

Synced: What areas is your research focusing on?

Huang: Language understanding, especially for grasping implied meanings. For perception, computers can reach human- level in the next few years; but for cognition, humans can still build a more thorough understanding of the speaker’s meaning through context, gestures, eye movements, etc. This is still a huge gap for computers to overcome.

On the other hand, the current system is also very complicated. We have 14 neural networks in the speech part running in parallel. The system still needs to be simplified.

Synced: You participated in CMU’s voice research development before joining Microsoft. How would you compare the research teams of universities and companies when developing voice systems?

Huang: It’s essentially the same in that many researchers at Microsoft Research Institute are from CMU, which is very forward-thinking and hands-on, and produces end products relatively quickly. So the Microsoft Research Institute and CMU Computer Department are very similar, not to mention that Microsoft deans Kai-Fu Li, Harry Shum, and Hsiao-Wuen Hon were all taught by CMU Professor Raj Reddy.

There are also some differences. There is more image-based research in the schools, while companies focus more on audio and speech. Also, many industry research tasks can be applied for practical purposes, and so industry investment is relatively high. It is difficult for the academic community to compete in this regard.

In addition, although deep neural networks boosted machine perception, engineers need to consider problems other than speech recognition when introducing products. For example, what kind of data is used to train the model, how can it be made more efficient, and how to better integrate it with other systems? These are all things we need to consider if we want to introduce a simple, easy to use and efficient product — but they are not necessarily considered when doing research in a University.

The resources required to solve engineering problems are also much higher than in a pure research environment. For example, theoretical research staff and resources account for less than 10 percent of our team.

Synced: Can you tell us a bit about the Microsoft subtitle product Presentation Translator, which was developed using speech recognition technology?

Huang: I think the most important thing about making a good product is to understand usage scenarios. I studied English at Edinburgh University. When I arrived in Scotland, I was stupid, I could not understand Scottish English! At that time I imagined: If only every professor’s lecture had closed captions like the BBC, it would be great. So today, if a University of Edinburgh professor downloads the Presentation Translator, no Chinese student in Scotland will suffer the pain I did!

In March 2018 Huang’s team introduced a neural machine translation system that equals the performance of human experts in Chinese-to-English translation.


Localization: Meiling Wu | Editor: Meghan Han, Michael Sarazen

0 comments on “Q&A With Microsoft Chief Speech Scientist Xuedong Huang

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: