Machine Learning & Data Science

MIT CSAIL & Amazon AI Framework Maps Raw Audio Directly to Semantics

New Spoken Language Understanding (SLU) research from MIT CSAIL and Amazon AI introduces step-skipping semi-supervised frameworks that take speech as input and achieve performance competitive to systems leveraging oracle text.

New Spoken Language Understanding (SLU) research from MIT CSAIL and Amazon AI introduces step-skipping semi-supervised frameworks that take speech as input and achieve performance competitive to systems leveraging oracle text.

SLU is commonly used nowadays as a frontend technology in device voice assistants and social bots. The SLU pipeline typically starts with an Automatic Speech Recognition (ASR) stage that maps audio to text, and ends with Natural Language Understanding (NLU) mapping the text to semantic slots.

The neural networks that power SLU frameworks however require large amounts of labelled training data, which is expensive and can often be difficult or impossible to collect. The proposed end-to-end SLU framework learns to map audio to semantics directly through semi-supervised learning.

Screen Shot 2020-11-16 at 11.12.26 AM.png

Semi-supervised learning basically pretrains model components on large amounts of unlabelled data, then fine-tunes using target semantic labels. Although this approach has been implemented in previous studies, these efforts have had significant limitations:

  • Models need a separate ASR or feedback from ASR
  • Models are not designed to predict slot values
  • The generalization function of self-supervised language models is not properly utilized

The researchers also identified several challenges with current SOTA models:

  • Model training is difficult under limited labels
  • Model noise-robustness is not trained or evaluated in real life environments
  • Model evaluation lacks end-to-end intent classification

The proposed semi-supervised E2E learning framework uses ASR and a BERT language model pretrained on audio-text pairs for joint intent classification (IC) to perform slot labelling (SL) directly from speech under limited labels. The framework was also trained with explicit noise augmentation to make it robust to environmental noises.

Screen Shot 2020-11-16 at 12.28.41 PM.png

In experiments with two public SLU corpora using the new E2E evaluation metric, the proposed framework showed comparable results on slots edit F1 score. The researchers say this is the first time an SLU model with speech as input was shown to perform on par with NLU models, and suggest future research directions could involve developing the framework’s multilingual SLU capabilities.

The paper Towards Semi-Supervised Semantics Understanding from Speech was presented in a Self-Supervised Learning for Speech and Audio Processing Workshop at NeurIPS 2020, and is available on arXiv.


Analyst: Reina Qi Wan | Editor: Michael Sarazen; Yuan Yuan


B4.png

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon KindleAlong with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.


AI Weekly.png

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “MIT CSAIL & Amazon AI Framework Maps Raw Audio Directly to Semantics

  1. Pingback: [R] MIT CSAIL & Amazon AI Framework Maps Raw Audio Directly to Semantics – tensor.io

Leave a Reply

Your email address will not be published.

%d bloggers like this: