Semi-supervised learning has achieved impressive results on automatic speech recognition (ASR) tasks in recent years. One of the most widely adopted models is pseudo-labelling, where a teacher network trained on labelled audio data generates a large amount of pseudo-labelled data, which is then used to train a student network.
Despite the strong promise of pseudo-label methods, they have one main drawback: the student models’ performance will suffer if the initial labelled data is not large or accurate enough to train a reliable teacher model.
To tackle this issue, a research team from Facebook AI has proposed a Contrastive Semi-supervised Learning (CSL) approach that synthesizes pseudo-labelling and contrastive losses to improve the stability of learned speech representations.
Contrastive loss (e.g. InfoNCE loss) is a self-supervised representation learning approach that has recently achieved stunning results in computer vision and speech applications. The method uses two groups of samples (positive and negative), which are selected for specific anchor data within a pretext task. The negative samples are selected randomly from a mini-batch or a memory bank, while the positive samples are augmented versions of the anchor, nearby frames, or samples from the same speaker. The positive and negative samples together determine the learned representation.
The proposed CSL solves the weakness of these two approaches. It bypasses the selection of positive and negative samples by utilizing a supervised teacher. Moreover, CSL takes relative distance between label classes as a learning signal, and as such tends to be more robust in teacher-generated targets compared to standard pseudo-labelling methods.
The researchers explain the CSL pretraining comprises two functions: an encoder that encodes input audio into latent representations; and a projection network that maps encoder representations into a new space suitable for applying the contrastive loss. A hybrid-NN supervised teacher generates pseudo-labels to guide the selection of positive and negative samples for the contrastive loss.
The researchers also apply frame-level cross-entropy fine-tuning by employing other loss functions such as connectionist temporal classification (CTC) loss. The work builds upon previous research in ASR pretraining that applies a contrastive loss in a supervised setup.
The researchers summarize their paper’s main contributions as follows:
- Utilizing teacher pseudo-labels for selecting positive and negative samples, CSL is more stable than self-supervised pretraining methods, which are sensitive to the diversity and the criterion for choosing positive and negative samples. Moreover, CSL enables reliable sampling of positive examples within and across utterances in the mini-batch.
- CSL applies a softer constraint on learned representations through the contrastive loss. Such formulation improves robustness to noisy teacher pseudo-labels.
- Applying the contrastive loss over normalized representations emphasizes hard positives and negative examples, which enables the pretrained model to generalize better under out-of-domain conditions.
The researchers conducted several experiments involving the transcription of social media videos to test the performance of CSL. They used two data sources: de-identified public five-minute videos in British English and Italian from Facebook; and recordings of crowd-sourced workers responding to artificial prompts on mobile devices. They chose a hybrid-NN ASR system as the pseudo-labelling baseline and a supervised frame-level cross-entropy (CE) fine-tuning for all pretrained models.
Compared to the supervised baseline, the best pseudo-labelling model (CE-PL) decreased the word error rate (WER) by about 36 percent for British English and 46 percent for Italian; while CSL achieved relative improvements of 8 percent (British English) and 7 percent (Italian) over the best CE-PL performance.
CSL’s British English WER reduction jumped to 19 percent under the ultra low-resource condition of one hour of labelled data for teacher supervision; and when generalizing to out-of-domain conditions, the approach achieved a 17 percent WER reduction compared to the best CE-PL pretrained model.
The paper Contrastive Semi-Supervised Learning for ASR is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.