The rapid development and impressive transfer learning capabilities of large-scale pretrained language models have ignited a research trend toward unified multilingual models for all speech and text understanding and generation tasks. This week, a Google Research team took a significant step forward.
In the new paper Mu²SLAM: Multitask, Multilingual Speech and Language Models, the Google researchers present Mu2SLAM, a multilingual sequence-to-sequence pretraining method for speech and text models that covers arbitrary tasks in over 100 languages and achieves state-of-the-art translation performance on the CoVoST benchmark.


Mu²SLAM is based on an encoder-decoder backbone model and is jointly pretrained on four different data types: speech-only, text-only, speech-to-text, and text-to-text. The researchers scale up the total speech and text language types to more than 100, unify all the training data into a sequence-to-sequence format and use similar optimization objectives on both the encoder and decoder.
For speech inputs, the team leverages the SLAM (Bapna et al., 2021) and mSLAM (Bapna et al., 2022) speech and language models to transform acoustic feature sequences into a sequence of latent speech representations via a convolutional neural network (CNN) block. The text inputs pass through a token embedding layer which transforms them into a sequence of embeddings, and the speech and text representations are fed into a shared multi-modal encoder-decoder model.
The team uses a deep encoder with 24 conformer layers (Gulati et al., 2020) that is similar to mSLAM’s encoder and a shallow decoder with six transformer layers (Vaswani et al., 2017) to boost inference speed while maintaining model accuracy.



In their empirical study, the team evaluated Mu2SLAM models on multilingual speech translation, multilingual speech recognition, and multilingual text understanding tasks using the multilingual CoVoST (Wang et al., 2021b) automatic speech translation, Voxpopuli (Wang et al., 2021a) automatic speech recognition, and XTREME (Hu et al., 2020) multilingual text understanding benchmarks. In the experiments, Mu2SLAM set new state-of-the-art results for models trained on public datasets on CoVoST, performed comparably to mSLAM on Voxpopuli, and outperformed mSLAM by six percent on XTREME.
Overall, this study shows that Mu2SLAM models can match the performance of uni-modal text models and significantly outperform speech-only models on speech and text understanding and generation tasks. The team notes that as a text-to-speech loss is also introduced in the Mu2SLAM pretraining, they hope to explore speech generation in future work.
The paper Mu2SLAM: Multitask, Multilingual Speech and Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Google’s Mu2SLAM: Toward a Single Model For All Speech and Text Understanding Tasks”