Deep neural networks have recently hinted at their potential for processing speech in a manner more like the human brain and generating activations similar to those of the brain in response to the same inputs. The development of such algorithms however remains difficult as they require massive training data, supervised labels, textual data rather than more realistic raw sensory data, and prohibitively large memory.
In the new paper Toward a Realistic Model of Speech Processing in the Brain with Self-supervised Learning, a research team from Meta AI, PSL University, Université Paris Cité, Université Paris-Saclay, University of Toronto and INSERM shows that self-supervised architectures such as Wav2Vec 2.0 (Baevski et al., 2020) that stack convolutional and transformer layers to predict a quantization of the latent representations of speech waveforms can learn brain-like representations from as little as 600 hours of unlabelled speech; and can also learn sound-generic and speech- and language-specific representations similar to those of the prefrontal and temporal cortices.
Framework for Self-Supervised Learning of Speech Representations
The team summarizes their study’s main contributions as:
- Self-supervised learning leads Wav2Vec 2.0 to learn latent representations of the speech waveform similar to those of the human brain.
- The functional hierarchy of its transformer layers aligns with the cortical hierarchy of speech in the brain and reveals the whole-brain organization of speech processing with unprecedented clarity.
- The auditory-, speech-, and language-specific representations learned by the model converge to those of the human brain.
- Behavioural comparisons to 386 supplementary participants’ results on a speech sound discrimination task confirm this common language specialization.
The Wav2Vec 2.0 architecture comprises three modules: 1) a feature encoder that transforms raw mono speech waveform inputs into latent representations, 2) a quantization module that discretizes the latent representations into a dictionary of discrete and latent representations of sounds, and 3) a “context network” that uses the previously generated outputs to produce contextualized embeddings. The team trained several variants of Wav2Vec 2.0 on different datasets with both self-supervised and supervised learning objectives and extracted the activations of each layer from both the feature encoder and the context network.
In their empirical studies, the team compared the Wav2Vec 2.0 learned representations to those in the brains of 412 human volunteers (351 English speakers, 28 French speakers and 33 Mandarin speakers) recorded with functional magnetic resonance imaging (fMRI) while they passively listened to approximately one hour of audio books in their native language.
The experimental results show that Wav2Vec 2.0 model activations can predict brain activity in nearly all cortical areas, self-supervised learning leads to slightly better performance than supervised learning, the hierarchy of Wav2Vec 2.0 maps onto the hierarchy of the cortex, and 600 hours of self-supervised learning suffices for Wav2Vec 2.0 to learn brain-like language-specific representations.
Overall, this work demonstrates that applying self-supervised learning to a limited amount of speech data can enable the learning of representations similar to the human brain’s speech perception, taking a step toward a realistic model of speech processing in the brain with self-supervised learning.
The paper Toward a Realistic Model of Speech Processing in the Brain with Self-supervised Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.