State space models (SSMs) designed for modelling dynamic systems have achieved outstanding sequence-to-sequence performance in fields ranging from time series analysis to audio generation. SSMs however struggle on language modelling tasks, where they cannot match the performance of transformer architectures.
In the new paper Hungry Hungry Hippos: Towards Language Modeling with State Space Models, a research team from Stanford University and the State University of New York at Buffalo explores the expressivity gap between state space models and transformer language model attention mechanisms. To improve the training efficiency of SSMs on modern hardware, the team proposes FlashConv, a novel state-passing algorithm that yields 2x speedups on the Long Range Arena benchmark and enables 1.6x faster text generation than standard transformer architectures.

The researchers set out to understand and narrow the gap between attention and SSMs in language modelling with regard to modelling capabilities and hardware efficiency. They identify their main contributions as follows:
- We use synthetic language modelling tasks to show that there is an expressivity gap between SSMs and attention.
- We design a new SSM layer that nearly matches attention in language modelling.
- We propose better hardware-aware algorithms for SSMs that allow them to take advantage of modern accelerators and run faster than attention.

To identify expressivity gaps between SSMs and attention, the team uses synthetic language modelling tasks focused on text manipulation — recalling tokens from earlier time steps or comparing tokens from different points in a sequence. As an alternative to attention, they propose Hungry Hungry Hippo (H3), an SSM-based layer designed to solve such language modelling tasks.

The proposed H3 stacks two discrete SSMs with multiplicative interactions between input projections and the corresponding outputs to model comparisons between points in a sequence.

H3 matches the performance of attention mechanisms on the synthetic languages and almost closes the gap with transformers on language modelling. Moreover, a simple hybrid H3-attention model surpasses transformers on the OpenWebText benchmark by 1.0 PPL (perplexity).
The team also identifies ways to improve the efficiency of SSMs on modern hardware via FlashConv, a hierarchical algorithm inspired by IO-Aware attention for computing SSMs. By leveraging the recurrent properties of SSMs to process the input in chunks, inputs can be split to enable them to fit into GPU SRAM for efficient computing of the FFT-based (fast Fourier transform) convolution. FlashConv can thus scale SSMs to any sequence length on GPU SRAM with near-linear compute complexity.


In the team’s evaluations, FlashConv set a new state-of-the-art speed record on the Long Range Arena benchmark (Tay et al., 2020) using a Structured State Space Sequence model (S4, Gu et al., 2022), beating transformers by 5.8x and previous S4 models by 2x. The team also used FlashConv to train hybrid H3-attention language models with up to 1.3B parameters, which achieved competitive results on most SuperGLUE benchmark tasks under zero- and few-shot settings.
The paper Hungry Hungry Hippos: Towards Language Modeling with State Space Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
The novel approach slices the audio and video parts of the encoder output and feeds them to the decoder independently instead of jointly, which saves compute costs and boosts finetuning performance.
Maybe I need more time to learn more about what you shared, I am a beginner so things are quite difficult for me.