From text and proteins to audio, images, and state sequences, decoder-only generative models have proven their ability to generate new sequences across various modalities. However, integrating multiple generative foundation models, especially those trained on different modalities, into a cohesive and superior system presents significant challenges.
In a new paper Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities, a Google DeepMind research team introduces Zipper, a multi-tower decoder architecture. This architecture can flexibly combine multimodal generative models from independently pre-trained unimodal decoders and can be reused and repurposed in new multimodal combinations.
Similar to vocabulary expansion techniques, Zipper can perform generative tasks across all modalities. Unlike vocabulary expansion techniques, Zipper is more flexible and composable, allowing unimodal backbones to be pretrained independently from multimodal alignment fine-tuning while preserving unimodal performance by freezing the corresponding backbone.
The Zipper architecture features two autoregressive decoder towers (or backbones) that are “zipped” together using gated cross-attention layers. Each backbone is trained separately on a single modality using next-token prediction. Cross-attention is incorporated at every layer between the decoder backbones, enabling the representations of one modality at these interleaved layers to be cross-attended into the other modality.
Projection layers between the modalities during cross-attention facilitate the transformation of representations from one modality to another, especially when one or both backbones are frozen. Additionally, a non-linear input projection layer is added directly after the input embeddings of each backbone to better adjust the unimodal representations of inputs for multimodal tasks.
Empirical results on speech and text modalities demonstrate that Zipper performs competitively with frozen modality backbones against vocabulary expansion baselines on text-based generative tasks such as ASR. Furthermore, it shows an improved word error rate (WER) reduction of 12 absolute points (a 40% relative error reduction) on unfrozen modality backbones for speech-generative TTS tasks compared to vocabulary expansion baselines.
For future research, the team plans to extend the model beyond two unimodal decoders to combine a larger number of modalities. They also intend to scale Zipper to larger model sizes and a greater diversity of data.
The paper Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities is on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

