From text and proteins to audio, images, and state sequences, decoder-only generative models have proven their ability to generate new sequences across various modalities. However, integrating multiple generative foundation models, especially those trained on different modalities, into a cohesive and superior system presents significant challenges.
In a new paper Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities, a Google DeepMind research team introduces Zipper, a multi-tower decoder architecture. This architecture can flexibly combine multimodal generative models from independently pre-trained unimodal decoders and can be reused and repurposed in new multimodal combinations.

Similar to vocabulary expansion techniques, Zipper can perform generative tasks across all modalities. Unlike vocabulary expansion techniques, Zipper is more flexible and composable, allowing unimodal backbones to be pretrained independently from multimodal alignment fine-tuning while preserving unimodal performance by freezing the corresponding backbone.

The Zipper architecture features two autoregressive decoder towers (or backbones) that are “zipped” together using gated cross-attention layers. Each backbone is trained separately on a single modality using next-token prediction. Cross-attention is incorporated at every layer between the decoder backbones, enabling the representations of one modality at these interleaved layers to be cross-attended into the other modality.
Projection layers between the modalities during cross-attention facilitate the transformation of representations from one modality to another, especially when one or both backbones are frozen. Additionally, a non-linear input projection layer is added directly after the input embeddings of each backbone to better adjust the unimodal representations of inputs for multimodal tasks.


Empirical results on speech and text modalities demonstrate that Zipper performs competitively with frozen modality backbones against vocabulary expansion baselines on text-based generative tasks such as ASR. Furthermore, it shows an improved word error rate (WER) reduction of 12 absolute points (a 40% relative error reduction) on unfrozen modality backbones for speech-generative TTS tasks compared to vocabulary expansion baselines.
For future research, the team plans to extend the model beyond two unimodal decoders to combine a larger number of modalities. They also intend to scale Zipper to larger model sizes and a greater diversity of data.
The paper Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities is on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Reservation restaurant provides the perfect setting. As you step into our restaurant, you are instantly transported to a cozy Parisian bistro. Our restaurant features a classic French décor with elegant touches that evoke the romanticism of France. From the soft lighting and tasteful artwork to the comfortable seating and welcoming atmosphere, every detail is meticulously designed to ensure a memorable dining experience.
Osh University offers an excellent opportunity for studying Overseas medical education with its internationally recognized program. Our comprehensive curriculum, combined with experienced faculty, ensures quality medical education. Trust Osh University to guide you through a rewarding journey towards becoming a skilled medical professional in Russia.
The visual design of sprunki plays an important role in making the game feel lively. The characters are expressive and the colors are bright, which gives the whole game a cheerful vibe. Even small animations add personality to the experience. It’s the kind of design that makes the game feel welcoming from the start.