Recent development of Large language models (LLMs) empowers them to become more versatile and agnostic to specific tasks. Considering the impressive performance of LLM for capturing rich conceptual knowledge in their lexical embedding, an intriguing question arises: are frozen LLMs capable of solving multimodal tasks?
The abovementioned question however is under-explored and doesn’t gain much success. To bridge this gap, in a new paper SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs, a research team from Google Research and Carnegie Mellon University introduces Semantic Pyramid AutoEncoder (SPACE), the first successful method for enabling frozen LLMs to solve cross-modal tasks while outperforming state-of-the-art image understanding models by over 25%.
The team summarizes their main contributions as follows:
- This is the first successful method, to the best of our knowledge, that uses a frozen language model, trained solely on language tokens, to directly generate image content through in-context learning.
- We introduce a new SPAE tokenizer producing interpretable representations of semantic concepts and fine-grained details in the form of multilingual linguistic tokens with adjustable lengths.
- We propose a new progressive prompting method that facilitates in-context generation of long cross-modal sequences.
- We evaluate our method on visual understanding and generation tasks, and notably, our approach outperforms the best-published few-shot image classification accuracy by an absolute 25% under the same in-context setting.
This work aims at equipping frozen LLMs for modeling multi modalities, including images, video or audio as a model comprehensible language sequences. The proposed SPAE generates a lexical sequence that not only contains rich semantic information but retains fine details for signal reconstruction.
SPAE has a multi-scale representation arranged in a pyramid structure, with the upper layers contain semantic-central concepts while the lower layers capture the fine-grained details of image reconstruction. Under this setting, SPAE can dynamically adjust token lengths to adapt to different tasks. As a result, SPAE can effectively translates the given image inputs to a language that a frozen LLM can understand and process, therefore the resulting LLM model has strong generative capabilities to conduct conditional image understanding and generation tasks without the demand of training on relevant image-text pairs.
In their empirical study, the researchers evaluated SPAE on visual understanding and generation tasks. SPAE tokens achieve higher semantic CLIP scores compared to VQGAN bassline, and it consistently outperforms the few-shot image classification baseline, surpassing LQAE by 25% under the same setting.
Overall, this work demonstrates the potential of frozen LLMs in multimodal understanding and generating tasks without the requirement of explicit training on these modalities.
The paper SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.