Large language models (LLMs) have emerged as powerful tools for a wide range of natural language processing (NLP) tasks. The push toward humanlike artificial general intelligence (AGI) however will require equipping such models with additional capabilities — and multimodal perception is an essential next step.
In the new paper Language Is Not All You Need: Aligning Perception with Language Models, a Microsoft research team presents KOSMOS-1, a multimodal large language model (MLLM) that is able to perceive general modalities, learn in context, and follow instructions. KOSMOS-1 achieves impressive performance on language, perception-language, and vision tasks.
The researchers propose that LLMs with multimodal perception will be better equipped to acquire commonsense knowledge beyond the information they glean from text alone; and that this perception enrichment will facilitate LLM applications in new domains such as robotics and document intelligence. Multimodal perception also has the benefit of unifying multiple APIs to form a single general graphical user interface.
KOSMOS-1 follows the MetaLM training process, where a transformer-based LLM acts as a general-purpose interface and is augmented with various perception modules. Consistent with the MetaLM philosophy, the team treats language models as a universal task layer, enabling KOSMOS-1 to unify various task predictions as texts and capably handle natural-language instructions and action sequences.
Given a previous context, KOSMOS-1 learns to generate texts in an autoregressive manner. All non-text input modalities are embedded and then fed into its backbone transformer-based causal language model, with the transformer decoder serving as a general-purpose interface for all modalities. By interacting with natural language and the other modalities, KOSMOS-1 naturally inherits the capabilities of in-context learning and instruction following; and can thus handle both language and perception-intensive tasks.
In their empirical study, the team trained KOSMOS-1 on web-scale multimodal corpora and conducted experiments on a wide range of language and multimodal tasks and the Raven IQ test. KOSMOS-1 achieved impressive performance on all tasks, demonstrating its strong multimodal perception and nonverbal reasoning abilities.
In KOSMOS-1, the researchers introduce an MLLM with promising new capabilities and opportunities. In the future, they plan to equip KOSMOS-1 with speech and scale up its model size.
The paper Language Is Not All You Need: Aligning Perception with Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

