Large language models (LLMs) have emerged as powerful tools for a wide range of natural language processing (NLP) tasks. The push toward humanlike artificial general intelligence (AGI) however will require equipping such models with additional capabilities — and multimodal perception is an essential next step.
In the new paper Language Is Not All You Need: Aligning Perception with Language Models, a Microsoft research team presents KOSMOS-1, a multimodal large language model (MLLM) that is able to perceive general modalities, learn in context, and follow instructions. KOSMOS-1 achieves impressive performance on language, perception-language, and vision tasks.

The researchers propose that LLMs with multimodal perception will be better equipped to acquire commonsense knowledge beyond the information they glean from text alone; and that this perception enrichment will facilitate LLM applications in new domains such as robotics and document intelligence. Multimodal perception also has the benefit of unifying multiple APIs to form a single general graphical user interface.

KOSMOS-1 follows the MetaLM training process, where a transformer-based LLM acts as a general-purpose interface and is augmented with various perception modules. Consistent with the MetaLM philosophy, the team treats language models as a universal task layer, enabling KOSMOS-1 to unify various task predictions as texts and capably handle natural-language instructions and action sequences.
Given a previous context, KOSMOS-1 learns to generate texts in an autoregressive manner. All non-text input modalities are embedded and then fed into its backbone transformer-based causal language model, with the transformer decoder serving as a general-purpose interface for all modalities. By interacting with natural language and the other modalities, KOSMOS-1 naturally inherits the capabilities of in-context learning and instruction following; and can thus handle both language and perception-intensive tasks.


In their empirical study, the team trained KOSMOS-1 on web-scale multimodal corpora and conducted experiments on a wide range of language and multimodal tasks and the Raven IQ test. KOSMOS-1 achieved impressive performance on all tasks, demonstrating its strong multimodal perception and nonverbal reasoning abilities.
In KOSMOS-1, the researchers introduce an MLLM with promising new capabilities and opportunities. In the future, they plan to equip KOSMOS-1 with speech and scale up its model size.
The paper Language Is Not All You Need: Aligning Perception with Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
At Creative Klick, we understand that not everyone enjoys getting their picture taken. But we also know that a professional headshots atlantis is essential for anyone who wants to succeed in the business world. That’s why we’ve worked hard to perfect our headshot photography.
When you have a new information about this topic, let me know.
Pingback: Microsoft’s KOSMOS-1 MLLM Can Perceive General Modalities, Follow Instructions, and Perform In-Context Learning – One Man Company
Thanks for sharing. Its must read Article.