Autoregressive vision language models, such as Flamingo, Kosmos-1 and multimodal GPT-4, shows great potential for completing various vision-language tasks as well as having strong generalization capability. These powerful models however are closed-source, which limits the academic’s researches on autoregressive vision-language models.
In a new paper OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, a research team from University of Washington, Stanford University, Allen Institute for AI, LAION, University of California Santa Barbara, Hebrew University, Columbia University, Google DeepMind and Juelich Supercomputing Center releases OpenFlamingo, an open-source replication of DeepMind’s Flamingo models for training autoregressive vision-language models.
OpenFlamingo is a multimodal language model for addressing a wide range of vision-language tasks. The team chose to replicate DeepMind’s Flamingo due to its strong in-context learning abilities.
Specifically, given an interleaved sequence of image-text pairs, OpenFlamingo tries to predict the next text conditioned on all previous texts and the last preceding image. The team attaches the dense cross attention modules to the layers of a frozen, autoregressive language model, which enables the text tokens to attend their corresponding images. They also extract patch features that are generated by the frozen vision encoder and uses a trainable Perceiver resampler to embed images.
And unlike Flamingo, which is trained using closed-source ALIGN and M3W dataset, the team uses 1) LAION-2B, an open-source web-scraped dataset that consists of 2B image-text pairs; and 2) Multimodal C4, an open-source dataset that consists of 101M interleaved samples as training datasets. Moreover, they also uses ChatGPT-generated synthetic data to scale up training dataset.
Finally, the team evaluated OpenFlamingo on seven vision-language datasets, such as captioning, visual question answering, rank classification and etc. The results show that OpenFlamingo models on average achieve between 80 – 89% of corresponding Flamingo performance, demonstrating its effectively.
Overall, the work provides open access to large autoregressive vision-language models, the team hopes their work will encourage more relevant researches.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.