Task-specific models dominate today’s AI landscape, but a truly intelligent AI would be able to solve diverse real-world problems with minimal human involvement by learning joint, fundamental representations that support a broad range of downstream tasks. A step toward this goal is the creation of “foundation models” such as Google’s BERT and OpenAI’s GPT — an approach that in recent years has achieved outstanding performance and generalization abilities in natural language processing.
While large-scale vision pretraining methods such as CLIP, ALIGN and Wu Dao can learn directly from web-scale image-text pairs and achieve efficient transfer learning performance and zero-shot capability, their application is restricted to image-to-text mapping tasks.
In the paper A New Foundation Model for Computer Vision, a Microsoft research team proposes Florence, a novel foundation model for computer vision (CV) that significantly outperforms previous large-scale pretraining approaches and achieves new SOTA results across a wide range of visual and visual-linguistic benchmarks.
The Microsoft team begins their exploration with the question: “What is the foundation model for computer vision?” They identify three orthogonal axes in the related problem space: 1) Space: from coarse (e.g. scene-level classification) to fine-grained (e.g. object detection), 2) Time: from static (e.g. images) to dynamic (e.g. videos), and 3) Modality: from RGB only to multiple senses (e.g. captioning and depth). They tweak the “foundation model” term for CV, defining it here as a large-scale pretrained model and its adapters for solving all vision tasks in this Space-Time-Modality space, with transferability such as zero-/few-shot learning and fully fine-tuning, etc.
The researchers then present an emerging paradigm for building a vision foundation model they call Florence, in tribute to the birthplace of the Renaissance. Florence is trained on noisy web-scale data end-to-end with a unifying objective and achieves strong generalization ability and SOTA performance across a wide range of vision benchmarks.
The Florence ecosystem includes data curation, model pretraining, task adaptations and training infrastructure. Leveraging the huge amount of image data on the Internet, the researchers curate a new noisy free-form web-crawled dataset with 900 million image-text pairs. For model pretraining, they use a two-tower architecture comprising an image encoder and a language encoder. Tackling the vital challenge of task adaptation, they extend the learned feature representations along space, time, and modality, enabling Florence’s effective real-world adaptation via few-shot and zero-shot transfer learning. For training infrastructure, they apply techniques such as ZeRO, activation checkpointing, mixed-precision training and gradient cache to significantly reduce memory computation.
In the team’s empirical evaluations, Florence achieved new state-of-the-art results on a majority of 44 representative benchmarks, reaching zero-shot classification top-1 accuracy of 83.74 percent and top-5 accuracy of 97.18 percent on ImageNet-1K, 62.4 percent mAP on COCO fine-tuning, 80.36 percent on VQA, and 87.8 percent on Kinetics-600.
The study shows that Florence successfully extends to different tasks along space, time, and modality with impressive transferability, and achieves SOTA results on vision benchmarks. The team believes Florence can stimulate the creation of vision foundation models capable of powering millions of real-world vision tasks, and hope their study will motivate further research along this path.
The paper Florence: A New Foundation Model for Computer Vision is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.