Deep neural networks have revolutionized the field of computer vision, achieving unprecedented performance across a wide range of tasks. The production of high-dimensional structured outputs for vision tasks such as image segmentation, monocular depth estimation, object detection, etc. however requires human handcrafting of network architectures and tailoring of training procedures for each specific task. These are time-consuming processes that can also introduce the need for expert knowledge with regard to the task at hand.
A Google Brain research team challenges this “fragmented” vision modelling paradigm in their new paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, proposing UViM (Unified Vision Model), a novel approach that leverages language modelling and discrete representation learning to enable the modelling of diverse computer vision tasks without any task-specific modifications.
In the field of natural language processing (NLP), autoregressive sequence models parameterized by transformer architectures have emerged as a prominent unified model that enjoys advantages such as theoretical soundness, expressiveness, and robustness. This motivated the Google researchers to design a similar general solution for computer vision.
The proposed UViM is a unified computer vision model that combines a standard feedforward base model and an autoregressive language model. It can handle vision tasks that deal with extremely high dimensional and structured outputs with much lower computational costs.
The UViM optimization procedure comprises two training stages: learning with a guiding code and learning to model the guiding code. In the first stage, a restricted oracle model produces a short discrete sequence (guiding code) to help the base model solve complex vision tasks and reduce the cost of high-dimensional structured prediction. In the second stage, the team trains a language model to output a guiding code by learning to “mimic” the oracle using only the image input. The resulting UViM is thus equipped to model highly structured outputs for diverse vision tasks.
In their empirical study, the team applied UViM to three diverse vision tasks: general scene understanding panoptic segmentation, conditional generative image colorization, and 3D scene depth prediction understanding.
In the evaluations, the proposed UViM achieved results competitive with the state-of-the-art on all three tasks, confirming its ability to handle diverse vision tasks in a unified manner.
The team regards UViM as a “brave new prototype” for a general-purpose unified computer vision model and hopes their paper will motivate future research on the generation of better guiding codes and the design of more efficient training procedures.
The paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.