ChatGPT’s humanlike conversational competency has taken AI to new levels — might the generative large language model’s next evolutionary leap equip it with similar skills in the visual medium?
In the new paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, a Microsoft Research Asia team presents Visual ChatGPT, a system that incorporates various visual foundation models (VFMs) to enable ChatGPT to understand, generate and edit visual information and tackle complex visual tasks.
The team summarizes their main contributions as follows:
- We propose Visual ChatGPT, which opens the door to combining ChatGPT and Visual Foundation Models and enables ChatGPT to handle complex visual tasks.
- We design a Prompt Manager, in which we involve 22 different VFMs and define the internal correlation among them for better interaction and combination.
- Massive zero-shot experiments are conducted and abundant cases are shown to verify the understanding and generation ability of Visual ChatGPT.
Visual ChatGPT is built on ChatGPT and incorporates a variety of specialized VFMs (22 in total, curated from Hugging Face Transformers, Maskformer, and ControlNet) to help it understand visual information and provide corresponding answers to user inputs. To create an efficient connection between ChatGPT and the VFMs, the team crafts a series of prompts to “inject” visual information into ChatGPT. A novel Prompt Manager specifies the capability of each VFM and its input-output formats; converts different visual information to a language format; and deals with the histories, priorities, and conflicts of different VFMs. Visual ChatGPT uses the feedback from this VFM ensemble to iteratively build its visual understanding and generation capabilities.
The paper provides the following example to summarize Visual ChatGPT’s overall procedure. Given an image of a yellow flower and the prompt “Please generate a red flower conditioned on the predicted depth of this image and then make it like a cartoon, step by step,” Visual ChatGPT first uses ChatGPT’s capability to understand the given question and applies its depth estimation VFM to detect the image’s depth information. It then calls on a depth-to-image VFM to generate corresponding images that carry the depth information. Finally, it leverages a style transfer VFM (based on the Stable Diffusion model) to generate the final outputs in a cartoon style.
In their empirical study, the team used zero-shot experiments and user cases to test Visual ChatGPT’s visual information understanding and generation ability. In the evaluations, Visual ChatGPT aptly demonstrated its ability to solve complex visual questions, advancing the exciting possibility of combining LLMs with VFMs to develop visual processing capabilities.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.