AI Machine Learning & Data Science Research

Microsoft’s Visual ChatGPT Enables Image Understanding and Generation

In the new paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, a Microsoft Research Asia team presents Visual ChatGPT, a system that incorporates various visual foundation models to enable ChatGPT to understand, generate and edit visual information.

ChatGPT’s humanlike conversational competency has taken AI to new levels — might the generative large language model’s next evolutionary leap equip it with similar skills in the visual medium?

In the new paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, a Microsoft Research Asia team presents Visual ChatGPT, a system that incorporates various visual foundation models (VFMs) to enable ChatGPT to understand, generate and edit visual information and tackle complex visual tasks.

The team summarizes their main contributions as follows:

  1. We propose Visual ChatGPT, which opens the door to combining ChatGPT and Visual Foundation Models and enables ChatGPT to handle complex visual tasks.
  2. We design a Prompt Manager, in which we involve 22 different VFMs and define the internal correlation among them for better interaction and combination.
  3. Massive zero-shot experiments are conducted and abundant cases are shown to verify the understanding and generation ability of Visual ChatGPT.

Visual ChatGPT is built on ChatGPT and incorporates a variety of specialized VFMs (22 in total, curated from Hugging Face Transformers, Maskformer, and ControlNet) to help it understand visual information and provide corresponding answers to user inputs. To create an efficient connection between ChatGPT and the VFMs, the team crafts a series of prompts to “inject” visual information into ChatGPT. A novel Prompt Manager specifies the capability of each VFM and its input-output formats; converts different visual information to a language format; and deals with the histories, priorities, and conflicts of different VFMs. Visual ChatGPT uses the feedback from this VFM ensemble to iteratively build its visual understanding and generation capabilities.

The paper provides the following example to summarize Visual ChatGPT’s overall procedure. Given an image of a yellow flower and the prompt “Please generate a red flower conditioned on the predicted depth of this image and then make it like a cartoon, step by step,” Visual ChatGPT first uses ChatGPT’s capability to understand the given question and applies its depth estimation VFM to detect the image’s depth information. It then calls on a depth-to-image VFM to generate corresponding images that carry the depth information. Finally, it leverages a style transfer VFM (based on the Stable Diffusion model) to generate the final outputs in a cartoon style.

In their empirical study, the team used zero-shot experiments and user cases to test Visual ChatGPT’s visual information understanding and generation ability. In the evaluations, Visual ChatGPT aptly demonstrated its ability to solve complex visual questions, advancing the exciting possibility of combining LLMs with VFMs to develop visual processing capabilities.

The Visual ChatGPT system is available on the project’s GitHub. The paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Microsoft’s Visual ChatGPT Enables Image Understanding and Generation

  1. ChatGPT-4 is a game-changer for marketing! It’s super smart and can totally transform how businesses reach customers. With its advanced skills, it can analyze loads of customer info and make marketing campaigns way more personalized. Plus, it can help with writing killer ads and social media posts that really grab attention. ChatGPT-4 understands what customers like and can target them better than ever. I found a lot of tips how to use chatgpt4 in marketing here. Businesses should definitely tap into ChatGPT-4 to take their marketing to the next level!

  2. A remarkable discovery unfolded as I delved into the world of stock photos. This website is a place where creativity and professionalism beautifully converge. It offers a wide spectrum of images, each meticulously curated to meet the demands of discerning creators. With this resource bandana pattern vector in hand, I am poised to take my creative endeavors to new heights, no matter the project’s scope or scale.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: