Microsoft’s Visual ChatGPT Enables Image Understanding and Generation

ChatGPT’s humanlike conversational competency has taken AI to new levels — might the generative large language model’s next evolutionary leap equip it with similar skills in the visual medium?

In the new paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, a Microsoft Research Asia team presents Visual ChatGPT, a system that incorporates various visual foundation models (VFMs) to enable ChatGPT to understand, generate and edit visual information and tackle complex visual tasks.

The team summarizes their main contributions as follows:

We propose Visual ChatGPT, which opens the door to combining ChatGPT and Visual Foundation Models and enables ChatGPT to handle complex visual tasks.
We design a Prompt Manager, in which we involve 22 different VFMs and define the internal correlation among them for better interaction and combination.
Massive zero-shot experiments are conducted and abundant cases are shown to verify the understanding and generation ability of Visual ChatGPT.

Visual ChatGPT is built on ChatGPT and incorporates a variety of specialized VFMs (22 in total, curated from Hugging Face Transformers, Maskformer, and ControlNet) to help it understand visual information and provide corresponding answers to user inputs. To create an efficient connection between ChatGPT and the VFMs, the team crafts a series of prompts to “inject” visual information into ChatGPT. A novel Prompt Manager specifies the capability of each VFM and its input-output formats; converts different visual information to a language format; and deals with the histories, priorities, and conflicts of different VFMs. Visual ChatGPT uses the feedback from this VFM ensemble to iteratively build its visual understanding and generation capabilities.

The paper provides the following example to summarize Visual ChatGPT’s overall procedure. Given an image of a yellow flower and the prompt “Please generate a red flower conditioned on the predicted depth of this image and then make it like a cartoon, step by step,” Visual ChatGPT first uses ChatGPT’s capability to understand the given question and applies its depth estimation VFM to detect the image’s depth information. It then calls on a depth-to-image VFM to generate corresponding images that carry the depth information. Finally, it leverages a style transfer VFM (based on the Stable Diffusion model) to generate the final outputs in a cartoon style.

In their empirical study, the team used zero-shot experiments and user cases to test Visual ChatGPT’s visual information understanding and generation ability. In the evaluations, Visual ChatGPT aptly demonstrated its ability to solve complex visual questions, advancing the exciting possibility of combining LLMs with VFMs to develop visual processing capabilities.

The Visual ChatGPT system is available on the project’s GitHub. The paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

8 comments on “Microsoft’s Visual ChatGPT Enables Image Understanding and Generation”

Paul

2023-07-01

ChatGPT-4 is a game-changer for marketing! It’s super smart and can totally transform how businesses reach customers. With its advanced skills, it can analyze loads of customer info and make marketing campaigns way more personalized. Plus, it can help with writing killer ads and social media posts that really grab attention. ChatGPT-4 understands what customers like and can target them better than ever. I found a lot of tips how to use chatgpt4 in marketing here. Businesses should definitely tap into ChatGPT-4 to take their marketing to the next level!

Loading...

Reply
Tif

2023-08-30

A remarkable discovery unfolded as I delved into the world of stock photos. This website is a place where creativity and professionalism beautifully converge. It offers a wide spectrum of images, each meticulously curated to meet the demands of discerning creators. With this resource bandana pattern vector in hand, I am poised to take my creative endeavors to new heights, no matter the project’s scope or scale.

Loading...

Reply
gorilla tag

2023-11-03

In my opinion, this is a piece full of useful information that demonstrates a high level of knowledge.

Loading...

Reply
tunnel rush

2024-08-23

From my perspective, this composition is replete with valuable insights that exemplify a profound degree of cognitive expertise.

Loading...

Reply
yandere ai girlfriend simulator

2024-09-27

As the protagonist, you must manage a virtual partner’s emotions while keeping the relationship intact. In yandere ai girlfriend simulator, players must navigate a series of emotional challenges to maintain a healthy bond. One wrong move, and affection can become obsession, making emotional control crucial to success.

Loading...

Reply
low's adventures

2024-10-04

Your method of discussing the information is brilliant, revealing your professionalism in this domain. I’ve been searching for a clear explanation for a long time, and I’m pleased to have discovered yours.

Loading...

Reply
narcissist test

2025-10-04

Thanks for sharing this!

Loading...

Reply
derivative

2026-03-11

A derivative represents the rate of change of a function in calculus. Learn simple methods, formulas, and examples to calculate derivatives easily. Perfect for students studying mathematics and understanding how functions behave and change.

Loading...

Reply