Computer vision models have made remarkable progress in recent years on fundamental tasks like object recognition and depth estimation — but still struggle with visual queries that require both visual processing and reasoning. While end-to-end models remain the typical approach in this research area, their limited interpretability and generalization abilities leave them underequipped for improving performance on complex visual queries.
In the new paper ViperGPT: Visual Inference via Python Execution for Reasoning, a Columbia University research team presents ViperGPT, a framework for solving complex visual queries by integrating code-generation models into vision via a Python interpreter. The proposed approach requires no additional training and achieves state-of-the-art results.
The team summarizes their main contributions as follows:
- We propose a simple framework for solving complex visual queries by integrating code-generation models into vision with an API and the Python interpreter.
- We achieve state-of-the-art zero-shot results across tasks in visual grounding, image question answering, and video question-answering, showing this interpretability aids performance rather than hindering it.
- To promote research in this direction, we develop a Python library enabling rapid development for program synthesis for visual tasks, which will be open-sourced upon publication.
Given a visual input and a textual query describing its contents, ViperGPT first synthesizes an appropriate program with a program generator, then uses its execution engine to execute the program and produce a corresponding result for the input.
The team uses large language models (LLMs) to instantiate the program generator and integrate their system’s vision and language modules. LLMs take the input as a tokenized code sequence and autoregressively predict subsequent tokens, effectively eliminating the need for task-specific training for program generation.
At execution time, ViperGPT’s generated programs flexibly support both image and video as inputs and output results corresponding to the query provided to the LLM. The team also employs a Python interpreter to enable logical operations for program execution and expand compatibility with various existing Python tools.
In their empirical study, the team applied ViperGPT to tasks that included visual grounding, compositional image question answering, external knowledge-dependent image question answering, and video causal and temporal reasoning. In the experiments, ViperGPT achieved state-of-the-art performance on complex visual tasks, validating that, without any additional training, ViperGPT’s programmatic composition of specialized functions for complex visual queries effectively connects vision and language in a single system that is interpretable, logical, flexible and adaptable.
The paper ViperGPT: Visual Inference via Python Execution for Reasoning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Columbia U’s ViperGPT Solves Complex Visual Queries via Python Execution”