Computer vision models have made remarkable progress in recent years on fundamental tasks like object recognition and depth estimation — but still struggle with visual queries that require both visual processing and reasoning. While end-to-end models remain the typical approach in this research area, their limited interpretability and generalization abilities leave them underequipped for improving performance on complex visual queries.
In the new paper ViperGPT: Visual Inference via Python Execution for Reasoning, a Columbia University research team presents ViperGPT, a framework for solving complex visual queries by integrating code-generation models into vision via a Python interpreter. The proposed approach requires no additional training and achieves state-of-the-art results.

The team summarizes their main contributions as follows:
- We propose a simple framework for solving complex visual queries by integrating code-generation models into vision with an API and the Python interpreter.
- We achieve state-of-the-art zero-shot results across tasks in visual grounding, image question answering, and video question-answering, showing this interpretability aids performance rather than hindering it.
- To promote research in this direction, we develop a Python library enabling rapid development for program synthesis for visual tasks, which will be open-sourced upon publication.

Given a visual input and a textual query describing its contents, ViperGPT first synthesizes an appropriate program with a program generator, then uses its execution engine to execute the program and produce a corresponding result for the input.
The team uses large language models (LLMs) to instantiate the program generator and integrate their system’s vision and language modules. LLMs take the input as a tokenized code sequence and autoregressively predict subsequent tokens, effectively eliminating the need for task-specific training for program generation.
At execution time, ViperGPT’s generated programs flexibly support both image and video as inputs and output results corresponding to the query provided to the LLM. The team also employs a Python interpreter to enable logical operations for program execution and expand compatibility with various existing Python tools.

In their empirical study, the team applied ViperGPT to tasks that included visual grounding, compositional image question answering, external knowledge-dependent image question answering, and video causal and temporal reasoning. In the experiments, ViperGPT achieved state-of-the-art performance on complex visual tasks, validating that, without any additional training, ViperGPT’s programmatic composition of specialized functions for complex visual queries effectively connects vision and language in a single system that is interpretable, logical, flexible and adaptable.
The paper ViperGPT: Visual Inference via Python Execution for Reasoning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Great! This is so smart
Your post has given me a lot of knowledge. Let me once again believe in the power of words, providing inspiration, and thus gaining inspiration, allowing you to acquire knowledge and broaden your horizons. You are truly talented and excellent, and I wish you better and better
Such a concise and insightful summary of some of the key shortcomings of existing computer vision models, which is an important consideration as the field continues to evolve and tackle increasingly sophisticated visual understanding tasks.
Columbia U’s ViperGPT solves complex visual queries using Python, similar to how the game FNAF requires meticulous decoding and strategy. The technology provides a powerful solution to complex challenges, serving as a powerful assistant in the data world.
We do ship world wide and offer discreet overnight deliveries within U.S.A,Canada,Australia etc.Magic truffles will stimulate your mind’s creative centers, giving you helpful insights that may have a lasting positive impact.Amazonian Mushroom,Avery Albino Mushroom,Blue Meanies Mushroom,Golden Teacher Mushrooms,Malabar coast mushrooms,Psilocybe Allenii,Psilocybe Aucklandiae,Psilocybe Aztecorum,Psilocybe Azurescens Dried,Psilocybe Caerulescens,Utopia magic truffles,Psilocybin Mushroom Capsules,Psilocybe Mexicana,Psilocybe Hollandia Spores,Mescaline Powder,Mdma Crystal,LSD Drug Powder,LSD Blotter For Sale,Ketamine Powder,Ibogaine HCL Powder,Buy Ecstasy Pills Online,Buy DMT Powder,Buy 3mmc Online,Buy 3-FEA Powder,2CB Powder,25I-NBOME /N-BOMB
Lenovo’s latest workstation sets new standards for creative productivity in today’s hybrid offices, much like how Fnaf challenges creativity. To further boost your workflow, easily merge pdf files with the powerful tool at merge pdf. This efficient solution streamlines document management, helping you combine multiple PDFs in seconds. Embrace productivity and innovation—let Lenovo and the merge pdf tool support all your creative and collaborative needs. merge pdf
Hey there. As a game software creator, I like to pass the time by playing a lot of different types of games. However, geometry dash subzero has become one of my new favorite games. I like this fast-paced game that lets me explore a neon-colored world.