Decoding Code Execution: How DeepMind’s NExT Empowers AI Reasoning

In recent years, there has been a surge in the development of large language models (LLMs) tailored for code-related tasks. These LLMs have shown remarkable proficiency in aiding developers with tasks such as writing, editing, explaining, and reviewing code. However, they often stumble when faced with more intricate software engineering challenges that demand a deeper understanding of a program’s runtime behavior.

Addressing this gap, in a new paper NExT: Teaching Large Language Models to Reason about Code Execution, a Google DeepMind research team proposes Naturalized Execution Tuning (NExT), a method aims to equip LLMs with the ability to scrutinize program execution traces and deduce runtime behaviors through chain-of-thought (CoT) rationales.

The primary objective of this endeavor is to enhance LLMs’ capability to comprehend program execution when tackling coding tasks. NExT achieves this by teaching LLMs to dissect program execution traces and articulate insights about runtime behavior using natural language (NL).

In essence, for a given coding task, the core concept involves training a model to produce intermediate NL rationales akin to chain-of-thought reasoning. Crucially, the model is supplied with a trace of the program’s execution, enabling more accurate and semantically grounded rationales. Teaching LLMs to reason about program execution in NL not only enhances interpretability but also broadens the spectrum of predicted solutions.

To illustrate, when presented with a coding task instruction and a flawed program alongside its execution traces, an LLM employs chain-of-thought reasoning to generate a natural language rationale, leveraging the execution information. Program traces encapsulate valuable debugging insights such as line-by-line variable states and exceptions, aiding LLMs in identifying and rectifying bugs by analyzing expected versus actual execution outcomes. NExT facilitates LLMs’ comprehension of execution traces by representing them as concise inline code comments, seamlessly integrated with the original program structure.

The efficacy of NExT was evaluated using the PaLM 2-L model on two Python program repair tasks. Results demonstrate significant enhancements in PaLM 2’s ability to reason about program execution in natural language, with a 26.1% improvement on Mbpp-R and a 14.3% improvement on Human-EvalFix-Plus tasks, respectively. Furthermore, when compared to a robust self-training program repair approach lacking NL rationale prediction, NExT achieves comparable accuracy while substantially enhancing sample diversity.

In summary, this study underscores that training PaLM 2-L with NExT yields high-quality natural language rationales and bolsters success rates in program repair tasks. Looking ahead, the team envisions extending NExT to a broader array of program understanding tasks while enhancing trace representation to encompass a wider range of programming languages.

The paper NExT: Teaching Large Language Models to Reason about Code Execution is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Decoding Code Execution: How DeepMind’s NExT Empowers AI Reasoning”

Hilda J. Alvarez

2024-04-25

LLMs are powerful tools for processing and generating text, but they often struggle to understand the logic and flow of computer code. This limits their ability to perform tasks that require code comprehension, such as debugging or code generation with candy clicker .

Loading...

Decoding Code Execution: How DeepMind’s NExT Empowers AI Reasoning

Like this:

1 comment on “Decoding Code Execution: How DeepMind’s NExT Empowers AI Reasoning”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Decoding Code Execution: How DeepMind’s NExT Empowers AI Reasoning”

Leave a Reply Cancel reply

Related