Powerful pretrained large language models (LLMs) have achieved great success in natural language processing and have recently shown their ability to automate code-generation based on informal natural language prompts. The ambiguity of natural human language however can cause LLMs to struggle to produce code that correctly reflects user intent.
In the new paper Interactive Code Generation via Test-Driven User-Intent Formalization, a team from Microsoft Research, the University of Pennsylvania, and the University of California, San Diego proposes TiCoder (Test-driven Interactive Coder), a workflow for test-driven user-intent formalization (TDUIF) that leverages user feedback to generate code based on natural language inputs that is 90.40 percent consistent with user intent.
The difficulty of defining a precise intent from natural language inputs makes it challenging to even assess the correctness of LLM-generated code. It is also difficult to understand and evaluate code suggestions without running or debugging them. These factors can lead users to either accept buggy code or reject correct code that is too difficult to understand. The team’s TDUIF-based approach is designed to address these issues by leveraging feedback to generate codes consistent with user intent as expressed in their natural language inputs.
The proposed framework first refines and formalizes user intent through generated tests, then generates corresponding code based on these tests. The researchers summarize their high-level workflow as follows:
- The human user prompts the agent for completing a function body given the prefix in a file, a natural language description and the function header/signature containing method name, parameters and returns.
- The agent repeatedly queries the user (until a stopping criterion is reached) asking if a set of behaviours (or a test) is consistent with the user intent.
- The user responds either YES, NO, or DONTKNOW to each of the queries from the agent.
- Once the interaction terminates, the agent outputs (a) a set of tests that the user has approved, and (b) a ranked list of code suggestions that are consistent with the user responses.
In their empirical study, the team evaluated their TiCoder TDUIF implementation on the MBPP (Mostly Basic Python Problems) academic code generation benchmark dataset. Using the OpenAI Codex LLM on MBPP, TiCoder improved the pass@1 code generation accuracy metric from 48.39 percent to 70.49 percent using only a single user query. TiCoder also demonstrated its ability to generate a non-trivial functional unit test consistent with the user intent within an average of 1.69 user queries for 90.40 percent of the MBPP samples.
Overall, this work validates the effectiveness of the proposed workflow. The team believes their framework can serve as a scalable solution for code generation and is also flexible enough to adapt to richer forms of formal specifications such as procedure summaries.
The paper Interactive Code Generation via Test-Driven User-Intent Formalization is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.