The ability of large language models to generate computer code from natural language (NL) prompts has revolutionized the programming domain. Most contemporary models however can only generate code for seen libraries and function calls, and struggle when they encounter any of the new libraries or functions that are constantly being introduced. A human programmer facing such a challenge would typically research and retrieve user manuals and other relevant documents to familiarize themselves with the new library/function — could LLMs be taught to do the same?
In the new paper DocPrompting: Generating Code by Retrieving the Docs, a research team from Carnegie Mellon University and Inspired Cognition presents DocPrompting, a novel NL-to-code generation approach. Tasked with generating code to unseen functions or libraries from an NL intent, DocPrompting retrieves corresponding code documentation to enable the model to learn to perform the task.
DocPrompting is inspired by programmers’ use of manuals and documentation when encountering unseen/unused functions or libraries. The approach first learns to retrieve relevant documents from an external documentation pool, then learns to generate code using prompts based on the information it gleaned from the documents.
The documentation pool can be regularly updated with new content to enable DocPrompting to generate unseen and unused functions and libraries without requiring any costly retraining of model components. DocPrompting is also a general method — it can be applied to any programming language and is not bounded to the underlying neural model, and can be instantiated with any base retriever and generator.
In their empirical study, the team evaluated DocPrompting on two NL-to-code tasks and benchmarks: shell scripting and Python programming. In the shell scripting task, DocPrompting consistently improved on the base model; while In Python programming, CodeT5+DocPrompting performed exceptionally well on unseen functions and achieved a 1.65 BLEU score improvement over the state-of-the-art result.
This work opens a promising new direction for the evolution of code generation. The team says that, to their best knowledge, DocPrompting is the first approach to explicitly and effectively leverage documentation for NL-to-code tasks.
The code is available on the project’s GitHub. The paper DocPrompting: Generating Code by Retrieving the Docs is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “CMU & Inspired Cognition’s DocPrompting Improves Code Generation by Retrieving Relevant Documentation”