One of the most exciting capabilities of contemporary large language models (LLMs) is their impressive performance on code understanding and generation tasks, expanding access to the previously arcane domain of computer programming. There are a couple of drawbacks to code-focused LLMs: they often adopt an encoder-only or decoder-only architecture, which can limit their optimal performance to only specific tasks; and they usually have a limited set of pretraining objectives, which will degrade performance on downstream tasks that are less relevant to those objectives.
In the new paper CodeT5+: Open Code Large Language Models for Code Understanding and Generation, a Salesforce AI Research team presents CodeT5+, a novel family of encoder-decoder code foundation LLMs that can be flexibly adapted to a wide range of code understanding and generation tasks and outperform various code-related benchmarks.
As the team’s goal is to build a flexible code LLM suited to many different downstream tasks, they provide CodeT5+ with a diverse mixture of pretraining objectives on both unimodal and bimodal data.
In the first stage, CodeT5+ is pretrained on large-scale code unimodal data from open-source platforms such as GitHub. This pretraining uses a mixture of objectives — span denoising, decoder-only causal LM, and seq2seq causal LM tasks — to teach the model how to recover code contexts in code spans, partial programs, and complete programs.
The second pretraining stage employs text-code bimodal data — text-code pairs that contain a code function and its corresponding semantics description — at the function level. Here, CodeT5+ is pretrained on cross-modal contrastive learning, matching, and causal LM tasks to improve its cross-modal understanding and generation abilities.
This two-stage pretraining process enables CodeT5+ to perform flexibly across different model types that handle different tasks, such as seq2seq generation tasks, decoder-only tasks, and understanding-based tasks.
In their empirical study, the team compared CodeT5+ with state-of-the-art code LLMs such as LaMDA, GPT, StarCoder, etc. on code understanding and generation tasks such as zero-shot, finetuning, and instruction-tuning, across 20 benchmark datasets. In the experiments, CodeT5+ recorded SOTA results on zero-shot HumanEval code generation tasks and outperformed OpenAI’s powerful code-cushman-001 model.
This work demonstrates the proposed CodeT5+ open code LLMs’ ability to flexibly operate in encoder-only, decoder-only, and encoder-decoder modes to support and even reach SOTA performance across a wide range of downstream code tasks. The team believes CodeTs+ can be deployed as a unified retrieval-augmented generation system, and they are open-sourcing all CodeT5+ models to support future research in this area.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.