The rapid rise of generative AI — particularly the large language models (LLMs) that now dominate the natural language processing (NLP) domain — has put AI into the public spotlight like never before. As LLMs’ real-world applications continue to expand in scope, so too do concerns regarding the lack of transparency and openness during such models’ development processes. A new study addresses this issue with regard to code LLMs, which are trained on various programming languages and designed for efficient code generation.
The BigCode community is a global scientific collaboration dedicated to the open and responsible development of code LLMs. It is co-stewarded by Hugging Face and ServiceNow and has more than 600 members from diverse research institutions. In the new paper StarCoder: May the Source Be With You!, the BigCode community releases StarCoder and StarCoderBase, two 15.5B parameter open-access LLMs trained on 80+ programming languages. StarCoderBase outperforms all existing multi-programming-language code LLMs; while StarCoder surpasses all models fine-tuned on the popular Python programming language, reaching a state-of-the-art 40 percent pass@1 on the HumanEval problem-solving benchmark.
The team summarizes their main contributions as follows:
- We release StarCoderBase and StarCoder, open-access Code LLMs trained on 80+ programming languages that support a novel combination of capabilities and architectural features unavailable in other open Code LLMs.
- We perform the most comprehensive evaluation of Code LLMs to date using a diverse set of benchmarks.
- We take important steps towards a safe open model release.
StarCoderBase was trained on 1 trillion tokens sourced from a curated dataset of more than 80 programming languages, GitHub issues, Git commits, and Jupyter notebooks. It is based on the same architecture as BigCode’s previously released SantaCoder (Ben Allal et al., 2023), a decoder-only transformer with infilling capabilities (FIM, Bavarian et al., 2022), multi-query attention for fast large-batch inference (MQA, Shazeer, 2019), and learned absolute positional embeddings. FlashAttention (Dao et al., 2022) is incorporated to speed up attention computation, reduce costs, and enable the model to scale to a context length of 8k tokens. The StarCoder model is a fine-tuned version of StarCoderBase trained on an additional 35B Python tokens.
In their empirical study, the team compared StarCoder with benchmark code LLMs such as CodeGen, CodeGeeX, and OpenAI’s code-cushman-001 model. They summarize their results as follows:
- StarCoder outperforms every open LLM for code that supports multiple programming languages.
- StarCoder matches or outperforms the OpenAI code-cushman-001 model.
- When fine-tuned on Python, StarCoder substantially outperforms existing LLMs that are also fine-tuned on Python.
- Leveraging its 8K token context, StarCoder can be prompted to behave as a virtual technical assistant without instruction-tuning or RLHF.
StarCoder has been released under an Open Responsible AI Model license, and all code repositories for building the model are open-sourced on the project’s GitHub. The team hopes their work will increase the access, reproducibility, and transparency of Code LLMs in the research and developer communities and ensure that StarCoder models remain “a force for good.”
The paper StarCoder: May the Source Be With You! is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.