With ChatGPT racking up more than a million users in less than a week, large language models (LLMs) have captured the public imagination much like image generation models did last year. Countless social media posts have showcased the OpenAI model’s conversational abilities, with tech-oriented users focusing more on its amazing results in code generation. While the development of effective Code LLMs promises to significantly simplify programming tasks, progress in this area has been hindered by a lack of transparency with regard to the license terms on their training datasets.
In the new paper The Stack: 3 TB of Permissively Licensed Source Code, a team from ServiceNow Research and Hugging Face advances open and responsible research on code LLMs by releasing The Stack, a 3.1 TB dataset of permissively licensed source code in 30 programming languages. The researchers train 350M decoder-only transformers on various Python subsets to demonstrate the effectiveness and robustness of the Stack on text2code benchmarks.
The team summarizes their main contributions as follows:
- We present The Stack, a large dataset with 3.1 TB of permissively licensed source code in 30 programming languages. We release this dataset along with a near-deduplicated version at https: //hf.co/BigCode.
- We train 350M decoder-only transformers on several python subsets of the data and find that removing near-duplicates significantly boosts performance in all experiments. We show it is possible to reproduce text2code performance of Codex (Chen et al., 2021) and CodeGen (Nijkamp et al., 2022) by only using permissively licensed data. We outperform these models by a large margin if we train on the all-license version of the dataset.
- We acknowledge that some developers do not wish their code to be used for pre-training LLMs and, therefore, start experimenting with giving developers the possibility to have their data removed from 3 the dataset. We present the details of this opt-out process in a data governance plan in Section 3.2. We also provide further instructions for removal requests at https://www.bigcode-project.org/docs/about/the-stack/.
The team built their dataset from 137.36M GitHub repositories in the GH Archive, extracting available license information for each repository and running the go-license-detector to locate such information if it was lacking. MIT and Apache 2.0 were the most frequently detected licenses (9.6% and 2.7%, respectively, of the total repositories). They then applied near-deduplication techniques to remove files judged near-duplicates of other files and produce the final Stack dataset of source code with permissive licenses (defined as “minimal restrictions on how the software can be copied, modified, and redistributed”).
The team compared The Stack with popular code datasets CodeParrot10, AlphaCode, CodeGen, and PolyCoder, noting that while The Stack and CodeParrot both provide source code for 30 programming languages, the others cover 12 at most. The Stack dataset is also larger than CodeParrot in each of the 30 programming languages and 3x larger in total.
To evaluate the Stack’s quality, the team trained a 350M parameter decoder-only transformer on Python subsets. The results show that near-deduplicating the data significantly boosts performance and that it is possible to reproduce Codex and CodeGen text2code performance using only permissively licensed data. The Stack also showed promising results on the HumanEval benchmark.
The team plans to further improve the Stack dataset in the future and hopes it will become a helpful resource for open and responsible research on Code LLMs.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.