AI Machine Learning & Data Science Nature Language Tech Research

ServiceNow Research & Hugging Face Release The Stack: 3 TB of Permissively Licensed Source Code for LLMs

In the new paper The Stack: 3 TB of Permissively Licensed Source Code, a team from ServiceNow Research and Hugging Face advances open and responsible research on code LLMs by releasing The Stack, a 3.1 TB dataset of permissively licensed source code in 30 programming languages.

With ChatGPT racking up more than a million users in less than a week, large language models (LLMs) have captured the public imagination much like image generation models did last year. Countless social media posts have showcased the OpenAI model’s conversational abilities, with tech-oriented users focusing more on its amazing results in code generation. While the development of effective Code LLMs promises to significantly simplify programming tasks, progress in this area has been hindered by a lack of transparency with regard to the license terms on their training datasets.

In the new paper The Stack: 3 TB of Permissively Licensed Source Code, a team from ServiceNow Research and Hugging Face advances open and responsible research on code LLMs by releasing The Stack, a 3.1 TB dataset of permissively licensed source code in 30 programming languages. The researchers train 350M decoder-only transformers on various Python subsets to demonstrate the effectiveness and robustness of the Stack on text2code benchmarks.

The team summarizes their main contributions as follows:

  1. We present The Stack, a large dataset with 3.1 TB of permissively licensed source code in 30 programming languages. We release this dataset along with a near-deduplicated version at https: //hf.co/BigCode.
  2. We train 350M decoder-only transformers on several python subsets of the data and find that removing near-duplicates significantly boosts performance in all experiments. We show it is possible to reproduce text2code performance of Codex (Chen et al., 2021) and CodeGen (Nijkamp et al., 2022) by only using permissively licensed data. We outperform these models by a large margin if we train on the all-license version of the dataset.
  3. We acknowledge that some developers do not wish their code to be used for pre-training LLMs and, therefore, start experimenting with giving developers the possibility to have their data removed from 3 the dataset. We present the details of this opt-out process in a data governance plan in Section 3.2. We also provide further instructions for removal requests at https://www.bigcode-project.org/docs/about/the-stack/.

The team built their dataset from 137.36M GitHub repositories in the GH Archive, extracting available license information for each repository and running the go-license-detector to locate such information if it was lacking. MIT and Apache 2.0 were the most frequently detected licenses (9.6% and 2.7%, respectively, of the total repositories). They then applied near-deduplication techniques to remove files judged near-duplicates of other files and produce the final Stack dataset of source code with permissive licenses (defined as “minimal restrictions on how the software can be copied, modified, and redistributed”).

The team compared The Stack with popular code datasets CodeParrot10, AlphaCode, CodeGen, and PolyCoder, noting that while The Stack and CodeParrot both provide source code for 30 programming languages, the others cover 12 at most. The Stack dataset is also larger than CodeParrot in each of the 30 programming languages and 3x larger in total.

To evaluate the Stack’s quality, the team trained a 350M parameter decoder-only transformer on Python subsets. The results show that near-deduplicating the data significantly boosts performance and that it is possible to reproduce Codex and CodeGen text2code performance using only permissively licensed data. The Stack also showed promising results on the HumanEval benchmark.

The team plans to further improve the Stack dataset in the future and hopes it will become a helpful resource for open and responsible research on Code LLMs.

The Stack dataset is available on the HuggingFace website. The paper The Stack: 3 TB of Permissively Licensed Source Code is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

5 comments on “ServiceNow Research & Hugging Face Release The Stack: 3 TB of Permissively Licensed Source Code for LLMs

  1. شكرا

  2. Thanks! GTU

  3. Fifos Lilio

    I recently discovered this amazing spot in Lincoln Square, and I can’t stop raving about it. The food is incredible, and the atmosphere is just right for a laid-back meal with friends or family. They have a fantastic variety of options on the menu, so whether you’re in the mood for fresh seafood, a juicy burger, or something more on the lighter side, you’ll find it here. The service was great too – friendly, fast, and attentive, which made the experience even better. I loved sitting outside and soaking in the neighborhood vibes while enjoying a delicious meal. It’s such a great spot for a casual meal or even a special celebration. If you’re in the area or just looking for a new place to try, I highly recommend you check out their Lincoln Square menu – you won’t regret it!

  4. Fifos Lilio

    I recently attended a wedding at this incredible venue, and it was an experience I’ll never forget. The Charolais Room was stunningly beautiful, with elegant decor and lighting that created a cozy, intimate atmosphere. The staff was beyond attentive, ensuring every detail of the event was seamless. From the moment the ceremony started to the end of the reception, everything flowed perfectly. The food was exceptional, with each dish beautifully presented and absolutely delicious. I couldn’t stop talking about how well the evening was organized and how happy the couple seemed throughout the entire day. If you’re looking for a place that combines sophistication with warmth, I highly recommend you visit this venue for your special day

  5. Jeff Kane

    This is a fascinating development! The Stack dataset’s release is a huge step for open-source code LLMs. It’s exciting to see researchers tackle the licensing challenges. The possibility of reproducing Codex performance with permissively licensed data is significant. This reminds me of coding my first basic program, a simple Google Snake, but on a much grander scale. I’m keen to see how this impacts the future of AI-assisted coding.

Leave a Reply

Your email address will not be published. Required fields are marked *