IBM’s Granite Code: Powering Enterprise Software Development with AI Precision

Synced

2 years ago

In recent years, there has been remarkable advancement in Large Language Models (LLMs) capable of generating and manipulating code. A variety of models exhibiting impressive coding capabilities have emerged. Nevertheless, significant gaps persist within the realm of LLMs tailored for code, particularly concerning enterprise software development.

In a new paper Granite Code Models: A Family of Open Foundation Models for Code Intelligence, an IBM research team introduces the Granite Code model family. Specifically optimized for enterprise software development workflows, these models excel across a spectrum of coding tasks, rendering them versatile and well-suited for diverse coding challenges.

Comprising decoder-only code models geared towards code generative tasks, the Granite Code models family boasts two primary variants across four distinct sizes (3B, 8B, 20B, and 34B):

Granite Code Base: Serving as foundational models for code-related tasks.
Granite Code Instruct: Instruction-following models fine-tuned through a blend of Git commits paired with human instructions and datasets featuring open-source synthetically generated code instructions.

The base models undergo comprehensive training via a two-phase strategy. Initially, in phase 1, the model assimilates 3 to 4 trillion tokens sourced from 116 programming languages, ensuring a nuanced grasp of language syntax and structure. Subsequently, in phase 2, the model further refines its capabilities through exposure to 500 billion tokens, drawn from meticulously curated datasets spanning both code and natural language domains.

Derived from the aforementioned base models, the instruct models undergo additional fine-tuning. This process entails leveraging a combination of a refined version of CommitPack, along with datasets featuring natural language instruction following (such as OASST and HelpSteer) and open-source mathematical datasets (such as MathInstruct and MetaMathQA). Synthetically generated code datasets play a pivotal role in augmenting instruction-following and reasoning abilities.

In their empirical investigation, the team conducts extensive evaluations of their code LLMs across a comprehensive array of benchmarks. Results showcase the Granite Code models’ robust performance across all model sizes and benchmarks, often surpassing other open-source code models, even those twice their size.

In summary, the key strengths of Granite Code models include:

All-rounder Code LLM: Exhibiting competitive or state-of-the-art performance across various code-related tasks such as generation, explanation, debugging, editing, and translation, showcasing their versatility in tackling diverse coding challenges.
Trustworthy Enterprise-Grade LLM: All models are trained on data collected following IBM’s AI Ethics principles and guided by IBM’s Corporate Legal team to ensure trustworthy enterprise usage. Furthermore, all Granite Code models are released under the Apache 2.0 license.

Looking ahead, the team is committed to continually enhancing these models’ performance. Future plans include the release of long-context variants, as well as specialized models tailored for Python and Java environments.

The code is available on project’s GitHub. The paper Granite Code Models: A Family of Open Foundation Models for Code Intelligence is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Share this: