Meta AI has entered the AI race dominated by large language models (LLMs) such as OpenAI’s ChatGPT, Microsoft’s GPT-powered Bing, and Google’s Bard. Meta CEO Mark Zuckerberg posted the news in a Facebook post: “Today we’re releasing a new state-of-the-art AI large language model called LLaMA designed to help researchers advance their work… Meta is committed to this open model of research and we’ll make our new model available to the AI research community.”
The LLaMA foundation language models range in size from 7B to 65B parameters and were trained on trillions of tokens from publicly available databases. The LLaMA-13B model outperforms GPT-3 but is 10x smaller, enabling it to be run on a single GPU. A Meta AI research team takes a deep dive into LLaMA’s technical details in the new paper LLaMA: Open and Efficient Foundation Language Models.
Meta AI set out to train a series of LLMs that would optimize performance at different inference budgets. Their resulting Large Language Model Meta AI (LLaMA) collection comprises models that are smaller than existing LLMs, but are trained on more tokens. This boosts performance and makes the models easier to retrain and fine-tune for specific real-world use cases.
The LLaMA models are built on a transformer architecture (Vaswani et al., 2017) with various improvements adopted from other models. LLaMA models employ the RMSNorm normalizing function introduced by GPT-3 to improve training stability; and replace ReLU non-linearity with the SwiGLU activation function from PaLM to improve model performance. They also use rotary positional embeddings (RoPE) from GPTNeo (instead of absolute positional embeddings) to more effectively leverage positional information.
To improve training speed, the LLaMA models employ an efficient implementation of the causal multi-head attention operator, which reduces memory cost and computation complexity. Checkpointing reduces the number of activations recomputed during the backward pass to further boost training efficiency.
LLaMA models are available in 7B, 13B, 33B, and 65B parameter sizes. LLaMA 65B and LLaMA 33B were trained on 1.4 trillion tokens, while the smallest LLaMA 7B was trained on one trillion tokens (GPT-3 was trained on 499 billion tokens). The training data comprises publicly available text from sources such as English CommonCrawl and also includes Wikipedia data in 20 widely spoken languages that use Latin and Cyrillic alphabets.
In their empirical study, the researchers compared LLaMA with baseline foundation models such as GPT-3, Gopher, Chinchilla, PaLM, etc. on free-form text generation and multiple choice QA tasks under zero-shot and few-shot settings. In the evaluations, LLaMA-13B surpassed GPT-3 performance while being more than 10x smaller and LLaMA-65B achieved results comparable to state-of-the-art models Chinchilla-70B and PaLM-540B. Notably, LLaMA achieved this performance trained on only publicly available data — unlike the baselines that also leverage proprietary datasets for training.
Meta AI has open-sourced the LLaMA models and code for the machine learning research community in the hope that they can accelerate the development and robustness of LLMs and mitigate known issues such as toxicity and bias.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.