AI Machine Learning & Data Science Research

Llama 3: Meta AI’s Multilingual and Multimodal Marvel

In a new paper The Llama 3 Herd of Models, a Meta AI research team presents Llama 3, a new set of foundation models for language, delivering competitive performance comparing to state-of-the-art language models such as GPT-4 on a plethora of tasks.

Foundation models, also known as general-purpose AI systems, are a rising trend in AI research. These models excel in diverse tasks such as text synthesis, image manipulation, and audio generation. Notable examples include OpenAI’s GPT-3 and GPT-4, which power the conversational agent ChatGPT.

In a new paper The Llama 3 Herd of Models, a Meta AI research team presents Llama 3, a new set of foundation models for language. Llama 3 consists of three multilingual models with 8B, 70B, and 405B parameters, supporting coding, reasoning, and tool usage, and delivering performance on par with leading models like GPT-4. Additionally, multimodal extensions enable image recognition, video recognition, and speech understanding.

The Llama 3 is a herd of three multilingual language models with 8B, 70B, and 405B parameters. The team also develops multimodal extensions to the models, enabling image recognition, video recognition, and speech understanding capabilities.

The development of the proposed Llama 3 language models comprises two main stages:

  1. Language model pre-training.
    1. A large, multilingual text corpus is converted to discrete tokens.
    2. The model is pre-trained on this data for next-token prediction, learning language structure and acquiring extensive knowledge.
    3. The pre-training involves a 405B parameter model trained on 15.6T tokens with an 8K token context window, later expanded to 128K tokens.
  2. Language model post-training.
    1. The model is aligned with human feedback through supervised fine-tuning and Direct Preference Optimization.
    2. New capabilities, such as tool-use, are integrated, and improvements in coding and reasoning are observed.
    3. Safety measures are incorporated at this stage.

The resulting models can answer questions in multiple languages, write high-quality code, solve complex reasoning problems, and use tools in a zero-shot manner.

The researchers also demonstrate the multimodal Extensions:

  1. Multi-modal Encoder Pre-training:
    1. Separate encoders for images and speech are trained.
    2. The image encoder learns the relationship between visual content and natural language descriptions.
    3. The speech encoder uses a self-supervised approach to understand speech signals.
  2. Vision Adapter Training:
    1. An adapter integrates the pre-trained image encoder into the language model using cross-attention layers.
    2. The adapter is trained on text-image pairs, aligning image and language representations. A video adapter is trained on paired video-text data to aggregate information across frames.
  3. Speech Adapter Training:
    1. The speech encoder is integrated into the model with an adapter converting speech encodings to token representations.
    2. Parameters of the adapter and encoder are updated in a supervised fine-tuning stage for high-quality speech understanding.

The researchers also presents the results of experiments. Llama 3 demonstrates competitive performance in image, video, and speech recognition tasks, showcasing the potential of compositional approaches in multimodal AI systems.

The paper The Llama 3 Herd of Models is on arXiv.


Author: Hecate He | Editor: Chain Zhang

7 comments on “Llama 3: Meta AI’s Multilingual and Multimodal Marvel

  1. Dining at P.J. Clarke’s Lincoln Square was a fantastic experience from start to finish. The classic decor and comfortable seating made for a cozy evening. I opted for the lobster roll, and it was bursting with flavor—fresh lobster meat in a buttery roll that was simply divine. The staff was friendly and attentive, adding to the overall great vibe of the place. If you’re considering a visit, see here for what makes this spot a must-try.

  2. Robert Shelton

    nice post

  3. Very detailed and helpful article, thanks.

  4. Richard Stockstill

    Each game will have its own fun, if you play Geometry Vibes you will experience many different emotions.

  5. Each game will have its own fun, if you play geometry vibes you will experience many different emotions.

  6. Pingback: Technology Trend Brief – Week 3, June 2025 - Data Spoiler

  7. Pingback: 6월 3주차 기술 트렌드 요약 - Data Spoiler

Leave a Reply

Your email address will not be published. Required fields are marked *