In recent years, large language models (LLMs) have demonstrated a strong capability to learn vast amounts of ‘global’ knowledge from their training data and have shown the ability to quickly adapt to new information based on given contexts or prompts. Despite their impressive ‘in-context’ learning capabilities, their internal mechanisms remain under-explored, posing a threat to their reliability for real-world applications.
In the new paper, Birth of a Transformer: A Memory Viewpoint, the Meta AI research team introduces a novel synthetic setup to explore the structure and evolution of transformer language models. Their aim is to provide insights into the global vs. in-context learning of LLMs.

The team summarizes their main contributions as follow:
- We introduce a new synthetic setup to study global vs in-context learning: sequences follow bigram language models, where some bigrams change across sequences and others do not.
- We view the transformer’s weight matrices as associative memories that learn to store specific pairs of embeddings, and use this to derive a simplified but more interpretable model for our task.
- We empirically study the training dynamics with careful probing: global bigrams are learned first, then the induction head is formed by learning appropriate memories in a top-down fashion.
- We give theoretical insights on training dynamics, showing how a few top-down gradient steps on the population loss can recover the desired associative memories by finding signal in noisy inputs.
The team first develops a synthetic dataset to explore how transformers develop global knowledge and in-context learning capability. This dataset consists of generic bigram language models, where some bigrams are sequence-specified. Therefore, the transformer models rely on in-context learning to get good prediction on the sequence-specific bigrams while general bigrams can be predicted from global statistics based on the current token.

To gain a fine-grained understanding of the in-context mechanism during the training stage, the researchers further simplify the two-layer architecture by freezing some of the layers at random initialization. Such simplification allows the team to introduce a model for individual weight matrices as associative memories, which store pairs of embeddings. As a result, they yield a precise understanding of learning dynamics.

In their empirical study, the researchers used mini-batch SGD with momentum to train their model, they observed that the global bigram statistics tend to be learned faster then the induction head, and the change to the data distribution greatly impacts the speed of in-context learning.
They also provide theoretical insights on training dynamics, demonstrating that with enough data, the associative memory can filter out noise from inputs; and when the attention patterns are near-uniform, it can recover the desired associative memory.
Overall, this work provides valuable insights on the structure and evolution of transformer models. The team claims their next step will explore how transformers leverage some other aspects, such as learning embeddings, factorized key-query matrices and non-linear feedforward layers, to learn in richer settings.
The paper Birth of a Transformer: A Memory Viewpoint on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
The team was able to develop a model for individual weight matrices as associative memory, which store pairs of embeddings, as a result of this reduction.