One of the intriguing puzzles within the realm of neural network generalization is a phenomenon known as “grokking.” It involves a neural network achieving perfect training accuracy but displaying poor generalization capabilities. Interestingly, it has been observed that further training can transform a network experiencing this phenomenon into one that exhibits perfect generalization—a result that challenges our conventional understanding. Typically, we assume that a neural network, once its training loss converges to a low value, should not undergo significant changes.
In a recently published paper titled “Explaining Grokking through Circuit Efficiency,” a research team from DeepMind successfully unravels the mystery behind grokking through their circuit efficiency theory. This breakthrough sheds light on why the generalizing solution takes longer to learn compared to memorization. Additionally, the team introduces two novel concepts: “ungrokking” and “semi-grokking,” which contribute to a deeper understanding of neural network generalization.
The team summarizes their main contributions as follows:
- Demonstration of the Sufficiency of Three Ingredients for Grokking: The researchers showcase that grokking can be achieved through a simulated construction involving three essential components.
- Prediction of Novel Behaviors: By analyzing the dynamics at a critical dataset size, as suggested by their theory, the team predicts two previously unreported behaviors: semi-grokking and ungrokking.
- Experimental Validation: To substantiate their predictions, the team conducts meticulous experiments, effectively demonstrating both semi-grokking and ungrokking in practice.
The team’s explanation of grokking revolves around three key factors: the generalizing circuit, efficiency, and the rate of learning. They begin by highlighting the existence of two categories of circuits that yield favorable training outcomes:
- Memorizing Circuit Family: This family of circuits achieves high training performance but exhibits poor test performance.
- Generalizing Circuit Family: In contrast, this family of circuits attains good test performance. Remarkably, the generalizing family is more “efficient” than the memorizing one. In other words, it achieves an equivalent cross-entropy loss on the training set with fewer model parameters. However, the catch is that the generalizing family learns at a slower pace compared to the memorizing family, making the latter stronger during the early training phases.
Building upon this theory and considering behaviors around the critical dataset size, the team makes two groundbreaking predictions, hitherto unreported in previous research:
- Ungrokking: This phenomenon describes a network transitioning from high test performance to low test performance as it undergoes further training on a smaller dataset.
- Semi-Grokking: Here, a network exhibits delayed generalization, achieving partial rather than perfect test accuracy.
Finally, through rigorous experimentation, the team successfully validates their theory. This achievement not only solves the enigma of grokking but also underscores the potential for a broader comprehension of deep learning by examining it through the lens of circuit efficiency.
The paper Explaining grokking through circuit efficiency on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.