The success of modern, overparameterized neural models has sparked significant interest in exploring the intricate relationship between memorization and generalization. These models demonstrate an impressive ability to generalize effectively, even though they possess the capacity to memorize data, such as perfectly fitting entirely random labels.
To gain insights into the phenomenon of generalization in these powerful models, Feldman conducted an empirical investigation. He calculated the memorization profile of a ResNet on various image classification benchmarks, revealing that, in certain scenarios, memorization plays a crucial role in achieving generalization. While this represents an exciting initial step in understanding how real-world models memorize information, it leaves unanswered questions about how memorization dynamics change with varying model sizes.
In a new paper What do larger image classifiers memorise?, a Google Research team delivers a comprehensive empirical analysis addressing the question of whether larger neural models exhibit greater memorization tendencies. Their findings indicate that as model complexity increases, so does the distribution of memorization—a revelation that underscores the significance of considering multiple model sizes in future research.
The research team initiates their study with a quantitative examination of how memorization varies in relation to model complexity, specifically the depth and width of a ResNet used for image classification. They visually illustrate how the depth of a ResNet influences the memorization score on two prominent datasets, CIFAR-100 and ImageNet. Surprisingly, their analysis uncovers that the memorization score escalates up to a depth of 20, but then, in contrast to initial assumptions, begins to decrease.
Consequently, the researchers conclude that increasing model complexity results in a more bi-modal distribution of memorization across different examples. Simultaneously, they identify a limitation in the existing computationally tractable methods for quantifying memorization and example difficulty, as these methods fail to capture this essential trend.
To further delve into the bi-modal memorization pattern, the research team presents instances that exhibit varying memorization score trajectories across different model sizes. They identify four prominent types of trajectories, including those in which memorization increases with model complexity. Notably, they find that particularly ambiguous and mislabeled examples follow this kind of trajectory.
The researchers wrap up their study with a quantitative analysis demonstrating that distillation, a process used to transfer knowledge from a large teacher model to a smaller student model, tends to inhibit memorization. This inhibition is particularly evident for samples that the one-hot, non-distilled student model memorizes. Intriguingly, they observe that distillation primarily suppresses memorization in examples where memorization increases as model size grows. This observation leads to the conclusion that distillation enhances generalization by constraining the memorization of challenging examples.
In summary, this study offers practical insights and lays the foundation for future research avenues. It underscores the importance of caution when using certain statistics as proxies for memorization. Furthermore, it emphasizes the need to identify reliable memorization score proxies that can be efficiently computed and highlights the importance of considering multiple model sizes when characterizing examples in neural network research.
The paper What do larger image classifiers memorise? on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.