The practice of fine-tuning pretrained large neural language models (LMs) for specific downstream tasks has enabled countless performance breakthroughs in natural language processing (NLP) in recent years. This paradigm has also inspired machine learning researchers to explore such models’ ability to generalize by avoiding the memorization of their training data.
Most memorization techniques strongly reflect the frequency of occurrences in the training set, and models will thus tend to capture sensitive information such as phone numbers and usernames as “common” memorizations. This has led to an open question: How can LMs best be trained to avoid such common memorizations?
In the new paper Counterfactual Memorization in Neural Language Models, a research team from Google Research, University of Pennsylvania and Cornell University proposes a principled perspective for filtering out common memorization in neural LMs. Inspired by psychological studies on human memory, they introduce a “counterfactual memorization” approach to measure the expected change in a model’s prediction and distinguish between “rare” (episodic) memorization of individual episodes or events and less useful “common” (semantic) memorization.

The team summarizes their contributions as:
- We define counterfactual memorization in neural LMs, which gives us a principled perspective to distinguish “rare” (episodic) memorization from “common” (semantic) memorization in neural LMs.
- We estimate counterfactual memorization on several standard text datasets, and confirm that rare memorized examples exist in all of them. We study common patterns across memorized text in all datasets and the memorization profiles of individual internet domains.
- We extend the definition of counterfactual memorization to counterfactual influence, and study the impact of memorized examples on the test-time prediction of the validation set examples and generated examples.
Generation-time memorization in LMs is evidenced when a sufficient amount of overlap is found between the model’s generated texts and its training dataset. If the training data is not available, heuristic-based methods comparing language model perplexities can be used to predict whether the generated text contains memorized content. While such generation-time instances of memorization tend to be strongly correlated with the number of similar or near-duplicate examples in the training set, the proposed counterfactual memorization method handles this issue automatically and without the need for heuristics.
The team defines counterfactual memorization by extending a mathematical formulation from Feldman and Zhang (2020), which measures the influence of a memorized training example on a specific test example in the context of neural language modelling. More formally, a training example x is counterfactually memorized when the model predicts x accurately if and only if the model was trained on x. As such, the team can numerically measure episodic memory, as “common” knowledge found in the large-scale training datasets gives rise to semantic memory while “rare” information found only in individual documents represents episodic memory.
Counterfactual memorization can thus systematically ignore “common” memorizations such as common phrases (e.g. “How are you?”) and instead focus on capturing memorizations of rare information present in specific training examples. The researchers extend this idea to counterfactual influence to analyze how memorized training examples might affect model predictions.
In their empirical study, the team conducted experiments on three text corpora commonly used in language modelling: RealNews (Zellers et al., 2019), C4 (Raffel et al., 2020b) and Wiki-40B:en (Guo et al., 2020). They tested whether the proposed approach could find the memorizations related to these datasets; and further estimated counterfactual influence to study how memorized training examples could impact model predictions at test time.




The experiments demonstrated the proposed approach’s ability to effectively estimate and detect memorizations in all three datasets, and to systematically ignore “common” memorization such as common phrases while capturing memorization of rare information present in specific training examples. The study also showed that different sources could have substantially different memorization profiles, and that the model predictions could be drastically different depending on the presence or absence of a particular training example with high memorization.
The paper Counterfactual Memorization in Neural Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Counterfactual Memorization in Language Models: Distinguishing Rare from Common Memorization”