If you often find yourself encountering new and unfamiliar words online, you are not alone: attributed in large part to the global explosion of texting and social media, language has in recent years been changing faster than ever. This can cause static language models to fall behind the times.
In AI and natural language processing (NLP), the current dominant paradigm for language models is to use training and evaluation sets from overlapping time periods. The inherently dynamic nature of language however exposes problems with this approach. Despite their good performances on benchmark datasets, even state-of-the-art Transformer models are bad at predicting utterances that emerged after those covered in their training. In a bid to solve the temporal generalization problem of modern language models, a team of DeepMind researchers propose it’s time to develop adaptive language models that will remain up-to-date in our ever-changing world.
The researchers identify their work’s main contributions as:
- Empirically highlight the limitations of current language models with respect to temporal generalization.
- Demonstrate the need to rethink our static language modelling evaluation paradigm that trains and evaluates models on data from the same, overlapping time periods.
- Provide a benchmark to systematically measure progress and encourage more research on temporal generalization and adaptive language modelling.
- Highlight the fact that succeeding in this setup necessitates approaches that go above and beyond scaling models in terms of parameters or amounts of training data, thus paving the way for better and more efficient continual learning approaches.
The DeepMind team first examined the performance degradations of state-of-the-art Transformer language models tasked with generalizing to future data based on the past, using publicly available arXiv abstracts and the WMT News Crawl corpus for the scientific and news domain respectively. They also compiled a larger news corpus, CustomNews, comprising English-language news sources crawled from the web covering the period 1969-2020. To assess temporal generalization, they designed a time-stratification evaluation protocol that split training and evaluation from the corpora by considering each document’s timestamp.
Initial experiment results showed the Time-Stratified model perplexity degrades more over time. The intrinsic perplexity metric directly relates to the optimized loss, thus lower perplexity means better performance. Although increasing model size lowered overall perplexities, that approach still failed to improve the temporal generalization ability of stale models. To mitigate this temporal degradation and keep the models up to date, the researchers turned to dynamic evaluation. Proposed by Mikolov et al in 2010, dynamic evaluation is a form of online learning that continually updates a pretrained model’s parameters on the observed test set to incorporate knowledge and information from new documents.
The dynamic evaluation process improved overall perplexity in the standard (non-temporal) language modelling setup, reducing the speed with which models become outdated. The team notes however that dynamic evaluation alone does not completely solve the temporal degradation problem, and proposes further research on more sophisticated continual and lifelong learning approaches in future work.
First author of the paper, Angeliki Lazaridou, will host an online seminar on Feb 11, 2021. The paper Pitfalls of Static Language Modelling is on arXiv.
Author: Hecate He | Editor: Michael Sarazen