A phrase like “It’s cold today” would suggest a very different temperature if it were uttered in Nairobi or Montreal, while words like “troll” and “tweet” referred to totally different things just a generation ago. Although contemporary large-scale pretrained language models are very effective at learning linguistic representations, they are not as well equipped at capturing speaker/author-related temporal, geographical, social and other contextual aspects.
In the new paper LMSOC: An Approach for Socially Sensitive Pretraining, a Twitter Cortex research team proposes LMSOC, a simple but effective approach for learning both linguistically contextualized and socially sensitive representations in large-scale language models.
An implicit assumption in most pretrained language models (PLMs) is that language is independent of extra-linguistic contexts such as speaker/author identity and social settings. Despite the impressive achievements of PLMs, this remains a critical weakness, as there is strong evidence that socio-linguistics can significantly impact social context processing performance. Embeddings are the most commonly used method for learning word representations in PLMs, but this approach suffers from two fundamental limitations: 1) Word embeddings are not linguistically contextualized; 2) Word embedding learning is transductive — models can only generate embeddings for words observed during training and generally assume a finite word vocabulary and set of social contexts that also need to be seen during training.
To overcome these limitations, the proposed LMSOC aims to learn token representations that are both linguistically contextualized and socially sensitive and to enable language models to inductively generate representations for language grounded in social contexts they have not observed during their pretraining processes.
LMSOC has two components: SCE, a social context encoder; and SSP, a standard BERT encoder altered to condition on the SCE output. SCE is used to map a social context to a d-dimensional embedding where similar social contexts are closer in the vector space, and SSP conditions token representations on the social as well as the linguistic context. This design enables LMSOC to learn language representations that are both linguistically and socially contextualized.
The team evaluated LMSOC against baselines BERT and LMCTRL (Keskar et al., 2019.) on a variety of language modelling tasks in three settings: seen (grounded in social contexts seen during training), unseen (grounded in social contexts unseen during training), and overall (combining both). BERT, unsurprisingly, performed poorly on all settings, while LMCTRL achieved good scores on the seen setting but struggled on the unseen setting. The proposed LMSOC significantly outperformed both baseline models across all settings, with especially impressive performance on the unseen setting.
The study shows that LMSOC can leverage correlations between social contexts and thus enable language models to generalize better to social contexts under an unseen setting. The researchers believe their method sets the stage for future research on incorporating new types of social contexts to advance the intelligence of NLP systems.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.