Large language models (LMs) have become much larger and more powerful in recent years, achieving remarkable results across natural language processing (NLP) tasks such as text generation, translation, question answering and more. But the malicious use of these trillion-parameter models also poses critical societal threats, particularly through potential biases and the generation of “toxic” content such as insults, threats and hate speech.
In the paper Detoxifying Language Models, a DeepMind research team critically discusses toxicity evaluation and mitigation methods for contemporary transformer-based English LMs and provides insights toward safer model use and deployment.
The team summarizes their study’s contributions as:
- We critically discuss LM toxicity evaluation and conduct evaluation studies for several mitigation methods, relying both on automatic toxicity scores and on human judgement.
- We show that combinations of simple methods are very effective in optimizing (automatic) toxicity metrics, but prone to overfilter texts related to marginalized groups.
- We find increased disagreement of high automatic toxicity scores with human annotators once strong toxicity reduction measures are applied, limiting their usefulness as a metric for further mitigation of toxicity.
- We show that a reduction in (automatic) toxicity scores comes at a cost. We identify both a trade-off with LM evaluation loss, and further show that this disproportionately affects texts about and by marginalized groups: both topic-related and dialect-related LM biases increase.
The researchers consider an utterance or text to be toxic if it is rude, disrespectful or unreasonable; characterized in the widely adopted PerspectiveAPI definition as “language that is likely to make someone leave a discussion.” As such, toxicity judgements can be subjective, and so the researchers consider both automatic approaches (data-based, controllable generation, and direct filtering-based) and human evaluations in an effort to reduce biases with regard to an LM output’s possible toxicity.
The team first applies a training set filtering approach, training LMs on different versions of the C4 (Raffel et al., 2020) corpus, filtered for toxicity according to Perspective API scores. Next, they filter LM outputs directly at the decoding phase. Finally, they evaluate the strongest decoding-based method: Plug-and-Play Language Models (PPLMs).
The test results for the three toxicity mitigation approaches demonstrate that, compared to the baseline GPT-2, slightly reduced toxicity rates can be observed in a standard model trained on C4. Filtering the C4 training set based on classifier-based toxicity leads to further reductions in LM toxicity scores, while decoder filtering and PPLMs are both highly effective at reducing automatic toxicity evaluation metrics. Combining PPLMs with these other methods results in the most significant overall automatic toxicity metric reductions.
The team then measures toxicity and LM generation quality using human evaluations, with the results showing that toxicity reduction methods do indeed result in improvements in toxicity ratings according to human judgement; and that most of the human ratings align with the Perspective API scores for the standard LM samples. However, in the higher toxicity score range, the human and Perspective API scores differ substantially after LM detoxification.
The study also identifies a transfer of toxicity classifier biases onto LMs, highlighting the following findings:
- Toxicity is subjective and context-dependent — what is considered toxic may differ across cultures, social groups, and personal experiences.
- Very low automatic toxicity metrics of state-of-the-art LMs after application of the evaluated mitigation techniques suggest that further improvement with respect to these metrics is limited.
- Detoxification increases LM loss, and introduces and amplifies social biases in topic and dialect coverage, potentially leading to decreased LM performance for marginalized groups.
Overall, the DeepMind study aims to reduce the potential harm caused by LMs through an improved understanding of how these models can be detoxified. The resulting insights can also prove useful in characterizing performance and other trade-offs that may occur via different LM detoxification methods.
The paper Challenges in Detoxifying Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.