Rather than relying exclusively on labelled training data, machine learning researchers are increasingly turning to natural language inputs to provide instructions, supervision and inductive biases when training language models. Could natural language also be used to simplify and improve the correction of systematic errors (bugs) in such models?
While current methods for fixing bugs in language models typically rely on brittle patches or heavy data for finetuning, a research team from Stanford University and Microsoft Research has proposed a novel approach that uses declarative statements as corrective feedback for neural models with bugs. The method significantly increases model accuracy without high costs and is detailed in the new paper Fixing Model Bugs with Natural Language Patches,
Given a language model that contains bugs (behaviours inconsistent with users’ preferences or the ground truth), the proposed approach is designed to fix these bugs by employing a library of if/then natural language patches that can either override the model (“if a review gives 2 stars, the sentiment is negative”) or provide it with missing information (“if something is described as the bomb, then it is good”).
The team’s neural language patching model comprises two heads: a gating head that soft-predicts whether a patch should be applied; and an interpreter head that predicts outputs based on the information provided in the patch.
The training method for patchable models has two stages. In the task finetuning stage, the model is trained on a labelled dataset. In the patch finetuning stage, a small set of patch templates is employed to instantiate a small number of patches along with synthetic labelled examples.
In their empirical study, the team applied their approach on Google’s T5-large language model and compared performance with baselines on binary sentiment analysis and relation extraction under two conditions: the original model with only task finetuning (ORIG), and the model obtained after patch finetuning (ORIG+PF).
The experimental results show that natural language patches are efficient, with 1-7 patches obtaining comparable or better performance than as many as 100 finetuning examples. Patches are also shown to be less susceptible to the simple shortcuts often used by other approaches, which fail to address a problem at the right level of abstraction.
The team believes their method could be extended to realize a back-and-forth dialogue between developers and models, with potential applications in modelling pragmatics, interpreting multiple patches simultaneously, and automating patch finetuning.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.