How can we most effectively modify the way a machine learning (ML) model makes its predictions? That’s the question posed by MIT researchers in the new paper Editing a Classifier by Rewriting Its Prediction Rules.
ML models are designed to automatically discover prediction rules from raw data, but not all such derived rules are reliable, and some may need to be modified. While the common approach for performing such modifications is to intervene at the data level, gathering additional data to improve prediction rules’ correctness can be challenging, and models are still likely to end up learning unintended and potentially problematic prediction rules.
To address this issue, the MIT researchers introduce a general toolkit for directly modifying classifiers’ prediction rules. The novel approach requires no additional data collection and enables users to change a classifier’s behaviour on occurrences of specific concepts that lie beyond the examples used in the editing process.
The MIT paper builds on a recent study on rewriting deep generative models by Bau et al. which enables replacing occurrences of a selected object in generated images with another object without changing the model’s behaviour in other contexts. The proposed toolkit extends this approach to enable users to modify the weight of a layer so that the latent representations of a specific concept will map to the representations of another object — i.e. a direct method for modifying a model’s behaviour by rewriting its prediction rules in a targeted manner.
ML models will identify and exploit context-specific correlations in their data, for example taking the presence of a “wheel” to predict the presence of a “car.” The researchers note that such prediction rules can be unreliable in the case of novel environments such as snowy roads or when encountering confusing or adversarial examples such as cars with wooden wheels, and so should be modified before model deployment.
To demonstrate their method, the team uses the example task of enabling classifiers to detect “vehicles with wooden wheels,” where they want the classifier to perceive the “wooden wheel” in a transformed (adversarial) image as it does the standard wheel in the original image. They start with a single image from the “car” class that contains the concept “wheel” and denote the wheel locations using a binary mask. Given the transformed image version of a car with a wooden wheel, it is possible to map the keys for wooden wheels to the values corresponding to their standard (non-wooden) counterparts so the classifier will perceive a “wooden wheel” in the transformed image as it does a standard wheel in the original image.
To evaluate their approach, the team performed experiments on two real-world scenarios: 1) Adapting classifiers to handle novel weather conditions; and 2) Making models robust to typographic attacks.
In the first scenario, the team reported the model’s error rate on the new test set (vehicles in snow) before and after performing the rewrite. The results show that editing the model’s prediction rules significantly reduced the error rate and that the method will also change the way the model processes a concept. In the second scenario, the proposed editing method was able to fix all the errors caused by the typographic attacks.
Overall, the study demonstrates the effectiveness of the proposed prediction rule editing method, which the researchers hope can inspire additional interpretability and robustness studies and open up new avenues for interacting with and correcting ML models before and during deployment.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.