Adversarial examples — which aim to trick image classifiers — have been attracting a lot of attention in machine learning recently. Ian Goodfellow, the respected research scientist who pioneered generative adversarial networks (GANs), has characterized their impact as follows: “By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change GoogLeNet’s classification of the image.”
Although the differences introduced by adversarial examples are hard for humans to distinguish, a small addition to the above left image nonetheless causes the model to misclassify a panda as a gibbon with very high confidence. With the large-scale application of machine learning, the introduction of adversarial examples can lead to erroneous predictions even in state-of-the-art classifiers.
In the new paper Adversarial Examples Are Not Bugs, They Are Features, a group of MIT researchers propose that adversarial examples’ effectiveness can be attributed to non-robustness: “Adversarial vulnerability is a direct result of our models’ sensitivity to well-generalizing features in the data.”
The researchers suggest adopting a new perspective on adversarial examples, suggesting that classifiers are often trained to achieve maximal accuracy only. Inevitably, classifiers would opt to leverage any available signals to deliver a result even if some signals might be incomprehensible to humans. For example, to a classifier, the presence of “a tail” or “ears” are as natural as other equally predictive patterns. Following this pattern, researchers argue that models could learn to depend on “non-robust” features, and eventually adversarial perturbations can exploit correlations.
The research team leveraged standard image classification datasets to show it is possible to disentangle robust from non-robust features.
The MIT team managed to remove non-robust features from a dataset to demonstrate that adversarial vulnerability is not “necessarily tied to the standard training framework, but is rather a property of the dataset.” Their efforts in filtering out non-robust dataset features included limiting the set of available features to be used by a robust model.
Researchers also constructed a “non-robust” version for standard classification. In this training dataset, the inputs were almost identical to the originals, but all were incorrectly labeled. Training on the wrongly labeled dataset yielded good accuracy despite its lack of any predictive human-visible information. The process demonstrated that the training set inputs connected to the labels solely via small adversarial perturbations, and proved it used only non-robust features.
The study provides a new view on adversarial examples as a “fundamentally human phenomenon.” The authors conclude “we should not be surprised that classifiers exploit highly predictive features that happen to be non-robust under a human-selected notion of similarity, given such features exist in real-world datasets” and that as long as models continue to rely on such non-robust features, explanations cannot be both human-meaningful and faithful to the models.
The paper Adversarial Examples Are Not Bugs, They Are Features is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen