Content provided by Yao Qin, the first author of the paper Deflecting Adversarial Attacks.
There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we “deflect” adversarial attacks by causing the attacker to produce an input that semantically resembles the attack’s target class. To this end, we first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance on both standard and defense-aware attacks. We then show that undetected attacks against our defense often perceptually resemble the adversarial target class by performing a human study where participants are asked to label images produced by the attack. These attack images can no longer be called “adversarial” because our network classifies them the same way as humans do.
- We introduce the notion of deflecting adversarial attacks, which presents a step towards ending the battle between attacks and defenses.
- We propose a new cycle-consistency loss which trains a CapsNet to encourage the winning-capsule reconstruction to closely match the class-conditional distribution and show that this can help detect and deflect adversarial attacks.
- We introduce two attack-agnostic detection methods based on the discrepancy between the winning-capsule reconstruction of the clean and adversarial inputs, and design a defense-aware attack to specifically attack our detection mechanisms.
- We introduce a new approach that presents a step towards ending the battle between defenses and attacks by deflecting adversarial attacks.
- We propose a new cycle-consistency loss to encourage the winning capsule reconstruction of the CapsNet to closely match the class-conditional distribution. With three detection mechanisms, we are able to detect standard adversarial attacks with a low False Positive Rate on SVHN and CIFAR-10.
- To specifically attack our detection mechanisms, we propose a defense-aware attack and find that our model achieves drastically lower undetected rates for defense aware attacks compared to state-of-the-art methods.
- A large percentage of the undetected attacks are deflected by our model to resemble the adversarial target class, stop being adversarial any more. This is verified by a human study showing that 70% of the undetected black-box adversarial attacks are classified unanimously by humans as the target class on SVHN.
The paper Deflecting Adversarial Attacks is on arXiv.
Share My Research is Synced’s new column that welcomes scholars to share their own research breakthroughs with over 1.5M global AI enthusiasts. Beyond technological advances, Share My Research also calls for interesting stories behind the research and exciting research ideas. Share your research with us by clicking here.