Provably robust classification of adversarial examples with detection

Fatemeh Sheikholeslami; Ali Lotfi; J Zico Kolter

2021 ICLR ICLR 2021

Provably robust classification of adversarial examples with detection

Abstract

Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can \emph{detect} the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating \emph{verifiable} performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional "abstain/detection" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify \emph{or} detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under $\ell_\infty$ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than $63.63\%$ and $67.92\%$, for $55.6\%$ and $66.37\%$ natural error, for $\epsilon=8/255$ and $16/255$ on the CIFAR-10 dataset, respectively.

Authors

Fatemeh Sheikholeslami , Ali Lotfi , J Zico Kolter

Download PDF

Related papers

Predicting Infectiousness for Proactive Contact Tracing 2021

Adversarially Guided Actor-Critic 2021

Hierarchical Autoregressive Modeling for Neural Video Compression 2021

Unsupervised Discovery of 3D Physical Objects from Video 2021

Batch Reinforcement Learning Through Continuation Method 2021