Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation

Xuanli He; Qiongkai Xu; Jun Wang; Benjamin Rubinstein; Trevor Cohn

2023 EMNLP EMNLP 2023

Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation

Abstract

AbstractModern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour. For instance, backdoors can be implanted through crafting training instances with a specific textual trigger and a target label. This paper posits that backdoor poisoning attacks exhibit a spurious correlation between simple text features and classification labels, and accordingly, proposes methods for mitigating spurious correlation as means of defence. Our empirical study reveals that the malicious triggers are highly correlated to their target labels; therefore such correlations are extremely distinguishable compared to those scores of benign features, and can be used to filter out potentially problematic instances. Compared with several existing defences, our defence method significantly reduces attack success rates across backdoor attacks, and in the case of insertion-based attacks, our method provides a near-perfect defence.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Security & Privacy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xuanli He , Qiongkai Xu , Jun Wang , Benjamin Rubinstein , Trevor Cohn

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Application Areas > Privacy Security & Privacy > Privacy Artificial Intelligence > Core AI > Adversarial Learning Deep Learning > Learning Types > Adversarial Learning

Keywords

backdoor attack poisoning attack spurious correlation trigger detection nlp model trigger word nlp security

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023