Towards Semantics- and Domain-Aware Adversarial Attacks

Jianping Zhang; Yung-Chieh Huang; Weibin Wu; Michael R. Lyu

doi:10.24963/ijcai.2023/60

Towards Semantics- and Domain-Aware Adversarial Attacks

Jianping Zhang, Yung-Chieh Huang, Weibin Wu, Michael R. Lyu

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 536-544. https://doi.org/10.24963/ijcai.2023/60

PDF BibTeX

Language models are known to be vulnerable to textual adversarial attacks, which add human-imperceptible perturbations to the input to mislead DNNs. It is thus imperative to devise effective attack algorithms to identify the deficiencies of DNNs before real-world deployment. However, existing word-level attacks have two major deficiencies: (1) They may change the semantics of the original sentence. (2) The generated adversarial sample can appear unnatural to humans due to the introduction of out-of-domain substitute words. In this paper, to address such drawbacks, we propose a semantics- and domain-aware word-level attack method. Specifically, we greedily replace the important words in a sentence with the ones suggested by a language model. The language model is trained to be semantics- and domain-aware via contrastive learning and in-domain pre-training. Furthermore, to balance the quality of adversarial examples and the attack success rate, we propose an iterative updating framework to optimize the contrastive learning loss and the in-domain pre-training loss in circular order. Comprehensive experimental comparisons confirm the superiority of our approach. Notably, compared with state-of-the-art benchmarks, our strategy can achieve over 3\% improvement in attack success rates and 9.8\% improvement in the quality of adversarial examples.

Keywords:

AI Ethics, Trust, Fairness: ETF: Safety and robustness

Natural Language Processing: NLP: Interpretability and analysis of models for NLP