Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling

Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling

Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 493-501. https://doi.org/10.24963/ijcai.2025/56

Most jailbreak methods for large language models (LLMs) focus on superficially improving attack success through manually defined rules. However, they fail to uncover the underlying mechanisms within target LLMs that explain why an attack succeeds or fails. In this paper, we propose investigating the phenomenon of jailbreaks and defenses for LLMs from the perspective of attention distributions within the models. A preliminary experiment reveals that the success of a jailbreak is closely linked to the LLM's attention on sensitive words.Inspired by this interesting finding, we propose incorporating critical signals derived from internal attention distributions within LLMs, namely Attention Intensity on Sensitive Words and Attention Dispersion Entropy, to guide both attacks and defenses. Drawing inspiration from the concept of "Feint and Attack", we introduce an attention-guided jailbreak model, ABA, which redirects the model's attention to benign contexts, and an attention-based defense model, ABD, designed to detect attacks by analyzing internal attention entropy. Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines.
Keywords:
AI Ethics, Trust, Fairness: ETF: Trustworthy AI
AI Ethics, Trust, Fairness: ETF: Safety and robustness