Model Stealing Defense against Exploiting Information Leak through the Interpretation of Deep Neural Nets

Model Stealing Defense against Exploiting Information Leak through the Interpretation of Deep Neural Nets

Jeonghyun Lee, Sungmin Han, Sangkyun Lee

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 710-716. https://doi.org/10.24963/ijcai.2022/100

Model stealing techniques allow adversaries to create attack models that mimic the functionality of black-box machine learning models, querying only class membership or probability outcomes. Recently, interpretable AI is getting increasing attention, to enhance our understanding of AI models, provide additional information for diagnoses, or satisfy legal requirements. However, it has been recently reported that providing such additional information can make AI models more vulnerable to model stealing attacks. In this paper, we propose DeepDefense, the first defense mechanism that protects an AI model against model stealing attackers exploiting both class probabilities and interpretations. DeepDefense uses a misdirection model to hide the critical information of the original model against model stealing attacks, with minimal degradation on both the class probability and the interpretability of prediction output. DeepDefense is highly applicable for any model stealing scenario since it makes minimal assumptions about the model stealing adversary. In our experiments, DeepDefense shows significantly higher defense performance than the existing state-of-the-art defenses on various datasets and interpreters.
Keywords:
AI Ethics, Trust, Fairness: Trustworthy AI
AI Ethics, Trust, Fairness: Safety & Robustness
Machine Learning: Explainable/Interpretable Machine Learning
Computer Vision: Interpretability and Transparency