Detecting Hallucination in Large Language Models Through Deep Internal Representation Analysis

Luan Zhang; Dandan Song; Zhijing Wu; Yuhang Tian; Changzhi Zhou; Jing Xu; Ziyi Yang; Shuhao Zhang

doi:10.24963/ijcai.2025/929

Detecting Hallucination in Large Language Models Through Deep Internal Representation Analysis

Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, Shuhao Zhang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 8357-8365. https://doi.org/10.24963/ijcai.2025/929

PDF BibTeX

Large language models (LLMs) have shown exceptional performance across various domains. However, LLMs are prone to hallucinate facts and generate non-factual responses, which can undermine their reliability in real-world applications. Current hallucination detection methods suffer from external resource demands, substantial time overhead, difficulty overcoming LLMs' intrinsic limitation, and insufficient modeling. In this paper, we propose MHAD, a novel internal-representation-based hallucination detection method. MHAD utilizes linear probing to select neurons and layers within LLMs. The selected neurons and layers are demonstrated with significant awareness of hallucinations at the initial and final generation steps. By concatenating the outputs from these selected neurons of selected layers at the initial and final generation steps, a hallucination awareness vector is formed, enabling precise hallucination detection via an MLP. Additionally, we introduce SOQHD, a novel benchmark for evaluating hallucination detection in Open-Domain QA (ODQA). Extensive experiments show that MHAD outperforms existing hallucination detection methods across multiple LLMs, demonstrating superior effectiveness.

Keywords:

Natural Language Processing: NLP: Language models

Natural Language Processing: NLP: Language generation

Natural Language Processing: NLP: Question answering

Natural Language Processing: NLP: Resources and evaluation