Cyclic Vision-Language Manipulator: Towards Reliable and Fine-Grained Image Interpretation for Automated Report Generation

Cyclic Vision-Language Manipulator: Towards Reliable and Fine-Grained Image Interpretation for Automated Report Generation

Yingying Fang, Zihao Jin, Shaojie Guo, Jinda Liu, Zhiling Yue, Yijian Gao, Junzhi Ning, Zhi Li, Simon Walsh, Guang Yang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 357-366. https://doi.org/10.24963/ijcai.2025/41

Despite significant advancements in automated report generation, the opaqueness of text interpretability continues to cast doubt on the reliability of the content produced. This paper introduces a novel approach to identify specific image features in X-ray images that influence the outputs of report generation models. Specifically, we propose Cyclic Vision-Language Manipulator (CVLM), a module to generate a manipulated X-ray from an original X-ray and its report from a designated report generator. The essence of CVLM is that cycling manipulated X-rays to the report generator produces altered reports aligned with the alterations pre-injected into the reports for X-ray generation, achieving the term ``cyclic manipulation''. This process allows direct comparison between original and manipulated X-rays, clarifying the critical image features driving changes in reports and enabling model users to assess the reliability of the generated texts. Empirical evaluations demonstrate that CVLM can identify more precise and reliable features compared to existing explanation methods, significantly enhancing the transparency and applicability of AI-generated reports.
Keywords:
AI Ethics, Trust, Fairness: ETF: Explainability and interpretability
AI Ethics, Trust, Fairness: ETF: Trustworthy AI
Computer Vision: CV: Vision, language and reasoning