MsRAG: Knowledge Augumented Image Captioning with Object-level Multi-source RAG

MsRAG: Knowledge Augumented Image Captioning with Object-level Multi-source RAG

Yuming Qiao, Yuechen Wang, Dan Meng, Haonan Lu, Zhenyu Yang, Xudong Zhang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 6093-6101. https://doi.org/10.24963/ijcai.2025/678

Language-Visual Large Models (LVLMs) have made significant strides in enhancing visual understanding capabilities. However, these models often struggle with knowledge-based visual tasks due to constrains in their pre-training data scope and timeliness. Existing Retrieval-Augmented Generation (RAG) methods can effectively solve the problem but primarily rely on user queries, limiting their applicability in scenarios without explicit language input. To overcome these challenges, we introduce MsRAG, a knowledge-augmented captioning framework designed to effectively retrieve and utilize external real-world knowledge, particularly in the absence of user queries, and perform dense captioning for subjects. MsRAG comprises three key components: (1) Parallel Visual Search Module. It retrieves fine-grained object-level knowledge using both online visual search engines and offline domain-knowledge databases, enhancing the robustness and richness of retrieved information. (2) Prompt Templates Pool. The prompt pool dynamically assigns appropriate prompts based on retrieved information, optimizing LVLMs' ability to leverage relevant data under complex RAG conditions. (3) Visual-RAG Alignment Module, which employs a novel visual prompting method to bridge the modality gap between textual RAG content and corresponding visual objects, enabling precise alignment of visual elements with their text-format RAG content. To validate the effectiveness of MsRAG, we conducted a series of qualitative and quantitative experiments. The evaluation results demonstrate the superiority of MsRAG over other methods.
Keywords:
Machine Learning: ML: Applications
Machine Learning: ML: Multi-modal learning
Natural Language Processing: NLP: Applications