A Multimodal AI Dialogue System for Unified Document, Visual, and Audio Interaction

A Multimodal AI Dialogue System for Unified Document, Visual, and Audio Interaction

Yujun Feng, Jingyi Huang, Yang Zhang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Demo Track. Pages 11044-11047. https://doi.org/10.24963/ijcai.2025/1259

This paper presents a multimodal intelligent dialogue system that seamlessly integrates document analysis, visual media processing, and audio interaction within a unified web interface. The system ensures secure user identity verification through persistent conversational management, leveraging textual document analysis, dynamic context integration, and cross-media interactions via video, image, and real-time speech processing. Our approach introduces three key innovations: (1) context-aware document analysis through text extraction, (2) a multimodal input pipeline supporting images, videos, and audio, and (3) persistent chat history management for maintaining conversational continuity. The system facilitates seamless transitions between audio and text, enabling natural interactions by processing audio input and converting text responses into speech. Additionally, the platform provides an intuitive interface for document uploads, camera capture, and audio recording, while ensuring conversation context is preserved across sessions. This implementation demonstrates the practical integration of multimodal input in an interactive artificial intelligence (AI) system, showcasing its potential for enhanced user engagement and interaction.
Keywords:
Natural Language Processing: NLP: Dialogue and interactive systems
Humans and AI: HAI: Human-computer interaction
Humans and AI: HAI: Intelligent user interfaces