DUQ: Dual Uncertainty Quantification for Text-Video Retrieval
DUQ: Dual Uncertainty Quantification for Text-Video Retrieval
Xin Liu, Shibai Yin, Jun Wang, Jiaxin Zhu, Xingyang Wang, Yee-Hong Yang
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 5779-5787.
https://doi.org/10.24963/ijcai.2025/643
Text-video retrieval establishes accurate similarity relationships between text and video through feature enhancement and granularity alignment. However, relying solely on similarity to associate intra-pair features and distinguish inter-pair features is insufficient, \textit{e.g.}, when querying a multi-scene video with sparse text or selecting the most relevant video from many similar candidates. In this paper, we propose a novel Dual Uncertainty Quantification (DUQ) model that separately handles uncertainties in intra-pair interaction and inter-pair exclusion. Specifically, to enhance intra-pair interaction, we propose an intra-pair similarity uncertainty module to provide similarity-based trustworthy predictions and explicitly model this uncertainty. To increase inter-pair exclusion, we propose an inter-pair distance uncertainty module to construct a distance-based diversity probability embeding, thereby widening the gap between similar features. The two components work synergistically, jointly improving the calculation of similarity between features. We evaluate our model on six benchmark datasets: MSRVTT (51.2%), DiDeMo, MSVD, LSMDC, Charades, and VATEX, achieving state-of-the-art retrieval performance.
Keywords:
Machine Learning: ML: Multi-modal learning
Computer Vision: CV: Image and video retrieval
Computer Vision: CV: Representation learning
