SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding
SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding
Sisi You, Bowen Yuan, Bing-Kun Bao
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 2287-2295.
https://doi.org/10.24963/ijcai.2025/255
Video understanding seeks to enable machines to interpret visual content across three levels: action, event, and story. Existing models are limited in their ability to perform high-level long-term story understanding, due to (1) the oversimplified treatment of temporal information and (2) the training bias introduced by action/event-centric datasets. To address this, we introduce SCVBench, a novel benchmark for story-centric video understanding. SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 sub-question pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding. Additionally, our StoryCoT model significantly surpasses open-source LVLMs on SCVBench. SCVBench aims to advance research by comprehensively analyzing LVLMs' temporal reasoning and comprehension capabilities. Code can be accessed at https://github.com/yuanrr/SCVBench.
Keywords:
Computer Vision: CV: Vision, language and reasoning
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Video analysis and understanding
