DToMA: Training-free Dynamic Token MAnipulation for Long Video Understanding

DToMA: Training-free Dynamic Token MAnipulation for Long Video Understanding

Bowen Yuan, Sisi You, Bing-Kun Bao

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 2314-2322. https://doi.org/10.24963/ijcai.2025/258

Video Large Language Models (VideoLLMs) often require thousands of visual tokens to process long videos, leading to substantial computational costs, further exacerbated by visual token inefficiency. Existing token reduction and alternative video representation methods improve efficiency but often compromise comprehension abilities. In this work, we analyze the reasoning processes of VideoLLMs in multi-choice VideoQA task, identifying three reasoning stages—shallow, intermediate, and deep stages—that closely mimic human cognitive processing. Our analysis reveals specific inefficiencies at each stage: in shallow layers, VideoLLMs attempt to memorize all video details without prioritizing relevant content; in intermediate layers, models fail to re-examine uncertain content dynamically; and in deep layers, they continue processing video even when sufficiently confident. To bridge this gap, we propose DToMA, a training-free Dynamic Token MAnipulation method inspired by human adjustment mechanisms in three aspects: 1) Text-guided keyframe-aware reorganization to prioritize keyframes and reduce redundancy, 2) Uncertainty-based visual injection to revisit content dynamically, and 3) Early-exit pruning to halt visual tokens when confident. Experiments on 6 long video understanding benchmarks show that DToMA enhances both efficiency and comprehension, outperforming state-of-the-art methods and generalizing well across 3 VideoLLM architectures and sizes. Code is available at https://github.com/yuanrr/DToMA.
Keywords:
Computer Vision: CV: Video analysis and understanding