Exploring Efficient and Effective Sequence Learning for Visual Object Tracking

Dongdong Li; Zhinan Gao; Yangliu Kuai; Rui Chen

doi:10.24963/ijcai.2025/153

Exploring Efficient and Effective Sequence Learning for Visual Object Tracking

Dongdong Li, Zhinan Gao, Yangliu Kuai, Rui Chen

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 1368-1376. https://doi.org/10.24963/ijcai.2025/153

PDF BibTeX

Sequence learning based tracking frameworks are popular in the tracking community. In practice, its auto-regressive sequence generation manner leads to inferior performance and high latency compared with latest advanced trackers. In this paper, to mitigate this issue, we propose an efficient and effective sequence-to-sequence tracking framework named FastSeqTrack. FastSeqTrack differs from previous sequence learning based trackers in terms of token initialization and sequence generation manner. Four tracking tokens are appended to patch embeddings and generated in the encoder as initial guesses for the bounding box sequence, which improves the tracking accuracy compared with randomly initialized tokens. Tracking tokens are then parallelly fed into the decoder in a one-pass manner and greatly boost the forward inference speed compared with the auto-regressive manner. Inspired by the early-exit mechanism, we inject internal classifiers after each decoder layer to early terminate forward inference when the softmax confidence is sufficiently reliable. In easy tracking frames, early exits avoid network overthinking and unnecessary computation. Extensive experiments on multiple benchmarks demonstrate that FastSeqTrack runs over 100 fps and showcases superior performance against state-of-the-art trackers. Codes and models are available at https://github.com/vision4drones/FastSeqTrack.

Keywords:

Computer Vision: CV: Motion and tracking

Computer Vision: CV: Video analysis and understanding