Learning Target-aware Representation for Visual Tracking via Informative Interactions

Learning Target-aware Representation for Visual Tracking via Informative Interactions

Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing, Yilin Lyu, Bing Li, Weiming Hu

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 927-934. https://doi.org/10.24963/ijcai.2022/130

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Having observed de facto frameworks perform feature matching simply using the backbone outputs for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. Concretely, only the matching module can directly access the target information, while the representation learning of candidate frame is blind to the reference target. Therefore, the accumulated target-irrelevant interference in shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). The core of InBN is a general interaction modeler (GIM) that injects the target information to different stages of the backbone network, leading to better target-perception of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced on multiple benchmarks. In particular, the CNN version improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K. The Transformer version obtains SUC of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent SOTAs.
Keywords:
Computer Vision: Motion and Tracking