ViT-P3DE∗: Vision Transformer Based Multi-Camera Instance Association with Pseudo 3D Position Embeddings

Minseok Seo; Hyuk-Jae Lee; Xuan Truong Nguyen

doi:10.24963/ijcai.2023/149

ViT-P3DE∗: Vision Transformer Based Multi-Camera Instance Association with Pseudo 3D Position Embeddings

Minseok Seo, Hyuk-Jae Lee, Xuan Truong Nguyen

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 1340-1350. https://doi.org/10.24963/ijcai.2023/149

PDF BibTeX

Multi-camera instance association, which identifies identical objects among multiple objects in multi-view images, is challenging due to several harsh constraints. To tackle this problem, most studies have employed CNNs as feature extractors but often fail under such harsh constraints. Inspired by Vision Transformer (ViT), we first develop a pure ViT-based framework for robust feature extraction through self-attention and residual connection. We then propose two novel methods to achieve robust feature learning. First, we introduce learnable pseudo 3D position embeddings (P3DEs) that represent the 3D location of an object in the world coordinate system, which is independent of the harsh constraints. To generate P3DEs, we encode the camera ID and the object's 2D position in the image using embedding tables. We then build a framework that trains P3DEs to represent an object's 3D position in a weakly supervised manner. Second, we also utilize joint patch generation (JPG). During patch generation, JPG considers an object and its surroundings as a single input patch to reinforce the relationship information between two features. Ultimately, experimental results demonstrate that both ViT-P3DE and ViT-P3DE with JPG achieve state-of-the-art performance and significantly outperform existing works, especially when dealing with extremely harsh constraints.

Keywords:

Computer Vision: CV: Applications

Computer Vision: CV: Recognition (object detection, categorization)