Local Representation is Not Enough: Soft Point-Wise Transformer for Descriptor and Detector of Local Features

Local Representation is Not Enough: Soft Point-Wise Transformer for Descriptor and Detector of Local Features

Zihao Wang, Xueyi Li, Zhen Li

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 1150-1156. https://doi.org/10.24963/ijcai.2021/159

Significant progress has been witnessed for the descriptor and detector of local features, but there still exist several challenging and intractable limitations, such as insufficient localization accuracy and non-discriminative description, especially in repetitive- or blank-texture regions, which haven't be well addressed. The coarse feature representation and limited receptive field are considered as the main issues for these limitations. To address these issues, we propose a novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features. Furthermore, our model leverages the distinct transformers based on the soft point-wise attention, substantially decreasing the memory and computation complexity, especially for high-resolution feature maps. In addition, multi-level decoder is constructed to guarantee the high detection accuracy and discriminative description. Extensive experiments demonstrate that our model outperforms the existing state-of-the-art methods on the image matching and visual localization benchmarks.
Keywords:
Computer Vision: 2D and 3D Computer Vision
Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation