Zero-Shot Object Detection via Learning an Embedding from Semantic Space to Visual Space

Licheng Zhang; Xianzhi Wang; Lina Yao; Lin Wu; Feng Zheng

doi:10.24963/ijcai.2020/126

Zero-Shot Object Detection via Learning an Embedding from Semantic Space to Visual Space

Licheng Zhang, Xianzhi Wang, Lina Yao, Lin Wu, Feng Zheng

Short video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 906-912. https://doi.org/10.24963/ijcai.2020/126

PDF BibTeX

Zero-shot object detection (ZSD) has received considerable attention from the community of computer vision in recent years. It aims to simultaneously locate and categorize previously unseen objects during inference. One crucial problem of ZSD is how to accurately predict the label of each object proposal, i.e. categorizing object proposals, when conducting ZSD for unseen categories. Previous ZSD models generally relied on learning an embedding from visual space to semantic space or learning a joint embedding between semantic description and visual representation. As the features in the learned semantic space or the joint projected space tend to suffer from the hubness problem, namely the feature vectors are likely embedded to an area of incorrect labels, and thus it will lead to lower detection precision. In this paper, instead, we propose to learn a deep embedding from the semantic space to the visual space, which enables to well alleviate the hubness problem, because, compared with semantic space or joint embedding space, the distribution in visual space has smaller variance. After learning a deep embedding model, we perform $k$ nearest neighbor search in the visual space of unseen categories to determine the category of each semantic description. Extensive experiments on two public datasets show that our approach significantly outperforms the existing methods.

Keywords:

Computer Vision: Language and Vision

Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation

Machine Learning: Deep Learning