Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Embeddings

Mennatullah Siam; Naren Doraiswamy; Boris N. Oreshkin; Hengshuai Yao; Martin Jagersand

doi:10.24963/ijcai.2020/120

Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Embeddings

Mennatullah Siam, Naren Doraiswamy, Boris N. Oreshkin, Hengshuai Yao, Martin Jagersand

Long video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 860-867. https://doi.org/10.24963/ijcai.2020/120

PDF BibTeX

Significant progress has been made recently in developing few-shot object segmentation methods. Learning is shown to be successful in few-shot segmentation settings, using pixel-level, scribbles and bounding box supervision. This paper takes another approach, i.e., only requiring image-level label for few-shot object segmentation. We propose a novel multi-modal interaction module for few-shot object segmentation that utilizes a co-attention mechanism using both visual and word embedding. Our model using image-level labels achieves 4.8% improvement over previously proposed image-level few-shot object segmentation. It also outperforms state-of-the-art methods that use weak bounding box supervision on PASCAL-5^i. Our results show that few-shot segmentation benefits from utilizing word embeddings, and that we are able to perform few-shot segmentation using stacked joint visual semantic processing with weak image-level labels. We further propose a novel setup, Temporal Object Segmentation for Few-shot Learning (TOSFL) for videos. TOSFL can be used on a variety of public video data such as Youtube-VOS, as demonstrated in both instance-level and category-level TOSFL experiments.

Keywords:

Computer Vision: Language and Vision

Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation

Machine Learning: Deep Learning