Hierarchical Prompt Learning for Compositional Zero-Shot Recognition

Henan Wang; Muli Yang; Kun Wei; Cheng Deng

doi:10.24963/ijcai.2023/163

Hierarchical Prompt Learning for Compositional Zero-Shot Recognition

Henan Wang, Muli Yang, Kun Wei, Cheng Deng

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 1470-1478. https://doi.org/10.24963/ijcai.2023/163

PDF BibTeX

Compositional Zero-Shot Learning (CZSL) aims to imitate the powerful generalization ability of human beings to recognize novel compositions of known primitive concepts that correspond to a state and an object, e.g., purple apple. To fully capture the intra- and inter-class correlations between compositional concepts, in this paper, we propose to learn them in a hierarchical manner. Specifically, we set up three hierarchical embedding spaces that respectively model the states, the objects, and their compositions, which serve as three “experts” that can be combined in inference for more accurate predictions. We achieve this based on the recent success of large-scale pretrained vision-language models, e.g., CLIP, which provides a strong initial knowledge of image-text relationships. To better adapt this knowledge to CZSL, we propose to learn three hierarchical prompts by explicitly fixing the unrelated word tokens in the three embedding spaces. Despite its simplicity, our proposed method consistently yields superior performance over current state-of-the-art approaches on three widely-used CZSL benchmarks.

Keywords:

Computer Vision: CV: Recognition (object detection, categorization)

Computer Vision: CV: Transfer, low-shot, semi- and un- supervised learning