Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, Jian Yang

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 3300-3306. https://doi.org/10.24963/ijcai.2021/454

The main challenge of cross-modal retrieval is to learn the consistent embedding for heterogeneous modalities. To solve this problem, traditional label-wise cross-modal approaches usually constrain the inter-modal and intra-modal embedding consistency relying on the label ground-truths. However, the experiments reveal that different modal networks actually have various generalization capacities, thereby end-to-end joint training with consistency loss usually leads to sub-optimal uni-modal model, which in turn affects the learning of consistent embedding. Therefore, in this paper, we argue that what really needed for supervised cross-modal retrieval is a good shared classification model. In other words, we learn the consistent embedding by ensuring the classification performance of each modality on the shared model, without the consistency loss. Specifically, we consider a technique called Semantic Sharing, which directly trains the two modalities interactively by adopting a shared self-attention based classification model. We evaluate the proposed approach on three representative datasets. The results validate that the proposed semantic sharing can consistently boost the performance under NDCG metric.
Keywords:
Machine Learning: Multi-instance; Multi-label; Multi-view learning