Cross-modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation

Lei Li; Kai Fan; Chun Yuan

doi:10.24963/ijcai.2022/447

Cross-modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation

Lei Li, Kai Fan, Chun Yuan

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 3222-3228. https://doi.org/10.24963/ijcai.2022/447

PDF BibTeX

Since single-modal controllable manipulation typically requires supervision of information from other modalities or cooperation with complex software and experts, this paper addresses the problem of cross-modal adaptive manipulation (CAM). The novel task performs cross-modal semantic alignment from mutual supervision and implements bidirectional exchange of attributes, relations, or objects in parallel, benefiting both modalities while significantly reducing manual effort. We introduce a robust solution for CAM, which includes two essential modules, namely Heterogeneous Representation Learning (HRL) and Cross-modal Relation Reasoning (CRR). The former is designed to perform representation learning for cross-modal semantic alignment on heterogeneous graph nodes. The latter is adopted to identify and exchange the focused attributes, relations, or objects in both modalities. Our method produces pleasing cross-modal outputs on CUB and Visual Genome.

Keywords:

Machine Learning: Multi-modal learning

Computer Vision: Vision and language

Machine Learning: Relational Learning

Machine Learning: Representation learning

Machine Learning: Sequence and Graph Learning