Cross-modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation

Cross-modal Representation Learning and Relation Reasoning for Bidirectional Adaptive Manipulation

Lei Li, Kai Fan, Chun Yuan

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 3222-3228. https://doi.org/10.24963/ijcai.2022/447

Since single-modal controllable manipulation typically requires supervision of information from other modalities or cooperation with complex software and experts, this paper addresses the problem of cross-modal adaptive manipulation (CAM). The novel task performs cross-modal semantic alignment from mutual supervision and implements bidirectional exchange of attributes, relations, or objects in parallel, benefiting both modalities while significantly reducing manual effort. We introduce a robust solution for CAM, which includes two essential modules, namely Heterogeneous Representation Learning (HRL) and Cross-modal Relation Reasoning (CRR). The former is designed to perform representation learning for cross-modal semantic alignment on heterogeneous graph nodes. The latter is adopted to identify and exchange the focused attributes, relations, or objects in both modalities. Our method produces pleasing cross-modal outputs on CUB and Visual Genome.
Keywords:
Machine Learning: Multi-modal learning
Computer Vision: Vision and languageĀ 
Machine Learning: Relational Learning
Machine Learning: Representation learning
Machine Learning: Sequence and Graph Learning