MMT: Multi-way Multi-modal Transformer for Multimodal Learning

Jiajia Tang; Kang Li; Ming Hou; Xuanyu Jin; Wanzeng Kong; Yu Ding; Qibin Zhao

doi:10.24963/ijcai.2022/480

MMT: Multi-way Multi-modal Transformer for Multimodal Learning

Jiajia Tang, Kang Li, Ming Hou, Xuanyu Jin, Wanzeng Kong, Yu Ding, Qibin Zhao

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 3458-3465. https://doi.org/10.24963/ijcai.2022/480

PDF BibTeX

The heart of multimodal learning research lies the challenge of effectively exploiting fusion representations among multiple modalities.However, existing two-way cross-modality unidirectional attention could only exploit the intermodal interactions from one source to one target modality. This indeed fails to unleash the complete expressive power of multimodal fusion with restricted number of modalities and fixed interactive direction.In this work, the multiway multimodal transformer (MMT) is proposed to simultaneously explore multiway multimodal intercorrelations for each modality via single block rather than multiple stacked cross-modality blocks. The core idea of MMT is the multiway multimodal attention, where the multiple modalities are leveraged to compute the multiway attention tensor. This naturally benefits us to exploit comprehensive many-to-many multimodal interactive paths. Specifically, the multiway tensor is comprised of multiple interconnected modality-aware core tensors that consist of the intramodal interactions. Additionally, the tensor contraction operation is utilized to investigate intermodal dependencies between distinct core tensors.Essentially, our tensor-based multiway structure allows for easily extending MMT to the case associated with an arbitrary number of modalities. Taking MMT as the basis, the hierarchical network is further established to recursively transmit the low-level multiway multimodal interactions to high-level ones. The experiments demonstrate that MMT can achieve state-of-the-art or comparable performance.

Keywords:

Machine Learning: Multi-modal learning