Interactive Multimodal Learning via Flat Gradient Modification

Qing-Yuan Jiang; Zhouyang Chi; Yang Yang

doi:10.24963/ijcai.2025/611

Interactive Multimodal Learning via Flat Gradient Modification

Qing-Yuan Jiang, Zhouyang Chi, Yang Yang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 5489-5497. https://doi.org/10.24963/ijcai.2025/611

PDF BibTeX

Due to the notorious modality imbalance phenomenon, multimodal learning (MML) struggles to achieve satisfactory performance. Recently, multimodal learning with alternating unimodal adaptation (MLA) has been proven effective in mitigating the interference between modalities by capturing interaction through orthogonal projection, thus relieving modality imbalance phenomenon to some extent. However, the projection strategy orthogonal to the original space can lead to poor plasticity as the alternating learning proceeds, thus affecting model performance. To address this issue, in this paper, we propose a novel multimodal learning method called interactiveMML via flat gradient modification (IGM) by employing a flat gradient modification strategy to enhance interactive MML. Specifically, we first employ a flat projection-based gradient modification strategy that is independent to the original space, aiming to avoid the poor plasticity issue. Then we introduce the sharpness-aware minimization (SAM)-based optimization strategy to fully exploit the flatness of the learning objective and further enhance interaction during learning. To this end, the plasticity problem can be avoided and the overall performance is improved. Extensive experiments on widely used datasets demonstrate that IGM outperforms various state-of-the-art (SOTA) baselines, achieving superior performance. The source code is available at https://anonymous.4open.science/r/method-CC45.

Keywords:

Machine Learning: ML: Multi-modal learning

Computer Vision: CV: Machine learning for vision

Machine Learning: ML: Applications

Machine Learning: ML: Representation learning