Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis

Lina Wei; Yuhang Ma; Zhongsheng Lin; Fangfang Wang; Canghong Jin; Hanbin Zhao; Dapeng Chen

doi:10.24963/ijcai.2025/228

Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis

Lina Wei, Yuhang Ma, Zhongsheng Lin, Fangfang Wang, Canghong Jin, Hanbin Zhao, Dapeng Chen

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 2045-2053. https://doi.org/10.24963/ijcai.2025/228

PDF BibTeX

Multimodal perception, which integrates vision and touch, is increasingly demonstrating its significance in domains such as embodied intelligence and human-computer interaction. However, in open-world scenarios, multimodal data streams face significant challenges, including catastrophic forgetting and overfitting, during few-shot class incremental learning (FSCIL), leading to a severe degradation in model performance. In this work, we propose a novel approach named Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis (TIFS). Our method leverages vision imagination synthesis to enhance the semantic understanding and integrates touch and vision fusion to improve the problem of modal imbalance. Specifically, we introduce a framework that employs touch-guided vision information for cross-modal contrastive learning to address the challenges of few-shot learning. Additionally, we incorporate multiple learning mechanisms, including regularization, memory mechanisms, and attention mechanisms, to mitigate catastrophic forgetting during multi-incremental step learning. Experimental results on the Touch and Go and VisGel datasets demonstrate that the TIFS framework exhibits robust continuous learning capabilities and strong generalization performance in touch-vision few-shot incremental learning tasks. Our code is available at https://github.com/Vision-Multimodal-Lab-HZCU/TIFS.

Keywords:

Computer Vision: CV: Video analysis and understanding

Computer Vision: CV: Multimodal learning

Computer Vision: CV: Scene analysis and understanding