MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Wei Hua; Chenlin Zhou; Jibin Wu; Yansong Chua; Yangyang Shu

doi:10.24963/ijcai.2025/601

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 5399-5407. https://doi.org/10.24963/ijcai.2025/601

PDF BibTeX

The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results indicate that our MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among NN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

Keywords:

Machine Learning: ML: Attention models

Computer Vision: CV: Representation learning

Machine Learning: ML: Multi-modal learning

Computer Vision: CV: Efficiency and Optimization