Singularformer: Learning to Decompose Self-Attention to Linearize the Complexity of Transformer

Singularformer: Learning to Decompose Self-Attention to Linearize the Complexity of Transformer

Yifan Wu, Shichao Kan, Min Zeng, Min Li

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 4433-4441. https://doi.org/10.24963/ijcai.2023/493

Transformers achieve excellent performance in a variety of domains since they can capture long-distance dependencies through the self-attention mechanism. However, self-attention is computationally costly due to its quadratic complexity and high memory consumption. In this paper, we propose a novel Transformer variant (Singularformer) that uses neural networks to learn the singular value decomposition process of the attention matrix to design a linear-complexity and memory-efficient global self-attention mechanism. Specifically, we decompose the attention matrix into the product of three matrix factors based on singular value decomposition and design neural networks to learn these matrix factors, then the associative law of matrix multiplication is used to linearize the calculation of self-attention. The above procedure allows us to compute self-attention as two-dimensional reduction processes in the first and second token dimensional spaces, followed by a multi-head self-attention computational process on the first dimensional reduced token features. Experimental results on 8 real-world datasets demonstrate that Singularformer performs favorably against the other Transformer variants with lower time and space complexity. Our source code is publicly available at https://github.com/CSUBioGroup/Singularformer.
Keywords:
Machine Learning: ML: Attention models