Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Yucheng Zhao; Chong Luo; Zheng-Jun Zha; Wenjun Zeng

doi:10.24963/ijcai.2020/450

Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Yucheng Zhao, Chong Luo, Zheng-Jun Zha, Wenjun Zeng

Short video

Long video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 3251-3257. https://doi.org/10.24963/ijcai.2020/450

PDF BibTeX

In this paper, we introduce Transformer to the time-domain methods for single-channel speech separation. Transformer has the potential to boost speech separation performance because of its strong sequence modeling capability. However, its computational complexity, which grows quadratically with the sequence length, has made it largely inapplicable to speech applications. To tackle this issue, we propose a novel variation of Transformer, named multi-scale group Transformer (MSGT). The key ideas are group self-attention, which significantly reduces the complexity, and multi-scale fusion, which retains Transform's ability to capture long-term dependency. We implement two versions of MSGT with different complexities, and apply them to a well-known time-domain speech separation method called Conv-TasNet. By simply replacing the original temporal convolutional network (TCN) with MSGT, our approach called MSGT-TasNet achieves a large gain over Conv-TasNet on both WSJ0-2mix and WHAM! benchmarks. Without bells and whistles, the performance of MSGT-TasNet is already on par with the SOTA methods.

Keywords:

Machine Learning: Deep Learning: Sequence Modeling

Natural Language Processing: Speech