Joint Time-Frequency and Time Domain Learning for Speech Enhancement

Chuanxin Tang; Chong Luo; Zhiyuan Zhao; Wenxuan Xie; Wenjun Zeng

doi:10.24963/ijcai.2020/528

Joint Time-Frequency and Time Domain Learning for Speech Enhancement

Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Wenxuan Xie, Wenjun Zeng

Short video

Long video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 3816-3822. https://doi.org/10.24963/ijcai.2020/528

PDF BibTeX

For single-channel speech enhancement, both time-domain and time-frequency-domain methods have their respective pros and cons. In this paper, we present a cross-domain framework named TFT-Net, which takes time-frequency spectrogram as input and produces time-domain waveform as output. Such a framework takes advantage of the knowledge we have about spectrogram and avoids some of the drawbacks that T-F-domain methods have been suffering from. In TFT-Net, we design an innovative dual-path attention block (DAB) to fully exploit correlations along the time and frequency axes. We further discover that a sample-independent DAB (SDAB) achieves a good tradeoff between enhanced speech quality and complexity. Ablation studies show that both the cross-domain design and the SDAB block bring large performance gain. When logarithmic MSE is used as the training criteria, TFT-Net achieves the highest SDR and SSNR among state-of-the-art methods on two major speech enhancement benchmarks.

Keywords:

Natural Language Processing: Speech

Machine Learning: Deep Learning