A Multi-view Fusion Approach for Enhancing Speech Signals via Short-time Fractional Fourier Transform

Zikun Jin; Yuhua Qian; Xinyan Liang; Haijun Geng

doi:10.24963/ijcai.2025/613

A Multi-view Fusion Approach for Enhancing Speech Signals via Short-time Fractional Fourier Transform

Zikun Jin, Yuhua Qian, Xinyan Liang, Haijun Geng

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 5508-5516. https://doi.org/10.24963/ijcai.2025/613

PDF BibTeX

Deep learning-based speech enhancement (SE) methods focus on reconstructing speech from the time or frequency domain. However, these domains cannot provide enough information to capture the dynamics of non-stationary signals accurately. To enrich information, this work proposes a multi-view fusion SE method (MFSE). Specifically, MFSE extends the representation space of speech to the dynamic domain (also called fractional domain) between the time and frequency domains by using the short-time fractional Fourier transform (STFrFT). Subsequently, we construct inputs as modes of the primary short-time Fourier transform (STFT) spectrum and the auxiliary STFrFT spectrum views and adaptively identify the optimal fractional STFrFT spectrum from the infinitely continuous fractional domain by leveraging the average spectral centroids. The framework extracts potential features through multiple designed convolutional modules and captures the correlation between different speech frequencies through multi-granularity attention. Experimental results show that the proposed method significantly improves performance in several metrics compared to existing single-channel SE methods based on time and frequency domains. Furthermore, the results of its generalizability evaluation show that the multi-view method outperforms the single-view method under a wide range of SNR conditions.

Keywords:

Machine Learning: ML: Applications

Machine Learning: ML: Multi-view learning