SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

Zhaoxi Mu, Xinyu Yang, Gang Wang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 8204-8212. https://doi.org/10.24963/ijcai.2025/912

While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
Keywords:
Natural Language Processing: NLP: Speech
Natural Language Processing: NLP: Applications
Natural Language Processing: NLP: Language models