EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling

Lichao Zhang; Zhou Zhao; Yi Ren; Liqun Deng

doi:10.24963/ijcai.2022/625

EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling

Lichao Zhang, Zhou Zhao, Yi Ren, Liqun Deng

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 4503-4509. https://doi.org/10.24963/ijcai.2022/625

PDF BibTeX

Zero-shot text-based singing editing enables singing voice modification based on the given edited lyrics without any additional data from the target singer. However, due to the different demands, challenges occur when applying existing speech editing methods to singing voice editing task, mainly including the lack of systematic consideration concerning prosody in insertion and deletion, as well as the trade-off between the naturalness of pronunciation and the preservation of prosody in replacement. In this paper we propose EditSinger, which is a novel singing voice editing model with specially designed diverse prosody modules to overcome the challenges above. Specifically, 1) a general masked variance adaptor is introduced for the comprehensive prosody modeling of the inserted lyrics and the transition of deletion boundary; and 2) we further design a fusion pitch predictor for replacement. By disentangling the reference pitch and fusing the predicted pronunciation, the edited pitch can be reconstructed, which could ensure a natural pronunciation while preserving the prosody of the original audio. In addition, to the best of our knowledge, it is the first zero-shot text-based singing voice editing system. Our experiments conducted on the OpenSinger prove that EditSinger can synthesize high-quality edited singing voices with natural prosody according to the corresponding operations.

Keywords:

Natural Language Processing: Speech

Natural Language Processing: Applications