Dilated Convolution with Dilated GRU for Music Source Separation

Jen-Yu Liu; Yi-Hsuan Yang

Dilated Convolution with Dilated GRU for Music Source Separation

Jen-Yu Liu, Yi-Hsuan Yang

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

Main track. Pages 4718-4724. https://doi.org/10.24963/ijcai.2019/655

PDF BibTeX

Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve high-resolution information and still reach distant locations. Producing high-resolution predictions is also crucial in music source separation, whose goal is to separate different sound sources while maintain the quality of the separated sounds. Therefore, in this paper, we use stacked dilated convolutions as the backbone for music source separation. Although stacked dilated convolutions can reach wider context than standard convolutions do, their effective receptive fields are still fixed and might not be wide enough for complex music audio signals. To reach even further information at remote locations, we propose to combine a dilated convolution with a modified GRU called Dilated GRU to form a block. A Dilated GRU receives information from k-step before instead of the previous step for a fixed k. This modification allows a GRU unit to reach a location with fewer recurrent steps and run faster because it can execute in parallel partially. We show that the proposed model with a stack of such blocks performs equally well or better than the state-of-the-art for separating both vocals and accompaniment.

Keywords:

Multidisciplinary Topics and Applications: Art and Music

Machine Learning: Deep Learning

Machine Learning Applications: Applications of Supervised Learning