Fast Parallel Training of Neural Language Models

Fast Parallel Training of Neural Language Models

Tong Xiao, Jingbo Zhu, Tongran Liu, Chunliang Zhang

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Main track. Pages 4193-4199. https://doi.org/10.24963/ijcai.2017/586

Training neural language models (NLMs) is very time consuming and we need parallelization for system speedup. However, standard training methods have poor scalability across multiple devices (e.g., GPUs) due to the huge time cost required to transmit data for gradient sharing in the back-propagation process. In this paper we present a sampling-based approach to reducing data transmission for better scaling of NLMs. As a ''bonus'', the resulting model also improves the training speed on a single device. Our approach yields significant speed improvements on a recurrent neural network-based language model. On four NVIDIA GTX1080 GPUs, it achieves a speedup of 2.1+ times over the standard asynchronous stochastic gradient descent baseline, yet with no increase in perplexity. This is even 4.2 times faster than the naive single GPU counterpart.
Keywords:
Natural Language Processing: Natural Language Processing
Natural Language Processing: NLP Applications and Tools