A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

Shaohuai Shi, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, Xiaowen Chu

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 3411-3417. https://doi.org/10.24963/ijcai.2019/473

Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP) to O(k logP), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla mini-batch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.
Keywords:
Machine Learning: Deep Learning
Computer Vision: Big Data and Large Scale Methods
Machine Learning Applications: Big data ; Scalability