Reducing Underflow in Mixed Precision Training by Gradient Scaling

Ruizhe Zhao; Brian Vogel; Tanvir Ahmed; Wayne Luk

doi:10.24963/ijcai.2020/404

Reducing Underflow in Mixed Precision Training by Gradient Scaling

Ruizhe Zhao, Brian Vogel, Tanvir Ahmed, Wayne Luk

Long video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 2922-2928. https://doi.org/10.24963/ijcai.2020/404

PDF BibTeX

By leveraging the half-precision floating-point format (FP16) well supported by recent GPUs, mixed precision training (MPT) enables us to train larger models under the same or even smaller budget. However, due to the limited representation range of FP16, gradients can often experience severe underflow problems that hinder backpropagation and degrade model accuracy. MPT adopts loss scaling, which scales up the loss value just before backpropagation starts, to mitigate underflow by enlarging the magnitude of gradients. Unfortunately, scaling once is insufficient: gradients from distinct layers can each have different data distributions and require non-uniform scaling. Heuristics and hyperparameter tuning are needed to minimize these side-effects on loss scaling. We propose gradient scaling, a novel method that analytically calculates the appropriate scale for each gradient on-the-fly. It addresses underflow effectively without numerical problems like overflow and the need for tedious hyperparameter tuning. Experiments on a variety of networks and tasks show that gradient scaling can improve accuracy and reduce overall training effort compared with the state-of-the-art MPT.

Keywords:

Machine Learning: Deep Learning

Machine Learning: Deep Learning: Convolutional networks

Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation