Taming the Noisy Gradient: Train Deep Neural Networks with Small Batch Sizes

Taming the Noisy Gradient: Train Deep Neural Networks with Small Batch Sizes

Yikai Zhang, Hui Qu, Chao Chen, Dimitris Metaxas

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 4348-4354. https://doi.org/10.24963/ijcai.2019/604

Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.
Keywords:
Machine Learning: Classification
Machine Learning: Deep Learning