AdaR: An Adaptive Gradient Method with Cyclical Restarting of Moment Estimations
AdaR: An Adaptive Gradient Method with Cyclical Restarting of Moment Estimations
Yangchuan Wang, Lianhong Ding, Peng Shi
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 6478-6486.
https://doi.org/10.24963/ijcai.2025/721
Adaptive gradient methods, primarily based on Adam, are prevalent in training neural networks, adjusting step sizes via exponentially decaying averages of gradients and squared gradients. Adam assigns small weights to distant gradients, termed long-tail gradients in this paper. However, these gradients persistently influence update behavior, potentially degrading generalization performance. To address this issue, we incorporate a restart mechanism into moment estimations, proposing AdaR (ADAptive gradient methods via Restarting moment estimations). Specifically, AdaR divides a training epoch into fixed-iteration intervals, alternating between two sets of moment estimations for parameter updates and discarding prior moment estimations at the beginning of each interval. Within each interval, one set updates parameters and will be discarded in the subsequent interval, while the other is reset at the midpoint to estimate moments for updates in the subsequent interval. The restart mechanism cyclically discards distant gradients, initiates fresh moment estimations for parameter updates, and stabilizes training. By prioritizing recent gradients, the method increases estimation accuracy and enhances step size adjustment. Empirically, AdaR outperforms state-of-the-art optimization algorithms on image classification and language modeling tasks, demonstrating superior generalization and faster convergence.
Keywords:
Machine Learning: ML: Optimization
Machine Learning: ML: Online learning
