Optimal Distributed Training With Co-Adaptive Data Parallelism in  Heterogeneous Environments

Lifang Chen; Zhichao Chen; Liqi Yan; Yanyu Cheng; Fangli Guan; Pan Li

doi:10.24963/ijcai.2025/4

Optimal Distributed Training With Co-Adaptive Data Parallelism in Heterogeneous Environments

Lifang Chen, Zhichao Chen, Liqi Yan, Yanyu Cheng, Fangli Guan, Pan Li

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 29-37. https://doi.org/10.24963/ijcai.2025/4

PDF BibTeX

The computational power required for training deep learning models has been skyrocketing in the past decade as they scale with big data, and has become a very expensive and scarce resource. Therefore, distributed training, which can leverage distributed available computational power, is vital for efficient large-scale model training. However, most previous distributed training frameworks like DDP and DeepSpeed are primarily designed for co-located clusters under homogeneous computing and communication conditions, and hence cannot account for geo-distributed clusters with both computing and communication heterogeneity. To address this challenge, we develop a new data parallel based distributed training framework called Co-Adaptive Data Parallelism (C-ADP). First, we consider a data owner and parameter server that distributes data to and coordinates the collaborative learning across all the computing devices. We employ local training and delayed parameter synchronization to reduce communication costs. Second, we formulate a data parallel scheduling optimization problem to minimize the training time by optimizing data distribution. Third, we devise an efficient algorithm to solve this scheduling problem, and formally prove that the obtained solution is optimal in the asymptotic sense. Experiments on the ImageNet100 dataset demonstrate that C-ADP achieves fast convergence in heterogeneous distributed training environments. Compared to Distributed Data Parallel (DDP) and DeepSpeed, C-ADP achieves 21.6 times and 26.3 times improvements in FLOPS, respectively, and a reduction in training time of about 72% and 47%, respectively.

Keywords:

Agent-based and Multi-agent Systems: MAS: Coordination and cooperation

Multidisciplinary Topics and Applications: MTA: Databases

Natural Language Processing: NLP: Language models