DeepWeave: Accelerating Job Completion Time with Deep Reinforcement Learning-based Coflow Scheduling

Penghao Sun; Zehua Guo; Junchao Wang; Junfei Li; Julong Lan; Yuxiang Hu

doi:10.24963/ijcai.2020/458

DeepWeave: Accelerating Job Completion Time with Deep Reinforcement Learning-based Coflow Scheduling

Penghao Sun, Zehua Guo, Junchao Wang, Junfei Li, Julong Lan, Yuxiang Hu

Short video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 3314-3320. https://doi.org/10.24963/ijcai.2020/458

PDF BibTeX

To improve the processing efficiency of jobs in distributed computing, the concept of coflow is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage computation task. A job consists of multiple coflows and can be usually formulated as a Directed-Acyclic Graph (DAG). A proper scheduling of coflows can significantly reduce the completion time of jobs in distributed computing. However, this scheduling problem is proved to be NP-hard. Different from existing schemes that use hand-crafted heuristic algorithms to solve this problem, in this paper, we propose a Deep Reinforcement Learning (DRL) framework named DeepWeave to generate coflow scheduling policies. To improve the inter-coflow scheduling ability in the job DAG, DeepWeave employs a Graph Neural Network (GNN) to process the DAG information. DeepWeave learns from the history workload trace to train the neural networks of the DRL agent and encodes the scheduling policy in the neural networks, which make coflow scheduling decisions without expert knowledge or a pre-assumed model. The proposed scheme is evaluated with a simulator using real-life traces. Simulation results show that DeepWeave completes jobs at least 1.7X faster than the state-of-the-art solutions.

Keywords:

Machine Learning Applications: Applications of Reinforcement Learning

Machine Learning Applications: Networks