Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU
Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU
Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 2856-2864.
https://doi.org/10.24963/ijcai.2025/318
Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.
Keywords:
Data Mining: DM: Recommender systems
Data Mining: DM: Information retrieval
Planning and Scheduling: PS: Scheduling
