Efficient Inter-Operator Scheduling  for Concurrent Recommendation Model Inference on GPU

Shuxi Guo; Zikang Xu; Jiahao Liu; Jinyi Zhang; Qi Qi; Haifeng Sun; Jun Huang; Jianxin Liao; Jingyu Wang

doi:10.24963/ijcai.2025/318

Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU

Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 2856-2864. https://doi.org/10.24963/ijcai.2025/318

PDF BibTeX

Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.

Keywords:

Data Mining: DM: Recommender systems

Data Mining: DM: Information retrieval

Planning and Scheduling: PS: Scheduling