Continuous-Time Reward Machines

Amin Falah; Shibashis Guha; Ashutosh  Trivedi

doi:10.24963/ijcai.2025/563

Continuous-Time Reward Machines

Amin Falah, Shibashis Guha, Ashutosh Trivedi

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 5056-5064. https://doi.org/10.24963/ijcai.2025/563

PDF BibTeX

Reinforcement Learning (RL) is a sampling-based method for sequential decision-making, in which a learning agent iteratively converges toward an optimal policy by leveraging feedback from the environment in the form of scalar reward signals. While timing information is often abstracted in discrete-time domains, time-critical learning applications—such as queuing systems, population processes, and manufacturing systems—are naturally modeled as Continuous-Time Markov Decision Processes (CTMDPs). Since the seminal work of Bradtke and Duff, model-free RL for CTMDPs has become well-understood. However, in many practical applications, practitioners possess high-quality information about system rates derived from traditional queuing theory, which learning agents could potentially exploit to accelerate convergence. Despite this, classical RL algorithms for CTMDPs typically re-learn these parameters through sampling. In this work, we propose continuous-time reward machines (CTRMs), a novel framework that embeds reward functions and real-time state-action dynamics into a unified structure. CTRMs enable RL agents to effectively navigate dense-time environments while leveraging reward shaping and counterfactual experiences for accelerated learning. Our empirical results demonstrate CTRMs' ability to improve learning efficiency in time-critical environments.

Keywords:

Machine Learning: ML: Reinforcement learning

Planning and Scheduling: PS: Learning in planning and scheduling

Planning and Scheduling: PS: Markov decisions processes

AI Ethics, Trust, Fairness: ETF: Explainability and interpretability