2025 IJCAI IJCAI 2025

No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks

Abstract

We study online task scheduling problems where tasks arrive sequentially and are processed by the platform or server. The service processes for tasks are multi-stage and are modeled as episodic Markov Decision Processes (MDPs). While processing a task, the system acquires rewards by consuming resources. The goal of the platform is to maximize the reward-to-cost ratio over a sequence of K tasks. Online scheduling with multi-stage tasks faces two major challenges: intra-dependence among the different stages within a task and inter-dependence among different tasks. These challenges are further exacerbated by the unknown rewards, costs, and task arrival distribution. To address these challenges, we propose the Robbins-Monro-based Value Iteration for Ratio Maximization (RM^2VI) algorithm. Specifically,RM^2VI addresses ``intra-dependence'' through optimistic value iteration and handles ``inter-dependence'' using the Robbins-Monro method. The algorithm has a greedy structure and achieves a sub-linear regret of O(K^(3/4)), establishing the no-regret property (per-task). We test RM^2VI in two synthetic experiments of sale promotion in E-commerce and machine learning job training in cloud computing. The results show RM^2VI achieves the best reward-to-cost ratio compared with the baselines.

🌉 Interdisciplinary Bridge — Machine Learning and Reinforcement Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy