2026 AAAI AAAI 2026

Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Abstract

Abstract The contextual multi-armed bandit problem underlies applications in recommendations, e-commerce, finance, and healthcare, where balancing exploration and exploitation is critical. While algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) achieve strong theoretical guarantees, they often incur heavy computational cost from high-dimensional parameter estimation. We propose a new approach that combines reward sampling with online stochastic optimization. At each round, the algorithm samples hypothetical rewards for all actions and selects the action with the largest draw; the observed reward then updates the model via stochastic optimization. This design is both simple and efficient, preserving exploration while avoiding the pitfalls of greedy behavior on near-duplicate arms. Across synthetic and real-world datasets, our method attains near-optimal reward more quickly and with substantially lower computation than TS and UCB, demonstrating that sampling directly in reward space can improve both statistical efficiency and scalability.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy