Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Egor Suraveikin; Dastan Omirzak; Roman Sultimov; Yury Maximov

2026 AAAI AAAI 2026

Efficient Contextual Bandit Learning via Reward-Space Sampling and Online Optimization (Student Abstract)

Abstract

Abstract The contextual multi-armed bandit problem underlies applications in recommendations, e-commerce, finance, and healthcare, where balancing exploration and exploitation is critical. While algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) achieve strong theoretical guarantees, they often incur heavy computational cost from high-dimensional parameter estimation. We propose a new approach that combines reward sampling with online stochastic optimization. At each round, the algorithm samples hypothetical rewards for all actions and selects the action with the largest draw; the observed reward then updates the model via stochastic optimization. This design is both simple and efficient, preserving exploration while avoiding the pitfalls of greedy behavior on near-duplicate arms. Across synthetic and real-world datasets, our method attains near-optimal reward more quickly and with substantially lower computation than TS and UCB, demonstrating that sampling directly in reward space can improve both statistical efficiency and scalability.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Egor Suraveikin , Dastan Omirzak , Roman Sultimov , Yury Maximov

Topics

Machine Learning > Learning Types > Active Learning Mathematics & Optimization > Optimization > Online Algorithms

Keywords

stochastic optimization exploration-exploitation tradeoff online optimization contextual bandit reward sampling

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026