Adversarially robust stochastic multi-armed bandits

Julian Ulf Zimmert

Publikation: Bog/antologi/afhandling/rapportPh.d.-afhandlingForskning

Abstract

Multi-armed bandits are a fundamental problem in sequential decision making with applications in optimal experimental design, online advertisement, recommender systems and much more. In multi-armed bandits, an agent has to repeatedly choose an action from a finite set of options.

After each action, the agent receives and observes a reward. For example, the action might be to prescribe a certain drug and the reward is the outcome of the patient’s treatment. The goal of the agent is to maximise its cumulative reward. It is commonly assumed that the rewards are drawn from i.i.d. distributions that only depend on the agent’s action. This assumption allows algorithms with provably fast learning to be derived. However, minor violations of this assumption can prevent these algorithms from learning anything at all.

Another line of research aims at maximal robustness. These algorithms have performance guarantees even in the absence of any stochastic model. The drawback is that learning happens significantly slower.

Naturally the question arises if a combination of both properties is achievable. Are there robust algorithms that also adapt automatically to easier, i.e. stochastic, environments? This question has remained open for 7 years. Its conclusion is the major contribution of this thesis.

We derive an algorithm that is maximally robust and also achieves optimal rates in the stochastic setting. Furthermore, it is optimal in several intermediate regimes and we extend these results to problems beyond multi-armed bandits.

Citationsformater