Non-Stationary Delayed Bandits with Intermediate Observations
Claire Vernade - DeepMind (UK)
We consider the problem of learning with delayed bandit feedback, meaning by trial and error, in changing environments. This problem is ubiquitous in many online recommender systems that aim at showing content, which is ultimately evaluated by long-term metrics like a purchase, or a watching time. Mitigating the effects of delays in stationary environments is well-understood, but the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about the environment, since the available observations are already obsolete. However, the arising issues can be addressed if relevant intermediate signals are available without delay, such that given those signals, the long-term behavior of the system is stationary. To model this situation, we introduce the problem of stochastic, non-stationary and delayed bandits with intermediate observations. We develop a computationally efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance.
Claire is a Research Scientist at DeepMind in London UK. She received her PhD from Telecom ParisTech in October 2017, under the guidance of Prof. Olivier Cappé. From January 2018-October 2018, she worked part-time as an Applied Scientist at Amazon in Berlin, while doing a post-doc with Alexandra Carpentier at the University of Magdeburg in Germany. Her research is on sequential decision making. It mostly spans bandit problems, but Claire's interest also extends to Reinforcement Learning and Learning Theory. While keeping in mind concrete problems -- often inspired by interactions with product teams -- she focuses on theoretical approaches, aiming for provably optimal algorithms. She recently received an Outstanding Paper Award at ICLR for a joint work on a game-theoretic approach to PCA.
2021-10-18 at 3:00 pm