Speaker: Teodor Marinov

Affiliation: Johns Hopkins University

Title: Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

Abstract:

Reinforcement Learning (RL) is a general scenario where agents interact with the environment to achieve some goal. The environment and an agent’s interactions are typically modeled as a Markov decision process (MDP), which can represent a rich variety of tasks. But, for which MDPs can an agent or an RL algorithm succeed? This requires a theoretical analysis of the complexity of an MDP. In this talk I will present information-theoretic lower bounds for a large class of MDPs. The lower bounds are based on studying a certain semi-infinite LP. Further, I will show that existing algorithms enjoy tighter gap-dependent regret bounds (similar to the stochastic multi-armed bandit problem), however, they are unable to achieve the information-theoretic lower bounds even in deterministic transition MDPs, unless there is a unique optimal policy.