# Reinforcement Learning notes

Outline of David Silver’s RL course parts from Andrew Ng and Arulkuman et al. 17)

# Intro

Markov Decision Process is a tuple $(S, A, \{P_{sa}\}, \gamma, R)$ where $S$ is the set of states, $A$ is a set of actions, $\{P_{sa}\}$ are transition probabilities, $\gamma$ is a discount factor and $R : S \times A \rightarrow \mathbb{R}$ is the reward function.

Starting in state $s_0$, we choose action $a_0 \in A$ and arrive at state $s_1 \sim P_{s_0a_0}$. The reward is given by $R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) = R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)$. The goal is to maximize the expected reward, $E[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)...]$, which encourages immediate reward if possible.

A policy is any function $\pi : S \rightarrow A$ for the action that should be taken to derive $a_i = \pi(s_i)$. This lets us define a value function with respect to a policy, $\pi$, $V^\pi(s) = E[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)... | s_0 = s, \pi]$, which is the expected discounted reward for a given policy.

For a fixed policy $\pi$, the value function satisfies the Bellman equation $V^\pi(s) = R(s) + \gamma\sum_{s' \in S}P_{s\pi(s)}(s')V^\pi(s').$, which is basicaly a sum over all possible states that we could end up according to the transition probabilities. Here, $R(s)$ is the immediate reward. This equation can be used efficiently to solve for $V^\pi$ in a finite state MDP because we can write one for each state and solve $|S|$ linear equations with $|S|$ variables (the $\{V^\pi(s') : s' \in S\}$). There is also an optimal policy $V^*(s) = max_{\pi} V^{\pi}(s)$ for each state, which can get decomposed as a Bellman equation $V^*(s) = R(s) + \max_a \gamma \sum_{s' \in S}P_{sa}(s')V^*(s')$.

This lets us define an optimal policy $\pi^*(s) = \arg\max_a \sum_{s' \in S} P_{sa}(s')V^*(s')$, and we have $V^*(s) = V^{\pi^*}(s) \geq V^\pi(s)$.

## Solving finite-state MDPs

### Value Iteration

For each state, initialize $V(s) = 0$. Until convergence, for each state, update $V(s) = R(s) + \max_a \gamma \sum_{s' \in S} P_sa(s')V(s')$, which is repeatedly updating using the Bellman equations.

### Policy Iteration

Initialize $\pi$ randomly and let $V = V^\pi$. Then for each state $s$, update $\pi(s) = \arg\max_a \sum_{s'} P_sa(s')V(s')$. Note that the firs step requires solving a system of equations. Thus, policy iteration takes longer in practice but both are used.

fun problem: show that these both converge to $V^*$ and $\pi^*$ respectively.

## Learning a model for the MDP

In reality, we may not know the transition probabilities $P_{sa}$ and rewards, $R(s)$ (but we must estimate from data). One estimation trick is to use the MLE estimate of $P_{sa}(s') = \frac{\text{\#times} (a, s) \rightarrow s'}{\text{\#times} (a,s)}$ with some smoothing, e.g. $\frac{1}{|S|}$. Similarly, we can use MLE reward for state $s$.

This can be extended to policy/value iteration by taking an existing policy, executing for a number of trials, and then reestimating reward/transition probabilities. Specifically, after initializing $\pi$ randomly, execute $\pi$, update estimates, apply value iteration to get new estimates, update $\pi$.

## Continuous state MDP or Value function approximation

In some cases, the states are continuous (e.g. physics) and so one method for finding policies is to approximate $V^*$ directly by using a model. A model takes any continuous state $s_t$ and action $a_t$ and outputs $s_{t+1}$ according to $P_{s_ta_t}$. This model could also be learned from data (e.g. by estimates, or learning a linear model, etc). The model could also be either deterministic or stochastic.

## Fitted value iteration

Recall we’re trying to solve $V(s) := R(s) + \gamma \max_a E_{s' \sim P_{sa}}[V(s')]$, but in the continuous case this will be an integral. This method approximate the value function over a finite sample of states with $V(s) = \theta^\top\phi(s)$ where $\phi$ is some feature mapping. Specifically, we randomly sample $m$ states $s^{(1)}, s^{(2)}, ..., s^{(m)} \in S$ and $\theta := 0$. Next, for each state $i$ and action $a \in A$, sample $s_1', ..., s_k' \sim P_{s^(i)a}$ using a model of MDP. Set $q(s) = \frac{1}{k}\sum_{j=1}^kR(s^{(i)}) + \gamma V(s_j')$, so $q$ is an estimate of $V(s^{(i)})$.

With these estimates for each $a$, set $y^{(i)} = \max_a q(a)$, which is the approximation of the value of the best action. Finally, set $\theta := \arg\min_\theta \frac{1}{2}\sum_{i=1}^m(\theta^{\top}\phi(s^{(i)})-y^{(i)})^2$, which is solving a linear regression so that we can generalize to all states.

Note that this method requires $k|A|$ samples per iteration from the model, which could be expensive too.

# Revisiting intro: model-free control and prediction

## A state-action value quality function

We can define a function $Q^{\pi}(s, a)$ which is similar to $V$ except the initial action $a$ is provided, $Q^{\pi}(s,a) = E[R | s, a, \pi]$. Learning this Q-table is no longer a prediction task (what is the value?) but it is now a control task (which action is the best?). Using this function, the Bellman equation is simplified to $Q^\pi(s_t, a_t) = E_{s_{t+1}}[R(s_{t+1}) + \gamma Q^\pi(s_{t+1}, \pi(s_{t+1}))$, which corresponds to the idea of “future reward” given action $a_t$. Note that maximiz3ing over $a_t$ gives the term in the Bellman Equation from earlier, and so it can be improved by solving and iterating.

## Temporal Difference (TD)

The idea is to update Q values by taking a step in the MDP and adjusting the Q (or value) function appropriately. SARSA and Q-Learning are both TD algorithms. For the value function, they have the form of $V(s) \leftarrow V(s) + \alpha (Y - V(s))$, where $\alpha$ is a learning rate and $Y$ is a TD target. The difference $\delta = Y - V(s)$ is the TD error.

### On-policy, SARSA

This has the following update rule after interacting with the environment (using $\pi$).

$Q^\pi(s_t, a_t) \leftarrow Q^\pi(s_t, a_t) + \alpha [r_t + \gamma Q^\pi(s_{t+1}, a_{t+1}) - Q^\pi(s_t, a_t)]$ where $\alpha$ is a learning rate and the second term can be interpreted as the error between state-actions in $t$ and $t+1$. Here, $a_{t+1} = \pi(s_{t+1})$

### Off-policy, Q-Learning

Now, the update rule is $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_{a_{t+1}} (Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))]$. Note that earlier, we knew what $a_{t+1}$ would because we had a policy. Now, there is no policy.

## Monte Carlo (MC)

Over many episodes, update $V(s)$ using empirical means: $V(s) = S(s)/N(s)$ where $S(s)$ is the total return at $s$ and $N$ is a counter of visits to that state. The inner updates can be done once per episode on the first-visit or at every-visit. Note that the total return/reward is only known by the end of the episode.

## $\lambda$: between sampling and TD learning

The TD algorithms described earlier were actually $TD(\lambda)$, $\lambda = 0$ learning algorithms. $\lambda$ lets us interpolate between MC and TD approaches. TD is generally more desirable because it an learn before knowing the final reward and even without knowing the final outcome. This is because the TD target in TD approaches is $r + \gamma V(s_{t+1})$, which is based on a fixed (existing) policy and so it is a biased estimate of $V^\pi(s)$, while the MC approach is an unbiased estimate. Another difference is that TD assumes a Markov environment, while MC is usually more effective in nonmarkov environments.

TD and MC can be unified by looking at the TD target, $Y$, and defining an $n$-step return: \begin{align} n = 1 &\Rightarrow G_t^{(1)} = R_{t+1} + \gamma V(S_{t+1}) \\ n = 2 &\Rightarrow G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2}) \\ \vdots &\\ n = \infty& \Rightarrow G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{T-1}R_T \end{align}

These $n$ step returns can be combined in a geometric fashion with a parameter $\lambda$ to get $G_{t}^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)}$. Note that in $TD(0)$, the reward is immediate (we only use $n=1$) and we have regular TD learning. When $\lambda = 1$, this sum of errors telescopes to the MC error. However, this is still not a perfect correspondence unless the updates occur at the end of the episode.

# Function approximation

When there are large (or continuous) state spaces, we want to instead approximate $V$ or $Q$ with $\hat{V}(s, w) \approx V^{\pi}(s)$ or $\hat{Q}(s, a, w) \approx Q^{\pi}(s, a)$. This way, we can generalize from seen states to unseen states. $w$ is just a parameter (sometimes $\theta$ which we can update using MC or TD (or something else).

## Incremental methods

The function can be approximated using incremental updates with gradient descent. If $J(w)$ is differentiable, then we can define the gradient of $J$, $\nabla_wJ(w)$ and adjust $w$: $\delta w = -\frac{1}{2} \nabla_wJ(w)$. When we define $J(w) = E_{\pi}[(V^{\pi}(s) - \hat{V}(s, w))^2]$, we can use GD/SGD to update: $\delta w = \alpha(V^{\pi}(s) - \hat{V}(s, w))\nabla_w\hat{V}(s, w)$.

There are other ways to approximate: linear wrt features (then SGD is globally optimal and update rule is simple since $\nabla_w\hat{V}(s, w)$ are just the linear coefficients).

We can show that this still works even if the reward isn’t until the end (e.g. instead of $V^{\pi}(s)$ we could replace it with $G_t$ or whatever the TD target was.

While these examples used the value function, the same method can be used to approximate the Q function

## Batch methods

Gradient descent is not sample efficient (high sample complexity?). Batch methods can find the best fitting function per batch, which is more appealing (e.g. incremental least squares approximation vs linear algebra batch solution).

### Least squares 1

Suppose given the experience of $\langle state, value \rangle$ pairs $\mathcal{D} = \{(s_1, V_1^{\pi}), ..., (s_T, V_T^\pi)\}$, then we can from $\mathcal{D}$ and update for that sample using the SGD update rule earlier, and this converges to least squares solution for errors between $V^\pi$ and $\hat{V}$.

### DQN

Experience replay is a trick that assumes an $\epsilon$-greedy policy and stores $(s_t, a_t, r_{t+1}, s_{t+1})$ tuples in its memory $\mathcal{D}$. Random mini-batches can be sampled; the targets can be calculated, and the MSE between Q-network and Q-learning targets can be minimized: $L_i(w_i) = E_{(s, a, r, s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q(s' ,a ; w_{i}^{\text{old}}) - Q(s, a; w_{i}\right)^2\right].$ This loss, $L_i(w_i)$ can be optimized with SGD.

### Least squares 2

We can compute least squares directly if the value function is linear: $\hat{V}(s,w) = x(s)^\top w$ for some coefficients $x(s)$. At the solution, expected gradient is 0, so we can solve for $w$ to get $w = \left(\sum_{t=1}^T x(s_t)x(s_t)^\top\right)^{-1}\sum_{t=1}^{T}x(s_t)V_t^\pi$, which is familiar the least squares solution. We don’t know true $V_t^\pi$, but we can use various versions of the TD target (e.g. $G_t^{\lambda}$) and use the same solving process.

### Least squares for control

In control, we want to improve the policy, so we must learn off-policy. The change we need to make is similar to that from Q-learning. The update is $\delta = R_{t+1} + \gamma \hat{Q}(S_{t+1}, \pi'(S_{t+1}), w) - \hat{Q}(S_t, A_t, w)$ for a new policy $\pi'$, and the least squares can be solved the same way. This essentially “relearns” the experience $D$ under the new policy and the new policy is kept. The $Q$ functions are learned when the new policy is “close enough” to the old policy.

In the previous section, we assumed a fixed policy (e.g. $\epsilon$-greedy). Another approach is to directly approximate the policy, $\pi_{\theta}(s, a) = P(a \mid s, \theta)$ (note that it used to be $P(a \mid s)$). This may be desirable if we have high dimensional action spaces (e.g. text), and we can learn stochastic policies (so it can win RPS), and it has better convergence properties. The cons are that the convergence tends to be local and evaluating policies is inefficient with high variance.

The goal is to optimize $\pi_{\theta}(s, a)$ by finding the best $\theta$, but what is the objective? We could use $J_1(\theta) = V^{\pi_\theta}(s_1) = E_{\pi_0}[V_1]$ which is the expected value from the first state, or use the average value, $J_{avV}(\theta) = \sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s)$ or average reward per step $J_{avR}(\theta) = \sum_{s} d^{\pi_\theta}(s)\sum_a \pi_{\theta}(s, a)R_s^a$ where $d$ is the stationary distribution for the markov chain for $\pi_\theta$.

With an objective defined, we can maximize $J$, e.g. using gradient ascent. We can define $\delta \theta =\alpha\nabla_\theta J(\theta)$ where $\nabla_\theta J(\theta)$ is the policy gradient and $\alpha$ is a step size. This gradient can be computed with a finite difference method for any policy, but if the policy is differentiable then we can compute it directly.

## Policy Gradient Theorem

For any differentiable (softmax;log-linear) policy $\pi_\theta$ and for any of the objective functions $J_1$, $J_{avR}$ or $\frac{1}{1-\gamma}J_avV$, the policy gradient is $\nabla_\theta J(\theta) = E_{\pi_\theta}[\nabla_{\theta} \log \pi_\theta(s,a) Q^{\pi_\theta}(s, a)]$.

## No value function (MC policy gradient = REINFORCE)

REINFORCE uses the policy gradient theorem to perform updates. Initialize $\theta$ arbitrarily. Then for each episode following $\pi_\theta$, for each timestep $t = 1, ..., T - 1$, update $\theta \rightarrow \theta + \alpha \nabla_{\theta}\log\pi_{\theta}(s_t, a_t)v_t$.

## Actor-Critic (both approximate value and policy)

REINFORCE has high variance, so we can use a critic to estimate the action-value function, $Q_w(s,a) \approx Q^{\pi_\theta}(s, a)$. Actor-critic algorithms follow an approximate policy gradient as determined by $Q_w$: $\delta \theta = \alpha\nabla_\theta \log \pi_\theta(s, a)Q_w(s,a)$.

Thus, we can now update both $\theta$ and $w$ with each step. However, approximating $Q$ introduces bias that we want to avoid. The Compatible Function Approximation Theorem gives conditions for when the policy gradient is still exact.

Another approach is to reduce variance by subtracting a baseline from the update equation, e.g. $B(s) = V^{\pi_\theta}(s)$. Then our update is $E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) (Q^{\pi_\theta}(s, a) - B(s))] = E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) (Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s))]$

Really, we can estimate the advantage function, $A = Q-V$ directly if we wanted to.

### Other policy gradients

There are some fancier things we can do to estimate the policy gradient. See Advantage Actor-Critic, TD Actor-Critic, TD($\lambda$) Actor-Critic, Natural Actor-Critic. They will not be covered here.

# Model-based RL

In the previous section, we assumed we had an MDP with transitions $P_{sa}$. Now, we will try to learn the model from experience, and then we can use planning to come up with a good policy; ideally this would all be done in one architecture.

This is good if we can learn the model by supervised learning methods, and lets us reason about uncertainty (if we had a perfect model, we could just search). On the other hand, we need to both learn a model and construct functions, so there are more possible sources of error.

## Basic idea of model

A model is a representation of an MDP $\langle S, A, P ,R\rangle$ parameterized by $\eta$. So $S_{t+1} \sim P_\eta(S_{t+1} \mid S_t, A_t)$ and $R_{t+1} \sim R_\eta(R_{t+1} \mid S_t, A_t)$.

The goal is to estimate $\mathcal{M}_\eta$, which can be formulated as a supervised learning problem. $s, a \rightarrow r$ is a regression problem, while $s, a \rightarrow s'$ is a density estimation problem. We can pick parameters $\eta$ that minimize empirical loss.

## Planning

Given a model, we now want to plan (and find the best policy). This can be done using value/policy iteration or by a tree search. Another approach is to use sample-based planning, which uses the model to generate samples and then apply model-free RL to the samples. In practice this tends to be more efficient.

Note that when the model is wrong, planning will be even more wrong, so it is important that the model is accurate.

## Integration (Dyna)

We can integrate both planning and model based RL into one framework. We can consider two types of experience: real, from the environment, and simulated, sampled from our model. This is the Dyna-Q algorithm, which first initializes $Q, M$ and then:

1. $S \rightarrow$ current state
2. $A \rightarrow$ $\epsilon$-greedy(S, Q)
3. (real) Execute A, observe reward R and state S’.
4. $Q(s, a) \rightarrow Q(s, a) + \alpha[R + \gamma \max_a (Q(s', a) - Q(s, a))]$, which is the standard update rule.
5. Model(s, a) $\rightarrow R, S'$, which updates the model.
6. (simulated) Repeat $n$ times: S, A are random previously observed state and actions. Use the model to generate $R, S'$ and perform update rule.

This interleaves $n$ simulated steps with one real step.

Search algorithms either lookahead with existing model or sample model-free. Model-free approaches include MCTS and TD search depending on whether MC control or SARSA control is applied to the sub-MDP. Like earlier, TD approaches have less variance (more bias) and is usually more efficient.

# Exploration vs Exploitation

Exploitation: make best use of current info. Exploration: explore other options, which may be better than current.

## Bandits

A multi-armed bandit is a tuple $\langle A, R \rangle$ where $A$ is a set of actions (arms) and $R^a(r) = P(r \mid a)$ is a probability distribution over rewards. At each time step $t$, the agent selects an action $a_t \in A$ and the environment generates reward $r_t \sim R^{a_t}$. The goal is to maximize cumulative reward $\sum_{\tau=1}^t r_\tau$.

• Action Value is the mean reward of action $a$, i.e. $Q(s) = E[r \mid a]$.

• Optimal Value $V^*$ is $V^* = Q(a^*) = \max_{a \in A} Q(a)$ is value of the best arm.

• Regret is the opportunity loss for each step, $l_t = E[V^* - Q(a_t)]$

• Total regret is the total opportunity loss, $L_t = E\left[\sum_{\tau=1}^t V^* - Q(a_\tau)\right]$.

Maximizing total cumulative reward is equivalent to minimizing total regret. We can define a count $N_t(a)$ to be the expected number of selections of $a$ and define a gap $\delta_a = V^* - Q(a)$. Then alternatively, $L_t = \sum_{a \in A}E[N_t(a)]\delta_a$. In other words, we want low counts for large gaps.

## Greedy algorithms

If we explore forever, we will have linear total regret. If we never explore, we will also have linear total regret. It can be shown that greedy (and $\epsilon$-greedy) algorithms have linear regret with time – does there exist a strategy with sublinear total regret?

### Decaying $\epsilon$-greedy algorithm.

Idea: pick a decay schedule $\epsilon_1, \epsilon_2, ...$ and for $c > 0$, $d =$ smallest nonzero gap, $\epsilon_t = \min\{1, \frac{c|A|}{d^2t}\}$. We can show that this has logarithmic asymptotic total regret but this requires knowing gap sizes before hand…

### Lower bound [Lai and Robbins]

Previous work has shown $L \geq \log t \sum_{a \mid \delta_a \geq 0}\frac{\delta_a}{KL(R^a || R^{a^*}}$.

### Upper confidence bounds.

Idea: For each action, estimate an upper confidence bound (UCB) $\hat{U}_t(a)$ such that $Q(a) \leq \hat{Q}_t(a) + \hat{U}_t(a)$ with high probability, and select action maximizing UCB:

$a_t = \arg\max_{a\ in A} \hat{Q}_t(a) + \hat{U}_t(a).$

We can actually bound this using Hoeffding’s inequality, so for a given $p$, we can solve for $U_t(a) = \sqrt{\frac{-\log p}{2N_t(a)}}$.

By reducing $p$ over time, we can show that $U_t(a) = \sqrt{\frac{2\log t}{N_t(a)}}$, and so the UCB algorithm leads to asymptotic total regret, $L_t \leq 8\log t \sum_{a | \delta_a \geq 0} \delta_a$.

### Bayesian and other bandits

There is more to be said about bandits; they will not be covered here. Additional techniques include probability matching (Thompson sampling) and information state search.

## Contextual bandits

These consist of tuples $\langle A, S, R \rangle$ where $S = P(s)$ is an unknown distribution over contexts and so $R_s^a(r) = P(r \mid s, a)$ is an unknown probability distribution over rewards. At each step, the environment generates a state $s_t \sim S$ and then the agent selects and action $a_t \in A$; the reward is $r \sim R_{s_t}^{a_t}$. The goal is to maximize cumulative reward.

Why is this interesting? There’s an analogy to the action-value function: $Q(s,a) = E[r \mid s, a]$, and so we can use UCB here too. Bandit problems are cool because methods used to do well on bandit problems can be applied to the MDP setting.

# Self-play RL

For the $i$th player, denote its policy $\pi^i$ (and all other players have policy $\pi^{-i}$. Then the best policy is $\pi^i = \pi^i_*(\pi^{-i})$ and the Nash equilibrium is a joint policy such that for all $i$, $\pi^i = \pi^i_*(\pi^{-i})$.

We can treat this as an MDP with other players as part of the environment, and so the Nash equilibrium is a fixed point. We will only focus on two-player zero-sum games.

## Game tree

Classical methods such as minimax or alpha-beta define a single value function for both players and prunes the tree. This is already powerful, though a good value function is hard to define…

## Using TD to learn value function.

We can apply value-based RL algorithm by updating the value function towards the return in TD learning. This learned value function can then be used for better minimax search.

Two variations, TD Root and TD Lead, perform updates elsewhere in the tree.

## Imperfect information games

In imperfect information games, each player has a separate tree. Solutions here include iterative forward search (e.g. counterfactual regret minimization), self-play RL, and smooth UCT.

### Smooth UCT

This approach applies MCTS to the information-state-game tree by learning against and responding to the opponent average behavior. The average policy can be extracted $\pi_{avg}(a \mid s) = \frac{N(s,a)}{N(s)}$ and the action can be picked between $UCT(S)$ and $\pi_{avg}(\cdot \mid S)$ based on a probability parameter $\eta$.