Reinforcement Learning notes

Outline of David Silver’s RL course parts from Andrew Ng and Arulkuman et al. 17)


Markov Decision Process is a tuple (S,A,{Psa},γ,R)(S, A, \{P_{sa}\}, \gamma, R) where SS is the set of states, AA is a set of actions, {Psa}\{P_{sa}\} are transition probabilities, γ\gamma is a discount factor and R:S×AR : S \times A \rightarrow \mathbb{R} is the reward function.

Starting in state s0s_0, we choose action a0Aa_0 \in A and arrive at state s1Ps0a0s_1 \sim P_{s_0a_0}. The reward is given by R(s0,a0)+γR(s1,a1)+γ2R(s2,a2)=R(s0)+γR(s1)+γ2R(s2)R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) = R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2). The goal is to maximize the expected reward, E[R(s0)+γR(s1)+γ2R(s2)...]E[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)...], which encourages immediate reward if possible.

A policy is any function π:SA\pi : S \rightarrow A for the action that should be taken to derive ai=π(si)a_i = \pi(s_i). This lets us define a value function with respect to a policy, π\pi, Vπ(s)=E[R(s0)+γR(s1)+γ2R(s2)...|s0=s,π]V^\pi(s) = E[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)... | s_0 = s, \pi], which is the expected discounted reward for a given policy.

For a fixed policy π\pi, the value function satisfies the Bellman equation Vπ(s)=R(s)+γsSPsπ(s)(s)Vπ(s).V^\pi(s) = R(s) + \gamma\sum_{s' \in S}P_{s\pi(s)}(s')V^\pi(s')., which is basicaly a sum over all possible states that we could end up according to the transition probabilities. Here, R(s)R(s) is the immediate reward. This equation can be used efficiently to solve for VπV^\pi in a finite state MDP because we can write one for each state and solve |S||S| linear equations with |S||S| variables (the {Vπ(s):sS}\{V^\pi(s') : s' \in S\}). There is also an optimal policy V*(s)=maxπVπ(s)V^*(s) = max_{\pi} V^{\pi}(s) for each state, which can get decomposed as a Bellman equation V*(s)=R(s)+maxaγsSPsa(s)V*(s)V^*(s) = R(s) + \max_a \gamma \sum_{s' \in S}P_{sa}(s')V^*(s').

This lets us define an optimal policy π*(s)=argmaxasSPsa(s)V*(s)\pi^*(s) = \arg\max_a \sum_{s' \in S} P_{sa}(s')V^*(s'), and we have V*(s)=Vπ*(s)Vπ(s)V^*(s) = V^{\pi^*}(s) \geq V^\pi(s).

Solving finite-state MDPs

Value Iteration

For each state, initialize V(s)=0V(s) = 0. Until convergence, for each state, update V(s)=R(s)+maxaγsSPsa(s)V(s)V(s) = R(s) + \max_a \gamma \sum_{s' \in S} P_sa(s')V(s'), which is repeatedly updating using the Bellman equations.

Policy Iteration

Initialize π\pi randomly and let V=VπV = V^\pi. Then for each state ss, update π(s)=argmaxasPsa(s)V(s)\pi(s) = \arg\max_a \sum_{s'} P_sa(s')V(s'). Note that the firs step requires solving a system of equations. Thus, policy iteration takes longer in practice but both are used.

fun problem: show that these both converge to V*V^* and π*\pi^* respectively.

Learning a model for the MDP

In reality, we may not know the transition probabilities PsaP_{sa} and rewards, R(s)R(s) (but we must estimate from data). One estimation trick is to use the MLE estimate of Psa(s)=#times(a,s)s#times(a,s)P_{sa}(s') = \frac{\text{\#times} (a, s) \rightarrow s'}{\text{\#times} (a,s)} with some smoothing, e.g. 1|S|\frac{1}{|S|}. Similarly, we can use MLE reward for state ss.

This can be extended to policy/value iteration by taking an existing policy, executing for a number of trials, and then reestimating reward/transition probabilities. Specifically, after initializing π\pi randomly, execute π\pi, update estimates, apply value iteration to get new estimates, update π\pi.

Continuous state MDP or Value function approximation

In some cases, the states are continuous (e.g. physics) and so one method for finding policies is to approximate V*V^* directly by using a model. A model takes any continuous state sts_t and action ata_t and outputs st+1s_{t+1} according to PstatP_{s_ta_t}. This model could also be learned from data (e.g. by estimates, or learning a linear model, etc). The model could also be either deterministic or stochastic.

Fitted value iteration

Recall we’re trying to solve V(s):=R(s)+γmaxaEsPsa[V(s)]V(s) := R(s) + \gamma \max_a E_{s' \sim P_{sa}}[V(s')], but in the continuous case this will be an integral. This method approximate the value function over a finite sample of states with V(s)=θϕ(s)V(s) = \theta^\top\phi(s) where ϕ\phi is some feature mapping. Specifically, we randomly sample mm states s(1),s(2),...,s(m)Ss^{(1)}, s^{(2)}, ..., s^{(m)} \in S and θ:=0\theta := 0. Next, for each state ii and action aAa \in A, sample s1,...,skPs(i)as_1', ..., s_k' \sim P_{s^(i)a} using a model of MDP. Set q(s)=1kj=1kR(s(i))+γV(sj)q(s) = \frac{1}{k}\sum_{j=1}^kR(s^{(i)}) + \gamma V(s_j'), so qq is an estimate of V(s(i))V(s^{(i)}).

With these estimates for each aa, set y(i)=maxaq(a)y^{(i)} = \max_a q(a), which is the approximation of the value of the best action. Finally, set θ:=argminθ12i=1m(θϕ(s(i))y(i))2\theta := \arg\min_\theta \frac{1}{2}\sum_{i=1}^m(\theta^{\top}\phi(s^{(i)})-y^{(i)})^2, which is solving a linear regression so that we can generalize to all states.

Note that this method requires k|A|k|A| samples per iteration from the model, which could be expensive too.

Revisiting intro: model-free control and prediction

A state-action value quality function

We can define a function Qπ(s,a)Q^{\pi}(s, a) which is similar to VV except the initial action aa is provided, Qπ(s,a)=E[R|s,a,π]Q^{\pi}(s,a) = E[R | s, a, \pi]. Learning this Q-table is no longer a prediction task (what is the value?) but it is now a control task (which action is the best?). Using this function, the Bellman equation is simplified to Qπ(st,at)=Est+1[R(st+1)+γQπ(st+1,π(st+1))Q^\pi(s_t, a_t) = E_{s_{t+1}}[R(s_{t+1}) + \gamma Q^\pi(s_{t+1}, \pi(s_{t+1})), which corresponds to the idea of “future reward” given action ata_t. Note that maximiz3ing over ata_t gives the term in the Bellman Equation from earlier, and so it can be improved by solving and iterating.

Temporal Difference (TD)

The idea is to update Q values by taking a step in the MDP and adjusting the Q (or value) function appropriately. SARSA and Q-Learning are both TD algorithms. For the value function, they have the form of V(s)V(s)+α(YV(s))V(s) \leftarrow V(s) + \alpha (Y - V(s)), where α\alpha is a learning rate and YY is a TD target. The difference δ=YV(s)\delta = Y - V(s) is the TD error.

On-policy, SARSA

This has the following update rule after interacting with the environment (using π\pi).

Qπ(st,at)Qπ(st,at)+α[rt+γQπ(st+1,at+1)Qπ(st,at)] Q^\pi(s_t, a_t) \leftarrow Q^\pi(s_t, a_t) + \alpha [r_t + \gamma Q^\pi(s_{t+1}, a_{t+1}) - Q^\pi(s_t, a_t)] where α\alpha is a learning rate and the second term can be interpreted as the error between state-actions in tt and t+1t+1. Here, at+1=π(st+1)a_{t+1} = \pi(s_{t+1})

Off-policy, Q-Learning

Now, the update rule is Q(st,at)Q(st,at)+α[rt+γmaxat+1(Q(st+1,at+1)Q(st,at))] Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_{a_{t+1}} (Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))]. Note that earlier, we knew what at+1a_{t+1} would because we had a policy. Now, there is no policy.

Monte Carlo (MC)

Over many episodes, update V(s)V(s) using empirical means: V(s)=S(s)/N(s)V(s) = S(s)/N(s) where S(s)S(s) is the total return at ss and NN is a counter of visits to that state. The inner updates can be done once per episode on the first-visit or at every-visit. Note that the total return/reward is only known by the end of the episode.

λ\lambda: between sampling and TD learning

The TD algorithms described earlier were actually TD(λ)TD(\lambda), λ=0\lambda = 0 learning algorithms. λ\lambda lets us interpolate between MC and TD approaches. TD is generally more desirable because it an learn before knowing the final reward and even without knowing the final outcome. This is because the TD target in TD approaches is r+γV(st+1)r + \gamma V(s_{t+1}), which is based on a fixed (existing) policy and so it is a biased estimate of Vπ(s)V^\pi(s), while the MC approach is an unbiased estimate. Another difference is that TD assumes a Markov environment, while MC is usually more effective in nonmarkov environments.

TD and MC can be unified by looking at the TD target, YY, and defining an nn-step return: n=1Gt(1)=Rt+1+γV(St+1)n=2Gt(2)=Rt+1+γRt+2+γ2V(St+2)n=Gt()=Rt+1+γRt+2++γT1RT\begin{align} n = 1 &\Rightarrow G_t^{(1)} = R_{t+1} + \gamma V(S_{t+1}) \\ n = 2 &\Rightarrow G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2}) \\ \vdots &\\ n = \infty& \Rightarrow G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{T-1}R_T \end{align}

These nn step returns can be combined in a geometric fashion with a parameter λ\lambda to get Gtλ=(1λ)n=1λn1Gt(n)G_{t}^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)}. Note that in TD(0)TD(0), the reward is immediate (we only use n=1n=1) and we have regular TD learning. When λ=1\lambda = 1, this sum of errors telescopes to the MC error. However, this is still not a perfect correspondence unless the updates occur at the end of the episode.

Function approximation

When there are large (or continuous) state spaces, we want to instead approximate VV or QQ with V̂(s,w)Vπ(s)\hat{V}(s, w) \approx V^{\pi}(s) or Q̂(s,a,w)Qπ(s,a)\hat{Q}(s, a, w) \approx Q^{\pi}(s, a). This way, we can generalize from seen states to unseen states. ww is just a parameter (sometimes θ\theta which we can update using MC or TD (or something else).

Incremental methods

The function can be approximated using incremental updates with gradient descent. If J(w)J(w) is differentiable, then we can define the gradient of JJ, wJ(w)\nabla_wJ(w) and adjust ww: δw=12wJ(w)\delta w = -\frac{1}{2} \nabla_wJ(w). When we define J(w)=Eπ[(Vπ(s)V̂(s,w))2]J(w) = E_{\pi}[(V^{\pi}(s) - \hat{V}(s, w))^2], we can use GD/SGD to update: δw=α(Vπ(s)V̂(s,w))wV̂(s,w)\delta w = \alpha(V^{\pi}(s) - \hat{V}(s, w))\nabla_w\hat{V}(s, w).

There are other ways to approximate: linear wrt features (then SGD is globally optimal and update rule is simple since wV̂(s,w)\nabla_w\hat{V}(s, w) are just the linear coefficients).

We can show that this still works even if the reward isn’t until the end (e.g. instead of Vπ(s)V^{\pi}(s) we could replace it with GtG_t or whatever the TD target was.

While these examples used the value function, the same method can be used to approximate the Q function

Batch methods

Gradient descent is not sample efficient (high sample complexity?). Batch methods can find the best fitting function per batch, which is more appealing (e.g. incremental least squares approximation vs linear algebra batch solution).

Least squares 1

Suppose given the experience of state,value\langle state, value \rangle pairs 𝒟={(s1,V1π),...,(sT,VTπ)}\mathcal{D} = \{(s_1, V_1^{\pi}), ..., (s_T, V_T^\pi)\}, then we can from 𝒟\mathcal{D} and update for that sample using the SGD update rule earlier, and this converges to least squares solution for errors between VπV^\pi and V̂\hat{V}.


Experience replay is a trick that assumes an ϵ\epsilon-greedy policy and stores (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1}) tuples in its memory 𝒟\mathcal{D}. Random mini-batches can be sampled; the targets can be calculated, and the MSE between Q-network and Q-learning targets can be minimized: Li(wi)=E(s,a,r,s)𝒟[(r+γmaxaQ(s,a;wiold)Q(s,a;wi)2].L_i(w_i) = E_{(s, a, r, s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q(s' ,a ; w_{i}^{\text{old}}) - Q(s, a; w_{i}\right)^2\right]. This loss, Li(wi)L_i(w_i) can be optimized with SGD.

Least squares 2

We can compute least squares directly if the value function is linear: V̂(s,w)=x(s)w\hat{V}(s,w) = x(s)^\top w for some coefficients x(s)x(s). At the solution, expected gradient is 0, so we can solve for ww to get w=(t=1Tx(st)x(st))1t=1Tx(st)Vtπw = \left(\sum_{t=1}^T x(s_t)x(s_t)^\top\right)^{-1}\sum_{t=1}^{T}x(s_t)V_t^\pi, which is familiar the least squares solution. We don’t know true VtπV_t^\pi, but we can use various versions of the TD target (e.g. GtλG_t^{\lambda}) and use the same solving process.

Least squares for control

In control, we want to improve the policy, so we must learn off-policy. The change we need to make is similar to that from Q-learning. The update is δ=Rt+1+γQ̂(St+1,π(St+1),w)Q̂(St,At,w)\delta = R_{t+1} + \gamma \hat{Q}(S_{t+1}, \pi'(S_{t+1}), w) - \hat{Q}(S_t, A_t, w) for a new policy π\pi', and the least squares can be solved the same way. This essentially “relearns” the experience DD under the new policy and the new policy is kept. The QQ functions are learned when the new policy is “close enough” to the old policy.

Policy Gradients

In the previous section, we assumed a fixed policy (e.g. ϵ\epsilon-greedy). Another approach is to directly approximate the policy, πθ(s,a)=P(as,θ)\pi_{\theta}(s, a) = P(a \mid s, \theta) (note that it used to be P(as)P(a \mid s)). This may be desirable if we have high dimensional action spaces (e.g. text), and we can learn stochastic policies (so it can win RPS), and it has better convergence properties. The cons are that the convergence tends to be local and evaluating policies is inefficient with high variance.

The goal is to optimize πθ(s,a)\pi_{\theta}(s, a) by finding the best θ\theta, but what is the objective? We could use J1(θ)=Vπθ(s1)=Eπ0[V1]J_1(\theta) = V^{\pi_\theta}(s_1) = E_{\pi_0}[V_1] which is the expected value from the first state, or use the average value, JavV(θ)=sdπθ(s)Vπθ(s)J_{avV}(\theta) = \sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s) or average reward per step JavR(θ)=sdπθ(s)aπθ(s,a)RsaJ_{avR}(\theta) = \sum_{s} d^{\pi_\theta}(s)\sum_a \pi_{\theta}(s, a)R_s^a where dd is the stationary distribution for the markov chain for πθ\pi_\theta.

With an objective defined, we can maximize JJ, e.g. using gradient ascent. We can define δθ=αθJ(θ)\delta \theta =\alpha\nabla_\theta J(\theta) where θJ(θ)\nabla_\theta J(\theta) is the policy gradient and α\alpha is a step size. This gradient can be computed with a finite difference method for any policy, but if the policy is differentiable then we can compute it directly.

Policy Gradient Theorem

For any differentiable (softmax;log-linear) policy πθ\pi_\theta and for any of the objective functions J1J_1, JavRJ_{avR} or 11γJavV\frac{1}{1-\gamma}J_avV, the policy gradient is θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\nabla_{\theta} \log \pi_\theta(s,a) Q^{\pi_\theta}(s, a)].

No value function (MC policy gradient = REINFORCE)

REINFORCE uses the policy gradient theorem to perform updates. Initialize θ\theta arbitrarily. Then for each episode following πθ\pi_\theta, for each timestep t=1,...,T1t = 1, ..., T - 1, update θθ+αθlogπθ(st,at)vt\theta \rightarrow \theta + \alpha \nabla_{\theta}\log\pi_{\theta}(s_t, a_t)v_t.

Actor-Critic (both approximate value and policy)

REINFORCE has high variance, so we can use a critic to estimate the action-value function, Qw(s,a)Qπθ(s,a)Q_w(s,a) \approx Q^{\pi_\theta}(s, a). Actor-critic algorithms follow an approximate policy gradient as determined by QwQ_w: δθ=αθlogπθ(s,a)Qw(s,a)\delta \theta = \alpha\nabla_\theta \log \pi_\theta(s, a)Q_w(s,a).

Thus, we can now update both θ\theta and ww with each step. However, approximating QQ introduces bias that we want to avoid. The Compatible Function Approximation Theorem gives conditions for when the policy gradient is still exact.

Another approach is to reduce variance by subtracting a baseline from the update equation, e.g. B(s)=Vπθ(s)B(s) = V^{\pi_\theta}(s). Then our update is Eπθ[θlogπθ(s,a)(Qπθ(s,a)B(s))]=Eπθ[θlogπθ(s,a)(Qπθ(s,a)Vπθ(s))]E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) (Q^{\pi_\theta}(s, a) - B(s))] = E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) (Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s))]

Really, we can estimate the advantage function, A=QVA = Q-V directly if we wanted to.

Other policy gradients

There are some fancier things we can do to estimate the policy gradient. See Advantage Actor-Critic, TD Actor-Critic, TD(λ\lambda) Actor-Critic, Natural Actor-Critic. They will not be covered here.

Model-based RL

In the previous section, we assumed we had an MDP with transitions PsaP_{sa}. Now, we will try to learn the model from experience, and then we can use planning to come up with a good policy; ideally this would all be done in one architecture.

This is good if we can learn the model by supervised learning methods, and lets us reason about uncertainty (if we had a perfect model, we could just search). On the other hand, we need to both learn a model and construct functions, so there are more possible sources of error.

Basic idea of model

A model is a representation of an MDP S,A,P,R\langle S, A, P ,R\rangle parameterized by η\eta. So St+1Pη(St+1St,At)S_{t+1} \sim P_\eta(S_{t+1} \mid S_t, A_t) and Rt+1Rη(Rt+1St,At)R_{t+1} \sim R_\eta(R_{t+1} \mid S_t, A_t).

The goal is to estimate η\mathcal{M}_\eta, which can be formulated as a supervised learning problem. s,ars, a \rightarrow r is a regression problem, while s,ass, a \rightarrow s' is a density estimation problem. We can pick parameters η\eta that minimize empirical loss.


Given a model, we now want to plan (and find the best policy). This can be done using value/policy iteration or by a tree search. Another approach is to use sample-based planning, which uses the model to generate samples and then apply model-free RL to the samples. In practice this tends to be more efficient.

Note that when the model is wrong, planning will be even more wrong, so it is important that the model is accurate.

Integration (Dyna)

We can integrate both planning and model based RL into one framework. We can consider two types of experience: real, from the environment, and simulated, sampled from our model. This is the Dyna-Q algorithm, which first initializes Q,MQ, M and then:

  1. SS \rightarrow current state
  2. AA \rightarrow ϵ\epsilon-greedy(S, Q)
  3. (real) Execute A, observe reward R and state S’.
  4. Q(s,a)Q(s,a)+α[R+γmaxa(Q(s,a)Q(s,a))]Q(s, a) \rightarrow Q(s, a) + \alpha[R + \gamma \max_a (Q(s', a) - Q(s, a))], which is the standard update rule.
  5. Model(s, a) R,S\rightarrow R, S', which updates the model.
  6. (simulated) Repeat nn times: S, A are random previously observed state and actions. Use the model to generate R,SR, S' and perform update rule.

This interleaves nn simulated steps with one real step.

Search algorithms either lookahead with existing model or sample model-free. Model-free approaches include MCTS and TD search depending on whether MC control or SARSA control is applied to the sub-MDP. Like earlier, TD approaches have less variance (more bias) and is usually more efficient.

Exploration vs Exploitation

Exploitation: make best use of current info. Exploration: explore other options, which may be better than current.


A multi-armed bandit is a tuple A,R\langle A, R \rangle where AA is a set of actions (arms) and Ra(r)=P(ra)R^a(r) = P(r \mid a) is a probability distribution over rewards. At each time step tt, the agent selects an action atAa_t \in A and the environment generates reward rtRatr_t \sim R^{a_t}. The goal is to maximize cumulative reward τ=1trτ\sum_{\tau=1}^t r_\tau.

Maximizing total cumulative reward is equivalent to minimizing total regret. We can define a count Nt(a)N_t(a) to be the expected number of selections of aa and define a gap δa=V*Q(a)\delta_a = V^* - Q(a). Then alternatively, Lt=aAE[Nt(a)]δaL_t = \sum_{a \in A}E[N_t(a)]\delta_a. In other words, we want low counts for large gaps.

Greedy algorithms

If we explore forever, we will have linear total regret. If we never explore, we will also have linear total regret. It can be shown that greedy (and ϵ\epsilon-greedy) algorithms have linear regret with time – does there exist a strategy with sublinear total regret?

Decaying ϵ\epsilon-greedy algorithm.

Idea: pick a decay schedule ϵ1,ϵ2,...\epsilon_1, \epsilon_2, ... and for c>0c > 0, $d = $ smallest nonzero gap, ϵt=min{1,c|A|d2t}\epsilon_t = \min\{1, \frac{c|A|}{d^2t}\}. We can show that this has logarithmic asymptotic total regret but this requires knowing gap sizes before hand…

Lower bound [Lai and Robbins]

Previous work has shown Llogtaδa0δaKL(Ra||Ra*L \geq \log t \sum_{a \mid \delta_a \geq 0}\frac{\delta_a}{KL(R^a || R^{a^*}}.

Upper confidence bounds.

Idea: For each action, estimate an upper confidence bound (UCB) Ût(a)\hat{U}_t(a) such that Q(a)Q̂t(a)+Ût(a)Q(a) \leq \hat{Q}_t(a) + \hat{U}_t(a) with high probability, and select action maximizing UCB:

at=argmaxainAQ̂t(a)+Ût(a).a_t = \arg\max_{a\ in A} \hat{Q}_t(a) + \hat{U}_t(a).

We can actually bound this using Hoeffding’s inequality, so for a given pp, we can solve for Ut(a)=logp2Nt(a)U_t(a) = \sqrt{\frac{-\log p}{2N_t(a)}}.

By reducing pp over time, we can show that Ut(a)=2logtNt(a)U_t(a) = \sqrt{\frac{2\log t}{N_t(a)}}, and so the UCB algorithm leads to asymptotic total regret, Lt8logta|δa0δaL_t \leq 8\log t \sum_{a | \delta_a \geq 0} \delta_a.

Bayesian and other bandits

There is more to be said about bandits; they will not be covered here. Additional techniques include probability matching (Thompson sampling) and information state search.

Contextual bandits

These consist of tuples A,S,R\langle A, S, R \rangle where S=P(s)S = P(s) is an unknown distribution over contexts and so Rsa(r)=P(rs,a)R_s^a(r) = P(r \mid s, a) is an unknown probability distribution over rewards. At each step, the environment generates a state stSs_t \sim S and then the agent selects and action atAa_t \in A; the reward is rRstatr \sim R_{s_t}^{a_t}. The goal is to maximize cumulative reward.

Why is this interesting? There’s an analogy to the action-value function: Q(s,a)=E[rs,a]Q(s,a) = E[r \mid s, a], and so we can use UCB here too. Bandit problems are cool because methods used to do well on bandit problems can be applied to the MDP setting.

Self-play RL

For the iith player, denote its policy πi\pi^i (and all other players have policy πi\pi^{-i}. Then the best policy is πi=π*i(πi)\pi^i = \pi^i_*(\pi^{-i}) and the Nash equilibrium is a joint policy such that for all ii, πi=π*i(πi)\pi^i = \pi^i_*(\pi^{-i}).

We can treat this as an MDP with other players as part of the environment, and so the Nash equilibrium is a fixed point. We will only focus on two-player zero-sum games.

Game tree

Classical methods such as minimax or alpha-beta define a single value function for both players and prunes the tree. This is already powerful, though a good value function is hard to define…

Using TD to learn value function.

We can apply value-based RL algorithm by updating the value function towards the return in TD learning. This learned value function can then be used for better minimax search.

Two variations, TD Root and TD Lead, perform updates elsewhere in the tree.

Imperfect information games

In imperfect information games, each player has a separate tree. Solutions here include iterative forward search (e.g. counterfactual regret minimization), self-play RL, and smooth UCT.

Smooth UCT

This approach applies MCTS to the information-state-game tree by learning against and responding to the opponent average behavior. The average policy can be extracted πavg(as)=N(s,a)N(s)\pi_{avg}(a \mid s) = \frac{N(s,a)}{N(s)} and the action can be picked between UCT(S)UCT(S) and πavg(S)\pi_{avg}(\cdot \mid S) based on a probability parameter η\eta.