Reinforcement Learning notes
Outline of David Silver’s RL course parts from Andrew Ng and Arulkuman et al. 17)
Markov Decision Process is a tuple where is the set of states, is a set of actions, are transition probabilities, is a discount factor and is the reward function.
Starting in state , we choose action and arrive at state . The reward is given by . The goal is to maximize the expected reward, , which encourages immediate reward if possible.
A policy is any function for the action that should be taken to derive . This lets us define a value function with respect to a policy, , , which is the expected discounted reward for a given policy.
For a fixed policy , the value function satisfies the Bellman equation , which is basicaly a sum over all possible states that we could end up according to the transition probabilities. Here, is the immediate reward. This equation can be used efficiently to solve for in a finite state MDP because we can write one for each state and solve linear equations with variables (the ). There is also an optimal policy for each state, which can get decomposed as a Bellman equation .
This lets us define an optimal policy , and we have .
Solving finite-state MDPs
For each state, initialize . Until convergence, for each state, update , which is repeatedly updating using the Bellman equations.
Initialize randomly and let . Then for each state , update . Note that the firs step requires solving a system of equations. Thus, policy iteration takes longer in practice but both are used.
fun problem: show that these both converge to and respectively.
Learning a model for the MDP
In reality, we may not know the transition probabilities and rewards, (but we must estimate from data). One estimation trick is to use the MLE estimate of with some smoothing, e.g. . Similarly, we can use MLE reward for state .
This can be extended to policy/value iteration by taking an existing policy, executing for a number of trials, and then reestimating reward/transition probabilities. Specifically, after initializing randomly, execute , update estimates, apply value iteration to get new estimates, update .
Continuous state MDP or Value function approximation
In some cases, the states are continuous (e.g. physics) and so one method for finding policies is to approximate directly by using a model. A model takes any continuous state and action and outputs according to . This model could also be learned from data (e.g. by estimates, or learning a linear model, etc). The model could also be either deterministic or stochastic.
Fitted value iteration
Recall we’re trying to solve , but in the continuous case this will be an integral. This method approximate the value function over a finite sample of states with where is some feature mapping. Specifically, we randomly sample states and . Next, for each state and action , sample using a model of MDP. Set , so is an estimate of .
With these estimates for each , set , which is the approximation of the value of the best action. Finally, set , which is solving a linear regression so that we can generalize to all states.
Note that this method requires samples per iteration from the model, which could be expensive too.
Revisiting intro: model-free control and prediction
A state-action value quality function
We can define a function which is similar to except the initial action is provided, . Learning this Q-table is no longer a prediction task (what is the value?) but it is now a control task (which action is the best?). Using this function, the Bellman equation is simplified to , which corresponds to the idea of “future reward” given action . Note that maximiz3ing over gives the term in the Bellman Equation from earlier, and so it can be improved by solving and iterating.
Temporal Difference (TD)
The idea is to update Q values by taking a step in the MDP and adjusting the Q (or value) function appropriately. SARSA and Q-Learning are both TD algorithms. For the value function, they have the form of , where is a learning rate and is a TD target. The difference is the TD error.
This has the following update rule after interacting with the environment (using ).
where is a learning rate and the second term can be interpreted as the error between state-actions in and . Here,
Now, the update rule is . Note that earlier, we knew what would because we had a policy. Now, there is no policy.
Monte Carlo (MC)
Over many episodes, update using empirical means: where is the total return at and is a counter of visits to that state. The inner updates can be done once per episode on the first-visit or at every-visit. Note that the total return/reward is only known by the end of the episode.
: between sampling and TD learning
The TD algorithms described earlier were actually , learning algorithms. lets us interpolate between MC and TD approaches. TD is generally more desirable because it an learn before knowing the final reward and even without knowing the final outcome. This is because the TD target in TD approaches is , which is based on a fixed (existing) policy and so it is a biased estimate of , while the MC approach is an unbiased estimate. Another difference is that TD assumes a Markov environment, while MC is usually more effective in nonmarkov environments.
TD and MC can be unified by looking at the TD target, , and defining an -step return:
These step returns can be combined in a geometric fashion with a parameter to get . Note that in , the reward is immediate (we only use ) and we have regular TD learning. When , this sum of errors telescopes to the MC error. However, this is still not a perfect correspondence unless the updates occur at the end of the episode.
When there are large (or continuous) state spaces, we want to instead approximate or with or . This way, we can generalize from seen states to unseen states. is just a parameter (sometimes which we can update using MC or TD (or something else).
The function can be approximated using incremental updates with gradient descent. If is differentiable, then we can define the gradient of , and adjust : . When we define , we can use GD/SGD to update: .
There are other ways to approximate: linear wrt features (then SGD is globally optimal and update rule is simple since are just the linear coefficients).
We can show that this still works even if the reward isn’t until the end (e.g. instead of we could replace it with or whatever the TD target was.
While these examples used the value function, the same method can be used to approximate the Q function
Gradient descent is not sample efficient (high sample complexity?). Batch methods can find the best fitting function per batch, which is more appealing (e.g. incremental least squares approximation vs linear algebra batch solution).
Least squares 1
Suppose given the experience of pairs , then we can from and update for that sample using the SGD update rule earlier, and this converges to least squares solution for errors between and .
Experience replay is a trick that assumes an -greedy policy and stores tuples in its memory . Random mini-batches can be sampled; the targets can be calculated, and the MSE between Q-network and Q-learning targets can be minimized: This loss, can be optimized with SGD.
Least squares 2
We can compute least squares directly if the value function is linear: for some coefficients . At the solution, expected gradient is 0, so we can solve for to get , which is familiar the least squares solution. We don’t know true , but we can use various versions of the TD target (e.g. ) and use the same solving process.
Least squares for control
In control, we want to improve the policy, so we must learn off-policy. The change we need to make is similar to that from Q-learning. The update is for a new policy , and the least squares can be solved the same way. This essentially “relearns” the experience under the new policy and the new policy is kept. The functions are learned when the new policy is “close enough” to the old policy.
In the previous section, we assumed a fixed policy (e.g. -greedy). Another approach is to directly approximate the policy, (note that it used to be ). This may be desirable if we have high dimensional action spaces (e.g. text), and we can learn stochastic policies (so it can win RPS), and it has better convergence properties. The cons are that the convergence tends to be local and evaluating policies is inefficient with high variance.
The goal is to optimize by finding the best , but what is the objective? We could use which is the expected value from the first state, or use the average value, or average reward per step where is the stationary distribution for the markov chain for .
With an objective defined, we can maximize , e.g. using gradient ascent. We can define where is the policy gradient and is a step size. This gradient can be computed with a finite difference method for any policy, but if the policy is differentiable then we can compute it directly.
Policy Gradient Theorem
For any differentiable (softmax;log-linear) policy and for any of the objective functions , or , the policy gradient is .
No value function (MC policy gradient = REINFORCE)
REINFORCE uses the policy gradient theorem to perform updates. Initialize arbitrarily. Then for each episode following , for each timestep , update .
Actor-Critic (both approximate value and policy)
REINFORCE has high variance, so we can use a critic to estimate the action-value function, . Actor-critic algorithms follow an approximate policy gradient as determined by : .
Thus, we can now update both and with each step. However, approximating introduces bias that we want to avoid. The Compatible Function Approximation Theorem gives conditions for when the policy gradient is still exact.
Another approach is to reduce variance by subtracting a baseline from the update equation, e.g. . Then our update is
Really, we can estimate the advantage function, directly if we wanted to.
Other policy gradients
There are some fancier things we can do to estimate the policy gradient. See Advantage Actor-Critic, TD Actor-Critic, TD() Actor-Critic, Natural Actor-Critic. They will not be covered here.
In the previous section, we assumed we had an MDP with transitions . Now, we will try to learn the model from experience, and then we can use planning to come up with a good policy; ideally this would all be done in one architecture.
This is good if we can learn the model by supervised learning methods, and lets us reason about uncertainty (if we had a perfect model, we could just search). On the other hand, we need to both learn a model and construct functions, so there are more possible sources of error.
Basic idea of model
A model is a representation of an MDP parameterized by . So and .
The goal is to estimate , which can be formulated as a supervised learning problem. is a regression problem, while is a density estimation problem. We can pick parameters that minimize empirical loss.
Given a model, we now want to plan (and find the best policy). This can be done using value/policy iteration or by a tree search. Another approach is to use sample-based planning, which uses the model to generate samples and then apply model-free RL to the samples. In practice this tends to be more efficient.
Note that when the model is wrong, planning will be even more wrong, so it is important that the model is accurate.
We can integrate both planning and model based RL into one framework. We can consider two types of experience: real, from the environment, and simulated, sampled from our model. This is the Dyna-Q algorithm, which first initializes and then:
- current state
- -greedy(S, Q)
- (real) Execute A, observe reward R and state S’.
- , which is the standard update rule.
- Model(s, a) , which updates the model.
- (simulated) Repeat times: S, A are random previously observed state and actions. Use the model to generate and perform update rule.
This interleaves simulated steps with one real step.
Search algorithms either lookahead with existing model or sample model-free. Model-free approaches include MCTS and TD search depending on whether MC control or SARSA control is applied to the sub-MDP. Like earlier, TD approaches have less variance (more bias) and is usually more efficient.
Exploration vs Exploitation
Exploitation: make best use of current info. Exploration: explore other options, which may be better than current.
A multi-armed bandit is a tuple where is a set of actions (arms) and is a probability distribution over rewards. At each time step , the agent selects an action and the environment generates reward . The goal is to maximize cumulative reward .
Action Value is the mean reward of action , i.e. .
Optimal Value is is value of the best arm.
Regret is the opportunity loss for each step,
Total regret is the total opportunity loss, .
Maximizing total cumulative reward is equivalent to minimizing total regret. We can define a count to be the expected number of selections of and define a gap . Then alternatively, . In other words, we want low counts for large gaps.
If we explore forever, we will have linear total regret. If we never explore, we will also have linear total regret. It can be shown that greedy (and -greedy) algorithms have linear regret with time – does there exist a strategy with sublinear total regret?
Decaying -greedy algorithm.
Idea: pick a decay schedule and for , $d = $ smallest nonzero gap, . We can show that this has logarithmic asymptotic total regret but this requires knowing gap sizes before hand…
Lower bound [Lai and Robbins]
Previous work has shown .
Upper confidence bounds.
Idea: For each action, estimate an upper confidence bound (UCB) such that with high probability, and select action maximizing UCB:
We can actually bound this using Hoeffding’s inequality, so for a given , we can solve for .
By reducing over time, we can show that , and so the UCB algorithm leads to asymptotic total regret, .
Bayesian and other bandits
There is more to be said about bandits; they will not be covered here. Additional techniques include probability matching (Thompson sampling) and information state search.
These consist of tuples where is an unknown distribution over contexts and so is an unknown probability distribution over rewards. At each step, the environment generates a state and then the agent selects and action ; the reward is . The goal is to maximize cumulative reward.
Why is this interesting? There’s an analogy to the action-value function: , and so we can use UCB here too. Bandit problems are cool because methods used to do well on bandit problems can be applied to the MDP setting.
For the th player, denote its policy (and all other players have policy . Then the best policy is and the Nash equilibrium is a joint policy such that for all , .
We can treat this as an MDP with other players as part of the environment, and so the Nash equilibrium is a fixed point. We will only focus on two-player zero-sum games.
Classical methods such as minimax or alpha-beta define a single value function for both players and prunes the tree. This is already powerful, though a good value function is hard to define…
Using TD to learn value function.
We can apply value-based RL algorithm by updating the value function towards the return in TD learning. This learned value function can then be used for better minimax search.
Two variations, TD Root and TD Lead, perform updates elsewhere in the tree.
Imperfect information games
In imperfect information games, each player has a separate tree. Solutions here include iterative forward search (e.g. counterfactual regret minimization), self-play RL, and smooth UCT.
This approach applies MCTS to the information-state-game tree by learning against and responding to the opponent average behavior. The average policy can be extracted and the action can be picked between and based on a probability parameter .