RL — Reinforcement Learning

Ganesh Walavalkar
4 min readApr 25, 2020

RL can be thought about as an API. The approach of taking a model with transitions and rewards and converting it into policy is called as planning. The code is called as planner. Similarly transitions can be converted to policy using ‘Learner’ code, and is called and reinforcement learning process. There are sub components as well — modeler, which takes in transitions and convert to model; and simulator which takes model and convert it into transitions. Note that these ‘API’s can be linked to create different kind of combinations as follows:

Three approaches to RL

Policy search approaches— Policy search maps states to action. RL search algorithms which work directly to find the policy are called policy search algorithms. Learning this function is similar to temporal credit assignment problem.

Value function based approaches — Here states are mapped to value based on the utility or expected value of the reward.

Model based approaches — Here current state is mapped to next state using transition function/model and action is mapped to rewards using reinforcement function/model. The method that enables this approach is called as reinforcement learners.

We can move to Utility if we have T & R (model based approach) by value iterations (solve Bellman equation). We can take Utility and take argmax of value function to get the policy, which is relatively easy.

Q Learning

Recap from MDP discussion:
1. Bellman equation — U(s) = R(s) + γ max ∑ T(s, a, s’) U (s’)
2. Long term policy equation is — ∏^*(s) = argmax ∑ T(s, a, s’) U(s’)

Combining these two gets us Q as follows:
Q(s, a) = R(s) + γ ∑ (over s’) T(s, a, s’) max (over a’) Q (s’, a’)
This is defined as value of arriving in s, learning via a and then proceeding optimally thereafter.

Expressing Utility and Policy in terms of Q:
U(s) = max (of a) Q(s, a)
∏(s) = argmax (of a) Q(s, a)

Estimating Q from Transitions

Consider Q Learning equation that we discussed previously:
Q(s, a) = R(s) + γ ∑ (over s’) T(s, a, s’) max (over a’) Q (s’, a’)

If we had reward function and transition function we could calculate Q, however we don’t have these functions. This is true while we are learning, however while learning we have the transitions, which is defined as (s, a, r, s’) s — some state in MDP, a — action chosen for transition, r — we get the reward and s’ — the next state that we move to.

here r + (γ max … equation) → utility of that state and γ max … equation → utility of the next state.

Q Learning Convergence

If you start with Qhat pretty much anywhere, and then update according to function given in previous section, with transition (s, a, r, s’), and move this slowly towards actual Q with rate of alpha (r + γ), then Qhat → Q, that is Qhat will converge to Q. The challenges in this assumption are 1) we have to visit s, a infinite time 2) the learning rates should be updated 3) next state should be drawn from actual transition probabilities.

How to choose Qhat?
We initialize Qhat(s, ao) to some high value, and all other values where a <> ao are not those values, then we should be able to iterate over different values of a to get best value for Qhat. The catch is if the algorithm does not iterate, then Qhat could get stuck in some local minima. (greedy). To get out of this, try random restart. However this will take long time. Advised approach — simulated annealing, that is take random action occasionally with probability of ϵ

How do we decay our learning rate alpha t?

How do we choose our actions?
One way to do it — choose same action again, however this will yield same results. The other way — choose randomly, however you are not using your learning. The good idea — use our estimates Qhat to choose our actions.

Greedy Exploration

If the ϵ is decayed, that is we start off at higher value of randomness and then decrease that randomness, then Qhat → Q and ∏hat → ∏. This means the Q Learning under consideration converges toward optimal Q, and ∏ under consideration converges towards optimal ∏.

Exploration is about learning, while Exploitation is about using what we have learned. This is one of the example, however there are many other, please exploit this. This trade-off is fundamental trade-off in reinforcement learning.

Conclusion

Initially we looked at approaches to reinforcement learning. We looked in detailed in MDP — How to solve it. We did not have T & R, only thing we had were interactions. This was illustrated with Q Learning. How Q Learning converges etc. We looked at greedy exploration and exploration-exploitation.

References — Heavily borrowed from my notes on CS7641, so thanks to Prof Charles Isbell and Prof Michael Littman. Errors are all mine.

--

--