Reinforcement Learning: Policy Evaluation through Temporal Difference

In the first post about policy evaluation, our current subject, I talked a bit about concepts such as reward goals, value functions and gave an example algorithm on how to compute these last ones. Of course, that’s not the only option available (and to be honest, it’s not the one that performs best either).

In this post, I’ll introduce the idea of temporal difference methods. Such techniques use a concept called bootstrapping when learning state-value or action-value functions. In short, we will use current (and possibly wrong) value estimates to update other value estimates. I swear it works.

Let’s dive right in!

Ok, give me context

Again, we’re estimating \(v_\pi(s)\), the state value function of a policy \(\pi\) which dictates our behavior in a given environment. We want our agent to optimize the discounted return \(G_t = \sum_{i=0}^T \gamma^i R_{t+i+1}\), where \(\gamma\) determines how much we care about rewards in the future. We saw a method based off of a Monte Carlo strategy to determine this value function, but it has some flaws:

High variance

The updates generated by a Monte Carlo algorithm are not biased in the statistical sense of the word: we are calculating the expected return when starting from a state by doing exactly that. We start from a state, see how big is our return following policy \(\pi\), and shift our estimate towards that value.

We are approximating an average value by averaging its values. That makes total sense, right? The thing is, the world is not that simple: the fact that we are taking many steps before finishing an episode and only then updating our estimates introduces a high amount of variance into the updates.

Think about it: the environment is an MDP, which means that we have a distribution over the next possible states for each pair of initial state and action taken. The longer we move around in the MDP, the stronger are the effects of this randomness which is inherent to the environment. Hence, our updates might have wildly different values – a.k.a. high variance.

Episodic constraint

If you look back at the code for the Monte Carlo algorithm, you’ll see that first an episode is run completely, and only then we use the obtained rewards to update our value estimates. But what about environments where we don’t have the notion of episode? What if we’re constantly interacting and should be learning on the fly? Clearly, pure MC is not applicable to that.

Online updates not possible

Another point highly related to the previous one is that we can’t change our estimates while interacting with the environment. Only after stopping can we see what happened and evaluate our policy accordingly.

A valid analogy is an oblivious human who’s trying to get up from the couch, walk to the kitchen and grab some orange juice. Suppose their internal policy guides them through a path that makes them bump their pinky toe on a table (we all hate that, right?). With Monte Carlo, we’d still have to wait for the human to actually reach the bottle of orange juice, thus ending the episode, to learn that our policy isn’t that great. We’ll see that there’s a way to learn that such behavior is bad right when we decide to take the step that makes us hit the table.

Some more concepts

Targets

Something that was just hinted at in the last post was the concept of target for our policy evaluation algorithms. From now on, it will make more sense to talk about them when explaining algorithms.

The target is the value towards which we update our estimate. For example, looking at the MC update expression, we have (considering state \(s\) is the one visited at time \(t\)):

\[ v_\pi(s) \gets v_\pi(s) + \alpha \, (g_t - v_\pi(s)) \] where \(g_t\) is the “sampled” value of the discounted return from time \(t\) onwards.

Here, \(g_t\) is our target, because it’s what we’re shifting our current value estimate to. As we’ll see, there are many different targets we can use.

Truncated return

One idea (that will lead us to the Temporal Difference methods), is to truncate the full return at some point. The reasoning behind this is that it removes the need to wait for a full episode to end before using its value to update our estimates.

For example, nothing stops us from truncating our return after seeing, say, 10 steps of interaction with the environment. In fact, why bother wasting 10 steps? Why not truncate it after one step?

\[ G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots = R_t + \gamma \,\epsilon \]

where we use \(\epsilon\) to denote the fact that we won’t wait to see what happens in the future and just plug in some value there.

Looking a little closer at the first part of this equation and the definition of the state-value function, we can get a good idea of the value to use for \(\epsilon\). Bear with me.

Let’s start by taking expectations on both sides:

\[\mathbb{E}_\pi [ G_t | S_t = s ] = \mathbb{E}_\pi [R_t + \gamma\,G_t | S_t = s]\\ v_\pi(s) = R_t + \gamma \, \mathbb{E}_{a \sim \pi(a|s), s'\sim\mathcal{P}_a^{s,\cdot}}(v_\pi(s'))\]

“OK Alister you just made it more complicated.”. I swear it’s worth it. This formula shows (again, maybe) that if we have the state value for all states succeeding a given state \(s\), we can calculate, or at least estimate its value precisely.

The thing is… we don’t have that. This is the point where many people will use the concept of an “oracle” to explain what comes next. I don’t like it, so I’ll try another route.

What if we had something, an approximation, that we could plug in that expectation’s place? If that approximation got better and better over time, we’d expect that our estimate for \(v_\pi(s)\) would get better too. That’s precisely what we will do.

Temporal difference methods

Ok, we’re finally there! Temporal Difference! We’ll start by talking about the algorithm that follows the idea described so far, which is called TD(0).

TD(0)

Remember: we’re trying to find something to replace the expectation in our equation for the state-value function. Keep in mind also that we don’t need to have a good approximation at first, we just need it to get better over time. This is crucial, because here’s what we will do:

Say our current approximation for the state value function is \(\hat{v_\pi}(s)\). Whenever we interact with the environment, starting from state \(s\), taking action \(a\) and ending up in state \(s’\), we will update our estimate in the following way:

\[ \hat{v_\pi}(s) \gets \hat{v_\pi}(s) + \alpha \, (r_t + \gamma \, \hat{v_\pi}(s’) - \hat{v_\pi}(s)) \]

You might be staring at the screen incredulously now. Or thinking “Wait, can you even do this?”. Truth is: yes, you can. And in TD(0), you will.

The intuition to why this way of estimating makes sense comes from the fact that we introduce information from the real process into our estimates at every step. We see a real reward \(r_t\), and update our estimates according to it. When we use an estimate to update another, we assume that this real information propagates to all estimates if we do this enough times. The process of using a current estimate to update another is what we call bootstrapping. The function sort of “feeds off itself”. Below, there’s code for such a procedure.

Algorithm 1: TD(0) - estimating state-value function through Temporal Difference.

state_values = [0.0 for _ in range(n_states)]
state = env.reset() # start the environment and get initial state

for t in range(n_steps):
  # Act according to policy
  action = policy(state)
  new_state, reward, done = env.step(action)
  # Update current estimate for last state
  state_values[state] += alpha * \
    (reward + gamma * state_values[new_state] - state_values[state])
  # If we reached the end of an episode, reset the environment and continue
  if done:
    state = env.reset()
  else:
    state = new_state

In other words, our target is now \(r_t + \hat{v_\pi}(s’)\). The term \(r_t + \hat{v_\pi}(s’) - \hat{v_\pi}(s)\) is called the TD(0) error. As we travel around states, all our estimates will start to incorporate the real information introduced by the rewards we get. When we bootstrap from a state we already visited, we’re propagating this information to another state-value estimate. Isn’t that just beautiful?

Of course, not everything is perfect: we now introduced bias into our estimates, because we’re no longer approximating an expected value by using its average. We have largely reduced the variance, though. That happens because we’re no longer interacting with the environment for a lot of time, eliminating this compounded randomness.

What this was all about:

This post introduced the idea of TD or Temporal Difference methods (thank you, Barto). These avoid the need of waiting until the end of an episode to update our estimate for the value of a given state (or state-action pair). Also, TD updates have lower variance, at the cost of introducing bias. Again, if we “cool-down” the value of alpha over time, Temporal Difference converges to the true value function. Typically even faster than Monte Carlo estimates.

You might be wondering about at least two things now: the fact that we only look one step forward when updating our estimates, for example. Also, what does that zero in TD(0) even mean?

The two questions are related: you might think there’s benefits to looking further ahead in time while still not waiting for the episode to end before updating our value estimates. You’d be correct to think so. TD(0) is part of a broader class of methods called TD(\(\lambda\)) methods. Here we just used \(\lambda = 0\). In the next post, we’ll see that these other methods are a smart way of combining information from all time steps, without waiting for the episode to end to perform updates in our estimates. Pretty magic, right?

To give you just a bit more info, Monte Carlo methods are also under the TD(\(\lambda\)) umbrella. They’re TD(1)! This should give you a hint as to what the role of \(\lambda\) is.

So, this was TD(0). The next post is on something called eligibility traces and TD(\(\lambda\)). See you next time!

References

Andrew Barto’s paper
Again, David Silver’s class