It’s been a while since I posted here… in fact, so long I’m not sure I even know my way around this thing anymore.

Regardless, I’ve been going through the second edition of Richard Sutton and Andrew Barto’s book “Reinforcement Learning: An Introduction” while trying my best to solve every exercise. One that had me taking a particularly long time was Exercise 6.1, in page 121. This is in Chapter 6: Temporal-Difference Learning (by far the most exciting Chapter in the book until now).

While trying to come up with ideas to solve the exercise, I searched around the web but found nothing that could give me a clue. So here, this is my attempt at sharing my solution and the rationale for it.

First of all, notation!

This chapter talks about TD-Learning, so a few notions need to be established:

\[G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... + \gamma^{T-t-1} R_T \\ = R_{t+1} + \gamma G_{t+1}\]

is the return (weighted sum of rewards) observed after time \(t\). The notation here implicitly assumes that we’re dealing with episodic environments.

\(V_t(S_t)\) is the current approximation to a given state’s value. In other words, \(V_t\) is our current approximation to the value function.

And last but not least:

\[\delta_t \dot{=} R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)\]

is what’s called the TD error at time t. In the book’s definition, \(V\) shows up without the index \(t\). That’s because up until page 121, we’re asked to work as if all Value Function updates are performed after the end of the episode. Since exercise 6.1 is exactly about imagining what happens to the error between the Return \(G_t\) and its difference to \(V_t(S_t)\) as it changes throughout the episode, my definition already introduces the index.

You can see that this definition is correct (or at least that it makes sense) since at the time that we’re calculating \(\delta_t\), we’ve only just performed action \(A_t\) in state \(S_t\), being rewarded with \(R_{t+1}\) and observing the next state, \(S_{t+1}\). That means that we haven’t updated our value estimates yet, which means the error involves \(V_t\), not \(V_{t-1}\) or \(V_{t+1}\).

What’s in the book:

In the book, we see that the difference between the return \(G_t\) and \(V({S_t})\) can be neatly written as a sum of TD-errors, like this:

\[G_t - V(S_t) = R_{t+1} + \gamma G_{t+1} - V(S_t) + (\gamma V(S_{t+1}) - \gamma V(S_{t+1})) \\ = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma G_{t+1} - \gamma V(S_{t+1}) \\ = \delta_t + \gamma (G_{t+1} - V(S_{t+1})) \\ = \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+2} - \gamma V(S_{t+2})) \\ = \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T - t}(G_T - V(S_T))\]

Now, we need to pause. This is so we can understand exactly what \(G_T\) and \(V(S_T)\) are equal to. Let’s start with the return:

Remember the definition of \(G_t\) in terms of all the rewards that come after it:

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T\\ = \sum_{k=t}^{T-1} \gamma^{k-t} R_{k+1}\]

We can apply the same formula for \(G_T\):

\[G_T = \sum_{k=T}^{T-1} \gamma^{k-T} R_{k+1}\]

But \(T > T - 1\), so this is an empty sum, which means \(G_T = 0\).

As for \(V(S_T)\), remember that the value of a state is the Expected Return that can be obtained – under a given policy – after visiting that state. If the state is terminal (which is the case of \(S_T\)), then the episode ends when we land on it. This means that we can’t obtain any reward after visiting that state, and consequently that the state’s value is zero!

So, continuing the book’s proof:

\[= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(0 - 0)\\ = \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k\]

Now we’re ready to account for a value function estimate that changes within the episode.

When estimates change within the episode

When we perform a TD update within the episode, it looks like this:

\[V_{t+1}(S_t) \dot{=} V_t(S_t) + \alpha [R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)]\\ = V_t(S_t) + \alpha \delta_t \text{ , and}\\ V_{t+1}(s') \dot{=} V_t(s') \forall s' \neq S_t\]

Let’s look at how to express the difference between the return at time \(t\) and the value estimate of state \(S_t\), which is exactly what Exercise 6.1 asks us to do:

\[G_t - V_t(S_t) = R_{t+1} + \gamma G_{t+1} - V_t(S_t) + (\gamma V_{t+1}(S_{t+1}) - \gamma V_{t+1}(S_{t+1}))\\ = R_{t+1} - V_t(S_t) + \gamma V_{t+1}(S_{t+1}) + \gamma (G_{t+1} - V_{t+1}(S_{t+1}))\]

We almost have everything that we need to make \(\delta_t\) appear. But instead of \(R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)\), we have the same expression with \(V_{t+1}(S_{t+1})\) in place of \(V_t(S_{t+1})\). So we’ll artificially introduce one into the expression by adding zero in a clever way.

\[= R_{t+1} - V_t(S_t) + \gamma V_{t+1}(S_{t+1}) + \gamma (G_{t+1} - V_{t+1}(S_{t+1})) + \gamma V_t(S_{t+1}) - \gamma V_t(S_{t+1})\\ = R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t) + \gamma(V_{t+1}(S_{t+1}) - V_t(S_{t+1})) + \gamma (G_{t+1} - V_{t+1}(S_{t+1}))\\ = \delta_t + \gamma (V_{t+1}(S_{t+1}) - V_t(S_{t+1})) + \gamma(G_{t+1} - V_{t+1}(S_{t+1}))\\\]

We will now look closely at \(V_{t+1}(S_{t+1}) - V_t(S_{t+1})\), since it’s the only difference between this last step in the proof and the same thing for the case where estimates don’t change within the episode.

The only change between \(t\) and \(t+1\) is made in the estimate for \(S_t\), because that’s the only state we have information for. What I’m saying is, from the definition of the update in \(V\), after observing \(R_{t+1}\) and \(S_{t+1}\), we only update the estimate for state \(S_t\), leaving all the other estimates untouched. This effectively implies that \(V_t(S_{t+1})\) and \(V_{t+1}(S_{t+1})\) are the same thing, except for \(S_{t+1} = S_t\)! When \(S_{t+1} = S_t\), we have:

\[V_{t+1}(S_{t+1}) - V_t(S_{t+1}) = V_{t+1}(S_t) - V_t(S_t) = (V_t(S_t) + \alpha \delta_t) - V_t(S_t) =\\ \alpha \delta_t\]

We can express both cases – where \(S_{t+1}\) and \(S_t\) are/are not equal – compactly as:

\[V_{t+1}(S_{t+1}) - V_t(S_{t+1}) = \mathbf{I}( S_{t+1} = S_t ) (\alpha \delta_t)\]

where \(\mathbf{I}\) is the indicator function, which returns \(1\) when the predicate inside it is true, and \(0\) otherwise. We can then introduce this in our proof:

\[= \delta_t + \gamma (\mathbf{I}( S_{t+1} = S_t ) (\alpha \delta_t)) + \gamma(G_{t+1} - V_{t+1}(S_{t+1}))\\ = (1 + \alpha\gamma \mathbf{I}( S_{t+1} = S_t )) \delta_t + \gamma (G_{t+1} - V_{t+1}(S_{t+1}))\\ = (1 + \alpha\gamma \mathbf{I}( S_{t+1} = S_t )) \delta_t + \gamma (1 + \alpha\gamma \mathbf{I}(S_{t+2} = S_{t+1}))\delta_{t+1} + \gamma^2 (G_{t+2} - V_{t+2}(S_{t+2}))\\ = (1 + \alpha \gamma\mathbf{I}( S_{t+2} = S_t )) \delta_t + \gamma (1+ \alpha \gamma\mathbf{I}( S_{t+2} = S_{t+1}))\delta_{t+1} + \cdots + \gamma^{T-t-1}(1 + \alpha \gamma \mathbf{I}( S_T = S_{T-1} ))\delta_{T-1}\\ = \sum_{k=t}^{T-1} \gamma^{k-t} (1 + \alpha \gamma \mathbf{I}( S_{k+1} = S_k))\delta_k\]

(again, \(G_T\) and \(V_t(S_T), 0 \leq t \leq T\) are equal to zero for the same reasons as before).

Which means that, even though our state estimates change within an episode, the error between them and the corresponding returns can be neatly written as a sum of the TD errors, plus an additional factor that is equal to \(\alpha\gamma\) everytime \(S_{t+1}\) is equal to \(S_t\).

Thanks for bearing with me through what I think is a very nice and rewarding proof!

A sidenote

When I first attempted this problem, I made an erroneous demonstration, that led to a different result and did not account for cases where \(S_{t+1} = S_t\). Plugging this fact into my old proof might have made it too convoluted, so here’s another way to get to the same result:

\[G_t - V_t(S_t) = R_{t+1} + \gamma G_{t+1} - V_t(S_t) + (\gamma V_t(S_{t+1}) - \gamma V_t(S_{t+1}))\\ = R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t) + \gamma (G_{t+1} - V_t(S_{t+1}))\\ = \delta_t + \gamma [G_{t+1} - (V_{t+1}(S_{t+1}) - \alpha \delta_t \mathbf{I}(S_{t+1} = S_t))]\\ = \delta_t + \gamma [G_{t+1} - V_{t+1}(S_{t+1}) + \alpha \delta_t \mathbf{I}(S_{t+1} = S_t)]\\ = \delta_t + \gamma \alpha \delta_t \mathbf{I}(S_{t+1} = S_t) + \gamma (G_{t+1} - V_{t+1}(S_{t+1}))\\ = (1 + \alpha\gamma\mathbf{I}(S_{t+1} = S_t))\delta_t + \gamma (G_{t+1} - V_{t+1}(S_{t+1})) = \cdots\]

where the rest of the proof stays exactly the same.

References

  • Sutton and Barto’s book, 2nd edition, page 121