Yuanjun Feng 528668 Homework 1
1. MDPs
1.1
First of all, we have the value iteration formula:
V k + 1 ( s ) = r ( s ) + γ ⋅ max a ∈ { Run,Fight } ∑ s ′ P ( s ′ ∣ s , a ) V k ( s ′ ) . V_{k+1}(s) \,=\, r(s) + \gamma \cdot \max_{a\in\{\text{Run,Fight}\}} \sum_{s'} P(s'\mid s,a)\, V_k(s')\,. V k + 1 ( s ) = r ( s ) + γ ⋅ a ∈ { Run,Fight } max s ′ ∑ P ( s ′ ∣ s , a ) V k ( s ′ ) .
In which,
r ( s ) r(s) r ( s ) : O=+2, D=−1, C=−4
γ \gamma γ : 0.9
Initialization: J¹(O)=2, J¹(D)=-1, J¹(C)=-4
Then for V 2 V_2 V 2 from V 1 V_1 V 1 , we can calculate the value of each state:
V_2(O):
Run:
2 + 0.9 [ 0.6 ⋅ 2 + 0.4 ⋅ ( − 1 ) ] = 2 + 0.9 ⋅ 0.8 = 2.72 2 + 0.9\,[0.6\cdot 2 + 0.4\cdot(-1)] = 2 + 0.9\cdot 0.8 = 2.72 2 + 0.9 [ 0.6 ⋅ 2 + 0.4 ⋅ ( − 1 )] = 2 + 0.9 ⋅ 0.8 = 2.72
Fight:
2 + 0.9 [ 0.75 ⋅ 2 + 0.25 ⋅ ( − 4 ) ] = 2 + 0.9 ⋅ 0.5 = 2.45 2 + 0.9\,[0.75\cdot 2 + 0.25\cdot(-4)] = 2 + 0.9\cdot 0.5 = 2.45 2 + 0.9 [ 0.75 ⋅ 2 + 0.25 ⋅ ( − 4 )] = 2 + 0.9 ⋅ 0.5 = 2.45
We take the max of the two values, so we have:
V 2 ( O ) = 2.72 V_2(O) = 2.72 V 2 ( O ) = 2.72
V_2(D):
Run:
− 1 + 0.9 [ 0.4 ⋅ 2 + 0.3 ⋅ ( − 1 ) + 0.3 ⋅ ( − 4 ) ] = − 1 + 0.9 ⋅ ( − 0.7 ) = − 1.63 -1 + 0.9\,[0.4\cdot 2 + 0.3\cdot(-1) + 0.3\cdot(-4)] = -1 + 0.9\cdot (-0.7) = -1.63 − 1 + 0.9 [ 0.4 ⋅ 2 + 0.3 ⋅ ( − 1 ) + 0.3 ⋅ ( − 4 )] = − 1 + 0.9 ⋅ ( − 0.7 ) = − 1.63
Fight:
− 1 + 0.9 [ 0.3 ⋅ 2 + 0.25 ⋅ ( − 1 ) + 0.45 ⋅ ( − 4 ) ] = − 1 + 0.9 ⋅ ( − 1.45 ) = − 2.305 -1 + 0.9\,[0.3\cdot 2 + 0.25\cdot(-1) + 0.45\cdot(-4)] = -1 + 0.9\cdot (-1.45) = -2.305 − 1 + 0.9 [ 0.3 ⋅ 2 + 0.25 ⋅ ( − 1 ) + 0.45 ⋅ ( − 4 )] = − 1 + 0.9 ⋅ ( − 1.45 ) = − 2.305
We take the max of the two values, so we have:
V 2 ( D ) = − 1.63 V_2(D) = -1.63 V 2 ( D ) = − 1.63
V_2(C):
Run:
− 4 + 0.9 [ 0.6 ⋅ ( − 1 ) + 0.4 ⋅ ( − 4 ) ] = − 4 + 0.9 ⋅ ( − 2.2 ) = − 5.98 -4 + 0.9\,[0.6\cdot(-1) + 0.4\cdot(-4)] = -4 + 0.9\cdot (-2.2) = -5.98 − 4 + 0.9 [ 0.6 ⋅ ( − 1 ) + 0.4 ⋅ ( − 4 )] = − 4 + 0.9 ⋅ ( − 2.2 ) = − 5.98
Fight:
− 4 + 0.9 [ 0.2 ⋅ 2 + 0.8 ⋅ ( − 4 ) ] = − 4 + 0.9 ⋅ ( − 2.8 ) = − 6.52 -4 + 0.9\,[0.2\cdot 2 + 0.8\cdot(-4)] = -4 + 0.9\cdot (-2.8) = -6.52 − 4 + 0.9 [ 0.2 ⋅ 2 + 0.8 ⋅ ( − 4 )] = − 4 + 0.9 ⋅ ( − 2.8 ) = − 6.52
We take the max of the two values, so we have:
V 2 ( C ) = − 5.98 V_2(C) = -5.98 V 2 ( C ) = − 5.98
For V 3 V_3 V 3 from V 2 V_2 V 2 :
V_3(O):
Run:
2 + 0.9 [ 0.6 ⋅ 2.72 + 0.4 ⋅ ( − 1.63 ) ] = 2 + 0.9 ⋅ 0.98 = 2.882 2 + 0.9\,[0.6\cdot 2.72 + 0.4\cdot (-1.63)] = 2 + 0.9\cdot 0.98 = 2.882 2 + 0.9 [ 0.6 ⋅ 2.72 + 0.4 ⋅ ( − 1.63 )] = 2 + 0.9 ⋅ 0.98 = 2.882
Fight:
2 + 0.9 [ 0.75 ⋅ 2.72 + 0.25 ⋅ ( − 5.98 ) ] = 2 + 0.9 ⋅ 0.545 = 2.4905 2 + 0.9\,[0.75\cdot 2.72 + 0.25\cdot (-5.98)] = 2 + 0.9\cdot 0.545 = 2.4905 2 + 0.9 [ 0.75 ⋅ 2.72 + 0.25 ⋅ ( − 5.98 )] = 2 + 0.9 ⋅ 0.545 = 2.4905
We take the max of the two values, so we have:
V 3 ( O ) = 2.882 V_3(O) = 2.882 V 3 ( O ) = 2.882
V_3(D):
Run:
− 1 + 0.9 [ 0.4 ⋅ 2.72 + 0.3 ⋅ ( − 1.63 ) + 0.3 ⋅ ( − 5.98 ) ] = − 1 + 0.9 ⋅ ( − 1.195 ) = − 2.0755 -1 + 0.9\,[0.4\cdot 2.72 + 0.3\cdot (-1.63) + 0.3\cdot (-5.98)] = -1 + 0.9\cdot (-1.195) = -2.0755 − 1 + 0.9 [ 0.4 ⋅ 2.72 + 0.3 ⋅ ( − 1.63 ) + 0.3 ⋅ ( − 5.98 )] = − 1 + 0.9 ⋅ ( − 1.195 ) = − 2.0755
Fight:
− 1 + 0.9 [ 0.3 ⋅ 2.72 + 0.25 ⋅ ( − 1.63 ) + 0.45 ⋅ ( − 5.98 ) ] = − 1 + 0.9 ⋅ ( − 2.2825 ) = − 3.05425 -1 + 0.9\,[0.3\cdot 2.72 + 0.25\cdot (-1.63) + 0.45\cdot (-5.98)] = -1 + 0.9\cdot (-2.2825) = -3.05425 − 1 + 0.9 [ 0.3 ⋅ 2.72 + 0.25 ⋅ ( − 1.63 ) + 0.45 ⋅ ( − 5.98 )] = − 1 + 0.9 ⋅ ( − 2.2825 ) = − 3.05425
We take the max of the two values, so we have:
V 3 ( D ) = − 2.0755 V_3(D) = -2.0755 V 3 ( D ) = − 2.0755
V_3(C):
Run:
− 4 + 0.9 [ 0.6 ⋅ ( − 1.63 ) + 0.4 ⋅ ( − 5.98 ) ] = − 4 + 0.9 ⋅ ( − 3.370 ) = − 7.033 -4 + 0.9\,[0.6\cdot (-1.63) + 0.4\cdot (-5.98)] = -4 + 0.9\cdot (-3.370) = -7.033 − 4 + 0.9 [ 0.6 ⋅ ( − 1.63 ) + 0.4 ⋅ ( − 5.98 )] = − 4 + 0.9 ⋅ ( − 3.370 ) = − 7.033
Fight:
− 4 + 0.9 [ 0.2 ⋅ 2.72 + 0.8 ⋅ ( − 5.98 ) ] = − 4 + 0.9 ⋅ ( − 4.24 ) = − 7.816 -4 + 0.9\,[0.2\cdot 2.72 + 0.8\cdot (-5.98)] = -4 + 0.9\cdot (-4.24) = -7.816 − 4 + 0.9 [ 0.2 ⋅ 2.72 + 0.8 ⋅ ( − 5.98 )] = − 4 + 0.9 ⋅ ( − 4.24 ) = − 7.816
We take the max of the two values, so we have:
V 3 ( C ) = − 7.033 V_3(C) = -7.033 V 3 ( C ) = − 7.033
Finally, we can fill in the table:
k J k ( O ) J^k(O) J k ( O ) J k ( D ) J^k(D) J k ( D ) J k ( C ) J^k(C) J k ( C ) 1 2 -1 -4 2 2.72 -1.63 -5.98 3 2.882 -2.0755 -7.033
1.2
For both k = 2 k = 2 k = 2 and k = 3 k = 3 k = 3 , I would choose Greedy policy in state O O O , D D D , and C C C .
After calculation, the greedy policy with respect to V k V_k V k selects Run in all three states in both k = 2 k = 2 k = 2 and k = 3 k = 3 k = 3 .
However, these choices are not guaranteed to be the globally optimal policy. Finite iterations yield only a locally optimal policy relative to V k V_k V k , which would change after convergence .
2. Value Iteration Theorem
As we have the Bellman operator under policy π \pi π :
( B π V ) ( s ) = E a ∼ π ( ⋅ ∣ s ) [ R ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) V ( s ′ ) ] . (B_\pi V)(s)=\mathbb{E}_{a\sim\pi(\cdot\mid s)}\Big[R(s,a)+\gamma\sum_{s'}p(s'\mid s,a)\,V(s')\Big]. ( B π V ) ( s ) = E a ∼ π ( ⋅ ∣ s ) [ R ( s , a ) + γ s ′ ∑ p ( s ′ ∣ s , a ) V ( s ′ ) ] .
We can expand and cancel rewards between two evaluations:
∥ ( B π V ) ( s ) − ( B π V ′ ) ( s ) ∥ = ∥ E a ∼ π [ R ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) V ( s ′ ) ] − E a ∼ π [ R ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) V ′ ( s ′ ) ] ∥ = ∥ γ E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) ) ∥ = max s ∣ γ E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) ) ∣ \begin{aligned}
\lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert
&= \lVert\mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V(s') \Big]
\, - \, \mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V'(s') \Big]\rVert \\
&= \lVert\gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \rVert \\
&= \max_s \Big| \gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \Big|
\end{aligned} ∥( B π V ) ( s ) − ( B π V ′ ) ( s )∥ = ∥ E a ∼ π [ R ( s , a ) + γ s ′ ∑ p ( s ′ ∣ s , a ) V ( s ′ ) ] − E a ∼ π [ R ( s , a ) + γ s ′ ∑ p ( s ′ ∣ s , a ) V ′ ( s ′ ) ] ∥ = ∥ γ E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) ) ∥ = s max γ E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) )
Then we can pull out γ \gamma γ and apply triangle inequality ∣ E [ X ] ∣ ≤ E [ ∣ X ∣ ] |\mathbb{E}[X]|\le \mathbb{E}[|X|] ∣ E [ X ] ∣ ≤ E [ ∣ X ∣ ] :
∥ ( B π V ) ( s ) − ( B π V ′ ) ( s ) ∥ ≤ γ max s E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ∣ ( V ( s ′ ) − V ′ ( s ′ ) ) ∣ ≤ γ max s E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥ \begin{aligned}
\lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert
&\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,|\big(V(s')-V'(s')\big)| \\
&\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert
\end{aligned} ∥( B π V ) ( s ) − ( B π V ′ ) ( s )∥ ≤ γ s max E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ∣ ( V ( s ′ ) − V ′ ( s ′ ) ) ∣ ≤ γ s max E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥
Use ∑ s ′ p ( s ′ ∣ s , a ) = 1 \sum_{s'}p(s'\mid s,a)=1 ∑ s ′ p ( s ′ ∣ s , a ) = 1 :
γ max s E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥ = γ ∥ V − V ′ ∥ \begin{aligned}
\gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert &= \gamma\,\lVert V-V'\rVert
\end{aligned} γ s max E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥ = γ ∥ V − V ′ ∥
So we have:
∥ ( B π V ) ( s ) − ( B π V ′ ) ( s ) ∥ ≤ γ ∥ V − V ′ ∥ \lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert \le \gamma\,\lVert V-V'\rVert ∥( B π V ) ( s ) − ( B π V ′ ) ( s )∥ ≤ γ ∥ V − V ′ ∥
From (a) we have
∥ B π V − B π V ′ ∥ ∞ ≤ γ ∥ V − V ′ ∥ ∞ . \|B_\pi V - B_\pi V'\|_\infty \le \gamma \|V - V'\|_\infty . ∥ B π V − B π V ′ ∥ ∞ ≤ γ ∥ V − V ′ ∥ ∞ .
So ( B π ) (B_\pi) ( B π ) is a γ \gamma γ -contraction on the complete metric space ( R ∣ S ∣ , ∥ ⋅ ∥ ∞ ) (\mathbb{R}^{|\mathcal S|}, \|\cdot\|_\infty) ( R ∣ S ∣ , ∥ ⋅ ∥ ∞ ) .
By the Banach fixed-point theorem , there exists a unique fixed point V ˉ \bar V V ˉ such that V ˉ = B π V ˉ \bar V = B_\pi \bar V V ˉ = B π V ˉ .
And for any initial value V 0 V_0 V 0 , the iteration V t + 1 = B π V t V_{t+1}=B_\pi V_t V t + 1 = B π V t converges to V ˉ \bar V V ˉ at a geometric rate.
On the other hand, we have the Bellman equation for policy π \pi π :
V π = B π V π . V^\pi = B_\pi V^\pi . V π = B π V π .
Since the fixed point of B π B_\pi B π is unique , we must have V ˉ = V π \bar V = V^\pi V ˉ = V π .
Therefore, the unique fixed point of B π B_\pi B π is V π V^\pi V π .
The greedy policy is not guaranteed to be the optimal policy.
For greedy policy respects to V V V , the strategy π = π V \pi = \pi_V π = π V meets the condition of the Bellman equation:
V π = B π V π . V^\pi = B_\pi V^\pi . V π = B π V π .
However, V V V itself satisfies this equation, only when V V V is the fixed point of B π B_\pi B π , that is V = B π V V = B_\pi V V = B π V .
So in most cases, the greedy policy is not the optimal policy.
When π \pi π is the greedy policy, we have:
B V n = B π V n BV_n = B_\pi V_n B V n = B π V n
∥ V π − V n + 1 ∥ = ∥ B π V π − B π V n ∥ ≤ γ ∥ V π − V n ∥ \begin{aligned}
\lVert V^\pi - V_{n+1} \rVert &= \lVert B_\pi V^\pi - B_\pi V_n \rVert \\
&\le \gamma \lVert V^\pi - V_n \rVert \\
\end{aligned} ∥ V π − V n + 1 ∥ = ∥ B π V π − B π V n ∥ ≤ γ ∥ V π − V n ∥
Then we can introduce the triangle inequality:
γ ∥ V π − V n ∥ ≤ γ ∥ V π − V n + 1 ∥ + γ ∥ V n + 1 − V n ∥ \begin{aligned}
\gamma \lVert V^\pi - V_n \rVert &\le \gamma \lVert V^\pi - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\
\end{aligned} γ ∥ V π − V n ∥ ≤ γ ∥ V π − V n + 1 ∥ + γ ∥ V n + 1 − V n ∥
So we have:
( 1 − γ ) ∥ V π − V n + 1 ∥ ≤ γ ∥ V n + 1 − V n ∥ ∥ V π − V n + 1 ∥ ≤ γ 1 − γ ∥ V n + 1 − V n ∥ \begin{aligned}
(1-\gamma) \lVert V^\pi - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\
\lVert V^\pi - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert
\end{aligned} ( 1 − γ ) ∥ V π − V n + 1 ∥ ∥ V π − V n + 1 ∥ ≤ γ ∥ V n + 1 − V n ∥ ≤ 1 − γ γ ∥ V n + 1 − V n ∥
Finally we can introduce the stopping condition ∥ V n + 1 − V n ∥ < ϵ ( 1 − γ ) 2 γ \lVert V_{n+1} - V_n \rVert < \frac{\epsilon(1-\gamma)}{2\gamma} ∥ V n + 1 − V n ∥ < 2 γ ϵ ( 1 − γ ) :
∥ V π − V n + 1 ∥ < ϵ 2 \lVert V^\pi - V_{n+1} \rVert < \frac{\epsilon}{2} ∥ V π − V n + 1 ∥ < 2 ϵ
Which certainly proves that:
∥ V π − V n + 1 ∥ ≤ ϵ 2 \lVert V^\pi - V_{n+1} \rVert \le \frac{\epsilon}{2} ∥ V π − V n + 1 ∥ ≤ 2 ϵ
To prove that ∥ B k V − B k V ′ ∥ ≤ γ k ∥ V − V ′ ∥ \lVert B^kV - B^k V' \rVert \le \gamma^k \lVert V - V' \rVert ∥ B k V − B k V ′ ∥ ≤ γ k ∥ V − V ′ ∥ , we can use Induction hypothesis :
We assume that ∥ B k V − B k V ′ ∥ ≤ γ k ∥ V − V ′ ∥ \lVert B^{k}V - B^{k} V' \rVert \le \gamma^{k} \lVert V - V' \rVert ∥ B k V − B k V ′ ∥ ≤ γ k ∥ V − V ′ ∥ .
Meanwhile, we have already known that where B B B is the Bellman optimality operator, and B B B is a γ \gamma γ -contraction:
∥ B V − B V ′ ∥ ≤ γ ∥ V − V ′ ∥ \lVert BV - BV' \rVert \le \gamma \lVert V - V' \rVert ∥ B V − B V ′ ∥ ≤ γ ∥ V − V ′ ∥
Combining the two, we have that for k + 1 k+1 k + 1 :
∥ B k + 1 V − B k + 1 V ′ ∥ = ∥ B ( B k V ) − B ( B k V ′ ) ∥ ≤ γ ∥ B k V − B k V ′ ∥ ≤ γ ⋅ γ k ∥ V − V ′ ∥ = γ k + 1 ∥ V − V ′ ∥ \begin{aligned}
\lVert B^{k+1}V - B^{k+1} V' \rVert &= \lVert B(B^kV) - B(B^k V') \rVert \\
&\le \gamma \lVert B^kV - B^k V' \rVert \\
&\le \gamma \cdot \gamma^k \lVert V - V' \rVert \\
&= \gamma^{k+1} \lVert V - V' \rVert
\end{aligned} ∥ B k + 1 V − B k + 1 V ′ ∥ = ∥ B ( B k V ) − B ( B k V ′ )∥ ≤ γ ∥ B k V − B k V ′ ∥ ≤ γ ⋅ γ k ∥ V − V ′ ∥ = γ k + 1 ∥ V − V ′ ∥
Since ∥ B 1 V − B 1 V ′ ∥ ≤ γ ∥ V − V ′ ∥ \lVert B^1V - B^1 V' \rVert \le \gamma \lVert V - V' \rVert ∥ B 1 V − B 1 V ′ ∥ ≤ γ ∥ V − V ′ ∥ , k = 1 k = 1 k = 1 is also true.
So we have proved that ∥ B k V − B k V ′ ∥ ≤ γ k ∥ V − V ′ ∥ \lVert B^kV - B^k V' \rVert \le \gamma^k \lVert V - V' \rVert ∥ B k V − B k V ′ ∥ ≤ γ k ∥ V − V ′ ∥ .
Method 1 - Triangle Inequality
Obviously, we could use the same method as (e) to prove this:
∥ V ∗ − V n + 1 ∥ = ∥ B V ∗ − B V n ∥ ≤ γ ∥ V ∗ − V n ∥ ≤ γ ∥ V ∗ − V n + 1 ∥ + γ ∥ V n + 1 − V n ∥ \begin{aligned}
\lVert V^* - V_{n+1} \rVert &= \lVert BV* - BV_{n} \rVert \\
&\le \gamma \lVert V^* - V_{n} \rVert \\
&\le \gamma \lVert V^* - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\
\end{aligned} ∥ V ∗ − V n + 1 ∥ = ∥ B V ∗ − B V n ∥ ≤ γ ∥ V ∗ − V n ∥ ≤ γ ∥ V ∗ − V n + 1 ∥ + γ ∥ V n + 1 − V n ∥
( 1 − γ ) ∥ V ∗ − V n + 1 ∥ ≤ γ ∥ V n + 1 − V n ∥ ∥ V ∗ − V n + 1 ∥ ≤ γ 1 − γ ∥ V n + 1 − V n ∥ < γ 1 − γ × ϵ ( 1 − γ ) 2 γ = ϵ 2 \begin{aligned}
(1-\gamma) \lVert V^* - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\
\lVert V^* - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \\
&< \frac{\gamma}{1-\gamma} × \frac{\epsilon(1-\gamma)}{2\gamma} = \frac{\epsilon}{2}
\end{aligned} ( 1 − γ ) ∥ V ∗ − V n + 1 ∥ ∥ V ∗ − V n + 1 ∥ ≤ γ ∥ V n + 1 − V n ∥ ≤ 1 − γ γ ∥ V n + 1 − V n ∥ < 1 − γ γ × 2 γ ϵ ( 1 − γ ) = 2 ϵ
So ∥ V ∗ − V n + 1 ∥ ≤ ϵ 2 \lVert V^* - V_{n+1} \rVert \le \frac{\epsilon}{2} ∥ V ∗ − V n + 1 ∥ ≤ 2 ϵ holds.
Method 2 - Repeatedly Splitting
From the Hint, we could repeatedly split the term ∥ V ∗ − V n + 1 ∥ \lVert V^* - V_{n+1} \rVert ∥ V ∗ − V n + 1 ∥ into smaller parts.
∥ V ∗ − V n ∥ ≤ ∑ i = 0 ∞ ∥ V n + i + 1 − V n + i ∥ \lVert V^* - V_{n} \rVert \le \sum_{i=0}^{\infty} \lVert V_{n+i+1} - V_{n+i} \rVert ∥ V ∗ − V n ∥ ≤ i = 0 ∑ ∞ ∥ V n + i + 1 − V n + i ∥
For each term, we have:
∥ V n + i + 1 − V n + i ∥ = ∥ B V n + i − B V n + i − 1 ∥ ≤ γ ∥ V n + i − V n + i − 1 ∥ ≤ ⋯ ≤ γ i ∥ V n + 1 − V n ∥ \begin{aligned}
\lVert V_{n+i+1} - V_{n+i} \rVert &= \lVert B V_{n+i} - B V_{n+i-1} \rVert \\
&\le \gamma \lVert V_{n+i} - V_{n+i-1} \rVert \\
&\le \cdots \\
&\le \gamma^i \lVert V_{n+1} - V_n \rVert
\end{aligned} ∥ V n + i + 1 − V n + i ∥ = ∥ B V n + i − B V n + i − 1 ∥ ≤ γ ∥ V n + i − V n + i − 1 ∥ ≤ ⋯ ≤ γ i ∥ V n + 1 − V n ∥
So we have:
∥ V ∗ − V n ∥ ≤ ∑ i = 0 ∞ γ i ∥ V n + 1 − V n ∥ = 1 1 − γ ∥ V n + 1 − V n ∥ \lVert V^* - V_{n} \rVert \le \sum_{i=0}^{\infty} \gamma^i \lVert V_{n+1} - V_n \rVert = \frac{1}{1-\gamma} \lVert V_{n+1} - V_n \rVert ∥ V ∗ − V n ∥ ≤ i = 0 ∑ ∞ γ i ∥ V n + 1 − V n ∥ = 1 − γ 1 ∥ V n + 1 − V n ∥
Finally for ∥ V ∗ − V n + 1 ∥ \lVert V^* - V_{n+1} \rVert ∥ V ∗ − V n + 1 ∥ , we have:
∥ V ∗ − V n + 1 ∥ = ∥ B V ∗ − B V n ∥ ≤ γ ∥ V ∗ − V n ∥ ≤ γ 1 − γ ∥ V n + 1 − V n ∥ \begin{aligned}
\lVert V^* - V_{n+1} \rVert &= \lVert BV^* - BV_{n} \rVert \\
&\le \gamma \lVert V^* - V_{n} \rVert \\
&\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert
\end{aligned} ∥ V ∗ − V n + 1 ∥ = ∥ B V ∗ − B V n ∥ ≤ γ ∥ V ∗ − V n ∥ ≤ 1 − γ γ ∥ V n + 1 − V n ∥
Then just as the previous method, we can introduce the stopping condition and we have:
∥ V ∗ − V n + 1 ∥ ≤ ϵ 2 \lVert V^* - V_{n+1} \rVert \le \frac{\epsilon}{2} ∥ V ∗ − V n + 1 ∥ ≤ 2 ϵ
∥ V π − V ∗ ∥ ≤ ∥ V π − V n + 1 ∥ + ∥ V n + 1 − V ∗ ∥ = ∥ V π − V n + 1 ∥ + ∥ V ∗ − V n + 1 ∥ ≤ ϵ 2 + ϵ 2 = ϵ \begin{aligned}
\lVert V^\pi - V^* \rVert &\le \lVert V^\pi - V_{n+1} \rVert + \lVert V_{n+1} - V^* \rVert \\
&= \lVert V^\pi - V_{n+1} \rVert + \lVert V^* - V_{n+1} \rVert \\
&\le \frac{\epsilon}{2} + \frac{\epsilon}{2} \\
&= \epsilon
\end{aligned} ∥ V π − V ∗ ∥ ≤ ∥ V π − V n + 1 ∥ + ∥ V n + 1 − V ∗ ∥ = ∥ V π − V n + 1 ∥ + ∥ V ∗ − V n + 1 ∥ ≤ 2 ϵ + 2 ϵ = ϵ
So we have ∥ V π − V ∗ ∥ ≤ ϵ \lVert V^\pi - V^* \rVert \le \epsilon ∥ V π − V ∗ ∥ ≤ ϵ .
3. Simulation Lemma and Model-based Learning
For finite-horizon MDP, we have value functions:
V h π ( s ) = E π [ ∑ t = h H − 1 r ( s t , a t ) ∣ s h = s ] , V ~ h π ( s ) = E π [ ∑ t = h H − 1 r ( s t , a t ) ∣ s h = s ] \begin{aligned}
V_h^\pi(s) &= \mathbb{E}_\pi\Big[\sum_{t=h}^{H-1} r(s_t, a_t) | s_h = s],\\
\tilde V_h^\pi(s) &= \mathbb{E}_\pi\Big[\sum_{t=h}^{H-1} r(s_t, a_t) | s_h = s]
\end{aligned} V h π ( s ) V ~ h π ( s ) = E π [ t = h ∑ H − 1 r ( s t , a t ) ∣ s h = s ] , = E π [ t = h ∑ H − 1 r ( s t , a t ) ∣ s h = s ]
So the Bellman forms would be:
V h π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) V h + 1 π ( s ′ ) ] , V ~ h π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ] \begin{aligned}
V_h^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s')\Big],\\
\tilde V_h^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big]
\end{aligned} V h π ( s ) V ~ h π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) V h + 1 π ( s ′ ) ] , = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ]
So we can define the difference as:
Δ h ( s ) = V h π ( s ) − V ~ h π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V h + 1 π ( s ′ ) − E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) ( V h + 1 π ( s ′ ) − V ~ h + 1 π ( s ′ ) ) ] + E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ] \begin{aligned}
\Delta_h(s) &= V_h^\pi(s) - \tilde V_h^\pi(s) \\
&= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\
&= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') + \mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\
&= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}\big(V_{h+1}^\pi(s') - \tilde V_{h+1}^\pi(s')\big)\Big] + \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\
\end{aligned} Δ h ( s ) = V h π ( s ) − V ~ h π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V h + 1 π ( s ′ ) − E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) ( V h + 1 π ( s ′ ) − V ~ h + 1 π ( s ′ ) ) ] + E a ∼ π ( ⋅ ∣ s ) [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ( s ′ ) ]
Then we could unroll it, starting from h = 0 h=0 h = 0 :
Δ 0 ( s 0 ) = V 0 π ( s 0 ) − V ~ 0 π ( s 0 ) = E a ∼ π ( ⋅ ∣ s 0 ) [ E s ′ ∼ P 0 ( ⋅ ∣ s 0 , a ) V 1 π ( s ′ ) − E s ′ ∼ P ~ 0 ( ⋅ ∣ s 0 , a ) V ~ 1 π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s 0 ) [ E s ′ ∼ P 0 ( ⋅ ∣ s 0 , a ) ( V 1 π ( s ′ ) − V ~ 1 π ( s ′ ) ) ] + E a ∼ π ( ⋅ ∣ s 0 ) [ E s ′ ∼ P 0 ( ⋅ ∣ s 0 , a ) V ~ 1 π ( s ′ ) − E s ′ ∼ P ~ 0 ( ⋅ ∣ s 0 , a ) V ~ 1 π ( s ′ ) ] = E a 0 ∼ π ( ⋅ ∣ s 0 ) [ E s 1 ∼ P 0 ( ⋅ ∣ s 0 , a 0 ) V 1 π ( s 1 ) − E s 1 ∼ P ~ 0 ( ⋅ ∣ s 0 , a 0 ) V ~ 1 π ( s 1 ) ] + E a 0 ∼ π [ E s 1 ∼ P 0 ( V 1 π ( s 1 ) − V ~ 1 π ( s 1 ) ) ] = E a 0 ∼ π ( ⋅ ∣ s 0 ) [ E s 1 ∼ P 0 ( ⋅ ∣ s 0 , a 0 ) V 1 π ( s 1 ) − E s 1 ∼ P ~ 0 ( ⋅ ∣ s 0 , a 0 ) V ~ 1 π ( s 1 ) ] + E a 0 ∼ π E s 1 ∼ P 0 Δ 1 ( s 1 ) \begin{aligned}
\Delta_0(s_0) &= V_0^\pi(s_0) - \tilde V_0^\pi(s_0) \\
&= \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}V_1^\pi(s') - \mathbb{E}_{s'\sim \tilde P_0(\cdot|s_0,a)}\tilde V_1^\pi(s')\Big] \\
&= \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}\big(V_1^\pi(s') - \tilde V_1^\pi(s')\big)\Big] + \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}\tilde V_1^\pi(s') - \mathbb{E}_{s'\sim \tilde P_0(\cdot|s_0,a)}\tilde V_1^\pi(s')\Big] \\
&= \mathbb{E}_{a_0 \sim \pi(\cdot|s_0)}\Big[\mathbb{E}_{s_1\sim P_0(\cdot|s_0,a_0)} V_1^\pi(s_1) - \mathbb{E}_{s_1\sim \tilde P_0(\cdot|s_0,a_0)}\tilde V_1^\pi(s_1)\Big] + \mathbb{E}_{a_0 \sim \pi}\Big[\mathbb{E}_{s_1\sim P_0} \big(V_1^\pi(s_1) - \tilde V_1^\pi(s_1)\big)\Big] \\
&= \mathbb{E}_{a_0 \sim \pi(\cdot|s_0)}\Big[\mathbb{E}_{s_1\sim P_0(\cdot|s_0,a_0)} V_1^\pi(s_1) - \mathbb{E}_{s_1\sim \tilde P_0(\cdot|s_0,a_0)}\tilde V_1^\pi(s_1)\Big] + \mathbb{E}_{a_0 \sim \pi}\mathbb{E}_{s_1\sim P_0} \Delta_1(s_1) \\
\end{aligned} Δ 0 ( s 0 ) = V 0 π ( s 0 ) − V ~ 0 π ( s 0 ) = E a ∼ π ( ⋅ ∣ s 0 ) [ E s ′ ∼ P 0 ( ⋅ ∣ s 0 , a ) V 1 π ( s ′ ) − E s ′ ∼ P ~ 0 ( ⋅ ∣ s 0 , a ) V ~ 1 π ( s ′ ) ] = E a ∼ π ( ⋅ ∣ s 0 ) [ E s ′ ∼ P 0 ( ⋅ ∣ s 0 , a ) ( V 1 π ( s ′ ) − V ~ 1 π ( s ′ ) ) ] + E a ∼ π ( ⋅ ∣ s 0 ) [ E s ′ ∼ P 0 ( ⋅ ∣ s 0 , a ) V ~ 1 π ( s ′ ) − E s ′ ∼ P ~ 0 ( ⋅ ∣ s 0 , a ) V ~ 1 π ( s ′ ) ] = E a 0 ∼ π ( ⋅ ∣ s 0 ) [ E s 1 ∼ P 0 ( ⋅ ∣ s 0 , a 0 ) V 1 π ( s 1 ) − E s 1 ∼ P ~ 0 ( ⋅ ∣ s 0 , a 0 ) V ~ 1 π ( s 1 ) ] + E a 0 ∼ π [ E s 1 ∼ P 0 ( V 1 π ( s 1 ) − V ~ 1 π ( s 1 ) ) ] = E a 0 ∼ π ( ⋅ ∣ s 0 ) [ E s 1 ∼ P 0 ( ⋅ ∣ s 0 , a 0 ) V 1 π ( s 1 ) − E s 1 ∼ P ~ 0 ( ⋅ ∣ s 0 , a 0 ) V ~ 1 π ( s 1 ) ] + E a 0 ∼ π E s 1 ∼ P 0 Δ 1 ( s 1 )
Then unroll to the end:
V π − V ~ π = ∑ h = 0 H − 1 E ( s h , a h ) ∼ P h π [ E s ′ ∼ P h ( ⋅ ∣ s h , a h ) V ~ h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s h , a h ) V ~ h + 1 π ( s ′ ) ] . V^\pi-\tilde V^\pi
=\sum_{h=0}^{H-1}
\mathbb{E}_{(s_h,a_h)\sim P_h^\pi}\!\Big[
\mathbb{E}_{s'\sim P_h(\cdot|s_h,a_h)}\tilde V_{h+1}^\pi(s')
-\mathbb{E}_{s'\sim \tilde P_h(\cdot|s_h,a_h)}\tilde V_{h+1}^\pi(s')
\Big]. V π − V ~ π = h = 0 ∑ H − 1 E ( s h , a h ) ∼ P h π [ E s ′ ∼ P h ( ⋅ ∣ s h , a h ) V ~ h + 1 π ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s h , a h ) V ~ h + 1 π ( s ′ ) ] .
First of all, we can decompose the difference as:
V π ∗ − V π ~ ∗ = ( V π ∗ − V ~ π ∗ ) + ( V ~ π ∗ − V ~ π ~ ∗ ) + ( V ~ π ~ ∗ − V π ~ ∗ ) V^{\pi^*}-V^{\tilde\pi^*}
=(V^{\pi^*}-\tilde V^{\pi^*})+(\tilde V^{\pi^*}-\tilde V^{\tilde\pi^*}) + (\tilde V^{\tilde\pi^*}-V^{\tilde\pi^*}) V π ∗ − V π ~ ∗ = ( V π ∗ − V ~ π ∗ ) + ( V ~ π ∗ − V ~ π ~ ∗ ) + ( V ~ π ~ ∗ − V π ~ ∗ )
We could set:
A = V π ∗ − V ~ π ∗ , B = V ~ π ∗ − V ~ π ~ ∗ , C = V ~ π ~ ∗ − V π ~ ∗ A = V^{\pi^*}-\tilde V^{\pi^*}, \quad B = \tilde V^{\pi^*}-\tilde V^{\tilde\pi^*}, \quad C = \tilde V^{\tilde\pi^*}-V^{\tilde\pi^*} A = V π ∗ − V ~ π ∗ , B = V ~ π ∗ − V ~ π ~ ∗ , C = V ~ π ~ ∗ − V π ~ ∗
As we know that π ~ ∗ \tilde \pi^* π ~ ∗ is the optimal policy in M ~ \tilde M M ~ , which means that V ~ π ∗ ≤ V ~ π ~ ∗ \tilde V^{\pi^*} \le \tilde V^{\tilde\pi^*} V ~ π ∗ ≤ V ~ π ~ ∗ . So we have B ≤ 0 B \le 0 B ≤ 0 .
So we have:
V π ∗ − V π ~ ∗ ≤ A + C V^{\pi^*}-V^{\tilde\pi^*} \le A + C V π ∗ − V π ~ ∗ ≤ A + C
By part A A A , for any h h h , we have:
A = ∑ h = 0 H − 1 E ( s , a ) ∼ P h π ∗ [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ∗ ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ∗ ( s ′ ) ] . A
= \sum_{h=0}^{H-1}
\mathbb{E}_{(s,a)\sim P_h^{\pi^*}}
\Big[
\mathbb{E}_{s'\sim P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\pi^*}(s')
-
\mathbb{E}_{s'\sim \tilde P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\pi^*}(s')
\Big]. A = h = 0 ∑ H − 1 E ( s , a ) ∼ P h π ∗ [ E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ∗ ( s ′ ) − E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ∗ ( s ′ ) ] .
Same for C C C :
C = ∑ h = 0 H − 1 E ( s , a ) ∼ P h π ~ ∗ [ E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ~ ∗ ( s ′ ) − E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ~ ∗ ( s ′ ) ] . C
= \sum_{h=0}^{H-1}
\mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}}
\Big[
\mathbb{E}_{s'\sim \tilde P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\tilde\pi^*}(s')
-
\mathbb{E}_{s'\sim P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\tilde\pi^*}(s')
\Big]. C = h = 0 ∑ H − 1 E ( s , a ) ∼ P h π ~ ∗ [ E s ′ ∼ P ~ h ( ⋅ ∣ s , a ) V ~ h + 1 π ~ ∗ ( s ′ ) − E s ′ ∼ P h ( ⋅ ∣ s , a ) V ~ h + 1 π ~ ∗ ( s ′ ) ] .
As ∣ E X ∣ ≤ E ∣ X ∣ \lvert \mathbb{E}X \rvert \le \mathbb{E}\lvert X \rvert ∣ E X ∣ ≤ E ∣ X ∣ , we have:
A ≤ ∣ A ∣ ≤ ∑ h E ( s , a ) ∼ P h π ∗ ∣ E P h V ~ h + 1 π ∗ − E P ~ h V ~ h + 1 π ∗ ∣ . A \le |A|
\le \sum_{h}\mathbb{E}_{(s,a)\sim P_h^{\pi^*}}
\Big|\ \mathbb{E}_{P_h}\tilde V_{h+1}^{\pi^*}-\mathbb{E}_{\tilde P_h}\tilde V_{h+1}^{\pi^*}\ \Big|. A ≤ ∣ A ∣ ≤ h ∑ E ( s , a ) ∼ P h π ∗ E P h V ~ h + 1 π ∗ − E P ~ h V ~ h + 1 π ∗ .
Expand the difference between the two expectations, to the sum of each s ′ s' s ′ :
∣ E P h V ~ h + 1 π ∗ − E P ~ h V ~ h + 1 π ∗ ∣ = ∣ ∑ s ′ ( P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ) V ~ h + 1 π ∗ ( s ′ ) ∣ ≤ ∑ s ′ ∣ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ∣ ⋅ ∣ V ~ h + 1 π ∗ ( s ′ ) ∣ ≤ ( m a x s ′ ∣ V ~ h + 1 π ∗ ( s ′ ) ∣ ) ∑ s ′ ∣ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ∣ = ∥ V ~ h + 1 π ∗ ∥ ∞ ∑ s ′ ∣ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ∣ \begin{aligned}
\Big|\ \mathbb{E}_{P_h}\tilde V_{h+1}^{\pi^*}-\mathbb{E}_{\tilde P_h}\tilde V_{h+1}^{\pi^*}\ \Big|
&= \Big| \sum_{s'}\big(P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\big)\tilde V_{h+1}^{\pi^*}(s') \Big| \\
&\le \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|\ \cdot \ \lvert \tilde V_{h+1}^{\pi^*}(s') \rvert \\
&\le \big(max_{s'}\lvert \tilde V_{h+1}^{\pi^*}(s') \rvert\big) \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| \\
&= \lVert \tilde V_{h+1}^{\pi^*} \rVert_\infty \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| \\
\end{aligned} E P h V ~ h + 1 π ∗ − E P ~ h V ~ h + 1 π ∗ = s ′ ∑ ( P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ) V ~ h + 1 π ∗ ( s ′ ) ≤ s ′ ∑ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ⋅ ∣ V ~ h + 1 π ∗ ( s ′ )∣ ≤ ( ma x s ′ ∣ V ~ h + 1 π ∗ ( s ′ )∣ ) s ′ ∑ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) = ∥ V ~ h + 1 π ∗ ∥ ∞ s ′ ∑ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a )
As we have assumed that the per-step reward r ( s , a ) ∈ [ 0 , 1 ] r(s,a)\in[0,1] r ( s , a ) ∈ [ 0 , 1 ] .
Starting from time h + 1 h+1 h + 1 , there are at most H − ( h + 1 ) H-(h+1) H − ( h + 1 ) steps remaining. So we have:
∥ V ~ h + 1 π ∗ ∥ ∞ ≤ H − ( h + 1 ) ≤ H \lVert \tilde V_{h+1}^{\pi^*} \rVert_\infty \le H-(h+1) \le H ∥ V ~ h + 1 π ∗ ∥ ∞ ≤ H − ( h + 1 ) ≤ H
So back to A A A , we have:
A ≤ H ∑ h = 0 H − 1 E ( s , a ) ∼ P h π ∗ ∑ s ′ ∣ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ∣ A \le H \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| A ≤ H h = 0 ∑ H − 1 E ( s , a ) ∼ P h π ∗ s ′ ∑ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a )
Same for C C C :
C ≤ H ∑ h = 0 H − 1 E ( s , a ) ∼ P h π ~ ∗ ∑ s ′ ∣ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a ) ∣ C \le H \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}} \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| C ≤ H h = 0 ∑ H − 1 E ( s , a ) ∼ P h π ~ ∗ s ′ ∑ P h ( s ′ ∣ s , a ) − P ~ h ( s ′ ∣ s , a )
Finally, we have:
V π ∗ − V π ~ ∗ ≤ A + C = H ∑ h = 0 H − 1 [ E ( s , a ) ∼ P h π ∗ ∥ P ~ h ( ⋅ ∣ s , a ) − P h ( ⋅ ∣ s , a ) ∥ 1 + E ( s , a ) ∼ P h π ~ ∗ ∥ P ~ h ( ⋅ ∣ s , a ) − P h ( ⋅ ∣ s , a ) ∥ 1 ] . \begin{aligned}
V^{\pi^*}-V^{\tilde\pi^*} &\le A+C \\
&=H\sum_{h=0}^{H-1}\Big[
\mathbb{E}_{(s,a)\sim P_h^{\pi^*}}\|\tilde P_h(\cdot\mid s,a)-P_h(\cdot\mid s,a)\|_1
+
\mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}}\|\tilde P_h(\cdot\mid s,a)-P_h(\cdot\mid s,a)\|_1
\Big].
\end{aligned} V π ∗ − V π ~ ∗ ≤ A + C = H h = 0 ∑ H − 1 [ E ( s , a ) ∼ P h π ∗ ∥ P ~ h ( ⋅ ∣ s , a ) − P h ( ⋅ ∣ s , a ) ∥ 1 + E ( s , a ) ∼ P h π ~ ∗ ∥ P ~ h ( ⋅ ∣ s , a ) − P h ( ⋅ ∣ s , a ) ∥ 1 ] .