Skip to main content

Homework 1

Yuanjun Feng528668Homework 1

1. MDPs

1.1

First of all, we have the value iteration formula:

Vk+1(s)=r(s)+γmaxa{Run,Fight}sP(ss,a)Vk(s).V_{k+1}(s) \,=\, r(s) + \gamma \cdot \max_{a\in\{\text{Run,Fight}\}} \sum_{s'} P(s'\mid s,a)\, V_k(s')\,.

In which,

  • r(s)r(s): O=+2, D=−1, C=−4
  • γ\gamma: 0.9
  • Initialization: J¹(O)=2, J¹(D)=-1, J¹(C)=-4

Then for V2V_2 from V1V_1, we can calculate the value of each state:
V_2(O):

  • Run:
    2+0.9[0.62+0.4(1)]=2+0.90.8=2.722 + 0.9\,[0.6\cdot 2 + 0.4\cdot(-1)] = 2 + 0.9\cdot 0.8 = 2.72
  • Fight:
    2+0.9[0.752+0.25(4)]=2+0.90.5=2.452 + 0.9\,[0.75\cdot 2 + 0.25\cdot(-4)] = 2 + 0.9\cdot 0.5 = 2.45

We take the max of the two values, so we have:

V2(O)=2.72V_2(O) = 2.72

V_2(D):

  • Run:
    1+0.9[0.42+0.3(1)+0.3(4)]=1+0.9(0.7)=1.63-1 + 0.9\,[0.4\cdot 2 + 0.3\cdot(-1) + 0.3\cdot(-4)] = -1 + 0.9\cdot (-0.7) = -1.63
  • Fight:
    1+0.9[0.32+0.25(1)+0.45(4)]=1+0.9(1.45)=2.305-1 + 0.9\,[0.3\cdot 2 + 0.25\cdot(-1) + 0.45\cdot(-4)] = -1 + 0.9\cdot (-1.45) = -2.305

We take the max of the two values, so we have:

V2(D)=1.63V_2(D) = -1.63

V_2(C):

  • Run: 4+0.9[0.6(1)+0.4(4)]=4+0.9(2.2)=5.98-4 + 0.9\,[0.6\cdot(-1) + 0.4\cdot(-4)] = -4 + 0.9\cdot (-2.2) = -5.98
  • Fight: 4+0.9[0.22+0.8(4)]=4+0.9(2.8)=6.52-4 + 0.9\,[0.2\cdot 2 + 0.8\cdot(-4)] = -4 + 0.9\cdot (-2.8) = -6.52

We take the max of the two values, so we have:

V2(C)=5.98V_2(C) = -5.98

For V3V_3 from V2V_2:

V_3(O):

  • Run: 2+0.9[0.62.72+0.4(1.63)]=2+0.90.98=2.8822 + 0.9\,[0.6\cdot 2.72 + 0.4\cdot (-1.63)] = 2 + 0.9\cdot 0.98 = 2.882
  • Fight: 2+0.9[0.752.72+0.25(5.98)]=2+0.90.545=2.49052 + 0.9\,[0.75\cdot 2.72 + 0.25\cdot (-5.98)] = 2 + 0.9\cdot 0.545 = 2.4905

We take the max of the two values, so we have:

V3(O)=2.882V_3(O) = 2.882

V_3(D):

  • Run: 1+0.9[0.42.72+0.3(1.63)+0.3(5.98)]=1+0.9(1.195)=2.0755-1 + 0.9\,[0.4\cdot 2.72 + 0.3\cdot (-1.63) + 0.3\cdot (-5.98)] = -1 + 0.9\cdot (-1.195) = -2.0755
  • Fight: 1+0.9[0.32.72+0.25(1.63)+0.45(5.98)]=1+0.9(2.2825)=3.05425-1 + 0.9\,[0.3\cdot 2.72 + 0.25\cdot (-1.63) + 0.45\cdot (-5.98)] = -1 + 0.9\cdot (-2.2825) = -3.05425

We take the max of the two values, so we have:

V3(D)=2.0755V_3(D) = -2.0755

V_3(C):

  • Run: 4+0.9[0.6(1.63)+0.4(5.98)]=4+0.9(3.370)=7.033-4 + 0.9\,[0.6\cdot (-1.63) + 0.4\cdot (-5.98)] = -4 + 0.9\cdot (-3.370) = -7.033
  • Fight: 4+0.9[0.22.72+0.8(5.98)]=4+0.9(4.24)=7.816-4 + 0.9\,[0.2\cdot 2.72 + 0.8\cdot (-5.98)] = -4 + 0.9\cdot (-4.24) = -7.816

We take the max of the two values, so we have:

V3(C)=7.033V_3(C) = -7.033

Finally, we can fill in the table:

kJk(O)J^k(O)Jk(D)J^k(D)Jk(C)J^k(C)
12-1-4
22.72-1.63-5.98
32.882-2.0755-7.033

1.2

For both k=2k = 2 and k=3k = 3, I would choose Greedy policy in state OO, DD, and CC.
After calculation, the greedy policy with respect to VkV_k selects Run in all three states in both k=2k = 2 and k=3k = 3. However, these choices are not guaranteed to be the globally optimal policy. Finite iterations yield only a locally optimal policy relative to VkV_k, which would change after convergence.

2. Value Iteration Theorem

a.

As we have the Bellman operator under policy π\pi:

(BπV)(s)=Eaπ(s)[R(s,a)+γsp(ss,a)V(s)].(B_\pi V)(s)=\mathbb{E}_{a\sim\pi(\cdot\mid s)}\Big[R(s,a)+\gamma\sum_{s'}p(s'\mid s,a)\,V(s')\Big].

We can expand and cancel rewards between two evaluations:

(BπV)(s)(BπV)(s)=Eaπ[R(s,a)+γsp(ss,a)V(s)]Eaπ[R(s,a)+γsp(ss,a)V(s)]=γEaπsp(ss,a)(V(s)V(s))=maxsγEaπsp(ss,a)(V(s)V(s))\begin{aligned} \lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert &= \lVert\mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V(s') \Big] \, - \, \mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V'(s') \Big]\rVert \\ &= \lVert\gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \rVert \\ &= \max_s \Big| \gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \Big| \end{aligned}

Then we can pull out γ\gamma and apply triangle inequality E[X]E[X]|\mathbb{E}[X]|\le \mathbb{E}[|X|]:

(BπV)(s)(BπV)(s)γmaxsEaπsp(ss,a)(V(s)V(s))γmaxsEaπsp(ss,a)VV\begin{aligned} \lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert &\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,|\big(V(s')-V'(s')\big)| \\ &\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert \end{aligned}

Use sp(ss,a)=1\sum_{s'}p(s'\mid s,a)=1:

γmaxsEaπsp(ss,a)VV=γVV\begin{aligned} \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert &= \gamma\,\lVert V-V'\rVert \end{aligned}

So we have:

(BπV)(s)(BπV)(s)γVV\lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert \le \gamma\,\lVert V-V'\rVert

b.

From (a) we have

BπVBπVγVV.\|B_\pi V - B_\pi V'\|_\infty \le \gamma \|V - V'\|_\infty .

So (Bπ)(B_\pi) is a γ\gamma-contraction on the complete metric space (RS,)(\mathbb{R}^{|\mathcal S|}, \|\cdot\|_\infty).

By the Banach fixed-point theorem, there exists a unique fixed point Vˉ\bar V such that Vˉ=BπVˉ\bar V = B_\pi \bar V.
And for any initial value V0V_0, the iteration Vt+1=BπVtV_{t+1}=B_\pi V_t converges to Vˉ\bar V at a geometric rate.

On the other hand, we have the Bellman equation for policy π\pi:

Vπ=BπVπ.V^\pi = B_\pi V^\pi .

Since the fixed point of BπB_\pi is unique, we must have Vˉ=Vπ\bar V = V^\pi.

Therefore, the unique fixed point of BπB_\pi is VπV^\pi.

c.

The greedy policy is not guaranteed to be the optimal policy.

For greedy policy respects to VV, the strategy π=πV\pi = \pi_V meets the condition of the Bellman equation:

Vπ=BπVπ.V^\pi = B_\pi V^\pi .

However, VV itself satisfies this equation, only when VV is the fixed point of BπB_\pi, that is V=BπVV = B_\pi V.

So in most cases, the greedy policy is not the optimal policy.

d.

When π\pi is the greedy policy, we have:

BVn=BπVnBV_n = B_\pi V_n

e.

VπVn+1=BπVπBπVnγVπVn\begin{aligned} \lVert V^\pi - V_{n+1} \rVert &= \lVert B_\pi V^\pi - B_\pi V_n \rVert \\ &\le \gamma \lVert V^\pi - V_n \rVert \\ \end{aligned}

Then we can introduce the triangle inequality:

γVπVnγVπVn+1+γVn+1Vn\begin{aligned} \gamma \lVert V^\pi - V_n \rVert &\le \gamma \lVert V^\pi - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\ \end{aligned}

So we have:

(1γ)VπVn+1γVn+1VnVπVn+1γ1γVn+1Vn\begin{aligned} (1-\gamma) \lVert V^\pi - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\ \lVert V^\pi - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \end{aligned}

Finally we can introduce the stopping condition Vn+1Vn<ϵ(1γ)2γ\lVert V_{n+1} - V_n \rVert < \frac{\epsilon(1-\gamma)}{2\gamma}:

VπVn+1<ϵ2\lVert V^\pi - V_{n+1} \rVert < \frac{\epsilon}{2}

Which certainly proves that:

VπVn+1ϵ2\lVert V^\pi - V_{n+1} \rVert \le \frac{\epsilon}{2}

f.

To prove that BkVBkVγkVV\lVert B^kV - B^k V' \rVert \le \gamma^k \lVert V - V' \rVert, we can use Induction hypothesis:

We assume that BkVBkVγkVV\lVert B^{k}V - B^{k} V' \rVert \le \gamma^{k} \lVert V - V' \rVert.

Meanwhile, we have already known that where BB is the Bellman optimality operator, and BB is a γ\gamma-contraction:

BVBVγVV\lVert BV - BV' \rVert \le \gamma \lVert V - V' \rVert

Combining the two, we have that for k+1k+1:

Bk+1VBk+1V=B(BkV)B(BkV)γBkVBkVγγkVV=γk+1VV\begin{aligned} \lVert B^{k+1}V - B^{k+1} V' \rVert &= \lVert B(B^kV) - B(B^k V') \rVert \\ &\le \gamma \lVert B^kV - B^k V' \rVert \\ &\le \gamma \cdot \gamma^k \lVert V - V' \rVert \\ &= \gamma^{k+1} \lVert V - V' \rVert \end{aligned}

Since B1VB1VγVV\lVert B^1V - B^1 V' \rVert \le \gamma \lVert V - V' \rVert, k=1k = 1 is also true.

So we have proved that BkVBkVγkVV\lVert B^kV - B^k V' \rVert \le \gamma^k \lVert V - V' \rVert.

g.

Method 1 - Triangle Inequality

Obviously, we could use the same method as (e) to prove this:

VVn+1=BVBVnγVVnγVVn+1+γVn+1Vn\begin{aligned} \lVert V^* - V_{n+1} \rVert &= \lVert BV* - BV_{n} \rVert \\ &\le \gamma \lVert V^* - V_{n} \rVert \\ &\le \gamma \lVert V^* - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\ \end{aligned} (1γ)VVn+1γVn+1VnVVn+1γ1γVn+1Vn<γ1γ×ϵ(1γ)2γ=ϵ2\begin{aligned} (1-\gamma) \lVert V^* - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\ \lVert V^* - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \\ &< \frac{\gamma}{1-\gamma} × \frac{\epsilon(1-\gamma)}{2\gamma} = \frac{\epsilon}{2} \end{aligned}

So VVn+1ϵ2\lVert V^* - V_{n+1} \rVert \le \frac{\epsilon}{2} holds.

Method 2 - Repeatedly Splitting

From the Hint, we could repeatedly split the term VVn+1\lVert V^* - V_{n+1} \rVert into smaller parts.

VVni=0Vn+i+1Vn+i\lVert V^* - V_{n} \rVert \le \sum_{i=0}^{\infty} \lVert V_{n+i+1} - V_{n+i} \rVert

For each term, we have:

Vn+i+1Vn+i=BVn+iBVn+i1γVn+iVn+i1γiVn+1Vn\begin{aligned} \lVert V_{n+i+1} - V_{n+i} \rVert &= \lVert B V_{n+i} - B V_{n+i-1} \rVert \\ &\le \gamma \lVert V_{n+i} - V_{n+i-1} \rVert \\ &\le \cdots \\ &\le \gamma^i \lVert V_{n+1} - V_n \rVert \end{aligned}

So we have:

VVni=0γiVn+1Vn=11γVn+1Vn\lVert V^* - V_{n} \rVert \le \sum_{i=0}^{\infty} \gamma^i \lVert V_{n+1} - V_n \rVert = \frac{1}{1-\gamma} \lVert V_{n+1} - V_n \rVert

Finally for VVn+1\lVert V^* - V_{n+1} \rVert, we have:

VVn+1=BVBVnγVVnγ1γVn+1Vn\begin{aligned} \lVert V^* - V_{n+1} \rVert &= \lVert BV^* - BV_{n} \rVert \\ &\le \gamma \lVert V^* - V_{n} \rVert \\ &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \end{aligned}

Then just as the previous method, we can introduce the stopping condition and we have:

VVn+1ϵ2\lVert V^* - V_{n+1} \rVert \le \frac{\epsilon}{2}

h.

VπVVπVn+1+Vn+1V=VπVn+1+VVn+1ϵ2+ϵ2=ϵ\begin{aligned} \lVert V^\pi - V^* \rVert &\le \lVert V^\pi - V_{n+1} \rVert + \lVert V_{n+1} - V^* \rVert \\ &= \lVert V^\pi - V_{n+1} \rVert + \lVert V^* - V_{n+1} \rVert \\ &\le \frac{\epsilon}{2} + \frac{\epsilon}{2} \\ &= \epsilon \end{aligned}

So we have VπVϵ\lVert V^\pi - V^* \rVert \le \epsilon.

3. Simulation Lemma and Model-based Learning

a.

For finite-horizon MDP, we have value functions:

Vhπ(s)=Eπ[t=hH1r(st,at)sh=s],V~hπ(s)=Eπ[t=hH1r(st,at)sh=s]\begin{aligned} V_h^\pi(s) &= \mathbb{E}_\pi\Big[\sum_{t=h}^{H-1} r(s_t, a_t) | s_h = s],\\ \tilde V_h^\pi(s) &= \mathbb{E}_\pi\Big[\sum_{t=h}^{H-1} r(s_t, a_t) | s_h = s] \end{aligned}

So the Bellman forms would be:

Vhπ(s)=Eaπ(s)[r(s,a)+EsPh(s,a)Vh+1π(s)],V~hπ(s)=Eaπ(s)[r(s,a)+EsP~h(s,a)V~h+1π(s)]\begin{aligned} V_h^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s')\Big],\\ \tilde V_h^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \end{aligned}

So we can define the difference as:

Δh(s)=Vhπ(s)V~hπ(s)=Eaπ(s)[EsPh(s,a)Vh+1π(s)EsP~h(s,a)V~h+1π(s)]=Eaπ(s)[EsPh(s,a)Vh+1π(s)EsPh(s,a)V~h+1π(s)+EsPh(s,a)V~h+1π(s)EsP~h(s,a)V~h+1π(s)]=Eaπ(s)[EsPh(s,a)(Vh+1π(s)V~h+1π(s))]+Eaπ(s)[EsPh(s,a)V~h+1π(s)EsP~h(s,a)V~h+1π(s)]\begin{aligned} \Delta_h(s) &= V_h^\pi(s) - \tilde V_h^\pi(s) \\ &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\ &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') + \mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\ &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}\big(V_{h+1}^\pi(s') - \tilde V_{h+1}^\pi(s')\big)\Big] + \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\ \end{aligned}

Then we could unroll it, starting from h=0h=0:

Δ0(s0)=V0π(s0)V~0π(s0)=Eaπ(s0)[EsP0(s0,a)V1π(s)EsP~0(s0,a)V~1π(s)]=Eaπ(s0)[EsP0(s0,a)(V1π(s)V~1π(s))]+Eaπ(s0)[EsP0(s0,a)V~1π(s)EsP~0(s0,a)V~1π(s)]=Ea0π(s0)[Es1P0(s0,a0)V1π(s1)Es1P~0(s0,a0)V~1π(s1)]+Ea0π[Es1P0(V1π(s1)V~1π(s1))]=Ea0π(s0)[Es1P0(s0,a0)V1π(s1)Es1P~0(s0,a0)V~1π(s1)]+Ea0πEs1P0Δ1(s1)\begin{aligned} \Delta_0(s_0) &= V_0^\pi(s_0) - \tilde V_0^\pi(s_0) \\ &= \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}V_1^\pi(s') - \mathbb{E}_{s'\sim \tilde P_0(\cdot|s_0,a)}\tilde V_1^\pi(s')\Big] \\ &= \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}\big(V_1^\pi(s') - \tilde V_1^\pi(s')\big)\Big] + \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}\tilde V_1^\pi(s') - \mathbb{E}_{s'\sim \tilde P_0(\cdot|s_0,a)}\tilde V_1^\pi(s')\Big] \\ &= \mathbb{E}_{a_0 \sim \pi(\cdot|s_0)}\Big[\mathbb{E}_{s_1\sim P_0(\cdot|s_0,a_0)} V_1^\pi(s_1) - \mathbb{E}_{s_1\sim \tilde P_0(\cdot|s_0,a_0)}\tilde V_1^\pi(s_1)\Big] + \mathbb{E}_{a_0 \sim \pi}\Big[\mathbb{E}_{s_1\sim P_0} \big(V_1^\pi(s_1) - \tilde V_1^\pi(s_1)\big)\Big] \\ &= \mathbb{E}_{a_0 \sim \pi(\cdot|s_0)}\Big[\mathbb{E}_{s_1\sim P_0(\cdot|s_0,a_0)} V_1^\pi(s_1) - \mathbb{E}_{s_1\sim \tilde P_0(\cdot|s_0,a_0)}\tilde V_1^\pi(s_1)\Big] + \mathbb{E}_{a_0 \sim \pi}\mathbb{E}_{s_1\sim P_0} \Delta_1(s_1) \\ \end{aligned}

Then unroll to the end:

VπV~π=h=0H1E(sh,ah)Phπ ⁣[EsPh(sh,ah)V~h+1π(s)EsP~h(sh,ah)V~h+1π(s)].V^\pi-\tilde V^\pi =\sum_{h=0}^{H-1} \mathbb{E}_{(s_h,a_h)\sim P_h^\pi}\!\Big[ \mathbb{E}_{s'\sim P_h(\cdot|s_h,a_h)}\tilde V_{h+1}^\pi(s') -\mathbb{E}_{s'\sim \tilde P_h(\cdot|s_h,a_h)}\tilde V_{h+1}^\pi(s') \Big].

b.

First of all, we can decompose the difference as:

VπVπ~=(VπV~π)+(V~πV~π~)+(V~π~Vπ~)V^{\pi^*}-V^{\tilde\pi^*} =(V^{\pi^*}-\tilde V^{\pi^*})+(\tilde V^{\pi^*}-\tilde V^{\tilde\pi^*}) + (\tilde V^{\tilde\pi^*}-V^{\tilde\pi^*})

We could set:

A=VπV~π,B=V~πV~π~,C=V~π~Vπ~A = V^{\pi^*}-\tilde V^{\pi^*}, \quad B = \tilde V^{\pi^*}-\tilde V^{\tilde\pi^*}, \quad C = \tilde V^{\tilde\pi^*}-V^{\tilde\pi^*}

As we know that π~\tilde \pi^* is the optimal policy in M~\tilde M, which means that V~πV~π~\tilde V^{\pi^*} \le \tilde V^{\tilde\pi^*}. So we have B0B \le 0.

So we have:

VπVπ~A+CV^{\pi^*}-V^{\tilde\pi^*} \le A + C

By part AA, for any hh, we have:

A=h=0H1E(s,a)Phπ[EsPh(s,a)V~h+1π(s)EsP~h(s,a)V~h+1π(s)].A = \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \Big[ \mathbb{E}_{s'\sim P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\pi^*}(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\pi^*}(s') \Big].

Same for CC:

C=h=0H1E(s,a)Phπ~[EsP~h(s,a)V~h+1π~(s)EsPh(s,a)V~h+1π~(s)].C = \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}} \Big[ \mathbb{E}_{s'\sim \tilde P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\tilde\pi^*}(s') - \mathbb{E}_{s'\sim P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\tilde\pi^*}(s') \Big].

As EXEX\lvert \mathbb{E}X \rvert \le \mathbb{E}\lvert X \rvert, we have:

AAhE(s,a)Phπ EPhV~h+1πEP~hV~h+1π .A \le |A| \le \sum_{h}\mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \Big|\ \mathbb{E}_{P_h}\tilde V_{h+1}^{\pi^*}-\mathbb{E}_{\tilde P_h}\tilde V_{h+1}^{\pi^*}\ \Big|.

Expand the difference between the two expectations, to the sum of each ss':

 EPhV~h+1πEP~hV~h+1π =s(Ph(ss,a)P~h(ss,a))V~h+1π(s)sPh(ss,a)P~h(ss,a)  V~h+1π(s)(maxsV~h+1π(s))sPh(ss,a)P~h(ss,a)=V~h+1πsPh(ss,a)P~h(ss,a)\begin{aligned} \Big|\ \mathbb{E}_{P_h}\tilde V_{h+1}^{\pi^*}-\mathbb{E}_{\tilde P_h}\tilde V_{h+1}^{\pi^*}\ \Big| &= \Big| \sum_{s'}\big(P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\big)\tilde V_{h+1}^{\pi^*}(s') \Big| \\ &\le \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|\ \cdot \ \lvert \tilde V_{h+1}^{\pi^*}(s') \rvert \\ &\le \big(max_{s'}\lvert \tilde V_{h+1}^{\pi^*}(s') \rvert\big) \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| \\ &= \lVert \tilde V_{h+1}^{\pi^*} \rVert_\infty \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| \\ \end{aligned}

As we have assumed that the per-step reward r(s,a)[0,1]r(s,a)\in[0,1].
Starting from time h+1h+1, there are at most H(h+1)H-(h+1) steps remaining. So we have:

V~h+1πH(h+1)H\lVert \tilde V_{h+1}^{\pi^*} \rVert_\infty \le H-(h+1) \le H

So back to AA, we have:

AHh=0H1E(s,a)PhπsPh(ss,a)P~h(ss,a)A \le H \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|

Same for CC:

CHh=0H1E(s,a)Phπ~sPh(ss,a)P~h(ss,a)C \le H \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}} \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|

Finally, we have:

VπVπ~A+C=Hh=0H1[E(s,a)PhπP~h(s,a)Ph(s,a)1+E(s,a)Phπ~P~h(s,a)Ph(s,a)1].\begin{aligned} V^{\pi^*}-V^{\tilde\pi^*} &\le A+C \\ &=H\sum_{h=0}^{H-1}\Big[ \mathbb{E}_{(s,a)\sim P_h^{\pi^*}}\|\tilde P_h(\cdot\mid s,a)-P_h(\cdot\mid s,a)\|_1 + \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}}\|\tilde P_h(\cdot\mid s,a)-P_h(\cdot\mid s,a)\|_1 \Big]. \end{aligned}