Skip to main content

Homework 1

Yuanjun Feng	528668	Homework 1

1. MDPs

1.1

First of all, we have the value iteration formula:

V_{k+1}(s) \,=\, r(s) + \gamma \cdot \max_{a\in\{\text{Run,Fight}\}} \sum_{s'} P(s'\mid s,a)\, V_k(s')\,.

In which,

$r(s)$ : O=+2, D=−1, C=−4
$\gamma$ : 0.9
Initialization: J¹(O)=2, J¹(D)=-1, J¹(C)=-4

Then for $V_2$ from $V_1$ , we can calculate the value of each state:
V_2(O):

Run:
$2 + 0.9\,[0.6\cdot 2 + 0.4\cdot(-1)] = 2 + 0.9\cdot 0.8 = 2.72$
Fight:
$2 + 0.9\,[0.75\cdot 2 + 0.25\cdot(-4)] = 2 + 0.9\cdot 0.5 = 2.45$

We take the max of the two values, so we have:

V_2(O) = 2.72

V_2(D):

Run:
$-1 + 0.9\,[0.4\cdot 2 + 0.3\cdot(-1) + 0.3\cdot(-4)] = -1 + 0.9\cdot (-0.7) = -1.63$
Fight:
$-1 + 0.9\,[0.3\cdot 2 + 0.25\cdot(-1) + 0.45\cdot(-4)] = -1 + 0.9\cdot (-1.45) = -2.305$

We take the max of the two values, so we have:

V_2(D) = -1.63

V_2(C):

Run: $-4 + 0.9\,[0.6\cdot(-1) + 0.4\cdot(-4)] = -4 + 0.9\cdot (-2.2) = -5.98$
Fight: $-4 + 0.9\,[0.2\cdot 2 + 0.8\cdot(-4)] = -4 + 0.9\cdot (-2.8) = -6.52$

We take the max of the two values, so we have:

V_2(C) = -5.98

For $V_3$ from $V_2$ :

V_3(O):

Run: $2 + 0.9\,[0.6\cdot 2.72 + 0.4\cdot (-1.63)] = 2 + 0.9\cdot 0.98 = 2.882$
Fight: $2 + 0.9\,[0.75\cdot 2.72 + 0.25\cdot (-5.98)] = 2 + 0.9\cdot 0.545 = 2.4905$

We take the max of the two values, so we have:

V_3(O) = 2.882

V_3(D):

Run: $-1 + 0.9\,[0.4\cdot 2.72 + 0.3\cdot (-1.63) + 0.3\cdot (-5.98)] = -1 + 0.9\cdot (-1.195) = -2.0755$
Fight: $-1 + 0.9\,[0.3\cdot 2.72 + 0.25\cdot (-1.63) + 0.45\cdot (-5.98)] = -1 + 0.9\cdot (-2.2825) = -3.05425$

We take the max of the two values, so we have:

V_3(D) = -2.0755

V_3(C):

Run: $-4 + 0.9\,[0.6\cdot (-1.63) + 0.4\cdot (-5.98)] = -4 + 0.9\cdot (-3.370) = -7.033$
Fight: $-4 + 0.9\,[0.2\cdot 2.72 + 0.8\cdot (-5.98)] = -4 + 0.9\cdot (-4.24) = -7.816$

We take the max of the two values, so we have:

V_3(C) = -7.033

Finally, we can fill in the table:

k	$J^k(O)$	$J^k(D)$	$J^k(C)$
1	2	-1	-4
2	2.72	-1.63	-5.98
3	2.882	-2.0755	-7.033

1.2

For both $k = 2$ and $k = 3$ , I would choose Greedy policy in state $O$ , $D$ , and $C$ .
After calculation, the greedy policy with respect to $V_k$ selects Run in all three states in both $k = 2$ and $k = 3$ . However, these choices are not guaranteed to be the globally optimal policy. Finite iterations yield only a locally optimal policy relative to $V_k$ , which would change after convergence.

2. Value Iteration Theorem

a.

As we have the Bellman operator under policy $\pi$ :

(B_\pi V)(s)=\mathbb{E}_{a\sim\pi(\cdot\mid s)}\Big[R(s,a)+\gamma\sum_{s'}p(s'\mid s,a)\,V(s')\Big].

We can expand and cancel rewards between two evaluations:

\begin{aligned} \lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert &= \lVert\mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V(s') \Big] \, - \, \mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V'(s') \Big]\rVert \\ &= \lVert\gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \rVert \\ &= \max_s \Big| \gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \Big| \end{aligned}

Then we can pull out $\gamma$ and apply triangle inequality $|\mathbb{E}[X]|\le \mathbb{E}[|X|]$ :

\begin{aligned} \lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert &\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,|\big(V(s')-V'(s')\big)| \\ &\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert \end{aligned}

Use $\sum_{s'}p(s'\mid s,a)=1$ :

\begin{aligned} \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert &= \gamma\,\lVert V-V'\rVert \end{aligned}

So we have:

\lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert \le \gamma\,\lVert V-V'\rVert

b.

From (a) we have

\|B_\pi V - B_\pi V'\|_\infty \le \gamma \|V - V'\|_\infty .

So $(B_\pi)$ is a $\gamma$ -contraction on the complete metric space $(\mathbb{R}^{|\mathcal S|}, \|\cdot\|_\infty)$ .

By the Banach fixed-point theorem, there exists a unique fixed point $\bar V$ such that $\bar V = B_\pi \bar V$ .
And for any initial value $V_0$ , the iteration $V_{t+1}=B_\pi V_t$ converges to $\bar V$ at a geometric rate.

On the other hand, we have the Bellman equation for policy $\pi$ :

V^\pi = B_\pi V^\pi .

Since the fixed point of $B_\pi$ is unique, we must have $\bar V = V^\pi$ .

Therefore, the unique fixed point of $B_\pi$ is $V^\pi$ .

c.

The greedy policy is not guaranteed to be the optimal policy.

For greedy policy respects to $V$ , the strategy $\pi = \pi_V$ meets the condition of the Bellman equation:

V^\pi = B_\pi V^\pi .

However, $V$ itself satisfies this equation, only when $V$ is the fixed point of $B_\pi$ , that is $V = B_\pi V$ .

So in most cases, the greedy policy is not the optimal policy.

d.

When $\pi$ is the greedy policy, we have:

BV_n = B_\pi V_n

e.

\begin{aligned} \lVert V^\pi - V_{n+1} \rVert &= \lVert B_\pi V^\pi - B_\pi V_n \rVert \\ &\le \gamma \lVert V^\pi - V_n \rVert \\ \end{aligned}

Then we can introduce the triangle inequality:

\begin{aligned} \gamma \lVert V^\pi - V_n \rVert &\le \gamma \lVert V^\pi - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\ \end{aligned}

So we have:

\begin{aligned} (1-\gamma) \lVert V^\pi - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\ \lVert V^\pi - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \end{aligned}

Finally we can introduce the stopping condition $\lVert V_{n+1} - V_n \rVert < \frac{\epsilon(1-\gamma)}{2\gamma}$ :

\lVert V^\pi - V_{n+1} \rVert < \frac{\epsilon}{2}

Which certainly proves that:

\lVert V^\pi - V_{n+1} \rVert \le \frac{\epsilon}{2}

f.

To prove that $\lVert B^kV - B^k V' \rVert \le \gamma^k \lVert V - V' \rVert$ , we can use Induction hypothesis:

We assume that $\lVert B^{k}V - B^{k} V' \rVert \le \gamma^{k} \lVert V - V' \rVert$ .

Meanwhile, we have already known that where $B$ is the Bellman optimality operator, and $B$ is a $\gamma$ -contraction:

\lVert BV - BV' \rVert \le \gamma \lVert V - V' \rVert

Combining the two, we have that for $k+1$ :

\begin{aligned} \lVert B^{k+1}V - B^{k+1} V' \rVert &= \lVert B(B^kV) - B(B^k V') \rVert \\ &\le \gamma \lVert B^kV - B^k V' \rVert \\ &\le \gamma \cdot \gamma^k \lVert V - V' \rVert \\ &= \gamma^{k+1} \lVert V - V' \rVert \end{aligned}

Since $\lVert B^1V - B^1 V' \rVert \le \gamma \lVert V - V' \rVert$ , $k = 1$ is also true.

So we have proved that $\lVert B^kV - B^k V' \rVert \le \gamma^k \lVert V - V' \rVert$ .

g.

Method 1 - Triangle Inequality

Obviously, we could use the same method as (e) to prove this:

\begin{aligned} \lVert V^* - V_{n+1} \rVert &= \lVert BV* - BV_{n} \rVert \\ &\le \gamma \lVert V^* - V_{n} \rVert \\ &\le \gamma \lVert V^* - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\ \end{aligned}

\begin{aligned} (1-\gamma) \lVert V^* - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\ \lVert V^* - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \\ &< \frac{\gamma}{1-\gamma} × \frac{\epsilon(1-\gamma)}{2\gamma} = \frac{\epsilon}{2} \end{aligned}

So $\lVert V^* - V_{n+1} \rVert \le \frac{\epsilon}{2}$ holds.

Method 2 - Repeatedly Splitting

From the Hint, we could repeatedly split the term $\lVert V^* - V_{n+1} \rVert$ into smaller parts.

\lVert V^* - V_{n} \rVert \le \sum_{i=0}^{\infty} \lVert V_{n+i+1} - V_{n+i} \rVert

For each term, we have:

\begin{aligned} \lVert V_{n+i+1} - V_{n+i} \rVert &= \lVert B V_{n+i} - B V_{n+i-1} \rVert \\ &\le \gamma \lVert V_{n+i} - V_{n+i-1} \rVert \\ &\le \cdots \\ &\le \gamma^i \lVert V_{n+1} - V_n \rVert \end{aligned}

So we have:

\lVert V^* - V_{n} \rVert \le \sum_{i=0}^{\infty} \gamma^i \lVert V_{n+1} - V_n \rVert = \frac{1}{1-\gamma} \lVert V_{n+1} - V_n \rVert

Finally for $\lVert V^* - V_{n+1} \rVert$ , we have:

\begin{aligned} \lVert V^* - V_{n+1} \rVert &= \lVert BV^* - BV_{n} \rVert \\ &\le \gamma \lVert V^* - V_{n} \rVert \\ &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert \end{aligned}

Then just as the previous method, we can introduce the stopping condition and we have:

\lVert V^* - V_{n+1} \rVert \le \frac{\epsilon}{2}

h.

\begin{aligned} \lVert V^\pi - V^* \rVert &\le \lVert V^\pi - V_{n+1} \rVert + \lVert V_{n+1} - V^* \rVert \\ &= \lVert V^\pi - V_{n+1} \rVert + \lVert V^* - V_{n+1} \rVert \\ &\le \frac{\epsilon}{2} + \frac{\epsilon}{2} \\ &= \epsilon \end{aligned}

So we have $\lVert V^\pi - V^* \rVert \le \epsilon$ .

3. Simulation Lemma and Model-based Learning

a.

For finite-horizon MDP, we have value functions:

\begin{aligned} V_h^\pi(s) &= \mathbb{E}_\pi\Big[\sum_{t=h}^{H-1} r(s_t, a_t) | s_h = s],\\ \tilde V_h^\pi(s) &= \mathbb{E}_\pi\Big[\sum_{t=h}^{H-1} r(s_t, a_t) | s_h = s] \end{aligned}

So the Bellman forms would be:

\begin{aligned} V_h^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s')\Big],\\ \tilde V_h^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \end{aligned}

So we can define the difference as:

\begin{aligned} \Delta_h(s) &= V_h^\pi(s) - \tilde V_h^\pi(s) \\ &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\ &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') + \mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\ &= \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}\big(V_{h+1}^\pi(s') - \tilde V_{h+1}^\pi(s')\big)\Big] + \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[\mathbb{E}_{s'\sim P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot|s,a)}\tilde V_{h+1}^\pi(s')\Big] \\ \end{aligned}

Then we could unroll it, starting from $h=0$ :

\begin{aligned} \Delta_0(s_0) &= V_0^\pi(s_0) - \tilde V_0^\pi(s_0) \\ &= \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}V_1^\pi(s') - \mathbb{E}_{s'\sim \tilde P_0(\cdot|s_0,a)}\tilde V_1^\pi(s')\Big] \\ &= \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}\big(V_1^\pi(s') - \tilde V_1^\pi(s')\big)\Big] + \mathbb{E}_{a\sim\pi(\cdot|s_0)}\Big[\mathbb{E}_{s'\sim P_0(\cdot|s_0,a)}\tilde V_1^\pi(s') - \mathbb{E}_{s'\sim \tilde P_0(\cdot|s_0,a)}\tilde V_1^\pi(s')\Big] \\ &= \mathbb{E}_{a_0 \sim \pi(\cdot|s_0)}\Big[\mathbb{E}_{s_1\sim P_0(\cdot|s_0,a_0)} V_1^\pi(s_1) - \mathbb{E}_{s_1\sim \tilde P_0(\cdot|s_0,a_0)}\tilde V_1^\pi(s_1)\Big] + \mathbb{E}_{a_0 \sim \pi}\Big[\mathbb{E}_{s_1\sim P_0} \big(V_1^\pi(s_1) - \tilde V_1^\pi(s_1)\big)\Big] \\ &= \mathbb{E}_{a_0 \sim \pi(\cdot|s_0)}\Big[\mathbb{E}_{s_1\sim P_0(\cdot|s_0,a_0)} V_1^\pi(s_1) - \mathbb{E}_{s_1\sim \tilde P_0(\cdot|s_0,a_0)}\tilde V_1^\pi(s_1)\Big] + \mathbb{E}_{a_0 \sim \pi}\mathbb{E}_{s_1\sim P_0} \Delta_1(s_1) \\ \end{aligned}

Then unroll to the end:

V^\pi-\tilde V^\pi =\sum_{h=0}^{H-1} \mathbb{E}_{(s_h,a_h)\sim P_h^\pi}\!\Big[ \mathbb{E}_{s'\sim P_h(\cdot|s_h,a_h)}\tilde V_{h+1}^\pi(s') -\mathbb{E}_{s'\sim \tilde P_h(\cdot|s_h,a_h)}\tilde V_{h+1}^\pi(s') \Big].

b.

First of all, we can decompose the difference as:

V^{\pi^*}-V^{\tilde\pi^*} =(V^{\pi^*}-\tilde V^{\pi^*})+(\tilde V^{\pi^*}-\tilde V^{\tilde\pi^*}) + (\tilde V^{\tilde\pi^*}-V^{\tilde\pi^*})

We could set:

A = V^{\pi^*}-\tilde V^{\pi^*}, \quad B = \tilde V^{\pi^*}-\tilde V^{\tilde\pi^*}, \quad C = \tilde V^{\tilde\pi^*}-V^{\tilde\pi^*}

As we know that $\tilde \pi^*$ is the optimal policy in $\tilde M$ , which means that $\tilde V^{\pi^*} \le \tilde V^{\tilde\pi^*}$ . So we have $B \le 0$ .

So we have:

V^{\pi^*}-V^{\tilde\pi^*} \le A + C

By part $A$ , for any $h$ , we have:

A = \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \Big[ \mathbb{E}_{s'\sim P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\pi^*}(s') - \mathbb{E}_{s'\sim \tilde P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\pi^*}(s') \Big].

Same for $C$ :

C = \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}} \Big[ \mathbb{E}_{s'\sim \tilde P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\tilde\pi^*}(s') - \mathbb{E}_{s'\sim P_h(\cdot\mid s,a)}\tilde V_{h+1}^{\tilde\pi^*}(s') \Big].

As $\lvert \mathbb{E}X \rvert \le \mathbb{E}\lvert X \rvert$ , we have:

A \le |A| \le \sum_{h}\mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \Big|\ \mathbb{E}_{P_h}\tilde V_{h+1}^{\pi^*}-\mathbb{E}_{\tilde P_h}\tilde V_{h+1}^{\pi^*}\ \Big|.

Expand the difference between the two expectations, to the sum of each $s'$ :

\begin{aligned} \Big|\ \mathbb{E}_{P_h}\tilde V_{h+1}^{\pi^*}-\mathbb{E}_{\tilde P_h}\tilde V_{h+1}^{\pi^*}\ \Big| &= \Big| \sum_{s'}\big(P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\big)\tilde V_{h+1}^{\pi^*}(s') \Big| \\ &\le \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|\ \cdot \ \lvert \tilde V_{h+1}^{\pi^*}(s') \rvert \\ &\le \big(max_{s'}\lvert \tilde V_{h+1}^{\pi^*}(s') \rvert\big) \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| \\ &= \lVert \tilde V_{h+1}^{\pi^*} \rVert_\infty \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big| \\ \end{aligned}

As we have assumed that the per-step reward $r(s,a)\in[0,1]$ .
Starting from time $h+1$ , there are at most $H-(h+1)$ steps remaining. So we have:

\lVert \tilde V_{h+1}^{\pi^*} \rVert_\infty \le H-(h+1) \le H

So back to $A$ , we have:

A \le H \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\pi^*}} \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|

Same for $C$ :

C \le H \sum_{h=0}^{H-1} \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}} \sum_{s'}\Big|P_h(s'\mid s,a)-\tilde P_h(s'\mid s,a)\Big|

Finally, we have:

\begin{aligned} V^{\pi^*}-V^{\tilde\pi^*} &\le A+C \\ &=H\sum_{h=0}^{H-1}\Big[ \mathbb{E}_{(s,a)\sim P_h^{\pi^*}}\|\tilde P_h(\cdot\mid s,a)-P_h(\cdot\mid s,a)\|_1 + \mathbb{E}_{(s,a)\sim P_h^{\tilde\pi^*}}\|\tilde P_h(\cdot\mid s,a)-P_h(\cdot\mid s,a)\|_1 \Big]. \end{aligned}

1. MDPs
- 1.1
- 1.2
2. Value Iteration Theorem
3. Simulation Lemma and Model-based Learning
- a.
- b.