Yuanjun Feng 528668 Homework 1
1. MDPs
1.1
First of all, we have the value iteration formula:
V k + 1 ( s ) = r ( s ) + γ ⋅ max a ∈ { Run,Fight } ∑ s ′ P ( s ′ ∣ s , a ) V k ( s ′ ) . V_{k+1}(s) \,=\, r(s) + \gamma \cdot \max_{a\in\{\text{Run,Fight}\}} \sum_{s'} P(s'\mid s,a)\, V_k(s')\,. V k + 1 ( s ) = r ( s ) + γ ⋅ a ∈ { Run,Fight } max s ′ ∑ P ( s ′ ∣ s , a ) V k ( s ′ ) .
In which,
r ( s ) r(s) r ( s ) : O=+2, D=−1, C=−4
γ \gamma γ : 0.9
Initialization: J¹(O)=2, J¹(D)=-1, J¹(C)=-4
Then for V 2 V_2 V 2 from V 1 V_1 V 1 , we can calculate the value of each state:
V_2(O):
Run:
2 + 0.9 [ 0.6 ⋅ 2 + 0.4 ⋅ ( − 1 ) ] = 2 + 0.9 ⋅ 0.8 = 2.72 2 + 0.9\,[0.6\cdot 2 + 0.4\cdot(-1)] = 2 + 0.9\cdot 0.8 = 2.72 2 + 0.9 [ 0.6 ⋅ 2 + 0.4 ⋅ ( − 1 )] = 2 + 0.9 ⋅ 0.8 = 2.72
Fight:
2 + 0.9 [ 0.75 ⋅ 2 + 0.25 ⋅ ( − 4 ) ] = 2 + 0.9 ⋅ 0.5 = 2.45 2 + 0.9\,[0.75\cdot 2 + 0.25\cdot(-4)] = 2 + 0.9\cdot 0.5 = 2.45 2 + 0.9 [ 0.75 ⋅ 2 + 0.25 ⋅ ( − 4 )] = 2 + 0.9 ⋅ 0.5 = 2.45
We take the max of the two values, so we have:
V 2 ( O ) = 2.72 V_2(O) = 2.72 V 2 ( O ) = 2.72
V_2(D):
Run:
− 1 + 0.9 [ 0.4 ⋅ 2 + 0.3 ⋅ ( − 1 ) + 0.3 ⋅ ( − 4 ) ] = − 1 + 0.9 ⋅ ( − 0.7 ) = − 1.63 -1 + 0.9\,[0.4\cdot 2 + 0.3\cdot(-1) + 0.3\cdot(-4)] = -1 + 0.9\cdot (-0.7) = -1.63 − 1 + 0.9 [ 0.4 ⋅ 2 + 0.3 ⋅ ( − 1 ) + 0.3 ⋅ ( − 4 )] = − 1 + 0.9 ⋅ ( − 0.7 ) = − 1.63
Fight:
− 1 + 0.9 [ 0.3 ⋅ 2 + 0.25 ⋅ ( − 1 ) + 0.45 ⋅ ( − 4 ) ] = − 1 + 0.9 ⋅ ( − 1.45 ) = − 2.305 -1 + 0.9\,[0.3\cdot 2 + 0.25\cdot(-1) + 0.45\cdot(-4)] = -1 + 0.9\cdot (-1.45) = -2.305 − 1 + 0.9 [ 0.3 ⋅ 2 + 0.25 ⋅ ( − 1 ) + 0.45 ⋅ ( − 4 )] = − 1 + 0.9 ⋅ ( − 1.45 ) = − 2.305
We take the max of the two values, so we have:
V 2 ( D ) = − 1.63 V_2(D) = -1.63 V 2 ( D ) = − 1.63
V_2(C):
Run:
− 4 + 0.9 [ 0.6 ⋅ ( − 1 ) + 0.4 ⋅ ( − 4 ) ] = − 4 + 0.9 ⋅ ( − 2.2 ) = − 5.98 -4 + 0.9\,[0.6\cdot(-1) + 0.4\cdot(-4)] = -4 + 0.9\cdot (-2.2) = -5.98 − 4 + 0.9 [ 0.6 ⋅ ( − 1 ) + 0.4 ⋅ ( − 4 )] = − 4 + 0.9 ⋅ ( − 2.2 ) = − 5.98
Fight:
− 4 + 0.9 [ 0.2 ⋅ 2 + 0.8 ⋅ ( − 4 ) ] = − 4 + 0.9 ⋅ ( − 2.8 ) = − 6.52 -4 + 0.9\,[0.2\cdot 2 + 0.8\cdot(-4)] = -4 + 0.9\cdot (-2.8) = -6.52 − 4 + 0.9 [ 0.2 ⋅ 2 + 0.8 ⋅ ( − 4 )] = − 4 + 0.9 ⋅ ( − 2.8 ) = − 6.52
We take the max of the two values, so we have:
V 2 ( C ) = − 5.98 V_2(C) = -5.98 V 2 ( C ) = − 5.98
For V 3 V_3 V 3 from V 2 V_2 V 2 :
V_3(O):
Run:
2 + 0.9 [ 0.6 ⋅ 2.72 + 0.4 ⋅ ( − 1.63 ) ] = 2 + 0.9 ⋅ 0.98 = 2.882 2 + 0.9\,[0.6\cdot 2.72 + 0.4\cdot (-1.63)] = 2 + 0.9\cdot 0.98 = 2.882 2 + 0.9 [ 0.6 ⋅ 2.72 + 0.4 ⋅ ( − 1.63 )] = 2 + 0.9 ⋅ 0.98 = 2.882
Fight:
2 + 0.9 [ 0.75 ⋅ 2.72 + 0.25 ⋅ ( − 5.98 ) ] = 2 + 0.9 ⋅ 0.545 = 2.4905 2 + 0.9\,[0.75\cdot 2.72 + 0.25\cdot (-5.98)] = 2 + 0.9\cdot 0.545 = 2.4905 2 + 0.9 [ 0.75 ⋅ 2.72 + 0.25 ⋅ ( − 5.98 )] = 2 + 0.9 ⋅ 0.545 = 2.4905
We take the max of the two values, so we have:
V 3 ( O ) = 2.882 V_3(O) = 2.882 V 3 ( O ) = 2.882
V_3(D):
Run:
− 1 + 0.9 [ 0.4 ⋅ 2.72 + 0.3 ⋅ ( − 1.63 ) + 0.3 ⋅ ( − 5.98 ) ] = − 1 + 0.9 ⋅ ( − 1.195 ) = − 2.0755 -1 + 0.9\,[0.4\cdot 2.72 + 0.3\cdot (-1.63) + 0.3\cdot (-5.98)] = -1 + 0.9\cdot (-1.195) = -2.0755 − 1 + 0.9 [ 0.4 ⋅ 2.72 + 0.3 ⋅ ( − 1.63 ) + 0.3 ⋅ ( − 5.98 )] = − 1 + 0.9 ⋅ ( − 1.195 ) = − 2.0755
Fight:
− 1 + 0.9 [ 0.3 ⋅ 2.72 + 0.25 ⋅ ( − 1.63 ) + 0.45 ⋅ ( − 5.98 ) ] = − 1 + 0.9 ⋅ ( − 2.2825 ) = − 3.05425 -1 + 0.9\,[0.3\cdot 2.72 + 0.25\cdot (-1.63) + 0.45\cdot (-5.98)] = -1 + 0.9\cdot (-2.2825) = -3.05425 − 1 + 0.9 [ 0.3 ⋅ 2.72 + 0.25 ⋅ ( − 1.63 ) + 0.45 ⋅ ( − 5.98 )] = − 1 + 0.9 ⋅ ( − 2.2825 ) = − 3.05425
We take the max of the two values, so we have:
V 3 ( D ) = − 2.0755 V_3(D) = -2.0755 V 3 ( D ) = − 2.0755
V_3(C):
Run:
− 4 + 0.9 [ 0.6 ⋅ ( − 1.63 ) + 0.4 ⋅ ( − 5.98 ) ] = − 4 + 0.9 ⋅ ( − 3.370 ) = − 7.033 -4 + 0.9\,[0.6\cdot (-1.63) + 0.4\cdot (-5.98)] = -4 + 0.9\cdot (-3.370) = -7.033 − 4 + 0.9 [ 0.6 ⋅ ( − 1.63 ) + 0.4 ⋅ ( − 5.98 )] = − 4 + 0.9 ⋅ ( − 3.370 ) = − 7.033
Fight:
− 4 + 0.9 [ 0.2 ⋅ 2.72 + 0.8 ⋅ ( − 5.98 ) ] = − 4 + 0.9 ⋅ ( − 4.24 ) = − 7.816 -4 + 0.9\,[0.2\cdot 2.72 + 0.8\cdot (-5.98)] = -4 + 0.9\cdot (-4.24) = -7.816 − 4 + 0.9 [ 0.2 ⋅ 2.72 + 0.8 ⋅ ( − 5.98 )] = − 4 + 0.9 ⋅ ( − 4.24 ) = − 7.816
We take the max of the two values, so we have:
V 3 ( C ) = − 7.033 V_3(C) = -7.033 V 3 ( C ) = − 7.033
Finally, we can fill in the table:
k J k ( O ) J^k(O) J k ( O ) J k ( D ) J^k(D) J k ( D ) J k ( C ) J^k(C) J k ( C ) 1 2 -1 -4 2 2.72 -1.63 -5.98 3 2.882 -2.0755 -7.033
1.2
For both k = 2 k = 2 k = 2 and k = 3 k = 3 k = 3 , I would choose Greedy policy in state O O O , D D D , and C C C .
After calculation, the greedy policy with respect to V k V_k V k selects Run in all three states in both k = 2 k = 2 k = 2 and k = 3 k = 3 k = 3 .
However, these choices are not guaranteed to be the globally optimal policy. Finite iterations yield only a locally optimal policy relative to V k V_k V k , which would change after convergence .
2. Value Iteration Theorem
As we have the Bellman operator under policy π \pi π :
( B π V ) ( s ) = E a ∼ π ( ⋅ ∣ s ) [ R ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) V ( s ′ ) ] . (B_\pi V)(s)=\mathbb{E}_{a\sim\pi(\cdot\mid s)}\Big[R(s,a)+\gamma\sum_{s'}p(s'\mid s,a)\,V(s')\Big]. ( B π V ) ( s ) = E a ∼ π ( ⋅ ∣ s ) [ R ( s , a ) + γ s ′ ∑ p ( s ′ ∣ s , a ) V ( s ′ ) ] .
We can expand and cancel rewards between two evaluations:
∥ ( B π V ) ( s ) − ( B π V ′ ) ( s ) ∥ = ∥ E a ∼ π [ R ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) V ( s ′ ) ] − E a ∼ π [ R ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) V ′ ( s ′ ) ] ∥ = ∥ γ E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) ) ∥ = max s ∣ γ E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) ) ∣ \begin{aligned}
\lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert
&= \lVert\mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V(s') \Big]
\, - \, \mathbb{E}_{a\sim\pi}\Big[ R(s,a) + \gamma\sum_{s'}p(s'\mid s,a) V'(s') \Big]\rVert \\
&= \lVert\gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \rVert \\
&= \max_s \Big| \gamma\,\mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\big(V(s')-V'(s')\big) \Big|
\end{aligned} ∥( B π V ) ( s ) − ( B π V ′ ) ( s )∥ = ∥ E a ∼ π [ R ( s , a ) + γ s ′ ∑ p ( s ′ ∣ s , a ) V ( s ′ ) ] − E a ∼ π [ R ( s , a ) + γ s ′ ∑ p ( s ′ ∣ s , a ) V ′ ( s ′ ) ] ∥ = ∥ γ E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) ) ∥ = s max γ E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ( V ( s ′ ) − V ′ ( s ′ ) )
Then we can pull out γ \gamma γ and apply triangle inequality ∣ E [ X ] ∣ ≤ E [ ∣ X ∣ ] |\mathbb{E}[X]|\le \mathbb{E}[|X|] ∣ E [ X ] ∣ ≤ E [ ∣ X ∣ ] :
∥ ( B π V ) ( s ) − ( B π V ′ ) ( s ) ∥ ≤ γ max s E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ∣ ( V ( s ′ ) − V ′ ( s ′ ) ) ∣ ≤ γ max s E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥ \begin{aligned}
\lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert
&\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,|\big(V(s')-V'(s')\big)| \\
&\le \gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert
\end{aligned} ∥( B π V ) ( s ) − ( B π V ′ ) ( s )∥ ≤ γ s max E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ∣ ( V ( s ′ ) − V ′ ( s ′ ) ) ∣ ≤ γ s max E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥
Use ∑ s ′ p ( s ′ ∣ s , a ) = 1 \sum_{s'}p(s'\mid s,a)=1 ∑ s ′ p ( s ′ ∣ s , a ) = 1 :
γ max s E a ∼ π ∑ s ′ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥ = γ ∥ V − V ′ ∥ \begin{aligned}
\gamma\,\max_s \mathbb{E}_{a\sim\pi}\sum_{s'}p(s'\mid s,a)\,\lVert V-V'\rVert &= \gamma\,\lVert V-V'\rVert
\end{aligned} γ s max E a ∼ π s ′ ∑ p ( s ′ ∣ s , a ) ∥ V − V ′ ∥ = γ ∥ V − V ′ ∥
So we have:
∥ ( B π V ) ( s ) − ( B π V ′ ) ( s ) ∥ ≤ γ ∥ V − V ′ ∥ \lVert(B_\pi V)(s) - (B_\pi V')(s)\rVert \le \gamma\,\lVert V-V'\rVert ∥( B π V ) ( s ) − ( B π V ′ ) ( s )∥ ≤ γ ∥ V − V ′ ∥
From (a) we have
∥ B π V − B π V ′ ∥ ∞ ≤ γ ∥ V − V ′ ∥ ∞ . \|B_\pi V - B_\pi V'\|_\infty \le \gamma \|V - V'\|_\infty . ∥ B π V − B π V ′ ∥ ∞ ≤ γ ∥ V − V ′ ∥ ∞ .
So ( B π ) (B_\pi) ( B π ) is a γ \gamma γ -contraction on the complete metric space ( R ∣ S ∣ , ∥ ⋅ ∥ ∞ ) (\mathbb{R}^{|\mathcal S|}, \|\cdot\|_\infty) ( R ∣ S ∣ , ∥ ⋅ ∥ ∞ ) .
By the Banach fixed-point theorem , there exists a unique fixed point V ˉ \bar V V ˉ such that V ˉ = B π V ˉ \bar V = B_\pi \bar V V ˉ = B π V ˉ .
And for any initial value V 0 V_0 V 0 , the iteration V t + 1 = B π V t V_{t+1}=B_\pi V_t V t + 1 = B π V t converges to V ˉ \bar V V ˉ at a geometric rate.
On the other hand, we have the Bellman equation for policy π \pi π :
V π = B π V π . V^\pi = B_\pi V^\pi . V π = B π V π .
Since the fixed point of B π B_\pi B π is unique , we must have V ˉ = V π \bar V = V^\pi V ˉ = V π .
Therefore, the unique fixed point of B π B_\pi B π is V π V^\pi V π .
The greedy policy is not guaranteed to be the optimal policy.
For greedy policy respects to V V V , the strategy π = π V \pi = \pi_V π = π V meets the condition of the Bellman equation:
V π = B π V π . V^\pi = B_\pi V^\pi . V π = B π V π .
However, V V V itself satisfies this equation, only when V V V is the fixed point of B π B_\pi B π , that is V = B π V V = B_\pi V V = B π V .
So in most cases, the greedy policy is not the optimal policy.
When π \pi π is the greedy policy, we have:
B V n = B π V n BV_n = B_\pi V_n B V n = B π V n
∥ V π − V n + 1 ∥ = ∥ B π V π − B π V n ∥ ≤ γ ∥ V π − V n ∥ \begin{aligned}
\lVert V^\pi - V_{n+1} \rVert &= \lVert B_\pi V^\pi - B_\pi V_n \rVert \\
&\le \gamma \lVert V^\pi - V_n \rVert \\
\end{aligned} ∥ V π − V n + 1 ∥ = ∥ B π V π − B π V n ∥ ≤ γ ∥ V π − V n ∥
Then we can introduce the triangle inequality:
γ ∥ V π − V n ∥ ≤ γ ∥ V π − V n + 1 ∥ + γ ∥ V n + 1 − V n ∥ \begin{aligned}
\gamma \lVert V^\pi - V_n \rVert &\le \gamma \lVert V^\pi - V_{n+1} \rVert + \gamma \lVert V_{n+1} - V_n \rVert \\
\end{aligned} γ ∥ V π − V n ∥ ≤ γ ∥ V π − V n + 1 ∥ + γ ∥ V n + 1 − V n ∥
So we have:
( 1 − γ ) ∥ V π − V n + 1 ∥ ≤ γ ∥ V n + 1 − V n ∥ ∥ V π − V n + 1 ∥ ≤ γ 1 − γ ∥ V n + 1 − V n ∥ \begin{aligned}
(1-\gamma) \lVert V^\pi - V_{n+1} \rVert &\le \gamma \lVert V_{n+1} - V_n \rVert \\
\lVert V^\pi - V_{n+1} \rVert &\le \frac{\gamma}{1-\gamma} \lVert V_{n+1} - V_n \rVert
\end{aligned} ( 1 − γ ) ∥ V π − V n + 1 ∥ ∥ V π − V n + 1 ∥ ≤ γ ∥ V n + 1 − V n ∥ ≤ 1 − γ γ ∥ V n + 1 − V n ∥
Finally we can introduce the stopping condition ∥ V n + 1 − V n ∥ < ϵ ( 1 − γ ) 2 γ \lVert V_{n+1} - V_n \rVert < \frac{\epsilon(1-\gamma)}{2\gamma} ∥ V n + 1 − V n ∥ < 2 γ ϵ ( 1 − γ ) :