Homework2

Name	ID
Yuanjun Feng	528668
Miro Ye	527238

Short Answer Questions

question 1

We have Update1:

Q(s,a) := Q(s,a) + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q(s',a') - Q(s,a)\bigr)

Update2:

w := w + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q_w(s',a') - Q_w(s,a)\bigr)\, \nabla_w Q_w(s,a)

We have

Q_w(s,a) = w^\top \phi(s,a), \quad w \in \mathbb{R}^{|S||A|},\; \phi: S\times A \to \mathbb{R}^{|S||A|}

\begin{aligned} \nabla_w Q_w(s,a) &= \nabla_w\, w^\top \phi(s,a) \\ &= \nabla_w \sum_i w_i \, \phi_i(s,a) \\ &= \phi(s,a) \end{aligned}

Let $\delta = r+\gamma \, \max\limits_{a' \in A} Q(s',a') - Q(s,a)$ . So we have:

\begin{aligned} w &:= w + \alpha \, \delta \, \phi(s,a) \\ w_{(i,j)} &:= w_{(i,j)} + \alpha \, \delta \, \mathbf{1}[i=s,\, j=a] \end{aligned}

because $\phi(s,a)\_{(s',a')} = \mathbf{1}[\,s'=s,\; a'=a\,]$ . Only when $s'=s,\; a'=a$ , the function equals 1, and we have

Q_w(s,a) = w^\top \phi(s,a) = w_{(s,a)}

Finally we get

\begin{aligned} Q_w(s,a) &= w_{(s,a)} \\ &:= w_{(s,a)} + \alpha \, \delta \\ &= Q_w(s,a) + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q_w(s',a') - Q_w(s,a)\bigr) \\ &= Q(s,a) + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q(s',a') - Q(s,a)\bigr) \end{aligned}

question 2

The Deadly Triad is the combination of function approximation, bootstrapping, and off-policy. When these three properties are combined, learning can diverge with the value estimates becoming unbounded.Bootstrapping means targets include current estimates, and with function approximation updates can inappropriately change other states—if learning is off-policy, those bootstrap values may not be updated often enough, yielding harmful dynamics and possible divergence.
Instability / divergence risk from the triad: deep Q-nets + one-step bootstrapping + off-policy updates can blow up action values. Yet DQN famously combines all three, motivating careful design. Overestimation bias in max-backup targets (classic Q-learning) increases instability. Empirically, among four targets, basic Q-learning had the highest fraction of soft-divergences; target-Q and double-Q were most stable.
Bootstrapping on a separate target network reduces instability: the target cannot be inadvertently updated immediately, so divergence occurs less often. Empirically, target Q and double Q are the most stable among the four bootstrap targets, precisely because they use target networks; this effect is also highlighted in the Discussion.

question 3

Maximization bias arises in vanilla Q-learning because the same noisy estimator both selects the greedy action and evaluates it, so $r+\gamma \, \max*{a' \in A} Q(s',a')$ is systematically overestimated. Double Q-Learning reduces this bias by decoupling selection from evaluation: use one estimator (the online network) to pick the action and another (the target network) to score it: $Y = R + \gamma\, Q\big(s',\, \arg\max*{a} Q(s',a; w);\, w^{-}\big)$ . Because the selection noise and the evaluation noise come from different parameters, a spurious high estimate can’t both win the max and inflate its own target, yielding much lower overestimation and more stable learning.

Deep Q-Network (DQN) Experiment Report

1. Experiment Description

This experiment systematically compares the performance of 7 different DQN variants:

Basic DQN - Baseline implementation
Double DQN - Reduces Q-value overestimation
Dueling DQN - Separates state value and advantage functions
N-Step (3-step) - Uses multi-step returns
PER - Prioritized Experience Replay
N-Step + PER - Combines N-Step and PER
All-In-One - Combines all four improvements (Double + Dueling + N-Step + PER)

2. Experiment Setup

2.1 Environment Configuration

Environment: OpenAI Gymnasium CartPole-v1
State Space: 4-dimensional continuous state (position, velocity, angle, angular velocity)
Action Space: 2 discrete actions (left, right)
Reward: +1 for each time step
Termination Condition: the pole exceeds 12 degrees or the cart moves out of the range
Max Score: 500 points (the maximum episode length of CartPole-v1)

2.2 Training Parameters

Training Steps: 50,000
Batch Size: 128
Learning Rate: 0.002
Discount Factor (γ): 0.99
Target Network Update Frequency: every 3 steps
Epsilon Decay: from 1.0 to 0.05 (12,500 steps)
Hidden Layer Size: 64
Activation Function: ELU

3. Experiment Results

3.1 Convergence Speed Comparison

Rank	Experiment Name	Steps to 500	Relative to Baseline
1	N-Step + PER	4,000	4.5x Acceleration
2	All-In-One	6,000	3.0x Acceleration
3	N-Step (3-step)	8,000	2.25x Acceleration
4	Double DQN	14,000	1.29x Acceleration
5	Basic DQN	18,000	Baseline
6	Dueling DQN	18,000	No Improvement
7	PER	18,000	No Improvement

We find that:

N-Step + PER combination is the best (4.5x acceleration)
N-Step is the most effective single improvement (2.25x acceleration)
Double DQN has a noticeable but small improvement (1.29x acceleration)
Dueling DQN and PER alone have no obvious advantage in simple tasks

Anyway, all methods eventually reached Max Score 500.0, which proves that they all can solve the CartPole task after training.

3.2 Training Curve Analysis

The training return climbs steadily and converges to 500.
The TD loss shows a high initial variance that diminishes as the policy stabilizes, and N-Step has a faster convergence speed.
The average Q-values increase early and stabilize. Double DQN's Q platform is slightly lower and more stable, fitting the description of "overestimating the lower estimate".

Figures for each variant are included below.

DQN _{DQN: Baseline convergence within ~18k steps.}

_{Double DQN: Reduced Q overestimation, slightly faster stabilization.}

_{Dueling DQN: Similar on CartPole; stronger on tasks with many similar-valued actions.}

N-Step _{N-Step (3): Faster convergence via multi-step targets.}

PER _{PER: Mixed effect on this simple task; helps more when dynamics are diverse.}

NStep + PER _{N-Step + PER: Best overall speed on this task.}

_{All-In-One: Strong, close to the best combo.}

4. Method Analysis

N-Step reduces bias in the target estimate at the expense of slightly higher variance by spreading rewards over several steps, which dramatically speeds up learning.
PER corrects sampling bias using importance sampling weights and gives priority to informative transitions with substantial TD errors; advantages vary depending on the task.
Double DQN reduces overestimation by separating action selection and evaluation; the effect is slight but steady.
Dueling DQN decomposes Q into value and advantage; the benefit is minimal on CartPole.

5. Experiment Conclusion

On CartPole, the combination of N-Step and PER converges the fastest; the benefit of using N-Step alone is also significant.
Double DQN is slightly more stable and has lower overestimation.
Dueling DQN has minimal benefits on this environment.
All methods eventually reach Max Score 500, indicating successful policy learning.

Short Answer Questions​

question 1​

question 2​

question 3​

Deep Q-Network (DQN) Experiment Report​

1. Experiment Description​

2. Experiment Setup​

2.1 Environment Configuration​

2.2 Training Parameters​

3. Experiment Results​

3.1 Convergence Speed Comparison​

3.2 Training Curve Analysis​

4. Method Analysis​

5. Experiment Conclusion​