Skip to main content

Homework2

NameID
Yuanjun Feng528668
Miro Ye527238

Short Answer Questions

question 1

We have Update1:

Q(s,a):=Q(s,a)+α(r+γmaxaAQ(s,a)Q(s,a))Q(s,a) := Q(s,a) + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q(s',a') - Q(s,a)\bigr)

Update2:

w:=w+α(r+γmaxaAQw(s,a)Qw(s,a))wQw(s,a)w := w + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q_w(s',a') - Q_w(s,a)\bigr)\, \nabla_w Q_w(s,a)

We have

Qw(s,a)=wϕ(s,a),wRSA,  ϕ:S×ARSAQ_w(s,a) = w^\top \phi(s,a), \quad w \in \mathbb{R}^{|S||A|},\; \phi: S\times A \to \mathbb{R}^{|S||A|}

so

wQw(s,a)=wwϕ(s,a)=wiwiϕi(s,a)=ϕ(s,a)\begin{aligned} \nabla_w Q_w(s,a) &= \nabla_w\, w^\top \phi(s,a) \\ &= \nabla_w \sum_i w_i \, \phi_i(s,a) \\ &= \phi(s,a) \end{aligned}

Let δ=r+γmaxaAQ(s,a)Q(s,a)\delta = r+\gamma \, \max\limits_{a' \in A} Q(s',a') - Q(s,a). So we have:

w:=w+αδϕ(s,a)w(i,j):=w(i,j)+αδ1[i=s,j=a]\begin{aligned} w &:= w + \alpha \, \delta \, \phi(s,a) \\ w_{(i,j)} &:= w_{(i,j)} + \alpha \, \delta \, \mathbf{1}[i=s,\, j=a] \end{aligned}

because ϕ(s,a)_(s,a)=1[s=s,  a=a]\phi(s,a)\_{(s',a')} = \mathbf{1}[\,s'=s,\; a'=a\,]. Only when s=s,  a=as'=s,\; a'=a, the function equals 1, and we have

Qw(s,a)=wϕ(s,a)=w(s,a)Q_w(s,a) = w^\top \phi(s,a) = w_{(s,a)}

Finally we get

Qw(s,a)=w(s,a):=w(s,a)+αδ=Qw(s,a)+α(r+γmaxaAQw(s,a)Qw(s,a))=Q(s,a)+α(r+γmaxaAQ(s,a)Q(s,a))\begin{aligned} Q_w(s,a) &= w_{(s,a)} \\ &:= w_{(s,a)} + \alpha \, \delta \\ &= Q_w(s,a) + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q_w(s',a') - Q_w(s,a)\bigr) \\ &= Q(s,a) + \alpha\bigl(r+\gamma \, \max_{a' \in A} Q(s',a') - Q(s,a)\bigr) \end{aligned}

question 2

  • The Deadly Triad is the combination of function approximation, bootstrapping, and off-policy. When these three properties are combined, learning can diverge with the value estimates becoming unbounded.Bootstrapping means targets include current estimates, and with function approximation updates can inappropriately change other states—if learning is off-policy, those bootstrap values may not be updated often enough, yielding harmful dynamics and possible divergence.

  • Instability / divergence risk from the triad: deep Q-nets + one-step bootstrapping + off-policy updates can blow up action values. Yet DQN famously combines all three, motivating careful design. Overestimation bias in max-backup targets (classic Q-learning) increases instability. Empirically, among four targets, basic Q-learning had the highest fraction of soft-divergences; target-Q and double-Q were most stable.

  • Bootstrapping on a separate target network reduces instability: the target cannot be inadvertently updated immediately, so divergence occurs less often. Empirically, target Q and double Q are the most stable among the four bootstrap targets, precisely because they use target networks; this effect is also highlighted in the Discussion.

question 3

Maximization bias arises in vanilla Q-learning because the same noisy estimator both selects the greedy action and evaluates it, so r+γmaxaAQ(s,a)r+\gamma \, \max*{a' \in A} Q(s',a') is systematically overestimated. Double Q-Learning reduces this bias by decoupling selection from evaluation: use one estimator (the online network) to pick the action and another (the target network) to score it: Y=R+γQ(s,argmaxaQ(s,a;w);w)Y = R + \gamma\, Q\big(s',\, \arg\max*{a} Q(s',a; w);\, w^{-}\big). Because the selection noise and the evaluation noise come from different parameters, a spurious high estimate can’t both win the max and inflate its own target, yielding much lower overestimation and more stable learning.

Deep Q-Network (DQN) Experiment Report


1. Experiment Description

This experiment systematically compares the performance of 7 different DQN variants:

  1. Basic DQN - Baseline implementation
  2. Double DQN - Reduces Q-value overestimation
  3. Dueling DQN - Separates state value and advantage functions
  4. N-Step (3-step) - Uses multi-step returns
  5. PER - Prioritized Experience Replay
  6. N-Step + PER - Combines N-Step and PER
  7. All-In-One - Combines all four improvements (Double + Dueling + N-Step + PER)

2. Experiment Setup

2.1 Environment Configuration

  • Environment: OpenAI Gymnasium CartPole-v1
  • State Space: 4-dimensional continuous state (position, velocity, angle, angular velocity)
  • Action Space: 2 discrete actions (left, right)
  • Reward: +1 for each time step
  • Termination Condition: the pole exceeds 12 degrees or the cart moves out of the range
  • Max Score: 500 points (the maximum episode length of CartPole-v1)

2.2 Training Parameters

Training Steps: 50,000
Batch Size: 128
Learning Rate: 0.002
Discount Factor (γ): 0.99
Target Network Update Frequency: every 3 steps
Epsilon Decay: from 1.0 to 0.05 (12,500 steps)
Hidden Layer Size: 64
Activation Function: ELU

3. Experiment Results

3.1 Convergence Speed Comparison

RankExperiment NameSteps to 500Relative to Baseline
1N-Step + PER4,0004.5x Acceleration
2All-In-One6,0003.0x Acceleration
3N-Step (3-step)8,0002.25x Acceleration
4Double DQN14,0001.29x Acceleration
5Basic DQN18,000Baseline
6Dueling DQN18,000No Improvement
7PER18,000No Improvement

We find that:

  • N-Step + PER combination is the best (4.5x acceleration)
  • N-Step is the most effective single improvement (2.25x acceleration)
  • Double DQN has a noticeable but small improvement (1.29x acceleration)
  • Dueling DQN and PER alone have no obvious advantage in simple tasks

Anyway, all methods eventually reached Max Score 500.0, which proves that they all can solve the CartPole task after training.

3.2 Training Curve Analysis

  • The training return climbs steadily and converges to 500.
  • The TD loss shows a high initial variance that diminishes as the policy stabilizes, and N-Step has a faster convergence speed.
  • The average Q-values increase early and stabilize. Double DQN's Q platform is slightly lower and more stable, fitting the description of "overestimating the lower estimate".

Figures for each variant are included below.

DQN DQN: Baseline convergence within ~18k steps.

Double DQN Double DQN: Reduced Q overestimation, slightly faster stabilization.

Dueling DQN Dueling DQN: Similar on CartPole; stronger on tasks with many similar-valued actions.

N-Step N-Step (3): Faster convergence via multi-step targets.

PER PER: Mixed effect on this simple task; helps more when dynamics are diverse.

NStep + PER N-Step + PER: Best overall speed on this task.

All-In-One All-In-One: Strong, close to the best combo.


4. Method Analysis

  • N-Step reduces bias in the target estimate at the expense of slightly higher variance by spreading rewards over several steps, which dramatically speeds up learning.
  • PER corrects sampling bias using importance sampling weights and gives priority to informative transitions with substantial TD errors; advantages vary depending on the task.
  • Double DQN reduces overestimation by separating action selection and evaluation; the effect is slight but steady.
  • Dueling DQN decomposes Q into value and advantage; the benefit is minimal on CartPole.

5. Experiment Conclusion

  • On CartPole, the combination of N-Step and PER converges the fastest; the benefit of using N-Step alone is also significant.
  • Double DQN is slightly more stable and has lower overestimation.
  • Dueling DQN has minimal benefits on this environment.
  • All methods eventually reach Max Score 500, indicating successful policy learning.