HW1
Problem 1.1.4
Use optimality conditions to show that for all , we have
Solution
First, we define the function
So we have
To find the minimum of , we set :
As we know that , we have ****.
And while we check the second derivative, we have:
Since , we have , which verifies that is the minimum of .
So we have:
Thus, for all , we have:
Problem 1.2.1
Consider the problem of minimizing the function of two variables . nd
(a) Steepest Descent Method
Apply one iteration of the steepest descent method with (1, -2) as the starting point and with the stepsize chosen by the Armijo rule with .
Solution
First, we have the gradient of :
So we have the direction of steepest descent:
Then we can find the stepsize by the Armijo rule:
That is:
After calculation, we have:
So we firstly choose , and we have:
While the right side is , which obviously does not satisfy the Armijo rule.
As , we choose next, and we have:
While the right side is , which also does not satisfy.
So we choose next, and we have:
While the right side is , which still does not satisfy.
So we choose next, and we have:
While the right side is , which is close but still does not satisfy.
So we choose next, and we have:
While the right side is , which satisfies the Armijo rule.
So at last we know that the next point is with the stepsize .
(b) Repeat (a) using
Solution
As , we choose after the first , and we have:
While the right side is , which satisfies the Armijo rule.
So at last we know that the next point is with the stepsize .
How does the cost of the new iterate compare to that obtained in (a)?:
The cost of the new iterate in (a) is , while the cost of the new iterate in (b) is , which is larger.
Comment on the tradeoffs involved in the choice of $ \beta $:
A larger , which means a faster stepsize reduction, leads to slower convergence but ensures the stability.
While a smaller one retains a faster convergence but with a higher risk of instability.
(c) Newton's Method
Apply one iteration of Newton's method with the same starting point andstepsize rule as in (a). How does the cost of the new iterate compare tothat obtained in (a)? How about the amount of work involved in finding the new iterate?
Solution
We have already known the gradient of :
So next we need to calculate the Hessian matrix of :
And the Hessian matrix of is:
So we have the direction of Newton's method:
So the next point is .
How does the cost of the new iterate compare to that obtained in (a)?:
The cost of the new iterate here is , which is larger than the cost of the new iterate in (a) .
How about the amount of work involved in finding the new iterate?:
Newton's method requires calculation of the Hessian matrix and solving the linear system.
In contrast, Steepest Descent only needs to calculate the gradient, so the algorithmic workload is much lower.
Therefore, the workload of each iteration of Newton's method is much greater than that of gradient descent.
Problem Ⅲ
1.b.1
Solution
L-BFGS-B, which stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno with Bound constraints,
is a quasi-Newton optimization method.
- So firstly, the basic method
L-BFGS:
It optimizes gradient descent by memorizing the last steps and approximating the Hessian matrix. - Then the
L-BFGS-Bmethod: It adds the bound constraints the variable, and allows the solution to be projected at each iteration to ensure that the solution is within the constraints.
1.b.2
Solution
| Problem | Solution Time | Objective Value | Solution Quality |
|---|---|---|---|
dqrtic.mod | 15 sec (65 iterations) | 1.998169937e-14 | Converged |
eigenbls.mod | 11 sec (1001 iterations) | 3.529070739e-08 | Max Iteration |
freuroth.mod | 16 sec (21 iterations) | 608159.189 | Failed |
2.a
Solution
param n := 10;
param A := 10;
var x{i in 1..n} default 1.0;
minimize Rastrigin:
A*n + sum{i in 1..n} (x[i]^2 - A*cos(2*Pi*x[i]));
solve;
display x;
display Rastrigin;
2.b
Solution
| n | Solution Time | Objective Value | Solution Quality |
|---|---|---|---|
| 10 | 5 sec (4 iterations) | 9.949590571 | Converged |
| 20 | 12 sec (4 iterations) | 19.89918114 | Converged |
| 50 | 8 sec (4 iterations) | 49.74795285 | Converged |
| 100 | 7 sec (4 iterations) | 99.49590571 | Converged |
| 1000 | 8 sec (4 iterations) | 994.9590571 | Converged |
| 10000 | 2 sec (2 iterations) | 949.590571 | Failed |
It seems that for low dimensions (like n≤100), L-BFGS-B converges quickly but easily getting stuck in local minimum instead of the global minimum.
For high dimensions (such as n=1000, n=10000), it either still gets stuck in local minima or is interrupted by memory/time constraints.
So to get the global optimum, just relying on a single gradient method may not be enough.