Skip to main content

HW1

Problem 1.1.4

Use optimality conditions to show that for all x>0x > 0, we have

1x+x2\frac{1}{x} + x \geq 2

Solution

First, we define the function f(x)=1x+xf(x) = \frac{1}{x} + x

So we have f(x)=1x2+1f'(x) = -\frac{1}{x^2} + 1

To find the minimum of f(x)f(x), we set f(x)=0f'(x) = 0 :

1=1x21 = \frac{1}{x^2}

As we know that x>0x > 0, we have **x=1x = 1**.

And while we check the second derivative, we have:

f(x)=2x3f''(x) = \frac{2}{x^3}

Since x>0x > 0, we have f(x)>0f''(x) > 0, which verifies that f(1)f(1) is the minimum of f(x)f(x).

So we have:

f(x)f(1)=2f(x) \geq f(1) = 2

Thus, for all x>0x > 0, we have:

1x+x2\frac{1}{x} + x \geq 2

Problem 1.2.1

Consider the problem of minimizing the function of two variables f(x,y)=3x2+y4f(x, y) = 3x^2 + y^4. nd

(a) Steepest Descent Method

Apply one iteration of the steepest descent method with (1, -2) as the starting point and with the stepsize chosen by the Armijo rule with s=1,σ=0.1,β=0.5s = 1, \sigma = 0.1, \beta = 0.5.

Solution

First, we have the gradient of f(x,y)f(x, y):

f(x,y)=[6x4y3]\nabla f(x, y) = \begin{bmatrix} 6x \\ 4y^3 \end{bmatrix}

So we have the direction of steepest descent:

d=f(1,2)=[632]d = -\nabla f(1, -2) = \begin{bmatrix} -6 \\ 32 \end{bmatrix}

Then we can find the stepsize by the Armijo rule:

f(x+αd)f(x)+σαf(x)Tdf(x + \alpha d) \leq f(x) + \sigma \alpha \nabla f(x)^T d

That is:

f(1+α(6),2+α32)f(1,2)+0.1α[632]T[632]f(1 + \alpha \cdot (-6), -2 + \alpha \cdot 32) \leq f(1, -2) + 0.1 \alpha \begin{bmatrix} -6 \\ 32 \end{bmatrix}^T \begin{bmatrix} -6 \\ 32 \end{bmatrix}

After calculation, we have:

f(16α,2+32α)19106αf(1 - 6\alpha, -2 + 32\alpha) \leq 19 - 106\alpha

So we firstly choose α=1\alpha = 1, and we have:

f(5,30)=325+304=810075f(-5, 30) = 3 \cdot 25 + 30^4 = 810075

While the right side is 19106=8719 - 106 = -87, which obviously does not satisfy the Armijo rule.

As β=0.5\beta = 0.5, we choose α=0.5\alpha = 0.5 next, and we have:

f(2,14)=34+144=38428f(-2, 14) = 3 \cdot 4 + 14^4 = 38428

While the right side is 1953=3419 - 53 = -34, which also does not satisfy.

So we choose α=0.25\alpha = 0.25 next, and we have:

f(0.5,6)=30.25+64=1296.75f(-0.5, 6) = 3 \cdot 0.25 + 6^4 = 1296.75

While the right side is 1926.5=7.519 - 26.5 = -7.5, which still does not satisfy.

So we choose α=0.125\alpha = 0.125 next, and we have:

f(0.25,2)=30.0625+24=16.1875f(0.25, 2) = 3 \cdot 0.0625 + 2^4 = 16.1875

While the right side is 1913.25=5.7519 - 13.25 = 5.75, which is close but still does not satisfy.

So we choose α=0.0625\alpha = 0.0625 next, and we have:

f(0.625,0)=30.390625=1.171875f(0.625, 0) = 3 \cdot 0.390625 = 1.171875

While the right side is 196.625=12.37519 - 6.625 = 12.375, which satisfies the Armijo rule.

So at last we know that the next point is (0.625,0)(0.625, 0) with the stepsize α=0.0625\alpha = 0.0625.

(b) Repeat (a) using β=0.1\beta = 0.1

Solution

As β=0.1\beta = 0.1, we choose α=0.1\alpha = 0.1 after the first α=1\alpha = 1, and we have:

f(0.4,1.2)=30.16+1.24=2.5536f(0.4, 1.2) = 3 \cdot 0.16 + 1.2^4 = 2.5536

While the right side is 1910.6=8.419 - 10.6 = 8.4, which satisfies the Armijo rule.

So at last we know that the next point is (0.4,1.2)(0.4, 1.2) with the stepsize α=0.1\alpha = 0.1.

How does the cost of the new iterate compare to that obtained in (a)?:
The cost of the new iterate in (a) is 1.1718751.171875, while the cost of the new iterate in (b) is 2.55362.5536, which is larger.

Comment on the tradeoffs involved in the choice of $ \beta $:
A larger β\beta, which means a faster stepsize reduction, leads to slower convergence but ensures the stability.
While a smaller one retains a faster convergence but with a higher risk of instability.

(c) Newton's Method

Apply one iteration of Newton's method with the same starting point andstepsize rule as in (a). How does the cost of the new iterate compare tothat obtained in (a)? How about the amount of work involved in finding the new iterate?

Solution

We have already known the gradient of f(1,2)f(1, -2):

f(1,2)=[632]\nabla f(1, -2) = \begin{bmatrix} 6 \\ -32 \end{bmatrix}

So next we need to calculate the Hessian matrix of f(x,y)f(x, y):

H=[60012y2]H = \begin{bmatrix} 6 & 0 \\ 0 & 12y^2 \end{bmatrix}

And the Hessian matrix of f(1,2)f(1, -2) is:

H(1,2)=[60048]H(1, -2) = \begin{bmatrix} 6 & 0 \\ 0 & 48 \end{bmatrix}

So we have the direction of Newton's method:

d=H(1,2)1f(1,2)=[1/6001/48][632]=[12/3]d = -H(1, -2)^{-1} \nabla f(1, -2) = -\begin{bmatrix} 1/6 & 0 \\ 0 & 1/48 \end{bmatrix} \begin{bmatrix} 6 \\ -32 \end{bmatrix} = \begin{bmatrix} -1 \\ 2/3 \end{bmatrix}

So the next point is (1,2)+(1,2/3)=(0,4/3)(1, -2) + (-1, 2/3) = (0, -4/3).

How does the cost of the new iterate compare to that obtained in (a)?:
The cost of the new iterate here is f(0,4/3)=3.16f(0, -4/3) = 3.16, which is larger than the cost of the new iterate in (a) 1.1718751.171875.

How about the amount of work involved in finding the new iterate?:
Newton's method requires calculation of the Hessian matrix and solving the linear system.
In contrast, Steepest Descent only needs to calculate the gradient, so the algorithmic workload is much lower.
Therefore, the workload of each iteration of Newton's method is much greater than that of gradient descent.

Problem Ⅲ

1.b.1

Solution

L-BFGS-B, which stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno with Bound constraints, is a quasi-Newton optimization method.

  • So firstly, the basic method L-BFGS:
    It optimizes gradient descent by memorizing the last mm steps and approximating the Hessian matrix.
  • Then the L-BFGS-B method: It adds the bound constraints the xx variable, and allows the solution to be projected at each iteration to ensure that the solution is within the constraints.

1.b.2

Solution

ProblemSolution TimeObjective ValueSolution Quality
dqrtic.mod15 sec (65 iterations)1.998169937e-14Converged
eigenbls.mod11 sec (1001 iterations)3.529070739e-08Max Iteration
freuroth.mod16 sec (21 iterations)608159.189Failed

2.a

Solution

param n := 10;
param A := 10;

var x{i in 1..n} default 1.0;

minimize Rastrigin:
A*n + sum{i in 1..n} (x[i]^2 - A*cos(2*Pi*x[i]));

solve;

display x;
display Rastrigin;

2.b

Solution

nSolution TimeObjective ValueSolution Quality
105 sec (4 iterations)9.949590571Converged
2012 sec (4 iterations)19.89918114Converged
508 sec (4 iterations)49.74795285Converged
1007 sec (4 iterations)99.49590571Converged
10008 sec (4 iterations)994.9590571Converged
100002 sec (2 iterations)949.590571Failed

It seems that for low dimensions (like n≤100), L-BFGS-B converges quickly but easily getting stuck in local minimum instead of the global minimum.
For high dimensions (such as n=1000, n=10000), it either still gets stuck in local minima or is interrupted by memory/time constraints.
So to get the global optimum, just relying on a single gradient method may not be enough.