$30
Computational and Variational Methods for Inverse Problems
CSE 397/GEO 391/ME 397/ORI 397
Assignment 2: Optimization methods and automatic differentiation
Assignment 2: Optimization methods and automatic differentiation
In this assignment we will solve the catenary problem to explore the convergence behavior of several
optimization methods we studied in class. We will also introduce a tool for automatic differentiation.
The catenary problem is a classical problem in variational calculus that seeks to find the shape of a
hanging chain that is fixed at both ends:
(credit: )
We provide a jupyter notebook called catenary.ipynb on Canvas that will walk you through problems
2, 3, and 4 using python/numpy/scipy/autograd. We recommend that you complete problems 2, 3, and
4 by filling out this notebook. You may solve this assignment using any language/framework of your
choosing, but using another language may make the assignment more difficult.1
Problem 1: Derivation of the optimization problem
Let u : [0, 1] → R denote the height (positive upward) of an idealized chain as a function of horizontal
position, x. Here the left end of the chain is fixed at x = 0 and height u(0) = 0, and the right end of
the chain is fixed at at x = 1 and height u(1) = 0:
1You will have to track down optimization solvers and automatic differentiation tools in that language.
To determine the shape of the hanging chain, we find the u(x) that minimizes the potential energy
E(u) of the chain, subject to the constraint that the chain is fixed at both ends and has a given length
L0. This leads to the following constrained optimization problem:
min
u
E(u)
such that L(u) = L0
and u(0) = u(1) = 0.
where L(u) is the length of the chain for a given shape u(x).
The solution to this optimization problem may be approximated by solving the following unconstrained optimization problem,2
in which the length constraint is enforced approximately using a quadratic
penalty method:
min
u
E(u) + α (L(u) − L0)
2
(1)
such that u(0) = u(1) = 0.
Here α > 0 is a penalty parameter. As α → ∞, the solution found using the penalty method approaches
the solution to the original constrained optimization problem. However the penalty term makes the
unconstrained optimization problem (1) increasingly ill-conditioned, and thus difficult to solve, as α →
∞, so we typically use a large enough, but not too large, value of α.
a) Show that the length of the chain, L(u), is given by
L(u) = Z 1
0
p
1 + (u
0)
2 dx,
where u
0
:= du
dx .
b) Show that the potential energy of the chain, E(u), is given by
E(u) = Z 1
0
ρgup
1 + (u
0)
2 dx,
where ρ is the linear density (mass per unit length) of the chain, and g is the gravitational
acceleration constant.
For convenience, let ρg = 1 for all subsequent problems, so that E(u) = R 1
0
u
p
1 + (u
0)
2 dx.
Problem 2: Discretization
We discretize the optimization problem (1) by approximating u(x) as a continuous piecewise linear
function on [0, 1], which is defined by its values at N + 2 equally spaced nodes at locations
0, h, 2h, 3h, . . . , Nh, 1.
The spacing between nodes is given by h =
1
N+1 . Let u ∈ R
N+2 be the vector of nodal values at the
nodes. Note that here we are including the values at the endpoints in the vector u.
2The endpoint constraints u(0) = u(1) = 0 will be directly built into our representation of u(x), so it’s ok to call this
problem unconstrained.
a) Since u is continuous piecewise linear, the derivative u
0
is piecewise constant. This is illustrated
in the image below. Write a function that takes u ∈ R
N+2 as input, and returns the vector of
derivative values, u
0 ∈ R
N+1, where the entries of u
0
are the values of u
0
in the intervals (cells)
between consecutive nodes.
b) Write a function that takes in u ∈ R
N+2 as input, and returns the arc length, L(u) ∈ R, of the
corresponding continuous piecewise linear function u as output. Hint: the integrand is a piecewise
constant function.
c) Write a function that takes in u ∈ R
N+2 as input, and returns the energy, E(u) ∈ R as output.
Hint: the integrand is a discontinuous piecewise linear function.
Problem 3: Objective, gradient, and Hessian
In this problem we implement the objective function, gradient, and Hessian for the unconstrained
optimization problem (1), using the automatic differentiation library autograd 3
to compute derivatives.
We will use finite difference checks to verify that the derivatives are computed correctly.
Let J : R
N → R,
J(uint) := E(u) + α (L(u) − L0)
2
denote the discretized objective function for the unconstrained optimization problem (1), where uint ∈
R
N is the vector consisting of the interior nodal values of u (recall u ∈ R
N+2 includes the values for
the endpoints in the first and last entries). In the Jupyter notebook, we provide a function that takes
in uint ∈ R
N as input, and returns J(uint) as output. If you are using another programming language,
you will need to write this function yourself using the functions you already wrote for E(u) and L(u).
a) Use automatic differentiation to construct a function that takes in uint ∈ R
N as input, and
computes the gradient,
g(uint) := ∂J
∂uint
(uint) ∈ R
N
as output.
3Automatic differentiation (AD) is a technique to differentiate the output of code with respect to its input; it makes
use of knowledge of how to differentiate elementary functions, and chains together derivatives of each line of code library
library autograd toautograd to (i.e., derivatives of output with respect to input) to form the derivatives of the overall code.
Autograd is a popular library for AD of python codes.
b) Check that the gradient is computed correctly by using the gradient to compute the directional
derivative of J in a random direction hint ∈ R
N , and comparing this to to the finite difference
approximation,
g(uint)
Thint ≈
1
s
(J(uint + shint) − J(uint)),
where s > 0 is a small step size. Make a log-log plot of the discrepancy between the directional
derivative and the finite difference approximation as a function of step size s ranging from s =
10−13 to s = 100
.
c) Use automatic differentiation to construct a function that takes uint ∈ R
N and pint ∈ R
N as
input, and computes the Hessian matrix-vector product
H(uint) pint ∈ R
N
as output. Here H(uint) = d
2J
du2
int
(uint) ∈ R
N×N is the Hessian at uint. Compute the Hessian
matrix-vector product matrix-free. Do not form the Hessian matrix and then multiply it with pint.
Hint: the Hessian is the derivative of the gradient, so
H(uint) pint =
d
du
dJ
duint
(uint)pint
=
d
duint
g(uint)
T pint
.
In other words, H(uint) pint is the gradient of the scalar-valued function
q(uint) := g(uint)
T pint.
d) The Hessian-vector product may be approximated by finite differencing the gradient in the direction
of the vector:
H(uint) pint =
d
duint
q(uint) ≈
1
s
(g(uint + spint) − g(uint)).
Construct a log-log plot of the discrepancy between the action of the Hessian on a random vector
pint and the corresponding finite difference, as a function of step size s ranging from s = 10−13
to s = 100
.
Problem 4: Comparison of optimization methods
In this problem, we solve the unconstrained optimization problem (1) with BFGS, the method of steepest
descent, and the Newton-conjugate-gradient method. We estimate the convergence rate for all methods.
In the jupyter notebook, we provide code to do these tasks for BFGS, as an example. You will repeat
these tasks for steepest descent and Newton-CG.
For this problem, use N = 32, L0 = 3.0, and α = 1e2. A reasonable initial guess is the parabola
u
(0)(x) = −C x(1 − x)
with constant
C = 2 (L0 − 1).
We would like to choose the constant C to make L
u
(0)
equal to L0. But there’s no need to exactly
evaluate the integral in L(u), since this is just an initial guess for the optimizer. Instead, we approximate
L
u
(0)
using the sum of the lengths of the left, right, and bottom edges of the box that bounds the
parabola. This gives you the expression for C abo
a) Implement the method of steepest descent and use it to find the minimizer of J(uint). You may
use any step length choice scheme that you want, but we recommend using the Wolfe linesearch
conditions, as implemented in the scipy.optimize.line search function. Terminate the iteration after
the norm of the gradient has decreased by a factor of 106
from the initial norm of the gradient
(i.e., ||g(u
(k)
int )|| ≤ 10−6
||g(u
(0)
int)||). Save the values of J(u
(k)
int ) for all iterations k so that we can
study the convergence of the method.
b) Asymptotically, the error should decrease as
J(u
(k+1)
int ) − J(u
∗
int) ≤ c
J(u
(k)
int ) − J(u
∗
int)
q
for some constant c and convergence rate q, where u
∗
int is the solution to the optimization problem
(1). Estimate the convergence rate, q, for steepest descent by plotting J(u
(k+1)
int ) − J(u
∗
int) vs.
J(u
(k)
int ) − J(u
∗
int) on a log-log plot and finding the slope. You may use the final value of J as
a proxy for the solution J(u
∗
int). Is the observed convergence rate consistent with the theoretical
convergence rate for steepest descent?
c) Minimize J using the Newton-CG method. You may use the Newton-CG implementation in
scipy.optimize.minimize, or any other existing implementation. (For your own sanity, please don’t
write Newton-CG from scratch!) Use a stopping tolerance of 10−6
. Save the values of J(u
(k)
int )
for all Newton (outer) iterations k so that we can study the convergence of the method.
d) Estimate the convergence rate, q, for Newton-CG by plotting J(u
(k+1)
int ) − J(u
∗
int) vs. J(u
(k)
int ) −
J(u
∗
int) on a log-log plot and finding the slope. You may use the final value of J as a proxy
for the solution J(u
∗
int). What does your observed convergence rate tell you about the choice
of the forcing term ηk in scipy’s implementation of Newton-CG? Comment on the efficiency of
Newton-CG relative to steepest descent and BFGS.
Problem 5: Mesh independence (Extra credit)
Mesh independence (sometimes called dimension independence) is a property often enjoyed by Newton
methods for infinite dimensional problems, but not by other methods. This concept refers to the
relatively constant number of iterations taken by Newton’s method as the mesh is refined.
a) Confirm that steepest descent does not exhibit mesh independence by solving the optimization
problem (1) for a sequences of meshes from N + 1 = 8, 16, 32, 64, 128.
b) Use Newton-CG to solve (1) for a series of meshes from N + 1 = 8, 16, 32, 64, 128, 256. Can you
give an explanation of what might be preventing Newton-CG from delivering mesh independence?
Hint: plot some of the early iterates u
(k)
int and see if you can spot abnormalities. Can you suggest
a fix so that we can observe mesh independence. [We have not discussed reasons for lack of mesh
independence in class, so this will take some creative thinking.]