$30
CS 131 Computer Vision: Foundations and Applications
Problem Set 2
1 K-Means Clustering
Suppose we want to organize m points x
(1), . . . , x(m) ∈ R
n into k clusters. This is equivalent to a label
`1, . . . , `m to each point, where `i ∈ {1, . . . , k}. As discussed in class, the K-Means clustering algorithm
solves this problem by minimizing the quantity
Xn
i=1
d(x
(i)
, µ`i
), (1)
where µc is the centroid of all points x
(i) with label `i = c and d(x, y) is the distance between points x and
y. Usually d is the Euclidean distance, given by
d(x, y) = Xn
i=1
(xi − yi)
2
!1/2
In this problem we will explore the possibility of using other distance functions.
(a) One reason for choosing different distance functions is to control the shape of the clusters that are
discovered by the K-Means algorithm. K-Means usually tends to find clusters that are roughly spherical
in shape. A sphere with center c ∈ R
n and radius r 0 is the set of all points in R
n whose distance to
c is exactly r. For example, when n = 2, spheres generated using the usual Euclidean distance function
are just circles.
-1.0 -0.5 0.5 1.0
-1.0
-0.5
0.5
1.0
Figure 1: Using Euclidean distance in R
2
, spheres are just circles.
1
For each of the following distance functions on R
n, sketch and describe the spheres that the distance
function defines when n = 2:
(i) The Manhattan distance, also called the L1 distance:
d(x, y) = Xn
i=1
|xi − yi
|
(ii) The Chebyshev distance, also called the L∞ distance:
d(x, y) = max
i=1,...,n
|xi − yi
|
(iii) The Cosine distance:
d(x, y) = 1 −
x · y
|x||y|
, where x · y is the dot product given by x · y =
Pn
i=1 xiyi and |x| is the norm of x, given by
|x| =
Pn
i=1 x
2
i
?1/2
.
(b) Another reason for choosing different distance functions in the K-Means algorithm is to control the
invariances of the clusters. However, before worrying about invariances of clusters we must think
about invariances of distance functions. A distance function d on R
n is said to be invariant under a
transformation φ : R
n → R
n if
d(x, y) = d(φ(x), φ(y))
for all x, y ∈ R
n.
(i) Consider the case where φ is a scaling transformation given by φ(x) = sx for some fixed s ∈ R with
s 0. Among the Euclidean distance, the Manhattan distance, and the Cosine distance, which
are invariant under scaling? For each distance function either prove that it is scale invariant or
provide a counterexample to show that it is not.
(ii) Consider the case where φ is a translation given by φ(x) = x + t for some fixed t ∈ R
n. Which
of the following are invariant to translation: Euclidean distance, Manhattan distance, Cosine distance? For each distance function either prove that it is invariant under translation or provide a
counterexample to show that it is not.
(iii) Let φ be a rotation given by φ(x) = Rx where R is a rotation matrix. which of the following are
invariant to rotation: Euclidean distance, Manhattan distance, Cosine distance? For each distance
function either prove that it is invariant under rotation or provide a counterexample to show that
it is not.
(c) The K-Means algorithm with distance function d is said to be invariant to a transformation φ : R
n → R
n
if for all sets of points x
(1), . . . , x(m) ∈ R
n, the cluster assignments that minimize Equation 1 also
minimize Equation 1 when each x
(i)
is replaced by φ(x
(i)
).
In order to use the K-Means algorithm, we need to know how to compute distances between points
and we need to know how to compute the centers of clusters. The distance function d tells us how to
compute distances; to compute cluster centers we use a cluster center function µ. For any set of points
x
(1), . . . , x(m) ∈ R
n the center µc of these points is given by µc = µ(x
(1), . . . , x(2)).
A cluster center function µ is invariant to a transformation φ : R
n → R
n if
φ(µ(x
(1), . . . , x(m)
)) = µ(φ(x
(1)), . . . , φ(x
(m)
))
for all sets of points x
(1), . . . , x(m) ∈ R
n.
(i) Consider the cluster center function
µ(x
(1), . . . , x(m)
) = 1
m
Xm
i=1
x
(i)
(2)
Is K-Means with Euclidean distance and the above cluster center function invariant under scaling
transformations? If yes, provide a proof; if not, provide a counterexample.
(ii) Consider the cluster center function µ given by
µ(x
(1), . . . , x(m)
)i = median(x
(1)
i
, . . . , x
(m)
i
) (3)
In other words, the ith coordinate of the centroid is the median of the ith coordinates of the points
x
(1), . . . , x(m)
. Is K-Means with Manhattan distance and the above cluster center function invariant
under scaling transformations? If yes, provide a proof; if not, provide a counterexample.
(d) Suppose that we want to compute a segmentation for an image I by using K-Means to cluster the color
values of its pixels. In other words, for each pixel we will create a vector (r, g, b) of its color values and
use K-Means to cluster these color vectors. Each cluster will correspond to a segment in the image; all
pixels whose color vectors end up in the same cluster will be assigned to the same segment.
(a) We create a high-contrast image I
0 by doubling the color values of each pixel in I. Which of the
following distance function(s) can we use to ensure that the segmentations for I and I
0 are the
same: Euclidean distance, Manhattan distance, Chebyshev distance, Cosine distance? For each of
these distance functions give a short explanation as to why it does or does not produce the same
segmentations for images I and I
0
. For the Manhattan distance, assume that we use the cluster
center function from Equation 3; for all other distance functions, assume that we use the cluster
center function from Equation 2.
(b) We create a tinted image I
0 by increasing the red value of each pixel in I by a fixed amount r.
Which of the following distance function(s) can we use to ensure that the segmentations for I and
I
0 are the same: Euclidean distance, Manhattan distance, Chebyshev distance, Cosine distance?
For each of these distance functions give a short explanation as to why it does or does not produce
the same segmentations for images I and I
0
. For the Manhattan distance, assume that we use the
cluster center function from Equation 3; for all other distance functions, assume that we use the
cluster center function from Equation 2.
2 Frame Interpolation using Optical Flow
Suppose we are given a pair of frames I0 and I1 from a video sequence which occur at times t = 0 and t = 1,
respectively. We wish to generate additional interpolated frames It for 0 < t < 1. This is a common problem
in video processing; for example, if we wish to display a video with 24 frames per second on a television with
a refresh rate of 60 Hz, then interpolating the frames of the input video can result in smoother playback.
The code and data for this problem are located in the FrameInterpolationSkeleton directory. Start by
looking at the file FrameInterpolation.m; this file sets up the frame interpolation task for you by loading
a pair of images and computing the optical flow between them.
NOTE: In the equations below we use Cartesian coordinates (x, y) but the MATLAB code for this problem
uses matrix coordinates (row, column). Keep this in mind when implementing your solution.
(a) The simplest way of generating interpolated frames is to use linear interpolation, also known as crossfading. In this model, the interpolated images It are given by
It(x, y) = (1 − t)I0(x, y) + tI1(x, y)
3
Implement this method in the file ComputeCrossFadeFrame.m. After implementing this method, you can
view the interpolated video sequence by uncommenting the indicated lines in FrameInterpolation.m.
Hand in your completed version of the file ComputeCrossFadeFrames.m and a printout of the tiled image
frames which are written to the file crossfade.png.
(b) More sophisticated techniques for frame interpolation take the appearance of the input images into
account. One of the most natural ways to incorporate this information is to use optical flow fields. In
this problem the horizontal and vertical components of the velocity of the point It(x, y) will be denoted
ut(x, y) and vt(x, y) respectively. As discussed in class, we can compute the optical flow between the
images I0 and I1 to obtain the velocities (u0(x, y), v0(x, y)) of every point of I0.
After computing the optical flow, a simple strategy for computing interpolated frames is to carry the
pixels of I0 forward along their velocity vectors; this algorithm is known as forward warping. More
concretely, the interpolated images are given by
It(x + tu0(x, y), y + tv0(x, y)) = I0(x, y)
Implement this method in the file ComputeForwardWarpingFrame.m. After implementing this method
you can can view the interpolated video by uncommenting the indicated lines in FrameInterpolation.m.
Hand in your completed version of the file ComputeForwardWarpingFrame.m and a printout of the tiled
image frames which are written to the file forwardwarped.png.
(c) Forward warping gives much better results than cross-fading, but it can still lead to significant artifacts
in the interpolated images. Give at least one explanation for the types of artifacts that we see when
using forward warping to interpolate images.
(d) A more robust strategy is to use backward warping. If we knew the velocities (ut(x, y), vt(x, y)) for all
of the points in It, then we could compute the pixel values of It by setting
It(x, y) = I0(x − tut(x, y), y − tvt(x, y))
Unfortunately we do not know the velocities at time t. However, we can estimate the velocities at time
t by forward-warping the known velocities u0 and v0 at time 0. We can increase the robustness of this
estimate by using a few heuristics.
For each point (x, y) of I0, we compute the quantities x
0 = x + tu0(x, y) and y
0 = y + tv0(x, y). Then
for all pairs (x
00, y00) with x
00 ∈ {floor(x
0
), ceil(x
0
)} and y
00 ∈ {floor(y
0
), ceil(y
0
)} we set
ut(x
00, y00) = u0(x, y)
vt(x
00, y00) = v0(x, y)
This can help account for small inaccuracies in the computed optical flow. If this would cause multiple
points (x, y) of (u0, v0) to be assigned to the point (x
00, y00) of (ut, vt), then pick the one that minimizes
the color difference
|I0(x, y) − I1(x + u0(x, y), y + v0(x, y))|
The quantity I1(x + u0(x, y), y + v0(x, y)) is appearance of the pixel to which (x, y) would be sent if we
forward warped all the way to time 1. The idea behind this heuristic is that if two or more pixels at
time 0 collide when forward warped to time t, then we keep only the pixel whose appearance changes
the least when forward warped all the way to time 1; hopefully this is the pixel that also changes the
least when forward warped to time t.
Additionally, if any points of the interpolated flow (ut, vt) remain unassigned after this procedure,
we use linear interpolation to fill in the gaps. The function FillInHoles can be used for this step.
Implement forward warping of the flow field as described above in the function WarpFlow of the file
ComputeFlowWarpFrame.m.
4
(e) Once we have estimated the velocities (ut, vt) we can use backward warping to compute the interpolated
image It as described above. Implement backward warping in the function ComputeFlowWarpFrame of
the file ComputeFlowWarpFrame.m. After implementing this method you can view the interpolated video
by uncommenting the indicated lines in FrameInterpolation.m. Hand in the completed version of the
file ComputeFlowWarpFrame.m and a printout of the tiled image frames which are written to the file
flowwarped.png.
3 Motion Segmentation
The problem of grouping pixels of an image into regions that each move in a coherent fashion is called motion
segmentation.
Image plane
Moving plane in space
X
x
Orthographic camera
projection World space velocity of X
Image space
velocity of x
Figure 2: Problem 3a: A point on a moving plane is projected onto the image plane.
(a) Consider a plane in 3D space given by
Z = cxX + cyY + c0.
Suppose that this plane is undergoing rigid body motion and that it is being viewed through an orthographic camera as in Figure 2. Show that the velocity v = (vx, vy) of the point x = (x, y) in image space
corresponding to the point X = (X, Y, Z) on the plane in world space can be written in the form
vx = ax,0 + ax,xx + ax,yy
vy = ay,0 + ay,xx + ay,yy,
where the coefficients ax,0, ax,x, ax,y, ay,0, ay,x, and ay,y do not depend on x or y. In other words, show
that the velocity of all image space points is an affine function of their positions; this is called an affine
motion model.
HINT: Rigid body motion can be decomposed into a rotation about each axis and a translation. If
the plane is rotating with angular velocities ω = (ωX, ωY , ωZ) about each axis and is translating with
velocity T = (TX, TY , TZ) along each axis then the velocity of the point X = (X, Y, Z) is given by
V = ΩX + T
where
Ω =
0 −ωZ ωY
ωZ 0 −ωX
−ωY ωX 0
5
An orthographic camera is a simplified camera model whose camera matrix is simply
1 0 0 0
0 1 0 0
0 0 0 1
In other words, the orthographic camera converts points from world space to image space by simply
ignoring the z-coordinate.
(b) If we make the simplifying assumptions that the world is composed of planes undergoing rigid motion
and that the camera performs orthographic projection, then by (a) the problem of motion segmentation
is reduced to the problem of determining a set of affine motion models for an image and assigning each
pixel of the image to one of the motion models.
We can use an optical flow algorithm to estimate the velocity of each point in an image. If the world
really obeyed our simplified model and if the optical flow algorithm were perfect, then we could trivially
deduce a set of affine motion models using only the optical flow. Since neither of these conditions are
likely to hold we must perform some processing on the optical flow in order to estimate the affine motion
models.
Given a small group of nearby pixels, it is likely that they belong to the same motion model. Therefore,
to generate a set of candidate motion models, we can divide the image into many small regions and use
linear regression to fit an affine motion model in each region.
For this problem you may assume that the least squares solution to the problem Ax = b is given by
x = (AT A)
−1AT
b.
(i) Explain how we can use linear regression to fit an affine motion model to a set of points given their
positions and velocities.
(ii) If the entire image is rescaled by a factor s 0 then how do the estimated affine motions models in
each region change? You may assume that scaling the image by a factor of s also scales the optical
flow of the image by a factor of s.
(iii) If the entire image is rotated, then how do the estimated affine models in each region change? You
may assume that rotating the image also rotates the optical flow.
(c) Since we generate one candidate motion model for each small image region, it is likely that the number of
candidate motion models is much larger than the number of “true” motion models present in the scene.
For example, large objects which could be described by a single motion model may have been split up
across many small image patches. In a few sentences, briefly explain how you might use the large set of
candidate motion models to generate a smaller set of motion models that more accurately describe the
motion present in the scene.
HINT: Think of a motion model as a point in R
6
.
4 K-Means Local Minima
Consider the points:
x1 = (0, 16), x2 = (0, 9), x3 = (−4, 0), x4 = (4, 0).
Suppose we wish to separate these points into two clusters and we initialize the K-Means algorithm by
randomly assigning points to clusters. This random initialization assigns x1 to cluster 1 and all other points
to cluster 2.
(a) With these points and these initial assignments of points to clusters, what will be the final assignment
of points to clusters computed by K-Means? Explicitly write out the steps that the K-Means algorithm
will take to find the final cluster assignments.
(b) In this case, does K-Means find the assignment of points to clusters that minimizes Equation 1?
6
5 Affine and Projective Cameras
Recall from lecture that we can transform a point in world space X = (X, Y, Z) to image coordinates
x = (x, y) by using a 3 × 4 camera matrix. In class we showed how a camera matrix can be factored into a
3 × 3 matrix of intrinsic parameters and a 3 × 4 matrix of extrinsic parameters. In this problem, however,
we will consider the case of a single 3 × 4 camera matrix A which combines both the intrinsic and extrinsic
parameters of the camera:
A11 A12 A13 A14
A21 A22 A23 A24
A31 A32 A33 A34
X
Y
Z
1
=
x
0
y
0
t
0
x = x
0
/t0
y = y
0
/t0
(a) How many degrees of freedom does the above camera matrix have for expressing transformations from
(X, Y, Z) to (x, y)? Explain. How does this compare with the number of degrees of freedom when the
camera matrix is factored into a 3 × 3 matrix of intrinsic parameters and a 3 × 4 matrix of extrinsic
parameters?
(b) In the special case of an affine camera, the camera matrix has the following form:
A =
A11 A12 A13 A14
A21 A22 A23 A24
0 0 0 A34
An important property of an affine camera is that it preserves parallel lines. We usually represent a line
L in R
n by choosing a point p ∈ R
n on the line and picking some vector v ∈ R
n parallel to the line.
Every point ` on the line can then be written in the form ` = p + tv for some t ∈ R. For brevity, we
often write this as a set of points: L = {p + tv : t ∈ R}.
Suppose that we have a pair of lines in R
n given by L1 = {p1 + tv1 : t ∈ R} and L2 = {p2 + tv2 : t ∈ R}.
These lines are parallel if the vectors v1 and v2 are scalar multiples of each other. Now consider a pair
of parallel lines in 3D space given by L1 = {p1 + tv : t ∈ R} and L2 = {p2 + tv : t ∈ R}. If the lines L1
and L2 are transformed from world space to image space using an affine camera matrix, prove that the
resulting lines in image space remain parallel.
7