$30
16-720-B, Computer Vision
Object Tracking in Videos
r
Instructions
1. Integrity and collaboration: Students are encouraged to work in groups but each student must
submit their own work. If you work as a group, include the names of your collaborators in your writeup. Code should NOT be shared or copied. Please DO NOT use external code unless permitted.
Plagiarism is prohibited and may lead to failure of this course.
2. Questions: Please keep an eye on the Piazza FAQ page for this homework.
3. Write-up and Code: Please stick to the coding conventions provided. Do not import external
packages unless specified.
4. Submission: Your submission for this assignment should be a zip file, <andrew-id.zip, composed
of your write-up, your Python implementations (including helper functions), and your implementations, results for extra credit (optional). Please make sure to remove the data/ folder and any other
files that are not required. Ensure that your submission is of a reasonable size. You may want to use
video compression if your videos are huge.
Your final upload should have the files arranged in this layout:
• <AndrewID.zip
– <AndrewId/
∗ <AndrewId.pdf
∗ python/
· LucasKanade.py
· LucasKanadeAffine.py
· InverseCompositionAffine.py
· test lk.py
· test lk affine.py
· test ic affine.py
· file utils.py
∗ ec/
· LucasKanade Robust.py
· LucasKanade Pyramid.py
1
1 Overview
An important aspect of human vision is the ability to track objects. Tracking is integral to our everyday
lives. In this assignment, you will be implementing algorithms that will track an object in a video.
• You will first implement the Lucas-Kanade tracker [1].
• Then a more efficient tracker, Matthews-Baker tracker [1].
For extra credit, we will also look at ways to make tracking more robust, by incorporating illumination
invariance and implementing the algorithm in a pyramid fashion.
2 Preliminaries
In this section, we will go through the basics of the Lucas-Kanade tracker and the Matthews-Baker tracker.
Table 1 contains a summary of the variables used in the upcoming sections which explain the algorithms.
This will come in handy for matching the dimensions when describing the algorithms.
Template
A template describes the object of interest (eg. a car, football) which we wish to track in a video.
Traditionally, the tracking algorithm is initialized with a template, which is represented by a bounding
box around the object to be tracked in the first frame of the video. For each of the subsequent frames in
the video, the tracker will update its estimate of the object in the image. The tracker achieves this by
updating its affine warp.
Warps
What is a warp? An image transformation or warp W is a function that acts on pixel coordinates
x = [u v]
T
and maps pixel values from one place to another in an image x
0 = [u
0
v
0
]
T
. Simply put,
W maps a pixel with coordinates x = [u v]
T
to x
0 = [u
0
v
0
]
T
. Translation, rotation, and scaling are all
examples of warps. We denote the parameters of the warp function W by p, refer eq 1.
x
0 = W(x; p) (1)
Affine Warp
An affine warp is a particular kind of warp that can include any combination of translation, scaling, and
rotations. An affine warp can be represented by 6 parameters p = [p1 p2 p3 p4 p5 p6]
T
. One of the most
convenient things about an affine warp is that it is linear; its action on a point with coordinates x = [u v]
T
can be described as a matrix operation by a 3 × 3 matrix W(p), refer eq 2 and eq 3,
u
0
v
0
1
= W(p)
u
v
1
(2)
W(p) =
1 + p1 p3 p5
p2 1 + p4 p6
0 0 1
(3)
Note: For convenience, when we want to refer to the warp as a function, we will use W(x; p) and when
we want to refer to the matrix for an affine warp, we will use W(p). We will use affine warp and affine
transformation interchangeably.
2
Symbol Vector/Matrix Size Description
u 1 × 1 Image horizontal coordinate
v 1 × 1 Image vertical coordinate
x 2 × 1 or 1 × 1 pixel coordinates: (u, v) or unrolled
I m × 1 Image unrolled into a vector (m pixels)
T m × 1 Template unrolled into a vector (m pixels)
W(p) 3 × 3 Affine warp matrix
p 6 × 1 parameters of affine warp
∂I
∂u m × 1 partial derivative of image wrt u
∂I
∂v m × 1 partial derivative of image wrt v
∂T
∂u m × 1 partial derivative of template wrt u
∂T
∂v m × 1 partial derivative of template wrt v
∇I m × 2 image gradient ∇I(x) = h
∂I(x)
∂u
∂I(x)
∂v i
∇T m × 2 image gradient ∇T(x) = h
∂T(x)
∂u
∂T(x)
∂v i
∂W
∂p
2 × 6 Jacobian of affine warp wrt its parameters
J m × 6 Jacobian of error function L wrt p
H 6 × 6 Pseudo Hessian of L wrt p
Table 1: Summary of Variables
Lucas-Kanade Tracker
A Lucas Kanade tracker uses a warp function W(x; p) to align an image I to a template T. The optimal
parameters p
∗ of the warp W(x; p) are found by minimizing loss L. The loss L is the pixel-wise sum of
square difference between the warped image I(W(x; p)) and template T, refer eq 4 and 5.
Note: We denote pixel locations by x, so I(x) is the pixel value (eg. rgb) at location x in image I. For
simplicity, I and T are treated as column vectors instead of a matrix (like linearized matrices!). W(x; p)
is the point obtained by warping x. I(W(x; p)) is the pixel value (eg. rgb) after warping.
L =
X
x
[T (x) − I (W(x; p))]2
(4)
p
∗ = argmin
p
L (5)
In general, this is a difficult non-linear optimization of L in p for even simple warps like affine warps, but
the Lucas-Kanade tracker circumvents this by using a key idea - forward additive alignment. The idea
assumes we already have a very close estimate p0 of the correct warp p
∗
, then we can assume that a small
linear change ∆p is enough to get the best alignment i.e. p
∗ = p0 + ∆p. This is the forward additive form
of the warp. The loss L can then be written as:
L =
X
x
[T (x) − I (W(x; p0 + ∆p))]2
(6)
∆p
∗ = argmin
∆p
L (7)
p
∗ = p0 + ∆p
∗
(8)
3
Expanding this to the first order with Taylor Series:
L ≈ X
x
?
T (x) − I (W(x; p0
)) − ∇I(x)
∂W
∂p0
∆p
?2
(9)
Here the Jacobian of I(x), ∇I(x) = h
∂I(x)
∂u
∂I(x)
∂v i
, is the vector containing the horizontal and vertical
gradient at pixel location x. Rearranging the Taylor expansion, it can be rewritten as a typical least
squares approximation ∆p
∗ = argmin
∆p
||A∆p−b||2
, note we pull out a negative sign thanks to the squaring,
∆p
∗ = argmin
∆p
X
x
?
∇I
∂W
∂p0
∆p − {T (x) − I (W(x; p0
))}
?2
(10)
This can be solved with ∆p
∗ = (ATA)
−1AT
b where ATA is the Hessian H:
(A
TA) = H =
X
x
?
∇I
∂W
∂p0
?T ?
∇I
∂W
∂p0
?
(11)
A =
?
∇I
∂W
∂p0
?
(12)
b = T (x) − I (W(x; p0
)) (13)
Once ∆p
∗
is computed, the best estimate warp p
∗
can be estimated using eq 8. However, in practice,
our initial estimate p0
can be well off the optimal p
∗
, thus violating the forward additive alignment
assumption. As a fix, we perform the optimization L in an iterative fashion using an error threshold ? as
described in algorithm 2 from [1].
Algorithm 1 The Lucas-Kanade Algorithm
(0) p ← p0
(1) Iterate until ||∆p|| ≤ ?:
(2) Warp I with W(x; p) to compute I(W(x; p))
(3) Compute the error image E(x) = T(x) − I(W(x; p))
(4) Warp the gradient ∇I with W(x; p)
(5) Evaluate the Jacobian ∂W
∂p
at (x; p)
(6) Compute the steepest descent image ∇I
∂W
∂p
(7) Compute the Hessian matrix H =
P
x
h
∇I
∂W
∂p
iT h
∇I
∂W
∂p
i
(8) Compute ∆p = H−1 P
x
h
∇I
∂W
∂p
iT
E(x)
(9) Update the parameters p ← p + ∆p
4
Matthews-Baker Tracker
While the Lucas-Kanade tracker works very well, it is computationally expensive due to the computation of
the Hessian and Jacobian for each frame of the video. The Matthews-Baker tracker is similar to the LucasKanade tracker, but requires less computation, as the Hessian and Jacobian are only computed once for
the video. The Matthews-Baker tracker replaces forward additive alignment idea of the Lucas-Kanade
tracker by the inverse compositional alignment. It assumes that the warp W(x; p) is an invertible
function. Since affine warps used by the Lucas-Kanade tracker are indeed invertible (why?, think!), we
can use the Matthews-Baker tracker instead of a Lucas-Kanade tracker.
The key to efficiency is switching the role of the image I and the template T in the algorithm. In contrast to
the Lucas-Kanade tracker (warp image I to template T), the Matthews-Baker tracker warps the template
T to the image I. This key difference leads to the computational gains because unlike the image I which
changes with each frame of the video, the template T remains fixed throughout the video. If we can write
the Hessian and Jacobian involved in the optimization of L in terms of the template T instead of image I,
then, we only need to compute them once at the start of the tracking.
Here is the inverse compositional form of the Matthew-Baker tracker given without the proof of equivalence
to the forward additive form of the Lucas-Kanade tracker. Please refer [1] for the proof. Given an initial
estimate p0
, we want to find the ∆p
∗
to minimize L as follows,
L =
X
x
[T (W(x; ∆p)) − I (W(x; p0
))]2
(14)
∆p
∗ = argmin
∆p
L (15)
This completes our role switch of the template T and the image I, please note that ∆p are the parameters
of the warp W applied to T and p0 are the parameters of the warp W applied to I.
Another key difference of Matthews-Baker tracker to the Lucas-Kanade tracker is - how do we combine the
two warps W(x; p0
) and W(x; ∆p)? In the Lucas-Kanade tracker we combined the two warps by simply
adding parameter p0
to another parameter ∆p, and produce a new warp W(x; p + ∆p). Specifically,
p
∗ = p0 + ∆p
∗
, W(x; p
∗
) = W(x; p0
) + W(x; ∆p
∗
). In Matthews-Baker tracker, we use another way of
combination through composition as follows,
W(x; p
∗
) = W(x; p0
) ◦ W(x; ∆p
∗
)
−1
(16)
If we are using affine warps then the warps can be implemented as matrix multiplication, then we can
simplify the eq 16 as follows
W(x; p
∗
) = W(W(x; ∆p
∗
)
−1
; p0
)
W(x; p
∗
) = W(p0
)W(∆p
∗
)
−1x
Let us simplify L as before using first order linear approximation of T (W(x; ∆p)). Please note, for
tracking a template T, the summation over x is performed only over the pixels lying inside T.
L ≈ X
x
?
T (x) + ∇T(x)
∂W
∂p0
∆p − I (W(x; p0
))?2
(17)
where ∇T(x) = h
∂T(x)
∂u
∂T(x)
∂v i
.
5
We now proceed to find a closed form solution to ∆p
∗ by equating the derivative ∂L
∂∆p
to zero.
∂L
∂∆p
= 2X
x
?
∇T
∂W
∂p0
?T ?
T (x) + ∇T(x)
∂W
∂p0
∆p − I (W(x; p0
))?
(18)
Setting to zero, switching from summation to vector notation and solving for ∆p we get
∆p = H−1J
T
[I(W(x; p0
)) − T] (19)
where J is the Jacobian of T(W(x; ∆p)), J = ∇T
∂W
∂p0
, H is the approximated Hessian H = J
T J and
I (W(x; p0
)) is the warped image. Note that for a given template, the Jacobian J and Hessian H are
independent of p0
. This means they only need to be computed once and then they can be reused during
the entire tracking sequence.
Once ∆p
∗ has been solved for, it needs to be inverted and composed with p0
to get the new warp
parameters for the next iteration.
W(x; p0
) ← W(x; p0
) ◦ W(x; ∆p
∗
)
−1
(20)
The next iteration solves Equation 19 starting with the new value of p0
. Possible termination criteria
include the absolute value of ∆p
∗
falling below some value or running for some fixed number of iterations.
Algorithm 2 The Matthews-Baker Algorithm
(0) p ← p0
(1) Pre-compute
(1) Evaluate the gradient of ∇T of the template T(x)
(2) Evaluate the Jacobian ∂W
∂p
at (x; 0)
(3) Compute the steepest descent images ∇T
∂W
∂p
(4) Compute the Hessian matrix using J = ∇T
∂W
∂p
, H = J
T J
(5)
(6) Iterate until ||∆p|| ≤ ?:
(7) Warp I with W(x; p) to compute I(W(x; p))
(8) Compute the error image E(x) = I(W(x; p)) − T(x)
(9) Compute ∆p = H−1 P
x
h
∇I
∂W
∂p
iT
E(x)
(10) Update the parameters p ← p + ∆p
6
3 Theory Questions (30 points)
Each question should only take a couple of lines. If you are lost in many lines of complicated algebra you
are doing something wrong.
Q1.1: Calculating the Jacobian (10 points)
Assuming the affine warp model defined in Equation 3, derive the expression for the Jacobian Matrix
J in terms of the warp parameters p = [p1 p2 p3 p4 p5 p6]
0
.
Q1.2: Computational complexity Lucas Kanade (10 points)
• Find the computational complexity (Big O notation) for each runtime iteration (computing J
and H−1
) of the Lucas Kanade method. Express your answers in terms of n, m and p where n
is the number of pixels in the template T, m is the number of pixels in an input image I and p
is the number of parameters used to describe the warp W.
Q1.3: Computational complexity Matthews-Baker (10 points)
• Find the computational complexity (Big O notation) for the initialization step (Precomputing J
and H−1
) and for each runtime iteration (Equation 19) of the Matthews-Baker method. Express
your answers in terms of n, m and p where n is the number of pixels in the template T, m is
the number of pixels in an input image I and p is the number of parameters used to describe
the warp W.
• How does this compare to the run time of the regular Lucas-Kanade method?
7
4 Implementing the Trackers (80 points)
In this section, you will implement the Lucas-Kanade tracker and the Matthews-Baker tracker.
Lucas-Kanade has two versions.
• with the warp W being translation only
• with the warp W being full affine transformation
.
Matthews-Baker has one version.
• with the warp W being full affine transformation
.
You will run all three algorithms (versions) on provided videos which can be downloaded here https://
www.dropbox.com/sh/l2ip26mkgf5p3e6/AACN2STT5Sk9r6bPeEXIYKZCa?dl=0 (the files are quite big and
cannot be uploaded to Piazza). To that end, we have provided script files test lk.py, test lk affine.py
and test ic affine.py to run three algorithms which can handle reading in images, template region marking, making tracker function calls, displaying output onto the screen and saving results to the disk. As a
result, you only need to implement the actual algorithms. The function prototypes provided are guidelines.
Please make sure that your code runs functionally with the original script and generates the outputs we
are looking for (a frame sequence with the bounding box of the target being tracked on each frame) so
that we can replicate your results.
Please include results of three algorithms on all three videos we have provided in your writeup. Specifically,
please include results on five frames with an reasonable interval (e.g., for landing.mp4, race.mp4, and
ballet.mp4 with only about 50 frames in total, you can use an interval of 10 frames and for car1.mp4 with
260 frames, you can use an interval of 50 frames) in your writeup. Also, please describe the difference
between three algorithms in terms of performance (e.g., accuracy and speed) and explain why.
Q2.1: Lucas-Kanade Forward Additive Alignment with Translation (20 points)
Write the function with the following function signature:
dx, dy = LucasKanade(It, It1, rect)
that computes the optimal local motion represented by only translation (motion in x and y directions)
from frame It to frame It+1 that minimizes Equation 4. Here It is the image frame It
, It1 is the
image frame It+1, and rect is the 4 × 1 vector that represents a rectangle on the image frame It.
The four components of the rectangle are [x1, y1, x2, y2], where (x1, y1) is the top-left corner
and (x2, y2) is the bottom-right corner of the bounding box. To deal with fractional movement of
the template, you will need to interpolate the image using the function RectBivariateSpline from
scipy.interpolate. Also, the RectBivariateSpline.ev function can be used to compute the gradient
of an image at a point location. You will also need to iterate the estimation in Equation 10 until the
change in warp parameters (dx, dy) is below a threshold or the number of iterations is too large.
To solve the least square problem in Equation 10, you can use the function lstsq from np.linalg.
Writeup: Please include indented code of the function LucasKanade(). The verbatim command in
latex may be helpful.
Q2.2: Lucas-Kanade Forward Additive Alignment with Affine Transformation (20 points)
Write the function with the following function signature:
M = LucasKanadeAffine(It, It1, rect)
8
Different from Q2.1, here we compute the optimal local motion M represented by an affine transformation with 6 parameters. In other words, the output M should be a 2 × 3 affine transformation
matrix.
Writeup: Please include indented code of the function LucasKanadeAffine().
Q2.3: Matthew-Bakers Inverse Compositional Alignment with Affine (20 points)
M = InverseCompositionAffine(It, It1, rect)
Same with Q2.2, the function should output a 2 × 3 affine matrix so that it aligns the current frame
with the template. But you need to implement the inverse compositional alignment following the
equations in the preliminaries or the slides from class.
Writeup: Please include indented code of the function InverseCompositionAffine().
Q2.4: Testing Your Algorithms (20 points)
Test your three algorithms on the video sequences (car1.npy), (car2.npy), (landing.npy), (race.npy)
and (ballet.npy), using our provided scripts. How do your algorithms perform on each video?
Which are the differences of three algorithms in terms of performance and why do we have those
differences? At what point does the algorithm break down and why does this happen? Remember
that the algorithms you are implementing are very basic tracking algorithms so that it might not
work at some situations. As long as it is working reasonably (not necessarily perfect), you will receive
full credits.
Writeup: Submit your results of three algorithms on all five videos (as mentioned above, please
include results on frames with a reasonable interval), analyze your results and answer above questions.
5 Extra Credit (20 points)
Q3.1x: Adding Illumination Robustness (10 points)
The LK tracker as it is formulated now breaks down when there is a change in illumination because
the sum of squared distances error it tries to minimize is sensitive to illumination changes. There are
a couple of things you could try to do to fix this. The first is to scale the brightness of pixels in each
frame so that the average brightness of pixels in the tracked region stays the same as the average
brightness of pixels in the template. The second is to use a more robust M-estimator instead of least
squares (so, e.g. a Huber or a Tukey M-estimator) that does not let outliers adversely affect the
cost function evaluation. Note that doing this will modify our least squares problem to a weighted
least squares problem, i.e. for each residual term you will have a corresponding weight Λii and your
minimization function will look like
L =
X
x
Λii [T (W(x; ∆p)) − I (W(x; p))]2
(21)
leading your jacobian computation to be a weighted jacobian instead of what you have seen earlier
(Eq. 11)
A
TΛA∆p = A
TΛb (22)
⇒ ∆p =
A
TΛA
?−1
A
TΛb (23)
Here Λ is a diagonal matrix of weights corresponding to the residual term computed as per the choice
of the robust M-estimator used, A is the jacobian matrix for each pixel of the template considered
in the cost function, and b is the corresponding vector of residuals. Implement these two methods
and test your new tracker on the provided video sequences again.
Implement these modifications for the Lucas-Kanade tracker that you implemented for Q2.1. Use
the same function names with ‘Robust’ appended.
Writeup: Please include the relevant code and submit the results of your ‘Robust’ algorithm on any
one of the video, compare with the results from the Lucas-Kanade tracker in Q2.1 and explain any
difference.
Q3.2x: LK Tracking on an Image Pyramid (10 points)
If the target being tracked moves a lot between frames, the LK tracker can break down. One way to
mitigate this problem is to run the LK tracker on a set of image pyramids instead of a single image.
The Pyramid tracker starts by performing tracking on a higher level (smaller image) to get a course
alignment estimate, propagating this down into the lower level and repeating until a fine aligning
warp has been found at the lowest level of the pyramid (the original sized image). In addition to
being more robust, the pyramid version of the tracker is much faster because it needs to run fewer
gradient descent iterations on the full scale image due to its coarse to fine approach.
Implement these modifications for the Lucas-Kanade tracker that you implemented for Q2.1. Use
the same function names with ‘Pyramid’ appended.
Writeup: Please include the relevant code and submit the results of your ‘Pyramid’ algorithm on
any one of the video, compare with the results from the Lucas-Kanade tracker in Q2.1 and explain
any difference.
6 Submission Summary
• Q1.1 Derive the expression for the Jacobian Matrix
• Q1.2 What is the computational complexity of Lucas-Kanade method?
• Q1.3 What is the computational complexity of Matthews-Baker method?
• Q2.1 Write the Lucas-Kanade tracker with translation
• Q2.2 Write the Lucas-Kanade tracker with affine transformation
• Q2.3 Write the Matthews-Baker tracker with affine transformation
• Q2.4 Test three algorithms on the provided sequences and analyze the results.
• Q3.1x Add illumination robustness
• Q3.2x LK Tracking on an image pyramid.
References
[1] Simon Baker, et al. Lucas-Kanade 20 Years On: A Unifying Framework: Part 1, CMU-RI-TR-02-16,
Robotics Institute, Carnegie Mellon University, 2002
[2] Simon Baker, et al. Lucas-Kanade 20 Years On: A Unifying Framework: Part 2, CMU-RI-TR-03-35,
Robotics Institute, Carnegie Mellon University, 2003
[3] Bouguet, Jean-Yves. Pyramidal Implementation of the Lucas Kanade Feature Tracker: Description of
the algorithm, Intel Corporation, 2001
10