Starting from:

$30

CSCE 633 Homework I

CSCE 633

Homework I

Note: For written questions, you can either turn in a scanned copy of your handwritten answers
or a PDF file of your answers. For programming questions, you need to submit your code. Please
put your code and written answers in a zip file and submit it on Canvas.
Problem 1: Cosine and Dot Product Similarity (60 points)
In this homework assignment, you are required to compare the retrieval performance of two different
similarity measures, i.e., dot product and cosine similarity. The document collection has already
been preprocessed, with one file for each document. The collection of cleaned up documents and
queries can be downloaded from Canvas (Assignment/Homework I/hw1 data.zip). Upon unzipping
the file, you can see two folders. One folder named docs contains all documents, with one file for
each document. Similarly, in the folder named queries one file is for each query.
You need first to extract the vocabulary out of the document collection and create a vector
representation for each document and query. Let n be the number of unique worlds extracted from
the document collection. Let d = (d1, . . . , dn)
⊤ ∈ R
n denote a vector representation for a document
where di
is the term frequency of ith term in the vocabulary. Similarly, you can denote a query by
q = (q1, . . . , qn)
⊤ ∈ R
n
. Two similarity measures will be computed and compared. For dot product
similarity, the document-query similarity is computed as
Sdot(d, q) = d
⊤q =
Xn
i=1
diqi = d1q1 + d2q2 + . . . + dnqn
For cosine similarity, the document-query similarity can be computed by
Scos(d, q) = d
⊤q
∥d∥2∥q∥2
=
Pn
i=1 diqi
qPn
i=1 d
2
i
qPn
i=1 q
2
i
For each query, you are asked to compute the similarities between the query and all documents
using both similarity measures, and return the first 10 documents with the largest scores (you can
randomly break the tie when documents have identical scores). You will then compare the returned
documents using different similarity measures, and discuss your observation. In particular, you need
to submit in this homework:
1. For each of the five queries and for each similarity measure, report the list of 10 most similar
documents (i.e. documents with the largest similarity scores).
2. By looking at the content of the original documents, decide the relevance of the returned
documents to the query, and compare the performance of the two similarity measures.
1
Problem 2: Singular Value Decomposition (40 points)
Let X ∈ R
n×d
(n ≥ d) denote a matrix with the singular value decomposition given by X = UΣV
⊤,
where U ∈ R
n×d and V ∈ R
d×d are orthonormal matrices satisfying U
⊤U = Id and V
⊤V = Id, and
Σ = diag(σ1, · · · , σd) is a diagonal matrix with σi ≥ 0, i = 1, . . . , d. You are asked to compute
(λId + X⊤X)
−1X⊤
using U, Σ and V , where Id is an identity matrix of size d × d.
2

More products