1. (60 pts) Recall that the string alignment problem takes as input two strings x and y,
composed of symbols xi; yj 2 Σ, for a fixed symbol set Σ, and returns a minimal-cost
set of edit operations for transforming the string x into string y.
Let x contain nx symbols, let y contain ny symbols, and let the set of edit operations be
those defined in the lecture notes (substitution, insertion, deletion, and transposition).
Let the cost of indel be 1, the cost of swap be 10 (plus the cost of the two sub ops),
and the cost of sub be 10, except when xi = yj, which is a \no-op" and has cost 0.
In this problem, we will implement and apply three functions.
(i) alignStrings(x,y) takes as input two ASCII strings x and y, and runs a dynamic
programming algorithm to return the cost matrix S, which contains the optimal costs
for all the subproblems for aligning these two strings.
alignStrings(x,y) : // x,y are ASCII strings
S = table of length nx by ny // for memoizing the subproblem costs
initialize S // fill in the basecases
for i = 1 to nx
for j = 1 to ny
S[i,j] = cost(i,j) // optimal cost for x[0..i] and y[0..j]
}}
return S
(ii) extractAlignment(S) takes as input an optimal cost matrix S and returns a
vector a that represents an optimal sequence of edit operations to convert x into y.
This optimal sequence is recovered by finding a path on the implicit DAG of decisions
made by alignStrings to obtain the value S[nx; ny], starting from S[0; 0].
extractAlignment(S) : // S is an optimal cost matrix from alignStrings
initialize a // empty vector of edit operations
[i,j] = [nx,ny] // initialize the search for a path to S[0,0]
while i 0 or j 0
a[i] = determineOptimalOp(S,i,j) // what was the optimal choice here?
[i,j] = updateIndices(S,i,j,a) // move to next position
}
return a
When storing the sequence of edit operations in a, use a special symbol to denote
no-ops.
(iii) commonSubstrings(x,L,a) which takes as input the ASCII string x, an integer
1 ≤ L ≤ nx, and an optimal sequence a of edits to x, which would transform x into
y. This function returns each of the substrings of length at least L in x that aligns
exactly, via a run of no-ops, to a substring in y.
(a) From scratch, implement the functions alignStrings, extractAlignment, and
commonSubstrings. You may not use any library functions that make their implementation trivial. Within your implementation of extractAlignment, ties must
be broken uniformly at random.
Submit (i) a paragraph for each function that explains how you implemented it
(describe how it works and how it uses its data structures), and (ii) your code
implementation, with code comments.
Hint: test your code by reproducing the APE / STEP and the EXPONENTIAL /
POLYNOMIAL examples in the lecture notes (to do this exactly, you’ll need to use
unit costs instead of the ones given above).
(b) Using asymptotic analysis, determine the running time of the call
commonSubstrings(x, L, extractAlignment( alignStrings(x,y) ) )
Justify your answer.
(c) Describe an algorithm for counting the number of optimal alignments, given an
optimal cost matrix S. Prove that your algorithm is correct, and give is asymptotic
running time.
Hint: Convert this problem into a form that allows us to apply an algorithm we’ve
already seen.
(d) String alignment algorithms can be used to detect changes between different versions of the same document (as in version control systems) or to detect verbatim
copying between different documents (as in plagiarism detection systems).
The two data string files for PS6 (see class Moodle) contain actual documents
recently released by two independent organizations. Use your functions from (1a)
to align the text of these two documents. Present the results of your analysis,
including a reporting of all the substrings in x of length L = 10 or more that could
have been taken from y, and briefly comment on whether these documents could
be reasonably considered original works, under CU’s academic honesty policy.
(e) (10 pts extra credit) Find a different document, from the same organization that
produced x, which contains substantial copied text from some other document y,
produced by a different organization. Present your results. (This is real detective
work!)
(10 pts extra credit) Infinite Monkey Theorem: a monkey hitting keys at random
on a typewriter keyboard for an infinite amount of time will almost surely type a
given text, such as the complete works of William Shakespeare. Let’s find out!
The data MuchAdo txt file for PS6 (see class Moodle) contains an ASCII version
of Shakespeare’s play Much Ado About Nothing, Act 1 Scene 2, which will serve as
an input string y. Write a function that takes as input the data MuchAdo freqs
file for PS6 (see class Moodle), which gives the frequencies of the ASCII characters in the scene, and an integer n, and outputs a file x that contains n characters
drawn randomly but with the given frequencies. Then, using your string alignment functions, determine what value of n is required to produce an overlap of 7
characters. Present your results with a brief discussion about what you learned.
2. (25 pts) Vankin’s Mile is a solitaire game played by young wizards on an n × n square
grid. The wizard starts by placing a token on any square of the grid. Then on each
turn, the wizard moves the token either one square to the right or one square down.
The game ends when the wizard moves the token off the edge of the board. Each
square of the grid has a numerical value, which could be positive, negative, or zero.
The wizard starts with a score of zero; whenever the token lands on a square, the
wizard adds its value to his score. The object of the game is to score as many points
as possible, without resorting to magic.
For example, given the grid, the wizard can score 8 − 6 + 7 − 3 + 4 = 10 points by
placing the initial token on the 8 in the second row, and then moving down, down,
right, down, down. (This is not the best possible score for these values.)
a) Give an algorithm (including pseudocode) to compute the maximum possible score
for a game of Vankin’s Mile, given the n × n array of values as input.
(b) Prove that your algorithm is correct.
(c) Analyze your algorithm’s time and space requirements.
3. (15 pts) A simple graph (V; E) is bipartite if and only if the vertices V can be partitioned into two subsets L and R, such that for every edge (i; j) 2 E, if i 2 L then
j 2 R or if i 2 R then j 2 L.
(a) Prove that every tree is a bipartite graph.
(b) Adapt an algorithm described in class so that it will determine whether a given
undirected graph is bipartite. Give and justify its running time.
4. (15 pts) Prof. Dumbledore needs your help to compute the in- and out-degrees of all
vertices in a directed multigraph G. However, he is not sure how to represent the
graph so that the calculation is most efficient. For each of the three possible representations, express your answers in asymptotic notation (the only notation Dumbledore
understands), in terms of V and E, and justify your claim.
(a) An edge list representation. Assume vertices have arbitrary labels
(b) An adjacency list representation. Assume the vector’s length is known.
(c) An adjacency matrix representation. Assume the size of the matrix is known