project 2 A permutation 𝑝 of 𝑛 symbols

Your shopping cart is empty.

Programming project 2
Submission: ipython notebook with your code and result displayed.
Part 1
A permutation 𝑝 of 𝑛 symbols is an ordered list of 𝑛 symbols (no repetitions). We call 𝑛 the
length of the permutation.
In the following, we use the symbols 𝑎, 𝑏, 𝑐, ….. For example, 𝑝 = 𝑐𝑒𝑓𝑑𝑎𝑏 is a permutation of
length 6 on the symbols 𝑎, 𝑏, 𝑐, 𝑑,𝑒, 𝑓.
We denote by 𝑝[𝑖] the symbol of 𝑝 at position 𝑖, 𝑖 = 1, ⋯ , 𝑛, and by 𝑝𝑜𝑠𝑝[𝑥] the position of
symbol 𝑥 in 𝑝. If 𝑝 = 𝑐𝑑𝑎𝑏 is a permutation on 4 symbols we have 𝑝[3] = 𝑎 and 𝑝𝑜𝑠𝑝
[𝑎] = 3.
Given two permutations 𝑝 and 𝑞 of the same length n we want to compute their distance. Many
definitions of distance between two permutations exist in the literature. In this project, we consider
the following common definitions: Hamming distance, Kendall’s distance, Spearman distance.
You have to write a Python program that takes two permutations from the standard input (use
function input(), example code from example1.ipynb). The input permutations have length less
than 26 and are on symbols of the English alphabet. It then:
1. checks that each input permutation is valid, i.e. there are no repeated symbols. If not the
program stops.
2. checks that the permutations are on the same set of symbols. If not the program stops.
3. Computes and prints the above 3 distances between the input permutations: Hamming
distance, Kendall’s distance, Spearman distance.
Below are the definitions of the distances:
Hamming distance: Given two permutations 𝑝 and 𝑞 of length 𝑛, it counts the number of
positions at which 𝑝 and 𝑞 have a mismatch.
Ex. 𝑝 = 𝑐𝑒𝑓𝑑𝑎𝑏 and 𝑞 = 𝑐𝑓𝑒𝑏𝑎𝑑
The Hamming distance of 𝑝 and 𝑞 is 4 since the two permutations differ in 4 positions; i.e.
2,3,4,5.
What is the maximum value of the hamming distance between any two permutations as a function
of n? No written answer is necessary, just think about it.
Kendall’s distance: The Kendall’s tau distance KD counts the number of inversions occurring
between two permutations, i.e. the number of pairs of symbols that appear in opposite order.
Formally,
𝐾𝐷 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑠𝑦𝑚𝑏𝑜𝑙𝑠 (𝑥, 𝑦)such that (( 𝑝𝑜𝑠𝑝
[𝑥] <
𝑝𝑜𝑠𝑝
[𝑦] 𝑎𝑛𝑑 𝑝𝑜𝑠𝑞
[𝑥] 𝑝𝑜𝑠𝑞
[𝑦]) 𝑜𝑟 ( 𝑝𝑜𝑠𝑝
[𝑥] 𝑝𝑜𝑠𝑝
[𝑦] 𝑎𝑛𝑑 𝑝𝑜𝑠𝑞
[𝑥] < 𝑝𝑜𝑠𝑞
[𝑦])).
Ex. 𝑝 = 𝑐𝑒𝑓𝑑𝑎𝑏 and 𝑞 = 𝑐𝑓𝑒𝑏𝑎𝑑
The distance is 4 since there are 4 inversions, precisely (e,f), (a,b), (b,d) (a,d)
What is the maximum value of the Kendall distance between any two permutations as a function
of n? No written answer is necessary, just think about it.
Implementation hint: (However, you can use any other way except, of course, functions in
statistical packages)
Represent permutation p by a precedence matrix P, that is a 𝑛 × 𝑛 matrix defined as follows:
𝑃[𝑥, 𝑦] = 1 if symbol x precedes y, and 𝑃[𝑥, 𝑦] = 0 otherwise.
For example, the matrix P associated to 𝑝 = 𝑐𝑒𝑓𝑑𝑎𝑏 is
You can compute the Kendall distance between p and q as follows. Construct the two matrices P
and Q as above and …. (you continue).
Spearman distance: The Spearman distance or position-based distance is the sum of the
absolute differences of the positions of the symbols, i.e
𝑆𝐹 = ∑ | 𝑝𝑜𝑠𝑝
[𝑥] − 𝑝𝑜𝑠𝑞[𝑥]
𝑥
|.
Ex. 𝑝 = 𝑐𝑒𝑓𝑑𝑎𝑏 and 𝑞 = 𝑐𝑓𝑒𝑏𝑎𝑑
The symbol a is at the same position 5 in both permutations and so is c. Thus, they contribute 0
to the above sum. The symbol b is at position 6 in p and 4 in q thus the absolute value of the
difference in position is 2, and so on. In total, the Spearman distance of p and q is 6.
What is the maximum value of the Spearman distance between any two permutations as a
function of n? No written answer is necessary, just think about it.
Problem 2.
Given a set of t DNA strings of the same length n. The frequency of a base at position j is the
number of times the base appears in position j in the set of DNA strings divided by the number t
of DNA strings.
Ex.
a b c d e f
a 0 1 0 0 0 0
b 0 0 0 0 0 0
c 1 1 0 1 1 1
d 1 1 0 0 0 0
e 1 1 0 1 0 1
f 1 1 0 1 0 0
AAAGGTTTAA
ATTTTCCCCC
GGGGCCAAG
ATTTTCAAAT
ATTAACTCCT
Here t=5 and n=10. The frequency of A at positions 0 to n-1= 9 are 4/5, 1/5, 1/5, 1/5, 1/5, 0, 1/5,
2/5, 3/5,1/5, respectively.
The frequency of T at position 1 is 0, and so on.
Let base A correspond to the integer 0, T to 1, G to 2, and C to 3. The frequency matrix is a 4 ×
𝑛 matrix M such that 𝑀[𝑖,𝑗] (i=0,⋯,3, j=0,⋯, n -1) is the frequency of base i in position j in the
set of DNA strings. Note that the frequency is between 0 and 1.
You have to write a program that takes as input the set of strings from the file “DNAsequences.txt”, computes and print the frequency matrix M. The program generates the
following output (keep the fraction):
Frequencies:
A: 4/5, 1/5, 0, 1/5, 1/5, 0, 1/5, 2/5, 3/5,1/5
T: 0, …
G: 1/5, …
C: 0, …

Shopping cart

US$0

project 2 A permutation 𝑝 of 𝑛 symbols

More products