$30
ECE 8540
Lab 7 - Viterbi Algorithm
1 Introduction
This lab report concerns the problem of using a Hidden Markov model (HMM) to develop
a code that executes the Viterbi algorithm. The HMM is composed of two different DNA
states, H and L. The H state represents high GC content and the L state represents low
GC content. From each state there is a different probability for each of the possible DNA
nucleotide (A, C, G, T) to occur. The model also provides details on the probability of
state transitions. The Viterbi algorithm is a method that allows for the most probable
path of a sequence to be computed. In this example a DNA sequence was given as the
input and the most probable path of states that produced this sequence was computed. The
HMM model in combination with the Viterbi algorithm allows us to efficiently determine
the most probable state path that produced a specific sequence of DNA. This method can
be applied to a variety of different problems if a state model is available. This method is a
large improvement when compared to previous methods for determining the most probable
path. Previous methods were very computationally expensive and required large amounts
of time to compute and produce a final output. The Viterbi algorithm reduces the number
of computations and produces accurate probability predictions.
2 Methods
In order to determine the path of maximum probability the probabilities in the model needed
to be defined. The prior probabilities were {0.5, 0.5}. The state transition probabilities for
state H were {0.5, 0.5} and for state L were {0.4, 0.6}. Within each state the possible DNA
nucleotide {A, C, G, T} could be observed with a corresponding probability of {0.2, 0.3, 0.3,
0.2} for state H and {0.3, 0.2, 0.2, 0.3} for state L.
After defining all the probabilities in the model the Viterbi calculations could be completed. It is important to note that the calculations were done using the sums of log2 of
probabilities. This method takes the log2 of each of the probabilities to adjust them from
decimal values. Probabilities are then solved for by taking the sum instead of the product. By performing this calculation the probabilities do not become extremely small with
large test sequences. This method will be further observed by the outputted probabilities.
Since log2 of all the probabilities was taken the final values will reflect numbers in the range
of -25.66 to -2.74, where the larger numbers represent higher probabilities. The algorithm
was run by iterating through each DNA nucleotide in the input sequence and finding the
highest probability for the nucleotide to be emitted by each state. The following steps were
completed to implement the algorithm:
1. Calculate the probability that the first nucleotide is emitted by state H and state L.
1
Store the results in the probability table. These probabilities can be calculated using
the following equations:
pH(1) = −1 + pNucleotideH (1)
pL(1) = −1 + pNucleotideL (2)
The -1 in the above equations represent the prior probabilities for each state, which is
the same for both. The variable pNucleotideH represents the probability that the given
nucleotide {A, C, G, T} is emitted in the H state. The variable pNucleotideL represents
the probability that the given nucleotide {A, C, G, T} is emitted in the L state. These
equations are only used in the first iteration of the algorithm.
2. For the remaining nucleotide in the sequence the following equations are used to calculate the maximum probabilities that the given nucleotide is emitted by each state.
pH(i) = pNucleotideH + max(pH(i − 1) + pHH, pL(i − 1) + pLH) (3)
pL(i) = pNucleotideL + max(pH(i − 1) + pHL, pL(i − 1) + pLL) (4)
In equations 3 and 4 pHH represents the probability that state remains in state H and
pHL represents the probability of a state transition from H to L. Similarly, pLL represents the probability that state remains in state L and pLH represents the probability
of a state transition from L to H. Equations 3 and 4 compute all the possible ways that
the nucleotide at position i could occur in each state, the maximum for each state is
what is saved and represented in the final table.
3. Increment i by one until the probabilities for each nucleotide in the test have been
calculated using equations 1-4.
4. Back track through both pH and pL comparing the probabilities at each index. Select
the maximum probability from the two states and record the corresponding state letter
(H or L) at the specific index. After the backtracking is complete a most probable state
sequence is produced. In the case where the probability from each state is equivalent
at a given index the L state is selected for the most probable state sequence.
2.1 MATLAB Implementation
Once it was determined how the Viterbi algorithm functioned the calculations were implemented in software. The 2020a version of MATLAB on a Windows operating system was
used to implement the code. All the initial probabilities were defined in the code and a for
loop was used to iterate over each one of the input sequences separately. The loop was ran
for each nucleotide in the DNA sequence. At each iteration the nucleotide in the sequence
was determined using if statements so the proper probability could be used in the equation.
If it is the first letter than equations 1 and 2 are used to determine the two maximum state
probabilities, otherwise equations 3 and 4 were used. At the conclusion of this loop the max
probabilities for both states at each index were calculated. A for loop was then used to run
through all the probabilities in reverse order. The probabilities for each state were compared
and the maximum was declared as the most probable state at that index. At the end of this
loop the most probable path was determined and the algorithm was complete.
2
Table 1: Table of max probabilities for DNA strand ’GGCACTGAA’.
DNA Nucleotide H State Probability L State Probability
G -2.74 -3.32
G -5.47 -6.06
C -8.21 -8.80
A -11.53 -10.95
C -14.01 -14.01
T -17.33 -16.48
G -19.54 -19.54
A -22.86 -22.01
A -25.66 -24.49
3 Results
For this lab the results can be observed best in a tables 1, 2, and 3. Tables 1 and 2 show the
max probabilities at each data index or nucleotide in the tested sequence. As shown in both
tables the probabilities are all represented by negative numbers. The numbers closer to zero
represent the state with the higher probability at the specific index location. These tables
were then traced backwards to produce the most probable paths shown in table 3. The path
was discovered by selecting the state with higher probability for each DNA nucleotide in
the sequence. Viewing tables 1, and 2 there are multiple instances where the probability is
this same for both states. In this situation the L state was always chosen. Depending on
the problem the algorithm can be adjusted to behave differently in the situation where state
probabilities are the same. When running the Viterbi algorithm on these simple tests the
run time was very quick pointing to the computational efficiency of the algorithm. Although
the run time was quick in this scenario it can also be credited to the small size of the tests.
The results from table 1 and row one of table 3 were verified using an external source and
were confirmed to be completely accurate. The probability of the path corresponding to
test one was 2−24.49 = 4.24E-8 when converted back from log2. The probability of the path
corresponding to test 2 two was 2−25.01 = 2.96E-8 when converted back from log2.
4 Conclusion
In this lab a Hidden Markov model (HMM) was used to develop a code that successfully
executes the Viterbi algorithm. The algorithm was tested on two examples where the inputs
were sequences of DNA nucleotide and the outputs represented the most probable state
sequence that produced the input. The results from the example were verified for accuracy
using an external reference. The algorithm functioned by iteratively calculating the max
probability that the current nucleotide in the sequence was emitted from each state, H and L.
Tables 1 and 2 represent the outputs of the maximum probability the nucleotide was emitted
from each of the states. These tables could then be backtracked through to produce the most
probable state path, displayed in 3. When backtracking the max probability from the two
3
Table 2: Table of max probabilities for DNA strand ’TCAGCGGCT’.
DNA Nucleotide H State Probability L State Probability
T -3.32 -2.74
C -5.80 -5.80
A -9.12 -8.27
G -11.33 -11.33
C -14.07 -14.39
G -16.80 -17.39
G -19.54 -20.12
C -22.28 -22.86
T -25.60 -25.01
Table 3: Summary of tested DNA sequences and corresponding state sequence.
Test # Initial DNA Sequence Most Probable State Sequence
1 GGCACTGAA HHHLLLLLL
2 TCAGCGGCT LLLLHHHHL
states was compared and the maximum was where the nucleotide was most likely emitted
from. This algorithm provides a computationally efficient method to decoding the DNA
sequence and finding the most probable state path. This method improves upon previously
used brute force computations that were time consuming and inefficient in decoding the most
probable path of a sequence. The Viterbi algorithm reduces the computational expense and
provides accurate results in decoding problems.
4