$30
Lab 1: Introduction to Genomics
Name: Your Name Here (Your netid here)
Due March 11, 2021 11:59 PM
Lab 1 contains an introductory exploration of genomic data.
Important Instructions -
You are not allowed to use any in-built libraries for processing DNA sequencing data files
Please implement all the graded functions in main.py file. Do not change function names in main.py.
Please read the description of every graded function very carefully. The description clearly states what is the expectation of each graded function.
After some graded functions, there is a cell which you can run and see if the expected output matches the output you are getting.
The expected output provided is just a way for you to assess the correctness of your code. The code will be tested on several other cases as well.
Preamble (Don't change this)
import random
import matplotlib.pyplot as plt
import seaborn as sns
Exploring an Illumina E. coli dataset
First, let's look at the data in the file ecoli.fastq. It contains reads generated using an Illumina sequencing machine from the E. coli genome.
#reading Illumina fastq data
reads=""
with open("ecoli.fastq") as file:
reads=file.read()
FASTQ is a standard file format for genomic data. See the wikipedia article. Let's look at the first 1000 characters:
print(reads[:1000])
@HISEQ03:379:C2WP8ACXX:7:1101:4288:2189 1:N:0:ACTTGA
TATTCAATTCAGACTACAGAGTGGGCGATTTTAATCTATGGACTGGTGATGATCTTCTTTTTATACATGTATGTTTGCTTCGCGTCGGCGGTTTATATCCCGGAGCTTTGGCCAACGCATTTACGCCTGCGCGGTTCGGGTTTCGTTAAT
+
CCCFFFFFHHHHHJJJJJJGIEFHJJJHIJJJJJJJJJJJJGHGJJFCEEGGIIHIIJJJJJIIIIIJJIJJJHHHFHHHFFFDDDDDDDD>>BCDEECDDDDBDDDDDCCDCDDDDDBB@DCDDDDDDDDDDDBDBBBB2<<>??CBDD
@HISEQ03:379:C2WP8ACXX:7:1101:4288:2189 2:N:0:ACTTGA
CACCGTGATCGACCCATAATGTGTTAATAATGCCGCAACGCCATAGGGCGTGAAGACTGCGACGATCCGTCCGACGGCATTAACGAAACCCGAACCGCGCAGGCGTAAATGCGTTGGCCAAAGCTCCGGGATATAAACCGCCGACGCGAA
+
CCCFFFFFHHHHHJJJJJJJJHHHIJJJJJIIJJJJIJJJJJIJIJJJJHFDFFFFEEEEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDBDDDDBDDDDBBDD@DDDBBDDDDDDCDCCDDDDDB>CCDDED@BDDD9<<BB79
@HISEQ03:379:C2WP8ACXX:7:1101:4591:2228 1:N:0:ACTTGA
AATTAAAAGACACCCAGCAGTTACAAAAGTGCGCTGATCGTCTTGCCCAGAGTGCGCAGGATTTTCGTCTGCAACTCGGTGAGCCAGGTTATCGCGGTAACCTGCGTGAGCTGTTAGCTAATCCGCAAATTCAGCGGGCATTTTTACTGC
+
@@@=BDDD???ACCF?HIBD<FAHIDDHE@E@G8:66?9DFHD7F8=F3===).75;@EH=?3;);>A=@?(>AC:><?,
reads[202]
'T'
eol=0
start=0
end=0
for i in range(len(reads[:1000])):
if reads[i] == '\n':
eol+=1
if eol % 4 == 1:
start=i+1
if eol % 4 == 2:
end=i+1
if end>start:
print(start,end)
53 204
53 204
53 204
410 561
410 561
410 561
767 918
767 918
Every block of four lines corresponds to one read:
Line 1 (starting with @) is a read ID
Line 2 is the DNA sequence
Line 3 usually only has a + sign
Line 4 has the same length as the DNA sequence. Each characters encodes the quality (or the reliability) of the corresponding symbol in the DNA sequence
%run main.py
The following line creates an object from the class in main.py. Do not change the class name and function headers!
module = Lab1()
Graded function 1: parse_reads_illumina(reads) (10 marks)
Purpose - To parse the input read file and get a list of DNA reads.
Input - a string (reads) which contains the entire reads file. You should begin by first obtaining individual lines of the file. Each DNA read corresponds to the second line of each block of four lines.
Output - a list of DNA reads
Example Output Format - ['ACGTGGGTAAACC', 'ACGTGGGAACC', 'GTGGGTAAACC']
dna_reads_illumina=module.parse_reads_illumina(reads)
print(len(dna_reads_illumina))
print(dna_reads_illumina[0]=="TATTCAATTCAGACTACAGAGTGGGCGATTTTAATCTATGGACTGGTGATGATCTTCTTTTTATACATGTATGTTTGCTTCGCGTCGGCGGTTTATATCCCGGAGCTTTGGCCAACGCATTTACGCCTGCGCGGTTCGGGTTTCGTTAAT")
print(dna_reads_illumina[1]=="CACCGTGATCGACCCATAATGTGTTAATAATGCCGCAACGCCATAGGGCGTGAAGACTGCGACGATCCGTCCGACGGCATTAACGAAACCCGAACCGCGCAGGCGTAAATGCGTTGGCCAAAGCTCCGGGATATAAACCGCCGACGCGAA")
644022
True
True
dna_reads_illumina[0]
'TATTCAATTCAGACTACAGAGTGGGCGATTTTAATCTATGGACTGGTGATGATCTTCTTTTTATACATGTATGTTTGCTTCGCGTCGGCGGTTTATATCCCGGAGCTTTGGCCAACGCATTTACGCCTGCGCGGTTCGGGTTTCGTTAAT'
Expected Output -
644022
True
True
Graded Function 2: unique_lengths(dna_reads) (10 marks)
Purpose - To return a set of all read lengths among all the DNA reads
Input - list of DNA reads
Output - set which contains different read lengths
Example Output Format - {123,156,167}
counts_illumina=module.unique_lengths(dna_reads_illumina)
print(counts_illumina)
{150}
Next we will look into the content of the actual reads. Are A, C, G, and T the only characters in the reads?
Graded Function 3 : check_impurity(dna_reads) (10 marks)
Purpose - To check if reads have any characters apart from the 4 defined above.
Input - list of DNA reads.
Output - you should output a list of all reads which contain any non-{A,C,G,T} characters and a set containing all the additional characters encountered.
Example Output Format -
List of reads with impurities - ['ACGTGGGBAAACC', 'ACDDGGGAACC', 'GTGGGTAABDC']
Set with additional characters - {'B','D'}
impure_reads_illumina,impure_chars_illumina=module.check_impurity(dna_reads_illumina)
print(len(impure_reads_illumina))
print(impure_chars_illumina)
1368
{'N'}
print(impure_reads_illumina[2])
CATNAAACTATGCAACATATCGCGCATTGGCCCGTTGGCGGGATTAGCGTCGGTATCAATTGGCCCTGGCTGGACGACGTTAATGGTGATCCCACGCGGTCCAAAATCACGGGCCAGCCCGCGCGCCATGCCTTGCAGGGCAGATTTGAT
The symbol N is used to represent undetermined bases (i.e., bases where the sequencing machine failed to obtain a proper reading)
Graded Function 4 : get_read_counts(dna_reads) (10 marks)
Purpose - To compute the number of times each read occurs in the entire collections of reads.
Input - list of DNA reads
Output - you should output a dictionary where the read is the key and the number of times it appears is the value.
Example Output Format - {'ACGTGGGTAAACC' : 15, 'ACGTGGGAACC' : 10, 'GTGGGTAAACC' : 5}
reads_counts_illumina=module.get_read_counts(dna_reads_illumina)
print(sorted(list(reads_counts_illumina.values()),reverse=True)[:5])
print(len(reads_counts_illumina.keys()))
[11, 7, 7, 6, 6]
616342
# reads_counts_illumina
a={'b':1}
a.keys()
if 'a' not in a.keys():
a['a']=1
if 'b' in a.keys():
a['b']+=1
a
{'b': 2, 'a': 1}
Plotting read frequencies
We will now use the count dictionary from above to generate a histogram of counts vs no. of reads with that count(log scale)
def plot_frequency_histogram(read_counts) :
plt.yscale('log', nonpositive='clip')
plt.hist([read_counts[key] for key in read_counts],bins=[i for i in range(13)])
plt.xlabel("count values")
plt.ylabel("no. of reads")
plt.show()
plot_frequency_histogram(reads_counts_illumina)
Notice that most reads appear only once, and it is rare for the same read to appear many times. This is expected, since the reads are drawn roughly uniformly at random from the whole genome.
Exploring a PacBio E. coli dataset
Next, we will look into a read dataset obtained using a Pacific Biosciences (PacBio) machine, from the same E. coli genome.
#reading PacBio data
reads_pac=""
with open("ecoli_pac-bio.fasta") as file :
reads_pac=file.read()
As in the case of the Illumina dataset, let's look at the beginning of the file:
a=reads_pac[:2000].split('\n')
for i in range(len(a)):
print(a[i])
>m140930_121059_sherri_c100688052550000001823139503241542_s1_p0/24/0_7424 RQ=0.846
aaaaaaaaaaaaaaaaacaaaaaaaaaaaaaaaaaaaaagggggggggggggaaaggaggggaaaagaaaaaaaaaaaaa
aaaaaaaaaaaaaattgggggcccccccccaaaaaggaaaaattctctttttcaaacaaaaaacggtgttttttttctgg
gtggtttgggggcgaaaataaatcgcttcctttgtcttttggggccccactcctttcttcgatcagcgttttgccagcaa
aacgcaattttttttttttctttcgttttttagaagggtaaagaaacagctttcttttctttaaataggttttggccccg
tttttttcctgtttccggttccacttcaatatattttcgccattgtttccatctgcttccgaaacgccagttttcacgta
ccccggtatcgcaagcgtggcggaggaaacagccatgtttgaggcgctggtttgcaggcggcatacggcgggaacagcca
gcggatatctttaataaagcgcagaaatcgtaacaatgcgatcggcttcgtccagtaccacgcacctgaatggcacgcga
ggtttaatgtggttctgcttggcggtagtcattaagagccccgtggtggccaatcaagaaaatgtcacgccgcttcccag
cactttcagctgttttgtcgtagcccatcaccaccgtaagccaagacccagcttcaggccaagtagccttccgccagcgg
ttctgcgtcggcatggattctgcacggcaaagttcacgcgtcggtttgccataattaaggacgcgcctggattcaccttg
cgatcggcaatcgcaggaatgagagagcagataatgaaagcgttgacgtaagaaagccatcgttttcccggtaccggttt
ttgcgcctgcccggctacgtcagcgacctcgccagcgtcagcggacagggcgcaagtgccgtgaatgggccgtacagtta
tgaaaccctttttttctaaggggcttctacaacccttggatgcagggcgaagtcgggaaaacttctgttctgtttaaaat
gtgttttgctcatagtgtggtagatctcagcttactattggctttaacgaaagccgtattccggtgaaaataacagtcac
gcttttagttgttaatgttacaccaacaacgaaaccaacacgccaggcttaattcctgtggagttatatatgagcgtaaa
attattcacctgactgacgacagttttttgacacggatgtactccaagggacggggcgatcctcgtcggattttctgggc
agaggtggtgcggtccgttgcaaaatgaatcgccccgattctggatgaaatcgctgaacgaatagtcagggcaaactgac
cgttgcaaaactgaacaatcggatcaaaaccctggcactgcgcccgaaaatatggcatcgtgggtatcccgactctgctg
ctgttttcaaacggtgaagtggcggcaactcaaagtgggtgcactgtctaaaggttcagttgaaagagtttcctcgaacg
ctaacactggcgtaagggaatttcatgttcgggtgccccgtcgctaaaaactggacgccccggcgtgagtcatgctaact
tagtggtttgactttcgtattaaacataccttattaagtttgaaatctttgtaatttccaaagcaggcttcccgtttttt
cttaaatgcgaaagtgaacagatttcgctgggtcgtcactcaatccgtcttgtcgtttcagttcttgcgtagctctcctg
gtgacccaaggcagcggaacagaccatggagtcgatgaccgtaaaaacaggcatggtatgatcctgccatatataccatt
cacaacattaagttcgagatttaccccaagtttaagaagctcacacgtgcacta
reads_pac[:8000]
'>m140930_121059_sherri_c100688052550000001823139503241542_s1_p0/24/0_7424 RQ=0.846\naaaaaaaaaaaaaaaaacaaaaaaaaaaaaaaaaaaaaagggggggggggggaaaggaggggaaaagaaaaaaaaaaaaa\naaaaaaaaaaaaaattgggggcccccccccaaaaaggaaaaattctctttttcaaacaaaaaacggtgttttttttctgg\ngtggtttgggggcgaaaataaatcgcttcctttgtcttttggggccccactcctttcttcgatcagcgttttgccagcaa\naacgcaattttttttttttctttcgttttttagaagggtaaagaaacagctttcttttctttaaataggttttggccccg\ntttttttcctgtttccggttccacttcaatatattttcgccattgtttccatctgcttccgaaacgccagttttcacgta\nccccggtatcgcaagcgtggcggaggaaacagccatgtttgaggcgctggtttgcaggcggcatacggcgggaacagcca\ngcggatatctttaataaagcgcagaaatcgtaacaatgcgatcggcttcgtccagtaccacgcacctgaatggcacgcga\nggtttaatgtggttctgcttggcggtagtcattaagagccccgtggtggccaatcaagaaaatgtcacgccgcttcccag\ncactttcagctgttttgtcgtagcccatcaccaccgtaagccaagacccagcttcaggccaagtagccttccgccagcgg\nttctgcgtcggcatggattctgcacggcaaagttcacgcgtcggtttgccataattaaggacgcgcctggattcaccttg\ncgatcggcaatcgcaggaatgagagagcagataatgaaagcgttgacgtaagaaagccatcgttttcccggtaccggttt\nttgcgcctgcccggctacgtcagcgacctcgccagcgtcagcggacagggcgcaagtgccgtgaatgggccgtacagtta\ntgaaaccctttttttctaaggggcttctacaacccttggatgcagggcgaagtcgggaaaacttctgttctgtttaaaat\ngtgttttgctcatagtgtggtagatctcagcttactattggctttaacgaaagccgtattccggtgaaaataacagtcac\ngcttttagttgttaatgttacaccaacaacgaaaccaacacgccaggcttaattcctgtggagttatatatgagcgtaaa\nattattcacctgactgacgacagttttttgacacggatgtactccaagggacggggcgatcctcgtcggattttctgggc\nagaggtggtgcggtccgttgcaaaatgaatcgccccgattctggatgaaatcgctgaacgaatagtcagggcaaactgac\ncgttgcaaaactgaacaatcggatcaaaaccctggcactgcgcccgaaaatatggcatcgtgggtatcccgactctgctg\nctgttttcaaacggtgaagtggcggcaactcaaagtgggtgcactgtctaaaggttcagttgaaagagtttcctcgaacg\nctaacactggcgtaagggaatttcatgttcgggtgccccgtcgctaaaaactggacgccccggcgtgagtcatgctaact\ntagtggtttgactttcgtattaaacataccttattaagtttgaaatctttgtaatttccaaagcaggcttcccgtttttt\ncttaaatgcgaaagtgaacagatttcgctgggtcgtcactcaatccgtcttgtcgtttcagttcttgcgtagctctcctg\ngtgacccaaggcagcggaacagaccatggagtcgatgaccgtaaaaacaggcatggtatgatcctgccatatataccatt\ncacaacattaagttcgagatttaccccaagtttaagaagctcacacgtgcactatgaagtcttacgcgaattaaagaata\ncggccggtttctgagctgattcactctcggcgaaaatatggggctgaaaacctggctcgtatgcgtagcgggacattatt\ntttgccattcctgaagcagcagcgcaaagagtggcgaagatatctttggtgatggcgtactggagatattgcaggatgga\ntttggtttcctccgttcgcagacagctcctactgcccggtccctgatgacatctacagtttcccctagccaaatccgagc\nccgtttcaaatcttccgcactgtggataccatcctctggtaagatattcgctccgccgaaagaagggtgaacgcttattt\ngcgctgcttgaaagttttaacgaagttacttcgaacaacctgaaaaacgccgcaacaaatcctctttgagaacttacgcc\ngctgggcacgcaaacctctcgtctggcgtatggaacgtggtaacggtttcttacaatctggaagattgtaactggctgcg\ncgtactggatctggcaatcacctatcggtcgtggtcaggcgtggtcctgattgtggcacccggccgaaagcgcggtaaaa\nccatgctgctgccgaacaattgccttcagagcattgcttacaaccacccggattgtgtgctgatggttctgctgatgcgg\nacggaacgtcgcggaagaagtaaccgagatgcagcgtctggtaaaaggtgaagttgttgctttatttgacgaacctgcat\nctcgccccacgttcaggggttgcggaaatggtgaatcgagaaggcccaagaacgcccctggtttgagcacaagaaaagac\ngttatgcattcttgctcgctccatcagctcgttctggcgcagcgctttacaacaccgttggttccgggcggcgtcaggta\naagtgtttgacgcggttggtgtggatgccacgtcccctgcatccgtcgcgaaaaacgcttctttggtgggcgggcgcgta\nacgtggaaagagggggcggcacagcctgacgcattatcgcgacggcgcttctcggataccggttctaaaatggaacgaag\nttatctacgaagagtttaaaggtacaggcaacatggaactgcacctctctcgtaagatcgctgaaaaacgcgtcttccgc\nggctatcgactacaaccgttctgggtaccccgtaaagaagagctgtcacgactcaggaagcaactgcagtttaaactgtg\ngattcctgcgcaaaatcaatttcaagacccgatggcgaaatcgatgcaatggaatttcctcattaataacctggcaatga\ncgcaagaccaatgacgatttctttcaaatgatgaaacgctcataaatgttgtcttatgccaaaaaacgccaaccgtgttt\nacgtggcgttttgcttttatatctgtaaatcttaaatgccgcgctggcgatgttaggaaaattcagctggaattttgctg\ngcatggtttatggcaatttgcatatcacatggtttaatttttggcaacaggactggtgggtttttggaagcggacttttc\nccttctgaatcaaggtcttcgtgttatacttctgctaataactttttcttctgagagcatgcattggtgatttactggac\nagtgacgtcactgatctcatcagtatttttttattcacgtggcgacactgttttctgtttttggcccgtaaggtggcaaa\naaaaaaagttcccccggggtttttagtggataaaccaaacttcccgcacacgtcaccagggattgatacctctcgttggg\ngggatttccggtttacgcagggattttgcctttcagcgttcggaatttgtcgattactataatttccgcaatgcatcttc\ntctcatctcggcttgtgccggttgtggcttgttttcaattggcgcgctggatgaaccggttttgaatatcagcgtaaaaa\ntccgtgcccccatacaggccgcttgttggcattgttatgattggtgttcggcaaggctttaatctcagtagcctgggtta\ntactctttggctcctgggagatggtgctcgaccgtttggtttacttcctgacgctatttgccgggtctgggcggccatta\natggcgttcaacatggttgatggcatttgatggctttgcatggggggttgtcactgcgtctgcgtttgcagcaatcgtat\ngatttttggtggttcgacggggcaaaccagcctcgcaatctggtgcctgttgcgatgatcgcccgccattcctgccatac\natccacaatgcctttaaacgcttggtaatcctggtgcgccgctacaaagtcctttatgggtgatgcgggcaagtacgctg\nattggttttaccgtttatctggatcctgctcgaaaccgacccaggcaaaacgctatcccatcagcccccggttttaccgc\ntttgtggataatccgcgcattcttttttcgcttaatggatatggtggcgattttcatgtaccgtcgctggttaaaaggca\ntgagccccattctctcctgaccgtcagcatatcaccatttgatcatgcggttgccggggtttcttcccgtcaggcgtttt\ngtgctgattacccttgccgcagcactgctcgcttccattgggcgtggctggcagaatattctgcattttgtcccggagtg\nggtgccatgctggtgcttctttttgctagccaattcttctctatgggaattaattggcattaagcgtgccctggaaagtt\ngctcgcttttattaaagcgcgtaaaacgcagaatgcgtaagaaatcgtggtggagccgcaatttaccaaaataaatgagg\ngatgtgatgagacaacgcaatgctgggaaaccgggccgaagacgctgaaaatgaactgggatattcgtgggttggtttcg\ntaccttggtgggctgggaagctatggatttattggcaatggggctggcgtttgcgtaatcgcgctggcgtatacgttttt\ntgctcgtcaggagtggaggctgcgacggcgattaccgatcgttcccaacggtggaatatgcttggggggatattacttcc\ngccaagcatgcaatttttgcgtaacctggatgtccgttcaacatggctttctgccgaccaaccatcgtcatggacgaagc\nccctaaaagagtttgtttgtatgcaactggcctcgtgggataccgcagagagttctggctgggcaaaccgactattaaca\naacagcgggatggtgggcaaacccgcaaaaggccgatgcggccgttggctggatgaaaatgattaaacacatccagttat\nccccggagactttaccgcgcgcggtcaatgaacagcgtgaagctttattgccgaaaccgcgcctgacgctaataacctgt\nttacgtcagtaatgtttgctttttgccagccagcgtgcaagccagccatctgaatgatgagctgaagggcggcagggcgg\ncggcgtaaccatccagaatgaaagctcaggtgaagcgtcaggaagaggtggcgaaaagccattctacgaccggccggatg\naacgagcattgagcattgccgctgaaaaattgctgaagcagcataatatttcgcgcagtgggcgacagaatgtaccttgc\ncgaggaattttacctgaattcagaaatgttcctgcttgggcgtgcaatgcttcaggctcgactggaaaaatttacagggc\ncgtcgtccggcccctttggaaaactcgactatgatcagaatcgggcccatgttaaacacctctgaatgttggtccaaaac\ncctggatccgcggttttttcaggacctatcgctatttgcgtacgccggaagaagccggtaaaacgcgatagccggcacgt\ntcgttgcccttcctgatgattatggtggggttgtcgggggctgatcgggggcttggtgtcgcattaacataccgccgttg\nctcccgaaatagcaacactgctgcggttggaggcgcaaaggcgccgtccgcttattcgacagaggaatcgatgtgaaagt\nacttggactgtcaatttggttacgcggcccggaagcaatcaagatggcgccccgttggtgctgcgttggcaaaagatcct\nttttttgaggctaaagtttgcgtcactgcgcagcatcgggaagatgctcgatcaggtgctgaaactctttcgcattgtaa\nacctgactacgatctcaacattaatgcagccaggacatggggcctgacatgagctaactgtcggattctggaaggggctc\naaaccctattttgccgagttcaaacagacgtcgggtgctggttcacggcgatacgaccgacgacggctggcaaccagcgc\ntgggcggcgttttatcagcgtaattcctgttggtacacgtttgaggctgggtctggcgcacgggcgattctctattcgcc\ngtggccggaattttgaggctaaaccgtaacattggaccgggcatctggcgatgtatcacttttcctcttccaacgcgaaa\nctcccggcaaaaacttgctgcgtgaaaaacgtttgcggatagccgaatcttccatttaccgggtaatacagtccattgat\ngcacctgttatgggtgcgtgaccaaggttgatgagcagcgacaagctgcgttcagaactggcggcaaattaacggcgcgt\nttatccgaccccgatcaaaaagatgatcttggtgaccggtcacaggcgtgagagggttttcggtcgtggctttgaagaaa\ntctggccacgcgctggcagacatcgcacccacgcaaccaggacatccagattggtctatccggtgcatctcaacccgaac\ngttccagagaaccggtcaatcgcattctggggcatgtgaaaaatggtcattctgatcgatccgcaggagtatttaaaaac\ncgttttgtctggctggatgaaccacgcctggctgattttgaccggactcaggcggcattcaggaagaagcgccttcgctg\nggacaacctgtgctgggtgatgcgcgataccactggagcgtccggaaaagcggtgagacggcgggtacgggttgcgtctg\ngtaggcacggataagcagcgaattggtcgaggaaggtgacgcgggtctttaaaaaggacgaaacgaatatcaagctatga\ngccgcgccccataaccccgtattggtgatggtgcaggcatgctctcggcattccctggaagcgttaaaaaataatcggat\natcactatgagttttggccggacgcatttctgttatcggactggggtaattatcgggctgccaacggcagcagcgtttgg\ncctcacggcaaaaacaggttcattgggtggtcgatatatcaaaccaacatgtcggttgatcaccaatcaatgcgtcggcg\ngaaaatccatatcgtcgaactgatttgggcgagtgtagtaaaacaactgccgtagaaggcggttttttagcgagcgaggc\nacgacgccagttgacagcgggatgcgccgccgcgcccgccgccgccgccgcgcgcggccgcccg\n>m140930_121059_sherri_c100688052550000001823139503241542_s1_p0/31/3358_19870 RQ=0.816\ntatgcgaggatcaggttgctggaccgttatcccatattttggaacgctttatgcgcgcgcttccaaggacggtgacggga\natctgcaggccaatgcgccgctcgacgaagttctccatcttgtggacgggcgcttgctattacgccctgcgggcggcaat\nctggaacagtggataagagcgcgtcacaacagtggcgacttacaggcctgttaaattccccggaaccttttgagggaagt\ntgcagggcaacttggccggcgtcggggttccaaccggccaggcggggcgattctcatcgctttctgctgg'
a='abcdefg\nqwer'
a.split('\n')
['abcdefg', 'qwer']
Unlike the Illumina file, which was in the FASTQ format, this one is in the FASTA format. The FASTA format is simpler. Each line starting with a '>' contains the read ID. After that, many lines may follow, with the content of a single read.
Graded Function 5: parse_reads_pac(reads) (10 marks)
Purpose - To parse the input read file and obtain a list of DNA reads.
Input - a string which contains the entire PacBio reads file. You should begin by first getting individual lines of the file. Every two reads are separated by a line which begins with '>'. Note that reads can span across several lines in this case as opposed to single line reads in the Illumina file.
Output - a list of DNA reads
Example Output Format - ['ACGTGGGTAAACC', 'ACGTGGGAACC', 'GTGGGTAAACC']
dna_reads_pac=module.parse_reads_pac(reads_pac)
print(len(dna_reads_pac))
for i in range(10,15) :
print(len(dna_reads_pac[i]))
1004
19944
21731
21133
13502
8134
reads_pac[0:20].upper()
'>M140930_121059_SHER'
dna_reads_pac[-1]
'agagagagataataatttttttaaaaatttggggttttttttttttttttttttttttttgggcgcgggacccggcggggttaaaaggcggccaaaaccattgtttggggggcttggaattcctggccccggaggatcccccaccccgccaggtagcagtttcaaaggtttttcagccaacatcaaaaaagggtggaggataccgacataagcggggttccgtgcctcccaatcccattttttcgtttttccgccaccggaccggagaaacacgaggccaactcccgcaactatcgccctttgcaaaagcctgttttctgccagctgccaaaattgttctctttgctgcgccaaattaccgagaaccgacgctgaccaccgagacgggcttttttgttcgagatcgacgcctgggtttgctctggcgcgctccttgggtttggcccgtgcgggtaatcgccacagccggagcctccactccacgggtgtgcgtttcgccagtgcgtgacgcgtctggctgagccccgcatcggaaacgtgcccgctgctgatggtcagtaccgcataacggcgtggaatgcgttcttgccaattccgccagaccacgtaaaccatgtgtttcactgtcgcttaaccgtcctgtacgctgacggtgctggcatcgtttagacaacccgcgccagtttaaaaggatgaacccagttttgccgcccagtaatttgccggcgacgggtggtttggcagcagcaccaggccgtctgcgccgtgctgggcgaaatagtggcagccatgacaccggcgtaatcttccgatcatcgatgtccgcgttttgccgtttaatttttccaagacatgattagcgccgagcttgggattgcctgtgcgcacgtcgcatcaaattgaggacaaagggtttgatttgattagctaaagcgcctgcgcccgattcatacagttccggcagagacagagaaggggtatcgctgaatacccagacttgagaaaacgtgttcatagcacccctgtaaattaaatgactttgcgaagatttttcagcaacatgccggcgatctggtttcttcgcgtcgcccttcaatcacgatgccgcttgacggttcgcgctgtttcggcgcggcaacctgtttgttctgacaggtgccctctgcgtaaaaccaatatccgccgccgaccatacctggacgggcttttttcgccgcggccgagaaatggctttcatcgaaggaatttgtggggattgatatcagtggaaacagcaacaaccgcaggcagcggaatgcttaaggtttcggtttcatcttcccagttcggcgctcaacggtgagggtatctgccgtcagggagatacaattttgcttgacgccgttaactgccggaaatattgaggatttcgcaacaccagcagaaccaacctgcctgggcataaagggtcggaagaaccatcgccacagagatcagatcaaagcctgcttttctgggcggctgcagccagtgccgctccgccgtttgttgcggcagtgcctgcctcgaactggtcatcaatcaccacatcagttcatccgggccgcgcgatagcacatctttacggccgctttggcgtttgtgtcaggcttaccgcccacacttaaggctgtcacctgccgcctactgctgcctgttgcttagctgggcaagccgcttcaatagcgttgagatcgtattgtcttattttggcatcggcctttgctgaaagtctaatgaaccatacagcatttattgaccgcaatatcctgttcattcaaggcacgcacttatagcaagtaaataatcttcattgcaatctccagaacatcatgaaggaaacgtcgtttcaatatcggtggtgttaattacagatatcacgcccttttaaattctgaaaaataaaactttaatcaccaataatttgaaaatgtcaccgcaaagaataaatccaaactttcaatacttgttatcagtttcctcacacaatatgaataacaacggcgtagcaaaaagaaatttcaaaatattgttttttaattggatcaccatcattgaaaagtagatgttatttttttactcacaacagaagcataacaaactgttattaataaattaacgaaaaaacgccctctatcacccgaaaacaaaaaatgtataccatcacagaattaacagcttattgaatacccatatatgagttagccattaaaccgtccacgagttaataataatttatattaaatgttaacaaaaataaaacaaacgggaaacgcaaaaaataaatattcgttttccacaatggaattaaaactcaatgaaagaatgaaaaagagaaaaacgggaatagaaacgcgaaggttttctattcccgccgttaattaatcgtccggcataacttttgttggcttaccagttcagaagatctggagcagcggaatgtcgttattaatgctggtattcagttacgtccaccaatgtatggggatgggcatattgacaatggtatatggtgggtgatcttttcggttggttcttggctggtgtttggcccgtaatgccaaaaagcgtttaggtaacgaaacgcgcccagaatttagcaccgccagttggatctttatgatgttccgcctcctgtacgtctttgctaggccgtacttgttctgggggatcggatgagatcttactactgacatctctcacccccgccgtttgggcttagaacgaaactcgacaggggcgaaagaagttttggggctggcttacacttgttccacctggggacctcatgccgtgggccaacttacagcttccttcagtcgccttcgcttacttcttctttggttccagcaaaaatggaagtgatttcgccccagcttcgaacactggtcgctggtaggtgaaaaaacacgcccaaagggtttgttcggcactatcgtgcgacaacttctatctcggtcgccttgaaatcttcgcgaatgggtaccagtctgggccttgccacgcgccgctggtggaaccgaacgtgtatgcaatggttgtttgggcattccgctaccctgcaactggacgctaaatcatcattaacccgtgctggattatcctcaacgccaattttgcgtcgcttgcgtctgcaaaaaggggacgtatcgccaagtgacgtggcgtagttaccctgagcttcctgatgctgggttgggtgtttccattgtcagcggtgacagcttcatcatgaactactgacttcacgcgattcggtggggatgtttgctgatgtaatctgccgcgcacaatgttgttctataccgaatcgccatcgctaaaggcggcttcgccgagggctggaccgtgttctaactgggcatggtgggtgattttattgctatcgcagatgagtatcttccctcgcccgcatctcccccgttggtccgtactgtgcgtgaactgtgcttcgcatggtcatgggggctgacagcggtcaacctggattcctgagggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggagggggggggggggggggggggggggggggggggggggtggactgtatcgtagtaaacactctgctgttgatagaatcaacatcatcaaccattccaaatctgatcgaacaagtaagcggtggtgggcgcgcgccatcattgaaaacctgggccgctctgccctcagcaccgccaccatgtgggcttcttcatcctctctttattgccaccgttacgcatctggttaacgcctgccctcttaataaccctggcggatgtcccacttgccgtcgaagtacgcgattggctgaagaaccactctgctggtgcgtatcggttggtcaattctggttggcattatgcggtaattgttttctgctggcgctcggcggcctgaaaaaccgattccaaaccgccaattatcgccgcgaggatgcccgctgttctttacgtcaacattatggtggaccgcctctccgctttattaaagacgcgaaaacagaactggaaaagaattaaaattaacccccaaaatatcagaggtttggaaaaagatggatttttaatttaaatgaatgagcaggaatctgttttttgtcgccgtatccgcgaaactgatggccagcgaaaactgggaggcctattttgccgagtgcgaccgtggacaagcgtctacgcgcggaacgtttttgtcaagcactgggcggatatgggtatcgacagtctgctgatccctgaaagagccggtggtctggaacgcgggggttttgttactctcgcgccgctgctggatggagctgggacgtcttgggggcaccaaccttatgtgctgttaccagttgccgggcgggcttcaacacccttcctgcgcgaaggcacacaaaggcagatcgaccaaaaattatggctttcccgcggcaccggtaagcagatgtggaactacagcgattaccgaacccgggcggcgggctccgacgtgggtagcctgaaaacgacttaataccgtagaaatggtaagatttatattaatggtagtaagtgttttaattaccgcaagcacgcagcctaccaccccgtacaatcgtgggatggcgcgcgaaacggggcttctccggacaaaacctgtcctacacccgaatggtttgttgatatgagcaaaaccgggcatcaaaagtgaccaaaccttgaaaagctcggtctgcgtaatggaatagcttgctgtgaaatcacctttgacgtcaagtggaactggacggaaagacatgttggtcggaaaggtaacggctttaaccgcgtcaaagaagagttcgaaaccatgaacgtttcctgtagccctccacgccaaactacggtacgcggatgtgcgcctttggaaagaaagcggcgcgctacgccaatcagccgcgtgcagtttttggcgaggctattggtcgtttatccacgttgattcaggaaaaaatcgccccacatggcgatcaaataaactgccattgaaaaaatgctgttttattgaagcagcgtggaaagcagacaacggcaccaatcacctctggcgatgcagcgatgtgcaaatacttctgacgcccaatgcggcattttgaagttgtggatagcgtcaatgccaggtgctgggcggtgtcgggattgccggggcaaccagccgcatgcagccgcttctggcgtgatctgcgtgtagaccggcgtccctccggggggatctgacgaaatgcagattcctgaacgctggtcgtgcggcagtgctgaagcaataccggcctaagtccgtgcgctgcctgatgcgacgctgatgcgtcttatcaggcccaacgggagcaaacgaaatgtaaggcggataagcgttttacagcgcatccggtcagtcccatgcgctacataaatttttttccaggaagcaaaataatggatcatgctacccatgccgaaaaattcgggcgttggccggattgcgcgttgtcttctccggtaatcgaaatcgccggaccgtttgccccggcaaaaatgttcgcagaatggggcgcgaagtttatctggatcgagaacgtcgcctggccggacaccattcgcgttcaaccgaaactacccgccaactccccgccgcaatttgcaacgcagctgtccgttaaatatttcaaagatgaaaggcccgcgaagcgtttctgaaattaatgaaaccaaccgatatcttcacatcgaagccagtaaaacggtccggcctttgccgtcgtggcatttaccgatgaagtactgtggcaaagcacaacccgaaactgtgtaatcgctcacactgtccggtttggtcaagttaacggcaccggaggagtacaccaaaatcttccggcctataacactatcgcccaaggcctttaagtggttacctgattcagaacggtgatgttgacagtcaatgcctgccttccccgtatagccgccgattcttttcttggcctgaccgcaccacggcgggcgctgcgcagcactgcatagtgcgtgaaccggttaaaggcgaaagttcggacatcgcccatgtatgaagtgggatgctgcgtatgggccagtacttcatgatggattacttcaacggcggcgaatgtgcccgcgcatgagccaaaggtaaagaatccctactacggccgggattgcggtctgtataaatgtgccgacggctacattcgtgatggaacctggtggcattacccaaattgaagagtgctttaagatattggcctgcgcacatctgcttggcaacgccagaaatccggaaggcactcagcttatccaccgtaaatcgaaatgcccttacggccacgcacttggttgaagagaaaactcgatgcctggctggcgacacataccatcgcgaagtaaaagacgcttttgcctgaactgaatatcgcctgggccggccaaaggtgctgaccgtaccgggaactggaaagcaatccaacagtatgtggctcgccgaatcaatcactcagatggtcaaacgatggatggtcgccactctgccagagggccgaacatcaatgcgaaattccaaaaaataaccccggacaaaatctttggcggcggaatgccctcacatggcatggacacggctgcccattttgaaaaatatcggctacagcaaaacgacattcaaggagttggtcagcaaaggtcctggcgcaaagttgaggcactaaacgcgttggccgtcagtcggcttgccctgactagcgcccgccataacaagcgaacaaaaatggatagaggtgcaatggatatcaaattggcggacaaaacatcttaacgtcaaatgtgggacgatcttgcggacgctttacggtcataaaacggcgctgatttttgtgaatccaatgcgcggagtcgtttaaccgtatagtttatatcttgagttaaatcaggaagattaaaaccgcacggcaaaccatgttctatacgcatggggatcgcaaaaggcgacaaggttgcacctacatctcgaacaactgcccagaattctatccttttgctggtttcgggtggcaaaaattggcgcccgattaatggtgccgattaaccgccggcctgttggtgcgaggaaagcgcgtgatcctttgcaaaatagccaggcggtgcctgccttggtgaccagtgcgcaattctatccatgtatcaacaagattcagcaggaagatgcccactcaattgcggcactttgcctgacatgatggtgcacttccgctgatgaatggcgtgagttcgttacatcaaacctgaaaaatccaacaaccctgccacctttgtgctatgcatccgccgctaatcgaactggacgatacggcggaaattctctttcacctccgggcaccacctcccgaccgaaaggtttggtgattaacccattacaacctgcgcttcgctggatattaactccgcctggcagtgtgcatctgcgtgacgatgacgtctacgctgacggtaattgcctgagtttcatatcgaattgccaaagtgtactgcggcgatggcggccgtttctgccggggccacctttgtgctggtcgagaaactacacgcgccccggcggccttcctgggacaggtacagaagtaccgcagcgcccaccgttaaccgaatgtattccgatgatgatccgtaacgttgaatggtaacagccgcctttcagcgaacgatcagcaacacgcctgcgggaagtgatgttttatctcaacttggtcggagcaggaaaaaagctgcgtttttgtgaacggcttcggcgttcgcgttgctgacgtcttatggagacggaaaacgcattgtggcattaacgcgatcgtccctggcgataaacgacgctggccgtcgattggtccgggtggggttttgctaaaacgagcggaagatcccgcgacgatcacaatcgcccgctcccggcttggtgagatcggtgaaatcttgcattaaagcataacctggaaaaccaaatcttcaagagtacttctcaccccacaagccactgcgaaagtgctggagcgatggcctggctgtgcataccggcgataacgcggataccgcgacgaagaggactttttttatttcgtcgacgccgctgcaatatgaatttaaacgtggcggcgagaatgtttctcctgccgtggagctggaaaatattaatcgccgccgcaacaagcagtccgaaaattccaggacatcgtggttgtgggtattaaaagattcgattcgcgatgaaagccaccaaagcaaatttgtgggtgctgaatgaaaaggtgaacattgagcgaagaggaattttttccgctctgcgaacaaaaatggcgaaatttaaagtgccctcttatctggagatcagaaagaatctgcaccgtaattgctcggggaaaataaattagaaagaatctgaaaataaatgaaacagcagggaactaccctgccactaataatcacacttgtctaaaaaacaatatgccatttttttgcagggatgttgtcgtcccctgaaaaaggcaaaaatggagaaaaggaatgagtgaatcaattacatctgacccgcaatgggatcaattctggaaattacccacttgatcgtccaaaagcgaaatgctattgatgcaaaaacccagctttggaaatgggcgaagtatttcctaaattttccgtgacgatccgcaaattacgctgtcgccaattattaccggtggccggagagaagttccttttccggcgggctgggaatttaaaagcggcaagagaaaggcgaaggcaccggcaatgctgacttttggtcccgggtgttttgcggattaaccgaaattttcatctcggacaaaccggtttatgcgcagctgtgaacggctatgctttggcggcggcttttggaactggcgctggcggcagatttattgtttggtggccgataacgcccagcttcgccctgccggaaagccaaactgggcaatcgttcctgacagcggcggtgtgctgcggtctgcccgtaagaaatccctggccgcctgccatcgtcaatgaaatggtgatgaccggcagaacgaatgggcgcaagaagaggcgcctggcggttgggggatagtcaaacgcgtggttagcccaggcggaactgatggataaacgcccggcgaactggcctcagccaatgctggttaacaagcgccacgctggcgatttgcggcgcttgaaagaagatctaaccgccaccaccagcgaaatgccggtagaaagaagcgtaatcgctaatattcgcgagcggcgtgttgaaacaaaactatccatgcggttctgcattcgaagatccatgtgaagggccgctggcgtttgccgagaaaggcgcgattcgggtgtggaaaggaacgttaacaaacccgtgagcttattacgccttgaggggtttaaataattccggtggttcacgcctgcggacggcgtttttgtcccatcccagtgccgtctgattggcgatgtgattgtgggagccggtgtctacatcggacccactcgcctcaacttgcgtggtgctacgggcggttgaatcgtgcaaagcgggagccaatatcaggatggctgcattatgcaatggatactggcgacactgacactatcgttggggcaaacggccatatcgggccacggagcgatccccgcatggttgttttgattggtcggatgccattgttcgggatgaacaagcgtgattatggatggcgcggtcattggcgaagagagcattgttgccgcccatgagcttgtcaaagcgggccttatcgcggcgagaaacgccagttgttgatgggtacgccagccccgcgctgtacgcaattgttagtgacgacgagtttacacctggaaacggttgaataccaaaagagtatcaggatctgtcgggcgctgtcatgtatcgttacaatgaaacgcagccactgaaagggcaaaatggaggaaaaatacgcccccgtttttgcaaggggaacgacggatgtgacgcctaaacggtgatcttgccggatgcgacctggttgctggataagatgctacaacatcgcatctcggcaagcaggacgcgccggtatcaacgagataaaattaaacgaacgcatacctctttgacaacatcatctgccacttcttgtcgccgttaagttccgtgagcggaacgcaggaatacgcggcgtctttcttaaccaacacaccgacctaatttttcggccgactgttattgcgcaggcgcgcgtaattttgctcatcgattaatcggaccactttaaccagtcgctgacactggcatcccgcccttccagcttattaagggaatcatcttaacttccgcagcttatttctctgtgacttccgacagaatataagtcagggtattaaccgcctttgctgtgttcaatatcgaaatgaatagagatctctccttgctatcaccctccaccgatttttcagccatcatccattcggcgaattgaacaagtagagtggtttttcaacatatccttcacaacaatacgaactccagtttcagcaacaaattaaaagtcctggctaatgactcgttttgaaagaattaacaaggcatattaaaattgtgggatatatcggattagtaatgcaacttccaccgtcgataagtgtgagctaaatcctggcttatagtcatcaaaaatatcggattccccaaaagatttacttaatgggcaaaaaaaaaaccggtccctcatcaagaaccgtattagtattcgatcaaaattaagcggatgaaaaatatctgccatgacccacgtattaatttatctgtgcgtgcattttcctgcaccgaaatttaacttttttcagtcgcatgacggcttcagcgtccaatcgcggtggcaaagcgccgttcagggtggtgtcgtaaatgcaactttatattgcagccgcagctgcgacgaatcacgcgggaaagtcttcaatcgcacgacggcctgaggtggtgttgatgatgtaggtatttcgccattcttgatacggtcctgaatgtggcggacggccttcatgcacgcttgttaccaagacgcgggttgaatacctgcttcgcccagcacaatcgccgtgcgtggggtcgcatccagctcgaacccctgttcagcaagttttgccgccaggtcccaccacgcgttctttatcgccttcgcgcacggaaagcagcgcacgaccgtgtttcttcatggtggagttggctgccgcccagccctgccgctttggcaaacgcttcagcgaggtggcggcccacgcccaatgaacttccccgcggtagagcgcatttctggcccaacagcgggtcaacgcccggaatttattgaaacggcagcacccttctttcaccgagtagtacggcgggataaactgtctttggttacgccctgcttcaagcagcgatttgccagcacatcacgcgcgccgcccacttttgccaagcgggtacgcccggtggcttttggagacgaacggaacggtacgcgcagcacgcggggttaacttcaatcaggtagacttcgttgtttttcaccgcaaactgacgttcatcaaggcgcgccgcacctgcaattcgaaggccagtttttcctgcacctgctggcgcatcacatgcctgaatttcctgacttaaggtgtaggctggcagagaaacatgcgagtcaccggagtgcaacgcccgccccttgctcaatatgctcgcatgatgcagccaatcagcaccatttcggccgtcgcagatgggcaatctcacgtcccacttctacgcgtccatcgaggaagtggtccagcacactggcgcatcggttagaacagcgctgacccgcagtcctggaagttagccgacgcaggtcatcgtcatagacgatttttccatcgccccgaccagccggaaacgtaaactgacccgtaccacaagcggtagccaatctctttcgccttctctaccgccatttcaatagtcgtaatcggtggcgttcgccgttgttttcagtttcagaacgctcaaccgcaatgctggaagcgttcacggtccttctgcacggtcgatagcatccgggaatggtgccgataaccggtacgccagcagctttccagccgcgcgcgccagtttcagcggggtctgacccgccgaaaaaaaaaataaatactggacgataacgcctttcggcttctcgatagcacgatttccagcacatcttccagattacggctcgagtagaggcggtcggaagtgtcgtagtcggtggagacggtttccggttaacagttaactcataatggtttcgtaaccgttcttcgcggcagcgccagcgaggcgtgtacgcacagtagtcgaattcgatcctgaccgatacggttcgggccgccgcaagaccatgattttttcacggtcggtagaaacggattcgcttcgcactctttcttcataagtggaggttacatgtaagcggtgtcgtggcgaaactctgccgcacaggtatcgccgcgcttataaacgcgggggtgcaggtcatactggtcacgcagcttacggatttccgcttcgcgacgccccgccagttttgccaagcgctgcatcggcaaagcctttggcgtttcagctggcgagcgaagtcagcgttcaggcagtgatgcccactccgccactttctcttccagaccgccaccagctccttcatcctgtacccaggaaccagcggtccaatgtttggtcaggttggaagacgccgtccaagacaggcccgcaacggaacgcatcgggcggatgtacagatacgatctgccgccctgcgtccttttcagtttcgcgacggattttggttaacgcttccgggtccatcccaggctcactttcgggtcgaatccatcccgcaaccgacttcaggccgcgcaagcgctttttgcgcagggatcctgctgcgtgcgatcaatcgccatcactttcgccaaccgatttcatctgagtggtcagacggtcgttagcaacggcgaacctttttcgagttgaagcgaggaattttagtaaacacacataaaagtcgatggacggctcgaagtggaggccggagtacgtcgcgcacagtgatgtcgttccatcagttcgtccgagggtgtaacccaccgccagtattcgccgccactttaagcaatcgggaaaccggtccgctttcgaccagcgccgaagaagcgggaccaccgtgggtttcatttcgataacaatcaacgagaccgttttttcttcgggtttaccgcaaaaccgaacgttggaacaccaccgtttttcaacgatctcccgattcaacgcagcaccgccaatcgaggcgttacgcatgattgatattcttttgtcggtcagcgttgggctggagcgacagtgatggagttcacccggtgtggattgcccatcgcatcgaagttttcgaaatagagcagacgatgatgcatgattgtcgtttttatcacgcaacgcacttccatctccgtaactcttctccccagccgatcagcgactcatcaaatcagcaactcttttggtcggaagagagatccagaaccgcgggccgcaaatttcttcaaactcttcaacggttataagcgataccgccgccgctaccgccgccatggtaaagatgggcgaataatggcacgggaagcccacgtccaagcggcaaccgccagcgcttcttccatcgtgtgtagcgataccggaacgcgcggtttccagaccaattttcttcatcgctacgatcgaaacgacggcggtctttcctgcctttatcaatcgcaaatcggcagtggcagccaatatggtgacacgaactcttccaacacggtcgcatcacggcactgaccgttcccagctccacgcgcaagttcagcgccgtctgaccgcccatcgttggcagcaccgcgtccggcgcactctttttcaataaatccttgcgtacaacttcccagtgaatcggctcgatgtaggttgcatcagccatttcccgggtcgggtcatgaatggtcgccgggttggagttcacgcagaaatgacgcgtaaccctcttcaacgcagggcctttacaacgcttgcgcgccagagtaagtcaactcaacacgcctgaccgataacaatcggagcccgcacccagaatcaggcatacttttatatctgtacgttttggcatggctctttactcctgattacttagcggtttttacggtaactgctcaatttaaactcgaataaagtggtcgaaacaacggcgggccggcgtcgtgtggaccagggctgcttatcagggtgcccctggaagctgaatgccggttaaatcggtgcgaatgaatgcccctgtaacgtaaaaccgtcgaacagggatttaaatgcgtgaaacacgcaggtttgcaggtaaatgttgcttcgtccaccgcaaaaaacctggttctgggcggtgatcattacccacgttttctccacatctttaaccggaatggttgccgccgtggtgaaccaaaatttcattttgaaccaagtcttcgcaccgctcgccagcgccagcaggctgaatatgaccgagacagatgccgaataccggaatatcggtttcgaggaagattatctggatggcggtaatggcgtaatcgagcgggccccgggtcgccaggaccttggagaggaaagaatgccgtctggattcattttcagcacatcttccgcagaagttttgcgcgcggaaacgatggtcaggcgacagcctcttttatccaccaagcaatcgccaggatgttgccttggcaccaaaaaatcataagccacgacgtggaacggcagcatgcgtctttctttttcgctttctggcaggccaccggtcaacgtccaagctcccttgtgtccaggctataggccttctgcggtggtccacttcttttgcagaatccatgccattcagacctgggaacggcggggccatttttctaacgccagcgccgcatcgggttatcgggcccgcgataatgcaagccattcgtgcgcctttctcgcgcagtaaacgcggtcagcttacggtatcgattatcggcaatcgccacgatgtatgggcgtttcaggtaagaaagagaggtcttcggtattacggaagttgctggcaatcagccggcaggtcgcgaatcaccagacccttgtcatgtacctggagaagattcttccatcggcgtcattggtgccgacatatttgccaaatatggggatagtaagagtaaaacgattgacgagaaataggaaggatcagtgaggatttcccttgaataaccgtcatgaagtaatttgaaaaacgacttccccaaccgcccgaacctgttgcccctatggccccgacgtgaactgggttccgtcccttccagaaaacaatagcgctgacttaatcaaaacaccctccagagaaatattccacccactttatttgcatattaaattcaccaatgatgaacgtaaatcaatgcaatcttgcttaccgacgaatttcctggcaaacggcggcatctggagataatgggcgctaaaagtccaacttaaatggcgggattttttatattattttgtggctcatgcaaatgttttacagcattaagcgacaaaaaagatcatctaacctgtttctggaaagctttgcgcagccaagaagattgcatgaaaataaaacgagcagaaaaaaagtggaccaaatgtcaaattcattaaaacaaaaacagaagaccaaaacatgtaatttttggcagctttatagaacaattaaaaaacataacaatgcaaaaataaaagggccaaaaattacaaggcccttttataatcaataaataaatgtgttttatttttttgcaaccaaataacaaatatttttgtggttacagaattattgagatcaagttaacatctcgcatattcaaaaaagacgcgctttccttaccacactcaccacaaaggccgatcttaccgcgccgttagcaaatgtcatacggctgaccgccttatggttgaatctcccagagctcgccaatatcgcaaacatcgcggtaatgttcaccaacgattcacctgcacgcacggtggcaaaaccaatgtgccaggcaaacacgttcacggtgttggccttcaggactgtagaccgcgcaatctttcagatcttttaaatcaagggcgtgggcgaatccgccctctcacattgccatgcggtggcctgacggccatcaactttagtctagatgtgcttcaataattttcgaattatcgtgtagtcacacccatcactttggctgctattcctccaagcagcttaagcatgacgttaacgcccaacgctaaaaaattggcagcaaaagacaatcgcaaatatcggcagcggcgtcacgaattgcttgtttacccgcttcgtcaaaccccgtagtgccgatcaccatcccttttgctgctggcgacaaaaaagcgaagatggttcagccgtacttccggagggtaaaatcgataaaacacagtcaaaaatccatcttttaaccgcatcgaggctgccttttgcaacggtaacgccctgtttcccggcctccggccagctcaccgcgtcgctgcccagtaaaagaagatccttccacgccctccagccgcagcgtcccaactgcacgcccttctaatgccaagcgccgcccctgaataactggcggcgcttatacgccctccccggcttccccccgcgatggcaacgcggaatggtttgcatcatgcatagctattcctctttgttgtaatttgcgatcagaacccgtttttcagagtaccagcgcatagctacggacgccagccaaaacgctgaaattagaaattaatagataaacagggagggatatcagtgaaaggcaatgagtcgatgaatgactgcatgccaataacatagtgacaggttacgacgccagaagccagcacctcaccacccactgctggagcctttcacatccagatccaatgccacctgtacattggctggcttgcccagcaaaccgtcgatatcaaaccaccgtcgtgccctgaggtaaattcgccctgagttttcccactgccacaaaacagggttgagagtgacaggtccgggcgcaccagccaggcgattgcgcagagatcgtgcattcgcaagccgctttgcatactgcccgctacggtagtggctaaacagggcgtgaagcattttcccggtagcggtttaactgcggcagtggtagagaagatagtcaggagttaaattgccctgattggtgacattccaaaaccgcacatgacgatttacaataccactgcggaagacacaagcagcagcttctggatcggcagcaatattaaactcggcgtttggcgtacatgttgcccgcgtccgccagaaccacaccatgatcaccagacggcgaatatacggcttgccattccgggcatgtgaaagtaacgcgcaatattggttaacgggccgatggcccaccaggggcggtaacaggctctggtgcaacgcatcagggcacccgaaaatccgccagaaacgccggtatccctcgagaccgagcggctttcggtttgtgctcaacaaaagtccgtagccagccattccccgattcgccgtgcacagatgccgcatatcacgcggtgcgcgtaccagtggcaccaagccggccccgtttgggccgaagcggaaactccgcattccagaaatgcagcagttgcagggcattgcgggtagtttcctcaaccgagaccatctacccgcgacggtggtcatcagttgcaggttcgagttcgggtgcaaaaaatcgcggcggaatggcgacgcagcatcgtcaatgccggggtcccacgtatcgaggaagaataggtaaacgcatgttttctccataaaaaatgccggtaacaagaccggcattttcgcataacttaggctgctaatgacttaatcgacttcacgaaatatcgacacgcagctctttcggcaccttcgaaataaaatgttttcttcagcggccttccagcgaatggcttcacgcaccgcccagcttgctgccaaacgtggccaccacatttctgcaaccagaataaccggagccgaatgcgcccgatgcagtcacgccgacgcatttaaccctctttccacaacccactcttcctggatgtcttttcgcacgtccaatcaaaacgcgcacgcggtgagttgcgcaatgacgtaagcggatgccgc'
reads_pac[-20:-1]
'tgacgtaagcggatgccgc'
Expected Output -
1004
19944
21731
21133
13502
8134
Notice that, unlike the Illumina dataset, the PacBio data has reads of very different lengths, and some of the reads are very long.
Plotting the distribution of read lengths
#getting distribution of length of reads
lengths_pac=[]
for read in dna_reads_pac :
lengths_pac.append(len(read))
plt.hist(lengths_pac)
plt.xlabel("length of read")
plt.ylabel("no. of reads")
plt.show()
Checking for impurity symbols
We will now check if the PacBio dataset has any reads that contain any symbols apart from {A,C,G,T}
impure_reads_pac,impure_chars_pac=module.check_impurity(dna_reads_pac)
print(len(impure_reads_pac))
impure_chars_pac
0
set()