Assignment #5: Arrays
How to hand in your work Submit the file SearchDNA.java
Learning outcomes
When you have completed this assignment, you should understand:
How to pass parameters and return values using static methods. How to use arrays. How to indent and document a Java program.
An important task in bioinformatics is the identification of DNA and RNA sequences. In this assignment we will be looking at nucleic acid sequences. These sequences contain up to four different bases denoted by letters: A for adenine, C for cytosine, G for guanine, and T for thymine. Sequence strings are compared in order to determine whether nucleic acid sequences match each other, or are related through mutations. Real sequence data as used by biochemists and in bioinformatics research consist of very long strings of A, C, G and T. Determining relatedness can require the use of very complex algorithms, beyond the scope of this assignment.
The sequences in this assignment will all contain between 2 and 4 of the possible bases (A, C, G, and T). Your task is to search through a collection of sequence data and count how many times a specific sequence occurs. (For example, if the collection contains the following sequences: {ACTG, GATC, ACT, GTC, AC, GATC, GA} and we search for the specific sequence GATC we would report that it was found 2 times.
One of the difficulties in this assignment will be dealing with mutated sequences. A mutation can occur due to insertions of additional bases within a sequence. For the purpose of this assignment, a mutated sequence contains at least two of the same bases occurring in a row (so in the sequence GAAATC the A has mutated, and in the sequence CCGGAT both the C and G have mutated). Another task in this assignment is to detect how many of the sequences in the collection are mutated. The final task will be to search through the collection of sequence data for a specific sequence, but you must treat original and mutated sequences the same (For example, if the collection contains {TGC, AC, TTGC, TACG, TGGCC, AGTC} and we search for the specific sequence TGC we would report that it was found 3 times (because TTGC and TGGCC are mutated forms of TGC).
Page 2 of 8
Recommended steps to follow in order
1. Similar Assignment 4, the specification document outlines all of the required methods for this assignment. Appendix A illustrates some examples of how to call and test the methods in your SearchDNA.java program.
2. The printArray method is a nice place to start. Focus on passing in an array of Strings as a parameter and using a loop to visit each element in the array. Remember that array indexes start at 0 (similar to Strings)!
3. After finishing printArray, we recommend working on the findLongest or the findFrequency methods, as they are quite similar. Within both methods, a loop should be created to visit each element in the array. In the findLongest method, you must keep track of which String in the array contains the most characters, whereas in the findFrequency method, you must keep track of how many times a specific String is found in the array. Make sure you finish and test them both before moving to the next step.
4. The methods involving mutations are a little more difficult. In this assignment, a mutation occurs when two or more characters in a String are repeated in a row. Think about how you might be able to detect a mutation in a String. Once you come up with a strategy, test it with a number of Strings to see if it works!
Marking
Your mark will be based on the following criteria:
Your code must compile and run. Some examples of how to test your methods, along with expected output, are outlined in Appendix A. Your code must conform to all the requirements mentioned in the specification document. The main method must show all of your testing code. Each of the required methods should be tested. Your code must follow the guidelines outlined in Style_Guidelines.pdf, found through the Lectures & Stuff link in the Lab Resources folder on connex. You may notice that the specification document provides some very nice comments you are welcome to borrow.
Page 3 of 8
Appendix A – Testing your code
As you work through a solution, it is recommended that you save, compile and test your code after every line or two of code that you write. This can be something as easy as printing out the value of a variable, or calling a method to print out the value returned. It is important to do this to confirm a component of your code works correctly, so you can be confident using that component throughout your code later.
Testing the printArray method:
One way to do this is to create an array of Strings in the main method and then call the printArray method from main. An example is shown below:
What to look for: Are all of the words printed? Does the method accept an array of Strings as a parameter? Is each word on a new line?