All assignments are to be submitted by email to Prof. Salzberg at salzberg@tigr.org. Assignments are due by midnight on
the due date.

Assignments 3 (due Oct. 20) and 4, due Nov. 3.

Assignments 3 and 4 are to write your own sequence assembler, using any strategy you like.   The assembler should
assemble the sequences contained in this file, which contains 316 DNA sequences ranging in length from 500 to 550
base pairs. Each sequence appears on exactly two lines. The first line is a unique ID, and the second line is the
sequence itself.

Note: to make the problem easier, all the sequences in the "Shotgun Reads" file are from the same strand. You don't
need to reverse complement any of them.  To make it even easier, the sequences contain no errors; i.e., all sequences
are 100% accurate.  The shotgun sequences represent 8x coverage of a contiguous piece of DNA that should be
longer than 20,000 bases if you assemble it correctly. There is no guarantee that the sequence will assemble
without gaps.

For Assignment 3, due Oct. 20, you will write the 'overlapper' module that determines which sequences overlap.
Two sequences should be considered overlapping if they overlap one another by at least 40bp.  In addition to your
source code, you should turn in an output file showing all the overlaps you've detected in the file of sequences
provided above.  The file should be sorted in order by ID number of the 316 sequences, and for each sequence
the file should contain EXACTLY one line.  That line should contain the ID of the sequence followed by the IDs
of all sequences that it overlaps.  The list of overlapping sequences should also be sorted in order by ID number.
For example:
A1 A24 A165 A220 A298
A2 A95 A109 A302
... etc.  I will be comparing your files to the correct answer using 'diff' so the format should match exactly.  Put
exactly one space between successive ID numbers.

For Assignment 4,  due Nov. 3, turn in two files: (1) your assembly program, and (2) the assembly. The assembly output needs to be formatted as a FASTA file in PRECISELY the following format: line 1 should read ">ShotgunAssembly" (without the quotes), and the complete assembled sequence should follow, in all lower case, with exactly 80 characters on every line (except the final line, which is allowed to contain fewer than 80 characters). If the assembly is in multiple contiguous pieces, then your output should be a multi-fasta file with the longest piece first, then the second longest, etc. In a multi-fasta file each sequence starts with a comment line (a ">" followed by any text you like), then 80 characters per line for the sequence itself.

I'll be using Unix 'diff' to compare your answer to the correct answer. I also plan to run your program on another test set, so it must compile and run under Unix, and the input should be a file in exactly the format demonstrated by the input file you're using above. 


Last modified at Wednesday, October 06 1999 08:57