Homework 6 (last assignment!)

This assignment is due in 2.5 weeks, on Dec. 3 (Friday). Note that extension of 2 days past the usual due date, to give you extra time because of the Thanksgiving holiday.

Gene finding

This assignment involves developing a small piece of a gene finder; in particular, the piece that identifies exon-intron boundaries. The boundary where an exon ends and an intron begins is called the DONOR site, and the boundary where an intron ends and an exon begins is the ACCEPTOR site. There are 4 files you need for this assignment. They contain different parts of 100 genes from humans and other vertebrate genomes. I have conveniently split the genes into four pieces for you:
the upstream region, prior to the start codon
the exons
the introns
the downstream region, after the stop codon.
Note that the upstream and downstream regions may contain the noncoding portions of exons, but for this assignment you do not need to consider those.

The files are sorted in order, and all genes are identified by unique IDs. The exons and introns are numbered 1 to N in order within each gene. For example, the very last gene in the files looks like this:

Exons:
GGRIHBGEN:1 atgcagccccggggcctcctcctcctcctggcactgctgctgctggcggccgctgccgaggctgccaaagccaagaaag
GGRIHBGEN:2 agaagatgaagaaggagggttccgagtgccaggactggcactgggggccctgcatccccaacagcaaggactgcggcctgggctaccgcgagggcagctgcggcgatgagagcaggaagctcaagtgcaagatcccctgcaactggaagaagaagtttggag
GGRIHBGEN:3 ctgactgcaagtacaagtttgagaggtggggaggtggcagcgccaagacgggtgtgaaaacacgctcaggcatcctgaagaaagcgctgtacaatgctgaatgtgaggaggtggtctatgtgagcaagccctgcaccgccaagatgaaggccaaggccaaag
GGRIHBGEN:4 caaagaagggcaaggggaaggactag

Introns:
GGRIHBGEN:1 gtgagcgggagtccgggtgggcacgggggggtctgggtgtgccgccccatctcccagccgctgcctcccttgcag
GGRIHBGEN:2 gtgcggaggtggctgtggggggaccggggctgcgttggggccacacgttcctaatgcctctcccttcctgcag
GGRIHBGEN:3 gtgagtcctcgcctcccccaggcagcactcaggaccaacgctgggcagggagctgctggtggccgcaataacgctaacgcatctcccatggtcattgggtgcag

This gene has four exons and three introns. Note that each line contains one exon or intron, with no carriage returns. Your assignment is to compute two probability matrices, one for donor sites and one for acceptor sites. The matrix for donor sites should contain the probabilities at the last 3 positions within the exon and the first 12 positions in the intron. Thus this is a 4x15 matrix. The matrix for acceptor sites should contain the probabilities for the last 15 positions in the intron and the first 3 position in the subsequent exon. This is a 4x18 matrix.

The format of the DONOR matrix should look EXACTLY like this, except of course that the probabilities I include here are not the correct ones (again, I want to use diff to compare it to the correct answer):

     -3   -2   -1    0    1    2    3    4    5    6    7    8 9      10   11
A  0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
C  0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30
G  0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14
T  0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

(If you display this in fixed-width font, you'll see that the column headers line up.) Note that the column headers refer to the position with respect to the donor site - negative numbers are inside the exon, and 0-11 correspond to the first 12 positions in the Also, please truncate all probabilities to 2 positions after the decimal, as shown above.

The ACCEPTOR matrix will look similar, except of course that it is 4x18 and the columns (positions) should be numbered -15, -14, -13, -12, ..., 0, 1, 2.

Some of you will no doubt realize that the upstream and downstream data is unnecessary for this assignment. That's correct, but I include it just for completeness.

Turning in your work

Turn in a tarfile containing your program(s) plus one file with the Donor matrix and one file with the acceptor matrix. Please name the matrix files "donor.mat" and "acceptor.mat". Make sure you document your code.