The files are sorted in order, and all genes are identified by unique IDs. The exons and introns are numbered 1 to N in order within each gene. For example, the very last gene in the files looks like this:
Exons:
GGRIHBGEN:1 atgcagccccggggcctcctcctcctcctggcactgctgctgctggcggccgctgccgaggctgccaaagccaagaaag
GGRIHBGEN:2 agaagatgaagaaggagggttccgagtgccaggactggcactgggggccctgcatccccaacagcaaggactgcggcctgggctaccgcgagggcagctgcggcgatgagagcaggaagctcaagtgcaagatcccctgcaactggaagaagaagtttggag
GGRIHBGEN:3 ctgactgcaagtacaagtttgagaggtggggaggtggcagcgccaagacgggtgtgaaaacacgctcaggcatcctgaagaaagcgctgtacaatgctgaatgtgaggaggtggtctatgtgagcaagccctgcaccgccaagatgaaggccaaggccaaag
GGRIHBGEN:4 caaagaagggcaaggggaaggactag
Introns:
GGRIHBGEN:1 gtgagcgggagtccgggtgggcacgggggggtctgggtgtgccgccccatctcccagccgctgcctcccttgcag
GGRIHBGEN:2 gtgcggaggtggctgtggggggaccggggctgcgttggggccacacgttcctaatgcctctcccttcctgcag
GGRIHBGEN:3 gtgagtcctcgcctcccccaggcagcactcaggaccaacgctgggcagggagctgctggtggccgcaataacgctaacgcatctcccatggtcattgggtgcag
This gene has four exons and three introns. Note that each line contains one exon or intron, with no carriage returns. Your assignment is to compute two probability matrices, one for donor sites and one for acceptor sites. The matrix for donor sites should contain the probabilities at the last 3 positions within the exon and the first 12 positions in the intron. Thus this is a 4x15 matrix. The matrix for acceptor sites should contain the probabilities for the last 15 positions in the intron and the first 3 position in the subsequent exon. This is a 4x18 matrix.
The format of the DONOR matrix should look EXACTLY like this, except of course that
the probabilities I include here are not the correct ones (again, I want to use diff
to compare it to the correct answer):
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11
A 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
C 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.30
G 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14
T 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36
(If you display this in fixed-width font, you'll see that the column headers
line up.)
Note that the column headers refer to the position with respect to the donor
site - negative numbers are inside the exon, and 0-11 correspond to the first
12 positions in the Also, please truncate all probabilities to 2 positions
after the decimal, as shown above. The ACCEPTOR matrix will look similar, except of course that it is 4x18 and the columns (positions) should be numbered -15, -14, -13, -12, ..., 0, 1, 2.
Some of you will no doubt realize that the upstream and downstream data is unnecessary for this assignment. That's correct, but I include it just for completeness.