Homework 5
This assignment is due in two weeks, on Nov. 17. For this assignment
we will compute a measure of "codon bias", which is the preference within
a genome for certain codons over other, synonymous codons. We will
then use codon bias to score genes. The idea is that genes that use
"favored" codons are probably highly expressed; i.e., they are produced
in greater quantities than other genes.
Part 1
Compute log-odds scores (in bits) for each codon xyz coding
for amino acid aa, where
this figure
gives you the mapping of codons to amino acids (i.e., the genetic code).
The odds of interest here are the relative frequency at which codon
xyz
appears in coding sequence versus the frequency at which the triplet xyz
is expected to appear by chance in any sequence. To compute these odds,
summations over all synonymous codons abc encoding amino acid aa
must be computed, as in:

where Q and P represent frequencies (or probabilities) of codons in the
data. To get log-odds scores in bits, simply take the log base 2 of the
quantity above. The data to use for your computation of log-odds scores
is the genome of Borrelia burgdorferi (gzipped FASTA
file), the bacterium that causes Lyme disease, along with a subset
of its gene list. These are the 601 genes that
have homology to genes from other organisms; in most cases their function
is known. The format of the coordinate list is:
ID end5 end3
where "end5" is the 5' end of the gene, where the start codon is, and
"end3" is the 3' end, where the stop codon is. If end5<end3, that means
the gene occurs on the complementary strand, so you must reverse-complement
the sequence to get the gene. Using these genes, you can compute
the probabilities of all the codons as well as the probabilities of the
individual bases, which is what you need to compute the formula above.
Your list of log-odds scores should be a single file with 61 lines, one
for each codon (stop codons are of course not included). Sort the list
alphabetically; column 1 should be the codon and column 2 should be its
score. To illustrate, the file should start out like this:
aaa 0.0042
aac 0.0011
aag 0.012
aat 0.0040
aca 0.019
And so on through the codon ttt.
Part 2
Using the log-odds scoring system for codons computed in part (1), compute
a normalized score for each of the 601 genes. To get a normalized score,
sum the bit scores for all codons in a gene, and divide by the number of
codons. This gives an "average bits per codon" measure for the gene. The
output of this should be the 601 scores in a second file, where each line
contains the gene ID, followed by a space, followed by the log-odds score.
Turning in your work
Turn in a tarfile containing your program(s) plus the two output files
above. Make sure you document your code.