Homework 5

This assignment is due in two weeks, on Nov. 17.  For this assignment we will compute a measure of "codon bias", which is the preference within a genome for certain codons over other, synonymous codons.  We will then use codon bias to score genes.  The idea is that genes that use "favored" codons are probably highly expressed; i.e., they are produced in greater quantities than other genes.

Part 1

Compute log-odds scores (in bits) for each codon xyz coding for amino acid aa, where this figure gives you the mapping of codons to amino acids (i.e., the genetic code). The odds of interest here are the relative frequency at which codon xyz appears in coding sequence versus the frequency at which the triplet xyz is expected to appear by chance in any sequence. To compute these odds, summations over all synonymous codons abc encoding amino acid aa must be computed, as in:

where Q and P represent frequencies (or probabilities) of codons in the data. To get log-odds scores in bits, simply take the log base 2 of the quantity above. The data to use for your computation of log-odds scores is the genome of Borrelia burgdorferi (gzipped FASTA file), the bacterium that causes Lyme disease, along with a subset of its gene list. These are the 601 genes that have homology to genes from other organisms; in most cases their function is known. The format of the coordinate list is:
ID end5 end3
where "end5" is the 5' end of the gene, where the start codon is, and "end3" is the 3' end, where the stop codon is. If end5<end3, that means the gene occurs on the complementary strand, so you must reverse-complement the sequence to get the gene.   Using these genes, you can compute the probabilities of all the codons as well as the probabilities of the individual bases, which is what you need to compute the formula above.  Your list of log-odds scores should be a single file with 61 lines, one for each codon (stop codons are of course not included). Sort the list alphabetically; column 1 should be the codon and column 2 should be its score.  To illustrate, the file should start out like this:
aaa 0.0042
aac 0.0011
aag 0.012
aat 0.0040
aca 0.019
And so on through the codon ttt.

Part 2

Using the log-odds scoring system for codons computed in part (1), compute a normalized score for each of the 601 genes. To get a normalized score, sum the bit scores for all codons in a gene, and divide by the number of codons. This gives an "average bits per codon" measure for the gene. The output of this should be the 601 scores in a second file, where each line contains the gene ID, followed by a space, followed by the log-odds score.

Turning in your work

Turn in a tarfile containing your program(s) plus the two output files above. Make sure you document your code.