April 19, 2001 - Mihaela Pertera, Johns Hopkins Computer Science Department

The gene finding research community has focused considerable effort on human and bacterial genome analysis. This has left some small eukaryotes without a system to address their needs. We focused our attention on this category of organisms, and designed several algorithms to improve the accuracy of the gene finding detection for them. We considered three alternatives for gene searching. The first one identifies a coding region by searching signals surrounding the coding region. This technique is used by GeneSplicer - a program that predicts putative locations for the splice sites. The system combines decision trees with Markov chain models in order to detect signal patterns. A second alternative is to identify a protein region by analyzing the nucleotide distribution within the coding region. Complex gene finders like GlimmerM combine both the above alternatives to discover genes. The basis of GlimmerM is a dynamic programming algorithm that considers all combinations of possible exons for inclusion in a gene model, and chooses the best of these combinations. The decision about what gene model is best is a combination of the strength of the splice sites and the score of the exons produced by an interpolated Markov model (IMM). The third alternative carefully combines the predictions of existing gene finders to produce a significantly improved gene detection system.