GLIMMER 2.1 is the current release; See below for performance statistics
About GlimmerGlimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria and archaea. Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. The IMM approach, described in our Nucleic Acids Research paper on Glimmer 1.0 and in our subsequent paper on Glimmer 2.0, uses a combination of Markov models from 1st through 8th-order, weighting each model according to its predictive power. Glimmer 1.0 and 2.0 use 3-periodic nonhomogenous Markov models in their IMMs.
Glimmer is the primary microbial gene finder at TIGR, and has been used to annotate the complete genomes of B. burgdorferi (Fraser et al., Nature, Dec. 1997), T. pallidum (Fraser et al., Science, July 1998), T. maritima, D. radiodurans, M. tuberculosis, and non-TIGR projects including C. trachomatis, C. pneumoniae, and others. Its analyses of some of these genomes and others is available at the TIGR microbial database site.
A special version of Glimmer was designed specifically for eukaryotes, GlimmerHMM. It incorporates splice site models adapted from the GeneSplicer program and uses interpolated Markov models for evaluating the coding regions.
The Glimmer system consists of two main programs. The first of these is the training program, build-imm. This program takes an input set of sequences and builds and outputs the IMM for them. These sequences can be complete genes or just partial orfs. For a new genome, this training data can consist of those genes with strong database hits as well as very long open reading frames that are statistically almost certain to be genes. The second program is glimmer, which uses this IMM to identify putative genes in an entire genome. Glimmer automatically resolves conflicts between most overlapping genes by choosing one of them. It also identifies genes that are suspected to truly overlap, and flags these for closer inspection by the user. These ``suspect'' gene candidates have been a very small percentage of the total for all the genomes analyzed thus far.
New! A Perl program is now available, free to all, that will use Glimmer's predictions as input to the BLAST and FASTA programs. You can use this program to search any locally-installed protein database. The program is described by this README file, and the code itself is available by clicking here. No license is required for this program.
Glimmer 2.0's Accuracy
The table above shows Glimmer 2.0's accuracy for ten complete bacterial and archaeal genomes. Organisms are listed in the order in which the sequencing projects were completed. Accuracy figures reflect Glimmer's default settings, which include a minimum gene length of 90bp. The majority of the genes missed were very short, either below the minimum or very close to it. The default settings produce additional gene predictions ranging from 7-20% of the total, many of which are likely to be false positives, but some of which may be genuine. The additional prediction rate drops quickly if the minimum gene length is set to be greater than 90bp. Note that some recent publications have referred to these additional genes as the "false positive" rate of Glimmer, but this is wrong. We currently cannot accurately state how many of the additional gene predictions will turn out to be correct.
A better measure of accuracy is to consider only "confirmed genes,"
which we define as genes that have a significant database match to a gene
in another organism. The table below shows these statistics on 10
genomes for both Glimmer 1.0 and 2.0. (Note that the number confirmed
for B. subtilis is very small because this reflects only those genes that
were characterized experimentally prior to the completition of the B. subtilis
All of the above results were obtained by a very simple training procedure: Glimmer was trained by first extracting all non-overlapping open reading frames over 500bp (using the long-orfs program that comes with the system). The trained model was then used to find genes in the complete genome. Note that you can improve performance (!) by re-training using these long ORFs plus all genes with BLAST hits. In addition, you can generate considerably larger training sets for some genomes by using a value larger than 500bp in the long-orfs program. For example, in M. tuberculosis a value close to 800bp produces the largest training set.
A sample of Glimmer 1.0 output for H. pylori is contained here (360K). The Glimmer 1.0 output format is explained in this readme file. Note that the output file was generated using a minimum gene length of 180 bases, so shorter genes are missing. Glimmer 2.0 has similar output, with a few additional annotations included in its gene list.
SpeedFor genomes under 2 megabases (e.g., H. pylori and H. influenzae), Glimmer 2.0 requires under one minute for training (the build-icm program) on a Linux PC powered by a Pentium 400 processor. It then takes only 20 seconds (the glimmer2 program) to find all the genes.
Note to Glimmer users: it is always preferable to train Glimmer on a sample of genes from the same genome that you are finding genes in. This is easy to do with any bacterial genome, using the long-orfs program to extract long open reading frames that can be used to bootstrap the system. (This is explained in the readme files that come with Glimmer.) If you wish to search for genes in a short fragment of DNA, Glimmer needs to be trained on a longer sequence. The best strategy is to train on a closely similar genome.
This software is OSI
Certified Open Source Software .
% tar -xzf glimmer21.tar.gz
A directory will be created which contains the executable, training data sets, and other supporting files.
ReferencesFor a description of Glimmer 1.0 and 2.0 see our papers:
A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. Improved microbial gene identification with GLIMMER (306K, PDF format) Nucleic Acids Research, 27:23, 4636-4641.
S. Salzberg, A. Delcher, S. Kasif, and O. White. Microbial gene identification using interpolated Markov models (73K, PDF format) Nucleic Acids Research 26:2 (1998), 544-548. Reproduced with permission from NAR Online at http://www.oup.co.uk/nar.
AcknowledgementsThe development of Glimmer was supported by the National Science Foundation under grants IRI-9530462 and IIS-9902923, and by the National Institutes of Health under grants K01-HG00022-01 and R01-LM06845-01. The Glimmer code was written by Art Delcher.
Visit the GlimmerM web page