More than 20 years after the Human Genome Project was completed, scientists are still working on interpreting more than three billion base pairs of chemical molecules that make up human DNA. The genome sequence is a long string of chemical letters—a kind of book collection with no commas, periods, paragraphs, or other marks to show where a sentence begins and ends.
The challenge for scientists has been to identify the “sentences” within the genome that make up thousands of genes that control everything from eye color to heart abnormalities. To enable clinicians to diagnose rare diseases or cancer, or to develop gene-modifying treatments, scientists should be very sure about a gene’s sequence and function, including precision about a gene’s slight variations that are inherent from one person to the next.
To document gene sequences, scientists have been creating so-called genome annotation databases with the structural and functional knowledge of human genes. However, computational biologist Steven Salzberg says several databases of such genome annotations, including one he created with biomedical engineer Mihaela Pertea, agree on fewer than half of human genes.
“And that’s just protein-coding genes,” says Salzberg, the Bloomberg Distinguished Professor of Computational Biology and Genomics at the Johns Hopkins University. “The available databases are not even close to agreeing on non-protein coding genes, for which we know the function of about 5% of those genes.”
Salzberg, Pertea, and CS PhD student Hyun Joo “Hayden” Ji authored a review in Nature Reviews Genetics that challenges the science community to work toward more scientific consensus on the actual sequences of genes and their functions, variants, and protein products.
“Scientists need to get this right—it’s especially important for clinicians who are using annotation information to help diagnose and treat disease,” the authors note.
The task of sorting out gene structure and function is not an easy one. Human genes vary between individuals, and one gene can produce slight variations in the proteins it produces.
A gene’s DNA sequence is used as a template for creating RNA molecules that in turn make proteins. Within the RNA sequence are exons, which contain the regions that get translated into proteins, and introns, which are the regions that are not used for translation.
When RNA is spliced to remove introns, exons can be used in different combinations to make different versions—or isoforms—of proteins. It’s evolution’s way of diversifying the protein portfolio in the human body. Some isoforms don’t end up as functional proteins. Some are more prone to trigger disease or provide a biological advantage. Current databases have catalogued anywhere from 100,000 to 220,000 protein isoforms, depending on which isoforms scientists think are functional.
“We’re far from consensus on these questions,” says Salzberg.
In the review article, the Johns Hopkins authors provide details on annotation databases and tools, and their advantages and imperfections.
“We have better tools today to develop gene annotation, but these tools haven’t completely solved it for us,” says Salzberg.
The authors say that newer methods for annotation have improved over older ones. The new tools include methods to sequence RNA code that tracks back to the gene sequence it came from, potentially providing more direct evidence of transcription boundaries, they say.
Looking at the protein end product, Salzberg says that an AI-assisted program called AlphaFold2 has also been helpful for gene annotation. The program predicts how a protein is likely to fold and scores the certainty of that prediction. Protein folds with better scores are more likely to represent functional protein variants that can be traced back to the gene’s sequence. However, about a third of proteins don’t fold, such as proteins in cell membranes.
“Gene annotations vary a lot because scientists use different tools,” says Salzberg. “Most genes in annotation databases are described only on a molecular level, and more should include the human conditions and diseases associated with them, as well as their variants.”
He says the solution to better gene annotation is more research on high-throughput, automated technology to annotate genes.
“It’s important not only for answers to biological questions, but for improving biomedical research that affects people,” he says.
This article originally appeared in Johns Hopkins Medicine Fundamentals »