Publication details

Adam Siepel, Mark Diekhans, Brona Brejová, Laura Langton, Michael Stevens, Charles L.G. Comstock, Colleen Davis, Brent Ewing, Shelly Oommen, Christopher Lau, Hung-Chun Yu, Jianfeng Li, Bruce A. Roe, Phil Green, Daniela S. Gerhard, Gary Temple, David Haussler, and Michael R. Brent. Targeted discovery of novel human exons by comparative genomics. Genome Research, 17(12):1763-73. 2007. .
Download from publisher | Webpage | BibTeX | PubMed

Abstract

A complete and accurate set of human protein-coding gene annotations
is perhaps the single most important resource for genomic research
after the human-genome sequence itself, yet the major gene catalogs
remain incomplete and imperfect. Here we describe a genome-wide
effort, carried out as part of the Mammalian Gene Collection (MGC)
project, to identify human genes not yet in the gene catalogs. Our
approach was to produce gene predictions by algorithms that rely on
comparative sequence data but do not require direct cDNA evidence,
then to test predicted novel genes by RT-PCR. We have identified 734
novel gene fragments (NGFs) containing 2188 exons with, at most, weak
prior cDNA support. These NGFs correspond to an estimated 563 distinct
genes, of which >160 are completely absent from the major gene
catalogs, while hundreds of others represent significant extensions of
known genes. The NGFs appear to be predominantly protein-coding genes
rather than noncoding RNAs, unlike novel transcribed sequences
identified by technologies such as tiling arrays and CAGE. They tend
to be expressed at low levels and in a tissue-specific manner, and
they are enriched for roles in motor activity, cell adhesion,
connective tissue, and central nervous system development. Our results
demonstrate that many important genes and gene fragments have been
missed by traditional approaches to gene discovery but can be
identified by their evolutionary signatures using comparative sequence
data. However, they suggest that hundreds-not thousands-of
protein-coding genes are completely missing from the current gene
catalogs.