Adam Siepel, Mark Diekhans, Brona Brejova, Laura Langton, Michael Stevens, Charles L. G. Comstock, Colleen Davis, Brent Ewing, Shelly Oommen, Christopher Lau, Hung-Chun Yu, Jianfeng Li, Bruce A. Roe, Phil Green, Daniela S. Gerhard, Gary Temple, David Haussler, Michael R. Brent. Targeted discovery of novel human exons by comparative genomics. Genome research, 17(12):1763-1763. 2007.

Download preprint: not available

Download from publisher: http://dx.doi.org/10.1101/gr.7128207

Related web page: http://compgen.bscb.cornell.edu/projects/mgc/

Bibliography entry: BibTeX

Abstract:

A complete and accurate set of human protein-coding gene annotations is
perhaps the single most important resource for genomic research after the
human-genome sequence itself, yet the major gene catalogs remain
incomplete and imperfect. Here we describe a genome-wide effort, carried
out as part of the Mammalian Gene Collection (MGC) project, to identify
human genes not yet in the gene catalogs. Our approach was to produce gene
predictions by algorithms that rely on comparative sequence data but do
not require direct cDNA evidence, then to test predicted novel genes by
RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188
exons with, at most, weak prior cDNA support. These NGFs correspond to an
estimated 563 distinct genes, of which >160 are completely absent from the
major gene catalogs, while hundreds of others represent significant
extensions of known genes. The NGFs appear to be predominantly
protein-coding genes rather than noncoding RNAs, unlike novel transcribed
sequences identified by technologies such as tiling arrays and CAGE. They
tend to be expressed at low levels and in a tissue-specific manner, and
they are enriched for roles in motor activity, cell adhesion, connective
tissue, and central nervous system development. Our results demonstrate
that many important genes and gene fragments have been missed by
traditional approaches to gene discovery but can be identified by their
evolutionary signatures using comparative sequence data. However, they
suggest that hundreds-not thousands-of protein-coding genes are completely
missing from the current gene catalogs.