2-AIN-505, 2-AIN-251: Seminár z bioinformatiky (1) a (3)
Zima 2014
Abstrakt

Jared T. Simpson, Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367-373. 2010.

Download preprint: not available

Download from publisher: not available PubMed

Related web page: not available

Bibliography entry: BibTeX

Abstract:

MOTIVATION: Sequence assembly is a difficult problem whose importance has grown
again recently as the cost of sequencing has dramatically dropped. Most new
sequence assembly software has started by building a de Bruijn graph, avoiding
the overlap-based methods used previously because of the computational cost and
complexity of these with very large numbers of short reads. Here, we show how to 
use suffix array-based methods that have formed the basis of recent very fast
sequence mapping algorithms to find overlaps and generate assembly string graphs 
asymptotically faster than previously described algorithms. RESULTS: Standard
overlap assembly methods have time complexity O(N(2)), where N is the sum of the 
lengths of the reads. We use the Ferragina-Manzini index (FM-index) derived from 
the Burrows-Wheeler transform to find overlaps of length at least tau among a set
of reads. As well as an approach that finds all overlaps then implements
transitive reduction to produce a string graph, we show how to output directly
only the irreducible overlaps, significantly shrinking memory requirements and
reducing compute time to O(N), independent of depth. Overlap-based assembly
methods naturally handle mixed length read sets, including capillary reads or
long reads promised by the third generation sequencing technologies. The
algorithms we present here pave the way for overlap-based assembly approaches to 
be developed that scale to whole vertebrate genome de novo assembly.