2-AIN-505, 2-AIN-251: Seminár z bioinformatiky (1) a (3)
Zima 2017

Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, Adam M. Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting andrepeat separation. Genome research, 27(5):722-736. 2017.

Download preprint: not available

Download from publisher: not available PubMed

Related web page: not available

Bibliography entry: BibTeX


Long-read single-molecule sequencing has revolutionized de novo genome assembly
and enabled the automated reconstruction of reference-quality genomes. However,
given the relatively high error rates of such technologies, efficient and
accurate assembly of large repeats and closely related haplotypes remains
challenging. We address these issues with Canu, a successor of Celera Assembler
that is specifically designed for noisy single-molecule sequences. Canu
introduces support for nanopore sequencing, halves depth-of-coverage
requirements, and improves assembly continuity while simultaneously reducing
runtime by an order of magnitude on large genomes versus Celera Assembler 8.2.
These advances result from new overlapping and assembly algorithms, including an 
adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse
assembly graph construction that avoids collapsing diverged repeats and
haplotypes. We demonstrate that Canu can reliably assemble complete microbial
genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences
(PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on
both human and Drosophila melanogaster PacBio data sets. For assembly structures 
that cannot be linearly represented, Canu provides graph-based assembly outputs
in graphical fragment assembly (GFA) format for analysis or integration with
complementary phasing and scaffolding techniques. The combination of such highly 
resolved assembly graphs with long-range scaffolding information promises the
complete and automated assembly of complex genomes.