2-AIN-505, 2-AIN-251: Seminár z bioinformatiky (1) a (3)
Zima 2017
Abstrakt

Aleksey V. Zimin, Daniela Puiu, Ming-Cheng Luo, Tingting Zhu, Sergey Koren, Guillaume Marcais, James A. Yorke, Jan Dvorak, Steven L. Salzberg. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, aprogenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome research, 27(5):787-792. 2017.

Download preprint: not available

Download from publisher: not available PubMed

Related web page: not available

Bibliography entry: BibTeX

Abstract:

Long sequencing reads generated by single-molecule sequencing technology offer
the possibility of dramatically improving the contiguity of genome assemblies.
The biggest challenge today is that long reads have relatively high error rates, 
currently around 15%. The high error rates make it difficult to use this data
alone, particularly with highly repetitive plant genomes. Errors in the raw data 
can lead to insertion or deletion errors (indels) in the consensus genome
sequence, which in turn create significant problems for downstream analysis; for 
example, a single indel may shift the reading frame and incorrectly truncate a
protein sequence. Here, we describe an algorithm that solves the high error rate 
problem by combining long, high-error reads with shorter but much more accurate
Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly
algorithm combines these two types of reads to construct mega-reads, which are
both long and accurate, and then assembles the mega-reads using the CABOG
assembler, which was designed for long reads. We apply this technique to a large 
data set of Illumina and PacBio sequences from the species Aegilops tauschii, a
large and extremely repetitive plant genome that has resisted previous attempts
at assembly. We show that the resulting assembled contigs are far larger than in 
any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare
the contigs to independently produced optical maps to evaluate their large-scale 
accuracy, and to a set of high-quality bacterial artificial chromosome
(BAC)-based assemblies to evaluate base-level accuracy.