Vladimír Boža, Broňa Brejová, Tomáš Vinař. GAML: genome assembly by maximum likelihood. Algorithms for Molecular Biology, 10:18. 2015. Early version at WABI 2014.

Download preprint: not available

Download from publisher: http://www.almob.org/content/10/1/18/abstract

Related web page: http://compbio.fmph.uniba.sk/gaml

Bibliography entry: BibTeX

See also: early version


BACKGROUND: Resolution of repeats and scaffolding of shorter contigs
are critical parts of genome assembly. Modern assemblers usually
perform such steps by heuristics, often tailored to a particular
technology for producing paired or long reads. 

RESULTS: We propose a new framework that allows systematic combination
of diverse sequencing datasets into a single assembly. We achieve this
by searching for an assembly with the maximum likelihood in a
probabilistic model capturing error rate, insert lengths, and other
characteristics of the sequencing technology used to produce each
dataset. We have implemented a prototype genome assembler GAML that
can use any combination of insert sizes with Illumina or 454 reads, as
well as PacBio reads. Our experiments show that we can assemble short
genomes with N50 sizes and error rates comparable to ALLPATHS-LG or
Cerulean. While ALLPATHS-LG and Cerulean require each a specific
combination of datasets, GAML works on any combination. 

CONCLUSIONS: We have introduced a new probabilistic approach to genome
assembly and demonstrated that this approach can lead to superior
results when used to combine diverse set of datasets from different
sequencing technologies. Data and software is available at