Michal Hozza, Tomáš Vinař, Broňa Brejová. How Big is That Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra. In Costas S. Iliopoulos, Simon J. Puglisi, Emine Yilmaz, ed., String Processing and Information Retrieval (SPIRE), 9309 volume of Lecture Notes in Computer Science, pp. 199-209, London, UK, September 2015. Springer.

Download preprint: not available

Download from publisher: http://dx.doi.org/10.1007/978-3-319-23826-5_20

Related web page: http://compbio.fmph.uniba.sk/covest/

Bibliography entry: BibTeX

Abstract:

Many practical algorithms for sequence alignment, genome assembly and other
tasks represent a sequence as a set of k-mers. Here, we address the
problems of estimating genome size and sequencing coverage from
sequencing reads, without the need for sequence assembly. Our
estimates are based on a histogram of k-mer abundance in the input set
of sequencing reads and on probabilistic modeling of distribution
of k-mer abundance based on parameters related to the coverage,
error rate and repeat structure of the genome. Our method provides
reliable estimates even at coverage as low as 0.5 or at error rates as
high as 10%.