Mário Lipovský, Tomáš Vinař, Broňa Brejová. Approximate Abundance Histograms and Their Use for Genome Size Estimation. In J. Hlaváčová, ed., Information Technologies - Applications and Theory (ITAT), number 1885 in CEUR-WS, pp. 27-34, Martinské hole, Slovakia, 2017.

Download preprint: not available

Download from publisher: http://ceur-ws.org/Vol-1885/27.pdf

Related web page: not available

Bibliography entry: BibTeX

Abstract:


Abstract: DNA sequencing data is typically a large collection
of short strings called reads. We can summarize
such data by computing a histogram of the number of occurrences
of substrings of a fixed length. Such histograms
can be used for example to estimate the size of a genome.
In this paper, we study a recent tool, Kmerlight, which
computes approximate histograms. We discover an approximation
bias, and we propose a new, unbiased version
of Kmerlight. We also model the distribution of approximation
errors and support our theoretical model by
experimental data. Finally, we use another tool, CovEst,
to compute genome size estimates from approximate histograms.
Our results show that although CovEst was designed
to work with exact histograms, it can be used with
their approximate versions, which can be produced in a
much smaller memory.