Mário Lipovský, Tomáš Vinař, Broňa Brejová.
Approximate Abundance Histograms and Their Use for Genome Size Estimation.
In J. Hlaváčová, ed.,
Information Technologies - Applications and Theory (ITAT),
number 1885 in CEUR-WS,
pp. 27-34, Martinské hole, Slovakia, 2017.
Download from publisher | BibTeX
Abstract: DNA sequencing data is typically a large collection of short strings called reads. We can summarize such data by computing a histogram of the number of occurrences of substrings of a fixed length. Such histograms can be used for example to estimate the size of a genome. In this paper, we study a recent tool, Kmerlight, which computes approximate histograms. We discover an approximation bias, and we propose a new, unbiased version of Kmerlight. We also model the distribution of approximation errors and support our theoretical model by experimental data. Finally, we use another tool, CovEst, to compute genome size estimates from approximate histograms. Our results show that although CovEst was designed to work with exact histograms, it can be used with their approximate versions, which can be produced in a much smaller memory.