Askar Gafurov, Tomáš Vinař, Broňa Brejová. Probabilistic Models of k-mer Frequencies. In L. De Mol, A. Weiermann, F. Manea, D. Fernández-Duque, ed., Connecting with Computability (CiE 2021), 12813 volume of Lecture Notes in Computer Science, pp. 227-236, Computability in Europe, 2021. Computational pangenomics special session.

Download preprint: not available

Download from publisher:

Related web page: not available

Bibliography entry: BibTeX


In this article, we review existing probabilistic models for modeling 
abundance of fixed-length strings (k-mers) in DNA sequencing data. These 
models capture dependence of the abundance on various phenomena, such as the 
size and repeat content of the genome, heterozygosity levels, and sequencing 
error rate. This in turn allows to estimate these properties from k-mer 
abundance histograms observed in real data. We also briefly discuss the 
issue of comparing k-mer abundance between related sequencing samples and 
meaningfully summarizing the results.