Askar Gafurov, Tomáš Vinař, Broňa Brejová. Probabilistic Models of k-mer Frequencies. In L. De Mol, A. Weiermann, F. Manea, D. Fernández-Duque, ed., Connecting with Computability (CiE 2021), 12813 volume of Lecture Notes in Computer Science, pp. 227-236, Computability in Europe, 2021. Computational pangenomics special session.
Download preprint: not available
Download from publisher: https://link.springer.com/chapter/10.1007/978-3-030-80049-9_21
Related web page: not available
Bibliography entry: BibTeX
In this article, we review existing probabilistic models for modeling abundance of fixed-length strings (k-mers) in DNA sequencing data. These models capture dependence of the abundance on various phenomena, such as the size and repeat content of the genome, heterozygosity levels, and sequencing error rate. This in turn allows to estimate these properties from k-mer abundance histograms observed in real data. We also briefly discuss the issue of comparing k-mer abundance between related sequencing samples and meaningfully summarizing the results.