2-AIN-506 a 2-AIN-252: Seminár z bioinformatiky (2) a (4)
Leto 2020

Sophie Rohling, Alexander Linne, Jendrik Schellhorn, Morteza Hosseini, Thomas Dencker, Burkhard Morgenstern. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One, 15(2):e0228070. 2020.

Download preprint: not available

Download from publisher: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0228070 PubMed

Related web page: not available

Bibliography entry: BibTeX


We study the number Nk of length-k word matches between pairs of evolutionarily
related DNA sequences, as a function of k. We show that the Jukes-Cantor distance
between two genome sequences-i.e. the number of substitutions per site that
occurred since they evolved from their last common ancestor-can be estimated from
the slope of a function F that depends on Nk and that is affine-linear within a
certain range of k. Integers kmin and kmax can be calculated depending on the
length of the input sequences, such that the slope of F in the relevant range can
be estimated from the values F(kmin) and F(kmax). This approach can be
generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed
at positions specified by a user-defined binary pattern. Based on these
theoretical results, we implemented a prototype software program for
alignment-free sequence comparison called Slope-SpaM. Test runs on real and
simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic
distances for distances up to around 0.5 substitutions per position. The
statistical stability of our results is improved if spaced words are used instead
of contiguous words. Unlike previous alignment-free methods that are based on the
number of (spaced) word matches, Slope-SpaM produces accurate results, even if
sequences share only local homologies.