Robert C. Edgar. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. Technical Report doi:10.1101/2020.09.29.319095, bioRxiv, 2020.
Download preprint: not available
Download from publisher: https://doi.org/10.1101/2020.09.29.319095
Related web page: not available
Bibliography entry: BibTeX
Abstract:
Minimizers are widely used to select subsets of fixed-length substrings (k- mers) from biological sequences in applications ranging from read mapping to taxonomy prediction and indexing of large datasets. Syncmers are an alternative method for selecting a subset of k-mers. Unlike a minimizer, a syncmer is identified by its k-mer sequence alone and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence. Bounded syncmers are defined by a small and fast function of the k-mer sequence which exploits correlations between overlapping k-mers to guarantee that at least one syncmer must appear in a window of predetermined length, and therefore comprise a universal hitting set which does not require a precomputed lookup table. Bounded syncmers are shown to be unambiguously superior to minimizers because they achieve both lower density and better conservation in mutated sequences.