2-AIN-506 a 2-AIN-252: Seminár z bioinformatiky (2) a (4)
Leto 2017

Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheelertransform (PBWT). Bioinformatics, 30(9):1266-1272. 2014.

Download preprint: not available

Download from publisher: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu014 PubMed

Related web page: not available

Bibliography entry: BibTeX


MOTIVATION: Over the last few years, methods based on suffix arrays using the
Burrows-Wheeler Transform have been widely used for DNA sequence read matching
and assembly. These provide very fast search algorithms, linear in the search
pattern size, on a highly compressible representation of the dataset being
searched. Meanwhile, algorithmic development for genotype data has concentrated
on statistical methods for phasing and imputation, based on probabilistic
matching to hidden Markov model representations of the reference data, which
while powerful are much less computationally efficient. Here a theory of
haplotype matching using suffix array ideas is developed, which should scale too 
much larger datasets than those currently handled by genotype algorithms.
RESULTS: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm
to derive a representation of the data based on positional prefix arrays is
given, which is termed the positional Burrows-Wheeler transform (PBWT). On large 
datasets this compresses with run-length encoding by more than a factor of a
hundred smaller than using gzip on the raw data. Using this representation a
method is given to find all maximal haplotype matches within the set in O(NM)
time rather than O(NM(2)) as expected from naive pairwise comparison, and also a 
fast algorithm, empirically independent of M given sufficient memory for indexes,
to find maximal matches between a new sequence and the set. The discussion
includes some proposals about how these approaches could be used for imputation
and phasing.