2-AIN-505, 2-AIN-251: Seminar in Bioinformatics (1), (3)
Winter 2023

Chirag Jain. Coverage-preserving sparsification of overlap graphs for long-read assembly. Bioinformatics, 39(3). 2023.

Download preprint: not available

Download from publisher: https://doi.org/10.1093/bioinformatics/btad124 PubMed

Related web page: not available

Bibliography entry: BibTeX


MOTIVATION: Read-overlap-based graph data structures play a central role in 
computing de novo genome assembly. Most long-read assemblers use Myers's string 
graph model to sparsify overlap graphs. Graph sparsification improves assembly 
contiguity by removing spurious and redundant connections. However, a graph model 
must be coverage-preserving, i.e. it must ensure that there exist walks in the 
graph that spell all chromosomes, given sufficient sequencing coverage. This 
property becomes even more important for diploid genomes, polyploid genomes, and 
metagenomes where there is a risk of losing haplotype-specific information. 
RESULTS: We develop a novel theoretical framework under which the 
coverage-preserving properties of a graph model can be analyzed. We first prove 
that de Bruijn graph and overlap graph models are guaranteed to be 
coverage-preserving. We next show that the standard string graph model lacks this 
guarantee. The latter result is consistent with prior work suggesting that 
removal of contained reads, i.e. the reads that are substrings of other reads, 
can lead to coverage gaps during string graph construction. Our experiments done 
using simulated long reads from HG002 human diploid genome show that 50 coverage 
gaps are introduced on average by ignoring contained reads from nanopore 
datasets. To remedy this, we propose practical heuristics that are well-supported 
by our theoretical results and are useful to decide which contained reads should 
be retained to avoid coverage gaps. Our method retains a small fraction of 
contained reads (1-2%) and closes majority of the coverage gaps. AVAILABILITY AND 
IMPLEMENTATION: Source code is available through GitHub 
(https://github.com/at-cg/ContainX) and Zenodo with doi: 10.5281/zenodo.7687543.