2-AIN-506, 2-AIN-252: Seminar in Bioinformatics (2), (4)
Summer 2024

Can Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, Onur Mutlu. RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes. Bioinformatics, 39(39 Suppl 1):i297-i307. 2023.

Download preprint: not available

Download from publisher: https://doi.org/10.1093/bioinformatics/btad272 PubMed

Related web page: not available

Bibliography entry: BibTeX


Nanopore sequencers generate electrical raw signals in real-time while sequencing 
long genomic strands. These raw signals can be analyzed as they are generated, 
providing an opportunity for real-time genome analysis. An important feature of 
nanopore sequencing, Read Until, can eject strands from sequencers without fully 
sequencing them, which provides opportunities to computationally reduce the 
sequencing time and cost. However, existing works utilizing Read Until either (i) 
require powerful computational resources that may not be available for portable 
sequencers or (ii) lack scalability for large genomes, rendering them inaccurate 
or ineffective. We propose RawHash, the first mechanism that can accurately and 
efficiently perform real-time analysis of nanopore raw signals for large genomes 
using a hash-based similarity search. To enable this, RawHash ensures the signals 
corresponding to the same DNA content lead to the same hash value, regardless of 
the slight variations in these signals. RawHash achieves an accurate hash-based 
similarity search via an effective quantization of the raw signals such that 
signals corresponding to the same DNA content have the same quantized value and, 
subsequently, the same hash value. We evaluate RawHash on three applications: (i) 
read mapping, (ii) relative abundance estimation, and (iii) contamination 
analysis. Our evaluations show that RawHash is the only tool that can provide 
high accuracy and high throughput for analyzing large genomes in real-time. When 
compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash 
provides (i) 25.8x and 3.4x better average throughput and (ii) significantly 
better accuracy for large genomes, respectively. Source code is available at