Bioinformatický seminár

Wed 27 Apr. 2011, 17:20
I-9

Title: Hubisz et al. Error and error mitigation in low-coverage genome assemblies
Speaker: Maťo Kravec

The recent release of twenty-two new genome sequences has dramatically
increased the data available for mammalian comparative genomics, but
twenty of these new sequences are currently limited to  approximately 2x
coverage. Here we examine the extent of sequencing error in these 2x
assemblies, and its potential impact in downstream analyses. By comparing
2x assemblies with high-quality sequences from the ENCODE regions, we
estimate the rate of sequencing error to be 1-4 errors per kilobase. While
this error rate is fairly modest, sequencing error can still have
surprising effects. For example, an apparent lineage-specific insertion in
a coding region is more likely to reflect sequencing error than a true
biological event, and the length distribution of coding indels is strongly
distorted by error. We find that most errors are contributed by a small
fraction of bases with low quality scores, in particular, by the ends of
reads in regions of single-read coverage in the assembly. We explore
several approaches for automatic sequencing error mitigation (SEM), making
use of the localized nature of sequencing error, the fact that it is well
predicted by quality scores, and information about errors that comes from
comparisons across species. Our automatic methods for error mitigation
cannot replace the need for additional sequencing, but they do allow
substantial fractions of errors to be masked or eliminated at the cost of
modest amounts of over-correction, and they can reduce the impact of error
in downstream phylogenomic analyses. Our error-mitigated alignments are
available for download.