Difference between revisions of "Lbioinf1"

Revision as of 14:25, 18 March 2020

The next three lectures at targeted at the students in the Bioinformatics program and the goal is to get experience with several common bioinformatics tools. Students will learn more about the algorithms and models behind these tools in the Methods in bioinformatics course.

Overview of DNA sequencing and assembly

DNA sequencing is a technology of reading the order of nucleotides along a DNA strand

The result is represented as a string of A,C,G,T
Only fragments of DNA of limited length can be read, these are called sequencing reads
Different technologies produce reads of different characteristics
Examples:
- Illumina sequencers produce short reads (typical length 100-200bp), but in great quantities and very low error rate (<0.1%)
- Illumina reads usually come in pairs sequenced from both ends of a DNA fragment of an approximately known length
- Oxford nanopore sequencers produce longer reads (thousands of basepairs or more), but the error rates are higher (10-15%)

The goal of genome sequencing is to read all chromosomes of an organism

Sequencing machines produce many reads coming from different parts of the genome
Using software tools called sequence assemblers, these reads are glued together based on overlaps
Ideally we would get the true chromosomes, but often we get only shorter fragments called contigs
The results of assembly can contain errors
We prefer longer contigs with fewer errors

Sequence alignments and dotplots

Sequence alignment is the task of finding similarities between DNA (or protein) sequences
Here is an example - short similarity between region at positions 344,447..344,517 of one sequence and positions 3,261..3,327 of another sequence

Query: 344447 tctccgacggtgatggcgttgtgcgtcctctatttcttttatttctttttgttttatttc 344506
              |||||||| |||||||||||||||||| ||||||| |||||||||||| ||   ||||||
Sbjct: 3261   tctccgacagtgatggcgttgtgcgtc-tctatttattttatttctttgtg---tatttc 3316

Query: 344507 tctgactaccg 344517
              |||||||||||
Sbjct: 3317   tctgactaccg 3327

Alignments can be stored in many formats and visualised as dotplots
In a dotplot, the x-axis correspond to positions in one sequence and the y-axis in another sequence
Diagonal lines show alignments between the sequences (direction of the diagonal shows which DNA strand was aligned)

Dotplot of human and Drosophila mitochondrial genomes

File formats

FASTA

FASTA is a format for storing DNA, RNA and protein sequences
We have already seen FASTA files in Perl exercises
Each sequence is given on several lines of the file. The first line starts with ">" followed by an identifier of the sequence and optionally some further description separated by whitespace
The sequence itself is on the second line; long sequences are split into multiple lines

>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAATGCCTTAAACCAGTAGCAGAATCAGTATAAA
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGATTTATTAAAGTTAAGACGAGAGTCTCATTAT

FASTQ

FASTQ is a format for storing sequencing reads, containing DNA sequences but also quality information about each nucleotide
More in the lecture on Perl

SAM/BAM

SAM and BAM are formats for storing alignments of sequencing reads (or other sequences) to a genome
For each read, the file contains the read itself, its quality, but also the chromosome/contig name and position where this read is likely coming from, and an additional information e.g. about mapping quality (confidence in the correct location)
SAM files are text-based, thus easier to check manually; BAM files are binary and compressed, thus smaller and faster to read
We can easily convert between SAM and BAM using samtools
Full documentation of the format

PAF format

PAF is another format for storing alignments used in the minimap2 tool
Full documentation of the format

Gzip

Gzip is a general-purpose tool for file compression
It is often used in bioinformatics on large FASTQ or FASTA files
Running command gzip filename.ext will create compressed file filename.ext.gz (original file will be deleted).
The reverse process is done by gunzip filename.ext.gz
However, we can access the file withou uncompressining it. Command zcat filename.ext.gz prints the content of a gzipped file and keep the gzipped file as is. We can use pipe to do ruther processing on the file.
To manually page through the content of a gzipped file use zless filename.ext.gz
Some bioinformatics tools can work directly with gzipped files

@@ Line 46: / Line 46: @@
 ===FASTA===
 * FASTA is a format for storing DNA, RNA and protein sequences
-* We have already seen FASTA files in [[#HWperl|Perl exercises]]
+* We have already seen FASTA files in [[HWperl|Perl exercises]]
 * Each sequence is given on several lines of the file. The first line starts with ">" followed by an identifier of the sequence and optionally some further description separated by whitespace
 * The sequence itself is on the second line; long sequences are split into multiple lines
@@ Line 58: / Line 58: @@
 ===FASTQ===
 * FASTQ is a format for storing sequencing reads, containing DNA sequences but also quality information about each nucleotide
-* More in the [[#Lperl#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|lecture on Perl]]
+* More in the [[Lperl#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|lecture on Perl]]
 ===SAM/BAM===

Difference between revisions of "Lbioinf1"

Revision as of 14:25, 18 March 2020

Contents

Overview of DNA sequencing and assembly

Sequence alignments and dotplots

File formats

FASTA

FASTQ

SAM/BAM

PAF format

Gzip

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools