In this exercise, we will attempt to assemble 500 Kbp section of the E. coli genome. We will use both short reads from Illumina and long reads from Oxford Nanopore MinION and we will compare the resulting assemblies.
We will use two files coming from Illumina MiSeq platform (source ftp://ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/).
(Compared to the original data set, this data was filtered to only include reads from the 500 Kbp region of our interest.)
To preserve the original files, you can create a directory "work", link the data files in that directory. Links are pointers to the original files and they do not take additional space (compared to copies). You can use the following commands:
mkdir work cd work ln -s ../data/miseq_R1.fastq.gz ln -s ../data/miseq_R2.fastq.gz
Both files are compressed FASTQ files. To view the contents of the file, you can use the command zless, e.g.:
Note the structure of the file, which is split into 4-line blocks, each block representing a single read; lines in the block represent 1. read metadata, 2. read sequence, 3. separator + sign, and 4. read quality string.
The two files are the result of pair-end sequencing, 'R1' contains 5'-end reads and 'R2' contains 3'-end reads. Question:
Running spades.py command will show you the command line parameters. You can use the following parameters:
spades.py -t 2 -m 2 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o short-assembly > spades.log
The command should finish in several minutes. The results will be located in the subdirectory 'short-assembly' and the spades.log file will contain a report. The final assembly is stored in contigs.fasta file.
For future work, put the final assembly into your working directory and name it spades.fasta.
If you did not install quast tool already, install it now.
QUAST is a tool for collecting basic statistic about the assembly.
Examine the resulting files. Image included in the PDF file illustrates the concept of Nx statistics particularly well.
Now we will compare our assembly to the reference sequence and try to determine what are the reasons for its fragmentation.
Besides the files we have worked with so far, we will also use data/ref.fasta, which is the corresponding fragment of the reference assembly of E. coli genome.
The plan is as follow:
If you did not install BWA and SAMTOOLS already, install them now.
For mapping and alignments, we will use software BWA. The software first needs to create a Burrows-Wheeler-transform based index of the reference file and then it provides several methods for rapid alignment of short and long sequences to this reference file using the index.
bwa index ref.fasta
The resulting index is stored in several files with names ref.fasta.* The software will later automatically detect and use these files when referring to the ref.fasta file.
bwa mem ref.fasta miseq_R1.fastq.gz miseq_R2.fastq.gz > ref-miseq.sam
# convert result from sam to bam file format and sort it by coordinate samtools view -S -b ref-miseq.sam | samtools sort - ref-miseq # create index for the bam file samtools index ref-miseq.bam
You can similarly align the SPADES assembly to the reference. For longer sequences, BWA provides method BWA-SW method: instead of bwa mem, use bwa bwasw. Since you have already created the index, you can skip this step. Name the final BAM file ref-spades.bam.
If you did not install IGV already, install it now.
We will use IGV browser to view the results.
We will try to assemble the same region of E.coli genome from long reads from Oxford Nanopore MinION sequencer. Data come from a public data set produced by Nick Loman and it was filtered to only contain reads overlapping our region:
This step is memory and time intensive; your mileage may vary depending on the parameters of your computer. If you want to proceed with this step anyways and did not install CANU assembler yet, you should do it now. Otherwise, you may skip this step and use the replacement result file data/canu.fasta.
canu -d long-reads -p canu -nanopore-raw nanopore.fasta.gz genomeSize=0.5M useGrid=false maxThreads=2 maxMemory=8
Copy the resulting file to file canu.fasta.
One of the problems of CANU assembly is its high error rate (since it is essentially a concatenation of segments from the original high-error reads. One way of solving this problem is to align high-quality short reads (Illumina) to the assembly and then use them to correct the sequencing errors.
java -Xmx20G -jar /opt/broad/pilon-1.21.jar --genome canu.fasta --outdir canu-pilon --changes --tracks --frags canu-miseq.bam > canu-miseq-pilon.log