Exercise 1: Assembling genomes from short and long reads

In this exercise, we will attempt to assemble 500 Kbp section of the E. coli genome. We will use both short reads from Illumina and long reads from Oxford Nanopore MinION and we will compare the resulting assemblies.

Task A: Assembling short reads

We will use two files coming from Illumina MiSeq platform (source ftp://ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/).

(Compared to the original data set, this data was filtered to only include reads from the 500 Kbp region of our interest.)

To preserve the original files, you can create a directory "work", link the data files in that directory. Links are pointers to the original files and they do not take additional space (compared to copies). You can use the following commands:

	    mkdir work
	    cd work
	    ln -s ../data/miseq_R1.fastq.gz
	    ln -s ../data/miseq_R2.fastq.gz

Step 1: Examine the files

Both files are compressed FASTQ files. To view the contents of the file, you can use the command zless, e.g.:

zless miseq_R1.fastq.gz

Note the structure of the file, which is split into 4-line blocks, each block representing a single read; lines in the block represent 1. read metadata, 2. read sequence, 3. separator + sign, and 4. read quality string.

The two files are the result of pair-end sequencing, 'R1' contains 5'-end reads and 'R2' contains 3'-end reads. Question:

Step 2: Running SPADES assembler

If you did not install SPADES assembler yet, install it now.

Running spades.py command will show you the command line parameters. You can use the following parameters:

	  spades.py -t 2 -m 2 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o short-assembly > spades.log

The command should finish in several minutes. The results will be located in the subdirectory 'short-assembly' and the spades.log file will contain a report. The final assembly is stored in contigs.fasta file.


For future work, put the final assembly into your working directory and name it spades.fasta.

Step 3: Basic assembly evaluation

If you did not install quast tool already, install it now.

QUAST is a tool for collecting basic statistic about the assembly.

	    quast.py spades.fasta

Examine the resulting files. Image included in the PDF file illustrates the concept of Nx statistics particularly well.

Task B: Detailed assembly evaluation

Now we will compare our assembly to the reference sequence and try to determine what are the reasons for its fragmentation.

Besides the files we have worked with so far, we will also use data/ref.fasta, which is the corresponding fragment of the reference assembly of E. coli genome.

The plan is as follow:

Step 1: Mapping the Illumina reads

If you did not install BWA and SAMTOOLS already, install them now.

For mapping and alignments, we will use software BWA. The software first needs to create a Burrows-Wheeler-transform based index of the reference file and then it provides several methods for rapid alignment of short and long sequences to this reference file using the index.

  1. Create index of ref.fasta:
          bwa index ref.fasta 

    The resulting index is stored in several files with names ref.fasta.* The software will later automatically detect and use these files when referring to the ref.fasta file.

  2. Align short reads to the reference. BWA provides method BWA-MEM for mapping short reads.
          bwa mem ref.fasta miseq_R1.fastq.gz miseq_R2.fastq.gz > ref-miseq.sam



  3. As you can see, SAM file is somewhat large and difficult to view without using additional tools. To prepare it for viewing, we will use SAMTOOLS to compress the file, sort the alignment along the reference, and create an index to the file in order for a viewer to efficiently access the relevant portions of the file.
    		    # convert result from sam to bam file format and sort it by coordinate
    		    samtools view -S -b ref-miseq.sam | samtools sort - ref-miseq
    		    # create index for the bam file
    		    samtools index ref-miseq.bam

Step 2: Alignment of SPADES assembly to the reference

You can similarly align the SPADES assembly to the reference. For longer sequences, BWA provides method BWA-SW method: instead of bwa mem, use bwa bwasw. Since you have already created the index, you can skip this step. Name the final BAM file ref-spades.bam.

Step 3: Viewing the alignment results

If you did not install IGV already, install it now.

We will use IGV browser to view the results.

  1. Start IGV browser by using igv command
  2. First, you need to select the correct reference file by using menu Genomes -> Load Genome from File (use ref.fasta)
  3. Then you can load both alignments (BAM files) as tracks through the menu File -> Load from File
  4. Zoom in to view the alignments


Task C: Long-read assembly

We will try to assemble the same region of E.coli genome from long reads from Oxford Nanopore MinION sequencer. Data come from a public data set produced by Nick Loman and it was filtered to only contain reads overlapping our region:

Step 1: Assemble reads using CANU assembler

This step is memory and time intensive; your mileage may vary depending on the parameters of your computer. If you want to proceed with this step anyways and did not install CANU assembler yet, you should do it now. Otherwise, you may skip this step and use the replacement result file data/canu.fasta.

		canu -d long-reads -p canu -nanopore-raw nanopore.fasta.gz genomeSize=0.5M useGrid=false maxThreads=2 maxMemory=8

Copy the resulting file to file canu.fasta.


Step 2: Compare the assembly with the reference

Optional questions: BWA with default setting is clearly not the ideal tool for working with reads and sequences with high error rate. To mitigate this problem, you can explore the BWA parameters suggested for long reads, alternative aligners such as LAST, or other programs for comparison of sequence assemblies such as nucmer / mummerplot tool from MUMMER package.

Step 3: "Polish" the assembly

If you did not yet install PILON software, you can install it now.

One of the problems of CANU assembly is its high error rate (since it is essentially a concatenation of segments from the original high-error reads. One way of solving this problem is to align high-quality short reads (Illumina) to the assembly and then use them to correct the sequencing errors.