Exercise 1: Assembling genomes from short and long reads

In this exercise, we will attempt to assemble 500 Kbp section of the E. coli genome. We will use both short reads from Illumina and long reads from Oxford Nanopore MinION and we will compare the resulting assemblies.

Task A: Assembling short reads

We will use two files coming from Illumina MiSeq platform (source ftp://ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/).

(Compared to the original data set, this data was filtered to only include reads from the 500 Kbp region of our interest.)

To preserve the original files, you can create a directory "work", link the data files in that directory. Links are pointers to the original files and they do not take additional space (compared to copies). You can use the following commands:

	    mkdir work
	    cd work
	    ln -s ../data/miseq_R1.fastq.gz
	    ln -s ../data/miseq_R2.fastq.gz

Step 1: Examine the files

Both files are compressed FASTQ files. To view the contents of the file, you can use the command zless, e.g.:

zless miseq_R1.fastq.gz

Note the structure of the file, which is split into 4-line blocks, each block representing a single read; lines in the block represent 1. read metadata, 2. read sequence, 3. separator + sign, and 4. read quality string.

The two files are the result of pair-end sequencing, 'R1' contains 5'-end reads and 'R2' contains 3'-end reads. Question:

How many reads are in the 'R1' and 'R2' files? Are these numbers the same? Why? (you can achieve this with wc)

Step 2: Running SPADES assembler

If you did not install SPADES assembler yet, install it now.

Running spades.py command will show you the command line parameters. You can use the following parameters:

	  spades.py -t 2 -m 2 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o short-assembly > spades.log

The command should finish in several minutes. The results will be located in the subdirectory 'short-assembly' and the spades.log file will contain a report. The final assembly is stored in contigs.fasta file.

Questions:

Explain the parameters used.
How many contigs has the final assembly?
SPADES has attempted the assemblies with de Bruijn graphs with several values of k. What are the differences between these assemblies? Can you determine which was the value of k that it used for the final version of the assembly?
What is the total length of the assembly?
Can you tell which parts of the assembly correspond to unresolved repeats? (note: look at the information in fasta headers)
Did SPADES use the whole data set or only a subsample of the reads? (note: What would be the average coverage if all the reads were used?)

For future work, put the final assembly into your working directory and name it spades.fasta.

Step 3: Basic assembly evaluation

If you did not install quast tool already, install it now.

QUAST is a tool for collecting basic statistic about the assembly.

	    quast.py spades.fasta

Examine the resulting files. Image included in the PDF file illustrates the concept of Nx statistics particularly well.

Task B: Detailed assembly evaluation

Now we will compare our assembly to the reference sequence and try to determine what are the reasons for its fragmentation.

Besides the files we have worked with so far, we will also use data/ref.fasta, which is the corresponding fragment of the reference assembly of E. coli genome.

The plan is as follow:

Map the Illumina reads to the reference.
Align the SPADES assembly to the reference.
Display everything in a simple browser

Step 1: Mapping the Illumina reads

If you did not install BWA and SAMTOOLS already, install them now.

For mapping and alignments, we will use software BWA. The software first needs to create a Burrows-Wheeler-transform based index of the reference file and then it provides several methods for rapid alignment of short and long sequences to this reference file using the index.

Create index of ref.fasta:
```
      bwa index ref.fasta 
```
The resulting index is stored in several files with names ref.fasta.* The software will later automatically detect and use these files when referring to the ref.fasta file.
Align short reads to the reference. BWA provides method BWA-MEM for mapping short reads.
```
      bwa mem ref.fasta miseq_R1.fastq.gz miseq_R2.fastq.gz > ref-miseq.sam
```
Questions:
- The BWA-MEM tries to map reads in pairs. Can you identify from the progress report on screen what is the distance between the pairs of reads?
- Examine the resulting SAM file. Can you identify which reads were successfully mapped to the reference? Optional: find the description of SAM file format and try to figure out actual alignment for some of the reads.
As you can see, SAM file is somewhat large and difficult to view without using additional tools. To prepare it for viewing, we will use SAMTOOLS to compress the file, sort the alignment along the reference, and create an index to the file in order for a viewer to efficiently access the relevant portions of the file.
```
		    # convert result from sam to bam file format and sort it by coordinate
		    samtools view -S -b ref-miseq.sam | samtools sort - ref-miseq
		    # create index for the bam file
		    samtools index ref-miseq.bam
		  
```

Step 2: Alignment of SPADES assembly to the reference

You can similarly align the SPADES assembly to the reference. For longer sequences, BWA provides method BWA-SW method: instead of bwa mem, use bwa bwasw. Since you have already created the index, you can skip this step. Name the final BAM file ref-spades.bam.

Step 3: Viewing the alignment results

If you did not install IGV already, install it now.

We will use IGV browser to view the results.

Start IGV browser by using igv command
First, you need to select the correct reference file by using menu Genomes -> Load Genome from File (use ref.fasta)
Then you can load both alignments (BAM files) as tracks through the menu File -> Load from File
Zoom in to view the alignments

Questions:

Are there errors in reads? Are there errors in the assembly?
Is the whole reference covered by your assembly?
If there are gaps in the assembly, try to figure out why.
If there are overlapping contigs, try to figure our why they were not merged in the assembly.

Task C: Long-read assembly

We will try to assemble the same region of E.coli genome from long reads from Oxford Nanopore MinION sequencer. Data come from a public data set produced by Nick Loman and it was filtered to only contain reads overlapping our region:

data/nanopore.fasta.gz

Step 1: Assemble reads using CANU assembler

This step is memory and time intensive; your mileage may vary depending on the parameters of your computer. If you want to proceed with this step anyways and did not install CANU assembler yet, you should do it now. Otherwise, you may skip this step and use the replacement result file data/canu.fasta.

		canu -d long-reads -p canu -nanopore-raw nanopore.fasta.gz genomeSize=0.5M useGrid=false maxThreads=2 maxMemory=8

Copy the resulting file to file canu.fasta.

Question:

How many contigs has the assembly?

Step 2: Compare the assembly with the reference

Align both reads and the canu assembly to the reference using BWA-SW tool.
Did each read / the canu contig align as a single alignment? Why do you think this is the case?
Use IGV to display the result and attempt to compare two assemblies in the terms of continuity, coverage of the reference, and similarity to the reference.

Optional questions: BWA with default setting is clearly not the ideal tool for working with reads and sequences with high error rate. To mitigate this problem, you can explore the BWA parameters suggested for long reads, alternative aligners such as LAST, or other programs for comparison of sequence assemblies such as nucmer / mummerplot tool from MUMMER package.

Step 3: "Polish" the assembly

If you did not yet install PILON software, you can install it now.

One of the problems of CANU assembly is its high error rate (since it is essentially a concatenation of segments from the original high-error reads. One way of solving this problem is to align high-quality short reads (Illumina) to the assembly and then use them to correct the sequencing errors.

First, use BWA-MEM to align Illumina data to CANU assembly. Save the resulting file under the name canu-miseq.bam

Run PILON as follows:

  java -Xmx20G -jar /opt/broad/pilon-1.21.jar --genome canu.fasta --outdir canu-pilon  --changes --tracks --frags canu-miseq.bam > canu-miseq-pilon.log

Align the resulting assembly to the reference genome and use IGV to compare with the previous results.
Optional: you may repeat bwa and pilon several times, since after correction more short reads may aligned to the assembly.