1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWbioinf1"

From MAD
Jump to navigation Jump to search
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 
See also the [[Lbioinf1|lecture]]
 
See also the [[Lbioinf1|lecture]]
 +
<!-- /NOTEX -->
  
 
Submit the protocol and the required files to <tt>/submit/bioinf1</tt>
 
Submit the protocol and the required files to <tt>/submit/bioinf1</tt>
<!-- /NOTEX -->
+
 
 +
===Technical notes===
 +
* Task D and task E ask you to look at data visualizations.
 +
* If you are unable to open graphical applications from our server, you can download appropriate files and view them on your computer (in task D these are simply pdf files, in task E you can install IGV software on your computer or use the [https://igv.org/app/ online version]).
  
 
===Task A: examine input files===
 
===Task A: examine input files===
Line 12: Line 16:
 
cp -iv /tasks/bioinf1/* .
 
cp -iv /tasks/bioinf1/* .
 
</syntaxhighlight>
 
</syntaxhighlight>
* <tt>ref.fasta</tt> is a piece of genome from ''Escherichia coli''  
+
* File <tt>ref.fasta</tt> contains a piece of genome from ''Escherichia coli''.
* <tt>miseq_R1.fastq.gz</tt> and <tt>miseq_R2.fastq.gz</tt> are sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in the R1 file, second reads in the R2 file. These reads come from the region in <tt>ref.fasta</tt>
+
* Files <tt>miseq_R1.fastq.gz</tt> and <tt>miseq_R2.fastq.gz</tt> contain sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in the R1 file, second reads in the R2 file. These reads come from the region in <tt>ref.fasta</tt>.
* <tt>nanopore.fasta</tt> are nanopore sequencing reads in FASTA format (without qualities). These reads are also from the region in <tt>ref.fasta</tt>
+
* File <tt>nanopore.fasta</tt> contains nanopore sequencing reads in FASTA format (without qualities). These reads are also from the region in <tt>ref.fasta</tt>.
  
 
Try to find the answers to the following questions using command-line tools.  
 
Try to find the answers to the following questions using command-line tools.  
<!-- NOTEX -->
 
 
In your '''protocol''', note down the '''commands''' as well as the '''answers'''.
 
In your '''protocol''', note down the '''commands''' as well as the '''answers'''.
<!-- /NOTEX -->
 
  
 
(a) How many reads are in the MiSeq files? Is the number of reads the same in both files?  
 
(a) How many reads are in the MiSeq files? Is the number of reads the same in both files?  
* Try command <tt>zcat miseq_R1.fastq.gz | wc -l </tt>
+
* Try command <tt>zcat miseq_R1.fastq.gz | wc -l</tt>
 
* Can you figure out the answer from the result of this command?
 
* Can you figure out the answer from the result of this command?
  
Line 28: Line 30:
 
* Look at the file using <tt>zless</tt> - do all reads appear to be of an equal length?
 
* Look at the file using <tt>zless</tt> - do all reads appear to be of an equal length?
 
* Extend the following command with <tt>tail</tt> and <tt>wc -c</tt> to get the length of the first read: <tt>zcat miseq_R1.fastq.gz | head -n 2</tt>
 
* Extend the following command with <tt>tail</tt> and <tt>wc -c</tt> to get the length of the first read: <tt>zcat miseq_R1.fastq.gz | head -n 2</tt>
* Do not forget to consider the end of the line character
+
* Do not forget to consider the end of the line character.
* Repeat for both MiSeq files
+
* Repeat for both MiSeq files.
  
 
(c) How many reads are in the nanopore file (beware - different format)
 
(c) How many reads are in the nanopore file (beware - different format)
Line 35: Line 37:
 
(d) What is the average length of the reads in the nanopore file?
 
(d) What is the average length of the reads in the nanopore file?
 
* Try command: <tt>samtools faidx nanopore.fasta</tt>
 
* Try command: <tt>samtools faidx nanopore.fasta</tt>
* This creates <tt>nanopore.fasta.fai</tt> file, where the second column contains sequence length of each read
+
* This creates <tt>nanopore.fasta.fai</tt> file, where the second column contains sequence length of each read.
 
* Compute the average of this column by a one-liner: <tt>perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai</tt>
 
* Compute the average of this column by a one-liner: <tt>perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai</tt>
  
Line 41: Line 43:
  
 
===Task B: assemble the sequence from the reads===
 
===Task B: assemble the sequence from the reads===
* We will pretend that the correct answer (<tt>ref.fasta</tt>) is not known and we will try to assemble it from the reads
+
* We will pretend that the correct answer (<tt>ref.fasta</tt>) is not known and we will try to assemble it from the reads.
* We will assemble Illumina reads by program [http://cab.spbu.ru/software/spades/ SPAdes] and nanopore reads by [https://github.com/lh3/miniasm miniasm]
+
* We will assemble Illumina reads by program [http://cab.spbu.ru/software/spades/ SPAdes] and nanopore reads by [https://github.com/lh3/miniasm miniasm].
* Assembly takes several minutes, we will run it in the background using <tt>screen</tt> command
+
* Assembly takes several minutes; we will run it in the background using <tt>screen</tt> command.
  
 
SPAdes
 
SPAdes
 
* Run <tt>screen -S spades</tt>
 
* Run <tt>screen -S spades</tt>
* Press Enter to get command-line, then run the follwing command:<br>
+
* Press Enter to get the command-line, then run the following command:<br>
 
:: <tt>spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log</tt>
 
:: <tt>spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log</tt>
 
* Press <tt>Ctrl-a</tt> followed by <tt>d</tt>
 
* Press <tt>Ctrl-a</tt> followed by <tt>d</tt>
* This will take you out of <tt>screen</tt> command
+
* This will take you out of <tt>screen</tt> command ([[screen|more details in a separate section]]).
* Run <tt>top</tt> command to check that your command is running
+
* Run <tt>top</tt> command to check that your command is running.
  
 
Miniasm
 
Miniasm
Line 67: Line 69:
 
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
 
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
 
</syntaxhighlight>
 
</syntaxhighlight>
* Run <tt>screen -S miniasm</tt>
+
* Run <tt>screen -S miniasm</tt>.
 
* In screen, run <tt>source ./miniasm.sh</tt>
 
* In screen, run <tt>source ./miniasm.sh</tt>
* Press <tt>Ctrl-a d</tt> to exit <tt>screen</tt>
+
* Press <tt>Ctrl-a d</tt> to exit <tt>screen</tt>.
 
 
  
 
To check if your commands have finished:
 
To check if your commands have finished:
Line 77: Line 78:
  
 
Examine the outputs.  
 
Examine the outputs.  
<!-- NOTEX -->
 
 
Write '''commands''' and '''answers''' to your '''protocol'''.
 
Write '''commands''' and '''answers''' to your '''protocol'''.
<!-- /NOTEX -->
+
* Move the output of SPAdes to a new filename: <tt>mv -i spades/contigs.fasta spades.fasta</tt>
* Copy output of SPAdes under a new filename: <tt>cp -ip spades/contigs.fasta spades.fasta</tt>
 
 
* Output of miniasm should be in <tt>miniasm2.fasta</tt>
 
* Output of miniasm should be in <tt>miniasm2.fasta</tt>
  
 
(a) How many contigs are in each of these two files?
 
(a) How many contigs are in each of these two files?
  
(b) What can you find out from the names of contigs in <tt>spades.fasta</tt>? What is the length of the shortest and longest contigs? String <tt>cov</tt> in the names is abbreviation of read coverage - the average number of reads covering a position on the contig. Do the reads have similar coverage, or are there big differences?  
+
(b) What can you find out from the names of contigs in <tt>spades.fasta</tt>? What is the length of the shortest and longest contigs? String <tt>cov</tt> in the names is abbreviation of read coverage - the average number of reads covering a position in the contig. Do the reads have similar coverage, or are there big differences?  
 
* Use command <tt>grep '>' spades.fasta</tt>
 
* Use command <tt>grep '>' spades.fasta</tt>
  
(c) What are the lengths of contigs in <tt>miniasm2.fa</tt> file? (you can use <tt>LN:i:</tt> in the name of contigs)
+
(c) What are the lengths of contigs in <tt>miniasm2.fa</tt> file? (You can use <tt>LN:i:</tt> in the name of contigs.)
  
<!-- NOTEX -->
 
 
'''Submit''' files <tt>miniasm2.fasta</tt> and <tt>spades.fasta</tt>
 
'''Submit''' files <tt>miniasm2.fasta</tt> and <tt>spades.fasta</tt>
<!-- /NOTEX -->
 
  
 
===Task C: compare assemblies using Quast command===
 
===Task C: compare assemblies using Quast command===
Line 101: Line 98:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
<!-- NOTEX -->
 
 
'''Submit''' file <tt>stats/report.txt</tt>.
 
'''Submit''' file <tt>stats/report.txt</tt>.
<!-- /NOTEX -->
 
  
 
Look at the results in <tt>stats/report.txt</tt> and '''answer''' the following questions.
 
Look at the results in <tt>stats/report.txt</tt> and '''answer''' the following questions.
  
(a) How many contigs quast reported in the two assemblies? Does it agree with your counts in part B?
+
(a) How many contigs has quast reported in the two assemblies? Does it agree with your counts in Task B?
  
 
(b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the [[Lbioinf1#Overview_of_DNA_sequencing_and_assembly|lecture]])
 
(b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the [[Lbioinf1#Overview_of_DNA_sequencing_and_assembly|lecture]])
  
(c) What portion of the reference sequence is covered by the two assemblies (reported as genome fraction)? Which assembly is better in this aspect?
+
(c) What portion of the reference sequence is covered by the two assemblies (reported as <tt>genome fraction</tt>)? Which assembly is better in this aspect?
  
 
(d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?
 
(d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?
Line 118: Line 113:
  
 
We will now visualize alignments between each assembly and the reference genome using dotplots.
 
We will now visualize alignments between each assembly and the reference genome using dotplots.
<!-- NOTEX -->
 
 
As in other tasks, write '''commands''' and '''answers''' to your '''protocol'''.
 
As in other tasks, write '''commands''' and '''answers''' to your '''protocol'''.
<!-- /NOTEX -->
 
  
 
(a) Create a dotplot comparing miniasm assembly to the reference sequence
 
(a) Create a dotplot comparing miniasm assembly to the reference sequence
Line 127: Line 120:
 
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
 
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
 
# creating dotplot
 
# creating dotplot
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | ps2pdf -dEPSCrop - ref-miniasm2.pdf
+
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | \
# displaying dotplot - if this does not work, copy the pdf file to your commputer and view there
+
  ps2pdf -dEPSCrop - ref-miniasm2.pdf
 +
# displaying dotplot
 +
# if evince does not work, copy the pdf file to your commputer and view there
 
evince ref-miniasm2.pdf &
 
evince ref-miniasm2.pdf &
 
</syntaxhighlight>
 
</syntaxhighlight>
* x-axis is reference, y-axis assembly
+
* x-axis is reference, y-axis assembly.
 
* Which part of the reference is missing in the assembly?
 
* Which part of the reference is missing in the assembly?
 
* Do you see any other big differences between the assembly and the reference?
 
* Do you see any other big differences between the assembly and the reference?
  
(b) Use analogous commands to create a dotplot for spades assembly, call it <tt>ref-spades.pdf</tt>  
+
(b) Use analogous commands to create a dotplot for spades assembly, call it <tt>ref-spades.pdf</tt>.
 
* What are vertical gray lines in the dotplot?
 
* What are vertical gray lines in the dotplot?
 
* Is any contig aligning to multiple places in the reference? To how many places?
 
* Is any contig aligning to multiple places in the reference? To how many places?
  
(c) Use analogous commands to create a dotplot of reference to itself, call it <tt>ref-ref.pdf</tt>
+
(c) Use analogous commands to create a dotplot of reference to itself, call it <tt>ref-ref.pdf</tt>.
* However, in the minimap2 command add option <tt>-p 0</tt> to include also weaker self-alignments
+
* However, in the minimap2 command add option <tt>-p 0</tt> to include also weaker self-alignments.
 
* Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with the dotplot in part (b)?
 
* Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with the dotplot in part (b)?
  
<!-- NOTEX -->
 
 
'''Submit'''  all three pdf files <tt>ref-miniasm2.pdf</tt>, <tt>ref-spades.pdf</tt>, <tt>ref-ref.pdf</tt>
 
'''Submit'''  all three pdf files <tt>ref-miniasm2.pdf</tt>, <tt>ref-spades.pdf</tt>, <tt>ref-ref.pdf</tt>
<!-- /NOTEX -->
 
  
 
===Task E: Align reads and assemblies to reference, visualize in IGV===
 
===Task E: Align reads and assemblies to reference, visualize in IGV===
  
Finally, we will align all source reads as well as assemblies to the reference genome, then visualize the alignments in [https://software.broadinstitute.org/software/igv/ IGV tool].
+
Finally, we will align all source reads as well as assemblies to the reference genome, then visualize the alignments in [https://software.broadinstitute.org/software/igv/ IGV tool] or its [https://igv.org/app/ online version].
  
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 +
A short video introducing IGV: [https://youtu.be/46HhBqGGPU0]
 +
<!-- /NOTEX -->
 +
 
* Write '''commands''' and '''answers''' to your '''protocol'''
 
* Write '''commands''' and '''answers''' to your '''protocol'''
 
* '''Submit''' all four BAM files <tt>ref-miseq.bam</tt>, <tt>ref-nanopore.bam</tt>, <tt>ref-spades.bam</tt>, <tt>ref-miniasm2.bam</tt>
 
* '''Submit''' all four BAM files <tt>ref-miseq.bam</tt>, <tt>ref-nanopore.bam</tt>, <tt>ref-spades.bam</tt>, <tt>ref-miniasm2.bam</tt>
<!-- /NOTEX -->
 
  
 
(a) Align illumina reads (MiSeq files) to the reference sequence
 
(a) Align illumina reads (MiSeq files) to the reference sequence
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# align illumina reads to reference
 
# align illumina reads to reference
# minimap produces SAM file, samtools view converts to BAM, samtools sort orders by coordinate
+
# minimap produces SAM file, samtools view converts to BAM,  
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | samtools view -S -b - |  samtools sort - ref-miseq
+
# samtools sort orders reads by coordinate
 +
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | \
 +
  samtools view -S -b -o - |  samtools sort - -o ref-miseq.bam
 
# index BAM file for faster access
 
# index BAM file for faster access
 
samtools index ref-miseq.bam
 
samtools index ref-miseq.bam
Line 172: Line 169:
  
 
(e) Run the IGV viewer.  
 
(e) Run the IGV viewer.  
<!-- NOTEX -->
+
'''Beware: It needs a lot of memory, do not keep open on the server unnecessarily'''.
'''Beware: It needs a lot of memory, do not keep open unnecessarily'''
 
<!-- /NOTEX -->
 
 
* <tt>igv -g ref.fasta &</tt>
 
* <tt>igv -g ref.fasta &</tt>
* Using <tt>Menu->File->Load from File</tt>, open all four BAM files
+
* Using <tt>Menu->File->Load from File</tt>, open all four BAM files.
* Look at region <tt>ecoli-frag:224,000-244,000</tt>
+
* Look at region <tt>ecoli-frag:224,000-244,000</tt>.
 
* How many spades contigs do you see aligning in this region?
 
* How many spades contigs do you see aligning in this region?
* Look at region <tt>ecoli-frag:227,300-227,600</tt>
+
* Look at region <tt>ecoli-frag:227,300-227,600</tt>.
 
* Comment on what you see. How frequent are errors in the individual assemblies and read sets?
 
* Comment on what you see. How frequent are errors in the individual assemblies and read sets?
<!-- NOTEX -->
+
* If you are unable to run igv from home, you can install it on your computer [https://software.broadinstitute.org/software/igv/] and download <tt>ref.fasta</tt> and all <tt>.bam</tt> and <tt>.bam.bai</tt> files.
* If you are unable to run igv from home, you can install it on your computer [https://software.broadinstitute.org/software/igv/] and download <tt>ref.fasta</tt> and all <tt>.bam</tt> and <tt>.bam.bai</tt> files
 
<!-- /NOTEX -->
 

Latest revision as of 07:55, 21 March 2024

See also the lecture

Submit the protocol and the required files to /submit/bioinf1

Technical notes

  • Task D and task E ask you to look at data visualizations.
  • If you are unable to open graphical applications from our server, you can download appropriate files and view them on your computer (in task D these are simply pdf files, in task E you can install IGV software on your computer or use the online version).

Task A: examine input files

Copy files from /tasks/bioinf1/ as follows:

mkdir bioinf1
cd bioinf1
cp -iv /tasks/bioinf1/* .
  • File ref.fasta contains a piece of genome from Escherichia coli.
  • Files miseq_R1.fastq.gz and miseq_R2.fastq.gz contain sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in the R1 file, second reads in the R2 file. These reads come from the region in ref.fasta.
  • File nanopore.fasta contains nanopore sequencing reads in FASTA format (without qualities). These reads are also from the region in ref.fasta.

Try to find the answers to the following questions using command-line tools. In your protocol, note down the commands as well as the answers.

(a) How many reads are in the MiSeq files? Is the number of reads the same in both files?

  • Try command zcat miseq_R1.fastq.gz | wc -l
  • Can you figure out the answer from the result of this command?

(b) How long are individual reads in the MiSeq files?

  • Look at the file using zless - do all reads appear to be of an equal length?
  • Extend the following command with tail and wc -c to get the length of the first read: zcat miseq_R1.fastq.gz | head -n 2
  • Do not forget to consider the end of the line character.
  • Repeat for both MiSeq files.

(c) How many reads are in the nanopore file (beware - different format)

(d) What is the average length of the reads in the nanopore file?

  • Try command: samtools faidx nanopore.fasta
  • This creates nanopore.fasta.fai file, where the second column contains sequence length of each read.
  • Compute the average of this column by a one-liner: perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai

(e) How long is the sequence in the ref.fasta file?

Task B: assemble the sequence from the reads

  • We will pretend that the correct answer (ref.fasta) is not known and we will try to assemble it from the reads.
  • We will assemble Illumina reads by program SPAdes and nanopore reads by miniasm.
  • Assembly takes several minutes; we will run it in the background using screen command.

SPAdes

  • Run screen -S spades
  • Press Enter to get the command-line, then run the following command:
spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log

Miniasm

  • Create file miniasm.sh containing the following commands:
# Find alignments between pairs of reads
minimap2 -x ava-ont -t 1 nanopore.fasta nanopore.fasta | gzip -1 > nanopore.paf.gz 
# Use overlaps to compute the assembled genome
miniasm -f nanopore.fasta nanopore.paf.gz > miniasm.gfa 2> miniasm.log
# Convert genome to fasta format
perl -lane 'print ">$F[1]\n$F[2]" if $F[0] eq "S"' miniasm.gfa > miniasm.fasta
# Align reads to the assembled genome
minimap2 -x map-ont --secondary=no -t 1 miniasm.fasta nanopore.fasta | gzip -1 > miniasm.paf.gz
# Polish the genome by finding consensus of aligned reads at each position
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
  • Run screen -S miniasm.
  • In screen, run source ./miniasm.sh
  • Press Ctrl-a d to exit screen.

To check if your commands have finished:

  • Re-enter the screen environment using screen -r spades or screen -r miniasm
  • If the command finished, terminate screen by pressing Ctrl-d or typing exit

Examine the outputs. Write commands and answers to your protocol.

  • Move the output of SPAdes to a new filename: mv -i spades/contigs.fasta spades.fasta
  • Output of miniasm should be in miniasm2.fasta

(a) How many contigs are in each of these two files?

(b) What can you find out from the names of contigs in spades.fasta? What is the length of the shortest and longest contigs? String cov in the names is abbreviation of read coverage - the average number of reads covering a position in the contig. Do the reads have similar coverage, or are there big differences?

  • Use command grep '>' spades.fasta

(c) What are the lengths of contigs in miniasm2.fa file? (You can use LN:i: in the name of contigs.)

Submit files miniasm2.fasta and spades.fasta

Task C: compare assemblies using Quast command

We have found basic characteristics of the two assemblies in task B. Now we will use program Quast to compare both assemblies to the correct answer in ref.fa

quast.py -R ref.fasta miniasm2.fasta spades.fasta -o stats

Submit file stats/report.txt.

Look at the results in stats/report.txt and answer the following questions.

(a) How many contigs has quast reported in the two assemblies? Does it agree with your counts in Task B?

(b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the lecture)

(c) What portion of the reference sequence is covered by the two assemblies (reported as genome fraction)? Which assembly is better in this aspect?

(d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?

Task D: create dotplots of assemblies

We will now visualize alignments between each assembly and the reference genome using dotplots. As in other tasks, write commands and answers to your protocol.

(a) Create a dotplot comparing miniasm assembly to the reference sequence

# alignments
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
# creating dotplot
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | \
  ps2pdf -dEPSCrop - ref-miniasm2.pdf
# displaying dotplot
# if evince does not work, copy the pdf file to your commputer and view there
evince ref-miniasm2.pdf &
  • x-axis is reference, y-axis assembly.
  • Which part of the reference is missing in the assembly?
  • Do you see any other big differences between the assembly and the reference?

(b) Use analogous commands to create a dotplot for spades assembly, call it ref-spades.pdf.

  • What are vertical gray lines in the dotplot?
  • Is any contig aligning to multiple places in the reference? To how many places?

(c) Use analogous commands to create a dotplot of reference to itself, call it ref-ref.pdf.

  • However, in the minimap2 command add option -p 0 to include also weaker self-alignments.
  • Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with the dotplot in part (b)?

Submit all three pdf files ref-miniasm2.pdf, ref-spades.pdf, ref-ref.pdf

Task E: Align reads and assemblies to reference, visualize in IGV

Finally, we will align all source reads as well as assemblies to the reference genome, then visualize the alignments in IGV tool or its online version.

A short video introducing IGV: [1]

  • Write commands and answers to your protocol
  • Submit all four BAM files ref-miseq.bam, ref-nanopore.bam, ref-spades.bam, ref-miniasm2.bam

(a) Align illumina reads (MiSeq files) to the reference sequence

# align illumina reads to reference
# minimap produces SAM file, samtools view converts to BAM, 
# samtools sort orders reads by coordinate
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | \
  samtools view -S -b -o - |  samtools sort - -o ref-miseq.bam
# index BAM file for faster access
samtools index ref-miseq.bam

(b) Similarly align nanopore reads, but instead of -x sr use -x map-ont, call the result ref-nanopore.bam, ref-nanopore.bam.bai

(c) Similarly align spades.fasta, but instead of -x sr use -x asm10, call the result ref-spades.bam

(d) Similarly align miniasm2.fasta, but instead of -x sr use -x asm10, call the result ref-miniasm2.bam

(e) Run the IGV viewer. Beware: It needs a lot of memory, do not keep open on the server unnecessarily.

  • igv -g ref.fasta &
  • Using Menu->File->Load from File, open all four BAM files.
  • Look at region ecoli-frag:224,000-244,000.
  • How many spades contigs do you see aligning in this region?
  • Look at region ecoli-frag:227,300-227,600.
  • Comment on what you see. How frequent are errors in the individual assemblies and read sets?
  • If you are unable to run igv from home, you can install it on your computer [2] and download ref.fasta and all .bam and .bam.bai files.