1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWperl"

From MAD
Jump to navigation Jump to search
Line 55: Line 55:
 
===Task B (FASTQ to FASTA)===
 
===Task B (FASTQ to FASTA)===
  
 +
* In the rest of the assignment, we will write several scripts for working with [[Lperl#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|FASTQ files]] introduced in the lecture. Similar scripts are often used by bioinformaticians working with DNA sequencing data.
 
* Write a script which reformats FASTQ file to FASTA format, call it <tt>fastq2fasta.pl</tt>
 
* Write a script which reformats FASTQ file to FASTA format, call it <tt>fastq2fasta.pl</tt>
** [[Lperl#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|FASTQ file]] should be on standard input, FASTA file written to standard output
+
** FASTQ file should be on standard input, FASTA file written to standard output
 
* [https://en.wikipedia.org/wiki/FASTA_format FASTA format] is a typical format for storing DNA and protein sequences.  
 
* [https://en.wikipedia.org/wiki/FASTA_format FASTA format] is a typical format for storing DNA and protein sequences.  
 
** Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
 
** Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
Line 79: Line 80:
 
>SRR022868.1846_1
 
>SRR022868.1846_1
 
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACAC...
 
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACAC...
 +
</pre>
 +
* You can start by modifying our script <tt>fastq-lengths.pl</tt>, which prints the length of each read in the FASTQ input file.
 +
<pre>
 +
# copy our script to your folder under the new name
 +
cp -i /tasks/perl/fastq-lengths.pl fastq2fasta.pl
 +
# open in editor in bacground
 +
kate fastq2fasta.pl &
 
</pre>
 
</pre>
 
* Run your script on the small read file <tt>./fastq2fasta.pl < reads-small.fastq > reads-small.fasta</tt>
 
* Run your script on the small read file <tt>./fastq2fasta.pl < reads-small.fastq > reads-small.fasta</tt>

Revision as of 21:58, 10 February 2023

Materials: the lecture, Connecting from Windows

Files and setup

We recommend creating a directory (folder) for this set of tasks:

mkdir perl  # make directory
cd perl     # change to the new directory

We have 4 input files for this task set. We recommend creating soft links to your working directory as follows:

ln -s /tasks/perl/series-small.tsv .   # small version of the series file
ln -s /tasks/perl/series.tsv .         # full version of the series file
ln -s /tasks/perl/reads-small.fastq .  # smaller version of the read file
ln -s /tasks/perl/reads.fastq .        # bigger version of the read file

We recommend writing your protocol starting from an outline provided in /tasks/perl/protocol.txt. Make your own copy of the protocol and open it in an editor, e.g. kate:

cp -ip /tasks/perl/protocol.txt .  # copy protocol
kate protocol.txt &                # open editor, run in the background

Submitting

  • Directory /submit/perl/your_username will be created for you
  • Copy required files to this directory, including the protocol named protocol.txt
  • You can modify these files freely until deadline, but after the deadline of the homework, you will lose access rights to this directory

Task A (series)

Consider the program for counting series in the lecture 1, save it to file series-stat.pl

  • Open editor running in the background: kate series-stat.pl &
  • Copy and paste text to the editor, save it
  • Make the script executable: chmod a+x series-stat.pl

Extend the script to compute the average rating of each series (averaging over all episodes in the series)

  • Each row of the input table contains rating in column 5.
  • Output a table with three columns: name of series, the number of episodes, the average rating.
  • Use printf to print these three items right-justified in columns of a sufficient width; print the average rating to 1 decimal place.
  • If you run your script on the small file, the output should look something like this (exact column widths may differ):
./series-stat.pl < series-small.tsv
        Black Mirror        3        8.2
     Game of Thrones        3        8.8
  • Run your script also on the large file: ./series-stat.pl < series.tsv
    • Include the output in your protocol
  • Submit only your script, series-stat.pl

Task B (FASTQ to FASTA)

  • In the rest of the assignment, we will write several scripts for working with FASTQ files introduced in the lecture. Similar scripts are often used by bioinformaticians working with DNA sequencing data.
  • Write a script which reformats FASTQ file to FASTA format, call it fastq2fasta.pl
    • FASTQ file should be on standard input, FASTA file written to standard output
  • FASTA format is a typical format for storing DNA and protein sequences.
    • Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
    • The sequence itself is on the second line, long sequences can be split into multiple lines
  • In our case, the name of the sequence will be the ID of the read with @ replaced by > and / replaced by underscore (_)
  • For example, the first two reads of the file reads.fastq are as follows (only the first 50 columns shown)
@SRR022868.1845/1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAG...
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI...
@SRR022868.1846/1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACA...
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5...
  • These should be reformatted as follows (again only first 50 columns shown, but you include entire reads):
>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGA...
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACAC...
  • You can start by modifying our script fastq-lengths.pl, which prints the length of each read in the FASTQ input file.
# copy our script to your folder under the new name
cp -i /tasks/perl/fastq-lengths.pl fastq2fasta.pl
# open in editor in bacground
kate fastq2fasta.pl & 
  • Run your script on the small read file ./fastq2fasta.pl < reads-small.fastq > reads-small.fasta
  • Submit files fastq2fasta.pl and reads-small.fasta

Task C (FASTQ quality)

Write a script fastq-quality.pl which for each position in a read computes the average quality

  • Standard input has fastq file with multiple reads, possibly of different lengths
  • As quality we will use ASCII values of characters in the quality string with value 33 subtracted
    • For example character 'A' (ASCII 65) means quality 32.
    • ASCII value can be computed by function ord
  • Positions in reads will be numbered from 0
  • Since reads can differ in length, some positions are used in more reads, some in fewer
  • For each position from 0 up to the highest position used in some read, print three numbers separated by tabs "\t": the position index, the number of times this position was used in reads, the average quality at that position with 1 decimal place (you can again use printf)
  • The last two lines when you run ./fastq-quality.pl < reads-small.fastq should be
99      86      5.5
100     86      8.6

Run the following command, which runs your script on the larger file and selects every 10th position.

./fastq-quality.pl < reads.fastq | perl -lane 'print if $F[0]%10==0'
  • What trends (if any) do you see in quality values with increasing position?
  • Submit only fastq-quality.pl
  • In your protocol, include the output of the command and the answer to the question above.

Task D (FASTQ trim)

Write script fastq-trim.pl that trims low quality bases from the end of each read and filters out short reads

  • This script should read a fastq file from standard input and write trimmed fastq file to standard output
  • It should also accept two command-line arguments: character Q and integer L
    • We have not covered processing command line arguments, but you can use the code snippet below
  • Q is the minimum acceptable quality (characters from quality string with ASCII value >= ASCII value of Q are ok)
  • L is the minimum acceptable length of a read
  • First find the last base in a read which has quality at least Q (if any). All bases after this base will be removed from both the sequence and quality string
  • If the resulting read has fewer than L bases, it is omitted from the output

You can check your program by the following tests:

  • If you run the following two commands, you should get file tmp identical with input and thus output of the diff command should be empty
./fastq-trim.pl '!' 101 < reads-small.fastq > tmp  # trim at quality ASCII >=33 and length >=101
diff reads-small.fastq tmp                         # output should be empty (no differences)
  • If you run the following two commands, you should see differences in 4 reads, 2 bases trimmed from each
./fastq-trim.pl '"' 1 < reads-small.fastq > tmp   # trim at quality ASCII >=34 and length >=1
diff reads-small.fastq tmp                        # output should be differences in 4 reads
  • If you run the following commands, you should get empty output (no reads meet the criteria):
./fastq-trim.pl d 1 < reads-small.fastq           # quality ASCII >=100, length >= 1
./fastq-trim.pl '!' 102 < reads-small.fastq       # quality ASCII >=33 and length >=102

Further runs and submitting

  • ./fastq-trim.pl '(' 95 < reads-small.fastq > reads-small-filtered.fastq # quality ASCII >= 40
  • Submit files fastq-trim.pl and reads-small-filtered.fastq
  • If you have done task C, run quality statistics on the trimmed version of the bigger file using command below. Comment on the differences between statistics on the whole file in part C and D. Are they as you expected?
# "2" means quality ASCII >= 50
./fastq-trim.pl 2 50 < reads.fastq | ./fastq-quality.pl | perl -lane 'print if $F[0]%10==0'
  • In your protocol, include the result of the command and your discussion of its results.

Note: in this task set, you have created tools which can be combined, e.g. you can first trim FASTQ and then convert it to FASTA (no need to submit these files).

Parsing command-line arguments in this task (they will be stored in variables $Q and $L):

#!/usr/bin/perl -w
use strict;

my $USAGE = "
Usage:
$0 Q L < input.fastq > output.fastq

Trim from the end of each read bases with ASCII quality value less
than the given threshold Q. If the length of the read after trimming
is less than L, the read will be omitted from output.

L is a non-negative integer, Q is a character
";

# check that we have exactly 2 command-line arguments
die $USAGE unless @ARGV==2;
# copy command-line arguments to variables Q and L
my ($Q, $L) = @ARGV;
# check that $Q is one character and $L looks like a non-negative integer
die $USAGE unless length($Q)==1 && $L=~/^[0-9]+$/;