1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


OPTbash

From MAD
Jump to navigation Jump to search

This is an optional set of exercises for lecture on command-line tools (Lbash). They use a bioinformatics input file with annotation of the yeas genome.

Task A (yeast genome)

The input file:

  • /tasks/optional/bash/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
    • Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [1].
    • It was further processed to omit DNA sequences from the end of file.
    • The size of the file is 5.6M.
  • For easier work, link the file to your directory by ln -s /tasks/optional/bash/saccharomyces_cerevisiae.gff yeast.gff
  • The file is in GFF3 format
  • The lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
  • Meaning of the first 5 columns:
    • column 0 chromosome name
    • column 1 source (can be ignored)
    • column 2 type of interval
    • column 3 start of interval (1-based coordinates)
    • column 4 end of interval (1-based coordinates)
  • You can assume that these first 5 columns do not contain whitespace

Task:

  • Print for each type of interval (column 2), how many times it occurs in the file.
  • Sort from the most common to the least common interval types.
  • Hint: commands sort and uniq will be useful. Do not forget to skip comments, for example using grep -v '^#'
  • The result should be a file types.txt formatted as follows:
   7058 CDS
   6600 mRNA
...
...
      1 telomerase_RNA_gene
      1 mating_type_region
      1 intein_encoding_region


Task B (chromosomes)

  • Continue processing file from task A.
  • For each chromosome, the file contains a line which has in column 2 string chromosome, and the interval is the whole chromosome.
  • To file chrosomes.txt, print a tab-separated list of chromosome names and sizes in the same order as in the input
  • The last line of chromosomes.txt should list the total size of all chromosomes combined.
  • Hints:
    • The total size can be computed by a perl one-liner.
    • Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
    • Grepping for word chromosome does not check if this word is indeed in the second column
    • Tab character is written in Perl as "\t".
  • Your output should start and end as follows:
chrI    230218
chrII   813184
...
...
chrXVI  948066
chrmt   85779
total   12157105