1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "OPTbash"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
+ | This is an optional set of exercises for lecture on command-line tools ([[Lbash]]). They use a bioinformatics input file with annotation of the yeas genome. | ||
+ | |||
===Task A (yeast genome)=== | ===Task A (yeast genome)=== | ||
Latest revision as of 21:58, 25 January 2024
This is an optional set of exercises for lecture on command-line tools (Lbash). They use a bioinformatics input file with annotation of the yeas genome.
Task A (yeast genome)
The input file:
- /tasks/optional/bash/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
- Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [1].
- It was further processed to omit DNA sequences from the end of file.
- The size of the file is 5.6M.
- For easier work, link the file to your directory by ln -s /tasks/optional/bash/saccharomyces_cerevisiae.gff yeast.gff
- The file is in GFF3 format
- The lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
- Meaning of the first 5 columns:
- column 0 chromosome name
- column 1 source (can be ignored)
- column 2 type of interval
- column 3 start of interval (1-based coordinates)
- column 4 end of interval (1-based coordinates)
- You can assume that these first 5 columns do not contain whitespace
Task:
- Print for each type of interval (column 2), how many times it occurs in the file.
- Sort from the most common to the least common interval types.
- Hint: commands sort and uniq will be useful. Do not forget to skip comments, for example using grep -v '^#'
- The result should be a file types.txt formatted as follows:
7058 CDS 6600 mRNA ... ... 1 telomerase_RNA_gene 1 mating_type_region 1 intein_encoding_region
Task B (chromosomes)
- Continue processing file from task A.
- For each chromosome, the file contains a line which has in column 2 string chromosome, and the interval is the whole chromosome.
- To file chrosomes.txt, print a tab-separated list of chromosome names and sizes in the same order as in the input
- The last line of chromosomes.txt should list the total size of all chromosomes combined.
- Hints:
- The total size can be computed by a perl one-liner.
- Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
- Grepping for word chromosome does not check if this word is indeed in the second column
- Tab character is written in Perl as "\t".
- Your output should start and end as follows:
chrI 230218 chrII 813184 ... ... chrXVI 948066 chrmt 85779 total 12157105