1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
OPTbash
Jump to navigation
Jump to search
Task A (yeast genome)
The input file:
- /tasks/optional/bash/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
- Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [1].
- It was further processed to omit DNA sequences from the end of file.
- The size of the file is 5.6M.
- For easier work, link the file to your directory by ln -s /tasks/optional/bash/saccharomyces_cerevisiae.gff yeast.gff
- The file is in GFF3 format
- The lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
- Meaning of the first 5 columns:
- column 0 chromosome name
- column 1 source (can be ignored)
- column 2 type of interval
- column 3 start of interval (1-based coordinates)
- column 4 end of interval (1-based coordinates)
- You can assume that these first 5 columns do not contain whitespace
Task:
- Print for each type of interval (column 2), how many times it occurs in the file.
- Sort from the most common to the least common interval types.
- Hint: commands sort and uniq will be useful. Do not forget to skip comments, for example using grep -v '^#'
- The result should be a file types.txt formatted as follows:
7058 CDS 6600 mRNA ... ... 1 telomerase_RNA_gene 1 mating_type_region 1 intein_encoding_region