1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Genomika: Informácie ku trackom
Revision as of 09:23, 1 March 2018 by Brona (talk | contribs) (→Basic information on creating tracks)
Informácie k predmetu Genomika
Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.
Contents
- 1 Comments to the task list
- 2 Basic information on creating tracks
- 3 (A) Genome (fast)
- 4 (B) Protein coding genes and other items from the annotation (fast, needs A)
- 5 (C) RepeatMasker (slow, needs A)
- 6 (D) tRNAscan-SE (medium, needs A)
- 7 (E) Augustus (slow, needs A)
- 8 (F) Self-alignment (medium/slow needs A)
- 9 (G) Chains between genomes (medium, needs A from both groups)
- 10 (H) Protein-based chains between genomes (medium, needs A,B from both groups)
- 11 (I) Genomes for comparative genomics (fast, only one group)
- 12 (J) Multiple whole-genome alignment (slow, needs A from both groups, I)
- 13 (K) Conservation by phyloP (medium, needs A,I,J)
- 14 (L) Conserved elements by phastCons (medium, needs A,I,J)
- 15 (M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)
- 16 (N) Expression data from RNA-seq (medium/slow, needs A)
- 17 (O) Differences between strains (slow, needs A)
Comments to the task list
- Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
- Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
- These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
- Fast tasks can be done entirely on genomika server.
- Students having accounts on compbio research cluster may run medium and slow tasks there.
- If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
- Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.
Basic information on creating tracks
- https://github.com/fmfi-genomika/genomika-2017/wiki/Basics-of-creating-tracks
- https://github.com/fmfi-genomika/genomika-2017/wiki/How-to-add-track-to-DB-and-display-it-on-page
- Important: compared to last year, path /kentsrc/kent/src/hg/makeDb/trackDb/ was moved to /kentsrc/trackDb/
(A) Genome (fast)
- Download genome in fasta format, add to browser
- Data malGlo [1], malSym [2]
- https://github.com/fmfi-genomika/genomika-2017/wiki/Setting-up-a-Yarrowia-lipolytica
- Important step not described is to rename chromosomes/contigs to something reasonable
- Genome versions are numbered, we will start with malGlo1 and malSym1
(B) Protein coding genes and other items from the annotation (fast, needs A)
- Download genome annotation in GFF format, process to genepred format, split into two tracks: genes and other items
- Last year done by 2 groups based on 2 different databases:
- https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-NCBI-RefSeq
- https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-EnsembleFungi
- Coordinate with renaming chromosomes in step (1)
- Use appropriate IDs for naming genes
- Last-year tracks not ideal, try to improve
- mRNA items in other item track are redundant, should be omitted
- also items covering entire chromosome (type=region) should be omitted
- protein coding genes could be displayed with codon highlighting - use the following settings in trackDb.ra:
baseColorUseCds given baseColorDefault genomicCodons
(C) RepeatMasker (slow, needs A)
- Run RepeatMasker program to find repeated sequences (use fungi as species)
- https://github.com/fmfi-genomika/genomika-2017/wiki/RepeatMasker
- RepeatMasker and repeat library installed at compbio and genomika servers, or request a copy for your computer, registration may take several days
(D) tRNAscan-SE (medium, needs A)
- Run software for finding tRNA genes (for comparison with annotation)
- Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
- Convert output by script rna/tRNAscan-SEtoBED.py on github
- trackDb.ra record:
track tRNAs shortLabel tRNA Genes longLabel Transfer RNA Genes Identified with tRNAscan-SE group genes visibility hide color 0,20,150 type bed 12 nextItemButton on priority 10
(E) Augustus (slow, needs A)
- Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
- Download and install software from http://bioinf.uni-greifswald.de/augustus/
- Already installed on compbio servers
- Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
- ustilago_maydis is a related fungal species used for training parameters
- The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
- If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra
(F) Self-alignment (medium/slow needs A)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
(G) Chains between genomes (medium, needs A from both groups)
- TODO: more info
(H) Protein-based chains between genomes (medium, needs A,B from both groups)
(I) Genomes for comparative genomics (fast, only one group)
- Download genomes of additional Malassezia species (other than malGlo and malSym)
- Use list here [3]
- Rename chromosomes similarly as in A, name fasta files in a systematic way (malRes1.fa etc.)
- Store files in a directory at genomika server
(J) Multiple whole-genome alignment (slow, needs A from both groups, I)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Alignments
(K) Conservation by phyloP (medium, needs A,I,J)
(L) Conserved elements by phastCons (medium, needs A,I,J)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Conservation
(M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Uniprot-data