1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Genomika: Informácie ku trackom
Revision as of 16:38, 1 March 2018 by Brona (talk | contribs) (→(F) Self-alignment (medium/slow needs A))
Informácie k predmetu Genomika
Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.
Contents
- 1 Comments to the task list
- 2 Basic information on creating tracks
- 3 (A) Genome (fast)
- 4 (B) Protein coding genes and other items from the annotation (fast, needs A)
- 5 (C) RepeatMasker (slow, needs A)
- 6 (D) tRNAscan-SE (medium, needs A)
- 7 (E) Augustus (slow, needs A)
- 8 (F) Self-alignment (medium/slow needs A)
- 9 (G) Chains between genomes (medium, needs A from both groups)
- 10 (H) Protein-based chains between genomes (medium, needs A,B from both groups)
- 11 (I) Genomes for comparative genomics (fast, only one group)
- 12 (J) Multiple whole-genome alignment (slow, needs A from both groups, I)
- 13 (K) Conservation by phyloP (medium, needs A,I,J)
- 14 (L) Conserved elements by phastCons (medium, needs A,I,J)
- 15 (M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)
- 16 (N) Expression data from RNA-seq (medium/slow, needs A)
- 17 (O) Differences between strains (slow, needs A)
Comments to the task list
- Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
- Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
- These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
- Fast tasks can be done entirely on genomika server.
- Students having accounts on compbio research cluster may run medium and slow tasks there.
- If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
- Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.
Basic information on creating tracks
- https://github.com/fmfi-genomika/genomika-2017/wiki/Basics-of-creating-tracks
- https://github.com/fmfi-genomika/genomika-2017/wiki/How-to-add-track-to-DB-and-display-it-on-page
- Important: compared to last year, path /kentsrc/kent/src/hg/makeDb/trackDb/ was moved to /kentsrc/trackDb/
(A) Genome (fast)
- Download genome in fasta format, add to browser
- Data malGlo [1], malSym [2]
- https://github.com/fmfi-genomika/genomika-2017/wiki/Setting-up-a-Yarrowia-lipolytica
- Important step not described is to rename chromosomes/contigs to something reasonable
- Genome versions are numbered, we will start with malGlo1 and malSym1
hgsql hgcentral -e ' insert into dbDb values (...); insert into defaultDb values ("M. ingens","magIngA4"); insert into genomeClade values ("M. ingens","other",10);
(B) Protein coding genes and other items from the annotation (fast, needs A)
- Download genome annotation in GFF format, process to genepred format, split into two tracks: genes and other items
- Last year done by 2 groups based on 2 different databases:
- https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-NCBI-RefSeq
- https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-EnsembleFungi
- Coordinate with renaming chromosomes in step (1)
- Use appropriate IDs for naming genes
- Last-year tracks not ideal, try to improve
- mRNA items in other item track are redundant, should be omitted
- also items covering entire chromosome (type=region) should be omitted
- protein coding genes could be displayed with codon highlighting - use the following settings in trackDb.ra:
baseColorUseCds given baseColorDefault genomicCodons
(C) RepeatMasker (slow, needs A)
- Run RepeatMasker program to find repeated sequences (use fungi as species)
- https://github.com/fmfi-genomika/genomika-2017/wiki/RepeatMasker
- RepeatMasker and repeat library installed at compbio and genomika servers, or request a copy for your computer, registration may take several days
(D) tRNAscan-SE (medium, needs A)
- Run software for finding tRNA genes (for comparison with annotation)
- Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
- Convert output by script rna/tRNAscan-SEtoBED.py on github
- trackDb.ra record:
track tRNAs shortLabel tRNA Genes longLabel Transfer RNA Genes Identified with tRNAscan-SE group genes visibility hide color 0,20,150 type bed 12 nextItemButton on priority 10
(E) Augustus (slow, needs A)
- Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
- Download and install software from http://bioinf.uni-greifswald.de/augustus/
- Already installed on compbio servers
- Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
- ustilago_maydis is a related fungal species used for training parameters
- The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
- If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra
(F) Self-alignment (medium/slow needs A)
- The goal here is to find regions ion the genome which have multiple approximate copies
- First we need to create local alignments of the genome to itself, then convert it to appropriate format
- Last year done by blastz: https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
- This yeast I suggest using program last (ubuntu package ast-align, website http://last.cbrc.jp/)
- Here is an example how I have done it in the past, it would be great to rewrite one-liners to some nicer script:
lastdb genome.fa genome.fa lastal genome.fa genome.fa -E 1e-20 > self.maf #slow part maf-convert psl self.maf > tmpC.psl # filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9 perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika # fix bad coordinates on reverse strand perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain # another chain for alignments with at least 70% identity and length at least 300bo perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl /projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev # fix bad coordinates on reverse strand perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain
Parts of trackDb.ra (replace magCap5 with your genome name):
track selfChain100_90 shortLabel Self aln >90%id longLabel Self alignments with length >100bp, identity >90% group varRep type chain magCapA5 track selfChain300_70 shortLabel Self aln >70%id longLabel Self alignments with length >300bp, identity >70% group varRep type chain magCapA5
(G) Chains between genomes (medium, needs A from both groups)
- TODO: more info
(H) Protein-based chains between genomes (medium, needs A,B from both groups)
(I) Genomes for comparative genomics (fast, only one group)
- Download genomes of additional Malassezia species (other than malGlo and malSym)
- Use list here [3]
- Rename chromosomes similarly as in A, name fasta files in a systematic way (malRes1.fa etc.)
- Store files in a directory at genomika server
(J) Multiple whole-genome alignment (slow, needs A from both groups, I)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Alignments
(K) Conservation by phyloP (medium, needs A,I,J)
(L) Conserved elements by phastCons (medium, needs A,I,J)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Conservation
(M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)
- TODO: more info
- https://github.com/fmfi-genomika/genomika-2017/wiki/Uniprot-data