1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt


Difference between revisions of "Genomika: Informácie ku trackom"

From MAD
Jump to navigation Jump to search
Line 76: Line 76:
 
* Last year done by blastz: https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
 
* Last year done by blastz: https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
 
* This yeast I suggest using program last (ubuntu package ast-align, website http://last.cbrc.jp/)
 
* This yeast I suggest using program last (ubuntu package ast-align, website http://last.cbrc.jp/)
* Here is an example how I have done it in the past, it would be great to rewrite opne-liners to some nicer script:
+
* Here is an example how I have done it in the past, it would be great to rewrite one-liners to some nicer script:
 
<pre>
 
<pre>
 
lastdb genome.fa genome.fa  
 
lastdb genome.fa genome.fa  

Revision as of 11:36, 1 March 2018

Informácie k predmetu Genomika

Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.

Comments to the task list

  • Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
  • Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
    • These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
    • Fast tasks can be done entirely on genomika server.
    • Students having accounts on compbio research cluster may run medium and slow tasks there.
  • If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
  • Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.

Basic information on creating tracks

(A) Genome (fast)

(B) Protein coding genes and other items from the annotation (fast, needs A)

baseColorUseCds given
baseColorDefault genomicCodons

(C) RepeatMasker (slow, needs A)

(D) tRNAscan-SE (medium, needs A)

  • Run software for finding tRNA genes (for comparison with annotation)
  • Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
  • Convert output by script rna/tRNAscan-SEtoBED.py on github
  • trackDb.ra record:
track tRNAs
shortLabel tRNA Genes
longLabel Transfer RNA Genes Identified with tRNAscan-SE
group genes
visibility hide
color 0,20,150
type bed 12
nextItemButton on
priority 10

(E) Augustus (slow, needs A)

  • Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
  • Download and install software from http://bioinf.uni-greifswald.de/augustus/
    • Already installed on compbio servers
  • Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
  • ustilago_maydis is a related fungal species used for training parameters
  • The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
  • If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra

(F) Self-alignment (medium/slow needs A)

lastdb genome.fa genome.fa 
binqsub -N R-last -l hostname=cpu07 'lastal genome.fa genome.fa -E 1e-20 > self.maf'
maf-convert psl self.maf > tmpC.psl

# filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl
pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain

# another chain for alignments with at least 70% identity and length at least 300bo
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl
/projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain

Parts of trackDb.ra (replace magCap5 with your genome name):

track selfChain100_90
shortLabel Self aln >90%id
longLabel Self alignments with length >100bp, identity >90%
group varRep
type chain magCapA5

track selfChain300_70
shortLabel Self aln >70%id
longLabel Self alignments with length >300bp, identity >70%
group varRep
type chain magCapA5

(G) Chains between genomes (medium, needs A from both groups)

  • TODO: more info

(H) Protein-based chains between genomes (medium, needs A,B from both groups)

(I) Genomes for comparative genomics (fast, only one group)

  • Download genomes of additional Malassezia species (other than malGlo and malSym)
  • Use list here [3]
  • Rename chromosomes similarly as in A, name fasta files in a systematic way (malRes1.fa etc.)
  • Store files in a directory at genomika server

(J) Multiple whole-genome alignment (slow, needs A from both groups, I)

(K) Conservation by phyloP (medium, needs A,I,J)

(L) Conserved elements by phastCons (medium, needs A,I,J)

(M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)

(N) Expression data from RNA-seq (medium/slow, needs A)

(O) Differences between strains (slow, needs A)