1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Genomika: Informácie ku trackom"

From MAD
Jump to navigation Jump to search
Line 168: Line 168:
 
===(K) Conservation by phyloP (medium, needs A,I,J)===
 
===(K) Conservation by phyloP (medium, needs A,I,J)===
 
* Based on multiple alignment from part J, find which positions are conserved in evolution (the result is a numerical level of conservation per position in a wiggle format)
 
* Based on multiple alignment from part J, find which positions are conserved in evolution (the result is a numerical level of conservation per position in a wiggle format)
 +
* See track Align. Cons. (L) in sacCer3 browser
 
* Use the same tree as in I
 
* Use the same tree as in I
 
* https://github.com/fmfi-genomika/genomika-2017/wiki/PhyloP-tracks
 
* https://github.com/fmfi-genomika/genomika-2017/wiki/PhyloP-tracks

Revision as of 12:17, 5 March 2018

Informácie k predmetu Genomika

Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.

Comments to the task list

  • Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
  • Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
    • These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
    • Fast tasks can be done entirely on genomika server.
    • Students having accounts on compbio research cluster may run medium and slow tasks there.
  • If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
  • Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.

Basic information on creating tracks

(A) Genome (fast)

hgsql hgcentral -e '
insert into dbDb values (...);

insert into defaultDb values (...);

insert into genomeClade values (...);
'

(B) Protein coding genes and other items from the annotation (fast, needs A)

baseColorUseCds given
baseColorDefault genomicCodons

(C) RepeatMasker (slow, needs A)

(D) tRNAscan-SE (medium, needs A)

  • Run software for finding tRNA genes (for comparison with annotation)
  • Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
  • Convert output by script rna/tRNAscan-SEtoBED.py on github
  • trackDb.ra record:
track tRNAs
shortLabel tRNA Genes
longLabel Transfer RNA Genes Identified with tRNAscan-SE
group genes
visibility hide
color 0,20,150
type bed 12
nextItemButton on
priority 10

(E) Augustus (slow, needs A)

  • Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
  • Download and install software from http://bioinf.uni-greifswald.de/augustus/
    • Already installed on compbio servers
  • Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
  • ustilago_maydis is a related fungal species used for training parameters
  • The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
  • If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra

(F) Self-alignment (medium/slow needs A)

lastdb genome.fa genome.fa 
lastal genome.fa genome.fa -E 1e-20 > self.maf #slow part
maf-convert psl self.maf > tmpC.psl

# filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl
pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain

# another chain for alignments with at least 70% identity and length at least 300bp
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl
/projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain

Parts of trackDb.ra (replace magCap5 with your genome name):

track selfChain100_90
shortLabel Self aln >90%id
longLabel Self alignments with length >100bp, identity >90%
group varRep
type chain magCapA5

track selfChain300_70
shortLabel Self aln >70%id
longLabel Self alignments with length >300bp, identity >70%
group varRep
type chain magCapA5

(G) Chains between genomes (medium, needs A from both groups)

  • The goal is to create chains from malGlo to malSym and vice versa
    • Each group creates chains from its browser to the other browser
  • This is done similarly as self-similarity chains, but alignments are done between two different genomes and filtering is done differently
lastdb genome.fa genome.fa 
lastal genome.fa genome2.fa -E 1e-20 > firstSecond.maf'
maf-convert psl firstSecond.maf > tmp.psl

# keep only alignments of length at least 100 in both sequences
perl -lane 'die unless @F==21; $s = $F[12]-$F[11]<100 || $F[16]-$F[15]<100; print unless $s' tmp.psl > tmp100.psl
pslToChain tmp100.psl tmp100.chain # kent tools binary on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmp100.chain > firstSecond.chain
  • trackDb.ra record similar, but include target species in line with type chain

(H) Protein-based chains between genomes (medium, needs A,B from both groups)

  • In more distant species, DNA-based chains from part G are not sufficiently sensitive, but it is easier to find similarity between proteins
  • In this type of track you extract protein sequences based on genome sequence and gene annotation, then you compare protein sets from the two species and map protein alignments back to the genome
  • Commands from the last year create a psl file and load it. Then the alignments cannot be used to move between genomes. It would be better to convert psl to chain as in parts F and G.
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Chains-from-protein-alignments

(I) Genomes for comparative genomics (fast, only one group)

  • Download genomes of additional Malassezia species (other than malGlo and malSym)
  • Use list here [3] - 13 species excluding malGlo and magSym
    • Download one representative assembly per species (some species have multipe strains /assemblies)
  • Rename chromosomes similarly as in A, name fasta files in a systematic way (malRes1.fa etc.)
    • Malassezia sp. JP-2017 can be renamed Malassezia vespertilionis according to a publication [4]
  • Store files in a directory at genomika server

(J) Multiple whole-genome alignment (slow, needs A from both groups, I)

  • The goal of this track is to create a whole-genome multiple alignment of several genomes
  • Use genomes from part I as well as malGlo and malSym genomes from the browser
  • Beware that malSym1 and malGlo1 should be correctly named, both the genome as a whole and their chromosomes as in the browser
  • The task requires some preprocessing - renaming things etc (fast), alignment computation (slow, we recommend running on compbio servers) and postprocessing (fast/medium)
    • Preprocessing and possibly also part of running alignment can be reused between groups - collaborate
  • The notes from the last year consist of three parts: general intron, Brona's notes (Example of use of tba in a different project), and student notes (Example of use of tba in a our project, display alignments).
    • Probably follow student notes.
    • The notes are not finished (end with "track does not work"), but the track was finished, see track "S. Align (L)" in sacCer3 browser. See final version of sacCer3 trackDb.ra on genomika server.
  • To run alignment, you need phylogenetic tree of these species. Use Fig 3 from paper Lorch et al 2018 [5], write in parenthesis notation
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Alignments

(K) Conservation by phyloP (medium, needs A,I,J)

(L) Conserved elements by phastCons (medium, needs A,I,J)

(M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)

(N) Expression data from RNA-seq (medium/slow, needs A)

(O) Differences between strains (slow, needs A)