Difference between revisions of "Genomika: Informácie ku trackom"

Revision as of 17:14, 1 March 2018

Informácie k predmetu Genomika

Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.

1 Comments to the task list
2 Basic information on creating tracks
3 (A) Genome (fast)
4 (B) Protein coding genes and other items from the annotation (fast, needs A)
5 (C) RepeatMasker (slow, needs A)
6 (D) tRNAscan-SE (medium, needs A)
7 (E) Augustus (slow, needs A)
8 (F) Self-alignment (medium/slow needs A)
9 (G) Chains between genomes (medium, needs A from both groups)
10 (H) Protein-based chains between genomes (medium, needs A,B from both groups)
11 (I) Genomes for comparative genomics (fast, only one group)
12 (J) Multiple whole-genome alignment (slow, needs A from both groups, I)
13 (K) Conservation by phyloP (medium, needs A,I,J)
14 (L) Conserved elements by phastCons (medium, needs A,I,J)
15 (M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)
16 (N) Expression data from RNA-seq (medium/slow, needs A)
17 (O) Differences between strains (slow, needs A)

Comments to the task list

Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
- These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
- Fast tasks can be done entirely on genomika server.
- Students having accounts on compbio research cluster may run medium and slow tasks there.
If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.

Basic information on creating tracks

https://github.com/fmfi-genomika/genomika-2017/wiki/Basics-of-creating-tracks
https://github.com/fmfi-genomika/genomika-2017/wiki/How-to-add-track-to-DB-and-display-it-on-page
Important: compared to last year, path /kentsrc/kent/src/hg/makeDb/trackDb/ was moved to /kentsrc/trackDb/

(A) Genome (fast)

Download genome in fasta format, add to browser
Data malGlo [1], malSym [2]
https://github.com/fmfi-genomika/genomika-2017/wiki/Setting-up-a-Yarrowia-lipolytica
Important step not described is to rename chromosomes/contigs to something reasonable
Genome versions are numbered, we will start with malGlo1 and malSym1

hgsql hgcentral -e '
insert into dbDb values (...);

insert into defaultDb values ("M. ingens","magIngA4");

insert into genomeClade values ("M. ingens","other",10);

(B) Protein coding genes and other items from the annotation (fast, needs A)

Download genome annotation in GFF format, process to genepred format, split into two tracks: genes and other items
Last year done by 2 groups based on 2 different databases:
https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-NCBI-RefSeq
https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-EnsembleFungi
Coordinate with renaming chromosomes in step (1)
Use appropriate IDs for naming genes
Last-year tracks not ideal, try to improve
- mRNA items in other item track are redundant, should be omitted
- also items covering entire chromosome (type=region) should be omitted
- protein coding genes could be displayed with codon highlighting - use the following settings in trackDb.ra:

baseColorUseCds given
baseColorDefault genomicCodons

(C) RepeatMasker (slow, needs A)

Run RepeatMasker program to find repeated sequences (use fungi as species)
https://github.com/fmfi-genomika/genomika-2017/wiki/RepeatMasker
RepeatMasker and repeat library installed at compbio and genomika servers, or request a copy for your computer, registration may take several days

(D) tRNAscan-SE (medium, needs A)

Run software for finding tRNA genes (for comparison with annotation)
Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
Convert output by script rna/tRNAscan-SEtoBED.py on github
trackDb.ra record:

track tRNAs
shortLabel tRNA Genes
longLabel Transfer RNA Genes Identified with tRNAscan-SE
group genes
visibility hide
color 0,20,150
type bed 12
nextItemButton on
priority 10

(E) Augustus (slow, needs A)

Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
Download and install software from http://bioinf.uni-greifswald.de/augustus/
- Already installed on compbio servers
Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
ustilago_maydis is a related fungal species used for training parameters
The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra

(F) Self-alignment (medium/slow needs A)

The goal here is to find regions in the genome which have multiple approximate copies
First we need to create local alignments of the genome to itself, then convert them to appropriate format
Last year done by blastz: https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
This yeast I suggest using program last (ubuntu package last-align, website http://last.cbrc.jp/)
Here is an example how I have done it in the past, it would be great to rewrite one-liners to some nicer script
- ideally also patch the bug in pslToChain and submit the patch to the UCSC genome github

lastdb genome.fa genome.fa 
lastal genome.fa genome.fa -E 1e-20 > self.maf #slow part
maf-convert psl self.maf > tmpC.psl

# filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl
pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain

# another chain for alignments with at least 70% identity and length at least 300bp
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl
/projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain

Parts of trackDb.ra (replace magCap5 with your genome name):

track selfChain100_90
shortLabel Self aln >90%id
longLabel Self alignments with length >100bp, identity >90%
group varRep
type chain magCapA5

track selfChain300_70
shortLabel Self aln >70%id
longLabel Self alignments with length >300bp, identity >70%
group varRep
type chain magCapA5

(G) Chains between genomes (medium, needs A from both groups)

The goal is to create chains from malGlo to malSym and vice versa
- Each group creates chains from its browser to the other browser
This is done similarly as self-similarity chains, but alignments are done between two different genomes and filtering is done differently

lastdb genome.fa genome.fa 
lastal genome.fa genome2.fa -E 1e-20 > firstSecond.maf'
maf-convert psl firstSecond.maf > tmp.psl

# keep only alignments of length at least 100 in both sequences
perl -lane 'die unless @F==21; $s = $F[12]-$F[11]<100 || $F[16]-$F[15]<100; print unless $s' tmp.psl > tmp100.psl
pslToChain tmp100.psl tmp100.chain # kent tools binary on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmp100.chain > firstSecond.chain

trackDb.ra record similar, but include target species in line with type chain

(H) Protein-based chains between genomes (medium, needs A,B from both groups)

In more distant species, DNA-based chains from part G are not sufficiently sensitive, but it is easier to find similarity between proteins
In this type of track you extract protein sequences based on genome sequence and gene annotation, then you compare protein sets from the two species and map protein alignments back to the genome
https://github.com/fmfi-genomika/genomika-2017/wiki/Chains-from-protein-alignments

(I) Genomes for comparative genomics (fast, only one group)

Download genomes of additional Malassezia species (other than malGlo and malSym)
Use list here [3]
Rename chromosomes similarly as in A, name fasta files in a systematic way (malRes1.fa etc.)
Store files in a directory at genomika server

@@ Line 122: / Line 122: @@
 ===(G) Chains between genomes (medium, needs A from both groups)===
 * The goal is to create chains from malGlo to malSym and vice versa
-* This is done similarly as self-similarity chains, but alignments are done between two genomes and filtering is done differently
+** Each group creates chains from its browser to the other browser
+* This is done similarly as self-similarity chains, but alignments are done between two different genomes and filtering is done differently
 <pre>
 lastdb genome.fa genome.fa

Difference between revisions of "Genomika: Informácie ku trackom"

Revision as of 17:14, 1 March 2018

Contents

Comments to the task list

Basic information on creating tracks

(A) Genome (fast)

(B) Protein coding genes and other items from the annotation (fast, needs A)

(C) RepeatMasker (slow, needs A)

(D) tRNAscan-SE (medium, needs A)

(E) Augustus (slow, needs A)

(F) Self-alignment (medium/slow needs A)

(G) Chains between genomes (medium, needs A from both groups)

(H) Protein-based chains between genomes (medium, needs A,B from both groups)

(I) Genomes for comparative genomics (fast, only one group)

(J) Multiple whole-genome alignment (slow, needs A from both groups, I)

(K) Conservation by phyloP (medium, needs A,I,J)

(L) Conserved elements by phastCons (medium, needs A,I,J)

(M) Protein domain and other protein annotation from Uniprot (medium, needs A,B)

(N) Expression data from RNA-seq (medium/slow, needs A)

(O) Differences between strains (slow, needs A)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools