1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Genomika: Informácie ku trackom
Revision as of 16:29, 5 March 2018 by Brona (talk | contribs) (→(N) Expression data from RNA-seq (medium/slow, needs A))
Informácie k predmetu Genomika
Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.
Contents
- 1 Comments to the task list
- 2 Basic information on creating tracks
- 3 (A) Genome (fast)
- 4 (B) Protein coding genes and other items from the annotation (fast, needs A)
- 5 (C) RepeatMasker (slow, needs A)
- 6 (D) tRNAscan-SE (medium, needs A)
- 7 (E) Augustus (slow, needs A)
- 8 (F) Self-alignment (medium/slow needs A)
- 9 (G) Chains between genomes (medium, needs A from both groups)
- 10 (H) Protein-based chains between genomes (medium, needs A,B from both groups)
- 11 (I) Genomes for comparative genomics (fast, only one group)
- 12 (J) Multiple whole-genome alignment (slow, needs A from both groups, I)
- 13 (K) Conservation by phyloP (medium, needs A,I,J)
- 14 (L) Conserved elements by phastCons (medium, needs A,I,J)
- 15 (M) Protein domain and other protein annotation from Uniprot (fast/medium, needs A,B)
- 16 (N) Expression data from RNA-seq (medium/slow, needs A)
- 17 (O) Differences between strains (slow, needs A)
Comments to the task list
- Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
- Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
- These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
- Fast tasks can be done entirely on genomika server.
- Students having accounts on compbio research cluster may run medium and slow tasks there.
- If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
- Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.
Basic information on creating tracks
- https://github.com/fmfi-genomika/genomika-2017/wiki/Basics-of-creating-tracks
- https://github.com/fmfi-genomika/genomika-2017/wiki/How-to-add-track-to-DB-and-display-it-on-page
- Important: compared to last year, path /kentsrc/kent/src/hg/makeDb/trackDb/ was moved to /kentsrc/trackDb/
(A) Genome (fast)
- Download genome in fasta format, add to browser
- Data malGlo [1], malSym [2]
- https://github.com/fmfi-genomika/genomika-2017/wiki/Setting-up-a-Yarrowia-lipolytica
- Important step not described is to rename chromosomes/contigs to something reasonable
- Genome versions are numbered, we will start with malGlo1 and malSym1
- Missing part in last-year documentation - adding species to hgcentral database
- look at existing records, e.g for yarLip1 to guess appropriate values
- taxonomy ID can be found at https://www.ncbi.nlm.nih.gov/taxonomy
hgsql hgcentral -e ' insert into dbDb values (...); insert into defaultDb values (...); insert into genomeClade values (...); '
(B) Protein coding genes and other items from the annotation (fast, needs A)
- Download genome annotation in GFF format, process to genepred format, split into two tracks: genes and other items
- Last year done by 2 groups based on 2 different databases:
- https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-NCBI-RefSeq
- https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-EnsembleFungi
- Coordinate with renaming chromosomes in step (1)
- Use appropriate IDs for naming genes
- Last-year tracks not ideal, try to improve
- mRNA items in other item track are redundant, should be omitted
- also items covering entire chromosome (type=region) should be omitted
- protein coding genes could be displayed with codon highlighting - use the following settings in trackDb.ra:
baseColorUseCds given baseColorDefault genomicCodons
(C) RepeatMasker (slow, needs A)
- Run RepeatMasker program to find repeated sequences (use fungi as species)
- https://github.com/fmfi-genomika/genomika-2017/wiki/RepeatMasker
- RepeatMasker and repeat library installed at compbio and genomika servers, or request a copy for your computer, registration may take several days
(D) tRNAscan-SE (medium, needs A)
- Run software for finding tRNA genes (for comparison with annotation)
- Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
- Convert output by script rna/tRNAscan-SEtoBED.py on github
- trackDb.ra record:
track tRNAs shortLabel tRNA Genes longLabel Transfer RNA Genes Identified with tRNAscan-SE group genes visibility hide color 0,20,150 type bed 12 nextItemButton on priority 10
(E) Augustus (slow, needs A)
- Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
- Download and install software from http://bioinf.uni-greifswald.de/augustus/
- Already installed on compbio servers
- Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
- ustilago_maydis is a related fungal species used for training parameters
- The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
- If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra
(F) Self-alignment (medium/slow needs A)
- The goal here is to find regions in the genome which have multiple approximate copies
- First we need to create local alignments of the genome to itself, then convert them to appropriate format
- Last year done by blastz: https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
- This yeast I suggest using program last (ubuntu package last-align, website http://last.cbrc.jp/)
- Here is an example how I have done it in the past, it would be great to rewrite one-liners to some nicer script
- ideally also patch the bug in pslToChain and submit the patch to the UCSC genome github
lastdb genome.fa genome.fa lastal genome.fa genome.fa -E 1e-20 > self.maf #slow part maf-convert psl self.maf > tmpC.psl # filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9 perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika # fix bad coordinates on reverse strand perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain # another chain for alignments with at least 70% identity and length at least 300bp perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl /projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev # fix bad coordinates on reverse strand perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain
Parts of trackDb.ra (replace magCap5 with your genome name):
track selfChain100_90 shortLabel Self aln >90%id longLabel Self alignments with length >100bp, identity >90% group varRep type chain magCapA5 track selfChain300_70 shortLabel Self aln >70%id longLabel Self alignments with length >300bp, identity >70% group varRep type chain magCapA5
(G) Chains between genomes (medium, needs A from both groups)
- The goal is to create chains from malGlo to malSym and vice versa
- Each group creates chains from its browser to the other browser
- This is done similarly as self-similarity chains, but alignments are done between two different genomes and filtering is done differently
lastdb genome.fa genome.fa lastal genome.fa genome2.fa -E 1e-20 > firstSecond.maf' maf-convert psl firstSecond.maf > tmp.psl # keep only alignments of length at least 100 in both sequences perl -lane 'die unless @F==21; $s = $F[12]-$F[11]<100 || $F[16]-$F[15]<100; print unless $s' tmp.psl > tmp100.psl pslToChain tmp100.psl tmp100.chain # kent tools binary on genomika # fix bad coordinates on reverse strand perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmp100.chain > firstSecond.chain
- trackDb.ra record similar, but include target species in line with type chain
(H) Protein-based chains between genomes (medium, needs A,B from both groups)
- In more distant species, DNA-based chains from part G are not sufficiently sensitive, but it is easier to find similarity between proteins
- In this type of track you extract protein sequences based on genome sequence and gene annotation, then you compare protein sets from the two species and map protein alignments back to the genome
- Commands from the last year create a psl file and load it. Then the alignments cannot be used to move between genomes. It would be better to convert psl to chain as in parts F and G.
- https://github.com/fmfi-genomika/genomika-2017/wiki/Chains-from-protein-alignments
(I) Genomes for comparative genomics (fast, only one group)
- Download genomes of additional Malassezia species (other than malGlo and malSym)
- Use list here [3] - 13 species excluding malGlo and magSym
- Download one representative assembly per species (some species have multipe strains /assemblies)
- Rename chromosomes similarly as in A, name fasta files in a systematic way (malRes1.fa etc.)
- Malassezia sp. JP-2017 can be renamed Malassezia vespertilionis according to a publication [4]
- Store files in a directory at genomika server
(J) Multiple whole-genome alignment (slow, needs A from both groups, I)
- The goal of this track is to create a whole-genome multiple alignment of several genomes
- Use genomes from part I as well as malGlo and malSym genomes from the browser
- Beware that malSym1 and malGlo1 should be correctly named, both the genome as a whole and their chromosomes as in the browser
- The task requires some preprocessing - renaming things etc (fast), alignment computation (slow, we recommend running on compbio servers) and postprocessing (fast/medium)
- Preprocessing and possibly also part of running alignment can be reused between groups - collaborate
- The notes from the last year consist of three parts: general intron, Brona's notes (Example of use of tba in a different project), and student notes (Example of use of tba in a our project, display alignments).
- Probably follow student notes.
- The notes are not finished (end with "track does not work"), but the track was finished, see track "S. Align (L)" in sacCer3 browser. See final version of sacCer3 trackDb.ra on genomika server.
- To run alignment, you need phylogenetic tree of these species. Use Fig 3 from paper Lorch et al 2018 [5], write in parenthesis notation
- https://github.com/fmfi-genomika/genomika-2017/wiki/Alignments
(K) Conservation by phyloP (medium, needs A,I,J)
- Based on multiple alignment from part J, find which positions are conserved in evolution (the result is a numerical level of conservation per position in a wiggle format)
- See tracks Align. Cons. (L) and Multiz. Cons. (L) in sacCer3 browser (here we want only one track)
- Use the same tree as in I
- https://github.com/fmfi-genomika/genomika-2017/wiki/PhyloP-tracks
(L) Conserved elements by phastCons (medium, needs A,I,J)
- Similar as track K, but uses a different program from the phast package. Phastcons is based on and HMM, finds contiguous conserved regions. The result is a list of conserved regions (bed format) as well as posterior probability of conserved region at each position (wig format)
- On sacCer3, wig format are e.g. tracks Cons. new (L), Cons. old (L); bed format track is PhastCons Most, but that was taken from the original UCSC database so no commands for it are available, but hopefully it should be easy to create and load.
- https://github.com/fmfi-genomika/genomika-2017/wiki/Conservation
(M) Protein domain and other protein annotation from Uniprot (fast/medium, needs A,B)
- The uniprot database (http://www.uniprot.org/) contains information about proteins. The goal is to download information about malGlo and malSym proteins, parse out info about particular regions and map these to the corresponding regions of the genome.
- See sacCer3 tracks Pfam (L), uniProtAnnot (L), uniProtStruct (L)
- Download protein info in XML format malGlo [6], malSym [7]
- Last year's protocol links uniprot proteins to genes from browser annotation via sequence similarity search (blat). Possibly this could be done also by cross-linking information from the databases, but blat is fine.
- https://github.com/fmfi-genomika/genomika-2017/wiki/Uniprot-data
- Last year, Pfam track was created by runing Interproscan tool locally [8]. However, this is time-consuming and uniprot contains pre-computed info about Pfam domains. Therefore it would be better to modify scripts so that they parse Pfam out of uniprot XML files together with other info.
(N) Expression data from RNA-seq (medium/slow, needs A)
- The goal is to display the results of measurement of expression (amount of mRNA) by RNA-seq
- Workflow:
- The original data are reads in fastq format. Some preprocessing can be done (quality trimming etc)
- Reads are aligned to the genome to produce sam/bam file. This is SLOW. The file is then sorted and indexed.
- Bam files can be used in the browser, but they are big. We will report only the number of reads at each position in a wig (wiggle) format.
- Wig files can be loaded to the database but perhaps more efficiently converted to binary bigwig files. The database then contains only reference to bigwig file.
- Data:
- malGlo needs to align reads to the genome.
- Currently recommended aligner is STAR https://github.com/alexdobin/STAR
- It seems that STAR can directly create wig files, read the manual for recommended settings (e.g. the section on small genomes)
- To convert wig to bigwig, use wigToBigWig on genomika
- To load bigwig file, see commands below
- malSym already has bam files for several experiments
- These need to be converted to wig / bigwig
- First use bedtools suite to create bedgraph (see commands below), then convert to bigwig using bedGraphToBigWig (installed on genomika)
- To load bigwig file, see commands below
- Multiple experiments are better combined to a single composite track with individual subtracks
- Subtracks are loaded to db normally, composite tract is noted only in trackdb file, see below
- Last year notes: https://github.com/fmfi-genomika/genomika-2017/wiki/Expression-tracks
- However steps there are mostly not recommended this year
- Useful commands (modify for your situation):
# bam to bedgraph faSize -detailed genome.fa > genome.sizes bedtools genomecov -ibam reads.bam -g genome.size -bga -split > reads.bedgraph # to create track, place bigwig file to appropriate place in /gbdb # then create table with reference to this file: hgsql malXyz1 -e "CREATE TABLE table_name (fileName varchar(255) not null);" hgsql malXyz1 -e "insert into table_name values ('/gbdb/malXyz1/filename.bw');" # in trackDb.ra include something like this: (change 500 to appropriate value at which read depth is clipped) track table_name shortLabel RNA-seq coverage longLabel RNA-seq coverage visibility dense group rna type bigWig 0 500 # composite track from multiple experiments: track track_name compositeTrack on type bigWig 0 200 shortLabel RNA-seq coverage longLabel RNA-seq coverage group rna visibility dense track subtrack_name shortLabel subtrack_label longLabel subtrack_label parent track_name type bigWig 0 250 visibility full maxHeightPixels 80:16:8