2-INF-185 Integrácia dátových zdrojov 2017/18
L09
From IDZ
Program for today: basics of R (applied to biology examples)
- very short intro as a lecture
- tutorial as HW: read a bit of text, try some commands, extend/modify them as requested
In this course we cover several languages popular for scripting in bioinformatics: Perl, Python, R
- their capabilities overlap, many extensions emulate strengths of one in another
- choose a language based on your preference, level of knowledge, existing code for the task, rest of the team
- quickly learn a new language if needed
- also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make
Introduction
- R is an open-source system for statistical computing and data visualization
- Programming language, command-line interface
- Many built-in functions, additional libraries
- For example http://bioconductor.org/ for bioinformatics
- We will concentrate on useful commands rather than language features
Working in R
- Run command R, type commands in command-line interface
- supports history of commands (arrows, up and down, Ctrl-R) and completing command names with tab key
> 1+2 [1] 3
- Write a script to file, run it from command-line: R --vanilla --slave < file.R
- Use rstudio to open a graphics IDE [1]
- Windows with editor of R scripts, console, variables, plots
- Ctrl-Enter in editor executes current command in console
x=c(1:10) plot(x,x*x)
- ? plot displays help for plot command
Suggested workflow
- work interactively in Rstudio or on command line, try various options
- select useful commands, store in a script
- run script automatically on new data/new versions, potentially as a part of a bigger pipeline
Additional information
- Official tutorial
- Seefeld, Linder: Statistics Using R with Biological Examples (pdf book)
- Patrick Burns: The R Inferno (intricacies of the language)
- Other books
Gene expression data
- Gene expression: DNA->mRNA->protein
- Level of gene expression: Extract mRNA from a cell, measure amounts of mRNA
- Technologies: microarray, RNA-seq
Gene expression data
- Rows: genes
- Columns: experiments (e.g. different conditions or different individuals)
- Each value is expression of a gene, i.e. relative amount of mRNA for this gene in the sample
We will use microarray data for yeast:
- Strassburg, Katrin, et al. "Dynamic transcriptional and metabolic responses in yeast adapting to temperature stress." Omics: a journal of integrative biology 14.3 (2010): 249-259. [2]
- Downloaded from GEO database [3]
- Data already preprocessed: normalization, log2, etc
- We have selected only cold conditions, genes with absolute change at least 1
- Data: 2738 genes, 8 experiments in a time series, yeast moved from normal temperature 28 degrees C to cold conditions 10 degrees C, samples taken after 0min, 15min, 30min, 1h, 2h, 4h, 8h, 24h in cold