1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Lr1
Program for this lecture: basics of R (applied to biology examples)
- very short intro as a lecture
- exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
- Their capabilities overlap, many extensions emulate strengths of one in another.
- Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
- Quickly learn a new language if needed.
- Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make.
Introduction
- R is an open-source system for statistical computing and data visualization
- Programming language, command-line interface
- Many built-in functions, additional libraries
- For example Bioconductor for bioinformatics
- We will concentrate on useful commands rather than language features
Working in R
Option 1: Run command R, type commands in a command-line interface
- It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key
Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R
Option 3: Use rstudio command to open a graphical IDE
- Sub-windows with editor of R scripts, console, variables, plots
- Ctrl-Enter in editor executes the current command in console
- You can also install RStudio on your home computer and work there
In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
x=c(1:10)
plot(x,x*x)
Suggested workflow
- work interactively in Rstudio or on command line, try various options
- select useful commands, store in a script
- run script automatically on new data/new versions, potentially as a part of a bigger pipeline
Additional information
- Official tutorial
- Seefeld, Linder: Statistics Using R with Biological Examples (pdf book)
- Patrick Burns: The R Inferno (intricacies of the language)
- Other books
- Built-in help: ? plot displays help for plot command
Gene expression data
- Gene expression: DNA -> mRNA -> protein
- Level of gene expression: Extract mRNA from cells, measure amounts of mRNA
- Technologies: microarray, RNA-seq
Gene expression data
- Rows: genes
- Columns: experiments (e.g. different conditions or different individuals)
- Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
We will use microarray data for yeast:
- Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
- Downloaded from the GEO database
- Data already preprocessed: normalization, etc, we will apply logarithmic scale
- Data: 6398 genes, 15 experiments: 5 conditions, 3 replicate experiments for each condition
- The first 3 experiments are control, that is, yeast grown in a usual medium
- In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
- We have 3 replicates from 4 different acids
- Columns 1,2,3 are control, columns 4,5,6 acetic acid, 7,8,9 benzoate acid, 10,11,12 propionate acid, and 13,14,15 sorbate acid
Read the microarray data, transform it to log scale, then work with table a:
input=read.table("/tasks/r1/acids.tsv", header=TRUE, row.names=1)
a = log(input)