1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Please fill in the following survey


Difference between revisions of "Lr1"

From MAD
Jump to navigation Jump to search
Line 51: Line 51:
  
 
==Gene expression data==
 
==Gene expression data==
* Gene expression: DNA -> mRNA -> protein
+
* DNA molecule contains regions called genes, which "recipes" for making proteins
* Level of gene expression: Extract mRNA from cells, measure amounts of mRNA
+
* Gene expression is the process of creating a protein according to the "recipe"
* Technologies: microarray, RNA-seq
+
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
 +
* Different proteins are created in different quantities and their amount depends on the needs of a cell
 +
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances
  
 
Gene expression data
 
Gene expression data
Line 60: Line 62:
 
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
 
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
  
We will use microarray data for yeast:
+
We will use a data set for yeast:
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
* Data already preprocessed: normalization, etc, we will apply logarithmic scale
+
* Data was preprocessed: normalized, converted to logarithmic scale
* Data: 6398 genes, 15 experiments: 5 conditions, 3 replicate experiments for each condition
+
* Only 1220 genes with biggest changes in expression are included in our dataset
 +
* 15 experiments were done: 5 conditions, 3 replicate experiments for each condition
 
** The first 3 experiments are control, that is, yeast grown in a usual medium
 
** The first 3 experiments are control, that is, yeast grown in a usual medium
 
** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
 
** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
 
** We have 3 replicates from 4 different acids  
 
** We have 3 replicates from 4 different acids  
** Columns 1,2,3 are control, columns 4,5,6 acetic acid, 7,8,9 benzoate acid, 10,11,12 propionate acid, and 13,14,15 sorbate acid
+
Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes
 
+
<pre>
Read the microarray data, transform it to log scale, then work with table ''a'':
+
,control1,control2,control3,acetate1,acetate2,acetate3,...
<syntaxhighlight lang="r">
+
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093,
input=read.table("/tasks/r1/acids.tsv", header=TRUE, row.names=1)
+
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217,
a = log(input)
+
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871,
</syntaxhighlight>
+
</pre>

Revision as of 16:39, 15 April 2020

HWr1

Program for this lecture: basics of R (applied to biology examples)

  • very short intro as a lecture
  • exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots
  • Ctrl-Enter in editor executes the current command in console
  • You can also install RStudio on your home computer and work there

In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x=c(1:10)
plot(x,x*x)

Suggested workflow

  • work interactively in Rstudio or on command line, try various options
  • select useful commands, store in a script
  • run script automatically on new data/new versions, potentially as a part of a bigger pipeline

Additional information

Gene expression data

  • DNA molecule contains regions called genes, which "recipes" for making proteins
  • Gene expression is the process of creating a protein according to the "recipe"
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
  • Different proteins are created in different quantities and their amount depends on the needs of a cell
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances

Gene expression data

  • Rows: genes
  • Columns: experiments (e.g. different conditions or different individuals)
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample

We will use a data set for yeast:

Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093,
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217,
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871,