1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lr1"

From MAD
Jump to navigation Jump to search
Line 51: Line 51:
  
 
==Gene expression data==
 
==Gene expression data==
* Gene expression: DNA -> mRNA -> protein
+
* DNA molecule contains regions called genes, which "recipes" for making proteins
* Level of gene expression: Extract mRNA from cells, measure amounts of mRNA
+
* Gene expression is the process of creating a protein according to the "recipe"
* Technologies: microarray, RNA-seq
+
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
 +
* Different proteins are created in different quantities and their amount depends on the needs of a cell
 +
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances
  
 
Gene expression data
 
Gene expression data
Line 60: Line 62:
 
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
 
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
  
We will use microarray data for yeast:
+
We will use a data set for yeast:
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
* Data already preprocessed: normalization, etc, we will apply logarithmic scale
+
* Data was preprocessed: normalized, converted to logarithmic scale
* Data: 6398 genes, 15 experiments: 5 conditions, 3 replicate experiments for each condition
+
* Only 1220 genes with biggest changes in expression are included in our dataset
 +
* 15 experiments were done: 5 conditions, 3 replicate experiments for each condition
 
** The first 3 experiments are control, that is, yeast grown in a usual medium
 
** The first 3 experiments are control, that is, yeast grown in a usual medium
 
** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
 
** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
 
** We have 3 replicates from 4 different acids  
 
** We have 3 replicates from 4 different acids  
** Columns 1,2,3 are control, columns 4,5,6 acetic acid, 7,8,9 benzoate acid, 10,11,12 propionate acid, and 13,14,15 sorbate acid
+
Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes
 
+
<pre>
Read the microarray data, transform it to log scale, then work with table ''a'':
+
,control1,control2,control3,acetate1,acetate2,acetate3,...
<syntaxhighlight lang="r">
+
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093,
input=read.table("/tasks/r1/acids.tsv", header=TRUE, row.names=1)
+
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217,
a = log(input)
+
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871,
</syntaxhighlight>
+
</pre>

Revision as of 16:39, 15 April 2020

HWr1

Program for this lecture: basics of R (applied to biology examples)

  • very short intro as a lecture
  • exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots
  • Ctrl-Enter in editor executes the current command in console
  • You can also install RStudio on your home computer and work there

In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x=c(1:10)
plot(x,x*x)

Suggested workflow

  • work interactively in Rstudio or on command line, try various options
  • select useful commands, store in a script
  • run script automatically on new data/new versions, potentially as a part of a bigger pipeline

Additional information

Gene expression data

  • DNA molecule contains regions called genes, which "recipes" for making proteins
  • Gene expression is the process of creating a protein according to the "recipe"
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
  • Different proteins are created in different quantities and their amount depends on the needs of a cell
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances

Gene expression data

  • Rows: genes
  • Columns: experiments (e.g. different conditions or different individuals)
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample

We will use a data set for yeast:

Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093,
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217,
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871,