1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lr1"

From MAD
Jump to navigation Jump to search
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
[[HWr1]]
+
[[HWr1]] {{Dot}} [https://youtu.be/qHdtopqSiXA Video introduction from an older edition of the course]
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
Program for this lecture: basics of R (applied to biology examples)
+
Program for this lecture: basics of R
* very short intro as a lecture
+
* A very short introduction will be given as a lecture.
* exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
+
* Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
  
 
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
 
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
Line 11: Line 11:
 
* Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
 
* Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
 
* Quickly learn a new language if needed.
 
* Quickly learn a new language if needed.
* Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with <tt>bash</tt> or <tt>make</tt>.
+
* Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with <tt>bash</tt> or <tt>make</tt>.
  
 
==Introduction==
 
==Introduction==
* [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization
+
* [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization.
 
* Programming language, command-line interface
 
* Programming language, command-line interface
 
* Many built-in functions, additional libraries
 
* Many built-in functions, additional libraries
 
** For example [http://bioconductor.org/ Bioconductor] for bioinformatics
 
** For example [http://bioconductor.org/ Bioconductor] for bioinformatics
* We will concentrate on useful commands rather than language features
+
* We will concentrate on useful commands rather than language features.
  
 
==Working in R==
 
==Working in R==
  
Option 1: Run command R, type commands in a command-line interface
+
'''Option 1:''' Run command R, type commands in a command-line interface
 
* It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key  
 
* It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key  
  
Option 2: Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt>
+
'''Option 2:''' Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt>
  
Option 3: Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE]
+
'''Option 3:''' Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE]
* Sub-windows with editor of R scripts, console, variables, plots
+
* Sub-windows with editor of R scripts, console, variables, plots.
* Ctrl-Enter in editor executes the current command in console
+
* Ctrl-Enter in editor executes the current command in console.
* You can also install RStudio on your home computer and work there
+
* You can also install RStudio on your home computer and work there.
  
In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
+
'''Option 4:''' If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [https://towardsdatascience.com/how-to-use-r-in-google-colab-b6e02d736497].
 +
 
 +
In R, you can easily create '''plots'''. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
x=c(1:10)
+
x = c(1:10)
plot(x,x*x)
+
plot(x, x * x)
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Suggested workflow
+
'''Suggested workflow'''
* work interactively in Rstudio or on command line, try various options  
+
* Work interactively in Rstudio, notebook or on command line, try various options.
* select useful commands, store in a script
+
* Select useful commands, store in a script.
* run script automatically on new data/new versions, potentially as a part of a bigger pipeline
+
* Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.
  
 
==Additional information==
 
==Additional information==
Line 51: Line 53:
  
 
==Gene expression data==
 
==Gene expression data==
* DNA molecule contains regions called genes, which "recipes" for making proteins
+
* DNA molecules contain regions called genes, which "recipes" for making proteins.
* Gene expression is the process of creating a protein according to the "recipe"
+
* Gene expression is the process of creating a protein according to the "recipe".
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
+
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
* Different proteins are created in different quantities and their amount depends on the needs of a cell
+
* Different proteins are created in different quantities and their amount depends on the needs of a cell.
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances
+
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.
  
Gene expression data
+
Gene expression data is typically a table with numeric values.
* Rows: genes
+
* Rows represent genes.
* Columns: experiments (e.g. different conditions or different individuals)
+
* Columns represent experiments (e.g. different conditions or different individuals).
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
+
* Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.
  
 
We will use a data set for yeast:
 
We will use a data set for yeast:
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
* Data was preprocessed: normalized, converted to logarithmic scale
+
* Data was preprocessed: normalized, converted to logarithmic scale.
* Only 1220 genes with biggest changes in expression are included in our dataset
+
* Only 1220 genes with the biggest changes in expression are included in our dataset.
* 15 experiments were done: 5 conditions, 3 replicate experiments for each condition
+
* The dataset contains gene expression measurements under 5 conditions:
** The first 3 experiments are control, that is, yeast grown in a usual medium
+
** Control: yeast grown in a normal environment.
** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
+
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
** We have 3 replicates from 4 different acids
+
* From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes
+
* The goal is to observe how the acids influence the yeast and the activity of its genes.
 +
Part of the file (only the first 4 experiments and first 3 genes shown), strings <tt>2mic_D_protein, AAC3, AAD15</tt> are identifiers of genes.
 
<pre>
 
<pre>
 
,control1,control2,control3,acetate1,acetate2,acetate3,...
 
,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093,
+
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217,
+
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871,
+
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...
 
</pre>
 
</pre>

Latest revision as of 14:51, 3 April 2023

HWr1 · Video introduction from an older edition of the course

Program for this lecture: basics of R

  • A very short introduction will be given as a lecture.
  • Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization.
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features.

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots.
  • Ctrl-Enter in editor executes the current command in console.
  • You can also install RStudio on your home computer and work there.

Option 4: If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [1].

In R, you can easily create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x = c(1:10)
plot(x, x * x)

Suggested workflow

  • Work interactively in Rstudio, notebook or on command line, try various options.
  • Select useful commands, store in a script.
  • Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.

Additional information

Gene expression data

  • DNA molecules contain regions called genes, which "recipes" for making proteins.
  • Gene expression is the process of creating a protein according to the "recipe".
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
  • Different proteins are created in different quantities and their amount depends on the needs of a cell.
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.

Gene expression data is typically a table with numeric values.

  • Rows represent genes.
  • Columns represent experiments (e.g. different conditions or different individuals).
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.

We will use a data set for yeast:

  • Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
  • Downloaded from the GEO database
  • Data was preprocessed: normalized, converted to logarithmic scale.
  • Only 1220 genes with the biggest changes in expression are included in our dataset.
  • The dataset contains gene expression measurements under 5 conditions:
    • Control: yeast grown in a normal environment.
    • 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
  • From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
  • The goal is to observe how the acids influence the yeast and the activity of its genes.

Part of the file (only the first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes.

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...