1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Please fill in the following survey


Difference between revisions of "Lr1"

From MAD
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
[[HWr1]] {{Dot}} [https://youtu.be/qHdtopqSiXA Video introduction]
+
[[HWr1]] {{Dot}} [https://youtu.be/qHdtopqSiXA Video introduction from an older edition of the course]
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
 
Program for this lecture: basics of R
 
Program for this lecture: basics of R
* very short intro as a lecture
+
* A very short introduction will be given as a lecture.
* exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
+
* Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
  
 
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
 
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
Line 11: Line 11:
 
* Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
 
* Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
 
* Quickly learn a new language if needed.
 
* Quickly learn a new language if needed.
* Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with <tt>bash</tt> or <tt>make</tt>.
+
* Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with <tt>bash</tt> or <tt>make</tt>.
  
 
==Introduction==
 
==Introduction==
* [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization
+
* [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization.
 
* Programming language, command-line interface
 
* Programming language, command-line interface
 
* Many built-in functions, additional libraries
 
* Many built-in functions, additional libraries
 
** For example [http://bioconductor.org/ Bioconductor] for bioinformatics
 
** For example [http://bioconductor.org/ Bioconductor] for bioinformatics
* We will concentrate on useful commands rather than language features
+
* We will concentrate on useful commands rather than language features.
  
 
==Working in R==
 
==Working in R==
  
Option 1: Run command R, type commands in a command-line interface
+
'''Option 1:''' Run command R, type commands in a command-line interface
 
* It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key  
 
* It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key  
  
Option 2: Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt>
+
'''Option 2:''' Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt>
  
Option 3: Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE]
+
'''Option 3:''' Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE]
* Sub-windows with editor of R scripts, console, variables, plots
+
* Sub-windows with editor of R scripts, console, variables, plots.
* Ctrl-Enter in editor executes the current command in console
+
* Ctrl-Enter in editor executes the current command in console.
* You can also install RStudio on your home computer and work there
+
* You can also install RStudio on your home computer and work there.
  
In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
+
'''Option 4:''' If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [https://towardsdatascience.com/how-to-use-r-in-google-colab-b6e02d736497].
 +
 
 +
In R, you can easily create '''plots'''. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
x=c(1:10)
+
x = c(1:10)
plot(x,x*x)
+
plot(x, x * x)
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Suggested workflow
+
'''Suggested workflow'''
* work interactively in Rstudio or on command line, try various options  
+
* Work interactively in Rstudio, notebook or on command line, try various options.
* select useful commands, store in a script
+
* Select useful commands, store in a script.
* run script automatically on new data/new versions, potentially as a part of a bigger pipeline
+
* Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.
  
 
==Additional information==
 
==Additional information==
Line 51: Line 53:
  
 
==Gene expression data==
 
==Gene expression data==
* DNA molecule contains regions called genes, which "recipes" for making proteins
+
* DNA molecules contain regions called genes, which "recipes" for making proteins.
* Gene expression is the process of creating a protein according to the "recipe"
+
* Gene expression is the process of creating a protein according to the "recipe".
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
+
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
* Different proteins are created in different quantities and their amount depends on the needs of a cell
+
* Different proteins are created in different quantities and their amount depends on the needs of a cell.
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances
+
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.
  
Gene expression data
+
Gene expression data is typically a table with numeric values.
* Rows: genes
+
* Rows represent genes.
* Columns: experiments (e.g. different conditions or different individuals)
+
* Columns represent experiments (e.g. different conditions or different individuals).
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
+
* Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.
  
 
We will use a data set for yeast:
 
We will use a data set for yeast:
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
* Data was preprocessed: normalized, converted to logarithmic scale
+
* Data was preprocessed: normalized, converted to logarithmic scale.
* Only 1220 genes with biggest changes in expression are included in our dataset
+
* Only 1220 genes with the biggest changes in expression are included in our dataset.
* Gene expression measurements under 5 conditions:
+
* The dataset contains gene expression measurements under 5 conditions:
** Control: yeast grown in a normal environment
+
** Control: yeast grown in a normal environment.
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic)
+
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
* From each condition (reference and each acid) we have 3 replicates, together 15 experiments
+
* From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
* The goal is to observe how the acids influence the yeast and the activity of its genes
+
* The goal is to observe how the acids influence the yeast and the activity of its genes.
Part of the file (only first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes
+
Part of the file (only the first 4 experiments and first 3 genes shown), strings <tt>2mic_D_protein, AAC3, AAD15</tt> are identifiers of genes.
 
<pre>
 
<pre>
 
,control1,control2,control3,acetate1,acetate2,acetate3,...
 
,control1,control2,control3,acetate1,acetate2,acetate3,...

Latest revision as of 14:51, 3 April 2023

HWr1 · Video introduction from an older edition of the course

Program for this lecture: basics of R

  • A very short introduction will be given as a lecture.
  • Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization.
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features.

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots.
  • Ctrl-Enter in editor executes the current command in console.
  • You can also install RStudio on your home computer and work there.

Option 4: If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [1].

In R, you can easily create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x = c(1:10)
plot(x, x * x)

Suggested workflow

  • Work interactively in Rstudio, notebook or on command line, try various options.
  • Select useful commands, store in a script.
  • Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.

Additional information

Gene expression data

  • DNA molecules contain regions called genes, which "recipes" for making proteins.
  • Gene expression is the process of creating a protein according to the "recipe".
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
  • Different proteins are created in different quantities and their amount depends on the needs of a cell.
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.

Gene expression data is typically a table with numeric values.

  • Rows represent genes.
  • Columns represent experiments (e.g. different conditions or different individuals).
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.

We will use a data set for yeast:

  • Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
  • Downloaded from the GEO database
  • Data was preprocessed: normalized, converted to logarithmic scale.
  • Only 1220 genes with the biggest changes in expression are included in our dataset.
  • The dataset contains gene expression measurements under 5 conditions:
    • Control: yeast grown in a normal environment.
    • 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
  • From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
  • The goal is to observe how the acids influence the yeast and the activity of its genes.

Part of the file (only the first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes.

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...