1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "Lr1"
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<!-- NOTEX --> | <!-- NOTEX --> | ||
− | [[HWr1]] | + | [[HWr1]] {{Dot}} [https://youtu.be/qHdtopqSiXA Video introduction from an older edition of the course] |
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
− | Program for this lecture: basics of R | + | Program for this lecture: basics of R |
− | * very short | + | * A very short introduction will be given as a lecture. |
− | * | + | * Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks |
In this course we cover several languages popular for scripting and data processing: Perl, Python, R. | In this course we cover several languages popular for scripting and data processing: Perl, Python, R. | ||
Line 11: | Line 11: | ||
* Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team. | * Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team. | ||
* Quickly learn a new language if needed. | * Quickly learn a new language if needed. | ||
− | * Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with <tt>bash</tt> or <tt>make</tt>. | + | * Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with <tt>bash</tt> or <tt>make</tt>. |
==Introduction== | ==Introduction== | ||
− | * [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization | + | * [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization. |
* Programming language, command-line interface | * Programming language, command-line interface | ||
* Many built-in functions, additional libraries | * Many built-in functions, additional libraries | ||
** For example [http://bioconductor.org/ Bioconductor] for bioinformatics | ** For example [http://bioconductor.org/ Bioconductor] for bioinformatics | ||
− | * We will concentrate on useful commands rather than language features | + | * We will concentrate on useful commands rather than language features. |
==Working in R== | ==Working in R== | ||
− | Option 1: Run command R, type commands in a command-line interface | + | '''Option 1:''' Run command R, type commands in a command-line interface |
* It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key | * It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key | ||
− | Option 2: Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt> | + | '''Option 2:''' Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt> |
− | Option 3: Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE] | + | '''Option 3:''' Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE] |
− | * Sub-windows with editor of R scripts, console, variables, plots | + | * Sub-windows with editor of R scripts, console, variables, plots. |
− | * Ctrl-Enter in editor executes the current command in console | + | * Ctrl-Enter in editor executes the current command in console. |
− | * You can also install RStudio on your home computer and work there | + | * You can also install RStudio on your home computer and work there. |
− | In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows. | + | '''Option 4:''' If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [https://towardsdatascience.com/how-to-use-r-in-google-colab-b6e02d736497]. |
+ | |||
+ | In R, you can easily create '''plots'''. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows. | ||
<syntaxhighlight lang="r"> | <syntaxhighlight lang="r"> | ||
− | x=c(1:10) | + | x = c(1:10) |
− | plot(x,x*x) | + | plot(x, x * x) |
</syntaxhighlight> | </syntaxhighlight> | ||
− | Suggested workflow | + | '''Suggested workflow''' |
− | * | + | * Work interactively in Rstudio, notebook or on command line, try various options. |
− | * | + | * Select useful commands, store in a script. |
− | * | + | * Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline. |
==Additional information== | ==Additional information== | ||
Line 51: | Line 53: | ||
==Gene expression data== | ==Gene expression data== | ||
− | * DNA | + | * DNA molecules contain regions called genes, which "recipes" for making proteins. |
− | * Gene expression is the process of creating a protein according to the "recipe" | + | * Gene expression is the process of creating a protein according to the "recipe". |
− | * It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein | + | * It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein. |
− | * Different proteins are created in different quantities and their amount depends on the needs of a cell | + | * Different proteins are created in different quantities and their amount depends on the needs of a cell. |
− | * There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances | + | * There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances. |
− | Gene expression data | + | Gene expression data is typically a table with numeric values. |
− | * Rows | + | * Rows represent genes. |
− | * Columns | + | * Columns represent experiments (e.g. different conditions or different individuals). |
− | * Each value is the expression of a gene, i.e. the relative amount of mRNA for | + | * Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment. |
We will use a data set for yeast: | We will use a data set for yeast: | ||
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833. | * Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833. | ||
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database] | * Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database] | ||
− | * Data was preprocessed: normalized, converted to logarithmic scale | + | * Data was preprocessed: normalized, converted to logarithmic scale. |
− | * Only 1220 genes with biggest changes in expression are included in our dataset | + | * Only 1220 genes with the biggest changes in expression are included in our dataset. |
− | * | + | * The dataset contains gene expression measurements under 5 conditions: |
− | ** | + | ** Control: yeast grown in a normal environment. |
− | ** | + | ** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic). |
− | + | * From each condition (reference and each acid) we have 3 replicates, together 15 experiments. | |
− | Part of the file (only first | + | * The goal is to observe how the acids influence the yeast and the activity of its genes. |
+ | Part of the file (only the first 4 experiments and first 3 genes shown), strings <tt>2mic_D_protein, AAC3, AAD15</tt> are identifiers of genes. | ||
<pre> | <pre> | ||
,control1,control2,control3,acetate1,acetate2,acetate3,... | ,control1,control2,control3,acetate1,acetate2,acetate3,... | ||
− | 2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804, | + | 2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,... |
− | AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063, | + | AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,... |
− | AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552, | + | AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,... |
</pre> | </pre> |
Latest revision as of 14:51, 3 April 2023
HWr1 · Video introduction from an older edition of the course
Program for this lecture: basics of R
- A very short introduction will be given as a lecture.
- Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
- Their capabilities overlap, many extensions emulate strengths of one in another.
- Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
- Quickly learn a new language if needed.
- Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with bash or make.
Introduction
- R is an open-source system for statistical computing and data visualization.
- Programming language, command-line interface
- Many built-in functions, additional libraries
- For example Bioconductor for bioinformatics
- We will concentrate on useful commands rather than language features.
Working in R
Option 1: Run command R, type commands in a command-line interface
- It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key
Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R
Option 3: Use rstudio command to open a graphical IDE
- Sub-windows with editor of R scripts, console, variables, plots.
- Ctrl-Enter in editor executes the current command in console.
- You can also install RStudio on your home computer and work there.
Option 4: If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [1].
In R, you can easily create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
x = c(1:10)
plot(x, x * x)
Suggested workflow
- Work interactively in Rstudio, notebook or on command line, try various options.
- Select useful commands, store in a script.
- Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.
Additional information
- Official tutorial
- Seefeld, Linder: Statistics Using R with Biological Examples (pdf book)
- Patrick Burns: The R Inferno (intricacies of the language)
- Other books
- Built-in help: ? plot displays help for plot command
Gene expression data
- DNA molecules contain regions called genes, which "recipes" for making proteins.
- Gene expression is the process of creating a protein according to the "recipe".
- It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
- Different proteins are created in different quantities and their amount depends on the needs of a cell.
- There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.
Gene expression data is typically a table with numeric values.
- Rows represent genes.
- Columns represent experiments (e.g. different conditions or different individuals).
- Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.
We will use a data set for yeast:
- Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
- Downloaded from the GEO database
- Data was preprocessed: normalized, converted to logarithmic scale.
- Only 1220 genes with the biggest changes in expression are included in our dataset.
- The dataset contains gene expression measurements under 5 conditions:
- Control: yeast grown in a normal environment.
- 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
- From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
- The goal is to observe how the acids influence the yeast and the activity of its genes.
Part of the file (only the first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes.
,control1,control2,control3,acetate1,acetate2,acetate3,... 2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,... AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,... AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...