1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lr1"

From MAD
Jump to navigation Jump to search
Line 4: Line 4:
  
 
Program for this lecture: basics of R
 
Program for this lecture: basics of R
* very short intro as a lecture
+
* A very short introduction will be given as a lecture.
* exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
+
* Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
  
 
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
 
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
Line 53: Line 53:
  
 
==Gene expression data==
 
==Gene expression data==
* DNA molecule contains regions called genes, which "recipes" for making proteins
+
* DNA molecules contain regions called genes, which "recipes" for making proteins
 
* Gene expression is the process of creating a protein according to the "recipe"
 
* Gene expression is the process of creating a protein according to the "recipe"
 
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
 
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
Line 68: Line 68:
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
 
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
 
* Data was preprocessed: normalized, converted to logarithmic scale
 
* Data was preprocessed: normalized, converted to logarithmic scale
* Only 1220 genes with biggest changes in expression are included in our dataset
+
* Only 1220 genes with the biggest changes in expression are included in our dataset
 
* Gene expression measurements under 5 conditions:
 
* Gene expression measurements under 5 conditions:
 
** Control: yeast grown in a normal environment
 
** Control: yeast grown in a normal environment
Line 74: Line 74:
 
* From each condition (reference and each acid) we have 3 replicates, together 15 experiments
 
* From each condition (reference and each acid) we have 3 replicates, together 15 experiments
 
* The goal is to observe how the acids influence the yeast and the activity of its genes
 
* The goal is to observe how the acids influence the yeast and the activity of its genes
Part of the file (only first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes
+
Part of the file (only first 4 experiments and first 3 genes shown), strings <tt>2mic_D_protein, AAC3, AAD15</tt> are identifiers of genes
 
<pre>
 
<pre>
 
,control1,control2,control3,acetate1,acetate2,acetate3,...
 
,control1,control2,control3,acetate1,acetate2,acetate3,...

Revision as of 13:36, 8 April 2021

HWr1 · Video introduction

Program for this lecture: basics of R

  • A very short introduction will be given as a lecture.
  • Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots
  • Ctrl-Enter in editor executes the current command in console
  • You can also install RStudio on your home computer and work there

Option 4: If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [1].

In R, you can easily create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x = c(1:10)
plot(x, x * x)

Suggested workflow

  • Work interactively in Rstudio, notebook or on command line, try various options.
  • Select useful commands, store in a script.
  • Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.

Additional information

Gene expression data

  • DNA molecules contain regions called genes, which "recipes" for making proteins
  • Gene expression is the process of creating a protein according to the "recipe"
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
  • Different proteins are created in different quantities and their amount depends on the needs of a cell
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances

Gene expression data

  • Rows: genes
  • Columns: experiments (e.g. different conditions or different individuals)
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample

We will use a data set for yeast:

Part of the file (only first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...