1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lr1"

From MAD
Jump to navigation Jump to search
Line 3: Line 3:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
Program for this lecture: basics of R (applied to biology examples)
+
Program for this lecture: basics of R
 
* very short intro as a lecture
 
* very short intro as a lecture
 
* exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
 
* exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

Revision as of 20:31, 15 April 2020

HWr1

Program for this lecture: basics of R

  • very short intro as a lecture
  • exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots
  • Ctrl-Enter in editor executes the current command in console
  • You can also install RStudio on your home computer and work there

In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x=c(1:10)
plot(x,x*x)

Suggested workflow

  • work interactively in Rstudio or on command line, try various options
  • select useful commands, store in a script
  • run script automatically on new data/new versions, potentially as a part of a bigger pipeline

Additional information

Gene expression data

  • DNA molecule contains regions called genes, which "recipes" for making proteins
  • Gene expression is the process of creating a protein according to the "recipe"
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
  • Different proteins are created in different quantities and their amount depends on the needs of a cell
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances

Gene expression data

  • Rows: genes
  • Columns: experiments (e.g. different conditions or different individuals)
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample

We will use a data set for yeast:

Part of the file (only first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...