1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Lr1

From MAD
Revision as of 14:51, 3 April 2023 by Brona (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

HWr1 · Video introduction from an older edition of the course

Program for this lecture: basics of R

  • A very short introduction will be given as a lecture.
  • Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization.
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features.

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots.
  • Ctrl-Enter in editor executes the current command in console.
  • You can also install RStudio on your home computer and work there.

Option 4: If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [1].

In R, you can easily create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x = c(1:10)
plot(x, x * x)

Suggested workflow

  • Work interactively in Rstudio, notebook or on command line, try various options.
  • Select useful commands, store in a script.
  • Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.

Additional information

Gene expression data

  • DNA molecules contain regions called genes, which "recipes" for making proteins.
  • Gene expression is the process of creating a protein according to the "recipe".
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
  • Different proteins are created in different quantities and their amount depends on the needs of a cell.
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.

Gene expression data is typically a table with numeric values.

  • Rows represent genes.
  • Columns represent experiments (e.g. different conditions or different individuals).
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.

We will use a data set for yeast:

  • Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
  • Downloaded from the GEO database
  • Data was preprocessed: normalized, converted to logarithmic scale.
  • Only 1220 genes with the biggest changes in expression are included in our dataset.
  • The dataset contains gene expression measurements under 5 conditions:
    • Control: yeast grown in a normal environment.
    • 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
  • From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
  • The goal is to observe how the acids influence the yeast and the activity of its genes.

Part of the file (only the first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes.

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...