1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "Lr1"
Line 51: | Line 51: | ||
==Gene expression data== | ==Gene expression data== | ||
− | * Gene expression | + | * DNA molecule contains regions called genes, which "recipes" for making proteins |
− | * | + | * Gene expression is the process of creating a protein according to the "recipe" |
− | * | + | * It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein |
+ | * Different proteins are created in different quantities and their amount depends on the needs of a cell | ||
+ | * There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances | ||
Gene expression data | Gene expression data | ||
Line 60: | Line 62: | ||
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample | * Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample | ||
− | We will use | + | We will use a data set for yeast: |
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833. | * Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833. | ||
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database] | * Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database] | ||
− | * Data | + | * Data was preprocessed: normalized, converted to logarithmic scale |
− | * | + | * Only 1220 genes with biggest changes in expression are included in our dataset |
+ | * 15 experiments were done: 5 conditions, 3 replicate experiments for each condition | ||
** The first 3 experiments are control, that is, yeast grown in a usual medium | ** The first 3 experiments are control, that is, yeast grown in a usual medium | ||
** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast | ** In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast | ||
** We have 3 replicates from 4 different acids | ** We have 3 replicates from 4 different acids | ||
− | + | Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes | |
− | + | <pre> | |
− | + | ,control1,control2,control3,acetate1,acetate2,acetate3,... | |
− | + | 2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093, | |
− | + | AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217, | |
− | + | AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871, | |
− | </ | + | </pre> |
Revision as of 16:39, 15 April 2020
Program for this lecture: basics of R (applied to biology examples)
- very short intro as a lecture
- exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
- Their capabilities overlap, many extensions emulate strengths of one in another.
- Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
- Quickly learn a new language if needed.
- Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make.
Introduction
- R is an open-source system for statistical computing and data visualization
- Programming language, command-line interface
- Many built-in functions, additional libraries
- For example Bioconductor for bioinformatics
- We will concentrate on useful commands rather than language features
Working in R
Option 1: Run command R, type commands in a command-line interface
- It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key
Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R
Option 3: Use rstudio command to open a graphical IDE
- Sub-windows with editor of R scripts, console, variables, plots
- Ctrl-Enter in editor executes the current command in console
- You can also install RStudio on your home computer and work there
In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
x=c(1:10)
plot(x,x*x)
Suggested workflow
- work interactively in Rstudio or on command line, try various options
- select useful commands, store in a script
- run script automatically on new data/new versions, potentially as a part of a bigger pipeline
Additional information
- Official tutorial
- Seefeld, Linder: Statistics Using R with Biological Examples (pdf book)
- Patrick Burns: The R Inferno (intricacies of the language)
- Other books
- Built-in help: ? plot displays help for plot command
Gene expression data
- DNA molecule contains regions called genes, which "recipes" for making proteins
- Gene expression is the process of creating a protein according to the "recipe"
- It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
- Different proteins are created in different quantities and their amount depends on the needs of a cell
- There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances
Gene expression data
- Rows: genes
- Columns: experiments (e.g. different conditions or different individuals)
- Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
We will use a data set for yeast:
- Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
- Downloaded from the GEO database
- Data was preprocessed: normalized, converted to logarithmic scale
- Only 1220 genes with biggest changes in expression are included in our dataset
- 15 experiments were done: 5 conditions, 3 replicate experiments for each condition
- The first 3 experiments are control, that is, yeast grown in a usual medium
- In each of the remaining experiments a weak solution of an acid was added to the growing medium to observe how this influences the yeast
- We have 3 replicates from 4 different acids
Part of the file (only first 6 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes
,control1,control2,control3,acetate1,acetate2,acetate3,... 2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,1.38194263856183,1.05754712802093, AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,0.588812791760133,0.171617377505217, AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,-1.76155026099627,-1.61661288118871,