1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Introduction"

From MAD
Jump to navigation Jump to search
Line 1: Line 1:
==Cieľová skupina==
+
==Target audience==
Tento predmet je určený pre študentov 2. ročníka bakalárskeho študijného programu Bioinformatika a pre študentov bakalárskeho a magisterského študijného programu Informatika, obzvlášť ak plánujú na magisterskom štúdiu absolvovať štátnicové zameranie Bioinformatika a strojové učenie. Radi privítame aj študentov iných zameraní a študijných programov, pokiaľ majú požadované (neformálne) prerekvizity.
+
This course is offered  at the Faculty of Matematics, Physics and Informatics, Comenius University in Bratislava for the students of the second year of the bachelor Bionformatics study program and the students of the bachelor and master Computer Science study programs. It is a prerequisite of the master-level state exams in Bioinformatics and Machine Learning. However, the course is open to students from other study programs if they satisfy the following informal prerequisites.  
  
Predpokladáme, že študenti na tomto predmete už vedia programovať v niektorom programovacom jazyku a neboja sa učiť podľa potreby nové jazyky. Takisto predpokladáme základnú znalosť práce v Linuxe vrátane spúšťania príkazov na príkazovom riadku (mali by ste poznať aspoň základné príkazy na prácu so súbormi a adresármi ako cd, mkdir, cp, mv, rm, chmod a pod.). Hoci väčšina technológií preberaných na tomto predmete sa dá použiť na spracovanie dát z mnohých oblastí, budeme ich často ilustrovať na príkladoch z oblasti bioinformatiky. Pokúsime sa vysvetliť potrebné pojmy, ale bolo by dobre, ak by ste sa orientovali v základných pojmoch molekulárnej biológie, ako sú DNA, RNA, proteín, gén, genóm, evolúcia, fylogenetický strom a pod. Študentom zamerania Bioinformatika a strojové učenie odporúčame absolvovať najskôr Metódy v bioinformatike, až potom tento predmet.
+
We assume that the students are proficient in programming in at least one programming language and are not afraid to learn new languages. We also assume basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.
  
Ak sa chcete doučiť základy používania príkazového riadku, skúste napr. tento tutoriál: http://korflab.ucdavis.edu/bootcamp.html
+
The basic use of command-line tools can be learned for example by using [http://korflab.ucdavis.edu/bootcamp.html a tutorial by Ian Korf].
  
==Cieľ predmetu==
+
==Course objectives==
  
Počas štúdia sa naučíte mnohé zaujímave algoritmy, modely a metódy, ktoré sa dajú použiť na spracovanie dát v bioinformatike alebo iných oblastiach. Ak však počas štúdia alebo aj neskôr v zamestnaní budete chcieť tieto metódy použiť na reálne dáta, zistíte, že väčšinou treba vynaložiť značné úsilie na samotné získanie dát, ich predspracovanie do vhodného tvaru, testovanie a porovnávanie rôznych metód alebo ich nastavení a získavanie finálnych výsledkov v tvare prehľadných tabuliek a grafov. Často je potrebné tieto činnosti veľakrát opakovať pre rôzne vstupy, rôzne nastavenia a podobne. Obzvlášť v bioinformatike je možné si nájsť zamestnanie, kde vašou hlavnou náplňou bude spracovanie dát s použitím už hotových nástrojov, prípadne doplnených menšími vlastnými programami. Na tomto predmete si ukážeme niektoré programovacie jazyky, postupy a technológie vhodné na tieto činnosti.
+
Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. For example in bioinformatics, it is possible to find a job where your main task will be data processing using existing tools, possibly supplemented by small custom scripts. This course will cover some programming languages and technologies suitable for these activities.
  
==Základné princípy==
+
This course is particularly recommended for students whose bachelor or master thesis involves substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).
  
Odporúčame nasledujúci článok s dobrými radami k výpočtovým experimentom:
+
==Basic guidelines for working with data ==
 +
 
 +
As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular desgn, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bionformatician William Stafford Noble (his advice applies outside of bionformatics as well):
 
* Noble WS. [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 A quick guide to organizing computational biology projects.] PLoS Comput Biol. 2009 Jul 31;5(7):e1000424.
 
* Noble WS. [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 A quick guide to organizing computational biology projects.] PLoS Comput Biol. 2009 Jul 31;5(7):e1000424.
  
Niektoré dôležité zásady:
+
Several important recommendations:
* Citát z článku Noble 2009: "Everything you do, you will probably have to do over again."
+
* Noble 2009: '''"Everything you do, you will probably have to do over again."'''
* Dobre zdokumentujte všetky kroky experimentu (čo ste robili, prečo ste to robili, čo vám vyšlo)
+
** After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps and therefore everything needs to be redone
** Ani vy sami si o pár mesiacov tieto detaily nebudete pamätať
+
** Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed
* Snažte sa udržiavať logickú štruktúru adresárov a súborov
+
** Try to avoid manually changing files, because this makes reruning analyses harder and more error-prone
** Ak však máte veľa experimentov, môže byť dostačujúce označiť ich dátumom, nevymýšľať stále nové dlhé mená
+
 
* Snažte sa vyhýbať manuálnym úpravám medzivýsledkov, ktoré znemožňujú jednoduché zopakovanie experimentu
+
* '''Document all steps of your analysis'''
* Snažte sa detegovať chyby v dátach
+
** Note what have you done, why have you done it, what was the result
** Skripty by mali skončiť s chybovou hláškou, keď niečo nejde ako by malo
+
** Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis
** V skriptoch čo najviac kontrolujte, že vstupné dáta zodpovedajú vašim predstavám (správny formát, rozumný rozsah hodnôt atď.)
+
** Good documentation is also indispensable for collaborative projects
** Ak v skripte voláte iný program, kontrolujte jeho exit code
+
 
** Tiež čo najčastejšie kontrolujte medzivýsledky výpočtu (ručným prezeraním, výpočtom rôznych štatistík a pod.), aby ste odhalili prípadné chyby v dátach alebo vo vašom kóde
+
 
 +
* '''Keep a logical structure of your files and folders'''
 +
** Their names should be indicative of the contents (create a sensible naming scheme)
 +
** However, if you have too many versions of the experiment, it may be easier to name them by date rather than create new long names (your notes should then detail the meaning of each dated version)
 +
 
 +
* '''Try to detect problems in the data'''
 +
** Often big files may hide some problems in the format, unexpected values etc.  These may confuse your programs and make the results meaningless
 +
** In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc)
 +
** In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code
 +
** If your script executes another program, check its exit code
 +
** Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code

Revision as of 22:43, 19 February 2020

Target audience

This course is offered at the Faculty of Matematics, Physics and Informatics, Comenius University in Bratislava for the students of the second year of the bachelor Bionformatics study program and the students of the bachelor and master Computer Science study programs. It is a prerequisite of the master-level state exams in Bioinformatics and Machine Learning. However, the course is open to students from other study programs if they satisfy the following informal prerequisites.

We assume that the students are proficient in programming in at least one programming language and are not afraid to learn new languages. We also assume basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.

The basic use of command-line tools can be learned for example by using a tutorial by Ian Korf.

Course objectives

Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. For example in bioinformatics, it is possible to find a job where your main task will be data processing using existing tools, possibly supplemented by small custom scripts. This course will cover some programming languages and technologies suitable for these activities.

This course is particularly recommended for students whose bachelor or master thesis involves substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).

Basic guidelines for working with data

As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular desgn, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bionformatician William Stafford Noble (his advice applies outside of bionformatics as well):

Several important recommendations:

  • Noble 2009: "Everything you do, you will probably have to do over again."
    • After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps and therefore everything needs to be redone
    • Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed
    • Try to avoid manually changing files, because this makes reruning analyses harder and more error-prone
  • Document all steps of your analysis
    • Note what have you done, why have you done it, what was the result
    • Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis
    • Good documentation is also indispensable for collaborative projects


  • Keep a logical structure of your files and folders
    • Their names should be indicative of the contents (create a sensible naming scheme)
    • However, if you have too many versions of the experiment, it may be easier to name them by date rather than create new long names (your notes should then detail the meaning of each dated version)
  • Try to detect problems in the data
    • Often big files may hide some problems in the format, unexpected values etc. These may confuse your programs and make the results meaningless
    • In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc)
    • In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code
    • If your script executes another program, check its exit code
    • Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code