1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Introduction"

From MAD
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Cieľová skupina==
+
==Target audience==
Tento predmet je určený pre študentov 2. ročníka bakalárskeho študijného programu Bioinformatika a pre študentov bakalárskeho a magisterského študijného programu Informatika, obzvlášť ak plánujú na magisterskom štúdiu absolvovať štátnicové zameranie Bioinformatika a strojové učenie. Radi privítame aj študentov iných zameraní a študijných programov, pokiaľ majú požadované (neformálne) prerekvizity.
+
This course is offered  at the Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava for the students of the second year of the bachelor Data Science and Bioinformatics study programs and the students of the bachelor and master Computer Science study programs. It is a prerequisite of the master-level state exam in Bioinformatics and Machine Learning.  
  
Predpokladáme, že študenti na tomto predmete už vedia programovať v niektorom programovacom jazyku a neboja sa učiť podľa potreby nové jazyky. Takisto predpokladáme základnú znalosť práce v Linuxe vrátane spúšťania príkazov na príkazovom riadku (mali by ste poznať aspoň základné príkazy na prácu so súbormi a adresármi ako cd, mkdir, cp, mv, rm, chmod a pod.). Hoci väčšina technológií preberaných na tomto predmete sa dá použiť na spracovanie dát z mnohých oblastí, budeme ich často ilustrovať na príkladoch z oblasti bioinformatiky. Pokúsime sa vysvetliť potrebné pojmy, ale bolo by dobre, ak by ste sa orientovali v základných pojmoch molekulárnej biológie, ako sú DNA, RNA, proteín, gén, genóm, evolúcia, fylogenetický strom a pod. Študentom zamerania Bioinformatika a strojové učenie odporúčame absolvovať najskôr Metódy v bioinformatike, až potom tento predmet.
+
However, the course is open to students from other study programs if they satisfy the following '''informal prerequisites'''.
 +
* Students should be proficient in '''programming''' in at least one programming language and not afraid to learn new languages.
 +
* Students should have basic knowledge of work on the Linux '''command-line''' (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). If you do not have these skills, please study our [[Command-line basics|tutorial]] before the second lecture. The first week contains detailed instructions to get you started.
  
Ak sa chcete doučiť základy používania príkazového riadku, skúste napr. tento tutoriál: http://korflab.ucdavis.edu/bootcamp.html
+
Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.
  
==Cieľ predmetu==
+
==Course objectives==
  
Počas štúdia sa naučíte mnohé zaujímave algoritmy, modely a metódy, ktoré sa dajú použiť na spracovanie dát v bioinformatike alebo iných oblastiach. Ak však počas štúdia alebo aj neskôr v zamestnaní budete chcieť tieto metódy použiť na reálne dáta, zistíte, že väčšinou treba vynaložiť značné úsilie na samotné získanie dát, ich predspracovanie do vhodného tvaru, testovanie a porovnávanie rôznych metód alebo ich nastavení a získavanie finálnych výsledkov v tvare prehľadných tabuliek a grafov. Často je potrebné tieto činnosti veľakrát opakovať pre rôzne vstupy, rôzne nastavenia a podobne. Obzvlášť v bioinformatike je možné si nájsť zamestnanie, kde vašou hlavnou náplňou bude spracovanie dát s použitím už hotových nástrojov, prípadne doplnených menšími vlastnými programami. Na tomto predmete si ukážeme niektoré programovacie jazyky, postupy a technológie vhodné na tieto činnosti. Veľa z nich je použiteľných na dáta z rôznych oblastí, ale budeme sa venovať aj špecificky bioinformatickým nástrojom.
+
===Quick summary===
 +
* Learn different languages and technologies for data processing tasks:
 +
** obtaining data,
 +
** preprocessing it to suitable form,
 +
** connecting existing tools into pipelines,
 +
** performing statistical tests and data visualization.
 +
* More details on statistical methods, visualization and machine learning are in different courses.
 +
* These tasks are fundamental in data science and bioinformatics, but also useful in many areas of computer science, where experimental evaluation and comparison of methods is needed.
 +
* Rather than learning one set of tools in detail, we give overview of many different ones.
 +
* Main reason is improving your flexibility so that you can quickly learn new language or library in future.
  
==Základné princípy==
+
===More details===
 +
Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. Many jobs in data science and bioinformatics involve data processing using existing tools and small custom scripts. This course will cover some programming languages and technologies suitable for such activities.
  
Odporúčame nasledujúci článok s dobrými radami k výpočtovým experimentom:
+
This course is also recommended for students whose bachelor or master theses involve substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).
 +
 
 +
We do not aim to teach you in detail one specific language or technology. Rather, we give you an overview of many different options, often doing a different language each week. One of the goals is to increase your flexibility so that you can quickly adapt when you need to use something new.
 +
 
 +
==Basic guidelines for working with data ==
 +
 
 +
As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular design, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bioinformatician William Stafford Noble, but his advice applies outside of bioinformatics as well.
 
* Noble WS. [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 A quick guide to organizing computational biology projects.] PLoS Comput Biol. 2009 Jul 31;5(7):e1000424.
 
* Noble WS. [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 A quick guide to organizing computational biology projects.] PLoS Comput Biol. 2009 Jul 31;5(7):e1000424.
  
Niektoré dôležité zásady:
+
Several important recommendations:
* Citát z článku Noble 2009: "Everything you do, you will probably have to do over again."
+
* Noble 2009: '''"Everything you do, you will probably have to do over again."'''
* Dobre zdokumentujte všetky kroky experimentu (čo ste robili, prečo ste to robili, čo vám vyšlo)
+
** After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps, and therefore everything needs to be redone.
** Ani vy sami si o pár mesiacov tieto detaily nebudete pamätať
+
** Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed.
* Snažte sa udržiavať logickú štruktúru adresárov a súborov
+
** Try to avoid manually changing files, because this makes rerunning analyses harder and more error-prone.
** Ak však máte veľa experimentov, môže byť dostačujúce označiť ich dátumom, nevymýšľať stále nové dlhé mená
+
 
* Snažte sa vyhýbať manuálnym úpravám medzivýsledkov, ktoré znemožňujú jednoduché zopakovanie experimentu
+
* '''Document all steps of your analysis'''
* Snažte sa detegovať chyby v dátach
+
** Note what have you done, why have you done it, what was the result.
** Skripty by mali skončiť s chybovou hláškou, keď niečo nejde ako by malo
+
** Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis.
** V skriptoch čo najviac kontrolujte, že vstupné dáta zodpovedajú vašim predstavám (správny formát, rozumný rozsah hodnôt atď.)
+
** Good documentation is also indispensable for collaborative projects.
** Ak v skripte voláte iný program, kontrolujte jeho exit code
+
 
** Tiež čo najčastejšie kontrolujte medzivýsledky výpočtu (ručným prezeraním, výpočtom rôznych štatistík a pod.), aby ste odhalili prípadné chyby v dátach alebo vo vašom kóde
+
* '''Keep a logical structure of your files and folders'''
 +
** Their names should be indicative of the contents (create a sensible naming scheme).
 +
** However, if you have too many versions of the experiment, it may be easier to name them by date rather than create new long names (your notes should then detail the meaning of each dated version).
 +
 
 +
* '''Try to detect problems in the data'''
 +
** Big files often hide some problems in the format, unexpected values etc.  These may confuse your programs and make the results meaningless.
 +
** In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc).
 +
** In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code.
 +
** If your script executes another program, check its exit code.
 +
** Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code.
 +
 
 +
<!-- TEX
 +
==Software requirements and the accompanying data==
 +
* In this course, the students are given access to a Linux server with all necessary tools installed.
 +
* All the tools are freely available and most of them can be easily installed e.g. as Ubuntu packages.
 +
* The server also contains data needed for the exercises, but this data can be also obtained from an accompanying website.
 +
* In the text below, replace <tt>/tasks/</tt> with the path to your copy of the accompanying data.
 +
/TEX -->

Latest revision as of 11:10, 22 February 2024

Target audience

This course is offered at the Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava for the students of the second year of the bachelor Data Science and Bioinformatics study programs and the students of the bachelor and master Computer Science study programs. It is a prerequisite of the master-level state exam in Bioinformatics and Machine Learning.

However, the course is open to students from other study programs if they satisfy the following informal prerequisites.

  • Students should be proficient in programming in at least one programming language and not afraid to learn new languages.
  • Students should have basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). If you do not have these skills, please study our tutorial before the second lecture. The first week contains detailed instructions to get you started.

Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.

Course objectives

Quick summary

  • Learn different languages and technologies for data processing tasks:
    • obtaining data,
    • preprocessing it to suitable form,
    • connecting existing tools into pipelines,
    • performing statistical tests and data visualization.
  • More details on statistical methods, visualization and machine learning are in different courses.
  • These tasks are fundamental in data science and bioinformatics, but also useful in many areas of computer science, where experimental evaluation and comparison of methods is needed.
  • Rather than learning one set of tools in detail, we give overview of many different ones.
  • Main reason is improving your flexibility so that you can quickly learn new language or library in future.

More details

Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. Many jobs in data science and bioinformatics involve data processing using existing tools and small custom scripts. This course will cover some programming languages and technologies suitable for such activities.

This course is also recommended for students whose bachelor or master theses involve substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).

We do not aim to teach you in detail one specific language or technology. Rather, we give you an overview of many different options, often doing a different language each week. One of the goals is to increase your flexibility so that you can quickly adapt when you need to use something new.

Basic guidelines for working with data

As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular design, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bioinformatician William Stafford Noble, but his advice applies outside of bioinformatics as well.

Several important recommendations:

  • Noble 2009: "Everything you do, you will probably have to do over again."
    • After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps, and therefore everything needs to be redone.
    • Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed.
    • Try to avoid manually changing files, because this makes rerunning analyses harder and more error-prone.
  • Document all steps of your analysis
    • Note what have you done, why have you done it, what was the result.
    • Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis.
    • Good documentation is also indispensable for collaborative projects.
  • Keep a logical structure of your files and folders
    • Their names should be indicative of the contents (create a sensible naming scheme).
    • However, if you have too many versions of the experiment, it may be easier to name them by date rather than create new long names (your notes should then detail the meaning of each dated version).
  • Try to detect problems in the data
    • Big files often hide some problems in the format, unexpected values etc. These may confuse your programs and make the results meaningless.
    • In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc).
    • In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code.
    • If your script executes another program, check its exit code.
    • Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code.