1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Skriptá"

From MAD
Jump to navigation Jump to search
(Created page with "<!-- NOTEX --> Website for 2018/19 * #Kontakt * #Úvod * #Pravidlá {| |- | 2019-02-21 || (BB) Introduction to Perl Lecture 1, [[#HWperl1|Homework 1]...")
 
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
Website for 2018/19
+
Website for 2019/20
  
 
* [[#Kontakt]]
 
* [[#Kontakt]]
 
* [[#Úvod]]
 
* [[#Úvod]]
 
* [[#Pravidlá]]
 
* [[#Pravidlá]]
 +
* [[#Projekt]]
  
 
{|
 
{|
 
|-
 
|-
| 2019-02-21 || (BB) Introduction to Perl [[#Lperl1|Lecture 1]], [[#HWperl1|Homework 1]]
+
| 2019-02-20 || (TV) Introduction to Perl (basics, input processing) [[#Lperl|Lecture]], [[#HWperl|Homework]]
 
|-
 
|-
| 2019-02-28 || (BB) Command-line tools, Perl one-liners [[#L02|Lecture 2]], [[#HW02|Homework 2]]
+
| 2019-02-27 || (TV) Command-line tools, Perl one-liners [[#Lbash|Lecture]], [[#HWbash|Homework]]
 
|-
 
|-
| 2019-03-07 || (BB) Job scheduling and make [[#L03|Lecture 3]], [[#HW03|Homework 3]]
+
| 2019-03-05 || (BB) Job scheduling and make [[#Lmake|Lecture]], [[#HWmake|Homework]]
 
|-
 
|-
| 2019-03-14 || (BB) Python and SQL for beginners [[#L04|Lecture 4]], [[#HW04|Homework 4]]
+
| 2019-03-12 || (BB) Python and SQL for beginners [[#Lpython|Lecture]], [[#HWpython|Homework]]
 
|-
 
|-
| 2019-03-21 || (VB) Python, web crawling, HTML parsing, sqlite3 [[#L05inf|Lecture 5 inf]], [[#HW05inf|Homework 5 inf]]
+
| 2019-03-19 || (VB) Python, web crawling, HTML parsing, sqlite3 [[#Lweb|Lecture INF]], [[#HWweb|Homework INF]]  
 
|-
 
|-
| || (BB) Bioinformatics 1 (genome assembly) [[#L05bin|Lecture 5 bin]], [[#HW05bin|Homework 5 bin]]
+
| || (BB) Bioinformatics 1 (genome assembly) [[#Lbioinf1|Lecture BIN]], [[#HWbioinf1|Homework BIN]]
 
|-
 
|-
| 2019-03-28 || (VB) Text data processing, flask [[#L06inf|Lecture 6 inf]], [[#HW06inf|Homework 6 inf]]  
+
| 2019-03-26 || (VB) Text data processing, flask [[#Lflask|Lecture INF]], [[#HWflask|Homework INF]]
 
|-
 
|-
| || (BB) Bioinformatics 2 (gene finding, RNA-seq) [[#L06bin|Lecture 6 bin]], [[#HW06bin|Homework 6 bin]]
+
| || (BB) Bioinformatics 2 (gene finding, RNA-seq) [[#Lbioinf2|Lecture BIN]], [[#HWbioinf2|Homework BIN]]  
 
|-
 
|-
| 2019-04-04 || (VB) Data visualization in JavaScript [[#L07inf|Lecture 7 inf]], [[#HW07inf|Homework inf]]
+
| 2019-04-02 || (VB) Data visualization in JavaScript [[#Ljavascript|Lecture INF]], [[#HWjavascript|Homework INF]]
 
|-
 
|-
| || (BB) Bioinformatics 3 (polymorphisms) [[#L07bin|Lecture 7 bin]], [[#HW07bin|Homework 7 bin]]
+
| || (BB) Bioinformatics 3 (polymorphisms) [[#Lbioinf3|Lecture BIN]], [[#HWbioinf3|Homework BIN]]
 
|-
 
|-
| 2019-04-11 || (BB) R, part 1 [[#L08|Lecture 8]], [[#HW08|Homework 8]]
+
| 2019-04-09 || Easter
 
|-
 
|-
| 2019-04-18 || Easter (project proposals due Wednesday April 17)
+
| 2019-04-16 || (BB) R, part 1  [[#Lr1|Lecture]], [[#HWr1|Homework]]
 
|-
 
|-
| 2019-04-25 || (BB) no lecture
+
| 2019-04-23 || (BB) R, part 2 [[#Lr2|Lecture]], [[#HWr2|Homework]]
 
|-
 
|-
| 2019-05-02 || (BB) R, part 2 [[#L09|Lecture 9]], [[#HW09|Homework 9]]
+
| 2019-04-30 || (VB) Cloud computing [[#Lcloud|Lecture]], [[#HWcloud|Homework]]
 
|-
 
|-
| 2019-05-09 || (VB) Cloud computing [[#L10|Lecture 10]], [[#HW10|Homework 10]]
+
| 2019-05-07 || Reserve, work on projects
 
|-
 
|-
| 2019-05-16 || no lecture
+
| 2019-05-14 || Reserve, work on projects
 
|}
 
|}
 
=Kontakt=
 
=Kontakt=
Line 44: Line 45:
  
 
* [http://compbio.fmph.uniba.sk/~bbrejova/ doc. Mgr. Broňa Brejová, PhD.]  miestnosť M-163 <!-- , [[Image:e-bb.png]] -->
 
* [http://compbio.fmph.uniba.sk/~bbrejova/ doc. Mgr. Broňa Brejová, PhD.]  miestnosť M-163 <!-- , [[Image:e-bb.png]] -->
* [http://compbio.fmph.uniba.sk/~tvinar/ Mgr. Tomáš Vinař, PhD.], miestnosť M-163 <!-- , [[Image:e-tv.png]] -->
+
* [http://compbio.fmph.uniba.sk/~tvinar/ doc. Mgr. Tomáš Vinař, PhD.], miestnosť M-163 <!-- , [[Image:e-tv.png]] -->
 
* [http://dai.fmph.uniba.sk/w/Vladimir_Boza/sk Mgr. Vladimír Boža, PhD.], miestnosť M-25 <!-- , [[Image:e-vb.png]] -->
 
* [http://dai.fmph.uniba.sk/w/Vladimir_Boza/sk Mgr. Vladimír Boža, PhD.], miestnosť M-25 <!-- , [[Image:e-vb.png]] -->
 
<!-- * [http://dai.fmph.uniba.sk/~siska/ RNDr. Jozef Šiška, PhD.], miestnosť I-7 -->
 
<!-- * [http://dai.fmph.uniba.sk/~siska/ RNDr. Jozef Šiška, PhD.], miestnosť I-7 -->
Line 51: Line 52:
 
'''Rozvrh'''
 
'''Rozvrh'''
 
*  Štvrtok 15:40-18:00 M-217
 
*  Štvrtok 15:40-18:00 M-217
 +
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
 
=Introduction=
 
=Introduction=
 
==Target audience==
 
==Target audience==
This course is offered at the Faculty of Matematics, Physics and Informatics, Comenius University in Bratislava for the students of the second year of the bachelor Bionformatics study program and the students of the bachelor and master Computer Science study programs. It is a prerequisite of the master-level state exams in Bioinformatics and Machine Learning. However, the course is open to students from other study programs if they satisfy the following informal prerequisites.  
+
This course is offered at the Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava for the students of the bachelor Data Science, Computer Science and Bioinformatics study programs and the students of the master Computer Science study program. It is a prerequisite of the master-level state exams in Bioinformatics and Machine Learning. However, the course is open to students from other study programs if they satisfy the following informal prerequisites.
 +
 
 +
We assume that the students are proficient in programming in at least one programming language and are not afraid to learn new languages. We also assume basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). The basic use of command-line tools can be learned for example by using [http://korflab.ucdavis.edu/bootcamp.html a tutorial by Ian Korf].  
  
We assume that the students are proficient in programming in at least one programming language and are not afraid to learn new languages. We also assume basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). Although most technologies covered in this course can be used for processing data from many apication areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.
+
Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.
  
Basics of command-line tools can be learned for example by using [http://korflab.ucdavis.edu/bootcamp.html a tutorial by Ian Korf (http://korflab.ucdavis.edu/bootcamp.html)].
 
  
 
==Course objectives==
 
==Course objectives==
  
Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. For example in bioinformatics, it is possible to find a job where your main task will be data processing using existing tools, possibly supplemented by small custom scripts. This course will cover some programming languages and technologies suitable for these activities.
+
Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. For example the main task for many bioinformaticians is data processing using existing tools, possibly supplemented by small custom scripts. This course will cover some programming languages and technologies suitable for these activities.
  
This course is particularly recommended for students whose bachelor or master thesis involves substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).  
+
This course is particularly recommended for students whose bachelor or master theses involve substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).  
  
 
==Basic guidelines for working with data ==
 
==Basic guidelines for working with data ==
  
As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular desgn, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bionformatician William Stafford Noble (his advice applies outside of bionformatics as well):
+
As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular design, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bioinformatician William Stafford Noble (his advice applies outside of bioinformatics as well):
 
* Noble WS. [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 A quick guide to organizing computational biology projects.] PLoS Comput Biol. 2009 Jul 31;5(7):e1000424.
 
* Noble WS. [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 A quick guide to organizing computational biology projects.] PLoS Comput Biol. 2009 Jul 31;5(7):e1000424.
  
Line 75: Line 78:
 
*  Noble 2009: '''"Everything you do, you will probably have to do over again."'''
 
*  Noble 2009: '''"Everything you do, you will probably have to do over again."'''
 
** After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps and therefore everything needs to be redone
 
** After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps and therefore everything needs to be redone
** Therefore it is better to use techniques that allow you to keep all detailes of your workflow and to repeat them if needed
+
** Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed
** Try to avoid manually changing files, because this makes reruning analyses harder and more error-prone
+
** Try to avoid manually changing files, because this makes rerunning analyses harder and more error-prone
  
 
* '''Document all steps of your analysis'''
 
* '''Document all steps of your analysis'''
 
** Note what have you done, why have you done it, what was the result
 
** Note what have you done, why have you done it, what was the result
 
** Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis
 
** Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis
** Good documentation is also indispensable for collaorative projects
+
** Good documentation is also indispensable for collaborative projects
  
  
Line 89: Line 92:
  
 
* '''Try to detect problems in the data'''
 
* '''Try to detect problems in the data'''
** Often big files may hide some problems in the format, unexpected values etc.  These may confuse your programs and make the results meaningless
+
** Big files may hide some problems in the format, unexpected values etc.  These may confuse your programs and make the results meaningless
 
** In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc)
 
** In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc)
 
** In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code
 
** In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code
 
** If your script executes another program, check its exit code
 
** If your script executes another program, check its exit code
** Also check intermedate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code
+
** Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code
 
 
<!-- TEX
 
==Software requirements and the accompanying data==
 
* In this course, the students are given access to a Linux server with all necessary tools installed
 
* All the tools are freely available and most of them can be easily installed e.g. as Ubuntu packages
 
* The server also contains data needed for the practice tasks, but this data can be also obtained from an accompanying website.
 
* In the text below, replace /tasks/ with the path to your copy of the accompanying data
 
/TEX -->
 
  
 
<!-- NOTEX -->
 
<!-- NOTEX -->
Line 130: Line 125:
 
===Protokoly===
 
===Protokoly===
 
* Väčšinou bude požadovanou súčasťou úlohy textový dokument nazvaný protokol.
 
* Väčšinou bude požadovanou súčasťou úlohy textový dokument nazvaný protokol.
* Protokol môže byť vo formáte .txt alebo .pdf a jeho meno má byť '''protocol.pdf''' alebo '''protocol.txt''' (nakopírujte ho do odovzdaného adresára)
+
* Protokol píšte v txt formáte a súbor nazvite '''protocol.txt''' (nakopírujte ho do odovzdaného adresára)
 
* Protokol môže byť po slovensky alebo po anglicky.
 
* Protokol môže byť po slovensky alebo po anglicky.
* V prípade použitia txt formátu a diakritiky ju kódujte v UTF8, ale pre jednoduchosť môžete protokoly písať aj bez diakritiky. Ak je protocol v pdf formáte, mali by sa v ňom dať selektovať texty.  
+
* Ak píšete s diakritikou, použite kódovanie UTF8, ale pre jednoduchosť môžete protokoly písať aj bez diakritiky.  
 
* Vo väčšine úloh dostanete kostru protokolu, dodržujte ju.
 
* Vo väčšine úloh dostanete kostru protokolu, dodržujte ju.
  
 
'''Hlavička protokolu, vyhodnotenie'''
 
'''Hlavička protokolu, vyhodnotenie'''
* Na vrchu protokolu uveďte meno, číslo domácej úluhy a vaše vyhodnotenie toho, ako sa vám úlohu podarilo vyriešiť. Vyhodnotenie je prehľadný zoznam všetkých príkladov zo zadania, ktoré ste aspoň začali riešiť a kódov označujúcich ich stupeň dokončenia:
+
* Na vrchu protokolu uveďte názov domácej úluhy a vaše vyhodnotenie toho, ako sa vám úlohu podarilo vyriešiť. Vyhodnotenie je prehľadný zoznam všetkých príkladov zo zadania, ktoré ste aspoň začali riešiť a kódov označujúcich ich stupeň dokončenia:
 
** kód HOTOVO uveďte, ak si myslíte, že tento príklad máte úplne a správne vyriešený
 
** kód HOTOVO uveďte, ak si myslíte, že tento príklad máte úplne a správne vyriešený
 
** kód ČASŤ uveďte, ak ste nevyriešili príklad celý a do poznámky za kód stručne uveďte, čo máte hotové a čo nie, prípadne ktorými časťami si nie ste istí.
 
** kód ČASŤ uveďte, ak ste nevyriešili príklad celý a do poznámky za kód stručne uveďte, čo máte hotové a čo nie, prípadne ktorými časťami si nie ste istí.
Line 147: Line 142:
 
* Ak nie je v zadaní určené inak, protokol by mal obsahovať nasledovné údaje:
 
* Ak nie je v zadaní určené inak, protokol by mal obsahovať nasledovné údaje:
 
** '''Zoznam odovzdaných súborov:''' o každom súbore uveďte jeho význam a či ste ho vyrobili ručne, získali z externých zdrojov alebo vypočítali nejakým programom. Ak máte väčšie množstvo súborov so systematickým pomenovaním, stačí vysvetliť schému názvov všeobecne. Súbory, ktorých mená sú špecifikované v zadaní, nemusíte v zozname uvádzať.
 
** '''Zoznam odovzdaných súborov:''' o každom súbore uveďte jeho význam a či ste ho vyrobili ručne, získali z externých zdrojov alebo vypočítali nejakým programom. Ak máte väčšie množstvo súborov so systematickým pomenovaním, stačí vysvetliť schému názvov všeobecne. Súbory, ktorých mená sú špecifikované v zadaní, nemusíte v zozname uvádzať.
** '''Postupnosť všetkých spustených príkazov,''' prípadne iných krokov, ktorými ste dospeli k získaným výsledkom. Tu uvádzajte príkazy na spracovanie dát a spúšťanie vašich či iných programov. Netreba uvádzať príkazy súvisiace so samotným programovaním (spúšťanie editora, nastavenie práv na spustenie a pod.), s kopírovaním úlohy na server a pod. Uveďte aj stručné '''komentáre''', čo bolo účelom určitého príkazu alebo skupiny príkazov.
+
** '''Postupnosť všetkých spustených príkazov,''' prípadne iných krokov, ktorými ste dospeli k získaným výsledkom. Tu uvádzajte príkazy na spracovanie dát a spúšťanie vašich či iných programov. Netreba uvádzať príkazy súvisiace so samotným programovaním (spúšťanie editora, nastavenie práv na spustenie a pod.), s kopírovaním úlohy na server a pod. Pri zložitejších príkazoch uveďte aj stručné '''komentáre''', čo bolo účelom určitého príkazu alebo skupiny príkazov.
 
** '''Zoznam zdrojov:''' webstránky a pod., ktoré ste pri riešení úlohy použili. Nemusíte uvádzať webstránku predmetu a zdroje odporučené priamo v zadaní.
 
** '''Zoznam zdrojov:''' webstránky a pod., ktoré ste pri riešení úlohy použili. Nemusíte uvádzať webstránku predmetu a zdroje odporučené priamo v zadaní.
 
Celkovo by protokol mal umožniť čitateľovi zorientovať sa vo vašich súboroch a tiež v prípade záujmu vykonať rovnaké výpočty, akými ste dospeli vy k výsledku. Nemusíte písať slohy, stačia zrozumiteľné a prehľadné heslovité poznámky.
 
Celkovo by protokol mal umožniť čitateľovi zorientovať sa vo vašich súboroch a tiež v prípade záujmu vykonať rovnaké výpočty, akými ste dospeli vy k výsledku. Nemusíte písať slohy, stačia zrozumiteľné a prehľadné heslovité poznámky.
Line 153: Line 148:
 
==Projekty==
 
==Projekty==
  
Cieľom projektu je vyskúšať si naučené zručnosti na konkrétnom projekte spracovania dát. Vašou úlohou je zohnať si dáta, tieto dáta analyzovať niektorými technikami z prednášok, prípadne aj inými technológiami a získané výsledky zobraziť v prehľadných grafoch a tabuľkách. Ideálne je, ak sa vám podarí prísť k zaujímavým alebo užitočným záverom, ale hodnotiť budeme hlavne voľbu vhodného postupu a jeho technickú náročnosť. Rozsah samotného programovania alebo analýzy dát by mal zodpovedať zhruba trom domácim úlohám, ale celkovo bude projekt náročnejší, lebo na rozdiel od úloh nemáte postup a dáta vopred určené, ale musíte si ich vymyslieť sami a nie vždy sa prvý nápad ukáže ako správny. V projekte môžete využiť aj existujúce nástroje a knižnice, ale pokiaľ možno používajte nástroje spúšťané na príkazovom riadku.  
+
Cieľom projektu je vyskúšať si naučené zručnosti na konkrétnom projekte spracovania dát. Vašou úlohou je zohnať si dáta, tieto dáta analyzovať niektorými technikami z prednášok, prípadne aj inými technológiami a získané výsledky zobraziť v prehľadných grafoch a tabuľkách.
 +
 
 +
* Zhruba v dvoch tretinách semestra budete odovzdávať krátky návrh projektu
 +
* Cez skúškové obdobie bude určený termín odovzdania projektu (vrátane písomnej správy)
 +
* Projekty môžete robiť aj vo dvojici, vtedy však vyžadujeme rozsiahlejší projekt a každý člen by mal byť primárne zodpovedný za určitú časť projektu
 +
* Po odovzdaní projektov sa bude konať ešte diskusia o projekte s vyučujúcimi, ktorá môže ovplyvniť vaše body z projektu.
 +
 
 +
Viac informácií o projektoch na [[#Projekt|zvláštnej stránke]]
 +
 
 +
==Opisovanie==
 +
 
 +
* Máte povolené sa so spolužiakmi a ďalšími osobami rozprávať o domácich úlohách resp. projektoch a stratégiách na ich riešenie. Kód, získané výsledky aj text, ktorý odovzdáte, musí však byť vaša samostatná práca. Je zakázané ukazovať svoj kód alebo texty spolužiakom.
 +
 
 +
* Pri riešení domácej úlohy a projektu očakávame, že budete využívať internetové zdroje, najmä rôzne manuály a diskusné fóra k preberaným technológiám. Nesnažte sa však nájsť hotové riešenia zadaných úloh. Všetky použité zdroje uveďte v domácich úlohách a projektoch.
 +
 
 +
* Ak nájdeme prípady opisovania alebo nepovolených pomôcok, všetci zúčastnení študenti získajú za príslušnú domácu úlohu, projekt a pod. nula bodov (t.j. aj tí, ktorí dali spolužiakom odpísať) a prípad ďalej podstúpime na riešenie disciplinárnej komisii fakulty.
  
Zhruba v dvoch tretinách semestra budete odovzdávať '''návrh projektu''' (formát txt alebo pdf, rozsah 0.5-1 strana). V tomto návrhu uveďte, aké dáta budete spracovávať, ako ich zoženiete, čo je cieľom analýzy a aké technológie plánujete použiť. Ciele a technológie môžete počas práce na projekte mierne pozmeniť podľa okolností, mali by ste však mať počiatočnú predstavu. K návrhu vám dáme spätnú väzbu, pričom v niektorých prípadoch môže byť potrebné tému mierne alebo úplne zmeniť. Za načas odovzdaný vhodný návrh projektu získate 5% z celkovej známky. Návrh odporúčame pred odovzdaním konzultovať s vyučujúcimi.
+
==Zverejňovanie==
  
Cez skúškové obdobie bude určený termín '''odovzdania projektu'''.  Podobne ako pri domácich úlohách odovzdávajte adresár s požadovanými súbormi:
+
Zadania a materiály k predmetu sú voľne prístupné na tejto stránke. Prosím vás ale, aby ste nezverejňovali ani inak nešírili vaše riešenia domácich úloh, ak nie je v zadaní povedané inak. Vaše projekty môžete zverejniť, pokiaľ to nie je v rozpore s vašou dohodou so zadávateľom projektu a poskytovateľom dát.
 +
=Projekt=
 +
Cieľom projektu je vyskúšať si naučené zručnosti na konkrétnom projekte spracovania dát. Vašou úlohou je zohnať si dáta, tieto dáta analyzovať niektorými technikami z prednášok, prípadne aj inými technológiami a získané výsledky zobraziť v prehľadných grafoch a tabuľkách. Ideálne je, ak sa vám podarí prísť k zaujímavým alebo užitočným záverom, ale hodnotiť budeme hlavne voľbu vhodného postupu a jeho technickú náročnosť. Rozsah samotného programovania alebo analýzy dát by mal zodpovedať zhruba trom domácim úlohám, ale celkovo bude projekt náročnejší, lebo na rozdiel od úloh nemáte postup a dáta vopred určené, ale musíte si ich vymyslieť sami a nie vždy sa prvý nápad ukáže ako správny.
 +
 
 +
V projekte môžete využiť aj existujúce nástroje a knižnice, ale dôraz by mal byť na nástrojoch spúšťaných na príkazovom riadku a využití technológií preberaných na predmete. Pri prototypovaní vášho nástroja a vytváraní vizualizácií do záverečnej správy sa vám môže dobre pracovať v interaktívnych prostrediach, ako napríklad Jupyter notebook, ale v odovzdanej verzii projektu by sa mala dať väčšia časť kódu spustiť zo samostatných skriptov spustiteľných na príkazovom riadku, potenciálne s výnimkou samotnej vizualizácie, ktorá môže zostať ako notebook alebo interaktívna webstránka (flask).
 +
 
 +
==Návrh projektu==
 +
Zhruba v dvoch tretinách semestra budete odovzdávať návrh projektu v rozsahu asi pol strany. V tomto návrhu uveďte, aké dáta budete spracovávať, ako ich zoženiete, čo je cieľom analýzy a aké technológie plánujete použiť. Ciele a technológie môžete počas práce na projekte mierne pozmeniť podľa okolností, mali by ste však mať počiatočnú predstavu. K návrhu vám dáme spätnú väzbu, pričom v niektorých prípadoch môže byť potrebné tému mierne alebo úplne zmeniť. Za načas odovzdaný vhodný návrh projektu získate 5% z celkovej známky. Návrh odporúčame pred odovzdaním konzultovať s vyučujúcimi.
 +
 
 +
'''Odovzdávanie:''' súbor vo formáte txt alebo pdf nakopírujte do </tt>/submit/navrh/username</tt> na serveri.
 +
 
 +
==Odovzdanie projektov==
 +
Cez skúškové obdobie bude určený termín odovzdania projektu.  Podobne ako pri domácich úlohách odovzdávajte adresár s požadovanými súbormi:
 
* Vaše '''programy a súbory s dátami''' (veľmi veľké dátové súbory vynechajte)  
 
* Vaše '''programy a súbory s dátami''' (veľmi veľké dátové súbory vynechajte)  
 
* '''Protokol''' podobne ako pri domácich úlohách  
 
* '''Protokol''' podobne ako pri domácich úlohách  
 
** formát txt alebo pdf, stručné heslovité poznámky
 
** formát txt alebo pdf, stručné heslovité poznámky
 
** obsahuje zoznam súborov, podrobný postup pri analýze dát (spustené príkazy), ako aj použité zdroje (dáta, programy, dokumentácia a iná literatúra atď)
 
** obsahuje zoznam súborov, podrobný postup pri analýze dát (spustené príkazy), ako aj použité zdroje (dáta, programy, dokumentácia a iná literatúra atď)
* '''Správu k projektu''' vo formáte pdf. Na rozdiel od menej formálneho protokolu by správu mal tvoriť súvislý text v odbornom štýle, podobne ako napr. záverečné práce. Môžete písať po slovensky alebo po anglicky, ale pokiaľ možno gramaticky správne. Správa by mala mať tieto časti:
+
* '''Správu k projektu''' vo formáte pdf. Na rozdiel od menej formálneho protokolu by správu mal tvoriť súvislý text v odbornom štýle, podobne ako napr. záverečné práce. Môžete písať po slovensky alebo po anglicky, ale pokiaľ možno gramaticky správne. Správa by mala obsahovať:
 
** úvod, v ktorom vysvetlíte ciele projektu, prípadne potrebné poznatky zo skúmanej oblasti a aké dáta ste mali k dispozícii
 
** úvod, v ktorom vysvetlíte ciele projektu, prípadne potrebné poznatky zo skúmanej oblasti a aké dáta ste mali k dispozícii
 
** stručný popis metód, v ktorom neuvádzajte detailne jednotlivé kroky, skôr prehľad použitého prístupu a jeho zdôvodnenie
 
** stručný popis metód, v ktorom neuvádzajte detailne jednotlivé kroky, skôr prehľad použitého prístupu a jeho zdôvodnenie
Line 170: Line 192:
 
Projekty môžete robiť aj vo '''dvojici''', vtedy však vyžadujeme rozsiahlejší projekt a každý člen by mal byť primárne zodpovedný za určitú časť projektu, čo uveďte aj v správe. Dvojice odovzdávajú jednu správu, ale po odovzdaní projektu majú stretnutie s vyučujúcimi individuálne.
 
Projekty môžete robiť aj vo '''dvojici''', vtedy však vyžadujeme rozsiahlejší projekt a každý člen by mal byť primárne zodpovedný za určitú časť projektu, čo uveďte aj v správe. Dvojice odovzdávajú jednu správu, ale po odovzdaní projektu majú stretnutie s vyučujúcimi individuálne.
  
Ako nájsť tému projektu:
+
==Typické časti projektu==
 +
Väčšina projektov obsahuje nasledujúce kroky, ktoré by sa mali premietnuť aj v správe
 +
* '''Získanie dát.''' Toto môže byť ľahké, ak vám dáta niekto priamo dá alebo ich stiahnete ako jeden súbor z internetu, alebo náročnejšie, napríklad ak ich parsujete z veľkého množstva súborov alebo webstránok. Nezabudnite na (aspoň námatkovú) kontrolu, či sa vám podarilo dáta stiahnuť správne. V správe by malo byť jasne uvedené, kde a ako ste dáta získali.
 +
* '''Predspracovanie dát do vhodného tvaru.''' Táto etapa zahŕňa parsovanie vstupných formátov, vyberanie užitočných dát, ich kontrola, odfiltrovanie nevhodných alebo neúplných položiek a podobne. Dáta si uložte do súboru alebo databázy vo vhodnom tvare, v ktorom sa vám s nimi bude dobre ďalej pracovať. Nezabudnite na kontrolu, či dáta vyzerajú byť v poriadku a spočítajte základné štatistiky, napríklad celkový počet záznamov, rozsahy rozličných atribútov a podobne, ktoré môžu vám aj čitateľovi správy ilustrovať, aký je charakter dát.
 +
* '''Ďalšie analýzy na dátach a vizualizácia výsledkov.''' V tejto fáze sa pokúste v dátach nájsť niečo zaujímavé alebo užitočné pre zadávateľa projektu. Výsledkom môžu byť statické grafy a tabuľky, alebo aj interaktívna webstránka (flask). Aj v prípade interaktívnej webstránky však aspoň niektoré výsledky uveďte aj v správe.
 +
Ak sa váš projekt od týchto krokov výrazne odlišuje, poraďte sa s vyučujúcimi.
 +
 
 +
==Vhodné témy projektov==
 
* Môžete spracovať nejaké dáta, ktoré potrebujete do bakalárskej alebo diplomovej práce, prípadne aj dáta, ktoré potrebujte na iný predmet (v tom prípade uveďte v správe, o aký predmet ide a takisto upovedomte aj druhého vyučujúceho, že ste použili spracovanie dát ako projekt pre tento predmet). Obzvlášť pre BIN študentov môže byť tento predmet vhodnou príležitosťou nájsť si tému bakalárskej práce a začať na nej pracovať.
 
* Môžete spracovať nejaké dáta, ktoré potrebujete do bakalárskej alebo diplomovej práce, prípadne aj dáta, ktoré potrebujte na iný predmet (v tom prípade uveďte v správe, o aký predmet ide a takisto upovedomte aj druhého vyučujúceho, že ste použili spracovanie dát ako projekt pre tento predmet). Obzvlášť pre BIN študentov môže byť tento predmet vhodnou príležitosťou nájsť si tému bakalárskej práce a začať na nej pracovať.
 
* Môžete skúsiť zopakovať analýzu spravenú v nejakom vedeckom článku a overiť, že dostanete tie isté výsledky. Vhodné je tiež skúsiť analýzu aj mierne obmeniť (spustiť na iné dáta, zmeniť nejaké nastavenia, zostaviť aj iný typ grafu a pod.)
 
* Môžete skúsiť zopakovať analýzu spravenú v nejakom vedeckom článku a overiť, že dostanete tie isté výsledky. Vhodné je tiež skúsiť analýzu aj mierne obmeniť (spustiť na iné dáta, zmeniť nejaké nastavenia, zostaviť aj iný typ grafu a pod.)
 
* Môžete skúsiť nájsť niekoho, kto má dáta, ktoré by potreboval spracovať, ale nevie ako na to (môže ísť o biológov, vedcov z iných oblastí, ale aj neziskové organizácie a pod.) V prípade, že takýmto spôsobom kontaktujete tretie osoby, bolo by vhodné pracovať na projekte obzvlášť zodpovedne, aby ste nerobili zlé meno našej fakulte.
 
* Môžete skúsiť nájsť niekoho, kto má dáta, ktoré by potreboval spracovať, ale nevie ako na to (môže ísť o biológov, vedcov z iných oblastí, ale aj neziskové organizácie a pod.) V prípade, že takýmto spôsobom kontaktujete tretie osoby, bolo by vhodné pracovať na projekte obzvlášť zodpovedne, aby ste nerobili zlé meno našej fakulte.
 
* V projekte môžete porovnávať niekoľko programov na tú istú úlohu z hľadiska ich rýchlosti či presnosti výsledkov. Obsahom projektu bude príprava dát, na ktorých budete programy bežať, samotné spúšťanie (vhodne zoskriptované) ako aj vyhodnotenie výsledkov.
 
* V projekte môžete porovnávať niekoľko programov na tú istú úlohu z hľadiska ich rýchlosti či presnosti výsledkov. Obsahom projektu bude príprava dát, na ktorých budete programy bežať, samotné spúšťanie (vhodne zoskriptované) ako aj vyhodnotenie výsledkov.
* A samozrejme môžete niekde na internete vyhrabať zaujímavé dáta a snažiť sa z nich niečo vydolovať.
+
* A samozrejme môžete niekde na internete vyhrabať zaujímavé dáta a snažiť sa z nich niečo vydolovať. Študenti si často vyberajú témy súvisiace s ich koníčkami a aktivitami, napríklad športy, počítačové hry, programátorské súťaže a iné.
 +
<!-- /NOTEX -->
  
==Opisovanie==
+
=Lperl=
 +
This lecture is a brief introduction to the Perl scripting language. More information can be found below (section [[#Sources of Perl-related information]]). We recommend revisiting necessary parts of this lecture while working on the exercises.
  
* Máte povolené sa so spolužiakmi a ďalšími osobami rozprávať o domácich úlohách resp. projektoch a stratégiách na ich riešenie. Kód, získané výsledky aj text, ktorý odovzdáte, musí však byť vaša samostatná práca. Je zakázané ukazovať svoj kód alebo texty spolužiakom.  
+
==Why Perl==
 +
* From [https://en.wikipedia.org/wiki/Perl Wikipedia:] It has been nicknamed "the Swiss Army chainsaw of scripting languages" because of its flexibility and power, and possibly also because of its "ugliness".
  
* Pri riešení domácej úlohy a projektu očakávame, že budete využívať internetové zdroje, najmä rôzne manuály a diskusné fóra k preberaným technológiám. Nesnažte sa však nájsť hotové riešenia zadaných úloh. Všetky použité zdroje uveďte v domácich úlohách a projektoch.  
+
Official slogans:
 +
* There's more than one way to do it.
 +
* Easy things should be easy and hard things should be possible.
  
* Ak nájdeme prípady opisovania alebo nepovolených pomôcok, všetci zúčastnení študenti získajú za príslušnú domácu úlohu, projekt a pod. nula bodov (t.j. aj tí, ktorí dali spolužiakom odpísať) a prípad ďalej podstúpime na riešenie disciplinárnej komisii fakulty.
+
Advantages
 
 
==Zverejňovanie==
 
 
 
Zadania a materiály k predmetu sú voľne prístupné na tejto stránke. Prosím vás ale, aby ste nezverejňovali ani inak nešírili vaše riešenia domácich úloh, ak nie je v zadaní povedané inak. Vaše projekty môžete zverejniť, pokiaľ to nie je v rozpore s vašou dohodou so zadávateľom projektu a poskytovateľom dát.
 
<!-- /NOTEX -->
 
 
 
=Lperl1=
 
This lecture is an introduction to the Perl scripting language.
 
<!-- NOTEX -->
 
We will quickly go through some language features, please read the rest of text as you work on [[#HWperl1]].
 
<!-- /NOTEX -->
 
 
 
==Why Perl==
 
* From [https://en.wikipedia.org/wiki/Perl Wikipedia:] It has been nicknamed "the Swiss Army chainsaw of scripting languages" because of its flexibility and power, and possibly also because of its "ugliness".
 
 
 
Official slogans:
 
* There's more than one way to do it
 
* Easy things should be easy and hard things should be possible
 
 
 
Advantages
 
 
* Good capabilities for processing text files, regular expressions, running external programs etc.
 
* Good capabilities for processing text files, regular expressions, running external programs etc.
 
* Closer to common programming languages than shell scripts
 
* Closer to common programming languages than shell scripts
Line 218: Line 232:
 
==Hello world==
 
==Hello world==
 
It is possible to run the code directly from a command line (more later):
 
It is possible to run the code directly from a command line (more later):
<pre>
+
<syntaxhighlight lang="bash">
 
perl -e'print "Hello world\n"'
 
perl -e'print "Hello world\n"'
</pre>
+
</syntaxhighlight>
  
 
This is equivalent to the following code stored in a file:
 
This is equivalent to the following code stored in a file:
<pre>
+
<syntaxhighlight lang="Perl">
 
#! /usr/bin/perl -w
 
#! /usr/bin/perl -w
 
use strict;
 
use strict;
 
print "Hello world!\n";
 
print "Hello world!\n";
</pre>
+
</syntaxhighlight>
  
 
* The first line is a path to the interpreter
 
* The first line is a path to the interpreter
* Swith <tt>-w</tt> switches warnings on, e.g. if we manipulate with an undefined value (equivalent to <tt>use warnings;</tt>)
+
* Switch <tt>-w</tt> switches warnings on, e.g. if we manipulate with an undefined value (equivalent to <tt>use warnings;</tt>)
 
* The second line <tt>use strict</tt> will switch on a more strict syntax checks, e.g. all variables must be defined
 
* The second line <tt>use strict</tt> will switch on a more strict syntax checks, e.g. all variables must be defined
 
* Use of <tt>-w</tt> and <tt>use strict</tt> is strongly recommended
 
* Use of <tt>-w</tt> and <tt>use strict</tt> is strongly recommended
Line 240: Line 254:
 
* It is also possible to run as <tt>perl hello.pl</tt> (e.g. if we don't have the path to the interpreter in the file or the executable bit is not set)
 
* It is also possible to run as <tt>perl hello.pl</tt> (e.g. if we don't have the path to the interpreter in the file or the executable bit is not set)
  
==The first input file for today: sequence repeats==
+
==The first input file for today: TV series==
* In genomes some sequences occur in many copies (often not exactly equal, only similar)
+
* [https://www.imdb.com/ IMDb] is an online database of movies and TV series with user ratings.
* We have downloaded a table containing such sequence repeats on chromosome 2L of the fruitfly ''Drosophila melanogaster''
+
* We have downloaded a preprocessed dataset of selected TV series ratings from [https://github.com/nazareno/imdb-series/ GitHub].
* It was done as follows: on webpage [http://genome.ucsc.edu/ http://genome.ucsc.edu/] we select drosophila genome, then in main menu select  Tools, Table browser, select group: variation and repeats, track: ReapatMasker, region: position chr2L, output format: all fields from the selected table a output file: repeats.txt
+
* From this dataset, we have selected only several series with a high number of voting users.
* Each line of the file contains data about one repeat in the selected chromosome. The first line contains column names. Columns are tab-separated.
+
 
* Here are the first two lines, each line split into three lines for better readability
+
* Each line of the file contains data about one episode of one series. Columns are tab-separated and contain the name of the series, the name of the episode, the global index of the episode within the series, the number of the season, the index of the episode with the season, rating of the episode and the number of voting users.
 +
* Here is a smaller version of this file with only six lines:
 
<pre>
 
<pre>
#bin    swScore milliDiv        milliDel        milliIns
+
Black Mirror The National Anthem 1 1 1 7.8 35156
  genoName        genoStart      genoEnd genoLeft        strand
+
Black Mirror Fifteen Million Merits 2 1 2 8.2 35317
  repName repClass        repFamily      repStart        repEnd  repLeft id
+
Black Mirror The Entire History of You 3 1 3 8.6 35266
585    778    167    7      20
+
Game of Thrones Winter Is Coming 1 1 1 9 27890
  chr2L  1       154    -23513558  +
+
Game of Thrones The Kingsroad 2 1 2 8.8 21414
  HETRP_DM        Satellite      Satellite      1519    1669    -203    1
+
Game of Thrones Lord Snow 3 1 3 8.7 20232
 
</pre>
 
</pre>
* The file can be found at our server under filename <tt>/tasks/perl1/repeats.txt</tt> (17185 lines)
+
* The smaller and the larger version of this file can be found at our server under filenames <tt>/tasks/perl/series-small.tsv</tt> and <tt>/tasks/perl/series.tsv</tt>
* A small randomly selected subset of the table rows is in file <tt>/tasks/perl1/repeats-small.txt</tt> (159 lines)
 
  
 
==A sample Perl program==
 
==A sample Perl program==
For each type of repeat (column 11 of the file when counting from 0) we want to compute the number of repeats of this type
+
For each series (column 0 of the file) we want to compute the number of episodes.
<pre>
+
<syntaxhighlight lang="Perl">
#!/usr/bin/perl -w
+
#! /usr/bin/perl -w
 
use strict;
 
use strict;
  
#associative array (hash), with repeat type as key
+
#associative array (hash), with series name as key
 
my %count;   
 
my %count;   
  
 
while(my $line = <STDIN>) {  # read every line on input
 
while(my $line = <STDIN>) {  # read every line on input
 
     chomp $line;    # delete end of line, if any
 
     chomp $line;    # delete end of line, if any
 
    if($line =~ /^#/) {  # skip commented lines
 
      next;      # similar to "continue" in C, move to next iteration
 
    }
 
  
 
     # split the input line to columns on every tab, store them in an array
 
     # split the input line to columns on every tab, store them in an array
 
     my @columns = split "\t", $line;   
 
     my @columns = split "\t", $line;   
  
     # check input - should have at least 17 columns
+
     # check input - should have 7 columns
     die "Bad input '$line'" unless @columns >= 17;
+
     die "Bad input '$line'" unless @columns == 7;
  
     my $type = $columns[11];
+
     my $series = $columns[0];
  
 
     # increase counter for this type
 
     # increase counter for this type
     $count{$type}++;
+
     $count{$series}++;
 
}
 
}
  
 
# write out results, types sorted alphabetically
 
# write out results, types sorted alphabetically
foreach my $type (sort keys %count) {
+
foreach my $series (sort keys %count) {
     print $type, " ", $count{$type}, "\n";
+
     print $series, " ", $count{$series}, "\n";
 
}
 
}
</pre>
+
</syntaxhighlight>
 +
 
 +
This program does the same thing as the following one-liner (more on one-liners in the next lecture)
 +
<syntaxhighlight lang="bash">
 +
perl -F'"\t"' -lane 'die unless @F==7; $count{$F[0]}++;
 +
  END { foreach (sort keys %count) { print "$_ $count{$_}" }}' filename
 +
</syntaxhighlight>
  
This program does the same thing as the following one-liner (more on one-liners in two weeks)
+
When we run it for the small six-line input, we get the following output:
 
<pre>
 
<pre>
perl -F'"\t"' -lane 'next if /^#/; die unless @F>=17; $count{$F[11]}++;
+
Black Mirror 3
  END { foreach (sort keys %count) { print "$_ $count{$_}" }}' filename
+
Game of Thrones 3
 
</pre>
 
</pre>
 
  
 
==The second input file for today: DNA sequencing reads (fastq)==
 
==The second input file for today: DNA sequencing reads (fastq)==
Line 311: Line 326:
 
* Technically, a single read and its quality can be split into multiple lines, but this is rarely done, and we will assume that each read takes 4 lines as described above
 
* Technically, a single read and its quality can be split into multiple lines, but this is rarely done, and we will assume that each read takes 4 lines as described above
  
The first 4 reads from file <tt>/tasks/perl1/reads-small.fastq</tt> (trimmed to 50 bases for better readability)
+
The first 4 reads from file <tt>/tasks/perl/reads-small.fastq</tt> (trimmed to 50 bases for better readability)
 
<pre>
 
<pre>
 
@SRR022868.1845/1
 
@SRR022868.1845/1
Line 329: Line 344:
 
* Scalar variables can hold undefined value (<tt>undef</tt>), string, number, reference etc.
 
* Scalar variables can hold undefined value (<tt>undef</tt>), string, number, reference etc.
 
* Perl converts automatically between strings and numbers
 
* Perl converts automatically between strings and numbers
<pre>
+
<syntaxhighlight lang="bash">
 
perl -e'print((1 . "2")+1, "\n")'
 
perl -e'print((1 . "2")+1, "\n")'
13
+
# 13
 
perl -e'print(("a" . "2")+1, "\n")'
 
perl -e'print(("a" . "2")+1, "\n")'
1
+
# 1
 
perl -we'print(("a" . "2")+1, "\n")'
 
perl -we'print(("a" . "2")+1, "\n")'
Argument "a2" isn't numeric in addition (+) at -e line 1.
+
# Argument "a2" isn't numeric in addition (+) at -e line 1.
1
+
# 1
</pre>
+
</syntaxhighlight>
 
* If we switch on strict parsing, each variable needs to be defined by <tt>my</tt>
 
* If we switch on strict parsing, each variable needs to be defined by <tt>my</tt>
 
** Several variables can be created and initialized as follows: <tt>my ($a,$b) = (0,1);</tt>
 
** Several variables can be created and initialized as follows: <tt>my ($a,$b) = (0,1);</tt>
Line 352: Line 367:
 
* If using non-existent indexes, they will be created, initialized to <tt>undef</tt> (<tt>++, +=</tt> treat <tt>undef</tt> as 0)
 
* If using non-existent indexes, they will be created, initialized to <tt>undef</tt> (<tt>++, +=</tt> treat <tt>undef</tt> as 0)
 
* Stack/vector using functions <tt>push</tt> and <tt>pop</tt>: <tt>push @a, (1,2,3); $x = pop @a;</tt>
 
* Stack/vector using functions <tt>push</tt> and <tt>pop</tt>: <tt>push @a, (1,2,3); $x = pop @a;</tt>
* Analogicaly <tt>shift</tt> and <tt>unshift</tt> on the left end of the array (slower)
+
* Analogically <tt>shift</tt> and <tt>unshift</tt> on the left end of the array (slower)
 
* Sorting
 
* Sorting
 
** <tt>@a = sort @a;</tt> (sorts alphabetically)
 
** <tt>@a = sort @a;</tt> (sorts alphabetically)
Line 360: Line 375:
 
* Swap values of two variables: <tt>($x,$y) = ($y,$x);</tt>
 
* Swap values of two variables: <tt>($x,$y) = ($y,$x);</tt>
 
* Command <tt>foreach</tt> iterates through values of an array (values can be changed during iteration):
 
* Command <tt>foreach</tt> iterates through values of an array (values can be changed during iteration):
<pre>
+
<syntaxhighlight lang="Perl">
 
my @a = (1,2,3);
 
my @a = (1,2,3);
 
foreach my $val (@a) {  # iterate through all values
 
foreach my $val (@a) {  # iterate through all values
Line 368: Line 383:
 
print join(" ", @a), "\n";  
 
print join(" ", @a), "\n";  
 
# prints 2 3 4
 
# prints 2 3 4
</pre>
+
</syntaxhighlight>
  
 
===Hash tables (associative array, dictionaries, maps)===
 
===Hash tables (associative array, dictionaries, maps)===
Line 375: Line 390:
 
* Access element with key <tt>"X"</tt>: <tt>$b{"X"}</tt>
 
* Access element with key <tt>"X"</tt>: <tt>$b{"X"}</tt>
 
* Write out all elements of associative array <tt>%b</tt>
 
* Write out all elements of associative array <tt>%b</tt>
<pre>
+
<syntaxhighlight lang="Perl">
 
foreach my $key (keys %b) {
 
foreach my $key (keys %b) {
 
     print $key, " ", $b{$key}, "\n";
 
     print $key, " ", $b{$key}, "\n";
 
}
 
}
</pre>
+
</syntaxhighlight>
 
* Initialization with a constant: <tt>%b = ("key1" => "value1", "key2" => "value2");</tt>
 
* Initialization with a constant: <tt>%b = ("key1" => "value1", "key2" => "value2");</tt>
 
* Test for existence of a key: <tt>if(exists $a{"X"}) {...}</tt>
 
* Test for existence of a key: <tt>if(exists $a{"X"}) {...}</tt>
Line 387: Line 402:
 
* Pointer to an anonymous array: <tt>[1,2,3]</tt>, pointer to an anonymous hash: <tt>{"key1" => "value1"}</tt>
 
* Pointer to an anonymous array: <tt>[1,2,3]</tt>, pointer to an anonymous hash: <tt>{"key1" => "value1"}</tt>
 
* Hash of lists is stored as hash of pointers to lists:
 
* Hash of lists is stored as hash of pointers to lists:
<pre>
+
<syntaxhighlight lang="Perl">
 
my %a = ("fruits" => ["apple","banana","orange"],
 
my %a = ("fruits" => ["apple","banana","orange"],
 
         "vegetables" => ["tomato","carrot"]);
 
         "vegetables" => ["tomato","carrot"]);
Line 394: Line 409:
 
my $aref = \%a;
 
my $aref = \%a;
 
$x = $aref->{"fruits"}[1];
 
$x = $aref->{"fruits"}[1];
</pre>
+
</syntaxhighlight>
* Module <tt>Data::Dumper</tt> has function <tt>Dumper</tt>, which recursively prints complex data structures (good for debuging)
+
* Module <tt>Data::Dumper</tt> has function <tt>Dumper</tt>, which recursively prints complex data structures (good for debugging)
  
 
==Strings==
 
==Strings==
Line 409: Line 424:
  
 
==Regular expressions==
 
==Regular expressions==
* Regular expressions are powerful tool for working with strings, now featued in many languages
+
* Regular expressions are powerful tool for working with strings, now featured in many languages
 
* Here only a few examples, more details can be found in [http://perldoc.perl.org/perlretut.html the official tutorial]
 
* Here only a few examples, more details can be found in [http://perldoc.perl.org/perlretut.html the official tutorial]
<pre>
+
<syntaxhighlight lang="Perl">
 
$line =~ s/\s+$//;      # remove whitespace at the end of the line
 
$line =~ s/\s+$//;      # remove whitespace at the end of the line
 
$line =~ s/[0-9]+/X/g;  # replace each sequence of numbers with character X
 
$line =~ s/[0-9]+/X/g;  # replace each sequence of numbers with character X
Line 419: Line 434:
 
# and store it in variable $name  
 
# and store it in variable $name  
 
# (\S means non-whitespace),
 
# (\S means non-whitespace),
# the string matching part of expression in (..) is stroed in $1
+
# the string matching part of expression in (..) is stored in $1
 
if($line =~ /^\>(\S+)/) { $name = $1; }
 
if($line =~ /^\>(\S+)/) { $name = $1; }
</pre>
+
</syntaxhighlight>
  
 
==Conditionals, loops==
 
==Conditionals, loops==
<pre>
+
<syntaxhighlight lang="Perl">
 
if(expression) {  # () and {} cannot be omitted
 
if(expression) {  # () and {} cannot be omitted
 
   commands
 
   commands
Line 451: Line 466:
 
   last if $x >= 100;
 
   last if $x >= 100;
 
}
 
}
</pre>
+
</syntaxhighlight>
  
Undefined value, number 0 and strings <tt>""</tt> and <tt>"0"</tt> evaluate as false, but we recommmend always explicitly using logical values in conditional expressions, e.g. <tt>if(defined $x)</tt>, <tt>if($x eq "")</tt>, <tt>if($x==0)</tt> etc.
+
Undefined value, number 0 and strings <tt>""</tt> and <tt>"0"</tt> evaluate as false, but we recommend always explicitly using logical values in conditional expressions, e.g. <tt>if(defined $x)</tt>, <tt>if($x eq "")</tt>, <tt>if($x==0)</tt> etc.
  
 
==Input, output==
 
==Input, output==
* Reading one line from standard input:
+
<syntaxhighlight lang="Perl">
<pre>$line = <STDIN></pre>
+
# Reading one line from standard input
* If no more input data available, returns <tt>undef</tt>
+
$line = <STDIN>
 +
# If no more input data available, returns undef
 +
 
 +
 
 +
# The special idiom below reads all the lines from input until the end of input is reached:
 +
while (my $line = <STDIN>) {
 +
  # commands processing $line ...
 +
}
 +
</syntaxhighlight>
 
* See also [http://perldoc.perl.org/perlop.html#I%2fO-Operators|manual on Perl I/O operators]
 
* See also [http://perldoc.perl.org/perlop.html#I%2fO-Operators|manual on Perl I/O operators]
* The special idiom below reads all the lines from input until the end of input is reached:
 
<pre>
 
while (my $line = <STDIN>) { ... } 
 
</pre>
 
* <tt>chomp $line</tt> removes <tt>"\n"</tt>, if any from the end of the string
 
 
* Output to stdout through <tt>[http://perldoc.perl.org/functions/print.html print]</tt> or <tt>[http://perldoc.perl.org/functions/printf.html printf]</tt> commands
 
* Output to stdout through <tt>[http://perldoc.perl.org/functions/print.html print]</tt> or <tt>[http://perldoc.perl.org/functions/printf.html printf]</tt> commands
  
 
==Sources of Perl-related information==
 
==Sources of Perl-related information==
* Lecture Perl 2 in [[#Integrácia_dátových_zdrojov_2017/18#L02|archive of this course]] (files, command-line arguments, running external programs, defining functions and modules, Bioperl)
+
* Man pages (included in ubuntu package <tt>perl-doc</tt>), also available online at [http://perldoc.perl.org/ http://perldoc.perl.org/]
* In package perl-doc man pages:
+
** <tt>man perlintro</tt> introduction to Perl
** '''man perlintro''' introduction to Perl
+
** <tt>man perlfunc</tt> list of standard functions in Perl
** '''man perlfunc''' list of standard functions in Perl
+
** <tt>perldoc -f split</tt> describes function split, similarly other functions
** '''perldoc -f split''' describes function split, similarly other functions
+
** <tt>perldoc -q sort</tt> shows answers to commonly asked questions (FAQ)
** '''perldoc -q sort''' shows answers to commonly asked questions (FAQ)
+
** <tt>man perlretut</tt> and <tt>man perlre</tt> regular expressions
** '''man perlretut''' and '''man perlre''' regular expressions
+
** <tt>man perl</tt> list of other manual pages about Perl
** '''man perl''' list of other manual pages about Perl
 
* The same content on the web http://perldoc.perl.org/
 
 
* Various web tutorials e.g. [http://www.perl.com/pub/a/2000/10/begperl1.html this one]
 
* Various web tutorials e.g. [http://www.perl.com/pub/a/2000/10/begperl1.html this one]
 
* Books
 
* Books
** Simon Cozens: Beginning Perl [http://www.perl.org/books/beginning-perl/] freely downloadable
+
** [http://www.perl.org/books/beginning-perl/ Simon Cozens: Beginning Perl] freely downloadable
** Larry Wall et al: Programming Perl [http://oreilly.com/catalog/9780596000271/] classics, Camel book
+
** [http://oreilly.com/catalog/9780596000271/ Larry Wall et al: Programming Perl] classics, Camel book
* [http://www.bioperl.org/wiki/Main_Page '''Bioperl'''] big library for bioinformatics
 
* Perl for Windows: http://strawberryperl.com/
 
  
==HWperl1==
+
==Further optional topics==
<!-- NOTEX -->
+
For illustration, we briefly cover other topics frequently used in Perl scripts (these are not needed to solve the exercises).
See [[#Lperl1|Lecture 1]]
 
<!-- /NOTEX -->
 
  
===Files and setup===
+
===Opening files===
We recommend creating a directory (folder) for this set of tasks:
+
<syntaxhighlight lang="Perl">
<pre>
+
my $in;
mkdir perl1  # make directory
+
open $in, "<", "path/file.txt" or die; # open file for reading
cd perl1    # change to the new directory
+
while(my $line = <$in>) {
</pre>
+
  # process line
We have 4 input files for this task set. We recommend creating soft links to your working directory as follows:
+
}
<pre>
+
close $in;
ln -s /tasks/perl1/repeats-small.txt . # small version of the repeat file
 
ln -s /tasks/perl1/repeats.txt .        # full version of the repeat file
 
ln -s /tasks/perl1/reads-small.fastq .  # smaller version of the read file
 
ln -s /tasks/perl1/reads.fastq .        # bigger version of the read file
 
</pre>
 
  
<!-- NOTEX -->
+
my $out;
We recommend writing your protocol starting from an outline provided in  <tt>/tasks/perl1/protocol.txt</tt>. Make your own copy of the protopcol and open it in an editor, e.g. kate:
+
open $out, ">", "path/file2.txt" or die; # open file for writing
<pre>
+
print $out "Hello world\n";
cp -ip /tasks/perl1/protocol.txt . # copy protocol
+
close $out;
kate protocol.txt &                # open editor, run in the backgrund
+
# if we want to append to a file use the following instead:
</pre>
+
# open $out, ">>", "cesta/subor2.txt" or die;
  
===Submitting===
+
# standard files
* Directory /submit/perl1/your_username will be created for you
+
print STDERR "Hello world\n";
* Copy required files to this directory, including the protocol named protocol.txt or protocol.pdf
+
my $line = <STDIN>;
* You can modify these files freely until deadline, but after the deadline of the homework, you will lose access rights to this directory
+
# files as arguments of a function
<!-- /NOTEX -->
+
read_my_file($in);
 +
read_my_file(\*STDIN);
 +
</syntaxhighlight>
  
===Task A===
+
===Working with files and directories===
 +
Module <tt>File::Temp</tt> allows to create temporary working directories or files with automatically generated names. These are automatically deleted when the program finishes.
 +
<syntaxhighlight lang="Perl">
 +
use File::Temp qw/tempdir/;
 +
my $dir = tempdir("atoms_XXXXXXX", TMPDIR => 1, CLEANUP => 1 );
 +
print STDERR "Creating temporary directory $dir\n";
 +
open $out,">$dir/myfile.txt" or die;
 +
</syntaxhighlight>
 +
 
 +
Copying files
 +
<syntaxhighlight lang="Perl">
 +
use File::Copy;
 +
copy("file1","file2") or die "Copy failed: $!";
 +
copy("Copy.pm",\*STDOUT);
 +
move("/dev1/fileA","/dev2/fileB");
 +
</syntaxhighlight>
 +
 
 +
Other functions for working with file system, e.g. <tt>chdir, mkdir, unlink, chmod,</tt> ...
 +
 
 +
Function <tt>glob</tt> finds files with wildcard characters similarly as on command line (see also <tt>opendir, readdir</tt>, and <tt>File::Find module</tt>)
 +
<syntaxhighlight lang="Perl">
 +
ls *.pl
 +
perl -le'foreach my $f (glob("*.pl")) { print $f; }'
 +
</syntaxhighlight>
 +
 
 +
Additional functions for working with file names, paths, etc. in modules <tt>File::Spec</tt> and <tt>File::Basename</tt>.
 +
 
 +
Testing for an existence of a file (more in [http://perldoc.perl.org/functions/-X.html perldoc -f -X])
 +
<syntaxhighlight lang="Perl">
 +
if(-r "file.txt") { ... }  # is file.txt readable?
 +
if(-d "dir") {.... }      # is dir a directory?
 +
</syntaxhighlight>
 +
 
 +
===Running external programs===
 +
Using the <tt>system</tt> command
 +
* It returns -1 if it cannot run command, otherwise returns the return code of the program
 +
<syntaxhighlight lang="Perl">
 +
my $ret = system("command arguments");
 +
</syntaxhighlight>
 +
 
 +
Using the backtick operator with capturing standard output to a variable
 +
* This does not tests the return code
 +
<syntaxhighlight lang="Perl">
 +
my $allfiles = `ls`;
 +
</syntaxhighlight>
 +
 
 +
Using pipes (special form of open sends output to a different command,
 +
or reads output of a different command as a file)
 +
<syntaxhighlight lang="Perl">
 +
open $in, "ls |";
 +
while(my $line = <$in>) { ... }
 +
</syntaxhighlight>
 +
 
 +
<syntaxhighlight lang="Perl">
 +
open $out, "| wc";
 +
print $out "1234\n";
 +
close $out;
 +
# output of wc:
 +
#      1      1      5
 +
</syntaxhighlight>
 +
 
 +
===Command-line arguments===
 +
<syntaxhighlight lang="Perl">
 +
# module for processing options in a standardized way
 +
use Getopt::Std;
 +
# string with usage manual
 +
my $USAGE = "$0 [options] length filename
 +
 
 +
Options:
 +
-l          switch on lucky mode
 +
-o filename  write output to filename
 +
";
 +
 
 +
# all arguments to the command are stored in @ARGV array
 +
# parse options and remove them from @ARGV
 +
my %options;
 +
getopts("lo:", \%options);
 +
# now there should be exactly two arguments in @ARGV
 +
die $USAGE unless @ARGV==2;
 +
# process options
 +
my ($length, $filenamefile) = @ARGV;
 +
# values of options are in the %options array
 +
if(exists $options{'l'}) { print "Lucky mode\n"; }
 +
</syntaxhighlight>
 +
For long option names, see module Getopt::Long
 +
 
 +
===Defining functions===
 +
 
 +
<syntaxhighlight lang="Perl">
 +
sub function_name {
 +
  # arguments are stored in @_ array
 +
  my ($firstarg, $secondarg) = @_;
 +
  # do something
 +
  return ($result, $second_result);
 +
}
 +
</syntaxhighlight>
 +
* Arrays and hashes are usually passed as references: <tt>function_name(\@array, \%hash);</tt>
 +
* It is advantageous to pass very long string as references to prevent needless copying: <tt>function_name(\$sequence);</tt>
 +
* References need to be dereferenced, e.g. <tt>substr($$sequence)</tt> or <tt>$array->[0]</tt>
 +
 
 +
===Bioperl===
 +
A large library useful for bioinformatics. This snippet translates DNA sequence to a protein using the standard genetic code:
 +
<syntaxhighlight lang="Perl">
 +
use Bio::Tools::CodonTable;
 +
sub translate
 +
{
 +
    my ($seq, $code) = @_;
 +
    my $CodonTable = Bio::Tools::CodonTable->new( -id => $code);
 +
    my $result = $CodonTable->translate($seq);
  
* Consider the program for counting repeat types in the [[#Lperl1#A_sample_Perl_program|lecture 1]], save it to file <tt>repeat-stat.pl</tt>
+
     return $result;
** Open editor running in the background: <tt>kate repeat-stat.pl</tt>
+
}
** Copy and paste text to the editor, save it
+
</syntaxhighlight>
** Make the script executable: <tt>chmod a+x repeat-stat.pl<//tt>
+
==HWperl==
* Extend the script to compute the average length of each type of repeat
 
** Each row of the input table contains the start and end coordinates of the repeat in columns 7 and 6. The length is simply the difference of these two values.
 
* Output a table with three columns: type of repeat, the number of occurrences, the average length of the repeat.
 
** Use [http://perldoc.perl.org/functions/printf.html printf] to print these three items right-justified in columns of sufficient width, print the average length to 1 decimal place.
 
* If you run your script on the small file, the output should look something like this (exact column widths may differ):
 
<pre>
 
./repeat-stat.pl < repeats-small.txt
 
                DNA        5    377.4
 
                LINE        4    410.2
 
                LTR        13    355.4
 
      Low_complexity        22      47.2
 
                  RC        8     236.2
 
      Simple_repeat      106      39.0
 
</pre>
 
* Run your script also on the large file: <tt>./repeat-stat.pl < repeats.txt</tt>
 
 
<!-- NOTEX -->
 
<!-- NOTEX -->
** Include the output in your '''protocol'''
+
See [[#Lperl|the lecture]]
<!-- /NOTEX -->
 
* Find out on [https://en.wikipedia.org/wiki/Retrotransposon Wikipedia], what acronyms LINE and LTR stand for. Do their names correspond to their lengths?
 
<!-- NOTEX -->
 
** (Write a short answer in the '''protocol'''.)
 
* '''Submit''' only your script, <tt>repeat-stat.pl</tt>
 
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task B===
+
===Files and setup===
 +
We recommend creating a directory (folder) for this set of tasks:
 +
<syntaxhighlight lang="bash">
 +
mkdir perl  # make directory
 +
cd perl    # change to the new directory
 +
</syntaxhighlight>
 +
We have 4 input files for this task set. We recommend creating soft links to your working directory as follows:
 +
<syntaxhighlight lang="bash">
 +
ln -s /tasks/perl/series-small.tsv .  # small version of the series file
 +
ln -s /tasks/perl/series.tsv .        # full version of the series file
 +
ln -s /tasks/perl/reads-small.fastq .  # smaller version of the read file
 +
ln -s /tasks/perl/reads.fastq .        # bigger version of the read file
 +
</syntaxhighlight>
  
* Write a script which reformats FASTQ file to FASTA format, call it <tt>fastq2fasta.pl</tt>
+
<!-- NOTEX -->
** [[#Lperl1#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|fastq file]] should be on standard input, fasta file written to standard output
+
We recommend writing your protocol starting from an outline provided in  <tt>/tasks/perl/protocol.txt</tt>. Make your own copy of the protocol and open it in an editor, e.g. kate:
 +
<syntaxhighlight lang="bash">
 +
cp -ip /tasks/perl/protocol.txt .  # copy protocol
 +
kate protocol.txt &                # open editor, run in the background
 +
</syntaxhighlight>
 +
 
 +
===Submitting===
 +
* Directory <tt>/submit/perl/your_username</tt> will be created for you
 +
* Copy required files to this directory, including the protocol named <tt>protocol.txt</tt>
 +
* You can modify these files freely until deadline, but after the deadline of the homework, you will lose access rights to this directory
 +
<!-- /NOTEX -->
 +
 
 +
===Task A (series)===
 +
 
 +
Consider the program for counting series in the [[#Lperl#A_sample_Perl_program|lecture 1]], save it to file <tt>series-stat.pl</tt>
 +
* Open editor running in the background: <tt>kate series-stat.pl &</tt>
 +
* Copy and paste text to the editor, save it
 +
* Make the script executable: <tt>chmod a+x series-stat.pl</tt>
 +
 
 +
Extend the script to compute the average rating of each series (averaging over all episodes in the series)
 +
* Each row of the input table contains rating in column 5.
 +
* Output a table with three columns: name of series, the number of episodes, the average rating.
 +
* Use <tt>[http://perldoc.perl.org/functions/printf.html printf]</tt> to print these three items right-justified in columns of sufficient width, print the average rating to 1 decimal place.
 +
* If you run your script on the small file, the output should look something like this (exact column widths may differ):
 +
<pre>
 +
./series-stat.pl < series-small.tsv
 +
        Black Mirror        3        8.2
 +
    Game of Thrones        3        8.8
 +
</pre>
 +
* Run your script also on the large file: <tt>./series-stat.pl < series.tsv</tt>
 +
<!-- NOTEX -->
 +
** Include the output in your '''protocol'''
 +
* '''Submit''' only your script, <tt>series-stat.pl</tt>
 +
<!-- /NOTEX -->
 +
 
 +
===Task B (FASTQ to FASTA)===
 +
 
 +
* Write a script which reformats FASTQ file to FASTA format, call it <tt>fastq2fasta.pl</tt>
 +
** [[#Lperl#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|FASTQ file]] should be on standard input, FASTA file written to standard output
 
* [https://en.wikipedia.org/wiki/FASTA_format FASTA format] is a typical format for storing DNA and protein sequences.  
 
* [https://en.wikipedia.org/wiki/FASTA_format FASTA format] is a typical format for storing DNA and protein sequences.  
 
** Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
 
** Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
** The sequence itself is on the second line, long sequences are split into multiple lines
+
** The sequence itself is on the second line, long sequences can be split into multiple lines
 
* In our case, the name of the sequence will be the ID of the read with @ replaced by > and / replaced by underscore (<tt>_</tt>)
 
* In our case, the name of the sequence will be the ID of the read with @ replaced by > and / replaced by underscore (<tt>_</tt>)
** you can try to use [http://perldoc.perl.org/perlop.html#Quote-Like-Operators tr or s operators] (see also [[#Lperl1#Regular_expressions|lecture]])
+
** you can try to use [http://perldoc.perl.org/perlop.html#Quote-Like-Operators <tt>tr</tt> or <tt>s</tt> operators] (see also [[#Lperl#Regular_expressions|lecture]])
* For example, the first two reads of the file <tt>reads.fastq</tt> are:
+
* For example, the first two reads of the file <tt>reads.fastq</tt> are as follows (only the first 50 columns shown)
 
<pre>
 
<pre>
 
@SRR022868.1845/1
 
@SRR022868.1845/1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAATGCCTTAAACCAGTAGCAGAATCAGTATAAA
+
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAG...
 
+
 
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI>+I0IB99I%%2GI*=?5*&1>'$0;%'+%%+;#'$&'%%$-+*$--*+(%
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI...
 
@SRR022868.1846/1
 
@SRR022868.1846/1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGATTTATTAAAGTTAAGACGAGAGTCTCATTAT
+
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACA...
 
+
 
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5&G.@8/6&%&,03:*.$479.91(9--$,*&/3"$#&*'+#&##&$(&+&+
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5...
 
</pre>
 
</pre>
* These should be reformatted as follows:
+
* These should be reformatted as follows (again only first 50 columns shown, but you include entire reads):
 
<pre>
 
<pre>
 
>SRR022868.1845_1
 
>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAATGCCTTAAACCAGTAGCAGAATCAGTATAAA
+
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGA...
 
>SRR022868.1846_1
 
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGATTTATTAAAGTTAAGACGAGAGTCTCATTAT
+
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACAC...
 
</pre>
 
</pre>
 
* Run your script on the small read file <tt>./fastq2fasta.pl < reads-small.fastq > reads-small.fasta</tt>
 
* Run your script on the small read file <tt>./fastq2fasta.pl < reads-small.fastq > reads-small.fasta</tt>
Line 578: Line 725:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task C===
+
===Task C (FASTQ quality)===
  
 
Write a script <tt>fastq-quality.pl</tt> which for each position in a read computes the average quality
 
Write a script <tt>fastq-quality.pl</tt> which for each position in a read computes the average quality
 
* Standard input has fastq file with multiple reads, possibly of different lengths
 
* Standard input has fastq file with multiple reads, possibly of different lengths
 
* As quality we will use ASCII values of characters in the quality string with value 33 subtracted, so the quality is -10 log p
 
* As quality we will use ASCII values of characters in the quality string with value 33 subtracted, so the quality is -10 log p
** ASCII value can be computed by function [http://perldoc.perl.org/functions/ord.html ord]
+
** ASCII value can be computed by function [http://perldoc.perl.org/functions/ord.html <tt>ord</tt>]
 
* Positions in reads will be numbered from 0
 
* Positions in reads will be numbered from 0
 
* Since reads can differ in length, some positions are used in more reads, some in fewer
 
* Since reads can differ in length, some positions are used in more reads, some in fewer
Line 593: Line 740:
 
</pre>
 
</pre>
 
Run the following command, which runs your script on the larger file and selects every 10th position.  
 
Run the following command, which runs your script on the larger file and selects every 10th position.  
<pre>
+
<syntaxhighlight lang="bash">
 
./fastq-quality.pl < reads.fastq | perl -lane 'print if $F[0]%10==0'
 
./fastq-quality.pl < reads.fastq | perl -lane 'print if $F[0]%10==0'
</pre>
+
</syntaxhighlight>
 
* What trends (if any) do you see in quality values with increasing position?
 
* What trends (if any) do you see in quality values with increasing position?
 
<!-- NOTEX -->
 
<!-- NOTEX -->
Line 602: Line 749:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task D===
+
===Task D (FASTQ trim)===
  
 
Write script <tt>fastq-trim.pl</tt> that trims low quality bases from the end of each read and filters out short reads
 
Write script <tt>fastq-trim.pl</tt> that trims low quality bases from the end of each read and filters out short reads
Line 615: Line 762:
 
You can check your program by the following tests:
 
You can check your program by the following tests:
 
* If you run the following two commands, you should get file <tt>tmp</tt> identical with input and thus output of the <tt>diff</tt> command should be empty
 
* If you run the following two commands, you should get file <tt>tmp</tt> identical with input and thus output of the <tt>diff</tt> command should be empty
<pre>
+
<syntaxhighlight lang="bash">
 
./fastq-trim.pl '!' 101 < reads-small.fastq > tmp  # trim at quality ASCII >=33 and length >=101
 
./fastq-trim.pl '!' 101 < reads-small.fastq > tmp  # trim at quality ASCII >=33 and length >=101
 
diff reads-small.fastq tmp                        # output should be empty (no differences)
 
diff reads-small.fastq tmp                        # output should be empty (no differences)
</pre>
+
</syntaxhighlight>
  
 
* If you run the following two commands, you should see differences in 4 reads, 2 bases trimmed from each
 
* If you run the following two commands, you should see differences in 4 reads, 2 bases trimmed from each
<pre>
+
<syntaxhighlight lang="bash">
 
./fastq-trim.pl '"' 1 < reads-small.fastq > tmp  # trim at quality ASCII >=34 and length >=1
 
./fastq-trim.pl '"' 1 < reads-small.fastq > tmp  # trim at quality ASCII >=34 and length >=1
 
diff reads-small.fastq tmp                        # output should be differences in 4 reads
 
diff reads-small.fastq tmp                        # output should be differences in 4 reads
</pre>
+
</syntaxhighlight>
  
 
* If you run the following commands, you should get empty output (no reads meet the criteria):
 
* If you run the following commands, you should get empty output (no reads meet the criteria):
<pre>
+
<syntaxhighlight lang="bash">
 
./fastq-trim.pl d 1 < reads-small.fastq          # quality ASCII >=100, length >= 1
 
./fastq-trim.pl d 1 < reads-small.fastq          # quality ASCII >=100, length >= 1
 
./fastq-trim.pl '!' 102 < reads-small.fastq      # quality ASCII >=33 and length >=102
 
./fastq-trim.pl '!' 102 < reads-small.fastq      # quality ASCII >=33 and length >=102
</pre>
+
</syntaxhighlight>
  
 
<!-- NOTEX -->
 
<!-- NOTEX -->
Line 638: Line 785:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
 
* If you have done task C, run quality statistics on the trimmed version of the bigger file using command below. Comment on the differences between statistics on the whole file in part C and D. Are they as you expected?
 
* If you have done task C, run quality statistics on the trimmed version of the bigger file using command below. Comment on the differences between statistics on the whole file in part C and D. Are they as you expected?
<pre>
+
<syntaxhighlight lang="bash">
 
# "2" means quality ASCII >= 50
 
# "2" means quality ASCII >= 50
 
./fastq-trim.pl 2 50 < reads.fastq | ./fastq-quality.pl | perl -lane 'print if $F[0]%10==0'
 
./fastq-trim.pl 2 50 < reads.fastq | ./fastq-quality.pl | perl -lane 'print if $F[0]%10==0'
</pre>
+
</syntaxhighlight>
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 
* In your '''protocol''', include the result of the command and your discussion of its results.
 
* In your '''protocol''', include the result of the command and your discussion of its results.
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
'''Note''': in this task set, you have created tools which can be combined, e.g. you can first trim fastq and then convert it to fasta (no need to submit these files)
+
'''Note''': in this task set, you have created tools which can be combined, e.g. you can first trim FASTQ and then convert it to FASTA
 +
<!-- NOTEX -->
 +
(no need to submit these files)
 +
<!-- /NOTEX -->
  
'''Parsing command-line arguments''' in this task (they will be stored in variables $Q and $L):
+
'''Parsing command-line arguments''' in this task (they will be stored in variables <tt>$Q</tt> and <tt>$L</tt>):
<pre>
+
<syntaxhighlight lang="Perl">
 
#!/usr/bin/perl -w
 
#!/usr/bin/perl -w
 
use strict;
 
use strict;
Line 670: Line 820:
 
# check that $Q is one character and $L looks like a non-negative integer
 
# check that $Q is one character and $L looks like a non-negative integer
 
die $USAGE unless length($Q)==1 && $L=~/^[0-9]+$/;
 
die $USAGE unless length($Q)==1 && $L=~/^[0-9]+$/;
</pre>
+
</syntaxhighlight>
 +
=Lbash=
 +
<!-- NOTEX -->
 +
[[#HWbash]]
 +
<!-- /NOTEX -->
  
=L02=
+
This lecture introduces command-line tools and Perl one-liners.
[[#HW02]]
+
* We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.
  
Today: using command-line tools and Perl one-liners.
+
When working on the exercises, record all the commands used
* We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.
+
* We strongly recommend making a log of commands for data processing also outside of this course
* You will record the commands used in your protocol
 
** We strongly recommend making a log of commands for data processing also outside of this course
 
 
* If you have a log of executed commands, you can easily execute them again by copy and paste
 
* If you have a log of executed commands, you can easily execute them again by copy and paste
* For this reason any comments are best preceded by <tt>#</tt>
+
* For this reason any comments are best preceded in the log by <tt>#</tt>
 
* If you use some sequence of commands often, you can turn it into a script
 
* If you use some sequence of commands often, you can turn it into a script
  
==Efficient use of command line==
+
==Efficient use of the Bash command line==
  
 
Some tips for bash shell:
 
Some tips for bash shell:
 
* use ''tab'' key to complete command names, path names etc  
 
* use ''tab'' key to complete command names, path names etc  
** tab completion can be customized [https://www.debian-administration.org/article/316/An_introduction_to_bash_completion_part_1]
+
** tab completion [https://www.debian-administration.org/article/316/An_introduction_to_bash_completion_part_1 can be customized]
* use ''up'' and ''down'' keys to walk through history of recently executed commands, then edit and resubmit chosen command
+
* use ''up'' and ''down'' keys to walk through the history of recently executed commands, then edit and execute the chosen command
 
* press ''ctrl-r'' to search in the history of executed commands
 
* press ''ctrl-r'' to search in the history of executed commands
 
* at the end of session, history stored in <tt>~/.bash_history</tt>
 
* at the end of session, history stored in <tt>~/.bash_history</tt>
 
* command <tt>history -a</tt> appends history to this file right now
 
* command <tt>history -a</tt> appends history to this file right now
** you can then look into the file and copy appropriate commands to your protocol
+
** you can then look into the file and copy appropriate commands to your log
 
* various other history tricks, e.g. special variables [http://samrowe.com/wordpress/advancing-in-the-bash-shell/]
 
* various other history tricks, e.g. special variables [http://samrowe.com/wordpress/advancing-in-the-bash-shell/]
* <tt>cd -</tt> goes to previously visited directory, also see <tt>pushd</tt> and <tt>popd</tt>
+
* <tt>cd -</tt> goes to previously visited directory (also see <tt>pushd</tt> and <tt>popd</tt>)
* <tt>ls -lt | head</tt> shows 10 most recent files, useful for seeing what you have done last
+
* <tt>ls -lt | head</tt> shows 10 most recent files, useful for seeing what you have done last in a directory
  
 
Instead of bash, you can use more advanced command-line environments, e.g. [http://ipython.org/notebook.html iPhyton notebook]
 
Instead of bash, you can use more advanced command-line environments, e.g. [http://ipython.org/notebook.html iPhyton notebook]
Line 701: Line 853:
 
==Redirecting and pipes==
 
==Redirecting and pipes==
  
<pre>
+
<syntaxhighlight lang="bash">
 
# redirect standard output to file
 
# redirect standard output to file
 
command > file
 
command > file
Line 714: Line 866:
 
command < file
 
command < file
  
# do not forget to quote > in other uses, e.g. when searching for string ">" in a file sequences.fasta
+
# do not forget to quote > in other uses,  
 +
# e.g. when searching for string ">" in a file sequences.fasta
 
grep '>' sequences.fasta
 
grep '>' sequences.fasta
 
# (without quotes rewrites sequences.fasta)
 
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc should be quoted in '' as well
+
# other special characters, such as ;, &, |, # etc
 +
# should be quoted in '' as well
  
 
# send stdout of command1 to stdin of command2
 
# send stdout of command1 to stdin of command2
Line 733: Line 887:
 
line 3'
 
line 3'
  
# in some commands, file argument can be taken from stdin if denoted as - or stdin or /dev/stdin
+
# in some commands, file argument can be taken from stdin
 +
# if denoted as - or stdin or /dev/stdin
 
# the following compares uncompressed version of file1 with file2
 
# the following compares uncompressed version of file1 with file2
 
zcat file1.gz | diff - file2
 
zcat file1.gz | diff - file2
</pre>
+
</syntaxhighlight>
  
 
Make piped commands fail properly:
 
Make piped commands fail properly:
<pre>
+
<syntaxhighlight lang="bash">
 
set -o pipefail
 
set -o pipefail
</pre>
+
</syntaxhighlight>
 
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.
 
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.
  
 
==Text file manipulation==
 
==Text file manipulation==
 
===Commands echo and cat (creating and printing files)===
 
===Commands echo and cat (creating and printing files)===
<pre>
+
<syntaxhighlight lang="bash">
 
# print text Hello and end of line to stdout
 
# print text Hello and end of line to stdout
 
echo "Hello"  
 
echo "Hello"  
Line 753: Line 908:
 
# concatenate several files to stdout
 
# concatenate several files to stdout
 
cat file1 file2
 
cat file1 file2
</pre>
+
</syntaxhighlight>
  
 
===Commands head and tail (looking at start and end of files)===
 
===Commands head and tail (looking at start and end of files)===
<pre>
+
<syntaxhighlight lang="bash">
 
# print 10 first lines of file (or stdin)
 
# print 10 first lines of file (or stdin)
 
head file
 
head file
Line 768: Line 923:
 
# print lines 81..100
 
# print lines 81..100
 
head -n 100 file | tail -n 20  
 
head -n 100 file | tail -n 20  
</pre>
+
</syntaxhighlight>
* Docs: [http://www.gnu.org/software/coreutils/manual/html_node/head-invocation.html head], [http://www.gnu.org/software/coreutils/manual/html_node/tail-invocation.html tail]
+
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/head-invocation.html head], [http://www.gnu.org/software/coreutils/manual/html_node/tail-invocation.html tail]
  
===Commands wc, ls -lh, od (exploring file stats and details)===
+
===Commands wc, ls -lh, od (exploring file statistics and details)===
<pre>
+
<syntaxhighlight lang="bash">
# prints three numbers: number of lines (-l), number of words (-w), number of bytes (-c)
+
# prints three numbers:
 +
# the number of lines (-l), number of words (-w), number of bytes (-c)
 
wc file
 
wc file
  
# prints size of file in human-readable units (K,M,G,T)
+
# prints the size of file in human-readable units (K,M,G,T)
 
ls -lh file
 
ls -lh file
  
Line 785: Line 941:
 
# 0000000  h  e  l  l  o  sp  w  o  r  l  d  !  nl
 
# 0000000  h  e  l  l  o  sp  w  o  r  l  d  !  nl
 
# 0000015
 
# 0000015
</pre>
+
</syntaxhighlight>
* Docs: [http://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html wc], [http://www.gnu.org/software/coreutils/manual/html_node/ls-invocation.html ls], [http://www.gnu.org/software/coreutils/manual/html_node/od-invocation.html od]
+
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html wc], [http://www.gnu.org/software/coreutils/manual/html_node/ls-invocation.html ls], [http://www.gnu.org/software/coreutils/manual/html_node/od-invocation.html od]
  
 
===Command grep (getting lines matching a regular expression)===
 
===Command grep (getting lines matching a regular expression)===
<pre>
+
<syntaxhighlight lang="bash">
 +
# get all lines containing string chromosome
 +
grep chromosome file
 
# -i ignores case (upper case and lowercase letters are the same)
 
# -i ignores case (upper case and lowercase letters are the same)
 
grep -i chromosome file
 
grep -i chromosome file
Line 803: Line 961:
 
# -f patterns in a file  
 
# -f patterns in a file  
 
#    (good for selecting e.g. only lines matching one of "good" ids)
 
#    (good for selecting e.g. only lines matching one of "good" ids)
</pre>
+
</syntaxhighlight>
* docs: [http://www.gnu.org/software/grep/manual/grep.html grep]
+
Documentation: [http://www.gnu.org/software/grep/manual/grep.html grep]
  
 
===Commands sort, uniq===
 
===Commands sort, uniq===
<pre>
+
<syntaxhighlight lang="bash">
 +
# sort lines of a file alphabetically
 +
sort file
 +
 
 
# some useful options of sort:
 
# some useful options of sort:
 
# -g numeric sort
 
# -g numeric sort
Line 815: Line 976:
 
# -t fields separator
 
# -t fields separator
  
# sorting first by column 2 numerically (-k2,2g), in case of ties use column 1 (-k1,1)
+
# sorting first by column 2 numerically (-k2,2g),
 +
# in case of ties use column 1 (-k1,1)
 
sort -k2,2g -k1,1 file  
 
sort -k2,2g -k1,1 file  
  
 
# uniq outputs one line from each group of consecutive identical lines
 
# uniq outputs one line from each group of consecutive identical lines
 
# uniq -c adds the size of each group as the first column
 
# uniq -c adds the size of each group as the first column
# the following finds all unique lines and sorts them by frequency from the most frequent
+
# the following finds all unique lines
 +
# and sorts them by frequency from the most frequent
 
sort file | uniq -c | sort -gr
 
sort file | uniq -c | sort -gr
</pre>
+
</syntaxhighlight>
* docs: [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html sort], [http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html uniq]
+
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html sort], [http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html uniq]
  
 
===Commands diff, comm (comparing files)===
 
===Commands diff, comm (comparing files)===
  
[http://www.gnu.org/software/coreutils/manual/html_node/diff-invocation.html diff] compares two files, useful for manual checking of differences
+
Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/diff-invocation.html diff]</tt> compares two files. It is good for manual checking of differences. Useful options:
* useful options  
+
* <tt>-b</tt> (ignore whitespace differences)  
** -b (ignore whitespace differences)  
+
* <tt>-r</tt> for comparing whole directories
** -r for comparing whole directories
+
* <tt>-q</tt> for fast checking for identity
** -q for fast checking for identity
+
* <tt>-y</tt> show differences side-by-side
** -y show differences side-by-side
+
 
 +
Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html comm]</tt> compares two sorted files. It is good for finding set intersections and differences. It writes three columns:
 +
* lines occurring only in the first file
 +
* lines occurring only in the second file
 +
* lines occurring in both files
 +
Some columns can be suppressed with options <tt>-1, -2, -3</tt>
  
[http://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html comm] compares two sorted files
 
* writes 3 columns:
 
** 1: lines occurring only in the first file
 
** 2: lines occurring only in the second file
 
** 3: lines occurring in both files
 
* some columns can be suppressed with -1, -2, -3
 
* good for finding set intersections and differences
 
  
 
===Commands cut, paste, join (working with columns)===
 
===Commands cut, paste, join (working with columns)===
* [http://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html cut] selects only some columns from file (perl/awk more flexible)
+
* Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html cut]</tt> selects only some columns from file (perl/awk more flexible)
* [http://www.gnu.org/software/coreutils/manual/html_node/paste-invocation.html paste] puts 2 or more files side by side, separated by tabs or other character
+
* Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/paste-invocation.html paste]</tt> puts two or more files side by side, separated by tabs or other characters
* [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] is a powerful tool for making joins and left-joins as in databases on specified columns in two files
+
* Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join]</tt> is a powerful tool for making joins and left-joins as in databases on specified columns in two files
  
 
===Commands split, csplit (splitting files to parts)===
 
===Commands split, csplit (splitting files to parts)===
* [http://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html split] splits into fixed-size pieces (size in lines, bytes etc.)
+
* Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html split]</tt> splits into fixed-size pieces (size in lines, bytes etc.)
* [http://www.gnu.org/software/coreutils/manual/html_node/csplit-invocation.html csplit] splits at occurrence of a pattern (e.g. fasta file into individual sequences)
+
* Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/csplit-invocation.html csplit]</tt> splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
<pre>
+
<syntaxhighlight lang="bash">
 
csplit sequences.fa '/^>/' '{*}'
 
csplit sequences.fa '/^>/' '{*}'
</pre>
+
</syntaxhighlight>
  
 
==Programs sed and awk==
 
==Programs sed and awk==
Both programs process text files line by line, allow to do various transformations
+
Both <tt>sed</tt> and <tt>awk</tt> process text files line by line, allowing to do various transformations
* awk newer, more advanced
+
* <tt>awk</tt> newer, more advanced
 
* several examples below
 
* several examples below
* More info on wikipedia: [https://en.wikipedia.org/wiki/AWK awk], [https://en.wikipedia.org/wiki/Sed sed]
+
* More info on [https://en.wikipedia.org/wiki/AWK awk], [https://en.wikipedia.org/wiki/Sed sed] on Wikipedia
<pre>
+
<syntaxhighlight lang="bash">
 
# replace text "Chr1" by "Chromosome 1"
 
# replace text "Chr1" by "Chromosome 1"
 
sed 's/Chr1/Chromosome 1/'
 
sed 's/Chr1/Chromosome 1/'
# prints first two lines, then quits (like head -n 2)
+
# prints the first two lines, then quits (like head -n 2)
 
sed 2q   
 
sed 2q   
  
# print first and second column from a file
+
# print the first and second column from a file
 
awk '{print $1, $2}'  
 
awk '{print $1, $2}'  
  
# print the line if difference in first and second column > 10
+
# print the line if the difference between the first and second column > 10
 
awk '{ if ($2-$1>10) print }'   
 
awk '{ if ($2-$1>10) print }'   
  
Line 874: Line 1,035:
 
awk '/pattern/ { print }'  
 
awk '/pattern/ { print }'  
  
# count lines
+
# count the lines (like wc -l)
 
awk 'END { print NR }'
 
awk 'END { print NR }'
</pre>
+
</syntaxhighlight>
  
 
==Perl one-liners==
 
==Perl one-liners==
 
Instead of sed and awk, we will cover Perl one-liners
 
Instead of sed and awk, we will cover Perl one-liners
* more examples [http://www.math.harvard.edu/computing/perl/oneliners.txt], [https://blogs.oracle.com/ksplice/entry/the_top_10_tricks_of]
+
* more examples on various websites ([http://www.math.harvard.edu/computing/perl/oneliners.txt example 1], [https://blogs.oracle.com/ksplice/entry/the_top_10_tricks_of example 2])
* documentation for Perl switches [http://perldoc.perl.org/perlrun.html]
+
* documentation for [http://perldoc.perl.org/perlrun.html Perl switches]
<pre>
+
<syntaxhighlight lang="bash">
 
# -e executes commands
 
# -e executes commands
 
perl -e'print 2+3,"\n"'
 
perl -e'print 2+3,"\n"'
 
perl -e'$x = 2+3; print $x, "\n"';
 
perl -e'$x = 2+3; print $x, "\n"';
  
# -n wraps commands in a loop reading lines from stdin or files listed as arguments
+
# -n wraps commands in a loop reading lines from stdin
 +
# or files listed as arguments
 
# the following is roughly the same as cat:
 
# the following is roughly the same as cat:
 
perl -ne'print'
 
perl -ne'print'
Line 908: Line 1,070:
  
 
# -a splits line into words separated by whitespace and stores them in array @F
 
# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in numbers stored in the second and first column
+
# the next example prints difference in the numbers stored
 +
# in the second and first column
 
# (e.g. interval size if each line coordinates of one interval)
 
# (e.g. interval size if each line coordinates of one interval)
 
perl -lane'print $F[1]-$F[0]'
 
perl -lane'print $F[1]-$F[0]'
Line 920: Line 1,083:
 
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
 
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
 
# similarly BEGIN { command } before we start
 
# similarly BEGIN { command } before we start
</pre>
+
</syntaxhighlight>
  
Other interesting possibilites:
+
Other interesting possibilities:
<pre>
+
<syntaxhighlight lang="bash">
 
# -i replaces each file with a new transformed version (DANGEROUS!)
 
# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files in the current directory
+
# the next example removes empty lines from all .txt files
 +
# in the current directory
 
perl -lne 'print if length($_)>0' -i *.txt
 
perl -lne 'print if length($_)>0' -i *.txt
 
# the following example replaces sequence of whitespace by exactly one space  
 
# the following example replaces sequence of whitespace by exactly one space  
Line 931: Line 1,095:
 
perl -lane 'print join(" ", @F)' -i *.txt
 
perl -lane 'print join(" ", @F)' -i *.txt
  
 
+
# variable $. contains the line number. $ARGV the name of file or - for stdin
# variable $. contains line number. $ARGV name of file or - for stdin
 
 
# the following prints filename and line number in front of every line
 
# the following prints filename and line number in front of every line
 
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2
 
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2
Line 942: Line 1,105:
 
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
 
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
 
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'
 
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'
</pre>
+
</syntaxhighlight>
=HW02=
+
==HWbash==
[[#Lperl1|Lecture 1 (Perl 1)]], [[#L02|Lecture 2 (command-line)]]
+
<!-- NOTEX -->
 +
[[#Lperl|Lecture on Perl]], [[#Lbash|Lecture on command-line tools]]
 +
<!-- /NOTEX -->
  
* In this homework, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.  
+
* In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.  
 
* Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.  
 
* Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.  
 
* Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
 
* Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
 +
<!-- NOTEX -->
 
* Include all relevant used commands in your protocol and add a short description of your approach.
 
* Include all relevant used commands in your protocol and add a short description of your approach.
 
* Submit the protocol and required output files.
 
* Submit the protocol and required output files.
* Outline of the protocol is in <tt>/tasks/hw02/protocol.txt</tt>, submit to directory <tt>/submit/hw02/yourname</tt>
+
* Outline of the protocol is in <tt>/tasks/bash/protocol.txt</tt>, submit to directory <tt>/submit/bash/yourname</tt>
 
+
<!-- /NOTEX -->
<!--
 
==Bonus==
 
* If you are bored, you can try to write solution of Task B using as small number of characters as possible
 
* In the protocol, include both normal readable form and the condensed form
 
* Winner with the shortest set of commands gets some bonus points
 
-->
 
  
==Task A==
+
===Task A (passwords)===
* <tt>/tasks/hw02/names.txt</tt> contains data about several people, one per line.  
+
* The file <tt>/tasks/bash/names.txt</tt> contains data about several people, one per line.  
 
* Each line consists of given name(s), surname and email separated by spaces.  
 
* Each line consists of given name(s), surname and email separated by spaces.  
 
* Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form <tt>username@uniba.sk</tt>.
 
* Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form <tt>username@uniba.sk</tt>.
Line 967: Line 1,127:
 
** The output file has columns separated by commas ','
 
** The output file has columns separated by commas ','
 
** The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
 
** The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
 +
<!-- NOTEX -->
 
* '''Submit''' file <tt>passwords.csv</tt> with the result of your commands.  
 
* '''Submit''' file <tt>passwords.csv</tt> with the result of your commands.  
 +
<!-- /NOTEX -->
  
 
Example line from input:
 
Example line from input:
 
<pre>
 
<pre>
Pavol Országh Hviezdoslav hviezdoslav32@uniba.sk
+
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
 
</pre>
 
</pre>
  
 
Example line from output (password will differ):
 
Example line from output (password will differ):
 
<pre>
 
<pre>
hviezdoslav32,Hviezdoslav,Pavol Országh,3T3Pu3un
+
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
 
</pre>
 
</pre>
  
Line 982: Line 1,144:
 
* Passwords can be generated using <tt>pwgen</tt> (e.g. <tt>pwgen -N 10 -1</tt> prints 10 passwords, one per line)  
 
* Passwords can be generated using <tt>pwgen</tt> (e.g. <tt>pwgen -N 10 -1</tt> prints 10 passwords, one per line)  
 
* We also recommend using <tt>perl</tt>, <tt>wc</tt>, <tt>paste</tt> (check option <tt>-d</tt> in <tt>paste</tt>)
 
* We also recommend using <tt>perl</tt>, <tt>wc</tt>, <tt>paste</tt> (check option <tt>-d</tt> in <tt>paste</tt>)
* In Perl, function [http://perldoc.perl.org/functions/pop.html pop] may be useful for manipulating @F and function [http://perldoc.perl.org/functions/join.html join] for connecting strings with a separator.
+
* In Perl, function <tt>[http://perldoc.perl.org/functions/pop.html pop]</tt> may be useful for manipulating <tt>@F</tt> and function <tt>[http://perldoc.perl.org/functions/join.html join]</tt> for connecting strings with a separator.
  
==Task B==
+
===Task B (yeast genome)===
  
'''File:'''
+
'''The input file:'''
* <tt>/tasks/hw02/saccharomyces_cerevisiae.gff</tt> contains annotation of the yeast genome  
+
* <tt>/tasks/bash/saccharomyces_cerevisiae.gff</tt> contains annotation of the yeast genome  
 
** Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff].  
 
** Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff].  
 
** It was further processed to omit DNA sequences from the end of file.  
 
** It was further processed to omit DNA sequences from the end of file.  
 
** The size of the file is 5.6M.  
 
** The size of the file is 5.6M.  
* For easier work, link the file to your directory by <tt>ln -s /tasks/hw02/saccharomyces_cerevisiae.gff yeast.gff</tt>
+
* For easier work, link the file to your directory by <tt>ln -s /tasks/bash/saccharomyces_cerevisiae.gff yeast.gff</tt>
* The file is in GFF3 format [http://www.sequenceontology.org/gff3.shtml]
+
* The file is in [http://www.sequenceontology.org/gff3.shtml GFF3 format]
* Lines starting with <tt>#</tt> are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
+
* The lines starting with <tt>#</tt> are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
 
* Meaning of the first 5 columns:
 
* Meaning of the first 5 columns:
 
** column 0 chromosome name
 
** column 0 chromosome name
Line 1,005: Line 1,167:
 
* Print for each type of interval (column 2), how many times it occurs in the file.  
 
* Print for each type of interval (column 2), how many times it occurs in the file.  
 
* Sort from the most common to the least common interval types.
 
* Sort from the most common to the least common interval types.
* Hint: commands <tt>sort</tt> and <tt>uniq</tt> will be useful. Do not forget to skip comments, for example using <tt>grep -v '^#'</tt>  
+
* Hint: commands <tt>sort</tt> and <tt>uniq</tt> will be useful. Do not forget to skip comments, for example using <tt>grep -v '^#'</tt>
* '''Submit''' file <tt>types.txt</tt> with the output formatted as follows:
+
* The result should be a file <tt>types.txt</tt> formatted as follows:
 
<pre>
 
<pre>
 
   7058 CDS
 
   7058 CDS
Line 1,017: Line 1,179:
  
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
'''Submit''' the file <tt>types.txt</tt>
 +
<!-- /NOTEX -->
  
==Task C==
+
===Task C (chromosomes)===
 
* Continue processing file from task B.  
 
* Continue processing file from task B.  
 
* For each chromosome, the file contains a line which has in column 2 string <tt>chromosome</tt>, and the interval is the whole chromosome.
 
* For each chromosome, the file contains a line which has in column 2 string <tt>chromosome</tt>, and the interval is the whole chromosome.
 
* To file <tt>chrosomes.txt</tt>, print a tab-separated list of chromosome names and sizes in the same order as in the input
 
* To file <tt>chrosomes.txt</tt>, print a tab-separated list of chromosome names and sizes in the same order as in the input
 
* The last line of <tt>chromosomes.txt</tt> should list the total size of all chromosomes combined.
 
* The last line of <tt>chromosomes.txt</tt> should list the total size of all chromosomes combined.
 +
<!-- NOTEX -->
 
* '''Submit''' file <tt>chromosomes.txt</tt>
 
* '''Submit''' file <tt>chromosomes.txt</tt>
 +
<!-- /NOTEX -->
 
* Hints:  
 
* Hints:  
 
** The total size can be computed by a perl one-liner.
 
** The total size can be computed by a perl one-liner.
Line 1,040: Line 1,207:
 
</pre>
 
</pre>
  
==Task D==
+
===Task D (blast)===
 
'''Overall goal:'''
 
'''Overall goal:'''
* Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09
+
* Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09. The file contains sequence of the protein as well as a short description of its biological function.
* We have also downloaded proteins from yeast Yarrowia lipolytica. We will pretend that nothing is known about these proteins (as if they were produced by gene finding program in a newly sequenced genome).
+
* We have also downloaded proteins from the yeast ''Yarrowia lipolytica''. We will pretend that nothing is known about the function of these proteins (as if they were produced by gene finding program in a newly sequenced genome).
* For each Y.lip. proteins we have found similar proteins from other yeasts
+
* For each ''Y.lipolytica'' protein, we have found similar proteins from other yeasts
* Now we want to find for each protein in Y.lip. its closest match among all known proteins.
+
* Now we want to extract for each protein in ''Y.lipolytica'' its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the ''Y.lipolytica'' protein.
  
 
'''Files:'''
 
'''Files:'''
* <tt>/tasks/hw02/known.fa</tt> is a fasta file with known proteins from several species
+
* <tt>/tasks/bash/known.fa</tt> is a FASTA file containing sequences of known proteins from several species
* <tt>/tasks/hw02/yarLip.fa</tt> is a fasta file with proteins from Y.lip.
+
* <tt>/tasks/bash/yarLip.fa</tt> is a FASTA file with proteins from ''Y.lipolytica''
* <tt>/tasks/hw02/known.blast</tt> is the result of finding similar proteins in <tt>yarLip.fa</tt> versus <tt>known.fa</tt> by these commands:
+
* <tt>/tasks/bash/known.blast</tt> is the result of finding similar proteins in <tt>yarLip.fa</tt> versus <tt>known.fa</tt> by these commands (already done by us):
<pre>
+
<syntaxhighlight lang="bash">
 
formatdb -i known.fa
 
formatdb -i known.fa
 
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
 
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
</pre>
+
</syntaxhighlight>
 
* you can link these files to your directory as follows:
 
* you can link these files to your directory as follows:
<pre>
+
<syntaxhighlight lang="bash">
ln -s /tasks/hw02/known.fa .
+
ln -s /tasks/bash/known.fa .
ln -s /tasks/hw02/yarLip.fa .
+
ln -s /tasks/bash/yarLip.fa .
ln -s /tasks/hw02/known.blast .
+
ln -s /tasks/bash/known.blast .
</pre>
+
</syntaxhighlight>
  
 
'''Step 1:'''
 
'''Step 1:'''
 
* Get the first (strongest) match for each query from <tt>known.blast</tt>.
 
* Get the first (strongest) match for each query from <tt>known.blast</tt>.
 
* This can be done by printing the lines that are not comments but follow a comment line starting with #.  
 
* This can be done by printing the lines that are not comments but follow a comment line starting with #.  
* In a perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide of you print the current line.  
+
* In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.  
* Instead of using perl, you can play with grep. Option -A 1 prints the matching lines as well as one line ofter each match
+
* Instead of using Perl, you can play with grep. Option <tt>-A 1</tt> prints the matching lines as well as one line after each match
 
* Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
 
* Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
* '''Submit''' file best.tsv with the result
+
* Store the result in file <tt>best.tsv</tt>. The file should start as follows:
* File should start as follows:
 
 
<pre>
 
<pre>
 
Q6CBS2  sp|B5BP46|YP52_SCHPO
 
Q6CBS2  sp|B5BP46|YP52_SCHPO
Line 1,076: Line 1,242:
 
Q6CH56  sp|B5BP48|YP54_SCHPO
 
Q6CH56  sp|B5BP48|YP54_SCHPO
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>best.tsv</tt> with the result
 +
<!-- /NOTEX -->
  
 
'''Step 2:'''
 
'''Step 2:'''
* '''Submit''' file <tt>known.tsv</tt> which contains sequence names extracted from known.fa with leading <tt>></tt> removed
+
* Create file <tt>known.tsv</tt> which contains sequence names extracted from <tt>known.fa</tt> with leading <tt>></tt> removed
 
* This file should be sorted alphabetically.
 
* This file should be sorted alphabetically.
* File should start as follows:
+
* The file should start as follows (lines are trimmed below):
 
<pre>
 
<pre>
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=YAL019W-A PE=5 SV=1
+
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces...
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=YAR019W-A PE=5 SV=1
+
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces...
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>known.tsv</tt>
 +
<!-- /NOTEX -->
  
 
'''Step 3:'''
 
'''Step 3:'''
 
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join the files <tt>best.tsv</tt> and <tt>known.tsv</tt> so that each line of <tt>best.tsv</tt> is extended with the text describing the corresponding target in <tt>known.tsv</tt>
 
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join the files <tt>best.tsv</tt> and <tt>known.tsv</tt> so that each line of <tt>best.tsv</tt> is extended with the text describing the corresponding target in <tt>known.tsv</tt>
 
* Use option <tt>-1 2</tt> to use the second column of <tt>best.tsv</tt> as a key for joining
 
* Use option <tt>-1 2</tt> to use the second column of <tt>best.tsv</tt> as a key for joining
* The output of join may look as follows:
+
* The output of <tt>join</tt> may look as follows:
 
<pre>
 
<pre>
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=SPBC460.02c PE=3 SV=1
+
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces...
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=SPBC460.04c PE=3 SV=1
+
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=...
 
</pre>
 
</pre>
* Further reformat the output so that query name goes first (e.g. <tt>Q6CBS2</tt>), followed by target name (e.g. <tt>sp|B5BP46|YP52_SCHPO</tt>), followed by the rest of the text, but remove all text after <tt>OS=</tt>
+
* Further reformat the output so that the query name goes first (e.g. <tt>Q6CBS2</tt>), followed by target name (e.g. <tt>sp|B5BP46|YP52_SCHPO</tt>), followed by the rest of the text, but remove all text after <tt>OS=</tt>
* Sort by query name
+
* Sort by query name, store as <tt>best.txt</tt>  
* '''Submit''' file <tt>best.txt</tt> with the result
 
 
* The output should start as follows:
 
* The output should start as follows:
 
<pre>
 
<pre>
Line 1,103: Line 1,274:
 
B5FVB1  sp|O13877|RPAB5_SCHPO  DNA-directed RNA polymerases I, II, and III subunit RPABC5
 
B5FVB1  sp|O13877|RPAB5_SCHPO  DNA-directed RNA polymerases I, II, and III subunit RPABC5
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
* '''Submit''' file  <tt>best.txt</tt>
 +
<!-- /NOTEX -->
  
 
'''Note:'''
 
'''Note:'''
* Not all Y.lip. are necessarily included in your final output (some proteins do not have blast match).
+
* Not all ''Y.lipolytica'' proteins are necessarily included in your final output (some proteins do not have blast match).
** You can think how to find the list of such proteins, but this is not part of the assignment.
+
** You can think how to find the list of such proteins, but this is not part of the task.
 
* Files <tt>best.txt</tt> and <tt>best.tsv</tt> should have the same number of lines.
 
* Files <tt>best.txt</tt> and <tt>best.tsv</tt> should have the same number of lines.
=L03=
+
=Lmake=
 
==Job Scheduling==
 
==Job Scheduling==
  
Line 1,116: Line 1,290:
 
** To run the program immediately, then switch the whole console to the background: [https://www.gnu.org/software/screen/manual/screen.html screen], [https://tmux.github.io/ tmux]
 
** To run the program immediately, then switch the whole console to the background: [https://www.gnu.org/software/screen/manual/screen.html screen], [https://tmux.github.io/ tmux]
 
** To run the command when the computer becomes idle: [http://pubs.opengroup.org/onlinepubs/9699919799/utilities/batch.html batch]
 
** To run the command when the computer becomes idle: [http://pubs.opengroup.org/onlinepubs/9699919799/utilities/batch.html batch]
* Now we will concentrate on '''[https://en.wikipedia.org/wiki/Oracle_Grid_Engine Sun Grid Engine]''', a complex software for managing many jobs from many users on a cluster from multiple computers
+
* Now we will concentrate on '''[https://en.wikipedia.org/wiki/Oracle_Grid_Engine Sun Grid Engine]''', a complex software for managing many jobs from many users on a cluster consisting of multiple computers
 
* Basic workflow:
 
* Basic workflow:
 
** Submit a job (command) to a queue
 
** Submit a job (command) to a queue
Line 1,125: Line 1,299:
 
* Complex possibilities for assigning priorities and deadlines to jobs, managing multiple queues etc.
 
* Complex possibilities for assigning priorities and deadlines to jobs, managing multiple queues etc.
 
* Ideally all computers in the cluster share the same environment and filesystem
 
* Ideally all computers in the cluster share the same environment and filesystem
 +
<!-- NOTEX -->
 
* We have a simple training cluster for this exercise:
 
* We have a simple training cluster for this exercise:
 
** You submit jobs to queue on vyuka
 
** You submit jobs to queue on vyuka
** They will run on computer cpu02
+
** They will run on computers runner01 and runner02
** This cluster is only temporarily available until next Thursday
+
** This cluster is only temporarily available until the next Thursday
 +
<!-- /NOTEX -->
 +
 
  
 
===Submitting a job (qsub)===
 
===Submitting a job (qsub)===
* <tt>qsub -b y -cwd 'command < input > output 2> error'</tt>
+
 
** quoting around command allows us to include special characters, such as <, > etc. and not to apply it to qsub command itself
+
Basic command: <tt>qsub -b y -cwd 'command < input > output 2> error'</tt>
** <tt>-b y</tt> treats command as binary, usually preferable for both binary programs and scripts
+
* quoting around command allows us to include special characters, such as <tt><, ></tt> etc. and not to apply it to <tt>qsub</tt> command itself
** <tt>-cwd</tt> executes command in the current directory
+
* <tt>-b y</tt> treats command as binary, usually preferable for both binary programs and scripts
** <tt>-N</tt> name allows to set name of the job
+
* <tt>-cwd</tt> executes command in the current directory
** <tt>-l resource=value</tt> requests some non-default resources
+
* <tt>-N</tt> name allows to set name of the job
** for example, we can use <tt>-l threads=2</tt> to request 2 threads for parallel programs
+
* <tt>-l resource=value</tt> requests some non-default resources
** Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
+
* for example, we can use <tt>-l threads=2</tt> to request 2 threads for parallel programs
 +
* Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
 
* qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27
 
* qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27
  
 
===Monitoring and deleting jobs (qstat, qdel)===
 
===Monitoring and deleting jobs (qstat, qdel)===
* <tt>qstat</tt> displays jobs of the current user
+
Command <tt>qstat</tt> displays jobs of the current user
 +
* job 28 is running of  server runner02 (status <t>r</tt>), job 29 is waiting in queue (status <tt>qw</tt>)
 
<pre>
 
<pre>
job-ID  prior  name      user        state submit/start at    queue                         slots ja-task-ID
+
job-ID  prior  name      user        state submit/start at    queue      
-----------------------------------------------------------------------------------------------------------------
+
---------------------------------------------------------------------------------
     28 0.50000 s3        bbrejova    r    03/15/2016 22:12:18 main.q@cpu02.compbio.fmph.unib    1
+
     28 0.50000 s3        bbrejova    r    03/15/2016 22:12:18 main.q@runner02
     29 0.00000 s3        bbrejova    qw    03/15/2016 22:14:08                                   1
+
     29 0.00000 s3        bbrejova    qw    03/15/2016 22:14:08            
 
</pre>
 
</pre>
  
* <tt>qstat -u '*'</tt> displays jobs of all users
+
* Command <tt>qstat -u '*'</tt> displays jobs of all users
** finished jobs disappear from the list
+
* Finished jobs disappear from the list
* <tt>qstat -F threads</tt> shows how many threads available
+
* Command <tt>qstat -F threads</tt> shows how many threads available
 
<pre>
 
<pre>
 
queuename                      qtype resv/used/tot. load_avg arch          states
 
queuename                      qtype resv/used/tot. load_avg arch          states
 
---------------------------------------------------------------------------------
 
---------------------------------------------------------------------------------
main.q@cpu02.compbio.fmph.unib BIP  0/2/8         0.03     lx26-amd64
+
main.q@runner01                BIP  0/1/2          0.00     lx26-amd64  
        hc:threads=0
+
hc:threads=1
    28 0.75000 s3        bbrejova    r    03/15/2016 22:12:18     1
+
    238 0.25000 sleeper.pl bbrejova    r    03/05/2020 13:12:28     1      
    29 0.25000 s3        bbrejova    r    03/15/2016 22:14:18     1
+
---------------------------------------------------------------------------------
 +
main.q@runner02                BIP  0/1/2          0.00    lx26-amd64   
 +
    237 0.75000 sleeper.pl bbrejova    r    03/05/2020 13:12:13     1      
 
</pre>
 
</pre>
  
* Command qdel allows you to delete a job (waiting or running)
+
* Command <tt>qdel</tt> deletes a job (waiting or running)
  
 
===Interactive work on the cluster (qrsh), screen===
 
===Interactive work on the cluster (qrsh), screen===
* <tt>qrsh</tt> creates a job which is a normal interactive shell running on the cluster
+
Command <tt>qrsh</tt> creates a job which is a normal interactive shell running on the cluster
* in this shell you can manually run commands
+
* In this shell you can manually run commands
* when you close the shell, the job finishes
+
* When you close the shell, the job finishes
* therefore it is a good idea to run qrsh within screen
+
* Therefore it is a good idea to run <tt>qrsh</tt> within <tt>screen</tt>
** run screen command, this creates a new shell  
+
** Run <tt>screen</tt> command, this creates a new shell  
** within this shell, run qrsh, then whatever commands
+
** Within this shell, run <tt>qrsh</tt>, then whatever commands
** by pressing Ctrl-a d you "detach" the screen, so that both shells (local and qrsh) continue running but you can close your local window
+
** By pressing <tt>Ctrl-a d</tt> you "detach" the screen, so that both shells (local and <tt>qrsh</tt>) continue running but you can close your local window
** later by running <tt>screen -r</tt> you get back to your shells
+
** Later by running <tt>screen -r</tt> you get back to your shells
  
 
===Running many small jobs===
 
===Running many small jobs===
For example, consider tens of thousands of genes, run some computation for each gene
+
For example, we many need to run some computation for each human gene (there are roughly 20,000 such genes). Here are some possibilities:
* Have a script which iterates through all and runs them sequentially  
+
* Run a script which iterates through all jobs and runs them sequentially  
 
** Problems: Does not use parallelism, needs more programming to restart after some interruption
 
** Problems: Does not use parallelism, needs more programming to restart after some interruption
 
* Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)  
 
* Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)  
 
** Jobs can run in parallel on many different computers
 
** Jobs can run in parallel on many different computers
 
** Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
 
** Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
* Array jobs in qsub (option -t): runs jobs numbered 1,2,3...; number of the job is in an environment variable, used by the script to decide which gene to process
+
* Array jobs in qsub (option <tt>-t</tt>): runs jobs numbered 1,2,3...; number of the current job is in an environment variable, used by the script to decide which gene to process
 
** Queue contains only running sub-jobs plus one line for the remaining part of the array job.  
 
** Queue contains only running sub-jobs plus one line for the remaining part of the array job.  
 
** After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
 
** After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
 
* Next: using make in which you specify how to process each gene and submit a single make command to the queue
 
* Next: using make in which you specify how to process each gene and submit a single make command to the queue
** Make can execute multiple tasks in parallel using several threads on the same computer (qsub array jobs can run tasks on multiple computers)
+
** Make can execute multiple tasks in parallel using several threads on the same computer (<tt>qsub</tt> array jobs can run tasks on multiple computers)
** It will automatically skip tasks which are already finished
+
** It will automatically skip tasks which are already finished, so restart is easy
  
 
==Make==
 
==Make==
* [https://en.wikipedia.org/wiki/Make_(software) Make] is a system for automatically building programs (running compiler, linker etc)
+
[https://en.wikipedia.org/wiki/Make_(software) Make] is a system for automatically building programs (running compiler, linker etc)
** In particular, we will use [https://www.gnu.org/software/make/manual/ GNU make]
+
* In particular, we will use [https://www.gnu.org/software/make/manual/ GNU make]
* Rules for compilation are written in a Makefile  
+
* Rules for compilation are written in a <tt>Makefile</tt>
 
* Rather complex syntax with many features, we will only cover basics
 
* Rather complex syntax with many features, we will only cover basics
  
 
===Rules===
 
===Rules===
* The main part of a Makefile are rules specifying how to generate target files from some source files (prerequisites).  
+
* The main part of a <tt>Makefile</tt> are rules specifying how to generate target files from some source files (prerequisites).  
* For example the following rule generates target.txt by concatenating source1.txt a source2.txt:
+
* For example the following rule generates file <tt>target.txt</tt> by concatenating files <tt>source1.txt</tt> and <tt>source2.txt</tt>:
<pre>
+
<syntaxhighlight lang="make">
 
target.txt : source1.txt source2.txt
 
target.txt : source1.txt source2.txt
 
       cat source1.txt source2.txt > target.txt
 
       cat source1.txt source2.txt > target.txt
</pre>
+
</syntaxhighlight>
 
* The first line describes target and prerequisites, starts in the first column
 
* The first line describes target and prerequisites, starts in the first column
 
* The following lines list commands to execute to create the target
 
* The following lines list commands to execute to create the target
 
* Each line with a command starts with a '''tab''' character
 
* Each line with a command starts with a '''tab''' character
  
* If we have a directory with this rule in Makefile and files source1.txt and source2.txt, running <tt>make target.txt</tt> will run the cat command
+
* If we have a directory with this rule in file called <tt>Makefile</tt> and files <tt>source1.txt</tt> and <tt>source2.txt</tt>, running <tt>make target.txt</tt> will run the <tt>cat</tt> command
 
* However, if <tt>target.txt</tt> already exists, the command will be run only if one of the prerequisites has more recent modification time than the target
 
* However, if <tt>target.txt</tt> already exists, the command will be run only if one of the prerequisites has more recent modification time than the target
 
* This allows to restart interrupted computations or rerun necessary parts after modification of some input files
 
* This allows to restart interrupted computations or rerun necessary parts after modification of some input files
* Makefile automatically chains the rules as necessary:  
+
* <tt>make</tt> automatically chains the rules as necessary:  
** if we run <tt>make target.txt</tt> and some prerequisite does not exist, Makefile checks if it can be created by some other rule and runs that rule first
+
** if we run <tt>make target.txt</tt> and some prerequisite does not exist, <tt>make</tt> checks if it can be created by some other rule and runs that rule first
** In general it first finds all necessary steps and runs them in topological order so that each rules has its prerequisites ready
+
** In general it first finds all necessary steps and runs them in appropriate order so that each rules has its prerequisites ready
** Option <tt>make -n target</tt> will show what commands would be executed to build target (dry run) - good idea before running something potentially dangerous
+
** Option <tt>make -n target</tt> will show which commands would be executed to build target (dry run) - good idea before running something potentially dangerous
  
 
===Pattern rules===
 
===Pattern rules===
  
* We can specify a general rule for files with a systematic naming scheme. For example, to create a .pdf file from a .tex file, we use pdflatex command:
+
We can specify a general rule for files with a systematic naming scheme. For example, to create a <tt>.pdf</tt> file from a <tt>.tex</tt> file, we use the <tt>pdflatex</tt> command:
<pre>
+
<syntaxhighlight lang="make">
 
%.pdf : %.tex
 
%.pdf : %.tex
 
       pdflatex $^
 
       pdflatex $^
</pre>
+
</syntaxhighlight>
* In the first line, % denotes some variable part of the filename, which has to agree in the target and all prerequisites
+
* In the first line, <tt>%</tt> denotes some variable part of the filename, which has to agree in the target and all prerequisites
 
* In commands, we can use several variables:
 
* In commands, we can use several variables:
** $^ contains name for the prerequisite (source)
+
** Variable <tt>$^</tt> contains the names of the prerequisites (source)
** $@ contains the name of the target
+
** Variable <tt>$@</tt> contains the name of the target
** $* contains the string matched by %
+
** Variable <tt>$*</tt> contains the string matched by <tt>%</tt>
  
 
===Other useful tricks in Makefiles===
 
===Other useful tricks in Makefiles===
  
 
====Variables====
 
====Variables====
* Store some reusable values in variables, then use them several times in the Makefile:
+
Store some reusable values in variables, then use them several times in the <tt>Makefile</tt>:
<pre>
+
<syntaxhighlight lang="make">
 
MYPATH := /projects/trees/bin
 
MYPATH := /projects/trees/bin
  
 
target : source
 
target : source
 
       $(MYPATH)/script < $^ > $@
 
       $(MYPATH)/script < $^ > $@
</pre>
+
</syntaxhighlight>
  
 
====Wildcards, creating a list of targets from files in the directory====
 
====Wildcards, creating a list of targets from files in the directory====
  
The following Makefile automatically creates .png version of each .eps file simply by running make:
+
The following <tt>Makefile</tt> automatically creates <tt>.png</tt> version of each <tt>.eps</tt> file simply by running <tt>make</tt>:
<pre>
+
<syntaxhighlight lang="make">
 
EPS := $(wildcard *.eps)
 
EPS := $(wildcard *.eps)
 
EPSPNG := $(patsubst %.eps,%.png,$(EPS))
 
EPSPNG := $(patsubst %.eps,%.png,$(EPS))
Line 1,251: Line 1,432:
 
%.png : %.eps
 
%.png : %.eps
 
         convert -density 250 $^ $@
 
         convert -density 250 $^ $@
</pre>
+
</syntaxhighlight>
* variable EPS contains names of all files matching *.eps
+
* variable <tt>EPS</tt> contains names of all files matching <tt>*.eps</tt>
* variable EPSPNG contains desirable names of png files
+
* variable <tt>EPSPNG</tt> contains desirable names of <tt>.png</tt> files
** it is created by taking filenames in EPS and changing .eps to .png
+
** it is created by taking filenames in <tt>EPS</tt> and changing <tt>.eps</tt> to <tt>.png</tt>
 
* <tt>all</tt> is a "phony target" which is not really created
 
* <tt>all</tt> is a "phony target" which is not really created
** its rule has no commands but all png files are prerequisites, so are done first
+
** its rule has no commands but all <tt>.png</tt> files are prerequisites, so are done first
** the first target in Makefile (in this case <tt>all</tt>) is default when no other target is specified on command-line
+
** the first target in a <tt>Makefile</tt> (in this case <tt>all</tt>) is default when no other target is specified on the command-line
* <tt>clean</tt> is also a phony target for deleting generated png files
+
* <tt>clean</tt> is also a phony target for deleting generated <tt>.png</tt> files
  
 
====Useful special built-in target names====
 
====Useful special built-in target names====
Include these lines in your Makefile if desired
+
Include these lines in your <tt>Makefile</tt> if desired
<pre>
+
<syntaxhighlight lang="make">
 
.SECONDARY:
 
.SECONDARY:
 
# prevents deletion of intermediate targets in chained rules
 
# prevents deletion of intermediate targets in chained rules
Line 1,268: Line 1,449:
 
.DELETE_ON_ERROR:
 
.DELETE_ON_ERROR:
 
# delete targets if a rule fails
 
# delete targets if a rule fails
</pre>
+
</syntaxhighlight>
  
 
===Parallel make===
 
===Parallel make===
* running make with option <tt>-j 4</tt> will run up to 4 commands in parallel if their dependencies are already finished
+
Running make with option <tt>-j 4</tt> will run up to 4 commands in parallel if their dependencies are already finished. This allows easy parallelization on a single computer.
* easy parallelization on a single computer
 
  
 
==Alternatives to Makefiles==
 
==Alternatives to Makefiles==
* Bioinformatics often uses "pipelines" - sequences of commands run one after another, e.g. by a script of Makefile
+
* Bioinformaticians often uses "pipelines" - sequences of commands run one after another, e.g. by a script or <tt>make</tt>
* There are many tools developed for automating computational pipelines, see e.g. this review: [https://academic.oup.com/bib/article/doi/10.1093/bib/bbw020/2562749/A-review-of-bioinformatic-pipeline-frameworks Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016 bbw020.]
+
* There are many tools developed for automating computational pipelines in bioinformatics, see e.g. this review: [https://academic.oup.com/bib/article/doi/10.1093/bib/bbw020/2562749/A-review-of-bioinformatic-pipeline-frameworks Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016.]
* For example [https://bitbucket.org/snakemake/snakemake/wiki/Home Snakemake]
+
* For example [https://snakemake.readthedocs.io/en/stable/ Snakemake]
** Workflows can contain shell commands or Python code
+
** Snake workflows can contain shell commands or Python code
** Big advantage compared to Make: pattern rules may contain multiple variable portions (in make only one % per filename)
+
** Big advantage compared to <tt>make</tt>: pattern rules may contain multiple variable portions (in <tt>make</tt> only one <tt>%</tt> per filename)
** For example, you have several fasta files and several HMMs representing protein families and you wans to run each HMM on each fasta file:
+
** For example, assume we have several FASTA files and several profiles (HMMs) representing protein families and we want to run each profile on each FASTA file:
 
<pre>
 
<pre>
 
rule HMMER:
 
rule HMMER:
     input: "{filename}.fasta", "{hmm}.hmm"
+
     input: "{filename}.fasta", "{profile}.hmm"
     output: "{filename}_{hmm}.hmmer"
+
     output: "{filename}_{profile}.hmmer"
 
     shell: "hmmsearch --domE 1e-5 --noali --domtblout {output} {input[1]} {input[0]}"
 
     shell: "hmmsearch --domE 1e-5 --noali --domtblout {output} {input[1]} {input[0]}"
 
</pre>
 
</pre>
=HW03=
+
==HWmake==
See also [[#L03|Lecture 3]]
+
<!-- NOTEX -->
 +
See also the [[#Lmake|lecture]]
 +
<!-- /NOTEX -->
 +
 
 +
===Motivation: Building phylogenetic trees===
 +
The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences
 +
* A '''phylogenetic tree''' is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors.
 +
* The '''input''' contains sequences of all proteins from each species (we will use only a smaller subset)
 +
* The process is typically split into several stages shown below
 +
 
 +
====Identify ortholog groups====
 +
''Orthologs'' are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download blast] to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species.
  
==Motivation: Building Phylogenetic Trees==
+
====Align proteins on each group====
The task for today will be to build a [https://en.wikipedia.org/wiki/Phylogenetic_tree phylogenetic tree] of 9 mammalian species using protein sequences
+
For each ortholog group, we need to align proteins in the group to identify corresponding parts of the proteins. This is done by a tool called <tt>muscle</tt>
* A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are target present-day species, internal nodes are their common ancestors.
 
* Input contains sequences of selected proteins from each species
 
* Step 1: Identify ''ortholog groups''. Orthologs are proteins from different species that "correspond" to each other. This is done based on sequence similarity and we can use a tool called [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download blast] to identify sequence similarities between individual proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species
 
* Step 2: For each ortholog group, we need to align proteins and build a phylogenetic tree for this protein using existing methods. We can do this using tools muscle (for alignment) and phyml (for phylogenetic tree inference).
 
 
   
 
   
Unaligned sequences (start of protein O60568):
+
Unaligned sequences (start of protein [https://www.uniprot.org/uniprot/O60568 O60568]):
 
<pre>
 
<pre>
 
>human
 
>human
Line 1,332: Line 1,519:
 
</pre>
 
</pre>
  
Phylogenetic tree in newick format:
+
====Build phylogenetic tree for each grup====
 +
 
 +
For each alignment, we build a phylogenetic tree for this group. We will use a program called phyml.
 +
 
 +
Example of a phylogenetic tree in newick format:
 
<pre>
 
<pre>
  ((opossum:0.09636245,rabbit:0.85794020):0.05219782,(rat:0.07263127,elephant:0.03306863):0.01043531,(dog:0.01700528,(pig:0.02891345,(guineapig:0.14451043,(human:0.01169266,baboon:0.00827402):0.02619598):0.00816185):0.00631423):0.00800806);
+
  ((opossum:0.09636245,rabbit:0.85794020):0.05219782,
 +
(rat:0.07263127,elephant:0.03306863):0.01043531,
 +
(dog:0.01700528,(pig:0.02891345,
 +
(guineapig:0.14451043,
 +
(human:0.01169266,baboon:0.00827402):0.02619598
 +
):0.00816185):0.00631423):0.00800806);
 
</pre>
 
</pre>
  
 +
[[Image:Tree.png|center|thumb|200px|Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)]]
  
<!-- [[Image:L02 human 15749.png|center|thumb|200px|Tree for gene human_15749 (branch lengths ignored)]] -->
+
====Build a consensus tree====
  
* Step 3: The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using an interactive tool called phylip.
+
The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree.
* Output is a single consensus tree.
 
  
==Files and submitting==
+
<!-- NOTEX -->
  
Our goal for today is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.
+
===Files and submitting===
 +
<!-- /NOTEX -->
 +
<!-- TEX
 +
===Files===
 +
/TEX -->
  
* Submit by copying requested files to <tt>/submit/hw03/username/</tt>
+
Our goal is to build a pipeline that automates the whole task using make and execute it remotely using <tt>qsub</tt>. Most of the work is already done, only small modifications are necessary.
* Do not forget to submit protocol, outline of the protocol is in <tt>/tasks/hw03/protocol.txt</tt>
 
  
Start by copying /tasks/hw03 to your user directory
+
<!-- NOTEX -->
::<tt>cp -ipr /tasks/hw03 ~</tt>
+
* Submit by copying requested files to <tt>/submit/make/username/</tt>
 +
* Do not forget to submit protocol, outline of the protocol is in <tt>/tasks/make/protocol.txt</tt>
 +
<!-- /NOTEX -->
 +
 
 +
Start by copying directory <tt>/tasks/make</tt> to your user directory
 +
<syntaxhighlight lang="bash">
 +
cp -ipr /tasks/make .
 +
cd make
 +
</syntaxhighlight>
  
It contains 3 subdirectories:
+
The directory contains three subdirectories:
* large: larger sample of proteins for task A
+
* <tt>large</tt>: a larger sample of proteins for task A
* tiny: very small set of proteins for task B
+
* <tt>tiny</tt>: a very small set of proteins for task B
* small: slightly larger set of proteins for task C
+
* <tt>small</tt>: a slightly larger set of proteins for task C
  
==Task A==
+
===Task A (long job)===
  
* In this task, you will run a long alignment job (>2 hours)
+
* In this task, you will run a long alignment job (more than two hours)
 
* Use directory <tt>large</tt> with files:
 
* Use directory <tt>large</tt> with files:
** ref.fa: selected human proteins
+
** <tt>ref.fa</tt>: selected human proteins
** other.fa: selected proteins from 8 other mammalian species
+
** <tt>other.fa</tt>: selected proteins from 8 other mammalian species
** Makefile: run blast on ref.fa vs other.fa (also formats database other.fa before that)
+
** <tt>Makefile</tt>: runs blast on <tt>ref.fa</tt> vs <tt>other.fa</tt> (also formats database <tt>other.fa</tt> before that)
* run make -n to see what commands will be done (you should see makeblastdb and blastp + echo for timing), copy the output to the '''protocol'''
+
* run <tt>make -n</tt> to see what commands will be done (you should see <tt>makeblastdb</tt>, <tt>blastp</tt>, and <tt>echo</tt> for timing)  
* run qsub with appropriate options to run make (at least -cwd and -b y)
+
<!-- NOTEX -->
 +
** copy the output to the '''protocol'''
 +
<!-- /NOTEX -->
 +
* run <tt>qsub</tt> with appropriate options to run <tt>make</tt> (at least <tt>-cwd -b y</tt>)
 +
<!-- NOTEX -->
 
* then run <tt>qstat > queue.txt </tt>
 
* then run <tt>qstat > queue.txt </tt>
 
** '''Submit''' file <tt>queue.txt</tt> showing your job waiting or running  
 
** '''Submit''' file <tt>queue.txt</tt> showing your job waiting or running  
* When your job finishes, '''submit''' also the following two files:
+
<!-- /NOTEX -->
** the last 100 lines from the output file ref.blast under the name ref-end.blast (use tool tail -n 100)
+
<!-- TEX
** standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
+
* Use <tt>qsub</tt> to check the status of your job
 +
/TEX -->
 +
* When your job finishes, check the following files:
 +
** the output file <tt>ref.blast</tt>
 +
** standard output from the <tt>qsub</tt> job, which is stored in a file named e.g. <tt>make.oX</tt> where <tt>X</tt> is the number of your job. The output shows the time when your job started and finished (this information was written by commands <tt>echo</tt> in the <tt>Makefile</tt>)
 +
<!-- NOTEX -->
 +
* '''Submit''' the last 100 lines from <tt>ref.blast</tt> under the name <tt>ref-end.blast</tt> (use tool <tt>tail -n 100</tt>) and the file <tt>make.oX</tt> mentioned above
 +
<!-- /NOTEX -->
  
==Task B==
+
===Task B (finishing Makefile)===
  
* In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group
+
* In this task, you will finish a <tt>Makefile</tt> for splitting blast results into ortholog groups and building phylogenetic trees for each group
** This Makefile works with much smaller files and so you can run it many times on vyuka, without qsub
+
** This <tt>Makefile</tt> works with much smaller files and so you can run it quickly many times without <tt>qsub</tt>
* Work in directory tiny
+
* Work in directory <tt>tiny</tt>
** ref.fa: 2 human proteins
+
** <tt>ref.fa</tt>: 2 human proteins
** other.fa: a selected subset of proteins from 8 other mammalian species
+
** <tt>other.fa</tt>: a selected subset of proteins from 8 other mammalian species
** Makefile: a longer makefile
+
** <tt>Makefile</tt>: a longer makefile
** brm.pl: a Perl script for finding ortholog groups and sorting them to directories
+
** <tt>brm.pl</tt>: a Perl script for finding ortholog groups and sorting them to directories
  
The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3
+
The <tt>Makefile</tt> runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3
* If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
+
* If you run <tt>make</tt> without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
 
* Stage 1: run as <tt>make ref.brm</tt>
 
* Stage 1: run as <tt>make ref.brm</tt>
** It runs blast as in task A, then splits proteins into ortholog groups and creates one directory for each group with file prot.fa containing protein sequences
+
** It runs <tt>blast</tt> as in task A, then splits proteins into ortholog groups and creates one directory for each group with file <tt>prot.fa</tt> containing protein sequences
 
* Stage 2: run as <tt>make alignments</tt>
 
* Stage 2: run as <tt>make alignments</tt>
** In each directory with a single gene, it will create an alignment prot.phy and link it under names lg.phy and wag.phy
+
** In each directory with an ortholog group, it will create an alignment <tt>prot.phy</tt> and link it under names <tt>lg.phy</tt> and <tt>wag.phy</tt>
 
* Stage 3: run as <tt>make trees</tt> (needs to be written by you)
 
* Stage 3: run as <tt>make trees</tt> (needs to be written by you)
** In each directory with a single gene, it should create lg.phy_phyml_tree and wag.phy_phyml_tree
+
** In each directory with an ortholog group, it should create files <tt>lg.phy_phyml_tree</tt> and <tt>wag.phy_phyml_tree</tt> containing the results of the <tt>phyml</tt> program run with two different evolutionary models WAG and LG, where LG is the default
** These corresponds to results of phyml commands run with two different evolutionary models WAG and LG, where LG is the default
+
** Run <tt>phyml</tt> by commands of the form:<br><tt>phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG</tt><br><tt>phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG</tt>
** Run phyml by commands of the forms:
+
** Change <tt>INPUT</tt> and <tt>LOG</tt> in the commands to the appropriate filenames using <tt>make</tt> variables <tt>$@, $^, $*</tt> etc. The input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension <tt>.log</tt> added (e.g. <tt>lg.phy_phyml_tree.log</tt>)
*** <tt>phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG</tt>
+
** Also add variables <tt>LG_TREES</tt> and <tt>WAG_TREES</tt> listing filenames of all desirable trees and uncomment phony target <tt>trees</tt> which uses these variables
*** <tt>phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG</tt>
 
** Change INPUT and LOG in the commands to appropriate filenames using make variables $@, $^, $* etc. Input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension .log added (e.g. lg.phy_phyml_tree.log)
 
** Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target <tt>trees</tt> which uses these variables
 
 
* Stage 4:  run as <tt>make consensus</tt>
 
* Stage 4:  run as <tt>make consensus</tt>
** Output trees from stage 3 are concatenated for each model separately to files lg/intree, wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
+
** Output trees from stage 3 are concatenated for each model separately to files <tt>lg/intree</tt>, <tt>wag/intree</tt> and then <tt>phylip</tt> is run to produce consensus trees <tt>lg.tree</tt> and <tt>wag.tree</tt>
** This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
+
** This stage also needs variables <tt>LG_TREES</tt> and <tt>WAG_TREES</tt> to be defined by you.
  
* Run your Makefile
+
* Run your <tt>Makefile</tt> and check that the files <tt>lg.tree</tt> and <tt>wag.tree</tt> are produced
* '''Submit''' the whole directory tiny, including Makefile and all gene directories with tree files.
+
<!-- NOTEX -->
 +
* '''Submit''' the whole directory <tt>tiny</tt>, including <tt>Makefile</tt> and all gene directories with tree files.
 +
<!-- /NOTEX -->
  
==Task C==
 
* Copy your Makefile from part B to directory small, which contains 9 human proteins and run make on this slightly larger set
 
** Again, run it on vyuka server without qsub, but it will take some time, particularly if the server is busy
 
* Look at the two trees from task C (wag.tree, lg.tree) using the figtree program on vyuka (you can also [https://github.com/rambaut/figtree/releases install it] on your computer)
 
* In figtree, change the position of the root in the tree to make opossum the outgroup (species branching as the first away from the others).
 
*** This is done in figtree by clicking on opossum and thus selecting it, then pressing Reroot button.
 
** Also switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch.
 
***  Use the left panel with options.
 
** Export the trees in pdf format as wag.tree.pdf and lg.tree.pdf and include in your submission
 
** Compare the two trees and write '''your observations to the protocol'''
 
*** Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
 
*** Do the two trees differ? What is the highest and lowest support for a branch in each tree?
 
*** Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included)
 
* '''Submit''' the entire small directory (including the two pdf files)
 
 
==Further possibilities==
 
  
 +
===Task C (running make)===
 +
* Copy your <tt>Makefile</tt> from part B to directory <tt>small</tt>, which contains 9 human proteins and run <tt>make</tt> on this slightly larger set
 +
** Again, run it without <tt>qsub</tt>, but it will take some time, particularly if the server is busy
 +
* Look at the two resulting trees (<tt>wag.tree</tt>, <tt>lg.tree</tt>) using the <tt>figtree</tt> program
 +
<!-- NOTEX -->
 +
** it is available on vyuka, but you can also [https://github.com/rambaut/figtree/releases install it] on your computer if needed
 +
<!-- /NOTEX -->
 +
* In <tt>figtree</tt>, change the position of the root in the tree to make the opossum the outgroup (species branching as the first away from the others). This is done by clicking on opossum and thus selecting it, then pressing the Reroot button.
 +
* Also switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch. To do this, use the left panel with options.
 +
* Export the trees in pdf format as <tt>wag.tree.pdf</tt> and <tt>lg.tree.pdf</tt>
 +
* Compare the two trees
 +
** Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
 +
** Do the two trees differ? What is the highest and lowest support for a branch in each tree?
 +
** Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included)
 +
<!-- NOTEX -->
 +
** Write '''your observations to the protocol'''
 +
* '''Submit''' the entire <tt>small</tt> directory (including the two pdf files)
 +
<!-- /NOTEX -->
 +
 +
===Further possibilities===
 +
 +
<!-- NOTEX -->
 
Here are some possibilities for further experiments, in case you are interested (do not submit these):
 
Here are some possibilities for further experiments, in case you are interested (do not submit these):
* You could copy your extended Makefile to directory large and create trees for all ortholog groups in the big set
+
<!-- /NOTEX -->
 +
<!-- TEX
 +
Here are some possibilities for further experiments, in case you are interested:
 +
/TEX -->
 +
* You could copy your extended <tt>Makefile</tt> to directory <tt>large</tt> and create trees for all ortholog groups in the big set
 +
<!-- NOTEX -->
 
** This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
 
** This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
** After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
+
<!-- /NOTEX -->
* Phyml also supports other models, for example JTT  (see [http://www.atgc-montpellier.fr/download/papers/phyml_manual_2012.pdf manual]), you could try to play with those.
+
** After <tt>ref.brm</tt> si done, programs for individual genes can be run in parallel, so you can try running <tt>make -j 2</tt> and request 2 threads from <tt>qsub</tt>
* Command touch FILENAME will change modification time of the given file to current file
+
* Phyml also supports other models, for example JTT  (see [http://www.atgc-montpellier.fr/download/papers/phyml_manual_2012.pdf manual]); you could try to play with those.
** What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?
+
* Command <tt>touch FILENAME</tt> will change the modification time of the given file to the current time
=L04=
+
** What happens when you run <tt>touch</tt> on some of the intermediate files in the analysis in task B? Does <tt>Makefile</tt> always run properly?
[[#HW04]]
+
=Lpython=
 +
<!-- NOTEX -->
 +
[[#HWpython]]
 +
<!-- /NOTEX -->
 +
 
 +
This lecture introduces the basics of the Python programming language. We will also cover basics of working with databases using the SQL language and SQLite3 lightweight database system.
  
* Program for today: basics of Python and SQL
+
<!-- NOTEX -->
** Two version of homework: four easier tasks for beginners, or two more complicated ones for advanced Python/SQL programmers
+
The next three lectures
* The next three lectures
+
* Computer science students will use Python, SQLite3 and several advanced Python libraries for complex data processing
** Computer science students will use Python and SQLite3 and several advanced Python libraries for complex data processing
+
* Bioinformatics students will use several bioinformatics command-line tools
** Bioinformatics students will use several bioinformatics command-line tools
+
<!-- /NOTEX -->
  
 
==Overview, documentation==
 
==Overview, documentation==
Python: good sources for beginners:
 
* A very concise cheat sheet: [http://www.cogsci.rpi.edu/~destem/igd/python_cheat_sheet.pdf]
 
* A more detailed tutorial: [https://docs.python.org/3/tutorial/]
 
  
SQL:
+
'''Python'''
 +
* Popular programming language
 +
* Advantages: powerful language features, extensive libraries
 +
* Disadvantages: interpreted language, can be slow
 +
* [https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf A very concise cheat sheet]
 +
* [https://docs.python.org/3/tutorial/ A more detailed tutorial]
 +
 
 +
'''SQL'''
 
* Language for working with relational databases, more in a dedicated course
 
* Language for working with relational databases, more in a dedicated course
 
* We will cover basics of SQL and work with a simple DB system SQLite3  
 
* We will cover basics of SQL and work with a simple DB system SQLite3  
* SQLite3 documentation: [https://www.sqlite.org/docs.html]
+
* Typical database systems are complex, use server-client architecture. SQLite3 is a simple "database" stored in one file. You can think of SQLite3 not as a replacement for Oracle but as a replacement for <tt>fopen()</tt>.
* SQL tutorial: [https://www.w3schools.com/sql/default.asp]
+
* [https://www.sqlite.org/docs.html SQLite3 documentation]
* SQLite3 in Python [https://docs.python.org/3/library/sqlite3.html]
+
* [https://www.w3schools.com/sql/default.asp SQL tutorial]
 +
* [https://docs.python.org/3/library/sqlite3.html SQLite3 in Python documentation]
  
Program for today:
+
Outline of this lecture:
 
* We introduce a simple data set
 
* We introduce a simple data set
* We look at several python scripts for processing this data set
+
* We look at several Python scripts for processing this data set
* HW: You create another such script
+
* Solve task A, where you create another such script
* We introduce basics of working directly with SQLite3
+
* We introduce basics of working with SQLite3 and writing SQL queries
* HW: You write your own queries
+
* Solve tasks B1 and B2, where you write your own SQL queries
 
* We look at how to combine Python and SQLite
 
* We look at how to combine Python and SQLite
* HW: You write a program combining the two
+
* Solve task C, where you write a program combining the two
 +
* Students familiar with both Python and SQL may skip tasks A, B1, B2 and and do tasks C and D
  
==Dataset for this week==
+
==Dataset for this lecture==
 
* [https://www.imdb.com/ IMDb] is an online database of movies and TV series with user ratings
 
* [https://www.imdb.com/ IMDb] is an online database of movies and TV series with user ratings
 
* We have downloaded a preprocessed dataset of selected TV series ratings from [https://github.com/nazareno/imdb-series/ GitHub]
 
* We have downloaded a preprocessed dataset of selected TV series ratings from [https://github.com/nazareno/imdb-series/ GitHub]
* From this dataset, we have selected only 10 series with the highest average number of voting users
+
* From this dataset, we have selected 10 series with high average number of voting users
* Data are 2 files in csv format: list of series, list of episodes
+
* Data are two files in csv format: list of series, list of episodes
 +
* csv stands for comma-separated values
  
File series.cvs contains one row per series  
+
File <tt>series.cvs</tt> contains one row per series  
 
* Columns: (0) series id, (1) series title, (2) TV channel:
 
* Columns: (0) series id, (1) series title, (2) TV channel:
 
<pre>
 
<pre>
Line 1,472: Line 1,712:
 
</pre>
 
</pre>
  
File episodes.csv contains one row per episode:
+
File <tt>episodes.csv</tt> contains one row per episode:
 
* Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
 
* Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
* Here is a sample of 4 episodes from Game of Thrones
+
* Here is a sample of 4 episodes from the Game of Thrones series
 
* If the episode title contains a comma, the whole title is in quotation marks
 
* If the episode title contains a comma, the whole title is in quotation marks
 
<pre>
 
<pre>
Line 1,483: Line 1,723:
 
</pre>
 
</pre>
  
==Several python scripts==
+
Note that a different version of this data was used already in the [[#Lperl#The_first_input_file_for_today:_TV_series|lecture on Perl]].
 +
 
 +
==Several Python scripts==
 +
 
 +
We will illustrate basic features of Python on several scripts working with these files.
  
 
===prog1.py===  
 
===prog1.py===  
Print the second column (series title) from series.csv
+
The first script prints the second column (series title) from <tt>series.csv</tt>
<pre>
+
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3
  
Line 1,498: Line 1,742:
 
         # print the second column
 
         # print the second column
 
         print(columns[1])
 
         print(columns[1])
</pre>
+
</syntaxhighlight>
 +
 
 +
* Python uses indentation to delimit blocks. In this example, the <tt>with</tt> command starts a block and within this block the <tt>for</tt> command starts another block containing commands <tt>columns=...</tt> and <tt>print</tt>. The body of each block is indented several spaces relative to the enclosing block.
 +
* Variables are not declared, but directly used. This program uses variables <tt>csvfile, line, columns</tt>.
 +
* The <tt>open</tt> command opens a file (here for reading, but other options are [https://docs.python.org/3/library/functions.html#open available]).
 +
* The [https://www.geeksforgeeks.org/with-statement-in-python/ <tt>with</tt> command] opens the file, stores the file handle in  <tt>csvfile</tt> variable,  executes all commands within its block and finally closes the file.
 +
* The for-loop iterates over all lines in the file, assigning each in variable <tt>line</tt> and executing the body of the block.
 +
* Method <tt>split</tt> of the built-in string type <tt>str</tt> splits the line at every comma and returns a list of strings, one for every column of the table (see also other [https://docs.python.org/3/library/stdtypes.html#string-methods string methods])
 +
* The final line prints the second column and the end of line character.
  
 
===prog2.py===  
 
===prog2.py===  
Print the list of series of each TV channel
+
The following script prints the list of series of each TV channel
* For illustration we also separately count the series for each channel, but the count could be obtained as the length of the list
+
* For illustration we also separately count the number of the series for each channel, but the count could be also obtained as the length of the list
* For simplicity we use library data structure defaultdict instead of plain python dictionary
+
<syntaxhighlight lang="Python">
<pre>
 
 
#! /usr/bin/python3
 
#! /usr/bin/python3
 
from collections import defaultdict
 
from collections import defaultdict
Line 1,533: Line 1,784:
 
# print counts
 
# print counts
 
print("Counts:")
 
print("Counts:")
for channel in channel_counts:
+
for (channel, count) in channel_counts.items():
 
     print("The number of series for channel \"%s\" is %d"  
 
     print("The number of series for channel \"%s\" is %d"  
     % (channel, channel_counts[channel]))
+
     % (channel, count))
 
 
  
 
# print series lists
 
# print series lists
Line 1,542: Line 1,792:
 
for channel in channel_lists:
 
for channel in channel_lists:
 
     list = ", ".join(channel_lists[channel])  
 
     list = ", ".join(channel_lists[channel])  
     print("series for channel \"%s\": %s" % (channel,list))
+
     print("Series for channel \"%s\": %s" % (channel,list))
</pre>
+
</syntaxhighlight>
 +
* In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary <tt>channel_counts</tt> stores the number of series, <tt>channel_lists</tt> stores the list of series names.
 +
* For simplicity we use a library data structure called <tt>defaultdict</tt> instead of a plain python dictionary. The reason is easier initialization: keys do not need  to be explicitly inserted to the dictionary, but are initialized with a default value at the first access.
 +
* Reading of the input file is similar to the previous script
 +
* Afterwards we iterate through the keys of both dictionaries and print both the keys and the values
 +
* We format the output string using the <tt>%</tt> operator to replace <tt>%s</tt> and <tt>%d</tt> with values <tt>channel</tt> and <tt>count</tt>.
 +
* Notice that when we print counts, we iterate through pairs <tt>(channel, count)</tt> returned by <tt>channel_counts.items()</tt>, while when we print series, we iterate through keys of the dictionary
  
 
===prog3.py===  
 
===prog3.py===  
Find the episode with the highest number of votes among all episodes
+
This script finds the episode with the highest number of votes among all episodes
* We use a library for csv parsing to deal with quotation marks.
+
* We use a library for csv parsing to deal with quotation marks around episode names with commas, such as <tt>"Dark Wings, Dark Words"</tt>
<pre>
+
* This is done by first opening a file and then passing it as an argument to <tt>csv.reader</tt>, which returns a reader object used to iterate over rows.
 +
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3
 
import csv
 
import csv
  
#keep maximum number of votes and its episode
+
# keep maximum number of votes and its episode
 
max_votes = 0
 
max_votes = 0
 
max_votes_episode = None
 
max_votes_episode = None
Line 1,569: Line 1,826:
 
# print result
 
# print result
 
print("Maximum votes %d in episode \"%s\"" % (max_votes, max_votes_episode))
 
print("Maximum votes %d in episode \"%s\"" % (max_votes, max_votes_episode))
</pre>
+
</syntaxhighlight>
  
 
===prog4.py===  
 
===prog4.py===  
Example of function definition, reading the whole file into a 2d array
+
The following script shows an example of function definition
<pre>
+
* The function reads a whole csv file into a 2d array
 +
* The rest of the program calls this function twice for each of the two files
 +
* This could be followed by some further processing of these 2d arrays
 +
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3
 
import csv
 
import csv
Line 1,593: Line 1,853:
 
print("the number of episodes is %d" % len(episodes))
 
print("the number of episodes is %d" % len(episodes))
 
# further processing of series and episodes...
 
# further processing of series and episodes...
</pre>
+
</syntaxhighlight>
 
+
 
'''Now do [[#HW04]], task A'''
+
<!-- NOTEX -->
 +
'''Now do [[#HWpython]], task A'''
 +
<!-- /NOTEX -->
 +
<!-- TEX
 +
Now do task A in the exercises section.
 +
/TEX -->
  
 
==SQL and SQLite==
 
==SQL and SQLite==
  
 
===Creating a database===
 
===Creating a database===
SQLite3 database is a file with your data stored in some special format. To load our csv file to a SQLite database, run command:
+
SQLite3 database is a file with your data stored in a special format. To load our csv file to a SQLite database, run command:
<pre>
+
<syntaxhighlight lang="bash">
 
sqlite3 series.db < create_db.sql
 
sqlite3 series.db < create_db.sql
</pre>
+
</syntaxhighlight>
  
Contents of create_db.pl:
+
Contents of <tt>create_db.sql</tt> file:
<pre>
+
<syntaxhighlight lang="sql">
 
CREATE TABLE series (
 
CREATE TABLE series (
 
   id INT,
 
   id INT,
Line 1,625: Line 1,890:
 
.mode csv
 
.mode csv
 
.import episodes.csv episodes
 
.import episodes.csv episodes
</pre>
+
</syntaxhighlight>
 +
* The two <tt>CREATE TABLE</tt> commands create two tables named <tt>series</tt> and <tt>episodes</tt>
 +
* For each column (attribute) of the table we list its name and type.
 +
* Commands starting with a dot are special SQLite3 commands, not part of SQL itself. Command <tt>.import</tt> reads the txt file and stores it in a table.
 +
 
 +
Other useful SQLite3 commands;
 +
* <tt>.schema tableName</tt> (lists columns of a given table)
 +
* <tt>.mode column</tt> and <tt>.headers on</tt> (turn on human-friendly formatting (not good for further processing)
  
 
===SQL queries===
 
===SQL queries===
* Run <tt>sqlite3 series.db</tt>
+
* Run <tt>sqlite3 series.db</tt> to get an SQLite command-line where you can interact with your database
* Then type on SQLite3 command line the following queries
+
* Then type the queries below which illustrate the basic features of SQL
<pre>
+
* In these queries, we use uppercase for SQL keywords and lowercase for our names of tables and columns (SQL keywords are not case sensitive)
 +
<syntaxhighlight lang="sql">
 
/*  switch on human-friendly formatting */
 
/*  switch on human-friendly formatting */
 
.mode column
 
.mode column
Line 1,650: Line 1,923:
 
/* print all episodes with at least 50k votes, order by votes */
 
/* print all episodes with at least 50k votes, order by votes */
 
SELECT title, votes FROM episodes
 
SELECT title, votes FROM episodes
   WHERE votes>50000 ORDER BY votes desc;
+
   WHERE votes>50000 ORDER BY votes DESC;
  
 
/* join series and episodes tables, print 10 episodes
 
/* join series and episodes tables, print 10 episodes
Line 1,657: Line 1,930:
 
   FROM episodes AS e, series AS s
 
   FROM episodes AS e, series AS s
 
   WHERE e.seriesId=s.id
 
   WHERE e.seriesId=s.id
   ORDER BY votes desc limit 10;
+
   ORDER BY votes DESC
 +
  LIMIT 10;
  
 
/* compute the number of series per channel, as prog2.py */
 
/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() as series_count
+
SELECT channel, COUNT() AS series_count
 
   FROM series GROUP BY channel;
 
   FROM series GROUP BY channel;
  
Line 1,666: Line 1,940:
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
   FROM episodes GROUP BY seriesId, season;
 
   FROM episodes GROUP BY seriesId, season;
</pre>
+
</syntaxhighlight>
 +
 
 +
Parts of a typical SQL query:
 +
* SELECT followed by column names, or functions MAX, COUNT etc. These columns or expressions are printed for each row of the table, unless filtered out (see below).  Individual columns of the output can be given aliases by keyword AS
 +
* FROM followed by a list of tables. Tables also can get aliases (<tt>FROM episodes AS e</tt>)
 +
* WHERE followed by a condition used for filtering the rows
 +
* ORDER BY followed by an expression used for sorting the rows
 +
* LIMIT followed by the maximum number of rows to print
  
'''Now do [[#HW04]], tasks B1, B2'''
+
More complicated concepts:
 +
* GROUP By allows to group rows based on common value of some columns and compute statistics per group (count, maximum, sum etc)
 +
* If you list two tables in FROM, you will conceptually create all pairs of rows, one from one table, one from the other. These are then typically filtered in the FROM clause to only those that have a matching ID (for example <tt>WHERE e.seriesId=s.id</tt> in one of the queries above)
 +
 
 +
<!-- NOTEX -->
 +
'''Now do [[#HWpython]], tasks B1 and B2.'''
 +
<!-- /NOTEX -->
 +
<!-- TEX
 +
Now do tasks B1 and B2 in the exercises section.
 +
/TEX -->
  
 
==Accessing a database from Python==
 
==Accessing a database from Python==
 +
 +
We will use sqlite3 library for Python to access data from the database and to process them further in the Python program.
  
 
===read_db.py===
 
===read_db.py===
* Script illustrates running a SELECT query and getting results
+
* The following script illustrates running a SELECT query and getting the results
<pre>
+
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3
 
import sqlite3
 
import sqlite3
Line 1,695: Line 1,987:
 
# close db connection
 
# close db connection
 
connection.close()
 
connection.close()
</pre>
+
</syntaxhighlight>
  
 
===write_db.py===
 
===write_db.py===
Script illustrates creating a new database containing a multiplication table
+
This script illustrates creating a new database containing a multiplication table
<pre>
+
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3
 
import sqlite3
 
import sqlite3
Line 1,723: Line 2,015:
 
# close db connection
 
# close db connection
 
connection.close()
 
connection.close()
</pre>
+
</syntaxhighlight>
  
 
We can check the result by running command
 
We can check the result by running command
<pre>
+
<syntaxhighlight lang="bash">
 
sqlite3 multiplication.db "SELECT * FROM mult_table;"
 
sqlite3 multiplication.db "SELECT * FROM mult_table;"
</pre>
+
</syntaxhighlight>
  
'''Now do [[#HW04]], task C'''
+
<!-- NOTEX -->
=HW04=
+
'''Now do [[#HWpython]], task C.'''
[[#L04|Lecture 04]]
+
<!-- /NOTEX -->
 +
<!-- TEX
 +
Now do task C in the exercises section.
 +
/TEX -->
  
==Introduction==
+
==HWpython==
 +
<!-- NOTEX -->
 +
See also the [[#Lpython|lecture]]
 +
<!-- /NOTEX -->
 +
 
 +
===Introduction===
 
Choose one of the options:
 
Choose one of the options:
 
* Tasks A, B1, B2, C (recommended for beginners)
 
* Tasks A, B1, B2, C (recommended for beginners)
 
* Tasks C, D (recommended for experienced Python/SQL programmers)
 
* Tasks C, D (recommended for experienced Python/SQL programmers)
  
==Preparation==
+
===Preparation===
 
Copy files:
 
Copy files:
<pre>
+
<syntaxhighlight lang="bash">
mkdir hw04
+
mkdir python
cd hw04
+
cd python
cp -iv /tasks/hw04/* .
+
cp -iv /tasks/python/* .
</pre>
+
</syntaxhighlight>
  
 
The directory contains the following files:
 
The directory contains the following files:
* *.py: python scripts from the lecture, included only for convenience
+
* <tt>*.py</tt>: python scripts from the lecture, included for convenience
* series.csv, episodes.csv: data file used in the homework (and the lecture)
+
* <tt>series.csv</tt>, <tt>episodes.csv</tt>: data files introduced in the lecture
* create_db.sql: sql commands to create the database needed in tasks B1, B2, C, D
+
* <tt>create_db.sql</tt>: SQL commands to create the database needed in tasks B1, B2, C, D
* protocol.txt: fill in and submit the protocol. Only "Vyhodnotenie" and "Pouzite zdroje" are needed this time
+
<!-- NOTEX -->
 +
* <tt>protocol.txt</tt>: fill in and submit the protocol.  
 +
<!-- /NOTEX -->
 +
 
 +
<!-- NOTEX -->
 +
Submit by copying requested files to <tt>/submit/python/username/</tt>
 +
<!-- /NOTEX -->
  
==Task A==
+
===Task A (Python)===
* Write a script which reads both csv files and outputs for each TV channel the total number of episodes in their series combined
+
Write a script <tt>taskA.py</tt> which reads both csv files and outputs for each TV channel the total number of episodes in their series combined. Run your script as follows:
* '''Submit''' file taskA.py with your script
+
<syntaxhighlight lang="bash">
* Run your script as follows and '''submit''' the file taskA.txt:
 
<pre>
 
 
./taskA.py > taskA.txt
 
./taskA.py > taskA.txt
</pre>
+
</syntaxhighlight>
* One of the lines of your output should be:
+
One of the lines of your output should be:
 
<pre>
 
<pre>
 
The number of episodes for channel "HBO" is 76
 
The number of episodes for channel "HBO" is 76
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
'''Submit''' file <tt>taskA.py</tt> with your script and the output file <tt>taskA.txt</tt>:
 +
<!-- /NOTEX -->
 +
 
Hints:
 
Hints:
* A good place to start is prog4.py with reading both csv files and prog2.py with a dictionary of counters
+
* A good place to start is <tt>prog4.py</tt> with reading both csv files and <tt>prog2.py</tt> with a dictionary of counters
 
* It might be useful to build a dictionary linking the series id to the channel name for that series
 
* It might be useful to build a dictionary linking the series id to the channel name for that series
  
==Task B1==
+
===Task B1 (SQL)===
* To prepare the database for tasks B1, B2 and C, run the command:
+
To prepare the database for tasks B1, B2, C and D, run the command:
<pre>
+
<syntaxhighlight lang="bash">
 
sqlite3 series.db < create_db.sql
 
sqlite3 series.db < create_db.sql
</pre>
+
</syntaxhighlight>
  
 
To verify that your database was created correctly, you can run the following commands:
 
To verify that your database was created correctly, you can run the following commands:
<pre>
+
<syntaxhighlight lang="bash">
 
sqlite3 series.db ".tables"
 
sqlite3 series.db ".tables"
 
# output should be  episodes  series   
 
# output should be  episodes  series   
Line 1,781: Line 2,089:
 
sqlite3 series.db "select count() from episodes; select count() from series;"
 
sqlite3 series.db "select count() from episodes; select count() from series;"
 
# output should be 348 and 10
 
# output should be 348 and 10
</pre>
+
</syntaxhighlight>
  
* The [[#L04#SQL_queries|last query in the lecture]] counts the number of episodes and average rating per each season of each series:
+
The [[#Lpython#SQL_queries|last query in the lecture]] counts the number of episodes and average rating per each season of each series:
<pre>
+
<syntaxhighlight lang="sql">
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
   FROM episodes GROUP BY seriesId, season;
 
   FROM episodes GROUP BY seriesId, season;
</pre>
+
</syntaxhighlight>
* Use join with the series table to replace the numeric series id with the series title and add the channel name
+
Use join with the <tt>series</tt> table to replace the numeric series id with the series title and add the channel name. Write your SQL query to file <tt>taskB1.sql</tt>. The first two lines of the file should be
* Write your SQL query to file <tt>taskB1.sql</tt> and '''submit''' this file
+
<syntaxhighlight lang="sql">
** The first two lines of the sql file should be
 
<pre>
 
 
.mode column
 
.mode column
 
.headers on
 
.headers on
</pre>
+
</syntaxhighlight>
* Run your query as follows:
+
Run your query as follows:
<pre>
+
<syntaxhighlight lang="bash">
 
sqlite3 series.db < taskB1.sql > taskB1.txt
 
sqlite3 series.db < taskB1.sql > taskB1.txt
</pre>
+
</syntaxhighlight>
* '''Submit''' also the resulting file <tt>taskB1.txt</tt>
+
For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25
* For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25
 
 
<pre>
 
<pre>
 
True Detective  HBO        1          8              9.3       
 
True Detective  HBO        1          8              9.3       
 
True Detective  HBO        2          8              8.25       
 
True Detective  HBO        2          8              8.25       
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
'''Submit'''  <tt>taskB1.sql</tt> and <tt>taskB1.txt</tt>
 +
<!-- /NOTEX -->
  
==Task B2==
+
===Task B2 (SQL)===
* For each channel compute the total count and average rating of all their episodes.
+
For each channel compute the total count and average rating of all their episodes. Write your SQL query to file <tt>taskB2.sql</tt>. As before, the first two lines of the file should be
* Write your SQL query to file taskB2.sql and '''submit''' this file
+
<syntaxhighlight lang="sql">
** The first two lines of the sql file should be
 
<pre>
 
 
.mode column
 
.mode column
 
.headers on
 
.headers on
</pre>
+
</syntaxhighlight>
* Run your query as follows:
+
Run your query as follows:
<pre>
+
<syntaxhighlight lang="bash">
 
sqlite3 series.db < taskB2.sql > taskB2.txt
 
sqlite3 series.db < taskB2.sql > taskB2.txt
</pre>
+
</syntaxhighlight>
* '''Submit''' also the resulting file taskB2.txt
+
For example, all 76 episodes for the two HBO series have average rating as follows:
* For example, all 76 episodes for the two HBO series have average rating as follows:
 
 
<pre>
 
<pre>
 
HBO        76          8.98947368421053
 
HBO        76          8.98947368421053
 
</pre>
 
</pre>
 +
<!-- NOTEX -->
 +
'''Submit'''  <tt>taskB2.sql</tt> and <tt>taskB2.txt</tt>
 +
<!-- /NOTEX -->
  
==Task C==
+
===Task C (Python+SQL)===
* If you have not done so already, create an SQLite database, as explained at the beginning of [[#Task_B1|task B1]].
+
If you have not done so already, create an SQLite database, as explained at the beginning of [[#Task_B1 (SQL)|task B1]].
* Write a python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database
+
 
<pre>
+
Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called <tt>seasons</tt> in the <tt>series.db</tt> database file
 +
<syntaxhighlight lang="sql">
 
/* print the number of episodes and average rating per season and series */
 
/* print the number of episodes and average rating per season and series */
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
   FROM episodes GROUP BY seriesId, season;
 
   FROM episodes GROUP BY seriesId, season;
</pre>
+
</syntaxhighlight>
* SQL can store results from a query directly in a table, but in this task you should instead read each row of the SELECT query in python and to store it by running INSERT command from python
+
* SQL can store the results from a query directly in a table, but in this task you should instead read each row of the <tt>SELECT</tt> query in Python and to store it by running <tt>INSERT</tt> command from Python
* Also do not forget to create the new table in the database with appropriate column names and types. You can execute CREATE TABLE command from python
+
* Also do not forget to create the new table in the database with appropriate column names and types. You can execute <tt>CREATE TABLE</tt> command from Python
* The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
+
* The cursor from the <tt>SELECT</tt> query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
 
* If you change your database during debugging, you can start over by running the command for creating the database above
 
* If you change your database during debugging, you can start over by running the command for creating the database above
* Store and '''submit''' the script in <tt>taskC.py</tt>. Also '''submit''' the modified database <tt>series.db</tt>
+
* Store the script as <tt>taskC.py</tt>.  
 +
 
 +
To check that your table was created, you can run command
 +
<syntaxhighlight lang="bash">
 +
sqlite3 series.db "SELECT * FROM seasons;"
 +
</syntaxhighlight>
 +
This will print many lines, including this one: <tt>"5|1|8|9.3"</tt> which is for season 1 of series 5 (True Detective).
  
* To check that your table was created, you can run command
+
<!-- NOTEX -->
:: <tt>sqlite3 series.db "SELECT * FROM seasons;"</tt>
+
'''Submit''' your script <tt>taskC.py</tt> and the modified database <tt>series.db</tt>.
* This will print many lines, including this one: "5|1|8|9.3" which is for season 1 of series 5 (True Detective)
+
<!-- /NOTEX -->
  
==Task D==
+
===Task D (SQL, optionally Python)===
* For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased
+
For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased.
* For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is -0.44166666666667
+
* For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
 
* Print a table containing series name, season number x, average rating in season x and average rating in season x+1
 
* Print a table containing series name, season number x, average rating in season x and average rating in season x+1
* The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons to the highest drop.
+
* The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons with the highest drop.
* One option is to run a query in SQL in which you join table seasons from task C with itself and select rows that belong to the same series and successive seasons
+
* One option is to run a query in SQL in which you join the <tt>seasons</tt> table from task C with itself and select rows that belong to the same series and successive seasons
* You can also read the rows of the seasons table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
+
* You can also read the rows of the <tt>seasons</tt> table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
* '''Submit''' your code as <tt>taskD.py</tt> or <tt>taskD.sql</tt>
+
<!-- NOTEX -->
* '''Submit''' the resulting table as <tt>taskD.txt</tt>
+
* '''Submit''' your code as <tt>taskD.py</tt> or <tt>taskD.sql</tt> and the resulting table as <tt>taskD.txt</tt>
 +
<!-- /NOTEX -->
  
 
The output should start like this (the formatting may differ):
 
The output should start like this (the formatting may differ):
Line 1,860: Line 2,176:
 
</pre>
 
</pre>
  
When using sql without python, include the following two lines in <tt>taskD.sql</tt>
+
When using SQL without Python, include the following two lines in <tt>taskD.sql</tt>
<pre>
+
<syntaxhighlight lang="sql">
 
.mode column
 
.mode column
 
.headers on
 
.headers on
</pre>
+
</syntaxhighlight>
 
and run your query as <tt>sqlite3 series.db < taskD.sql > taskD.txt</tt>
 
and run your query as <tt>sqlite3 series.db < taskD.sql > taskD.txt</tt>
=L05inf=
+
=Lweb=
[[#HW05inf]]
+
<!-- NOTEX -->
 
+
[[#HWweb]]
In this lecture we dive into SQLite3 and Python.
+
<!-- /NOTEX -->
 
 
== SQLite3 ==
 
  
SQLite3 is a simple "database" stored in one file. Think of SQLite not as a replacement for Oracle but as a replacement for fopen().
+
Sometimes you may be interested in processing data which is available in the form of a website consisting of multiple webpages (for example an e-shop with one page per item or a discussion forum with pages of individual users and individual discussion topics).  
Documentation: https://www.sqlite.org/docs.html
 
  
You can access sqlite database either from command line:
+
In this lecture, we will extract information from such a website using Python and existing Python libraries. We will store the results in an SQLite database. These results will be analyzed further in the following lectures.
<pre>
 
usamec@Darth-Labacus-2:~$ sqlite3 db.sqlite3
 
SQLite version 3.8.2 2013-12-06 14:53:30
 
Enter ".help" for instructions
 
Enter SQL statements terminated with a ";"
 
sqlite> CREATE TABLE test(id integer primary key, name text);
 
sqlite> .schema test
 
CREATE TABLE test(id integer primary key, name text);
 
sqlite> .exit
 
</pre>
 
 
 
Or from python interface: https://docs.python.org/2/library/sqlite3.html.
 
 
 
== Python ==
 
 
 
Python is a perfect language for almost anything. Here is a cheatsheet: http://www.cogsci.rpi.edu/~destem/igd/python_cheat_sheet.pdf
 
  
 
== Scraping webpages ==
 
== Scraping webpages ==
  
The simplest tool for scraping webpages is urllib2: https://docs.python.org/2/library/urllib2.html
+
In Python, the simplest tool for downloading webpages is <tt>[https://docs.python.org/2/library/urllib2.html urllib2]</tt> library. Example usage:
Example usage:
+
<syntaxhighlight lang="Python">
<pre>
 
 
import urllib2
 
import urllib2
 
f = urllib2.urlopen('http://www.python.org/')
 
f = urllib2.urlopen('http://www.python.org/')
 
print f.read()
 
print f.read()
</pre>
+
</syntaxhighlight>
  
Or use requests package:
+
You can also use <tt>[https://requests.readthedocs.io/en/master/ requests]</tt> package (this is recommended):
<pre>
+
<syntaxhighlight lang="Python">
 
import requests
 
import requests
 
r = requests.get("http://en.wikipedia.org")
 
r = requests.get("http://en.wikipedia.org")
 
print(r.text[:10])
 
print(r.text[:10])
</pre>
+
</syntaxhighlight>
  
 
== Parsing webpages ==
 
== Parsing webpages ==
  
We use beautifulsoup4 for parsing html (http://www.crummy.com/software/BeautifulSoup/bs4/doc/).
+
When you download one page from a website, it is in HTML format and you need to extract useful information from it. We will use <tt>beautifulsoup4</tt> library for parsing HTML.
I recommend following examples at the beginning of the documentation and example about CSS selectors:
+
* In your code, we recommend following the examples at the beginning of the [http://www.crummy.com/software/BeautifulSoup/bs4/doc/ documentation] and the example of [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors CSS selectors]. Also you can check out general [https://www.w3schools.com/cssref/css_selectors.asp syntax] of CSS selectors.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
+
* Information you need to extract is located within the structure of the HTML document
 +
* To find out, how is the document structured, use <tt>Inspect element</tt> feature in Chrome or Firefox (right click on the text of interest within the website). For example this text on the course webpage is located within <tt>LI</tt> element, which is within <tt>UL</tt> element, which is in 4 nested <tt>DIV</tt> elements, one <tt>BODY</tt> element and one <tt>HTML</tt> element. Some of these elements also have a class (starting with a dot) or an ID (starting with <tt>#</tt>).
 +
* Based on this information, create a CSS selector.
  
 
== Parsing dates ==
 
== Parsing dates ==
  
You have two options. Either use datetime.strptime or use [https://dateutil.readthedocs.org/en/latest/parser.html dateutil] package.
+
To parse dates (written as a text), you have two options:
 +
* <tt>[https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior datetime.strptime]</tt>
 +
* <tt>[https://dateutil.readthedocs.org/en/latest/parser.html dateutil]</tt> package.
  
== Other usefull tips ==
+
== Other useful tips ==
 +
* Don't forget to commit changes to your SQLite3 database (call <tt>db.commit()</tt>).
 +
* SQL command <tt>CREATE TABLE IF NOT EXISTS</tt> can be useful at the start of your script.
 +
* Use <tt>screen</tt> command for long running scripts.
 +
* All packages are installed on our server. If you use your own laptop, you need to install them using <tt>pip</tt> (preferably in an <tt>virtualenv</tt>).
 +
==HWweb==
 +
<!-- NOTEX -->
 +
See [[#Lweb|the lecture]]
 +
<!-- /NOTEX -->
  
* Don't forget to commit to your sqlite3 database (db.commit()).
+
<!-- NOTEX -->
* CREATE TABLE IF NOT EXISTS can be usefull at the start of your script.
+
Submit by copying requested files to <tt>/submit/web/username/</tt>
* Inspect element (right click on element) in Chrome can be very helpful.
+
<!-- /NOTEX -->
* Use screen command for long running scripts.
 
* All packages are installed on vyuka server. If you are planning using your own laptop, you need to install them using pip (preferably using virtualenv).
 
=HW05inf=
 
[[#L05inf|Lecture 5 inf]]
 
  
* Submit by copying requested files to /submit/hw05inf/username/
+
'''General goal:''' Scrape comments from user discussions at the <tt>sme.sk</tt> website. Store comments  from several (hundreds) users from the last month in an SQLite3 database.
  
General goal: Scrape comments from several (hundreds) sme.sk users from last month and store them in SQLite3 database.
+
===Task A===
 
 
==Task A==
 
  
 
Create SQLite3 "database" with appropriate schema for storing comments from SME.sk discussions.
 
Create SQLite3 "database" with appropriate schema for storing comments from SME.sk discussions.
You will probably need tables for users and comments. You don't need to store which comments replies to which one.
+
You will probably need tables for users and comments. You don't need to store which comment replies to which one but store the date and time when the comment was made.
  
 +
<!-- NOTEX -->
 
Submit two files:
 
Submit two files:
* db.sqlite3 - the database
+
* <tt>db.sqlite3</tt> - the database
* schema.txt - brief description of your schema and rationale behind it
+
* <tt>schema.txt</tt> - a brief description of your schema and rationale behind it
 +
<!-- /NOTEX -->
  
==Task B==
+
 
 +
===Task B===
  
 
Build a crawler, which crawls comments in sme.sk discussions.
 
Build a crawler, which crawls comments in sme.sk discussions.
 
You have two options:
 
You have two options:
* For fewer points: Script which gets url of the user (http://ekonomika.sme.sk/diskusie/user_profile.php?id_user=157432) and crawls his comments from last month.
+
* For fewer points: Script which gets URL of a user (e.g. http://ekonomika.sme.sk/diskusie/user_profile.php?id_user=157432) and crawls comments of this user from the last month.
* For more points: Scripts which gets one starting url (either user profile or some discussion, your choice) and automatically discovers users and crawls their comments.
+
* For more points: Scripts which gets one starting URL (either user profile or some discussion, your choice) and automatically discovers users and crawls their comments.
 +
 
 +
This crawler should store the comments in SQLite3 database built in the previous task.
  
This crawler should store comments in SQLite3 database built in previous task.
+
<!-- NOTEX -->
Submit following:
+
Submit the following:
* db.sqlite3 - the database
+
* <tt>db.sqlite3</tt> - the database
 
* every python script used for crawling
 
* every python script used for crawling
* README (how to start your crawler)
+
* <tt>README</tt> (how to start your crawler)
=L05bin=
+
<!-- /NOTEX -->
[[#HW05bin]]
 
  
The goal of the next three lectures is to get experience with several common bioinformatics tools
+
=Lflask=
* You will learn more about the algorithms and models behind these tools in [http://compbio.fmph.uniba.sk/vyuka/mbi/index.php/%C3%9Avod Methods in bioinformatics] course
+
<!-- NOTEX -->
 +
[[#HWflask]]
 +
<!-- /NOTEX -->
  
==Overview of DNA sequencing and assembly==
+
In this lecture, we will use Python to process user comments obtained in the previous lecture.
* '''DNA sequencing''' is a technology of reading the order of nucleotides along a DNA strand
+
* We will display information about individual users as a dynamic website written in Flask framework
* The result is represented as a string of A,C,G,T
+
* We will use simple text processing utilities from ScikitLearn library to extract word use statistics from the comments
* Only fragments of DNA of limited length can be read, these are called '''sequencing reads'''
 
* Different technologies produce reads of different characteristics
 
* Examples:
 
** '''Illumina sequencers''' produce short reads (typical length 100-200bp), but in great quantities and very low error rate (<0.1%)
 
*** The reads usually come in '''pairs''' sequenced from both ends of a DNA fragment of an approximately known length
 
** '''Oxford nanopore sequencers''' produce longer reads (thousands bp or more), but the error rates are higher (10-15%)
 
  
 +
==Flask==
  
* The goal of '''genome sequencing''' is to read all chromosomes of an organism
+
[http://flask.pocoo.org/docs/1.0/quickstart/ Flask] is a simple web server for Python. Using Flask you can write a simple dynamic website in Python.
* Sequencing machines produce many reads coming from different parts of the genome
 
* Using software tools called '''sequence assemblers''', these reads are glued together based on overlaps
 
* Ideally we would get the true chromosomes, but often we get only shorter fragments called '''contigs'''
 
* The results of assembly can contain errors
 
* We prefer longer contigs with fewer errors
 
  
==Sequence alignments and dotplots==
 
* '''Sequence alignment''' is the task of finding similarities between DNA (or protein) sequences
 
* Here is an example - short similarity between region 344447..344517 of one sequence and 3261..3327 of another
 
<pre>
 
Query: 344447 tctccgacggtgatggcgttgtgcgtcctctatttcttttatttctttttgttttatttc 344506
 
              |||||||| |||||||||||||||||| ||||||| |||||||||||| ||  ||||||
 
Sbjct: 3261  tctccgacagtgatggcgttgtgcgtc-tctatttattttatttctttgtg---tatttc 3316
 
  
Query: 344507 tctgactaccg 344517
+
===Running Flask===
              |||||||||||
 
Sbjct: 3317  tctgactaccg 3327
 
</pre>
 
  
* Alignments can be stored in many formats and visualised as dotplots
+
You can find a sample Flask application at <tt>/tasks/flask/simple_flask.</tt> Run it using these commands:
* In a '''dotplot''', x-axis correspond to positions in one sequence and y-axis to another sequence
+
<syntaxhighlight lang="bash">
* Diagonal lines show alignments between the sequences (direction of the diagonal shows which DNA strand was aligned)
+
cd <your directory>
 +
export FLASK_APP=main.py
 +
export FLASK_ENV=development # this is optional, but recommended for debugging
  
[[Image:Dotplot-mt-human-dros.png|center|thumb|250px|Dotplot of human and ''Drosophila'' mitochondrial genomes]]
+
# before running the following, change the port number
 +
# so that no two users use the same number
 +
flask run --port=PORT
 +
</syntaxhighlight>
  
==File formats==
+
PORT is a random number greater than 1024. This number should be different from other people running flask on the same machine (if you run into the problem where flask writes out lot of error messages complaining about permissions, select a different port number). Flask starts a webserver on port PORT and serves the pages created in your Flask application. Keep it running while you need to access these pages.
  
===Fasta===
+
To view these pages, open a web browser on the same computer where the Flask is running, e.g. <tt>chromium-browser  http://localhost:PORT/</tt> (use the port number you have selected to run Flask). If you are running flask on a server, you probably want to run the web browser on your local machine. In such case, you need to use ssh tunnel to channel the traffic through ssh connection:
* For storing DNA, RNA and protein sequences
+
<!-- NOTEX -->
* We were already working with fasta on [[#HWperl1]]
+
* On your local machine, open another console window and create an ssh tunnel as follows: <tt>ssh -L PORT:localhost:PORT vyuka.compbio.fmph.uniba.sk</tt> (replace PORT with the port number you have selected to run Flask)
* Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
+
<!-- /NOTEX -->
* The sequence itself is on the second line, long sequences are split into multiple lines
+
<!-- TEX
<pre>
+
* On your local machine, open another console window and create an ssh tunnel as follows: <tt>ssh -L PORT:localhost:PORT server.name.com</tt> (replace PORT with the port number you have selected to run Flask)
>SRR022868.1845_1
+
/TEX -->
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAATGCCTTAAACCAGTAGCAGAATCAGTATAAA
+
* For Windows machines, follow a [https://blog.devolutions.net/2017/04/how-to-configure-an-ssh-tunnel-on-putty tutorial] how to create an ssh tunnel
>SRR022868.1846_1
+
* Keep this ssh connection open while you need to access your Flask web pages; it makes port PORT available on your local machine
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGATTTATTAAAGTTAAGACGAGAGTCTCATTAT
+
* In your browser, you can now access your Flask webpages, using e.g. <tt>chromium-browser  http://localhost:PORT/</tt>
</pre>
 
  
===Fastq===
+
===Structure of a Flask application===
* Special format for storing sequencing reads, containing DNA sequences but also quality information about each nucleotide * More in [[#Lperl1#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|Lecture 01]]
 
  
===Sam/bam===
+
* The provided Flask application resides in the <tt>main.py</tt> script.
* Format for storing alignments of sequencing reads (or other sequences) to a genome [https://samtools.github.io/hts-specs/SAMv1.pdf]
+
* Some functions in this script are annotated with decorators starting with <tt>@app</tt>.
* For each read, the file contains the read itself, its quality, but also the chromosome/contig name and position where this read is likely coming from, and an additional information e.g. about mapping quality (confidence in the correct location)
+
* Decorator <tt>@app.before_request</tt> marks a function which will be executed before processing a particular request from a web browser. In this case we open a database connection and store it in a special variable <tt>g</tt> which can be used to store variables for a particular request.
* Sam files are text-based, thus easier to check manually; bam files are binary and compressed, thus smaller and faster to read
+
* Decorator <tt>@app.route('/')</tt> marks a function which will serve the main  page of the application with URL <tt>http://localhost:4247/</tt>. Similarly decorator <tt>@app.route('/wat/<random_id>/')</tt> marks a function which will serve URLs of the form <tt>http://localhost:4247/wat/100</tt> where the particular string which the user uses in the URL (here <tt>100</tt>) will be stored in <tt>random_id</tt> variable accessible within the function.
* We can easily convert between them using [https://github.com/samtools/samtools samtools]
+
* Functions serving a request return a string containing the requested webpage (typically a HTML document). For example, function <tt>wat</tt> returns a simple string without any HTML markup.
 +
* To more easily construct a full HTML document, you can use [http://jinja.pocoo.org/docs/dev/templates/ jinja2] templating language, as is done in the <tt>home</tt> function. The template itself is in file <tt>templates/main.html</tt>.
  
===Paf format===
 
* Another format for storing alignments [https://github.com/lh3/miniasm/blob/master/PAF.md]
 
  
===Gzip===
+
==Processing text==
* A general-purpose format for file compression [https://en.wikipedia.org/wiki/Gzip]
 
* Often used in bioinformatics on large fastq or fasta files
 
* Running command <tt>gzip filename.ext</tt> will create compressed file <tt>filename.ext.gz</tt> (original file will be deleted).
 
* The reverse process by <tt>gunzip filename.ext.gz</tt>
 
* Print the content of a gzipped file <tt>zcat filename.ext.gz</tt> (this will keep the gzipped file as is)
 
* Page through the content of a gzipped file <tt>zless filename.ext.gz</tt>
 
=HW05bin=
 
[[#L05bin]]
 
  
Submit the protocol and the required files to /submit/hw05bin
+
The main tool we will use for processing text is [http://scikit--learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html <tt>CountVectorizer</tt>] class from the Scikit-learn library.
 +
It transforms a text into a bag of words representation. In this representation we get the list of words and the count for each word. Example:
  
==Task A: examine input files==
+
<syntaxhighlight lang="Python">
* copy files from /tasks/hw05bin/
+
from sklearn.feature_extraction.text import CountVectorizer
<pre>
 
mkdir hw05
 
cd hw05
 
cp -iv /tasks/hw05bin/* .
 
</pre>
 
* ref.fasta is a piece of genome from E.coli
 
* miseq_R1.fastq.gz and miseq_R2.fastq.gz  are sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in R1 file, second reads in R2 file. These reads come from the region in ref.fasta
 
* nanopore.fasta are nanopore sequencing reads in a fasta format (without qualities). These reads are also from the region in ref.fasta
 
  
Try to find the answers to the following questions using command-line tools. In your '''protocol''', note down the '''commands''' as well as the '''answers''':
+
vec = CountVectorizer(strip_accents='unicode')
* (a) How many reads are in the miseq files? Is the number of reads the same in both files?
 
** Try command <tt>zcat miseq_R1.fastq.gz | wc -l </tt>
 
** Can you figure out the answer from the result of this command?
 
* (b) How long are individual reads in the miseq files?
 
** Look at the file using zless - do all reads appear to be of an equal length?
 
** Extend the following command with tail and wc -c to get the length of the first read: <tt>zcat miseq_R1.fastq.gz | head -n 2</tt>
 
** Repeat for both miseq files
 
* (c) How many reads are in the nanopore file (beware - different format)
 
* (d) What is the average length of the reads in the nanopore file?
 
** Try command: <tt>samtools faidx nanopore.fasta</tt>
 
** This creates <tt>nanopore.fasta.fai</tt> file, where the second column contains sequence length of each read
 
** Compute the average of this column by a one-liner: <tt>perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai</tt>
 
* (e) How long is the sequence in the <tt>ref.fasta</tt> file?
 
  
==Task B: assemble the sequence from the reads==
+
texts = [
* We will pretend that the correct answer (ref.fasta) is not known and we will try to assemble it from the reads
+
"Ema ma mamu.",
* We will assemble Illumina reads by program SPAdes and nanopore reads by miniasm
+
"Zirafa sa vo vani kupe a hneva sa."
* Assembly takes several minutes, we will run it in the background using screen command
+
]
  
SPAdes
+
t = vec.fit_transform(texts).todense()
* Run <tt>screen -S spades</tt>
 
* Press Enter to get command-line, then run the follwing command:<br>
 
:: <tt>spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log</tt>
 
* Press Ctrl-a followed by d
 
* This will take you out of screen command
 
* Run top command to check that your command is running
 
  
Minimap
+
print(t)
* Create file <tt>minimap.sh</tt> containing the following commands:
+
# prints:
<pre>
+
# [[1 0 0 1 1 0 0 0 0]
# Find alignments between pairs of reads
+
# [0 1 1 0 0 2 1 1 1]]
minimap2 -x ava-ont -t 1 nanopore.fasta nanopore.fasta | gzip -1 > nanopore.paf.gz
 
# Use overlaps to compute assembled genome
 
miniasm -f nanopore.fasta nanopore.paf.gz > miniasm.gfa 2> miniasm.log
 
# Convert genome to fasta format
 
perl -lane 'print ">$F[1]\n$F[2]" if $F[0] eq "S"' miniasm.gfa > miniasm.fasta
 
# Align reads to the assembled genome
 
minimap2 -x map-ont --secondary=no -t 1 miniasm.fasta nanopore.fasta | gzip -1 > miniasm.paf.gz
 
# Polish the genome by finding consensus of aligned reads at each position
 
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
 
</pre>
 
* Run <tt>screen -S minimap</tt>
 
* In screen, run <tt>source ./minimap.sh</tt>
 
* Press Ctrl-a d to exit screen
 
  
 +
print(vec.vocabulary_)
 +
# prints:
 +
# {'vani': 6, 'ema': 0, 'kupe': 2, 'mamu': 4,
 +
# 'hneva': 1, 'sa': 5, 'ma': 3, 'vo': 7, 'zirafa': 8}
 +
</syntaxhighlight>
  
To check if your commands have finished:
+
==NumPy arrays==
* Re-enter the screen environment using <tt>screen -r spades</tt> or <tt>screen -r miniasm</tt>
 
* If the command finished, terminate screen by pressing Ctrl-d or typing exit
 
 
 
Examine the outputs, write '''commands''' and '''answers''' to your '''protocol''':
 
* Copy output of spades under a new filename: <tt>cp -ip spades/contigs.fasta spades.fasta</tt>
 
* Output of miniasm should be in <tt>miniasm2.fasta</tt>
 
* (a) How many contigs are in each of these two files?
 
* (b) What can you find out from the names of contigs in <tt>spades.fasta</tt>? What is the length of the shortest and longest contigs? Cov in the names is abbreviation of read coverage - the average number of reads covering a position on the contig. Do the reads have similar coverage, or are there big differences?
 
** Use command <tt>grep '>' spades.fasta</tt>
 
* (c) What are the lengths of contigs in <tt>miniasm2.fa</tt> file? (you can use <tt>LN:i:</tt> in the name of contigs)
 
'''Submit files''' <tt>miniasm2.fasta</tt> and <tt>spades.fasta</tt>
 
  
==Task C: compare assemblies using Quast command==
+
Array <tt>t</tt> in the example above is a NumPy array provided by the [https://numpy.org/ NumPy library]. This library has also lots of nice tricks. First let us create two matrices:
* We have found basic characteristics of the two assemblies in task B
 
* Now we will use program Quast to compare both assemblies to the correct answer in <tt>ref.fa</tt>
 
 
<pre>
 
<pre>
quast.py -R ref.fasta miniasm2.fasta spades.fasta -o stats
+
>>> import numpy as np
 +
>>> a = np.array([[1,2,3],[4,5,6]])
 +
>>> b = np.array([[7,8],[9,10],[11,12]])
 +
>>> a
 +
array([[1, 2, 3],
 +
      [4, 5, 6]])
 +
>>> b
 +
array([[ 7,  8],
 +
      [ 9, 10],
 +
      [11, 12]])
 
</pre>
 
</pre>
* Submit file <tt>stats/report.txt</tt>
 
 
Look at the results in <tt>stats/report.txt</tt> and '''answer''' the following questions in your '''protocol''':
 
* (a) How many contigs quast reported in the two assemblies? Does it agree with your counts in part B?
 
* (b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the [[#L05bin#Overview_of_DNA_sequencing_and_assembly|lecture]])
 
* (c) What portion of the reference sequence is covered by the two assemblies (genome fraction)? Which assembly is better in this aspect?
 
* (d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?
 
  
==Task D: create dotplots of assemblies==
+
We can sum these matrices or multiply them by some number:
* We will now visualize alignments between each assembly and the reference genome using dotplots
 
** As in other tasks, write '''commands''' and '''answers''' to your '''protocol'''
 
 
 
* (a) Create dotplot comparing miniasm assembly to the reference sequence
 
 
<pre>
 
<pre>
# alignments
+
>>> 3 * a
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
+
array([[ 3,  6,  9],
# creating dotplot
+
      [12, 15, 18]])
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | ps2pdf -dEPSCrop - ref-miniasm2.pdf
+
>>> a + 3 * a
# displaying dotplot - if this does not work, copy the pdf file to your commputer and view there
+
array([[ 4,  8, 12],
evince ref-miniasm2.pdf &
+
      [16, 20, 24]])
 +
</pre>
 +
 
 +
We can calculate sum of elements in each matrix, or sum by some axis:
 +
<pre>
 +
>>> np.sum(a)
 +
21
 +
>>> np.sum(a, axis=1)
 +
array([ 6, 15])
 +
>>> np.sum(a, axis=0)
 +
array([5, 7, 9])
 
</pre>
 
</pre>
** x-axis is reference, y-axis assembly
 
** Which part of the reference is missing in the assembly?
 
** Do you see any other big differences between the assembly and the reference?
 
  
* (b) Use analogous commands to create dotplot for spades assembly, call it <tt>ref-spades.pdf</tt>  
+
There are many other useful functions, check [https://docs.scipy.org/doc/numpy-dev/user/quickstart.html the documentation].
** What are vertical gray lines in the dotplot?
+
==HWflask==
** Is any contig aligning to multiple places in the reference? To how many places?
+
<!-- NOTEX -->
 +
See [[#Lflask|the lecture]]
 +
<!-- /NOTEX -->
  
* (c) Use analogous commands to create dotplot of reference to itself, call it <tt>ref-ref.pdf</tt>
+
'''General goal:'''
** However, in the minimap2 command add option <tt>-p 0</tt> to include also weaker self-alignments
+
Build a simple website, which lists all crawled users and for each users has a page with simple statistics regarding the posts of this user.
** Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with dotplot in part (b)?
 
  
* '''Submit'''  all three pdf files ref-miniasm2.pdf, ref-spades.pdf, ref-ref.pdf
 
  
==Task E: Align reads and assemblies to reference, visualize in igv==
+
<!-- NOTEX -->
 +
Submit your source code (web server and preprocessing scripts) and database files. Copy these files to <tt>/submit/flask/username/</tt>
 +
<!-- /NOTEX -->
  
* Finally, we will align all source reads as well as assemblies to the reference genome, then visualize alignment in igv tool
+
<!-- NOTEX -->
** Write '''commands''' and '''answers''' to your '''protocol'''
+
This lesson requires crawled data from previous lesson. If you don't have one, you can find it at <tt>/tasks/flask/db.sqlite3</tt>
** '''Submit''' all four bam files <tt>ref-miseq.bam</tt>, <tt>ref-nanopore.bam</tt>, <tt>ref-spades.bam</tt>, <tt>ref-miniasm2.bam</tt>
+
<!-- /NOTEX -->
  
* (a) Align illumina reads (miseq files) to reference sequence
+
===Task A===
<pre>
 
# align illumina reads to reference
 
# minimap produces sam file, samtools view converts to bam, samtools sort orders by coordinate
 
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | samtools view -S -b - |  samtools sort - ref-miseq
 
# index bam file for faster access
 
samtools index ref-miseq.bam
 
</pre>
 
* (b) Similarly align nanopore reads. but instead of -x sr use -x map-ont, call the result ref-nanopore.bam, ref-nanopore.bam.bai
 
* (c) Similarly align spades.fasta, but instead of -x sr use -x asm10, call the result ref-spades.bam
 
* (d) Similarly align miniasm2.fasta, but instead of -x sr use -x asm10, call the result ref-miniasm2.bam
 
* (e) Run igv viewer. '''Beware: It needs a lot of memory, do not keep open unnecessarily'''
 
** <tt>igv -g ref.fasta &</tt>
 
** Using Menu->File->Load from File open all bam four files
 
** Look at region ecoli-frag:224,000-244,000
 
*** How many spades contigs do you see aligning in this region?
 
** Look at region ecoli-frag:227,300-227,600
 
*** Try to comment what you see, how frequent are errors in individual assemblies and read sets?
 
* If you are unable to run igv from home, you can install it on your computer [https://software.broadinstitute.org/software/igv/] and download ref.fasta and all bam and .bam.bai files
 
=L06inf=
 
  
 +
Create a simple Flask web application which:
 +
* Has a homepage with a list of all users (with links to their pages).
 +
* Has a page for each user with basic information: the nickname, the number of posts and the last 10 posts of this user.
  
 +
===Task B===
  
[[#HW06inf]]
+
Make a separate script which computes and stores in the database the floowing information for each user:
 +
* the list of 10 most frequently used words
 +
* the list of top 10 words typical for this user (words which this user uses much more often than other users). Come up with some simple heuristics for measuring this.
 +
Show this information on the page of each user.
  
In this lecture we will use Flask and simple text processing utilities from ScikitLearn.
+
Hint: To get the most frequently used words for each user, you can use  
 +
[http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort argsort from NumPy.]
  
==Flask==
+
===Task C===
  
Flask is simple web server for python (http://flask.pocoo.org/docs/1.0/quickstart/)
+
Preprocess and store the list of top three similar users for each user (try to come up with some simple definition of similarity based on the text in the posts). Again show this information on the user page.
You can find sample flask application at /tasks/hw06/simple_flask.
+
 
 +
'''Bonus:''' Try to use some simple topic modeling (e.g. PCA as in TruncatedSVD from scikit-learn) and use it for finding similar users.
 +
 
 +
=Ljavascript=
 +
<!-- NOTEX -->
 +
[[#HWjavascript]]
 +
<!-- /NOTEX -->
  
You can run it using these commands:
 
<pre>
 
cd <your directory>
 
export FLASK_APP=main.py
 
export FLASK_ENV=development (this is optional, but recommended for debugging)
 
flask run --host=0.0.0.0 --port=4247 (on vyuka server this starts python2.7, if you want python3 use flask3 run, but that is only on vyuka, on your own computer use virtualenv)
 
</pre>
 
  
Before running change the port number.
+
In this lecture we will extend the website from the previous lecture with interactive visualizations written in JavaScript. We will not cover details of the JavaScript programming language, only use visualization contained in the Google Charts library.
You can then access your app at vyuka.compbio.fmph.uniba.sk:4247 (change port number).
 
  
There may be problem with access to strange port numbers due to firewalling rules. There are at least two ways to circumvent this:
+
Your goal is to take examples from [https://developers.google.com/chart/interactive/docs/ the documentation] and tweak them for your purposes.
* Use X forwarding and run web browser directly from vyuka
 
local_machine> ssh vyuka.compbio.fmph.uniba.sk -XC
 
vyuka> chromium-browser
 
* Create SOCKS proxy to vyuka.compbio.fmph.uniba.sk and set SOCKS proxy at that port on your local machine. Then all web traffic goes through vyuka.compbio.fmph.uniba.sk via ssh tunnel. To create SOCKS proxy server on local machine port 8000 to vyuka.compbio.fmph.uniba.sk:
 
local_machine> ssh vyuka.compbio.fmph.uniba.sk -D 8000
 
(keep ssh session open while working)
 
  
Flask uses jinja2 (http://jinja.pocoo.org/docs/dev/templates/) templating language for showing html (you can use strings in python but it is painful).
+
'''Tips:'''
 +
* Each graph contains also HTML+JS code example. That is a good startpoint.
 +
* You can write your data into JavaScript data structures (`var data` from examples) in a Flask template. You might need a jinja for loop (https://jinja.palletsprojects.com/en/2.11.x/templates/#for). Or you can produce string in Python, which you will put into a HTML. It is a (very) bad practice, but sufficient for this lecture. (A better way is to load data in JSON format through API).
 +
* Consult the [[#Lflask|previous lecture]] on running and accessing Flask applications.
 +
==HWjavascript==
 +
<!-- NOTEX -->
 +
See [[#Ljavascript|the lecture]]
 +
<!-- /NOTEX -->
  
==Processing text==
+
'''General goal:''' Extend the user pages from the previous lecture with simple visualizations.
  
Main tool for processing text is CountVectorizer class from ScikitLearn
+
<!-- NOTEX -->
(http://scikit--learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
+
Submit your source code to <tt>/submit/javascript/username/</tt>
It transforms text into bag of words (for each word we get counts). Example:
+
<!-- /NOTEX -->
  
<pre>
+
===Task A===
from sklearn.feature_extraction.text import CountVectorizer
 
  
vec = CountVectorizer(strip_accents='unicode')
+
Display a calendar, which shows during which days was the user active. Use [https://developers.google.com/chart/interactive/docs/gallery/calendar calendar from Google Charts].
  
texts = [
+
===Task B===
"Ema ma mamu.",
 
"Zirafa sa vo vani kupe a hneva sa."
 
]
 
  
t = vec.fit_transform(texts).todense()
+
Show a histogram of comment lengths. Use [https://developers.google.com/chart/interactive/docs/gallery/histogram histogram from Google Charts].
  
print(t)
+
===Task C===
  
print(vec.vocabulary_)
+
Either: Show a word tree for a user using [https://developers.google.com/chart/interactive/docs/gallery/wordtree word tree from Google Charts]. Try to normalize the text before building the tree (convert to lowercase, remove accents). <tt>CountVectorizer</tt> has <tt>build_analyzer</tt> method, which returns a function, which does this for you.
</pre>
 
  
==Useful things==
+
Or: Pick some other appropriate visualization from [https://developers.google.com/chart/interactive/docs/gallery the gallery], feed it with data a show it.
 +
Also add some description to it.
  
We are working with numpy arrays here (that's array t in example above)
+
=Lbioinf1=
Numpy arrays has also lots of nice tricks.
+
<!-- NOTEX -->
First lets create two matrices:
+
[[#HWbioinf1]]
<pre>
+
<!-- /NOTEX -->
>>> import numpy as np
+
 
>>> a = np.array([[1,2,3],[4,5,6]])
+
The next three lectures at targeted at the students in the Bioinformatics program and the goal is to get experience with several common bioinformatics tools. Students will learn more about the algorithms and models behind these tools in the [http://compbio.fmph.uniba.sk/vyuka/mbi/index.php/%C3%9Avod Methods in bioinformatics] course.
>>> b = np.array([[7,8],[9,10],[11,12]])
+
 
>>> a
+
==Overview of DNA sequencing and assembly==
array([[1, 2, 3],
+
'''DNA sequencing''' is a technology of reading the order of nucleotides along a DNA strand
      [4, 5, 6]])
+
* The result is represented as a string of A,C,G,T
>>> b
+
* Only fragments of DNA of limited length can be read, these are called '''sequencing reads'''
array([[ 7, 8],
+
* Different technologies produce reads of different characteristics
      [ 9, 10],
+
* Examples:
      [11, 12]])
+
** '''Illumina sequencers''' produce short reads (typical length 100-200bp), but in great quantities and very low error rate (<0.1%)  
</pre>
+
** Illumina reads usually come in '''pairs''' sequenced from both ends of a DNA fragment of an approximately known length
 +
** '''Oxford nanopore sequencers''' produce longer reads (thousands of base pairs or more), but the error rates are higher (10-15%)  
  
We can sum this matrices or multiply them by some number:
 
<pre>
 
>>> 3 * a
 
array([[ 3,  6,  9],
 
      [12, 15, 18]])
 
>>> a + 3 * a
 
array([[ 4,  8, 12],
 
      [16, 20, 24]])
 
</pre>
 
  
We can calculate sum of elements in each matrix, or sum by some axis:
+
The goal of '''genome sequencing''' is to read all chromosomes of an organism
<pre>
+
* Sequencing machines produce many reads coming from different parts of the genome
>>> np.sum(a)
+
* Using software tools called '''sequence assemblers''', these reads are glued together based on overlaps
21
+
* Ideally we would get the true chromosomes, but often we get only shorter fragments called '''contigs'''
>>> np.sum(a, axis=1)
+
* The results of assembly can contain errors
array([ 6, 15])
+
* We prefer longer contigs with fewer errors
>>> np.sum(a, axis=0)
 
array([5, 7, 9])
 
</pre>
 
  
There is a lot other useful functions check https://docs.scipy.org/doc/numpy-dev/user/quickstart.html.
+
==Sequence alignments and dotplots==
 +
<!-- NOTEX -->
 +
A short video for this section: [https://youtu.be/qANrSl5w4t8]
 +
<!-- /NOTEX -->
 +
* '''Sequence alignment''' is the task of finding similarities between DNA (or protein) sequences
 +
* Here is an example - short similarity between region at positions 344,447..344,517 of one sequence and positions 3,261..3,327 of another sequence
 +
<pre>
 +
Query: 344447 tctccgacggtgatggcgttgtgcgtcctctatttcttttatttctttttgttttatttc 344506
 +
              |||||||| |||||||||||||||||| ||||||| |||||||||||| ||  ||||||
 +
Sbjct: 3261  tctccgacagtgatggcgttgtgcgtc-tctatttattttatttctttgtg---tatttc 3316
  
This can help you get top words for each user:
+
Query: 344507 tctgactaccg 344517
http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort
+
              |||||||||||
=HW06inf=
+
Sbjct: 3317  tctgactaccg 3327
 +
</pre>
  
 +
* Alignments can be stored in many formats and visualized as dotplots
 +
* In a '''dotplot''', the x-axis correspond to positions in one sequence and the y-axis in another sequence
 +
* Diagonal lines show alignments between the sequences (direction of the diagonal shows which DNA strand was aligned)
  
[[#L06inf|Lecture 6 inf]]
+
[[Image:Dotplot-mt-human-dros.png|center|thumb|250px|Dotplot of human and ''Drosophila'' mitochondrial genomes]]
  
* Submit by copying requested files to /submit/hw06inf/username/
+
==File formats==
  
General goal:
+
===FASTA===
Build a simple website, which lists all crawled users and for each users has a page with simple statistics for given user.
+
* FASTA is a format for storing DNA, RNA and protein sequences
 +
* We have already seen FASTA files in [[#HWperl|Perl exercises]]
 +
* Each sequence is given on several lines of the file. The first line starts with ">" followed by an identifier of the sequence and optionally some further description separated by whitespace
 +
* The sequence itself is on the second line; long sequences are split into multiple lines
 +
<pre>
 +
>SRR022868.1845_1
 +
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAA...
 +
>SRR022868.1846_1
 +
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGA...
 +
</pre>
  
This lesson requires crawled data from previous lesson, if you don't have one, you can find it at (and thank Baska):  
+
===FASTQ===
/tasks/hw06inf/db.sqlite3
+
* FASTQ is a format for storing sequencing reads, containing DNA sequences but also quality information about each nucleotide
 +
* More in the [[#Lperl#The_second_input_file_for_today:_DNA_sequencing_reads_.28fastq.29|lecture on Perl]]
  
Submit source code (web server and preprocessing scripts) and database files.
+
===SAM/BAM===
 +
* SAM and BAM are formats for storing alignments of sequencing reads (or other sequences) to a genome
 +
* For each read, the file contains the read itself, its quality, but also the chromosome/contig name and position where this read is likely coming from, and an additional information e.g. about mapping quality (confidence in the correct location)
 +
* SAM files are text-based, thus easier to check manually; BAM files are binary and compressed, thus smaller and faster to read
 +
* We can easily convert between SAM and BAM using [https://github.com/samtools/samtools samtools]
 +
* [https://samtools.github.io/hts-specs/SAMv1.pdf Full documentation of the format]
  
==Task A==
+
===PAF format===
 +
* PAF is another format for storing alignments used in the minimap2 tool
 +
* [https://github.com/lh3/miniasm/blob/master/PAF.md Full documentation of the format]
  
Create a simple flask web application which:
+
===Gzip===
* Has a homepage where is a list of all users (with links to their pages).
+
* Gzip is a general-purpose tool for file compression
* Has a page for each user, which has simple information about user: His nickname, number of posts and hist last 10 posts.
+
* It is often used in bioinformatics on large FASTQ or FASTA files
 
+
* Running command <tt>gzip filename.ext</tt> will create compressed file <tt>filename.ext.gz</tt> (original file will be deleted).  
==Task B==
+
* The reverse process is done by <tt>gunzip filename.ext.gz</tt> (this deletes the gziped file and creates the uncompressed version)
 +
* However, we can access the file without uncompressing it. Command <tt>zcat filename.ext.gz</tt> prints the content of a gzipped file and keeps the gzipped file as is. We can use pipes <tt>|</tt> to do further processing on the file.
 +
* To manually page through the content of a gzipped file use <tt>zless filename.ext.gz</tt>
 +
* Some bioinformatics tools can work directly with gzipped files.
 +
==HWbioinf1==
 +
<!-- NOTEX -->
 +
See also the [[#Lbioinf1|lecture]]
  
For each user preprocess and store list of his top 10 words and list of top 10 words typical for him (which he uses much more often than other users, come up with some simple heuristics).
+
Submit the protocol and the required files to <tt>/submit/bioinf1</tt>
Show this information on his page.
+
<!-- /NOTEX -->
  
==Task C==
+
<!-- NOTEX -->
 +
===Technical notes===
 +
* Task D and task E ask you too look at data visualizations
 +
* If you are unable to open graphical applications from our server, you can download appropriate files and view them on your computer (in task D these are simply pdf files, in task E you would have to install IGV software on your computer)
 +
<!-- /NOTEX -->
  
Preprocess and store list of top three similar users for each user (try to come up with some simple definition of similarity based on text in posts). Again show this information on user page.
+
===Task A: examine input files===
 +
Copy files from <tt>/tasks/bioinf1/</tt> as follows:
 +
<syntaxhighlight lang="bash">
 +
mkdir bioinf1
 +
cd bioinf1
 +
cp -iv /tasks/bioinf1/* .
 +
</syntaxhighlight>
 +
* <tt>ref.fasta</tt> is a piece of genome from ''Escherichia coli''
 +
* <tt>miseq_R1.fastq.gz</tt> and <tt>miseq_R2.fastq.gz</tt> are sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in the R1 file, second reads in the R2 file. These reads come from the region in <tt>ref.fasta</tt>
 +
* <tt>nanopore.fasta</tt> are nanopore sequencing reads in FASTA format (without qualities). These reads are also from the region in <tt>ref.fasta</tt>
  
Bonus:
+
Try to find the answers to the following questions using command-line tools.  
Try to use some simple topic modeling (e.g. PCA as in TruncatedSVD from scikit-learn) and use it for finding similar users.
+
<!-- NOTEX -->
=L06bin=
+
In your '''protocol''', note down the '''commands''' as well as the '''answers'''.
* [[#HW06bin]]
+
<!-- /NOTEX -->
==Eukaryotic gene structure==
 
* Recall the Central dogma of molecular biology: the flow of genetic information from DNA to RNA to protein (gene expression)
 
* In eukaryotes, mRNA often undergoes splicing, where introns are removed and exons are joined together
 
* The very start and end of mRNA remain untranslated (UTR = untranslated region)
 
* The coding part of the gene starts with a start codon, contains a sequence of additional codons and ends with a stop codon. Codons can be interrupted by  introns.
 
  
[[Image:Dogma.png|center|thumb|450px|Gene expression in eukaryotes]]
+
(a) How many reads are in the MiSeq files? Is the number of reads the same in both files?
 +
* Try command <tt>zcat miseq_R1.fastq.gz | wc -l </tt>
 +
* Can you figure out the answer from the result of this command?
  
==Computational gene finding==
+
(b) How long are individual reads in the MiSeq files?
* Input: DNA sequence (an assembled genome or a part of it)
+
* Look at the file using <tt>zless</tt> - do all reads appear to be of an equal length?
* Output: positions of protein coding genes and their exons
+
* Extend the following command with <tt>tail</tt> and <tt>wc -c</tt> to get the length of the first read: <tt>zcat miseq_R1.fastq.gz | head -n 2</tt>
* If we know the exact position of coding regions of a gene, we can use genetic code to predict the protein sequence encoded by it
+
* Do not forget to consider the end of the line character
* Gene finders use statistical features observed from known genes, such as typical sequence motifs near the start codons, stop codons and splice sites, typical codon frequences, typical exon and intron lengths etc.
+
* Repeat for both MiSeq files
* These statistical parameters need to be adjusted for each genome.
 
* We will use a gene finder called [http://bioinf.uni-greifswald.de/augustus/ Augustus]
 
  
==Gene expression==
+
(c) How many reads are in the nanopore file (beware - different format)
* Not all genes undergo transcription and translation all the time and at the same level
 
* The processes of transcription and translation are regulated according to cell needs
 
* The term "gene expression" has two meanings
 
** the process of transcription and translation (synthesis of a gene product)
 
** the amount of mRNA or protein produced from a single gene (genes with high or low expression)
 
  
* RNA-seq technology can sequence mRNA extracted from a sample of cells
+
(d) What is the average length of the reads in the nanopore file?
* We can aligned sequenced reads back to the genome
+
* Try command: <tt>samtools faidx nanopore.fasta</tt>
* The number of reads coming from a gene depends on its expression level (and on its length)
+
* This creates <tt>nanopore.fasta.fai</tt> file, where the second column contains sequence length of each read
=HW06bin=
+
* Compute the average of this column by a one-liner: <tt>perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai</tt>
[[#L06bin]]
 
  
==Input files, submitting==
+
(e) How long is the sequence in the <tt>ref.fasta</tt> file?
Copy files from /tasks/hw06bin/
 
<pre>
 
mkdir hw06
 
cd hw06
 
cp -iv /tasks/hw06bin/* .
 
</pre>
 
  
Files:
+
===Task B: assemble the sequence from the reads===
* ref.fasta is a 38kb piece of genome of the fungus [https://www.ncbi.nlm.nih.gov/genome?term=aspergillus%20fumigatus Aspergillus nidulans]
+
* We will pretend that the correct answer (<tt>ref.fasta</tt>) is not known and we will try to assemble it from the reads
* rnaseq.fastq are RNA-seq reads from Illumina sequencer extracted from the [https://www.ncbi.nlm.nih.gov/sra/?term=SRR4048918 Short read archive]
+
* We will assemble Illumina reads by program [http://cab.spbu.ru/software/spades/ SPAdes] and nanopore reads by [https://github.com/lh3/miniasm miniasm]
* annot.gff is the reference gene annotation from the database (we will consider his as correct gene positions)
+
* Assembly takes several minutes, we will run it in the background using <tt>screen</tt> command
  
Submit the protocol and the required files to /submit/hw06bin
+
SPAdes
 +
* Run <tt>screen -S spades</tt>
 +
* Press Enter to get command-line, then run the following command:<br>
 +
:: <tt>spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log</tt>
 +
* Press <tt>Ctrl-a</tt> followed by <tt>d</tt>
 +
* This will take you out of <tt>screen</tt> command
 +
* Run <tt>top</tt> command to check that your command is running
  
==Task A: Gene finding==
+
Miniasm
 
+
* Create file <tt>miniasm.sh</tt> containing the following commands:
Run the Augustus gene finder with two versions of parameters:
+
<syntaxhighlight lang="bash">
* one trained specifically for A. nidulas genes
+
# Find alignments between pairs of reads
* one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
+
minimap2 -x ava-ont -t 1 nanopore.fasta nanopore.fasta | gzip -1 > nanopore.paf.gz
<pre>
+
# Use overlaps to compute the assembled genome
augustus --species=anidulans ref.fasta > augustus-anidulans.gtf
+
miniasm -f nanopore.fasta nanopore.paf.gz > miniasm.gfa 2> miniasm.log
augustus --species=human ref.fasta > augustus-human.gtf
+
# Convert genome to fasta format
</pre>
+
perl -lane 'print ">$F[1]\n$F[2]" if $F[0] eq "S"' miniasm.gfa > miniasm.fasta
 +
# Align reads to the assembled genome
 +
minimap2 -x map-ont --secondary=no -t 1 miniasm.fasta nanopore.fasta | gzip -1 > miniasm.paf.gz
 +
# Polish the genome by finding consensus of aligned reads at each position
 +
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
 +
</syntaxhighlight>
 +
* Run <tt>screen -S miniasm</tt>
 +
* In screen, run <tt>source ./miniasm.sh</tt>
 +
* Press <tt>Ctrl-a d</tt> to exit <tt>screen</tt>
  
* The results of gene finding are in the [http://mblab.wustl.edu/GTF22.html GTF format]. Rows starting with # are comments, each of the remaining rows describes some interval of the sequence. If the second column is CDS, it is a coding part of an exon.
 
* The reference annotation annot.gff is in a similar format called [http://gmod.org/wiki/GFF3 GFF3].
 
  
Examine the files and try to find the answers to the following questions using command-line tools
+
To check if your commands have finished:
* (a) How many exons are in each of the two gtf files? (Beware: simply using grep with pattern CDS may yield lines containing this string in a different column. You can use e.g. techniques from [[#L02]] and [[#HW02]]).
+
* Re-enter the screen environment using <tt>screen -r spades</tt> or <tt>screen -r miniasm</tt>
* (b) How many genes are in each of the two gtf files? (The files contain rows with word gene in the second column, one for each gene)
+
* If the command finished, terminate <tt>screen</tt> by pressing <tt>Ctrl-d</tt> or typing <tt>exit</tt>
* (c) How many exons and genes are in the annot.gff file?
 
  
Write the anwsers and commands to the '''protocol'''. '''Submit''' files augustus-anidulans.gtf and augustus-human.gtf.
+
Examine the outputs.
 +
<!-- NOTEX -->
 +
Write '''commands''' and '''answers''' to your '''protocol'''.
 +
<!-- /NOTEX -->
 +
* Copy output of SPAdes under a new filename: <tt>cp -ip spades/contigs.fasta spades.fasta</tt>
 +
* Output of miniasm should be in <tt>miniasm2.fasta</tt>
  
==Task B: Aligning RNA-seq reads==
+
(a) How many contigs are in each of these two files?
  
* Align RNA-seq reads to the genome
+
(b) What can you find out from the names of contigs in <tt>spades.fasta</tt>? What is the length of the shortest and longest contigs? String <tt>cov</tt> in the names is abbreviation of read coverage - the average number of reads covering a position on the contig. Do the reads have similar coverage, or are there big differences?
* We will use a specialized tool tophat, which can recognize introns
+
* Use command <tt>grep '>' spades.fasta</tt>
* The we will sort and index bam the file, similarly as in [[#HW05bin]]
 
  
<pre>
+
(c) What are the lengths of contigs in <tt>miniasm2.fa</tt> file? (you can use <tt>LN:i:</tt> in the name of contigs)
bowtie2-build ref.fasta ref.fasta
 
tophat2 -i 10 -I 10000 --max-multihits 1 --output-dir rnaseq ref.fasta rnaseq.fastq
 
samtools sort rnaseq/accepted_hits.bam rnaseq
 
samtools index rnaseq.bam
 
</pre>
 
  
In addition to the bam file, TopHat produced several other files in the rnaseq folder. Examine them to find out answers to the following questions (you can do it manually by looking at the the files, e.g. by less command):
+
<!-- NOTEX -->
* (a) How many reads were in the the fastq file? How many of them were successfully mapped?
+
'''Submit''' files <tt>miniasm2.fasta</tt> and <tt>spades.fasta</tt>
* (b) How many introns ("junctions") were predicted? How many of them are supported by more than one read? (The 5th column of the corresponding file is the number of reads supporting a junction.)
+
<!-- /NOTEX -->
  
Write anwsers to the '''protocol'''. '''Submit''' file rnaseq.bam.
+
===Task C: compare assemblies using Quast command===
  
==Task C: Visualizing in igv==
+
We have found basic characteristics of the two assemblies in task B. Now we will use program [http://quast.sourceforge.net/quast Quast] to compare both assemblies to the correct answer in <tt>ref.fa</tt>
 +
<syntaxhighlight lang="bash">
 +
quast.py -R ref.fasta miniasm2.fasta spades.fasta -o stats
 +
</syntaxhighlight>
  
As before, run igv as follows:
+
<!-- NOTEX -->
<pre>
+
'''Submit''' file <tt>stats/report.txt</tt>.
igv -g ref.fasta &
+
<!-- /NOTEX -->
</pre>
 
  
* Open additional files using menu File -> Load from File
+
Look at the results in <tt>stats/report.txt</tt> and '''answer''' the following questions.
** annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam
 
* Exons are shown as thicker boxes, introns are thinner.
 
* For each of the following questions, select part of the sequence illustrating the answer and export figure using File->Save image
 
* You can check these images using command eog
 
  
Questions:
+
(a) How many contigs has quast reported in the two assemblies? Does it agree with your counts in part B?
* (a) Create image illustrating differences between Augustus with human parameters and the reference annotation, save as a.png. Briefly describe the differences in words.
 
* (b) Find some differences between Augustus with A.nidulans parameters and the reference annotation. Store an illustrative figure as b.png. Which parameters have yielded a more accurate prediction?
 
* (c) Zoom in to one of the genes with high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as c.png.
 
  
'''Submit''' files a.png, b.png, c.png. Write answers to your '''protocol'''.
+
(b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the [[#Lbioinf1#Overview_of_DNA_sequencing_and_assembly|lecture]])
=L07inf=
 
[[#HW07inf]]
 
  
In this lesson we make simple javascript visualizations.
+
(c) What portion of the reference sequence is covered by the two assemblies (reported as <tt>genome fraction</tt>)? Which assembly is better in this aspect?
  
Your goal is to take examples from here https://developers.google.com/chart/interactive/docs/
+
(d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?
and tweak them for your purposes.
 
  
Tips:
+
===Task D: create dotplots of assemblies===
* You can output your data into javascript data structures in Flask template. It is a bad practice, but sufficient for this lesson. (Better way is to load JSON through API).
 
* Remember that you have to bypass the firewall.
 
=HW07inf=
 
[[#L07inf]]
 
  
* Submit by copying requested files to /submit/hw07inf/username/
+
We will now visualize alignments between each assembly and the reference genome using dotplots.
 +
<!-- NOTEX -->
 +
As in other tasks, write '''commands''' and '''answers''' to your '''protocol'''.
 +
<!-- /NOTEX -->
  
General goal:
+
(a) Create a dotplot comparing miniasm assembly to the reference sequence
Extend user pages from previous project with simple visualizations.
+
<syntaxhighlight lang="bash">
 +
# alignments
 +
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
 +
# creating dotplot
 +
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | \
 +
  ps2pdf -dEPSCrop - ref-miniasm2.pdf
 +
# displaying dotplot
 +
# if evince does not work, copy the pdf file to your commputer and view there
 +
evince ref-miniasm2.pdf &
 +
</syntaxhighlight>
 +
* x-axis is reference, y-axis assembly
 +
* Which part of the reference is missing in the assembly?
 +
* Do you see any other big differences between the assembly and the reference?
  
==Task A==
+
(b) Use analogous commands to create a dotplot for spades assembly, call it <tt>ref-spades.pdf</tt>
 +
* What are vertical gray lines in the dotplot?
 +
* Is any contig aligning to multiple places in the reference? To how many places?
  
Show a calendar, which shows during which days was user active (like this https://developers.google.com/chart/interactive/docs/gallery/calendar#overview).
+
(c) Use analogous commands to create a dotplot of reference to itself, call it <tt>ref-ref.pdf</tt>
 +
* However, in the minimap2 command add option <tt>-p 0</tt> to include also weaker self-alignments
 +
* Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with the dotplot in part (b)?
  
==Task B==
+
<!-- NOTEX -->
 +
'''Submit'''  all three pdf files <tt>ref-miniasm2.pdf</tt>, <tt>ref-spades.pdf</tt>, <tt>ref-ref.pdf</tt>
 +
<!-- /NOTEX -->
  
Show a histogram of comments length (like this https://developers.google.com/chart/interactive/docs/gallery/histogram#example).
+
===Task E: Align reads and assemblies to reference, visualize in IGV===
  
==Task C==
+
Finally, we will align all source reads as well as assemblies to the reference genome, then visualize the alignments in [https://software.broadinstitute.org/software/igv/ IGV tool].
  
Try showing a word tree for a user (https://developers.google.com/chart/interactive/docs/gallery/wordtree#overview). Try to normalize the text (lowercase, remove accents). CountVectorizer has method build_analyzer, which returns a function, which does this for you.
+
<!-- NOTEX -->
=L07bin=
+
A short video introducing IGV: [https://youtu.be/46HhBqGGPU0]
* [[#HW07bin]]
 
  
==Polymorphisms==
+
* Write '''commands''' and '''answers''' to your '''protocol'''
* Individuals within species differ slightly in their genomes
+
* '''Submit''' all four BAM files <tt>ref-miseq.bam</tt>, <tt>ref-nanopore.bam</tt>, <tt>ref-spades.bam</tt>, <tt>ref-miniasm2.bam</tt>
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
+
<!-- /NOTEX -->
* [https://ghr.nlm.nih.gov/primer/genomicresearch/snp SNP]: single-nucleotide polymorphism (a polymorphism which is a single substitution)
 
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
 
* At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)
 
  
==Finding polymorphisms / genome variants==
+
(a) Align illumina reads (MiSeq files) to the reference sequence
* We compare sequencing reads coming from an individual to a reference genome of the species
+
<syntaxhighlight lang="bash">
* First we align them, as in [[#HW05bin]]
+
# align illumina reads to reference
* Then we look for positions where a substantial fraction of reads does not agree with the reference (SNP-calling)
+
# minimap produces SAM file, samtools view converts to BAM,  
 +
# samtools sort orders by coordinate
 +
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | \
 +
  samtools view -S -b - |  samtools sort - ref-miseq
 +
# index BAM file for faster access
 +
samtools index ref-miseq.bam
 +
</syntaxhighlight>
  
==Programs and file formats==
+
(b) Similarly align nanopore reads, but instead of <tt>-x sr</tt> use <tt>-x map-ont</tt>, call the result <tt>ref-nanopore.bam</tt>, <tt>ref-nanopore.bam.bai</tt>
* For mapping, we will use [https://github.com/lh3/bwa bwa mem] (you can also try minimap2, as in [[#HW05bin]])
 
* For SNP calling, we will use [https://github.com/ekg/freebayes freebayes]
 
* For reads and read alignments, we will use fastq and bam files, as in [[#L05bin|previous lectures]]
 
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files]
 
* For storing genome intervals, we will use [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED files]
 
  
==Human variants==
+
(c) Similarly align <tt>spades.fasta</tt>, but instead of <tt>-x sr</tt> use <tt>-x asm10</tt>, call the result <tt>ref-spades.bam</tt>
* For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
 
* There are various databases, e.g. [https://www.ncbi.nlm.nih.gov/SNP/ dbSNP], [https://www.omim.org/ OMIM], or user-editable [https://www.snpedia.com/index.php/SNPedia SNPedia]
 
  
==UCSC genome browser==
+
(d) Similarly align <tt>miniasm2.fasta</tt>, but instead of <tt>-x sr</tt> use <tt>-x asm10</tt>, call the result <tt>ref-miniasm2.bam</tt>
* On-line tool similar to IGV
 
* http://genome-euro.ucsc.edu/
 
* Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
 
  
====Basics====
+
(e) Run the IGV viewer.
* on the front page, choose Genomes in the top blue menu bar
+
<!-- NOTEX -->
* select a genome and its version, optionally enter position or keyword, press submit
+
'''Beware: It needs a lot of memory, do not keep open on the server unnecessarily'''
* on the browser screen top image shows chromosome map, selected region in red
+
<!-- /NOTEX -->
* below a view of selected region and various track with information about this region
+
* <tt>igv -g ref.fasta &</tt>
* for example some of the top tracks display genes (boxes are exons, lines are introns)
+
* Using <tt>Menu->File->Load from File</tt>, open all four BAM files
* tracks can be switched on and off and configured in the bottom part of the page
+
* Look at region <tt>ecoli-frag:224,000-244,000</tt>
** different display levels, full contains all information but takes a lot of vertical space
+
* How many spades contigs do you see aligning in this region?
* navigation at the top (move, zoom, etc.)
+
* Look at region <tt>ecoli-frag:227,300-227,600</tt>
* various actions in the menu
+
* Comment on what you see. How frequent are errors in the individual assemblies and read sets?
* clicking at the browser figure allows you to get more information about a gene or other displayed item
+
<!-- NOTEX -->
* this week, we will need tracks GENCODE and dbSNP - check e.g. [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr11%3A66546841-66563329 gene ACTN3] and within it SNP rs1815739 in exon15
+
* If you are unable to run igv from home, you can install it on your computer [https://software.broadinstitute.org/software/igv/] and download <tt>ref.fasta</tt> and all <tt>.bam</tt> and <tt>.bam.bai</tt> files
 +
<!-- /NOTEX -->
  
====Blat====
+
=Lbioinf2=
* UCSC genome browser uses a fast but less sensitive BLAT (good for the same or very closely related species)
+
<!-- NOTEX -->
* Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome
+
[[#HWbioinf2]]
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
+
<!-- /NOTEX -->
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
 
<pre>
 
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
 
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
 
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
 
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
 
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
 
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
 
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
 
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
 
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
 
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
 
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
 
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
 
CCGAAAAGCCCCCACAAAAAGCCG
 
</pre>
 
=HW07bin=
 
[[#L07bin]]
 
  
==Input files, submitting==
+
==Eukaryotic gene structure==
Copy files from /tasks/hw07bin/
+
* Recall the Central dogma of molecular biology: the flow of genetic information from DNA to RNA to protein (gene expression)
<pre>
+
* In eukaryotes, mRNA often undergoes splicing, where introns are removed and exons are joined together
mkdir hw07
+
* The very start and end of mRNA remain untranslated (UTR = untranslated region)
cd hw07
+
* The coding part of the gene starts with a start codon, contains a sequence of additional codons and ends with a stop codon. Codons can be interrupted by introns.
cp -iv /tasks/hw07bin/* .
 
</pre>
 
  
Files:
+
[[Image:Dogma.png|center|thumb|450px|Gene expression in eukaryotes]]
* humanChr7Region.fasta is a 7kb piece of the human chromosome 7
 
* motherChr7Region.fastq is a sample of reads from an anonymous donor known as NA12878, these reads come from region in humanChr7Region.fasta
 
* fatherChr12.vcf and motherChr12.vcf are single-nucleotide variants in chr12 obtained by sequencing two individuals NA12877, NA12878 (these come from a larger [https://www.coriell.org/0/Sections/Collections/NIGMS/CEPHFamiliesDetail.aspx?PgId=441&fam=1463& family])
 
  
Submit the protocol and the required files to /submit/hw07bin
+
==Computational gene finding==
 +
* Input: DNA sequence (an assembled genome or a part of it)
 +
* Output: positions of protein coding genes and their exons
 +
* If we know the exact position of coding regions of a gene, we can use the genetic code table to predict the protein sequence encoded by it.
 +
* Gene finders use statistical features observed from known genes, such as typical sequence motifs near the start codons, stop codons and splice sites, typical codon frequencies, typical exon and intron lengths etc.
 +
* These statistical parameters need to be adjusted for each genome.
 +
* We will use a gene finder called [http://bioinf.uni-greifswald.de/augustus/ Augustus].
  
==Task A: read mapping and SNP calling==
+
==Gene expression==
 +
* Not all genes undergo transcription and translation all the time and at the same level.
 +
* The processes of transcription and translation are regulated according to cell needs.
 +
* The term "gene expression" has two meanings:
 +
** the process of transcription and translation (synthesis of a gene product),
 +
** the amount of mRNA or protein produced from a single gene (genes with high or low expression).
  
Align reads to reference:
+
RNA-seq technology can sequence mRNA extracted from a sample of cells.
<pre>
+
* We can align sequenced reads back to the genome.
bwa index humanChr7Region.fasta
+
* The number of reads coming from a gene depends on its expression level (and on its length).
bwa mem humanChr7Region.fasta  motherChr7Region.fastq | samtools view -S -b - |  samtools sort - motherChr7Region
+
==HWbioinf2==
samtools index motherChr7Region.bam
+
<!-- NOTEX -->
</pre>
+
See also the [[#Lbioinf2|lecture]]
  
Call SNPs:
+
Submit the protocol and the required files to <tt>/submit/bioinf2</tt>
<pre>
+
<!-- /NOTEX -->
freebayes -f humanChr7Region.fasta --min-alternate-count 10 motherChr7Region.bam >motherChr7Region.vcf
 
</pre>
 
  
Run igv, use humanChr7Region.fasta as genome, open motherChr7Region.bam and motherChr7Region.vcf. Looking at the aligned reads and the vcf file, '''answer''' the following questions in protocol:
+
===Input files===
* (a) How many variants were found in the vcf file?
+
Copy files from /tasks/bioinf2/
* (b) How many variants are heterozygous and how many are homozygous?
+
<syntaxhighlight lang="bash">
* (c) Are all variants single-nucleotide variants or do you also see some insertions/deletions (indels)?
+
mkdir bioinf2
Also export overall view of the whole region from igv to file motherChr7Region.png.
+
cd bioinf2
 +
cp -iv /tasks/bioinf2/* .
 +
</syntaxhighlight>
  
'''Submit''' the following files:
+
Files:
* motherChr7Region.png, motherChr7Region.bam, motherChr7Region.vcf
+
* <tt>ref.fasta</tt> is a 38kb piece of the genome of the fungus ''[https://www.ncbi.nlm.nih.gov/genome?term=aspergillus%20fumigatus Aspergillus nidulans]''
 +
* <tt>rnaseq.fastq</tt> are RNA-seq reads from Illumina sequencer extracted from the [https://www.ncbi.nlm.nih.gov/sra/?term=SRR4048918 Short read archive]
 +
* <tt>annot.gff</tt> is the reference gene annotation from the database (we will consider this as correct gene positions)
  
==Task B: UCSC browser==
+
===Task A: Gene finding===
* (a) Where is sequence from regionChr7.fasta located in the browser?
 
** Go to http://genome-euro.ucsc.edu/, From the blue menu, select Tools->Blat
 
** Check that blat uses Human, hg38 assembly
 
** Open regionChr7.fasta in a graphical editor (e.g. gedit), sleect all, paste to BLAT window, run BLAT
 
** In the table of results, the best result should have identiy close to 100% and span close to 7kb
 
** For this best result, click on link named Browser
 
** Report which chromosome and which region you get
 
  
* (b) Which gene is located in this region?
+
Run the Augustus gene finder with two versions of parameters:
** Once you are int he browser, press Default tracks button
+
* one trained specifically for ''A. nidulans'' genes
** Track named GENCODE contains known genes, shown as rectangles (exanos) connected by lines (introns). Short gene names are next to them.
+
* one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
** Report the name of the gene in the region
+
<syntaxhighlight lang="bash">
 +
augustus --species=anidulans ref.fasta > augustus-anidulans.gtf
 +
augustus --species=human ref.fasta > augustus-human.gtf
 +
</syntaxhighlight>
  
* (c) In which tissue is this gene most highly expressed? What is the function of this gene?
+
The results of gene finding are in the [http://mblab.wustl.edu/GTF22.html GTF format]. Rows starting with <tt>#</tt> are comments, each of the remaining rows describes some interval of the sequence. If the second column is <tt>CDS</tt>, it is a coding part of an exon. The reference annotation <tt>annot.gff</tt> is in a similar format called [http://gmod.org/wiki/GFF3 GFF3].
** When you click on the gene (possibly twice), you get an information page which starts with a summary of the known function of this gene. Copy the first sentence to your protocol.  
 
** Further down on the gene information page you see RNA-Seq Expression Data (colorful boxplots). Find out which tissues have the highest signal.
 
* (d) Which SNPs are located in this gene? Which trait do they inflence?
 
** You can see SNPs in the Common SNPs(151) track, but their IDs appear only after switching this track to pack mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
 
** Information page of the gene (part c) also describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
 
** You can also find information about individual SNPs by looking for them by their ID in [https://www.snpedia.com/index.php/SNPedia SNPedia] (not required in this task)
 
  
 +
Examine the files and try to find the answers to the following questions using command-line tools
  
<!--
+
(a) How many exons are in each of the two GTF files? (Beware: simply using <tt>grep</tt> with pattern <tt>CDS</tt> may yield lines containing this string in a different column. You can use e.g. techniques from the [[#Lbash|lecture]] and [[#HWbash|exercises]] on command-line tools).
* Default tracks
 
* Which gene (GENCODE)
 
* Which tissues (Gene express track)
 
* Switch on SNPedia track to "Pack", click on one of the SNPs within the gene, read the text - which trait do these SNPs influence?
 
-->
 
  
==Task C: Examining larger vcf files==
+
(b) How many genes are in each of the two GTF files? (The files contain rows with word <tt>gene</tt> in the second column, one for each gene)
In this task, we will look at motherChr12.vcf and fatherChr12.vcf files and compute various statistics. You can use command-line tools, such as grep, wc, sort, uniq and Perl one-liners (as in [[#L02]]), you write small scripts in Perl or Python (as in [[#Lperl1]] and [[#L04]]).
 
* Write all used commands to your protocol
 
* If you write any scripts, submit them as well.
 
  
Questions:
+
(c) How many exons and genes are in the <tt>annot.gff</tt> file?
* (a) How many SNPs are in each file?
 
** This can be found easily by wc, only make sure to exclude lines with comments
 
* (b) How many heterozygous SNPs are in each file?
 
** The last column contains 1|1 for homozygous and either 0|1 or 1|0 for heterozygous SNPs
 
** Character | has special meaning on command line and in grep patterns, make sure to place it in ' ' and possibly escape it with \
 
* (c) How many SNP positions are shared between the two files?
 
** The second column of each file lists the position. We want to compute the size of intersection of the set of positions in motherChr12.vcf and fatherChr12.vcf files
 
** You can e.g. create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using [http://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html comm] command with options -1 -2. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.
 
* (d) List the 5 most frequent pairs of reference/alternate allele in motherChr12.vcf and their frequencies. Do they correspond to transitions or transversions?
 
** Fourth column contains the reference value, fifth column the alternate value. For example, the first SNP in motherChr12.vcf has a pair C,A.
 
** For each possible pair of nucleotides, find how many times it occurs in the motherChr12.vcf
 
** For example, pair C,A occurs 6894 times
 
** Then sort the pairs by their frequencies and report 5 most frequent pairs
 
** Mutations can be classified as transitions and transversions. Transitions change purine to purine or pyrimidine to pyrimidine, transversion change a purine to pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?
 
** To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands sort and uniq to count each pair.
 
* (e) Which parts of the chromosome have the highest and lowest number of SNPs in motherChr12.vcf?
 
** First create a list of windows of size 100kb covering the whole chromosome 12 using these two commands:
 
*** <tt>perl -le 'print "chr12\t133275309"' > humanChr12.size</tt>
 
*** <tt>bedtools makewindows -g humanChr12.size -w 100000 -i srcwinnum > humanChr12-windows.bed</tt>
 
** Then count SNPs in each window using this command:
 
*** <tt>bedtools coverage -a  humanChr12-windows.bed -b motherChr12.vcf > motherChr12-windows.tab</tt>
 
** Find out which column of the resulting file contains the number of SNPs per window, e.g. by reading the documentation obtained by command <tt>bedtools coverage -h</tt>
 
** Sort according to the column with SNP number, look at the first and last line of the sorted file
 
** For checking: the second highest count is 387 in window with coordinates 20,800,000-20,900,000
 
=L08=
 
[[#HW08]]
 
  
Program for today: basics of R (applied to biology examples)
+
<!-- NOTEX -->
* very short intro as a lecture
+
Write the anwsers and commands to the '''protocol'''. '''Submit''' files <tt>augustus-anidulans.gtf</tt> and <tt>augustus-human.gtf</tt>.
* tutorial as HW: read a bit of text, try some commands, extend/modify them as requested
+
<!-- /NOTEX -->
  
In this course we cover several languages popular for scripting in bioinformatics: Perl, Python, R
+
===Task B: Aligning RNA-seq reads===
* their capabilities overlap, many extensions emulate strengths of one in another
 
* choose a language based on your preference, level of knowledge, existing code for the task, rest of the team
 
* quickly learn a new language if needed
 
* also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make
 
  
==Introduction==
+
* Align RNA-seq reads to the genome
* [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization
+
* We will use a specialized tool <tt>tophat</tt>, which can recognize introns
* Programming language, command-line interface
+
* Then we will sort and index the BAM file, similarly as in the [[#HWbioinf1|previous lecture]]
* Many built-in functions, additional libraries
 
** For example http://bioconductor.org/ for bioinformatics
 
* We will concentrate on useful commands rather than language features
 
  
==Working in R==
+
<syntaxhighlight lang="bash">
* Run command R, type commands in command-line interface
+
bowtie2-build ref.fasta ref.fasta
** supports history of commands (arrows, up and down, Ctrl-R) and completing command names with tab key
+
tophat2 -i 10 -I 10000 --max-multihits 1 --output-dir rnaseq ref.fasta rnaseq.fastq
<pre>
+
samtools sort rnaseq/accepted_hits.bam rnaseq
> 1+2
+
samtools index rnaseq.bam
[1] 3
+
</syntaxhighlight>
</pre>
+
 
* Write a script to file, run it from command-line: <tt>R --vanilla --slave < file.R</tt>
+
In addition to the BAM file, TopHat produced several other files in the <tt>rnaseq</tt> folder. Examine them to find out answers to the following questions (you can do it manually by looking at the the files, e.g. by <tt>less</tt> command):
* Use <tt>rstudio</tt> to open a graphics IDE [https://www.rstudio.com/products/RStudio/]
 
** Windows with editor of R scripts, console, variables, plots
 
** Ctrl-Enter in editor executes current command in console
 
<pre>
 
x=c(1:10)
 
plot(x,x*x)
 
</pre>
 
* <tt>? plot</tt> displays help for plot command
 
  
Suggested workflow
+
(a) How many reads were in the FASTQ file? How many of them were successfully mapped?
* work interactively in Rstudio or on command line, try various options
 
* select useful commands, store in a script
 
* run script automatically on new data/new versions, potentially as a part of a bigger pipeline
 
  
==Additional information==
+
(b) How many introns ("junctions") were predicted? How many of them are supported by more than one read? (The 5th column of the corresponding file is the number of reads supporting a junction.)
* [http://cran.r-project.org/doc/manuals/R-intro.html Official tutorial]
 
* [http://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf Seefeld, Linder: Statistics Using R with Biological Examples (pdf book)]
 
* [http://www.burns-stat.com/pages/Tutor/R_inferno.pdf Patrick Burns: The R Inferno] (intricacies of the language)  
 
* [https://www.r-project.org/doc/bib/R-books.html Other books]
 
  
==Gene expression data==
+
<!-- NOTEX -->
* Gene expression: DNA->mRNA->protein
+
Write answers to the '''protocol'''. '''Submit''' the file <tt>rnaseq.bam</tt>.
* Level of gene expression: Extract mRNA from a cell, measure amounts of mRNA
+
<!-- /NOTEX -->
* Technologies: microarray, RNA-seq
 
Gene expression data
 
* Rows: genes
 
* Columns: experiments (e.g. different conditions or different individuals)
 
* Each value is expression of a gene, i.e. relative amount of mRNA for this gene in the sample
 
  
We will use microarray data for yeast:
+
===Task C: Visualizing in IGV===
* Strassburg, Katrin, et al. "Dynamic transcriptional and metabolic responses in yeast adapting to temperature stress." Omics: a journal of integrative biology 14.3 (2010): 249-259. [http://online.liebertpub.com/doi/full/10.1089/omi.2009.0107]
 
* Downloaded from GEO database [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15352]
 
* Data already preprocessed: normalization, log2, etc
 
* We have selected only cold conditions, genes with absolute change at least 1
 
* Data: 2738 genes, 8 experiments in a time series, yeast moved from normal temperature 28 degrees C to cold conditions 10 degrees C, samples taken after 0min, 15min, 30min, 1h, 2h, 4h, 8h, 24h in cold
 
=HW08=
 
[[#L08]]
 
  
==Submitting==
+
As before, run IGV as follows:
In this homework, try to read text, execute given commands, potentially trying some small modifications.
+
<syntaxhighlight lang="bash">
* Then do tasks A-D, submit required files (3x .png)
+
igv -g ref.fasta &
* In your protocol, enter commands used in tasks A-D, with explanatory comments in more complicated situations
+
</syntaxhighlight>
* In task B also enter required output to protocol
 
* Protocol template in /tasks/hw08/protocol.txt
 
  
==First steps==
+
Open additional files using menu <tt>File -> Load from File</tt>: <tt>annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam</tt>
* Type a command, R writes the answer, e.g.:
+
* Exons are shown as thicker boxes, introns are thinner.
<pre>
+
* For each of the following questions, select a part of the sequence illustrating the answer and export a figure using <tt>File->Save image</tt>
> 1+2
+
* You can check these images using command <tt>eog</tt>
[1] 3
 
</pre>
 
* We can store values in variables and use them later on
 
<pre>
 
> # The size of the sequenced portion of cow's genome, in millions of base pairs
 
> Cow_genome_size <- 2290
 
> Cow_genome_size
 
[1] 2290
 
> Cow_chromosome_pairs <- 30
 
> Cow_avg_chrom <- Cow_genome_size / Cow_chromosome_pairs
 
> Cow_avg_chrom
 
[1] 76.33333
 
</pre>
 
Surprises:
 
* dots are used as parts of id's, e.g. read.table is name of a single function (not method for object read)
 
* assignment via <- or =
 
** careful: a<-3 is an assignment, a < -3 is a comparison
 
* vectors etc are indexed from 1, not from 0
 
  
==Vectors, basic plots==
+
Questions:
* Vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans)
 
<pre>
 
# Vector can be created from a list of numbers by function named c
 
a <- c(1,2,4)
 
a
 
# prints [1] 1 2 4
 
  
# c also concatenates vectors
+
(a) Create an image illustrating differences between Augustus with human parameters and the reference annotation, save as <tt>a.png</tt>. Briefly describe the differences in words.
c(a,a)
 
# prints [1] 1 2 4 1 2 4
 
  
# Vector of two strings
+
(b) Find some differences between Augustus with ''A. nidulans'' parameters and the reference annotation. Store an illustrative figure as <tt>b.png</tt>. Which parameters have yielded a more accurate prediction?
b <- c("hello", "world")
 
  
# Create a vector of numbers 1..10
+
(c) Zoom in to one of the genes with a high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as <tt>c.png</tt>.
x <- 1:10
 
x
 
# prints [1]  1  2  3  4  5  6  7  8  9 10
 
</pre>
 
  
===Vector arithmetics===
+
<!-- NOTEX -->
* Operations applied to each member of the vector
+
'''Submit''' files <tt>a.png, b.png, c.png</tt>. Write answers to your '''protocol'''.
<pre>
+
<!-- /NOTEX -->
x <- 1:10
 
# Square each number in vector x
 
x*x
 
# prints [1]  1  4  9  16  25  36  49  64  81 100
 
  
# New vector y: logarithm of a number in x squared
+
=Lbioinf3=
y <- log(x*x)
+
<!-- NOTEX -->
y
+
[[#HWbioinf3]]
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
+
<!-- /NOTEX -->
# [9] 4.394449 4.605170
 
  
# Draw graph of function log(x*x) for x=1..10
+
==Polymorphisms==
plot(x,y)
+
* Individuals within species differ slightly in their genomes
# The same graph but use lines instead of dots
+
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
plot(x,y,type="l")
+
* [https://ghr.nlm.nih.gov/primer/genomicresearch/snp SNP]: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide)
 +
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
 +
* At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)
  
# Addressing elements of a vector: positions start at 1
+
==Finding polymorphisms / genome variants==
# Second element of the vector
+
* We compare sequencing reads coming from an individual to a reference genome of the species
y[2]
+
* First we align them, as in [[#HWbioinf1|the exercises on genome assembly]]
# prints [1] 1.386294
+
* Then we look for positions where a substantial fraction of reads does not agree with the reference (this process is called variant calling)
  
# Which elements of the vector satisfy certain condition? (vector of logical values)
+
==Programs and file formats==
y>3
+
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[#HWbioinf1|the exercises on genome assembly]])
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
+
* For variant calling, we will use [https://github.com/ekg/freebayes Freebayes]
 +
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[#Lbioinf1|previous lectures]]
 +
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files]
 +
* For storing genome intervals, we will use [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED files]
  
# write only those elements from y that satisfy the condition
+
==Human variants==
y[y>3]
+
* For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170
+
* There are various databases, e.g. [https://www.ncbi.nlm.nih.gov/SNP/ dbSNP], [https://www.omim.org/ OMIM], or user-editable [https://www.snpedia.com/index.php/SNPedia SNPedia]
  
# we can also write values of x such that values of y satisfy the condition...
+
==UCSC genome browser==
x[y>3]
+
<!-- NOTEX -->
# prints [1] 5  6  7  8  9 10
+
A short video for this section: [https://youtu.be/RwEBS62Avaw]
</pre>
+
<!-- /NOTEX -->
 +
* On-line tool similar to IGV
 +
* http://genome-euro.ucsc.edu/
 +
* Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
  
* Alternative plotting facilities: [http://ggplot2.org/ ggplot2 library], [https://cran.r-project.org/web/packages/lattice/index.html lattice library]
+
====Basics====
 +
* On the front page, choose Genomes in the top blue menu bar
 +
* Select a genome and its version, optionally enter a position or a keyword, press submit
 +
* On the browser screen, the top image shows chromosome map, the selected region is in red
 +
* Below there is a view of the selected region and various tracks with information about this region
 +
* For example some of the top tracks display genes (boxes are exons, lines are introns)
 +
* Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space)
 +
* Buttons for navigation are at the top (move, zoom, etc.)
 +
* Clicking at the browser figure allows you to get more information about a gene or other displayed item
 +
* In this lecture, we will need tracks GENCODE and dbSNP - check e.g. [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr11%3A66546841-66563329 gene ACTN3] and within it SNP <tt>rs1815739</tt> in exon 15
  
===Task A===
+
====Blat====
* Create a plot of the '''binary logarithm''' with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)
+
* For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species)
* Store it in file <tt>log.png</tt> and '''submit''' this file
+
* Choose <tt>Tools->Blat</tt> in the top blue menu bar, enter DNA sequence below, search in the human genome
* Hints:
+
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
** Create x and y by vector arithmetics
+
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
** To compute binary logarithm check help <tt>? log</tt>
 
** Before running plot, use command <tt>png("log.png")</tt> to store the result, afterwards call <tt>dev.off()</tt> to close the file (in rstudio you can also export plots manually)
 
 
 
==Data frames and simple statistics==
 
* Data frame: a table similar to spreadsheet, each column is a vector, all are of the same length
 
* We will use a table with the following columns:
 
** The size of a genome, in millions of nucleotides
 
** Number of chromosome pairs
 
** GC content
 
** Taxonomic group mammal or fish
 
* Stored in CSV format, columns separated by tabs.
 
* Data: Han et al Genome Biology 2008 [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2441465/]
 
 
<pre>
 
<pre>
Species    Size    Chrom  GC      Group
+
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
Human      2850    23      40.9    mammal
+
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
Chimpanzee 2750    24      40.7    mammal
+
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
Macaque    2650    21      40.7    mammal
+
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
Mouse      2480    20      41.7    mammal
+
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
...
+
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
Tetraodon  187    21      45.9    fish
+
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
...
+
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
 +
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
 +
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
 +
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
 +
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
 +
CCGAAAAGCCCCCACAAAAAGCCG
 
</pre>
 
</pre>
 +
==HWbioinf3==
 +
<!-- NOTEX -->
 +
See also the [[#Lbioinf3|lecture]]
  
<pre>
+
Submit the protocol and the required files to <tt>/submit/bioinf3</tt>
# reading a frame from file
+
<!-- /NOTEX -->
a<-read.table("/tasks/hw08/genomes.csv", header = TRUE, sep = "\t");
 
# column with name size
 
a$Size
 
  
# Average chromosome length: divide size by the number of chromosomes
+
===Input files===
a$Size/a$Chrom
+
Copy files from /tasks/bioinf3/
 +
<syntaxhighlight lang="bash">
 +
mkdir bioinf3
 +
cd bioinf3
 +
cp -iv /tasks/bioinf3/* .
 +
</syntaxhighlight>
  
# Add average chromosome length as a new column to frame a
+
Files:
a<-cbind(a,AvgChrom=a$Size/a$Chrom)
+
* <tt>humanChr7Region.fasta</tt> is a 7kb piece of the human chromosome 7
 +
* <tt>motherChr7Region.fastq</tt> is a sample of reads from an anonymous donor known as NA12878; these reads come from region in <tt>humanChr7Region.fasta</tt>
 +
* <tt>fatherChr12.vcf</tt> and <tt>motherChr12.vcf</tt> are single-nucleotide variants on the chromosome 12 obtained by sequencing two individuals NA12877, NA12878 (these come from a larger [https://www.coriell.org/0/Sections/Collections/NIGMS/CEPHFamiliesDetail.aspx?PgId=441&fam=1463& family])
 +
<!-- TODO: link above displays badly in tex -->
  
# Scatter plot of average chromosome length vs GC content
+
===Task A: read mapping and variant calling===
plot(a$AvgChrom, a$GC)
 
  
# Compactly display structure of a
+
Align reads to the reference:
# (good for checking that import worked etc)
+
<syntaxhighlight lang="bash">
str(a)
+
bwa index humanChr7Region.fasta
 +
bwa mem humanChr7Region.fasta  motherChr7Region.fastq | \
 +
  samtools view -S -b - |  samtools sort - motherChr7Region
 +
samtools index motherChr7Region.bam
 +
</syntaxhighlight>
  
# display mean, median, etc. of each column
+
Call variants:
summary(a);
+
<syntaxhighlight lang="bash">
 +
freebayes -f humanChr7Region.fasta --min-alternate-count 10 \
 +
  motherChr7Region.bam >motherChr7Region.vcf
 +
</syntaxhighlight>
  
# average genome size
+
Run IGV, use <tt>humanChr7Region.fasta</tt> as genome, open <tt>motherChr7Region.bam</tt> and <tt>motherChr7Region.vcf</tt>. Looking at the aligned reads and the VCF file, '''answer''' the following questions:
mean(a$Size)
 
# average genome size for mammals
 
mean(a$Size[a$Group=="mammal"])
 
# Standard deviation
 
sd(a$Size)
 
  
# Histogram of genome sizes
+
(a) How many variants were found in the VCF file?
hist(a$Size)
 
</pre>
 
  
===Task B===
+
(b) How many variants are heterozygous and how many are homozygous?
* Divide frame <tt>a</tt> to two frames, one for mammals, one for fish. Hint:
 
** Try command <tt>a[c(1,2,3),]</tt>. What is it doing?
 
** Try command <tt>a$Group=="mammal"</tt>.
 
** Combine these two commands to get rows for all mammals and store the frame in a new variable, then repeat for fish
 
** Use a general approach which does not depend on the exact number and ordering of rows in the table.
 
  
* Run the command <tt>summary</tt> separately for mammals and for fish. Which of their characteristics are different?
+
(c) Are all variants single-nucleotide variants or do you also see some insertions/deletions (indels)?
** '''Write''' output and your conclusion to the protocol
 
  
===Task C===
+
Also export the overall view of the whole region from IGV to file <tt>motherChr7Region.png</tt>.
* Draw a graph comparing genome size vs GC content; use different colors for points representing mammals and fish
 
** '''Submit''' the plot in file <tt>genomes.png</tt>
 
** To draw the graph, you can use one of the options below, or find yet another way
 
** Option 1: first draw mammals with one color, then add fish in another color
 
*** Color of points can be changed by: <tt>plot(1:10,1:10, col="red")</tt>
 
*** After plot command you can add more points to the same graph by command <tt>points</tt>, which can be used similarly as <tt>plot</tt>
 
*** Warning: command <tt>points</tt> does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options <tt>xlim</tt> and <tt>ylim</tt>, e.g. <tt>plot(x,y, col="red", xlim=c(1,100), ylim=c(1,100))</tt>
 
** Option 2: plot both mammals and fish in one plot command, and give it a vector of colors, one for each point
 
*** <tt>plot(1:10,1:10,col=c(rep("red",5),rep("blue",5)))</tt> will plot the first 5 points red and the last 5 points blue
 
  
* Bonus task: add a legend to the plot, showing which color is mammal and which is fish
+
<!-- NOTEX -->
 +
'''Submit''' the following files: <tt>motherChr7Region.png, motherChr7Region.bam, motherChr7Region.vcf</tt>
 +
<!-- /NOTEX -->
  
==Expression data and clustering==
+
===Task B: UCSC browser===
  
Data here is bigger, better to use plain R rather than rstudio (limited server CPU/memory)
+
(a) Where is sequence from <tt>regionChr7.fasta</tt> located in the browser?
 +
* Go to http://genome-euro.ucsc.edu/, from the blue menu, select <tt>Tools->Blat</tt>
 +
* Check that Blat uses Human, hg38 assembly
 +
* Open <tt>regionChr7.fasta</tt> in a graphical editor (e.g. <tt>kate</tt>), select all, paste to the BLAT window, run BLAT
 +
* In the table of results, the best result should have identity close to 100% and span close to 7kb
 +
* For this best result, click on the link named Browser
 +
* Report which chromosome and which region you get
  
<pre>
+
(b) Which gene is located in this region?
# Read gene expression data table
+
* Once you are in the browser, press Default tracks button
a <- read.table("/tasks/hw08/microarray.csv", header = TRUE, sep = "\t", row.names=1)
+
* Track named GENCODE contains known genes, shown as rectangles (exons) connected by lines (introns). Short gene names are next to them.
# Visual check of the first row
+
* Report the name of the gene in the region
a[1,]
 
# plot starting point vs. situation after 1 hour
 
plot(a$cold_0min,a$cold_1h)
 
# to better see density in dense clouds of points, use this plot
 
smoothScatter(a$cold_15min, a$cold_1h)
 
# outliers away from diagonal in the plot above are most strongly differentially expressed genes
 
# these are more easy to see in MA plot:
 
# x-axis: average expression in the two conditions
 
# y-axis: difference between values (they are log-scale, so difference 1 means 2-fold)
 
smoothScatter((a$cold_15min+a$cold_1h)/2, a$cold_15min-a$cold_1h)
 
</pre>
 
  
Clustering is a wide group of methods that split data points into groups with similar properties
+
(c) In which tissue is this gene most highly expressed? What is the function of this gene?
* We will group together genes that have a similar reaction to cold, i.e. their rows in gene expression data matrix have similar values
+
* When you click on the gene (possibly twice), you get an information page which starts with a summary of the known function of this gene. Copy the first sentence to your protocol.  
We will consider two simple clustering methods
+
* Further down on the gene information page you see RNA-Seq Expression Data (colorful boxplots). Find out which tissues have the highest signal.
* K means clustering splits points (genes) into ''k'' clusters, where ''k'' is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
 
* Hierarchical clustering puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.
 
  
[[Image:HW08-heatmap.png|thumb|200px|right|Example of a heatmap]]
+
(d) Which SNPs are located in this gene? Which trait do they influence?
<pre>
+
* You can see SNPs in the Common SNPs(151) track, but their IDs appear only after switching this track to pack mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
# Heatmap: creates hierarchical clustering of rows
+
* Information page of the gene (part c) also describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
# then shows every value in the table using color ranging from red (lowest) to white (highest)
+
* You can also find information about individual SNPs by looking for them by their ID in [https://www.snpedia.com/index.php/SNPedia SNPedia] (not required in this task)
# Computation may take some time
 
heatmap(as.matrix(a),Colv=NA)
 
# Previous heatmap normalized each row, the next one uses data as they are:
 
heatmap(as.matrix(a),Colv=NA,scale="none")
 
</pre>
 
  
<pre>
+
===Task C: Examining larger vcf files===
# k means clustering to 7 clusters
+
In this task, we will look at <tt>motherChr12.vcf</tt> and <tt>fatherChr12.vcf</tt> files and compute various statistics. You can use command-line tools, such as <tt>grep, wc, sort, uniq</tt> and Perl one-liners (as in [[#Lbash]]), or you can write small scripts in Perl or Python (as in [[#Lperl]] and [[#Lpython]]).
k = 7
+
<!-- NOTEX -->
cl <- kmeans(a,k)
+
* Write all used commands to your protocol
# each gene has assigned a cluster (number between 1 and k)
+
* If you write any scripts, submit them as well
cl$cluster
+
<!-- /NOTEX -->
# draw only cluster number 3 out of k
 
heatmap(as.matrix(a[cl$cluster==3,]),Rowv=NA, Colv=NA)
 
  
# reorder genes in the table according to cluster
+
Questions:
heatmap(as.matrix(a[order(cl$cluster),]),Rowv=NA, Colv=NA)
 
  
# compare overall column means with column means in cluster 3
+
(a) How many SNPs are in each file?
# function apply uses mean on every column (or row if 2 changed to 1)
+
* This can be found easily by <tt>wc</tt>, only make sure to exclude lines with comments
apply(a,2,mean)
 
# now means within cluster
 
apply(a[cl$cluster==3,],2,mean)
 
  
# clusters have centers which are also computed as means
+
(b) How many heterozygous SNPs are in each file?
# so this is the same as previous command
+
* The last column contains <tt>1|1</tt> for homozygous and either <tt>0|1</tt> or <tt>1|0</tt> for heterozygous SNPs
cl$centers[3,]
+
* Character <tt>|</tt> has special meaning on the command line and in <tt>grep</tt> patterns; make sure to place it in apostrophes <tt>' '</tt> and possibly escape it with backslash <tt>\</tt>
</pre>
 
  
===Task D===
+
(c) How many SNP positions are shared between the two files?
[[Image:HW08-clusters.png|thumb|200px|right|Example of a required plot]]
+
* The second column of each file lists the position. We want to compute the size of intersection of the set of positions in <tt>motherChr12.vcf</tt> and <tt>fatherChr12.vcf</tt> files
* Draw a plot in which x-axis is time and y-axis is the expression level and the center of each cluster is shown as a line
+
* You can e.g. create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using [http://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html comm] command with options <tt>-1 -2</tt>. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.
** use command <tt>matplot(x,y,type="l")</tt> which gets two matrices x and y and plots columns of x vs columns of y
 
** <tt>matplot(,y,type="l")</tt> will use numbers 1,2,3... as columns of the missing matrix x
 
** create y from <tt>cl$centers</tt> by applying function <tt>t</tt> (transpose)
 
** to create an appropriate matrix x, create a vector of times for individual experiments in minutes or hours (do it manually, no need to parse column names automatically)
 
** using functions <tt>rep</tt> and <tt>matrix</tt> you can create a matrix x in which this vector is used as every column
 
** then run <tt>matplot(x,y,type="l")</tt>
 
** since time points are not evenly spaced, it would be better to use logscale: <tt>matplot(x,y,type="l",log="x")</tt>
 
** to avoid log(0), change the first timepoint from 0min to 1min
 
* Submit file '''clusters.png''' with your final plot
 
=L09=
 
[[#HW09]]
 
  
Topic of this lecture are statistical tests in R.
+
(d) List the 5 most frequent pairs of reference/alternate allele in <tt>motherChr12.vcf</tt> and their frequencies. Do they correspond to transitions or transversions?
* Beginners in statistics: listen to lecture, then do tasks A, B, C
+
* The fourth column contains the reference value, fifth column the alternate value. For example, the first SNP in <tt>motherChr12.vcf</tt> has a pair <tt>C,A</tt>.
* If you know basics of statistical tests, do tasks B, C, D
+
* For each possible pair of nucleotides, find how many times it occurs in the <tt>motherChr12.vcf</tt>
* More information on this topic in [https://sluzby.fmph.uniba.sk/infolist/sk/1-EFM-340_13.html 1-EFM-340 Počítačová štatistika]
+
* For example, pair <tt>C,A</tt> occurs 6894 times
 +
* Then sort the pairs by their frequencies and report 5 most frequent pairs
 +
* Mutations can be classified as transitions and transversions. Transitions change purine to purine or pyrimidine to pyrimidine, transversions change a purine to pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?
 +
* To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands <tt>sort</tt> and <tt>uniq</tt> to count each pair.  
  
==Introduction to statistical tests: sign test==
+
(e) Which parts of the chromosome have the highest and lowest density of SNPs in </tt>motherChr12.vcf</tt>?
* [https://en.wikipedia.org/wiki/Sign_test]
+
* First create a list of windows of size 100kb covering the whole chromosome 12 using these two commands:
* Two friends ''A'' and ''B'' have played their favourite game ''n''=10 times, ''A'' has won 6 times and ''B'' has won 4 times.
 
* ''A'' claims that he is a better player, ''B'' claims that such a result could easily happen by chance if they were equally good players.
 
* Hypothesis of player ''B'' is called ''null hypothesis'' that the pattern we see (''A'' won more often than ''B'') is simply a result of chance
 
* Null hypothesis reformulated: we toss coin ''n'' times and compute value ''X'': the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail
 
* Similar situation: comparing programs A and B on several inputs, counting how many times is program A better than B.
 
 
<pre>
 
<pre>
# simulation in R: generate 10 psedorandom bits
+
perl -le 'print "chr12\t133275309"' > humanChr12.size
# (1=player A won)
+
bedtools makewindows -g humanChr12.size -w 100000 -i srcwinnum > humanChr12-windows.bed
sample(c(0,1), 10, replace = TRUE)
+
</pre>
# result e.g. 0 0 0 0 1 0 1 1 0 0
+
* Then count SNPs in each window using this command:
 +
<pre>
 +
bedtools coverage -a  humanChr12-windows.bed -b motherChr12.vcf > motherChr12-windows.tab
 +
</pre>
 +
* Find out which column of the resulting file contains the number of SNPs per window, e.g. by reading the documentation obtained by command <tt>bedtools coverage -h</tt>
 +
* Sort according to the column with the SNP number, look at the first and last line of the sorted file
 +
* For checking: the second highest count is 387 in window with coordinates <tt>20,800,000-20,900,000</tt>
 +
 
 +
=Lr1=
 +
<!-- NOTEX -->
 +
[[#HWr1]] {{Dot}} [https://youtu.be/qHdtopqSiXA Video introduction]
 +
<!-- /NOTEX -->
 +
 
 +
Program for this lecture: basics of R
 +
* A very short introduction will be given as a lecture.
 +
* Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks
 +
 
 +
In this course we cover several languages popular for scripting and data processing: Perl, Python, R.
 +
* Their capabilities overlap, many extensions emulate strengths of one in another.
 +
* Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
 +
* Quickly learn a new language if needed.
 +
* Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with <tt>bash</tt> or <tt>make</tt>.
 +
 
 +
==Introduction==
 +
* [http://www.r-project.org/ R] is an open-source system for statistical computing and data visualization
 +
* Programming language, command-line interface
 +
* Many built-in functions, additional libraries
 +
** For example [http://bioconductor.org/ Bioconductor] for bioinformatics
 +
* We will concentrate on useful commands rather than language features
  
# directly compute random variable X, i.e. sum of bits
+
==Working in R==
sum(sample(c(0,1), 10, replace = TRUE))
 
# result e.g. 5
 
  
# we define a function which will m times repeat
+
Option 1: Run command R, type commands in a command-line interface
# the coin tossing experiment with n tosses
+
* It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key
# and returns a vector with m values of random variable X
 
experiment <- function(m, n) {
 
  x = rep(0, m)    # create vector with m zeroes
 
  for(i in 1:m) {  # for loop through m experiments
 
    x[i] = sum(sample(c(0,1), n, replace = TRUE))
 
  }
 
  return(x)        # return array of values   
 
}
 
# call the function for m=20 experiments, each with n tosses
 
experiment(20,10)
 
# result e.g.  4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4
 
# draw histograms for 20 experiments and 1000 experiments
 
png("hist10.png")  # open png file
 
par(mfrow=c(2,1))  # matrix of plots with 2 rows and 1 column
 
hist(experiment(20,10))
 
hist(experiment(1000,10))
 
dev.off() # finish writing to file
 
</pre>
 
* It is easy to realize that we get [https://en.wikipedia.org/wiki/Binomial_distribution binomial distribution] (binomické rozdelenie)
 
* <math>Pr(X=k) = {n \choose k} 2^{-n}</math>
 
* ''P-value'' of the test is the probability that simply by chance we would get ''k'' the same or more extreme than in our data.
 
* In other words, what is the probability that in 10 tosses we see head 6 times or more (one sided test)
 
* <math>\sum_{j=k}^n {n \choose k} 2^{-n}</math>
 
* If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player ''A'' is in fact better than ''B''
 
  
<pre>
+
Option 2: Write a script to a file, run it from the command-line as follows:<br><tt>R --vanilla --slave < file.R</tt>
# computing the probability that we get exactly 6 heads in 10 tosses
 
dbinom(6, 10, 0.5) # result 0.2050781
 
# we get the same as our formula above:
 
7*8*9*10/(2*3*4*(2^10)) # result 0.2050781
 
  
# entire probability distribution: probabilities 0..10 heads in 10 tosses
+
Option 3: Use <tt>rstudio</tt> command to open a [https://www.rstudio.com/products/RStudio/ graphical IDE]
dbinom(0:10, 10, 0.5)
+
* Sub-windows with editor of R scripts, console, variables, plots
# [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
+
* Ctrl-Enter in editor executes the current command in console
# [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
+
* You can also install RStudio on your home computer and work there
# [11] 0.0009765625
 
  
#we can also plot the distribution
+
In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.
plot(0:10, dbinom(0:10, 10, 0.5))
+
<syntaxhighlight lang="r">
barplot(dbinom(0:10,10,0.5))
+
x=c(1:10)
 +
plot(x,x*x)
 +
</syntaxhighlight>
  
#our p-value is sum for 6,7,8,9,10
+
Suggested workflow
sum(dbinom(6:10,10,0.5))
+
* work interactively in Rstudio or on command line, try various options
# result: 0.3769531
+
* select useful commands, store in a script
# so results this "extreme" are not rare by chance,
+
* run script automatically on new data/new versions, potentially as a part of a bigger pipeline
# they happen in about 38% of cases
 
  
# R can compute the sum for us using pbinom
+
==Additional information==
# this considers all values greater than 5
+
* [http://cran.r-project.org/doc/manuals/R-intro.html Official tutorial]
pbinom(5, 10, 0.5, lower.tail=FALSE)
+
* [http://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf Seefeld, Linder: Statistics Using R with Biological Examples (pdf book)]
# result again 0.3769531
+
* [http://www.burns-stat.com/pages/Tutor/R_inferno.pdf Patrick Burns: The R Inferno] (intricacies of the language)  
 +
* [https://www.r-project.org/doc/bib/R-books.html Other books]
 +
* Built-in help: <tt>? plot</tt> displays help for <tt>plot</tt> command
  
# if probability is too small, use log of it
+
==Gene expression data==
pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE)
+
* DNA molecules contain regions called genes, which are "recipes" for making proteins
# [1] -6931.472
+
* Gene expression is the process of creating a protein according to the "recipe"
# the probability of getting 10000x head is exp(-6931.472) = 2^{-100000}
+
* It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
 +
* Different proteins are created in different quantities and their amount depends on the needs of a cell
 +
* There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances
  
# generating numbers from binomial distribution
+
Gene expression data
# - similarly to our function experiment
+
* Rows: genes
rbinom(20, 10, 0.5)
+
* Columns: experiments (e.g. different conditions or different individuals)
# [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5
+
* Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample
  
# running the test
+
We will use a data set for yeast:
binom.test(6, 10, p = 0.5, alternative="greater")
+
* Abbott, Derek A., et al. "[https://academic.oup.com/femsyr/article/7/6/819/533265 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae.]" FEMS yeast research 7.6 (2007): 819-833.
#
+
* Downloaded from the [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5926 GEO database]
#        Exact binomial test
+
* Data was preprocessed: normalized, converted to logarithmic scale
#
+
* Only 1220 genes with the biggest changes in expression are included in our dataset
# data: 6 and 10
+
* Gene expression measurements under 5 conditions:
# number of successes = 6, number of trials = 10, p-value = 0.377
+
** Control: yeast grown in a normal environment
# alternative hypothesis: true probability of success is greater than 0.5
+
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic)
# 95 percent confidence interval:
+
* From each condition (reference and each acid) we have 3 replicates, together 15 experiments
# 0.3035372 1.0000000
+
* The goal is to observe how the acids influence the yeast and the activity of its genes
# sample estimates:
+
Part of the file (only first 4 experiments and first 3 genes shown), strings <tt>2mic_D_protein, AAC3, AAD15</tt> are identifiers of genes
# probability of success
+
<pre>
#                  0.6
+
,control1,control2,control3,acetate1,acetate2,acetate3,...
 +
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
 +
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
 +
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...
 +
</pre>
 +
==HWr1==
 +
<!-- NOTEX -->
 +
See also the [[#Lr1|lecture]]
 +
<!-- /NOTEX -->
 +
 
 +
In this homework, try to read the text, execute given commands, potentially trying some small modifications. Within the tutorial, you will find tasks A-E to complete in this exercise.
 +
<!-- NOTEX -->
 +
* Submit the required files (4x .png)
 +
* In your protocol, enter the commands used in all tasks, with explanatory comments in more complicated situations
 +
* In tasks B and D also enter the required output to the protocol
 +
* Protocol template in <tt>/tasks/r1/protocol.txt</tt>
 +
<!-- /NOTEX -->
  
# to only get p-value run
+
===The first steps===
binom.test(6, 10, p = 0.5, alternative="greater")$p.value
+
Type a command, R writes the answer, e.g.:
# result 0.3769531
+
<pre>
 +
> 1+2
 +
[1] 3
 
</pre>
 
</pre>
  
==Comparing two sets of values: Welch's t-test==
+
We can store values in variables and use them later:
* Let us now consider two sets of values drawn from two [https://en.wikipedia.org/wiki/Normal_distribution normal distributions] with unknown means and variances
 
* The null hypothesis of the [https://en.wikipedia.org/wiki/Welch%27s_t-test Welch's t-test] is that the two distributions have equal means
 
* The test computes test statistics (in R for vectors x1, x2):
 
** <tt>(mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))</tt>
 
* This test statistics is approximately distributed according to [https://en.wikipedia.org/wiki/Student%27s_t-distribution Student's t-distribution] with the degree of freedom obtained by
 
 
<pre>
 
<pre>
n1=length(x1)
+
> # population of Slovakia in millions, 2019
n2=length(x2)
+
> population = 5.457
(var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
+
> population
 +
[1] 5.457
 +
> # area of Slovakia in thousands of km2
 +
> area = 49.035
 +
> density = population / area
 +
> density
 +
[1] 0.1112879
 
</pre>
 
</pre>
* Luckily R will compute the test for us simply by calling t.test
 
<pre>
 
x1 = rnorm(6, 2, 1)
 
# 2.70110750  3.45304366 -0.02696629  2.86020145  2.37496993  2.27073550
 
  
x2 = rnorm(4, 3, 0.5)
+
Surprises in the R language:
# 3.258643 3.731206 2.868478 2.239788
+
* dots are used as parts of id's, e.g. <tt>read.table</tt> is name of a single function (not a method for the object <tt>read</tt>)
> t.test(x1,x2)
+
* assignment via <tt><-</tt> or <tt>=</tt>
# t = -1.2898, df = 7.774, p-value = 0.2341
+
* vectors etc are indexed from 1, not from 0
# alternative hypothesis: true difference in means is not equal to 0
 
# means 2.272182  3.024529
 
  
x2 = rnorm(4, 5, 0.5)
+
===Vectors, basic plots===
# 4.882395 4.423485 4.646700 4.515626
+
A vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans)
t.test(x1,x2)
+
<syntaxhighlight lang="r">
# t = -4.684, df = 5.405, p-value = 0.004435
+
# Vector can be created from a list of numbers by function named c
# means 2.272182  4.617051
+
a = c(1,2,4)
 +
a
 +
# prints [1] 1 2 4
  
# to get only p-value, run
+
# c also concatenates vectors
t.test(x1,x2)$p.value
+
c(a,a)
</pre>
+
# prints [1] 1 2 4 1 2 4
  
We will apply Welch's t-test to microarray data
+
# Vector of two strings
* Data from GEO database [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2925], publication [http://femsyr.oxfordjournals.org/content/7/6/819.abstract]
+
b = c("hello", "world")
* Abbott et al 2007: Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae
 
* gene expression measurements under 5 conditions:
 
** reference: yeast grown in normal environment
 
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic)
 
* from each condition (reference and each acid) we have 3 replicates
 
* together our table has 15 columns (3 replicates from 5 conditions)
 
* 6398 rows (genes)
 
* We will test statistical difference between the reference condition and one of the acids (3 numbers vs other 3 numbers)
 
* See Task B in [[#HW09]]
 
  
==Multiple testing correction==
+
# Create a vector of numbers 1..10
 +
x = 1:10
 +
x
 +
# prints [1]  1  2  3  4  5  6  7  8  9 10
 +
</syntaxhighlight>
  
* When we run t-tests on the reference vs. acetic acid on all 6398 genes, we get 118 genes with p-value<=0.01
+
====Vector arithmetic====
* Purely by chance this would happen in 1% of cases (from definition of p-value)
+
Many operations can be easily applied to each member of a vector
* So purely by chance we would expect to get about 64 genes with p-value<=0.01
+
<syntaxhighlight lang="r">
* So perhaps roughly half of our detected genes (maybe less, maybe more) are false positives
+
x = 1:10
* Sometimes false positives may even overwhelm results
+
# Square each number in vector x
* Multiple testing correction tries to limit the number of false positives among results of multiple statistical tests
+
x*x
* Many different methods
+
# prints [11   4  9  16  25  36  49  64  81 100
* The simplest one is [https://en.wikipedia.org/wiki/Bonferroni_correction Bonferroni correction], where the threshold on p-value is divided by the number of tested genes, so instead of 0.01 we use 0.01/6398 = 1.56e-6
 
* This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality)
 
* We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected
 
* After Bonferroni correction we get only 1 significant gene
 
<pre>
 
# the results of p-tests are in vector pa of length 6398
 
# manually multiply p-values by length(pa), count those that have value <=0.01
 
sum(pa * length(pa) < 0.01)
 
# in R you can use p.adjust form multiple testing correction
 
pa.adjusted = p.adjust(pa, method ="bonferroni")
 
# this is equivalent to multiplying by the length and using 1 if the result > 1
 
pa.adjusted = pmin(pa*length(pa),rep(1,length(pa)))
 
  
# there are less conservative multiple testing correction methods, e.g. Holm's method
+
# New vector y: logarithm of a number in x squared
# but in this case we get almost the same results
+
y = log(x*x)
pa.adjusted2 = p.adjust(pa, method ="holm")
+
y
</pre>
+
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
* Other frequently used correction is [https://en.wikipedia.org/wiki/False_discovery_rate false discovery rate (FDR)], which is less strict and controls the overall proportion of false positives among results
+
# [9] 4.394449 4.605170
=HW09=
 
[[#L09]]
 
* Do either tasks A,B,C (beginners) or B,C,D (more advanced). If you really want, you can do all four for bonus credit.
 
* In your protocol write used R commands with brief comments on your approach.
 
* Submit required plots with filenames as specified.
 
* For each task also include results as required and a short discussion commenting the results/plots you have obtained. Is the value of interest increasing or decreasing with some parameter? Are the results as expected or surprising?
 
* Outline of protocol is in /tasks/hw09/protocol.txt
 
  
==Task A: sign test==
+
# Draw the graph of function log(x*x) for x=1..10
 +
plot(x,y)
 +
# The same graph but use lines instead of dots
 +
plot(x,y,type="l")
  
* Consider a situation in which players played ''n'' games, out of which a fraction of ''q'' were won by ''A'' (example in lecture corresponds to ''q=0.6'' and ''n=10'')
+
# Addressing elements of a vector: positions start at 1
* Compute a table of p-values for ''n=10,20,...,90,100'' and for ''q=0.6, 0.7, 0.8, 0.9''
+
# Second element of the vector
* Plot the table using matplot (''n'' is x-axis, one line for each value of ''q'')
+
y[2]
* '''Submit''' the plot in <tt>sign.png</tt>
+
# prints [1] 1.386294
* '''Discuss''' the values you have seen in the plot / table
 
  
Outline of the code:
+
# Which elements of the vector satisfy certain condition?
<pre>
+
# (vector of logical values)
# create vector rows with values 10,20,...,100
+
y>3
rows=(1:10)*10
+
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
# create vector columns with required values of q
 
columns=c(0.6, 0.7, 0.8, 0.9)
 
# create empty matrix of pvalues
 
pvalues = matrix(0,length(rows),length(columns))
 
# TODO: fill in matrix pvalues using binom.test
 
  
# set names of rows and columns
+
# write only those elements from y that satisfy the condition
rownames(pvalues)=rows
+
y[y>3]
colnames(pvalues)=columns
+
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170
# careful: pvalues[10,] is now 10th row, i.e. value for n=100,  
+
 
#          pvalues["10",] is the first row, i.e. value for n=10
+
# we can also write values of x such that values of y satisfy the condition...
 +
x[y>3]
 +
# prints [1]  5  6  7  8  9 10
 +
</syntaxhighlight>
 +
 
 +
Alternative plotting facilities: [http://ggplot2.org/ ggplot2 library], [https://cran.r-project.org/web/packages/lattice/index.html lattice library]
 +
 
 +
====Task A====
 +
Create a plot of the '''binary logarithm''' with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)
 +
<!-- NOTEX -->
 +
* Store it in file <tt>log.png</tt> and '''submit''' this file
 +
<!-- /NOTEX -->
 +
 
 +
Hints:
 +
* Create <tt>x</tt> and <tt>y</tt> by vector arithmetic
 +
* To compute binary logarithm check help <tt>? log</tt>
 +
* Before running plot, use command <tt>png("log.png")</tt> to store the result, afterwards call <tt>dev.off()</tt> to close the file (in Rstudio you can also export plots manually)
  
# check that for n=10 and q=0.6 you get p-value 0.3769531
+
===Data frames and simple statistics===
pvalues["10","0.6"]
+
Data frame: a table similar to a spreadsheet. Each column is a vector, all are of the same length.
  
# create x-axis matrix (as in HW08, part D)
+
We will use a table with the following columns:
x=matrix(rep(rows,length(columns)),nrow=length(rows))
+
* Country name
# matplot command
+
* Region (continent)
png("sign.png")
+
* Area in thousands of km2
matplot(x,pvalues,type="l",col=c(1:length(columns)),lty=1)
+
* Population in millions in 2019
legend("topright",legend=columns,col=c(1:length(columns)),lty=1)
+
(source of data UN)
dev.off()
 
</pre>
 
  
==Task B: Welch's t-test on microarray data==
+
The table is stored in the csv format (columns separated by commas).
  
* Read table with microarray data, transform it to log scale, then work with table ''a'':
 
 
<pre>
 
<pre>
input=read.table("/tasks/hw09/acids.tsv", header=TRUE, row.names=1)
+
Afghanistan,Asia,652.864,38.0418
a = log(input)
+
Albania,Europe,28.748,2.8809
 +
Algeria,Africa,2381.741,43.0531
 +
American Samoa,Oceania,0.199,0.0553
 +
Andorra,Europe,0.468,0.0771
 +
Angola,Africa,1246.7,31.8253
 
</pre>
 
</pre>
* Columns 1,2,3 are reference, columns 4,5,6 acetic acid, 7,8,9 benzoate, 10,11,12 propionate, and 13,14,15 sorbate
 
* Write a function <tt>my.test</tt> which will take as arguments table ''a'' and 2 lists of columns (e.g. 1:3 and 4:6) and will run for each row of the table Welch's t-test of the first set of columns vs the second set. It will return the resulting vector of p-values
 
* For example by calling <tt>pa <- my.test(a, 1:3, 4:6)</tt> we will compute p-values for differences between reference and acetic acid (computation may take some time)
 
* The first 5 values of pa should be
 
<pre>
 
> pa[1:5]
 
[1] 0.94898907 0.07179619 0.24797684 0.48204100 0.23177496
 
</pre>
 
* Run the test for all four acids
 
* '''Report''' how many genes were significant with p-value <= 0.01 for each acid
 
** See [[#HW08#Vector_arithmetics|Vector arithmetics in HW08]]
 
** You can count TRUE items in a vector of booleans by sum, e.g. <tt>sum(TRUE,FALSE,TRUE)</tt> is 2
 
* '''Report''' how many genes are significant for both acetic and benzoate acids? (logical and is written as <tt>&</tt>)
 
  
==Task C: multiple testing correction==
+
<syntaxhighlight lang="r">
 +
# reading a data frame from a file
 +
a = read.csv("/tasks/r1/countries.csv",header = TRUE)
  
Run the following snippet of code, which works on the vector of p-values <tt>pa</tt> obtained for acetate in task B
+
# display mean, median, etc. of each column
<pre>
+
summary(a);
# adjusts vectors of p-vales from tasks B for using Bonferroni correction
+
# Compactly display structure of a
pa.adjusted = p.adjust(pa, method ="bonferroni")
+
# (good for checking that import worked etc)
 +
str(a)
 +
 
 +
# print the column with the name "Area"
 +
a$Area
 +
 
 +
# population density: divide the population by the area
 +
a$Population / a$Area
 +
 
 +
# Add density as a new column to frame a
 +
a = cbind(a, Density = a$Population / a$Area)
 +
 
 +
# Scatter plot of area vs population
 +
plot(a$Area, a$Population)
 +
 
 +
# we will see smaller values better in log-scale (both axes)
 +
plot(a$Area, a$Population, log='xy')
 +
 
 +
# use linear scale, but zoom in on smaller countries:
 +
plot(a$Area, a$Population, xlim=c(0,1500), ylim=c(0,150))
 +
 
 +
# average country population 33.00224 million
 +
mean(a$Population)
 +
# median country population 5.3805 million
 +
median(a$Population)
 +
 
 +
# median country population in Europe
 +
median(a$Population[a$Region=="Europe"])
 +
# Standard deviation
 +
sd(a$Population)
 +
 
 +
# Histogram of country populations in Europe
 +
hist(a$Population[a$Region=="Europe"])
 +
</syntaxhighlight>
 +
 
 +
===Task B===
 +
Create frame <tt>europe</tt> which contains data for European countries selected from frame <tt>a</tt>. Also create a similar frame for African countries. Hint:
 +
* To select the first three rows of a frame: <tt>a[c(1,2,3),]</tt>.
 +
* Here we want to select rows based on values not position (see computation of median country size in Europe above)
 +
 
 +
Run the command <tt>summary</tt> separately for each new frame. Comment on how their characteristics differ.
 +
<!-- NOTEX -->
 +
'''Write''' output and your conclusion to the protocol.
 +
<!-- /NOTEX -->
 +
 
 +
===Task C===
 +
Draw a graph comparing the area vs population in Europe and Africa; use different colors for points representing European and African countries. Apply log scale on both axes.
 +
<!-- NOTEX -->
 +
* '''Submit''' the plot in file <tt>countries.png</tt>
 +
<!-- /NOTEX -->
 +
To draw the graph, you can use one of the options below, or find yet another way.
 +
 
 +
Option 1: first draw Europe with one color, then add Africa in another color
 +
* Color of points can be changed by as follows: <tt>plot(1:10,1:10, col="red")</tt>
 +
* After the <tt>plot</tt> command, you can add more points to the same graph by command <tt>points</tt>, which can be used similarly as <tt>plot</tt>
 +
* Warning: command <tt>points</tt> does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options <tt>xlim</tt> and <tt>ylim</tt>, e.g. <tt>plot(x,y, col="red", xlim=c(0.1,100), ylim=c(0.1,100))</tt>
 +
 
 +
Option 2: plot both Europe and Africa in one <tt>plot</tt> command, and give it a vector of colors, one for each point. Command <tt>plot(1:10,1:10,col=c(rep("red",5),rep("blue",5)))</tt> will plot the first 5 points red and the last 5 points blue
 +
 
 +
'''Bonus task:''' add a legend to the plot, showing which color is Europe and which is Africa.
 +
 
 +
===Expression data===
 +
 
 +
The dataset was described in the lecture.
 +
 
 +
<syntaxhighlight lang="r">
 +
# Read gene expression data table
 +
a = read.csv("/tasks/r1/microarray.csv", row.names=1)
 +
# Visual check of the first row
 +
a[1,]
 +
# Plot control replicate 1 vs. acetate acid replicate 1
 +
plot(a$control1, a$acetate1)
 +
# Plot control replicate 1 vs. control replicate 2
 +
plot(a$control1, a$control2)
 +
# To show density in dense clouds of points, use this plot
 +
smoothScatter(a$control1, a$acetate1)
 +
</syntaxhighlight>
 +
 
 +
===Task D===
 +
 
 +
In the plots above we compare two experiments, say A=control1 and B=acetate1. Outliers away from the diagonal in the plot are the genes whose expression changes. However distance from the diagonal is hard to judge visually, instead we will create MA plot:
 +
* As above, each gene is one dot in the plot (use <tt>plot</tt> rather than <tt>smoothScatter</tt>).
 +
* The x-axis is the average between values for conditions A and B. The points on the right have overall higher expression than points on the left.
 +
* The y-axis is the difference between condition A and B. The values in frame <tt>a</tt> are in log-scale base 2, so the difference of 1 means 2-fold change in expression.
 +
* The points far from the line y=0 have the highest change in expression. Use R functions <tt>min</tt>, <tt>max</tt>, <tt>which.min</tt> and <tt>which.max</tt> to find the largest positive and negative difference from line y=0 and which genes they correspond to. Functions <tt>min</tt> and <tt>max</tt> give you the minimum and maximum of a given vector. Functions <tt>which.min</tt> and <tt>which.max</tt> return the index where this extreme value is located. You can use this index to get the appropriate row of the dataframe <tt>a</tt>, including the gene name.
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>ma.png</tt> with your plot. Include the genes with the extreme changes in your protocol.
 +
<!-- /NOTEX -->
 +
 
 +
===Clustering applied to expression data===
 +
Clustering is a wide group of methods that split data points into groups with similar properties. We will group together genes that have a similar reaction to acids, i.e. their rows in gene expression data matrix have similar values. We will consider two simple clustering methods
 +
* '''K means''' clustering splits points (genes) into ''k'' clusters, where ''k'' is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
 +
 
 +
[[Image:HW08-heatmap.png|thumb|400px|right|Examples of heatmaps]]
 +
* '''Hierarchical clustering''' puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.
 +
 
 +
 
 +
<syntaxhighlight lang="r">
 +
# Create a new version of frame a in which row is scaled so that
 +
# it has mean 0 and standard deviation 1
 +
# Function scale does such transformation on columns instead of rows,
 +
# so we transpose the frame using function t, then transpose it back
 +
b = t(scale(t(a)))
 +
# Matrix b shows relative movements of each gene,
 +
# disregarding its overall high or low expression
 +
 
 +
# Command heatmap creates hierarchical clustering of rows,
 +
# then shows values using color ranging from red (lowest) to white (highest)
 +
heatmap(as.matrix(a), Colv=NA, scale="none")
 +
heatmap(as.matrix(b), Colv=NA, scale="none")
 +
# compare the two matrices - which phenomena influenced clusters in each of them?
 +
</syntaxhighlight>
 +
 
 +
<syntaxhighlight lang="r">
 +
# k means clustering to 5 clusters
 +
k = 5
 +
cl = kmeans(b, k)
 +
# Each gene is assigned a cluster (number between 1 and k)
 +
# the command below displays the first 10 elements, i.e. clusters of first 10 genes
 +
head(cl$cluster)
 +
# Draw heatmap of cluster number 3 out of k, no further clustering applied
 +
# Do you see any common pattern to genes in the cluster?
 +
heatmap(as.matrix(b[cl$cluster==3,]), Rowv=NA, Colv=NA, scale="none")
 +
 
 +
# Reorder genes in the whole table according to their cluster cluster number
 +
# Can you spot our k clusters?
 +
heatmap(as.matrix(b[order(cl$cluster),]), Rowv=NA, Colv=NA, scale="none")
 +
 
 +
# Compare overall column means with column means in cluster 3
 +
# Function apply runs mean on every column (or row if 2 changed to 1)
 +
apply(b, 2, mean)
 +
# Now means within cluster 3
 +
apply(b[cl$cluster==3,],2,mean)
 +
 
 +
# Clusters have centers which are also computed as means
 +
# so this is the same as the previous command
 +
cl$centers[3,]
 +
</syntaxhighlight>
 +
 
 +
===Task E===
 +
[[Image:HW08-clusters.png|thumb|200px|right|Example of a required plot (but for k=3, not k=5)]]
 +
Draw a plot in which the x-axis corresponds to experiments, the y-axis is the expression level and the center of each cluster is shown as a line (use k-means clustering on the scaled frame <tt>b</tt>, computed as shown above)
 +
* Use command <tt>matplot(x, y, type="l", lwd=2)</tt> which gets two matrices <tt>x</tt> and <tt>y</tt> of the same size and plots each column of matrices  <tt>x</tt> and <tt>y</tt> as one line (setting <tt>lwd=2</tt> makes lines thicker)
 +
* In this case we omit matrix <tt>x</tt>, the command will use numbers 1,2,3... as columns of the missing matrix
 +
* Create <tt>y</tt> from <tt>cl$centers</tt> by applying function <tt>t</tt> (transpose)
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>clusters.png</tt> with your final plot
 +
<!-- /NOTEX -->
 +
=Lr2=
 +
<!-- NOTEX -->
 +
[[#HWr2]]
 +
<!-- /NOTEX -->
 +
 
 +
The topic of this lecture are statistical tests in R.
 +
* Beginners in statistics: listen to lecture, then do tasks A, B, C
 +
* If you know basics of statistical tests, do tasks B, C, D
 +
* More information on this topic in the [https://sluzby.fmph.uniba.sk/infolist/sk/1-EFM-340_13.html 1-EFM-340 Computer Statistics] course
 +
 
 +
 
 +
==Introduction to statistical tests: sign test==
 +
* Two friends ''A'' and ''B'' have played their favorite game ''n''=10 times, ''A'' has won 6 times and ''B'' has won 4 times.
 +
* ''A'' claims that he is a better player, ''B'' claims that such a result could easily happen by chance if they were equally good players.
 +
* Hypothesis of player ''B'' is called the ''null hypothesis'' that the pattern we see (''A'' won more often than ''B'') is simply a result of chance
 +
* The null hypothesis reformulated: we toss coin ''n'' times and compute value ''X'': the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail
 +
* Similar situation: comparing programs ''A'' and ''B'' on several inputs, counting how many times is program ''A'' better than ''B''.
 +
<syntaxhighlight lang="r">
 +
# simulation in R: generate 10 psedorandom bits
 +
# (1=player A won)
 +
sample(c(0,1), 10, replace = TRUE)
 +
# result e.g. 0 0 0 0 1 0 1 1 0 0
 +
 
 +
# directly compute random variable X, i.e. the sum of bits
 +
sum(sample(c(0,1), 10, replace = TRUE))
 +
# result e.g. 5
 +
 
 +
# we define a function which will m times repeat
 +
# the coin tossing experiment with n tosses
 +
# and returns a vector with m values of random variable X
 +
experiment <- function(m, n) {
 +
  x = rep(0, m)    # create vector with m zeroes
 +
  for(i in 1:m) {  # for loop through m experiments
 +
    x[i] = sum(sample(c(0,1), n, replace = TRUE))
 +
  }
 +
  return(x)        # return array of values   
 +
}
 +
# call the function for m=20 experiments, each with n tosses
 +
experiment(20,10)
 +
# result e.g.  4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4
 +
# draw histograms for 20 experiments and 1000 experiments
 +
png("hist10.png")  # open png file
 +
par(mfrow=c(2,1))  # matrix of plots with 2 rows and 1 column
 +
hist(experiment(20,10))
 +
hist(experiment(1000,10))
 +
dev.off() # finish writing to file
 +
</syntaxhighlight>
 +
* It is easy to realize that we get [https://en.wikipedia.org/wiki/Binomial_distribution binomial distribution] (binomické rozdelenie)
 +
* The probability of getting ''k'' ones out of ''n'' coin tosses is  <math>\Pr(X=k) = {n \choose k} 2^{-n}</math>
 +
* The ''p-value'' of the test is the probability that simply by chance we would get ''k'' the same or more extreme than in our data.
 +
* In other words, what is the probability that in ''n=10'' tosses we see head 6 times or more (one sided test)
 +
* P-value for ''k'' ones out of ''n'' coin tosses <math>\sum_{j=k}^n {n \choose k} 2^{-n}</math>
 +
* If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player ''A'' is in fact better than ''B''
 +
 
 +
<syntaxhighlight lang="r">
 +
# computing the probability that we get exactly 6 heads in 10 tosses
 +
dbinom(6, 10, 0.5) # result 0.2050781
 +
# we get the same as our formula above:
 +
7*8*9*10/(2*3*4*(2^10)) # result 0.2050781
 +
 
 +
# entire probability distribution: probabilities 0..10 heads in 10 tosses
 +
dbinom(0:10, 10, 0.5)
 +
# [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
 +
# [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
 +
# [11] 0.0009765625
 +
 
 +
# we can also plot the distribution
 +
plot(0:10, dbinom(0:10, 10, 0.5))
 +
barplot(dbinom(0:10, 10, 0.5))
 +
 
 +
# our p-value is the sum for k=6,7,8,9,10
 +
sum(dbinom(6:10, 10, 0.5))
 +
# result: 0.3769531
 +
# so results this "extreme" are not rare by chance,
 +
# they happen in about 38% of cases
 +
 
 +
# R can compute the sum for us using pbinom
 +
# this considers all values greater than 5
 +
pbinom(5, 10, 0.5, lower.tail=FALSE)
 +
# result again 0.3769531
 +
 
 +
# if probability is too small, use log of it
 +
pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE)
 +
# [1] -6931.472
 +
# the probability of getting 10000x head is exp(-6931.472) = 2^{-100000}
 +
 
 +
# generating numbers from binomial distribution
 +
# - similarly to our function experiment
 +
rbinom(20, 10, 0.5)
 +
# [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5
 +
 
 +
# running the test
 +
binom.test(6, 10, p = 0.5, alternative="greater")
 +
#
 +
#        Exact binomial test
 +
#
 +
# data:  6 and 10
 +
# number of successes = 6, number of trials = 10, p-value = 0.377
 +
# alternative hypothesis: true probability of success is greater than 0.5
 +
# 95 percent confidence interval:
 +
# 0.3035372 1.0000000
 +
# sample estimates:
 +
# probability of success
 +
#                  0.6
 +
 
 +
# to only get p-value, run
 +
binom.test(6, 10, p = 0.5, alternative="greater")$p.value
 +
# result 0.3769531
 +
</syntaxhighlight>
 +
 
 +
==Comparing two sets of values: Welch's t-test==
 +
* Let us now consider two sets of values drawn from two [https://en.wikipedia.org/wiki/Normal_distribution normal distributions] with unknown means and variances
 +
* The null hypothesis of the [https://en.wikipedia.org/wiki/Welch%27s_t-test Welch's t-test] is that the two distributions have equal means
 +
* The test computes test statistics (in R for vectors x1, x2):
 +
** <tt>(mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))</tt>
 +
* If the null hypothesis holds, i.e. x1 and x2 were sampled from distributions with equal means, this test statistics is approximately distributed according to the [https://mathworld.wolfram.com/Studentst-Distribution.html Student's t-distribution] with the degree of freedom obtained by
 +
<syntaxhighlight lang="r">
 +
n1=length(x1)
 +
n2=length(x2)
 +
(var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
 +
</syntaxhighlight>
 +
* Luckily R will compute the test for us simply by calling <tt>t.test</tt>
 +
<syntaxhighlight lang="r">
 +
# generate x1: 6 values from normal distribution with mean 2 and standard deviation 1
 +
x1 = rnorm(6, 2, 1)
 +
# for example 2.70110750  3.45304366 -0.02696629  2.86020145  2.37496993  2.27073550
 +
 
 +
# generate x2: 4 values from normal distribution with mean 3 and standard deviation 0.5
 +
x2 = rnorm(4, 3, 0.5)
 +
# for example 3.258643 3.731206 2.868478 2.239788
 +
t.test(x1, x2)
 +
# t = -1.2898, df = 7.774, p-value = 0.2341
 +
# alternative hypothesis: true difference in means is not equal to 0
 +
# means 2.272182  3.024529
 +
# this time the test was not significant
 +
 
 +
# regenerate x2 from a distribution with a much more different mean
 +
x2 = rnorm(4, 5, 0.5)
 +
# 4.882395 4.423485 4.646700 4.515626
 +
t.test(x1, x2)
 +
# t = -4.684, df = 5.405, p-value = 0.004435
 +
# means 2.272182  4.617051
 +
# this time much more significant p-value
 +
 
 +
# to get only p-value, run
 +
t.test(x1,x2)$p.value
 +
</syntaxhighlight>
 +
 
 +
We will apply Welch's t-test to microarray data
 +
* Data from the same paper as in [[#Lr1#Gene_expression_data|the previous lecture]], i.e. Abbott et al 2007 [http://femsyr.oxfordjournals.org/content/7/6/819.abstract Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of ''Saccharomyces cerevisiae'']
 +
* Recall: Gene expression measurements under 5 conditions:
 +
** Control: yeast grown in a normal environment
 +
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic)
 +
* From each condition (control and each acid) we have 3 replicates
 +
* Together our table has 15 columns (3 replicates from 5 conditions) and 6398 rows (genes). Last time we have used only a subset of rows
 +
* We will test statistical difference between the control condition and one of the acids (3 numbers vs other 3 numbers)
 +
* See Task B in [[#HWr2|the exercises]]
 +
 
 +
==Multiple testing correction==
 +
 
 +
* When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01/
 +
* Purely by chance this would happen in 1% of cases (from the definition of the p-value).
 +
* So purely by chance we would expect to get about 64 genes with p-value at most 0.01.
 +
* So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance.
 +
* Sometimes false positives may even overwhelm the results.
 +
* Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests, there are many different methods
 +
* The simplest one is [https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf Bonferroni correction], where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 =  1.56e-6
 +
* This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality)
 +
* We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected
 +
* After Bonferroni correction we get only one significant gene
 +
<syntaxhighlight lang="r">
 +
# the results of t-tests are in vector pb of length 6398
 +
# manually multiply p-values by length(pb), count those that have value <= 0.01
 +
sum(pb * length(pb) <= 0.01)
 +
# in R you can use p.adjust for multiple testing correction
 +
pb.adjusted = p.adjust(pa, method ="bonferroni")
 +
# this is equivalent to multiplying by the length and using 1 if the result > 1
 +
pb.adjusted = pmin(pa*length(pa),rep(1,length(pa)))
 +
 
 +
# there are less conservative multiple testing correction methods,
 +
# e.g. Holm's method, but in this case we get almost the same results
 +
pa.adjusted2 = p.adjust(pa, method ="holm")
 +
</syntaxhighlight>
 +
Another frequently used correction is false discovery rate (FDR), which is less strict and controls the overall proportion of false positives among results.
 +
==HWr2==
 +
<!-- NOTEX -->
 +
See also the [[#Lr2|current]] and the [[#Lr1|the previous]] lecture.
 +
 
 +
* Do either tasks A,B,C (beginners) or B,C,D (more advanced). You can also do all four for bonus credit.
 +
* In your protocol write used R commands with brief comments on your approach.
 +
* Submit required plots with filenames as specified.
 +
* For each task also include results as required and a short discussion commenting the results/plots you have obtained. Is the value of interest increasing or decreasing with some parameter? Are the results as expected or surprising?
 +
* Outline of protocol is in <tt>/tasks/r2/protocol.txt</tt>
 +
<!-- /NOTEX -->
 +
 
 +
===Task A: sign test===
 +
 
 +
* Consider a situation in which players played ''n'' games, out of which a fraction of ''q'' were won by ''A'' (the example in the lecture corresponds to ''q=0.6'' and ''n=10'')
 +
* Compute a table of p-values for ''n=10,20,...,90,100'' and for ''q=0.6, 0.7, 0.8, 0.9''
 +
* Plot the table using <tt>matplot</tt> (''n'' is x-axis, one line for each value of ''q'')
 +
<!-- NOTEX -->
 +
* '''Submit''' the plot in <tt>sign.png</tt>
 +
* '''Discuss''' the values you have seen in the plot / table
 +
<!-- /NOTEX -->
 +
 
 +
Outline of the code:
 +
<syntaxhighlight lang="r">
 +
# create vector rows with values 10,20,...,100
 +
rows=(1:10)*10
 +
# create vector columns with required values of q
 +
columns=c(0.6, 0.7, 0.8, 0.9)
 +
# create empty matrix of pvalues
 +
pvalues = matrix(0,length(rows),length(columns))
 +
# TODO: fill in matrix pvalues using binom.test
 +
 
 +
# set names of rows and columns
 +
rownames(pvalues)=rows
 +
colnames(pvalues)=columns
 +
# careful: pvalues[10,] is now 10th row, i.e. value for n=100,
 +
#          pvalues["10",] is the first row, i.e. value for n=10
 +
 
 +
# check that for n=10 and q=0.6 you get p-value 0.3769531
 +
pvalues["10","0.6"]
 +
 
 +
# create x-axis matrix (as in previous exercises, part D)
 +
x=matrix(rep(rows,length(columns)),nrow=length(rows))
 +
# matplot command
 +
png("sign.png")
 +
matplot(x,pvalues,type="l",col=c(1:length(columns)),lty=1)
 +
legend("topright",legend=columns,col=c(1:length(columns)),lty=1)
 +
dev.off()
 +
</syntaxhighlight>
 +
 
 +
===Task B: Welch's t-test on microarray data===
 +
 
 +
Read the microarray data, and preprocess them (last time we worked with preprocessed data). We first transform it to log scale  and then shift and scale values in each column so that median is 0 and sum of squares of values is 1. This makes values more comparable between experiments; in practice more elaborate normalization is often performed. In the rest, work with table ''a'' containing preprocessed data.
 +
<syntaxhighlight lang="r">
 +
# read the input file
 +
input = read.table("/tasks/r2/acids.tsv", header=TRUE, row.names=1)
 +
# take logarithm of all the values in the table
 +
input = log2(input)
 +
# compute median of each column
 +
med = apply(input, 2, median)
 +
# shift and scale values
 +
a = scale(input, center=med)
 +
</syntaxhighlight>
 +
Columns 1,2,3 are control, columns 4,5,6 acetic acid, 7,8,9 benzoate, 10,11,12 propionate, and 13,14,15 sorbate
 +
 
 +
Write a function <tt>my.test</tt> which will take as arguments table ''a'' and 2 lists of columns (e.g. 1:3 and 4:6) and will run for each row of the table Welch's t-test of the first set of columns versus the second set. It will return the resulting vector of p-values, one for each gene.
 +
* For example by calling <tt>pb <- my.test(a, 1:3, 7:9)</tt> we will compute p-values for differences between control and benzoate (computation may take some time)
 +
* The first 5 values of <tt>pb</tt> should be
 +
<pre>
 +
> pb[1:5]
 +
[1] 0.02358974 0.05503082 0.15354833 0.68060345 0.04637482
 +
</pre>
 +
* Run the test for all four acids
 +
* '''Report''' how many genes were significant with p-value at most 0.01 for each acid
 +
** See [[#HWr1#Vector_arithmetic|Vector arithmetic in HWr1]]
 +
** You can count <tt>TRUE</tt> items in a vector of booleans by <tt>sum</tt>, e.g. <tt>sum(TRUE,FALSE,TRUE)</tt> is 2
 +
* '''Report''' how many genes are significant for both acetic and benzoate acids simultaneously (logical and is written as <tt>&</tt>).
 +
 
 +
===Task C: multiple testing correction===
 +
 
 +
Run the following snippet of code, which works on the vector of p-values <tt>pb</tt> obtained for benzoate in task B
 +
<syntaxhighlight lang="r">
 +
# adjusts vectors of p-vales from tasks B for using Bonferroni correction
 +
pb.adjusted = p.adjust(pb, method ="bonferroni")
 
# add this adjusted vector to frame a
 
# add this adjusted vector to frame a
a <-  cbind(a, pa.adjusted)
+
a <-  cbind(a, pb.adjusted)
# create permutation ordered by pa.adjusted
+
# create permutation ordered by pb.adjusted
oa = order(pa.adjusted)
+
ob = order(pb.adjusted)
# select from table five rows with the lowest pa.adjusted (using vector oa)
+
# select from table five rows with the lowest pb.adjusted (using vector ob)
# and display columns containing reference, acetate and adjusted p-value
+
# and display columns containing control, acetate and adjusted p-value
a[oa[1:5],c(1:6,16)]
+
a[ob[1:5],c(1:3,7:9,16,17)]
</pre>
+
</syntaxhighlight>
 
+
 
You should get output like this:
+
You should get an output like this:
<pre>
+
<pre>
            ref1    ref2    ref3 acetate1  acetate2 acetate3 pa.adjusted
+
      control1 control2  control3  benzoate1  benzoate2 benzoate3
SUL1    7.581312 7.394985 7.412040 2.1633230 2.05412373 1.9169226 0.004793318
+
PTC4 0.5391444 0.5793445 0.5597744  0.2543546  0.2539317  0.2202997
YMR244W 2.985682 2.975530 3.054001 0.3364722 0.33647224 0.1823216 0.188582576
+
GDH3 0.2480624 0.2373752 0.1911501 -0.3697303 -0.2982495 -0.3616723
DIP5    6.943991 7.147795 7.296955 0.6931472 0.09531018 0.5306283 0.253995075
+
AGA2 0.6735964 0.7860222 0.7222314  1.4807101  1.4885581  1.3976753
YLR460C 5.620401 5.801212 5.502482 3.2425924 3.48431229 3.3843903 0.307639012
+
CWP2 1.4723713 1.4582596 1.3802390  2.3759288  2.2504247  2.2710695
HXT4    2.821379 3.049273 2.772589 7.7893717 8.24446541 8.3041980 0.573813502
+
LSP1 0.7668296 0.8336119 0.7643181  1.3295121  1.2744859  1.2986457
</pre>
+
              pb pb.adjusted
 
+
PTC4 4.054985e-05  0.2594379
Do the same procedure for benzoate p-values and '''report''' the result (in your table, report both p-values and expression levels for bezoate, not acetate). '''Comment''' the results for both acids.
+
GDH3 5.967727e-05  0.3818152
 
+
AGA2 8.244790e-05  0.5275016
==Task D: volcano plot, test on data generated from null hypothesis==
+
CWP2 1.041416e-04  0.6662979
Draw a [https://en.wikipedia.org/wiki/Volcano_plot_(statistics) volcano plot] for the acetate data
+
LSP1 1.095217e-04  0.7007201
* x-axis of this plot is the difference in the mean of reference and mean of acetate.  
+
</pre>
** You can compute row means of a matrix by rowMeans.
+
 
* y-axis is -log10 of the p-value (use original p-values before multiple testing correction)
+
Do the same procedure for acetate p-values and '''report''' the result (in your table, report both p-values and expression levels for acetate, not bezoate). '''Comment''' on the results for both acids.
* You can quickly see genes which have low p-values (high on y-axis) and also big difference in mean expression between the two conditions (far from 0 on x-axis). You can also see if acetate increases or decreases expression of these genes.
+
 
 
+
===Task D: volcano plot, test on data generated from null hypothesis===
Now create a simulated dataset sharing some features of the real data but observing the null hypothesis that the mean of reference and acetate are the same for each gene
+
 
* Compute vector ''m'' of means for columns 1:6 from matrix ''a''
+
Draw a [https://en.wikipedia.org/wiki/Volcano_plot_(statistics) volcano plot] for the acetate data
* Compute vectors ''sr'' and ''sa'' of standard deviations for reference columns and for acetate columns respectively
+
* The x-axis of this plot is the difference between the mean of control and the mean of acetate. You can compute row means of a matrix by <tt>rowMeans</tt>.
** You can compute standard deviation for each row of a matrix by <tt>apply(some.matrix, 1, sd)</tt>
+
* The y-axis is -log10 of the p-value (use the p-values before multiple testing correction)
* For each i in 1:6398, create three samples from normal distribution with mean ''m[i]'' and standard deviation ''sr[i]'', and three samples with mean m[i] and deviation sa[i]
+
* You can quickly see the genes that have low p-values (high on y-axis) and also big difference in the mean expression between the two conditions (far from 0 on x-axis). You can also see if acetate increases or decreases the expression of these genes.
** Use function <tt>rnorm</tt>
+
 
* On the resulting matrix apply Welch's t-test and draw the volcano plot.  
+
Now create a simulated dataset sharing some features of the real data but observing the null hypothesis that the mean of control and acetate are the same for each gene
* How many random genes have p-value <=0.01? Is it roughly what we would expect under the null hypothesis?
+
* Compute vector ''m'' of means for columns 1:6 from matrix ''a''
 
+
* Compute vectors ''sc'' and ''sa'' of standard deviations for control columns and for acetate columns respectively. You can compute standard deviation for each row of a matrix by <tt>apply(some.matrix, 1, sd)</tt>
Draw histogram of p-values from the real data (reference vs acetate) and from random data. '''Describe''' how they differ. Is it what you would expect?
+
* For each i in 1:6398, create three samples from the normal distribution with mean <tt>m[i]</tt> and standard deviation <tt>sc[i]</tt> and three samples with mean <tt>m[i]</tt> and deviation <tt>sa[i]</tt> (use the <tt>rnorm</tt> function)
* use function <tt>hist</tt>
+
* On the resulting matrix apply Welch's t-test and draw the volcano plot.  
 
+
* How many random genes have p-value at most 0.01? Is it roughly what we would expect under the null hypothesis?
'''Submit''' plots <tt>volcano-real.png</tt>, <tt>volcano-random.png</tt>, <tt>hist-real.png</tt>, <tt>hist-random.png</tt>
+
 
(real for real expression data and random for generated data)
+
Draw a histogram of p-values from the real data (control vs acetate) and from the random data (use function <tt>hist</tt>). '''Describe''' how they differ. Is it what you would expect?
=L10=
+
 
Today we will work with AWS.
+
<!-- NOTEX -->  
Please use credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):
+
'''Submit''' plots <tt>volcano-real.png</tt>, <tt>volcano-random.png</tt>, <tt>hist-real.png</tt>, <tt>hist-random.png</tt>
https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p
+
(real for real expression data and random for generated data)
 +
<!-- /NOTEX -->
 +
=Lcloud=
 +
Today we will work with [https://aws.amazon.com/ Amazon Web Services] (AWS), which is a cloud computing platform. It allows highly parallel computation on large datasets. We will use an educational account which gives you certain amount of resources for free.
 +
 
 +
 
 +
==Credentials==
 +
* First you need to create <tt>.aws/credentials</tt> file in your home folder with valid AWS credentials.
 +
<!-- NOTEX -->
 +
* Also run <tt>`aws configure`</tt>. Press enter for access key ID and secret access key and put in <tt>`us-east-1`</tt> for region. Press enter for output format.
 +
* Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):
 +
https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p
 +
* Sometimes these credentials expire. In that case repeat the same steps to get new ones.
 +
<!-- /NOTEX -->
 +
<!-- TEX
 +
* Instructions for doing so are given during the lecture.
 +
/TEX -->
 +
 
 +
==AWS command line==
 +
* We will access AWS using <tt>aws</tt> command installed on our server.
 +
* You can also install it on your own machine using  <tt>pip install awscli</tt>
 +
 
 +
==Input files and data storage==
 +
 
 +
Today we will use [https://aws.amazon.com/s3/ Amazon S3] cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:
 +
 
 +
<syntaxhighlight lang="bash">
 +
# the following command should give you a big list of files
 +
aws s3 ls s3://idzbucket2
 +
 
 +
# this command downloads one file from the bucket
 +
aws s3 cp s3://idzbucket2/splitaa splitaa
 +
 
 +
# the following command prints the file in your console
 +
# (no need to do this).
 +
aws s3 cp s3://idzbucket2/splitaa -
 +
</syntaxhighlight>
 +
 
 +
You should also create your own bucket (storage area). Pick your own name, must be globally unique:
 +
<syntaxhighlight lang="bash">
 +
aws s3 mb s3://mysuperawesomebucket
 +
</syntaxhighlight>
 +
 
 +
==MapReduce==
 +
 
 +
We will be using MapReduce in this session. It is kind of outdated concept, but simple enough for us and runs out of box on AWS.
 +
If you ever want to use BigData in practice, try something more modern like [https://beam.apache.org/ Apache Beam]. And avoid PySpark if you can.
 +
 
 +
For tutorial on MapReduce check out [https://pythonhosted.org/mrjob/guides/concepts.html#mapreduce-and-apache-hadoop PythonHosted.org] or [https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm TutorialsPoint.com].
 +
 
 +
==Template==
 +
 
 +
You are given basic template with comments in <tt>/tasks/cloud/example_job.py</tt>
 +
 
 +
You can run it locally as follows:
 +
<syntaxhighlight lang="bash">
 +
python3 example_job.py <input file> -o <output_dir>
 +
</syntaxhighlight>
 +
 
 +
You can run it in the cloud on the whole dataset as follows:
 +
<syntaxhighlight lang="bash">
 +
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 \
 +
  --num-core-instances 4 -o s3://<your bucket>/<some directory>
 +
</syntaxhighlight>
 +
 
 +
For testing we recommend using a smaller sample as follows:
 +
<syntaxhighlight lang="bash">
 +
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* \
 +
  --num-core-instances 4 -o  s3://<your bucket>/<some directory>
 +
</syntaxhighlight>
 +
 
 +
==Other useful commands==
 +
 
 +
You can download output as follows:
 +
<syntaxhighlight lang="bash">
 +
# list of files
 +
aws s3 ls s3://<your bucket>/<some directory>/
 +
# download
 +
aws s3 cp s3://<your bucket>/<some directory>/ . --recursive
 +
</syntaxhighlight>
 +
 
 +
If you want to watch progress:
 +
* Click on AWS Console button workbench (vocareum).
 +
* Set region (top right) to N. Virginia (us-east-1).
 +
* Click on services, then EMR.
 +
* Click on the job, which is running, then Steps, view logs, syslog.
 +
==HWcloud==
 +
<!-- NOTEX -->
 +
See also the [[#Lcloud|lecture]]
 +
 
 +
For both tasks, submit your source code and the result, when run on whole dataset (<tt>s3://idzbucket2</tt>).
 +
The code is expected to use the MRJob framework presented in the lecture. Submit directory is <tt>/submit/cloud/</tt>
 +
<!-- /NOTEX -->
 +
 
 +
===Task A===
 +
 
 +
Count the number of occurrences of each 4-mer in the provided data.
 +
 
 +
===Task B===
  
The goal is to have '''.aws/credentials'' file in your homefolder.
+
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
Sometimes these credentials expire, follow same steps to refresh them (copy them again).
 
  
There is aws command line tools installed on vyuka machine (on your machine use '''pip install awscli''').
 
 
Now you need to do two things:
 
 
Check if you can download a file from our S3 datastore (think of S3 as some remote file storage).
 
 
'''aws s3 ls s3://idzbucket2''' - should give you a big list of files
 
'''aws s3 cp s3://idzbucket2/splitaa splitaa''' - should download a file to your machine
 
'''aws s3 cp s3://idzbucket2/splitaa -''' - will print file in your console (no need to do this).
 
 
Check if you can create a bucket (your storage area, pick your own name, must be globally unique):
 
'''aws s3 mb s3://mysuperawesomebucket'''
 
 
We will be using MapReduce in this session. (It is kind of outdated concept, but simple enough for us and runs out of box on AWS.
 
If you ever want to use BigData in practice, try something more modern like Apache Beam. And avoid PySpark if you can.)
 
 
For tutorial on mapreduce check out [https://pythonhosted.org/mrjob/guides/concepts.html#mapreduce-and-apache-hadoop] or [https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm].
 
 
You are given basic template with comments at '''tasks/example_job.py''' and you can run it locally as '''python3 example_job.py <input file> -o <output_dir>''' or in cloud as '''python3 example_job.py -r emr s3://idzbucket2 --num-core-instances 4 -o s3://<your bucket>/<some directory>''' (but for testing we recommend smaller sample '''python3 example_job.py -r emr s3://idzbucket2/splita* --num-core-instances 4 -o  s3://<your bucket>/<some directory>''').
 
 
You can download output using '''aws s3 ls s3://<your bucket>/<some directory>/''' and  '''aws s3 cp s3://<your bucket>/<some directory>/ . --recursive'''
 
 
If you want to watch progress:
 
Click on AWS Console button workbench (vocareum).
 
Set region (top right) to Oregon.
 
Click on services, then EMR.
 
Click on the job, which is running, then Steps, view logs, syslog.
 
=HW10=
 
Task 1:
 
 
Count number of occurences of each 4-mer in provided data.
 
 
Task 2:
 
 
Count number of pairs of reads which overlaps in exactly 30 bases (end of one read overlaps beginning of second read).
 
You can ignore reverse complement.
 
 
Hints:  
 
Hints:  
 
 
* Try counting pairs for each 30-mer first.
 
* Try counting pairs for each 30-mer first.
 
* You can yield something structured from Mapper (e.g. tuple).
 
* You can yield something structured from Mapper (e.g. tuple).
* There is a two step mapreduce, which can help you with final summation: https://pythonhosted.org/mrjob/guides/writing-mrjobs.html
+
* There is a two-step MapReduce, which can help you with the final summation: https://pythonhosted.org/mrjob/guides/writing-mrjobs.html
 
 
 
 
For both tasks, submit source code and the result, when run on whole dataset (s3://idzbucket2).
 
Code is expected to use MRJob framework presented in lecture.
 

Latest revision as of 16:30, 17 February 2021

Website for 2019/20

2019-02-20 (TV) Introduction to Perl (basics, input processing) Lecture, Homework
2019-02-27 (TV) Command-line tools, Perl one-liners Lecture, Homework
2019-03-05 (BB) Job scheduling and make Lecture, Homework
2019-03-12 (BB) Python and SQL for beginners Lecture, Homework
2019-03-19 (VB) Python, web crawling, HTML parsing, sqlite3 Lecture INF, Homework INF
(BB) Bioinformatics 1 (genome assembly) Lecture BIN, Homework BIN
2019-03-26 (VB) Text data processing, flask Lecture INF, Homework INF
(BB) Bioinformatics 2 (gene finding, RNA-seq) Lecture BIN, Homework BIN
2019-04-02 (VB) Data visualization in JavaScript Lecture INF, Homework INF
(BB) Bioinformatics 3 (polymorphisms) Lecture BIN, Homework BIN
2019-04-09 Easter
2019-04-16 (BB) R, part 1 Lecture, Homework
2019-04-23 (BB) R, part 2 Lecture, Homework
2019-04-30 (VB) Cloud computing Lecture, Homework
2019-05-07 Reserve, work on projects
2019-05-14 Reserve, work on projects

Contents

Kontakt

Vyučujúci

Rozvrh

  • Štvrtok 15:40-18:00 M-217


Introduction

Target audience

This course is offered at the Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava for the students of the bachelor Data Science, Computer Science and Bioinformatics study programs and the students of the master Computer Science study program. It is a prerequisite of the master-level state exams in Bioinformatics and Machine Learning. However, the course is open to students from other study programs if they satisfy the following informal prerequisites.

We assume that the students are proficient in programming in at least one programming language and are not afraid to learn new languages. We also assume basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). The basic use of command-line tools can be learned for example by using a tutorial by Ian Korf.

Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.


Course objectives

Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. For example the main task for many bioinformaticians is data processing using existing tools, possibly supplemented by small custom scripts. This course will cover some programming languages and technologies suitable for these activities.

This course is particularly recommended for students whose bachelor or master theses involve substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).

Basic guidelines for working with data

As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular design, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bioinformatician William Stafford Noble (his advice applies outside of bioinformatics as well):

Several important recommendations:

  • Noble 2009: "Everything you do, you will probably have to do over again."
    • After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps and therefore everything needs to be redone
    • Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed
    • Try to avoid manually changing files, because this makes rerunning analyses harder and more error-prone
  • Document all steps of your analysis
    • Note what have you done, why have you done it, what was the result
    • Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis
    • Good documentation is also indispensable for collaborative projects


  • Keep a logical structure of your files and folders
    • Their names should be indicative of the contents (create a sensible naming scheme)
    • However, if you have too many versions of the experiment, it may be easier to name them by date rather than create new long names (your notes should then detail the meaning of each dated version)
  • Try to detect problems in the data
    • Big files may hide some problems in the format, unexpected values etc. These may confuse your programs and make the results meaningless
    • In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc)
    • In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code
    • If your script executes another program, check its exit code
    • Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code

Pravidlá

Známkovanie

  • Domáce úlohy: 55%
  • Návrh projektu: 5%
  • Projekt: 40%

Stupnica:

  • A: 90 a viac, B:80...89, C: 70...79, D: 60...69, E: 50...59, FX: menej ako 50%

Formát predmetu

  • Každý týždeň 3 vyučovacie hodiny, z toho cca prvá je prednáška a ďalšie dve cvičenia. Na cvičeniach samostatne riešite príklady, ktoré doma dokončíte ako domácu úlohu.
  • Niektoré týždne bude zvlášť úloha pre študentov bakalárskeho programu Bioinformatika a zvlášť pre ostatných. Ak by ste chceli riešiť iné zadanie, než je pre vás určené, musíte získať vopred súhlas vyučujúcich.
  • Cez skúškové obdobie budete odovzdávať projekt. Po odovzdaní projektov sa bude konať ešte diskusia o projekte s vyučujúcimi, ktorá môže ovplyvniť vaše body z projektu.
  • Budete mať konto na Linuxovom serveri určenom pre tento predmet. Toto konto používajte len na účely tohto predmetu a snažte sa server príliš svojou aktivitou nepreťažiť, aby slúžil všetkým študentom. Akékoľvek pokusy úmyselne narušiť chod servera budú považované za vážne porušenie pravidiel predmetu.

Domáce úlohy

  • Termín DÚ týkajúcej sa aktuálnej prednášky je vždy do 9:00 v deň nasledujúcej prednášky (t.j. väčšinou o necelý týždeň od zadania).
  • Domácu úlohu odporúčame začať robiť na cvičení, kde vám môžeme prípadne poradiť. Ak máte otázky neskôr, pýtajte sa vyučujúcich emailom.
  • Domácu úlohu môžete robiť na ľubovoľnom počítači, pokiaľ možno pod Linuxom. Odovzdaný kód alebo príkazy by však mali byť spustiteľné na serveri pre tento predmet, nepoužívajte teda špeciálny softvér alebo nastavenia vášho počítača.
  • Domáca úloha sa odovzdáva nakopírovaním požadovaných súborov do požadovaného adresára na serveri. Konkrétne požiadavky budú spresnené v zadaní.
  • Ak sú mená súborov špecifikované v zadaní, dodržujte ich. Ak ich vymýšľate sami, nazvite ich rozumne. V prípade potreby si spravte aj podadresáre, napr. na jednotlivé príklady.
  • Dbajte na prehľadnosť odovzdaného zdrojového kódu (odsadzovanie, rozumné názvy premenných, podľa potreby komentáre)

Protokoly

  • Väčšinou bude požadovanou súčasťou úlohy textový dokument nazvaný protokol.
  • Protokol píšte v txt formáte a súbor nazvite protocol.txt (nakopírujte ho do odovzdaného adresára)
  • Protokol môže byť po slovensky alebo po anglicky.
  • Ak píšete s diakritikou, použite kódovanie UTF8, ale pre jednoduchosť môžete protokoly písať aj bez diakritiky.
  • Vo väčšine úloh dostanete kostru protokolu, dodržujte ju.

Hlavička protokolu, vyhodnotenie

  • Na vrchu protokolu uveďte názov domácej úluhy a vaše vyhodnotenie toho, ako sa vám úlohu podarilo vyriešiť. Vyhodnotenie je prehľadný zoznam všetkých príkladov zo zadania, ktoré ste aspoň začali riešiť a kódov označujúcich ich stupeň dokončenia:
    • kód HOTOVO uveďte, ak si myslíte, že tento príklad máte úplne a správne vyriešený
    • kód ČASŤ uveďte, ak ste nevyriešili príklad celý a do poznámky za kód stručne uveďte, čo máte hotové a čo nie, prípadne ktorými časťami si nie ste istí.
    • kód MOŽNO uveďte, ak príklad máte celý, ale nie ste si istí, či správne. Opäť v poznámke uveďte, čím si nie ste istí.
    • kód NIČ uveďte, ak ste príklad ani nezačali riešiť
  • Vaše vyhodnotenie je pre nás pomôckou pri bodovaní. Príklady označené HOTOVO budeme kontrolovať námatkovo, k príkladom označeným MOŽNO sa vám pokúsime dať nejakú spätnú väzbu, takisto aj k príkladom označeným ČASŤ, kde v poznámke vyjadríte, že ste mali nejaké problémy.
  • Pri vyhodnotení sa pokúste čo najlepšie posúdiť správnosť vašich riešení, pričom kvalita vášho seba-hodnotenia môže vplývať na celkový počet bodov.

Obsah protokolu

  • Ak nie je v zadaní určené inak, protokol by mal obsahovať nasledovné údaje:
    • Zoznam odovzdaných súborov: o každom súbore uveďte jeho význam a či ste ho vyrobili ručne, získali z externých zdrojov alebo vypočítali nejakým programom. Ak máte väčšie množstvo súborov so systematickým pomenovaním, stačí vysvetliť schému názvov všeobecne. Súbory, ktorých mená sú špecifikované v zadaní, nemusíte v zozname uvádzať.
    • Postupnosť všetkých spustených príkazov, prípadne iných krokov, ktorými ste dospeli k získaným výsledkom. Tu uvádzajte príkazy na spracovanie dát a spúšťanie vašich či iných programov. Netreba uvádzať príkazy súvisiace so samotným programovaním (spúšťanie editora, nastavenie práv na spustenie a pod.), s kopírovaním úlohy na server a pod. Pri zložitejších príkazoch uveďte aj stručné komentáre, čo bolo účelom určitého príkazu alebo skupiny príkazov.
    • Zoznam zdrojov: webstránky a pod., ktoré ste pri riešení úlohy použili. Nemusíte uvádzať webstránku predmetu a zdroje odporučené priamo v zadaní.

Celkovo by protokol mal umožniť čitateľovi zorientovať sa vo vašich súboroch a tiež v prípade záujmu vykonať rovnaké výpočty, akými ste dospeli vy k výsledku. Nemusíte písať slohy, stačia zrozumiteľné a prehľadné heslovité poznámky.

Projekty

Cieľom projektu je vyskúšať si naučené zručnosti na konkrétnom projekte spracovania dát. Vašou úlohou je zohnať si dáta, tieto dáta analyzovať niektorými technikami z prednášok, prípadne aj inými technológiami a získané výsledky zobraziť v prehľadných grafoch a tabuľkách.

  • Zhruba v dvoch tretinách semestra budete odovzdávať krátky návrh projektu
  • Cez skúškové obdobie bude určený termín odovzdania projektu (vrátane písomnej správy)
  • Projekty môžete robiť aj vo dvojici, vtedy však vyžadujeme rozsiahlejší projekt a každý člen by mal byť primárne zodpovedný za určitú časť projektu
  • Po odovzdaní projektov sa bude konať ešte diskusia o projekte s vyučujúcimi, ktorá môže ovplyvniť vaše body z projektu.

Viac informácií o projektoch na zvláštnej stránke

Opisovanie

  • Máte povolené sa so spolužiakmi a ďalšími osobami rozprávať o domácich úlohách resp. projektoch a stratégiách na ich riešenie. Kód, získané výsledky aj text, ktorý odovzdáte, musí však byť vaša samostatná práca. Je zakázané ukazovať svoj kód alebo texty spolužiakom.
  • Pri riešení domácej úlohy a projektu očakávame, že budete využívať internetové zdroje, najmä rôzne manuály a diskusné fóra k preberaným technológiám. Nesnažte sa však nájsť hotové riešenia zadaných úloh. Všetky použité zdroje uveďte v domácich úlohách a projektoch.
  • Ak nájdeme prípady opisovania alebo nepovolených pomôcok, všetci zúčastnení študenti získajú za príslušnú domácu úlohu, projekt a pod. nula bodov (t.j. aj tí, ktorí dali spolužiakom odpísať) a prípad ďalej podstúpime na riešenie disciplinárnej komisii fakulty.

Zverejňovanie

Zadania a materiály k predmetu sú voľne prístupné na tejto stránke. Prosím vás ale, aby ste nezverejňovali ani inak nešírili vaše riešenia domácich úloh, ak nie je v zadaní povedané inak. Vaše projekty môžete zverejniť, pokiaľ to nie je v rozpore s vašou dohodou so zadávateľom projektu a poskytovateľom dát.

Projekt

Cieľom projektu je vyskúšať si naučené zručnosti na konkrétnom projekte spracovania dát. Vašou úlohou je zohnať si dáta, tieto dáta analyzovať niektorými technikami z prednášok, prípadne aj inými technológiami a získané výsledky zobraziť v prehľadných grafoch a tabuľkách. Ideálne je, ak sa vám podarí prísť k zaujímavým alebo užitočným záverom, ale hodnotiť budeme hlavne voľbu vhodného postupu a jeho technickú náročnosť. Rozsah samotného programovania alebo analýzy dát by mal zodpovedať zhruba trom domácim úlohám, ale celkovo bude projekt náročnejší, lebo na rozdiel od úloh nemáte postup a dáta vopred určené, ale musíte si ich vymyslieť sami a nie vždy sa prvý nápad ukáže ako správny.

V projekte môžete využiť aj existujúce nástroje a knižnice, ale dôraz by mal byť na nástrojoch spúšťaných na príkazovom riadku a využití technológií preberaných na predmete. Pri prototypovaní vášho nástroja a vytváraní vizualizácií do záverečnej správy sa vám môže dobre pracovať v interaktívnych prostrediach, ako napríklad Jupyter notebook, ale v odovzdanej verzii projektu by sa mala dať väčšia časť kódu spustiť zo samostatných skriptov spustiteľných na príkazovom riadku, potenciálne s výnimkou samotnej vizualizácie, ktorá môže zostať ako notebook alebo interaktívna webstránka (flask).

Návrh projektu

Zhruba v dvoch tretinách semestra budete odovzdávať návrh projektu v rozsahu asi pol strany. V tomto návrhu uveďte, aké dáta budete spracovávať, ako ich zoženiete, čo je cieľom analýzy a aké technológie plánujete použiť. Ciele a technológie môžete počas práce na projekte mierne pozmeniť podľa okolností, mali by ste však mať počiatočnú predstavu. K návrhu vám dáme spätnú väzbu, pričom v niektorých prípadoch môže byť potrebné tému mierne alebo úplne zmeniť. Za načas odovzdaný vhodný návrh projektu získate 5% z celkovej známky. Návrh odporúčame pred odovzdaním konzultovať s vyučujúcimi.

Odovzdávanie: súbor vo formáte txt alebo pdf nakopírujte do /submit/navrh/username na serveri.

Odovzdanie projektov

Cez skúškové obdobie bude určený termín odovzdania projektu. Podobne ako pri domácich úlohách odovzdávajte adresár s požadovanými súbormi:

  • Vaše programy a súbory s dátami (veľmi veľké dátové súbory vynechajte)
  • Protokol podobne ako pri domácich úlohách
    • formát txt alebo pdf, stručné heslovité poznámky
    • obsahuje zoznam súborov, podrobný postup pri analýze dát (spustené príkazy), ako aj použité zdroje (dáta, programy, dokumentácia a iná literatúra atď)
  • Správu k projektu vo formáte pdf. Na rozdiel od menej formálneho protokolu by správu mal tvoriť súvislý text v odbornom štýle, podobne ako napr. záverečné práce. Môžete písať po slovensky alebo po anglicky, ale pokiaľ možno gramaticky správne. Správa by mala obsahovať:
    • úvod, v ktorom vysvetlíte ciele projektu, prípadne potrebné poznatky zo skúmanej oblasti a aké dáta ste mali k dispozícii
    • stručný popis metód, v ktorom neuvádzajte detailne jednotlivé kroky, skôr prehľad použitého prístupu a jeho zdôvodnenie
    • výsledky analýzy (tabuľky, grafy a pod.) a popis týchto výsledkov, prípadne aké závery sa z nich dajú spraviť (nezabudnite vysvetliť, čo znamenajú údaje v tabuľkách, osi grafov a pod.). Okrem finálnych výsledkov analýzy uveďte aj čiastkové výsledky, ktorými ste sa snažili overovať, že pôvodné dáta a jednotlivé časti vášho postupu sa správajú rozumne.
    • diskusiu, v ktorej uvediete, ktoré časti projektu boli náročné a na aké problémy ste narazili, kde sa vám naopak podarilo nájsť spôsob, ako problém vyriešiť jednoducho, ktoré časti projektu by ste spätne odporúčali robiť iným než vašim postupom, čo ste sa na projekte naučili a podobne

Projekty môžete robiť aj vo dvojici, vtedy však vyžadujeme rozsiahlejší projekt a každý člen by mal byť primárne zodpovedný za určitú časť projektu, čo uveďte aj v správe. Dvojice odovzdávajú jednu správu, ale po odovzdaní projektu majú stretnutie s vyučujúcimi individuálne.

Typické časti projektu

Väčšina projektov obsahuje nasledujúce kroky, ktoré by sa mali premietnuť aj v správe

  • Získanie dát. Toto môže byť ľahké, ak vám dáta niekto priamo dá alebo ich stiahnete ako jeden súbor z internetu, alebo náročnejšie, napríklad ak ich parsujete z veľkého množstva súborov alebo webstránok. Nezabudnite na (aspoň námatkovú) kontrolu, či sa vám podarilo dáta stiahnuť správne. V správe by malo byť jasne uvedené, kde a ako ste dáta získali.
  • Predspracovanie dát do vhodného tvaru. Táto etapa zahŕňa parsovanie vstupných formátov, vyberanie užitočných dát, ich kontrola, odfiltrovanie nevhodných alebo neúplných položiek a podobne. Dáta si uložte do súboru alebo databázy vo vhodnom tvare, v ktorom sa vám s nimi bude dobre ďalej pracovať. Nezabudnite na kontrolu, či dáta vyzerajú byť v poriadku a spočítajte základné štatistiky, napríklad celkový počet záznamov, rozsahy rozličných atribútov a podobne, ktoré môžu vám aj čitateľovi správy ilustrovať, aký je charakter dát.
  • Ďalšie analýzy na dátach a vizualizácia výsledkov. V tejto fáze sa pokúste v dátach nájsť niečo zaujímavé alebo užitočné pre zadávateľa projektu. Výsledkom môžu byť statické grafy a tabuľky, alebo aj interaktívna webstránka (flask). Aj v prípade interaktívnej webstránky však aspoň niektoré výsledky uveďte aj v správe.

Ak sa váš projekt od týchto krokov výrazne odlišuje, poraďte sa s vyučujúcimi.

Vhodné témy projektov

  • Môžete spracovať nejaké dáta, ktoré potrebujete do bakalárskej alebo diplomovej práce, prípadne aj dáta, ktoré potrebujte na iný predmet (v tom prípade uveďte v správe, o aký predmet ide a takisto upovedomte aj druhého vyučujúceho, že ste použili spracovanie dát ako projekt pre tento predmet). Obzvlášť pre BIN študentov môže byť tento predmet vhodnou príležitosťou nájsť si tému bakalárskej práce a začať na nej pracovať.
  • Môžete skúsiť zopakovať analýzu spravenú v nejakom vedeckom článku a overiť, že dostanete tie isté výsledky. Vhodné je tiež skúsiť analýzu aj mierne obmeniť (spustiť na iné dáta, zmeniť nejaké nastavenia, zostaviť aj iný typ grafu a pod.)
  • Môžete skúsiť nájsť niekoho, kto má dáta, ktoré by potreboval spracovať, ale nevie ako na to (môže ísť o biológov, vedcov z iných oblastí, ale aj neziskové organizácie a pod.) V prípade, že takýmto spôsobom kontaktujete tretie osoby, bolo by vhodné pracovať na projekte obzvlášť zodpovedne, aby ste nerobili zlé meno našej fakulte.
  • V projekte môžete porovnávať niekoľko programov na tú istú úlohu z hľadiska ich rýchlosti či presnosti výsledkov. Obsahom projektu bude príprava dát, na ktorých budete programy bežať, samotné spúšťanie (vhodne zoskriptované) ako aj vyhodnotenie výsledkov.
  • A samozrejme môžete niekde na internete vyhrabať zaujímavé dáta a snažiť sa z nich niečo vydolovať. Študenti si často vyberajú témy súvisiace s ich koníčkami a aktivitami, napríklad športy, počítačové hry, programátorské súťaže a iné.

Lperl

This lecture is a brief introduction to the Perl scripting language. More information can be found below (section #Sources of Perl-related information). We recommend revisiting necessary parts of this lecture while working on the exercises.

Why Perl

  • From Wikipedia: It has been nicknamed "the Swiss Army chainsaw of scripting languages" because of its flexibility and power, and possibly also because of its "ugliness".

Official slogans:

  • There's more than one way to do it.
  • Easy things should be easy and hard things should be possible.

Advantages

  • Good capabilities for processing text files, regular expressions, running external programs etc.
  • Closer to common programming languages than shell scripts
  • Perl one-liners on the command line can replace many other tools such as sed and awk
  • Many existing libraries

Disadvantages

  • Quirky syntax
  • It is easy to write very unreadable programs (Perl is sometimes joking called write-only language)
  • Quite slow and uses a lot of memory. If possible, do no read entire input to memory, process line by line

We will use Perl 5, Perl 6 is quite a different language

Hello world

It is possible to run the code directly from a command line (more later):

perl -e'print "Hello world\n"'

This is equivalent to the following code stored in a file:

#! /usr/bin/perl -w
use strict;
print "Hello world!\n";
  • The first line is a path to the interpreter
  • Switch -w switches warnings on, e.g. if we manipulate with an undefined value (equivalent to use warnings;)
  • The second line use strict will switch on a more strict syntax checks, e.g. all variables must be defined
  • Use of -w and use strict is strongly recommended

Running the script

  • Store the program in a file hello.pl
  • Make it executable (chmod a+x hello.pl)
  • Run it with command ./hello.pl
  • It is also possible to run as perl hello.pl (e.g. if we don't have the path to the interpreter in the file or the executable bit is not set)

The first input file for today: TV series

  • IMDb is an online database of movies and TV series with user ratings.
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub.
  • From this dataset, we have selected only several series with a high number of voting users.
  • Each line of the file contains data about one episode of one series. Columns are tab-separated and contain the name of the series, the name of the episode, the global index of the episode within the series, the number of the season, the index of the episode with the season, rating of the episode and the number of voting users.
  • Here is a smaller version of this file with only six lines:
Black Mirror	The National Anthem	1	1	1	7.8	35156
Black Mirror	Fifteen Million Merits	2	1	2	8.2	35317
Black Mirror	The Entire History of You	3	1	3	8.6	35266
Game of Thrones	Winter Is Coming	1	1	1	9	27890
Game of Thrones	The Kingsroad	2	1	2	8.8	21414
Game of Thrones	Lord Snow	3	1	3	8.7	20232
  • The smaller and the larger version of this file can be found at our server under filenames /tasks/perl/series-small.tsv and /tasks/perl/series.tsv

A sample Perl program

For each series (column 0 of the file) we want to compute the number of episodes.

#! /usr/bin/perl -w
use strict;

#associative array (hash), with series name as key
my %count;  

while(my $line = <STDIN>) {  # read every line on input
    chomp $line;    # delete end of line, if any

    # split the input line to columns on every tab, store them in an array
    my @columns = split "\t", $line;  

    # check input - should have 7 columns
    die "Bad input '$line'" unless @columns == 7;

    my $series = $columns[0];

    # increase counter for this type
    $count{$series}++;
}

# write out results, types sorted alphabetically
foreach my $series (sort keys %count) {
    print $series, " ", $count{$series}, "\n";
}

This program does the same thing as the following one-liner (more on one-liners in the next lecture)

perl -F'"\t"' -lane 'die unless @F==7; $count{$F[0]}++;
  END { foreach (sort keys %count) { print "$_ $count{$_}" }}' filename

When we run it for the small six-line input, we get the following output:

Black Mirror 3
Game of Thrones 3

The second input file for today: DNA sequencing reads (fastq)

  • DNA sequencing machines can read only short pieces of DNA called reads
  • Reads are usually stored in FASTQ format
  • Files can be very large (gigabytes or more), but we will use only a small sample from bacteria Staphylococcus aureus (data from the GAGE website)
  • Each read is stored in 4 lines:
    • line 1: ID of the read and other description, line starts with @
    • line 2: DNA sequence, A,C,G,T are bases (nucleotides) of DNA, N means unknown base
    • line 3: +
    • line 4: quality string, which is the string of the same length as DNA in line 2. Each character represents quality of one base in DNA. If p is the probability that this base is wrong, the quality string will contain character with ASCII value 33+(-10 log p), where log is the decimal logarithm. Higher ASCII means base of higher quality. Character ! (ASCII 33) means probability 1 of error, character $ (ASCII 36) means 50% error, character + (ASCII 43) is 10% error, character 5 (ASCII 53) is 1% error.
  • Our file has all reads of equal length (this is not always the case)
  • Technically, a single read and its quality can be split into multiple lines, but this is rarely done, and we will assume that each read takes 4 lines as described above

The first 4 reads from file /tasks/perl/reads-small.fastq (trimmed to 50 bases for better readability)

@SRR022868.1845/1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAG
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI
@SRR022868.1846/1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACA
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5

Variables, types

Scalar variables

  • The names of scalar variables start with $
  • Scalar variables can hold undefined value (undef), string, number, reference etc.
  • Perl converts automatically between strings and numbers
perl -e'print((1 . "2")+1, "\n")'
# 13
perl -e'print(("a" . "2")+1, "\n")'
# 1
perl -we'print(("a" . "2")+1, "\n")'
# Argument "a2" isn't numeric in addition (+) at -e line 1.
# 1
  • If we switch on strict parsing, each variable needs to be defined by my
    • Several variables can be created and initialized as follows: my ($a,$b) = (0,1);
  • Usual set of C-style operators, power is **, string concatenation .
  • Numbers compared by <, <=, ==, != etc., strings by lt, le, eq, ne, gt, ge
  • Comparison operator $a cmp $b for strings, $a <=> $b for numbers: returns -1 if $a<$b, 0 if they are equal, +1 if $a>$b

Arrays

  • Names start with @, e.g. @a
  • Access to element 0 in array @a: $a[0]
    • Starts with $, because the expression as a whole is a scalar value
  • Length of array scalar(@a). In scalar context, @a is the same thing.
    • e.g. for(my $i=0; $i<@a; $i++) { ... } iterates over all elements
  • If using non-existent indexes, they will be created, initialized to undef (++, += treat undef as 0)
  • Stack/vector using functions push and pop: push @a, (1,2,3); $x = pop @a;
  • Analogically shift and unshift on the left end of the array (slower)
  • Sorting
    • @a = sort @a; (sorts alphabetically)
    • @a = sort {$a <=> $b} @a; (sorts numerically)
    • { } can contain an arbitrary comparison function, $a and $b are the two compared elements
  • Array concatenation @c = (@a,@b);
  • Swap values of two variables: ($x,$y) = ($y,$x);
  • Command foreach iterates through values of an array (values can be changed during iteration):
my @a = (1,2,3);
foreach my $val (@a) {  # iterate through all values
    $val++;             # increase each value in array by 1
}
# concatenate values to a string separated by spaces
print join(" ", @a), "\n"; 
# prints 2 3 4

Hash tables (associative array, dictionaries, maps)

  • Names start with %, e.g. %b
  • Keys are strings, values are scalars
  • Access element with key "X": $b{"X"}
  • Write out all elements of associative array %b
foreach my $key (keys %b) {
    print $key, " ", $b{$key}, "\n";
}
  • Initialization with a constant: %b = ("key1" => "value1", "key2" => "value2");
  • Test for existence of a key: if(exists $a{"X"}) {...}

Multidimensional arrays, fun with pointers

  • Pointer to a variable (scalar, array, dictionary): \$a, \@a, \%a
  • Pointer to an anonymous array: [1,2,3], pointer to an anonymous hash: {"key1" => "value1"}
  • Hash of lists is stored as hash of pointers to lists:
my %a = ("fruits" => ["apple","banana","orange"],
         "vegetables" => ["tomato","carrot"]);
$x = $a{"fruits"}[1];
push @{$a{"fruits"}}, "kiwi";
my $aref = \%a;
$x = $aref->{"fruits"}[1];
  • Module Data::Dumper has function Dumper, which recursively prints complex data structures (good for debugging)

Strings

  • Substring: substr($string, $start, $length)
    • Used also to access individual characters (use length 1)
    • If we omit $length, extracts suffix until the end of the string, negative $start counts from the end of the string,...
    • We can also replace a substring by something else: substr($str, 0, 1) = "aaa" (replaces the first character by "aaa")
  • Length of a string: length($str)
  • Splitting a string to parts: split reg_expression, $string, $max_number_of_parts
    • If " " is used instead of regular expression, splits at any whitespace
  • Connecting parts to a string join($separator, @strings)
  • Other useful functions: chomp (removes the end of line), index (finds a substring), lc, uc (conversion to lower-case/upper-case), reverse (mirror image), sprintf (C-style formatting)

Regular expressions

  • Regular expressions are powerful tool for working with strings, now featured in many languages
  • Here only a few examples, more details can be found in the official tutorial
$line =~ s/\s+$//;      # remove whitespace at the end of the line
$line =~ s/[0-9]+/X/g;  # replace each sequence of numbers with character X

# if the line starts with >,
# store the word following > (until the first whitespace)
# and store it in variable $name 
# (\S means non-whitespace),
# the string matching part of expression in (..) is stored in $1
if($line =~ /^\>(\S+)/) { $name = $1; }

Conditionals, loops

if(expression) {  # () and {} cannot be omitted
   commands
} elsif(expression) {
   commands
} else {
   commands
}

command if expression;   # here () not necessary
command unless expression;
# good for checking inputs etc
die "negative value of x: $x" unless $x >= 0;

for(my $i=0; $i<100; $i++) {
   print $i, "\n";
}

foreach my $i (0..99) {
   print $i, "\n";
}

my $x = 1;
while(1) {
   $x *= 2;
   last if $x >= 100;
}

Undefined value, number 0 and strings "" and "0" evaluate as false, but we recommend always explicitly using logical values in conditional expressions, e.g. if(defined $x), if($x eq ""), if($x==0) etc.

Input, output

# Reading one line from standard input
$line = <STDIN>
# If no more input data available, returns undef


# The special idiom below reads all the lines from input until the end of input is reached:
while (my $line = <STDIN>) {
   # commands processing $line ...
}

Sources of Perl-related information

  • Man pages (included in ubuntu package perl-doc), also available online at http://perldoc.perl.org/
    • man perlintro introduction to Perl
    • man perlfunc list of standard functions in Perl
    • perldoc -f split describes function split, similarly other functions
    • perldoc -q sort shows answers to commonly asked questions (FAQ)
    • man perlretut and man perlre regular expressions
    • man perl list of other manual pages about Perl
  • Various web tutorials e.g. this one
  • Books

Further optional topics

For illustration, we briefly cover other topics frequently used in Perl scripts (these are not needed to solve the exercises).

Opening files

my $in;
open $in, "<", "path/file.txt" or die;  # open file for reading
while(my $line = <$in>) {
  # process line
}
close $in;

my $out;
open $out, ">", "path/file2.txt" or die; # open file for writing
print $out "Hello world\n";
close $out;
# if we want to append to a file use the following instead:
# open $out, ">>", "cesta/subor2.txt" or die;

# standard files
print STDERR "Hello world\n";
my $line = <STDIN>;
# files as arguments of a function
read_my_file($in);
read_my_file(\*STDIN);

Working with files and directories

Module File::Temp allows to create temporary working directories or files with automatically generated names. These are automatically deleted when the program finishes.

use File::Temp qw/tempdir/;
my $dir = tempdir("atoms_XXXXXXX", TMPDIR => 1, CLEANUP => 1 ); 
print STDERR "Creating temporary directory $dir\n";
open $out,">$dir/myfile.txt" or die;

Copying files

use File::Copy;
copy("file1","file2") or die "Copy failed: $!";
copy("Copy.pm",\*STDOUT);
move("/dev1/fileA","/dev2/fileB");

Other functions for working with file system, e.g. chdir, mkdir, unlink, chmod, ...

Function glob finds files with wildcard characters similarly as on command line (see also opendir, readdir, and File::Find module)

ls *.pl
perl -le'foreach my $f (glob("*.pl")) { print $f; }'

Additional functions for working with file names, paths, etc. in modules File::Spec and File::Basename.

Testing for an existence of a file (more in perldoc -f -X)

if(-r "file.txt") { ... }  # is file.txt readable?
if(-d "dir") {.... }       # is dir a directory?

Running external programs

Using the system command

  • It returns -1 if it cannot run command, otherwise returns the return code of the program
my $ret = system("command arguments");

Using the backtick operator with capturing standard output to a variable

  • This does not tests the return code
my $allfiles = `ls`;

Using pipes (special form of open sends output to a different command, or reads output of a different command as a file)

open $in, "ls |";
while(my $line = <$in>) { ... }
open $out, "| wc"; 
print $out "1234\n"; 
close $out;
# output of wc:
#      1       1       5

Command-line arguments

# module for processing options in a standardized way
use Getopt::Std;
# string with usage manual
my $USAGE = "$0 [options] length filename

Options:
-l           switch on lucky mode
-o filename  write output to filename
";

# all arguments to the command are stored in @ARGV array
# parse options and remove them from @ARGV
my %options;
getopts("lo:", \%options);
# now there should be exactly two arguments in @ARGV
die $USAGE unless @ARGV==2;
# process options
my ($length, $filenamefile) = @ARGV;
# values of options are in the %options array
if(exists $options{'l'}) { print "Lucky mode\n"; }

For long option names, see module Getopt::Long

Defining functions

sub function_name {
  # arguments are stored in @_ array
  my ($firstarg, $secondarg) = @_;
  # do something
  return ($result, $second_result);
}
  • Arrays and hashes are usually passed as references: function_name(\@array, \%hash);
  • It is advantageous to pass very long string as references to prevent needless copying: function_name(\$sequence);
  • References need to be dereferenced, e.g. substr($$sequence) or $array->[0]

Bioperl

A large library useful for bioinformatics. This snippet translates DNA sequence to a protein using the standard genetic code:

use Bio::Tools::CodonTable;
sub translate
{
    my ($seq, $code) = @_;
    my $CodonTable = Bio::Tools::CodonTable->new( -id => $code);
    my $result = $CodonTable->translate($seq);

    return $result;
}

HWperl

See the lecture

Files and setup

We recommend creating a directory (folder) for this set of tasks:

mkdir perl  # make directory
cd perl     # change to the new directory

We have 4 input files for this task set. We recommend creating soft links to your working directory as follows:

ln -s /tasks/perl/series-small.tsv .   # small version of the series file
ln -s /tasks/perl/series.tsv .         # full version of the series file
ln -s /tasks/perl/reads-small.fastq .  # smaller version of the read file
ln -s /tasks/perl/reads.fastq .        # bigger version of the read file

We recommend writing your protocol starting from an outline provided in /tasks/perl/protocol.txt. Make your own copy of the protocol and open it in an editor, e.g. kate:

cp -ip /tasks/perl/protocol.txt .  # copy protocol
kate protocol.txt &                # open editor, run in the background

Submitting

  • Directory /submit/perl/your_username will be created for you
  • Copy required files to this directory, including the protocol named protocol.txt
  • You can modify these files freely until deadline, but after the deadline of the homework, you will lose access rights to this directory

Task A (series)

Consider the program for counting series in the lecture 1, save it to file series-stat.pl

  • Open editor running in the background: kate series-stat.pl &
  • Copy and paste text to the editor, save it
  • Make the script executable: chmod a+x series-stat.pl

Extend the script to compute the average rating of each series (averaging over all episodes in the series)

  • Each row of the input table contains rating in column 5.
  • Output a table with three columns: name of series, the number of episodes, the average rating.
  • Use printf to print these three items right-justified in columns of sufficient width, print the average rating to 1 decimal place.
  • If you run your script on the small file, the output should look something like this (exact column widths may differ):
./series-stat.pl < series-small.tsv
        Black Mirror        3        8.2
     Game of Thrones        3        8.8
  • Run your script also on the large file: ./series-stat.pl < series.tsv
    • Include the output in your protocol
  • Submit only your script, series-stat.pl

Task B (FASTQ to FASTA)

  • Write a script which reformats FASTQ file to FASTA format, call it fastq2fasta.pl
    • FASTQ file should be on standard input, FASTA file written to standard output
  • FASTA format is a typical format for storing DNA and protein sequences.
    • Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
    • The sequence itself is on the second line, long sequences can be split into multiple lines
  • In our case, the name of the sequence will be the ID of the read with @ replaced by > and / replaced by underscore (_)
  • For example, the first two reads of the file reads.fastq are as follows (only the first 50 columns shown)
@SRR022868.1845/1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAG...
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI...
@SRR022868.1846/1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACA...
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5...
  • These should be reformatted as follows (again only first 50 columns shown, but you include entire reads):
>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGA...
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACAC...
  • Run your script on the small read file ./fastq2fasta.pl < reads-small.fastq > reads-small.fasta
  • Submit files fastq2fasta.pl and reads-small.fasta

Task C (FASTQ quality)

Write a script fastq-quality.pl which for each position in a read computes the average quality

  • Standard input has fastq file with multiple reads, possibly of different lengths
  • As quality we will use ASCII values of characters in the quality string with value 33 subtracted, so the quality is -10 log p
    • ASCII value can be computed by function ord
  • Positions in reads will be numbered from 0
  • Since reads can differ in length, some positions are used in more reads, some in fewer
  • For each position from 0 up to the highest position used in some read, print three numbers separated by tabs "\t": the position index, the number of times this position was used in reads, the average quality at that position with 1 decimal place (you can again use printf)
  • The last two lines when you run ./fastq-quality.pl < reads-small.fastq should be
99      86      5.5
100     86      8.6

Run the following command, which runs your script on the larger file and selects every 10th position.

./fastq-quality.pl < reads.fastq | perl -lane 'print if $F[0]%10==0'
  • What trends (if any) do you see in quality values with increasing position?
  • Submit only fastq-quality.pl
  • In your protocol, include the output of the command and the answer to the question above.

Task D (FASTQ trim)

Write script fastq-trim.pl that trims low quality bases from the end of each read and filters out short reads

  • This script should read a fastq file from standard input and write trimmed fastq file to standard output
  • It should also accept two command-line arguments: character Q and integer L
    • We have not covered processing command line arguments, but you can use the code snippet below
  • Q is the minimum acceptable quality (characters from quality string with ASCII value >= ASCII value of Q are ok)
  • L is the minimum acceptable length of a read
  • First find the last base in a read which has quality at least Q (if any). All bases after this base will be removed from both the sequence and quality string
  • If the resulting read has fewer than L bases, it is omitted from the output

You can check your program by the following tests:

  • If you run the following two commands, you should get file tmp identical with input and thus output of the diff command should be empty
./fastq-trim.pl '!' 101 < reads-small.fastq > tmp  # trim at quality ASCII >=33 and length >=101
diff reads-small.fastq tmp                         # output should be empty (no differences)
  • If you run the following two commands, you should see differences in 4 reads, 2 bases trimmed from each
./fastq-trim.pl '"' 1 < reads-small.fastq > tmp   # trim at quality ASCII >=34 and length >=1
diff reads-small.fastq tmp                        # output should be differences in 4 reads
  • If you run the following commands, you should get empty output (no reads meet the criteria):
./fastq-trim.pl d 1 < reads-small.fastq           # quality ASCII >=100, length >= 1
./fastq-trim.pl '!' 102 < reads-small.fastq       # quality ASCII >=33 and length >=102

Further runs and submitting

  • ./fastq-trim.pl '(' 95 < reads-small.fastq > reads-small-filtered.fastq # quality ASCII >= 40
  • Submit files fastq-trim.pl and reads-small-filtered.fastq
  • If you have done task C, run quality statistics on the trimmed version of the bigger file using command below. Comment on the differences between statistics on the whole file in part C and D. Are they as you expected?
# "2" means quality ASCII >= 50
./fastq-trim.pl 2 50 < reads.fastq | ./fastq-quality.pl | perl -lane 'print if $F[0]%10==0'
  • In your protocol, include the result of the command and your discussion of its results.

Note: in this task set, you have created tools which can be combined, e.g. you can first trim FASTQ and then convert it to FASTA (no need to submit these files)

Parsing command-line arguments in this task (they will be stored in variables $Q and $L):

#!/usr/bin/perl -w
use strict;

my $USAGE = "
Usage:
$0 Q L < input.fastq > output.fastq

Trim from the end of each read bases with ASCII quality value less
than the given threshold Q. If the length of the read after trimming
is less than L, the read will be omitted from output.

L is a non-negative integer, Q is a character
";

# check that we have exactly 2 command-line arguments
die $USAGE unless @ARGV==2;
# copy command-line arguments to variables Q and L
my ($Q, $L) = @ARGV;
# check that $Q is one character and $L looks like a non-negative integer
die $USAGE unless length($Q)==1 && $L=~/^[0-9]+$/;

Lbash

#HWbash

This lecture introduces command-line tools and Perl one-liners.

  • We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.

When working on the exercises, record all the commands used

  • We strongly recommend making a log of commands for data processing also outside of this course
  • If you have a log of executed commands, you can easily execute them again by copy and paste
  • For this reason any comments are best preceded in the log by #
  • If you use some sequence of commands often, you can turn it into a script

Efficient use of the Bash command line

Some tips for bash shell:

  • use tab key to complete command names, path names etc
  • use up and down keys to walk through the history of recently executed commands, then edit and execute the chosen command
  • press ctrl-r to search in the history of executed commands
  • at the end of session, history stored in ~/.bash_history
  • command history -a appends history to this file right now
    • you can then look into the file and copy appropriate commands to your log
  • various other history tricks, e.g. special variables [1]
  • cd - goes to previously visited directory (also see pushd and popd)
  • ls -lt | head shows 10 most recent files, useful for seeing what you have done last in a directory

Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook

Redirecting and pipes

# redirect standard output to file
command > file

# append to file
command >> file

# redirect standard error
command 2>file

# redirect file to standard input
command < file

# do not forget to quote > in other uses, 
# e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc
# should be quoted in '' as well

# send stdout of command1 to stdin of command2
command1 | command2

# backtick operator executes command, 
# removes trailing \n from stdout, substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file

# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'

# in some commands, file argument can be taken from stdin
# if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2

Make piped commands fail properly:

set -o pipefail

If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.

Text file manipulation

Commands echo and cat (creating and printing files)

# print text Hello and end of line to stdout
echo "Hello" 
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2

Commands head and tail (looking at start and end of files)

# print 10 first lines of file (or stdin)
head file
some_command | head 
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20

Documentation: head, tail

Commands wc, ls -lh, od (exploring file statistics and details)

# prints three numbers:
# the number of lines (-l), number of words (-w), number of bytes (-c)
wc file

# prints the size of file in human-readable units (K,M,G,T)
ls -lh file

# od -a prints file or stdout with named characters 
#   allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000   h   e   l   l   o  sp   w   o   r   l   d   !  nl
# 0000015

Documentation: wc, ls, od

Command grep (getting lines matching a regular expression)

# get all lines containing string chromosome
grep chromosome file
# -i ignores case (upper case and lowercase letters are the same)
grep -i chromosome file
# -c counts the number of matching lines in each file
grep -c '^[12][0-9]' file1 file2

# other options (there is more, see the manual):
# -v print/count not matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file 
#    (good for selecting e.g. only lines matching one of "good" ids)

Documentation: grep

Commands sort, uniq

# sort lines of a file alphabetically
sort file

# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -s stable
# -t fields separator

# sorting first by column 2 numerically (-k2,2g),
# in case of ties use column 1 (-k1,1)
sort -k2,2g -k1,1 file 

# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines
# and sorts them by frequency from the most frequent
sort file | uniq -c | sort -gr

Documentation: sort, uniq

Commands diff, comm (comparing files)

Command diff compares two files. It is good for manual checking of differences. Useful options:

  • -b (ignore whitespace differences)
  • -r for comparing whole directories
  • -q for fast checking for identity
  • -y show differences side-by-side

Command comm compares two sorted files. It is good for finding set intersections and differences. It writes three columns:

  • lines occurring only in the first file
  • lines occurring only in the second file
  • lines occurring in both files

Some columns can be suppressed with options -1, -2, -3


Commands cut, paste, join (working with columns)

  • Command cut selects only some columns from file (perl/awk more flexible)
  • Command paste puts two or more files side by side, separated by tabs or other characters
  • Command join is a powerful tool for making joins and left-joins as in databases on specified columns in two files

Commands split, csplit (splitting files to parts)

  • Command split splits into fixed-size pieces (size in lines, bytes etc.)
  • Command csplit splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
csplit sequences.fa '/^>/' '{*}'

Programs sed and awk

Both sed and awk process text files line by line, allowing to do various transformations

  • awk newer, more advanced
  • several examples below
  • More info on awk, sed on Wikipedia
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints the first two lines, then quits (like head -n 2)
sed 2q  

# print the first and second column from a file
awk '{print $1, $2}' 

# print the line if the difference between the first and second column > 10
awk '{ if ($2-$1>10) print }'  

# print lines matching pattern
awk '/pattern/ { print }' 

# count the lines (like wc -l)
awk 'END { print NR }'

Perl one-liners

Instead of sed and awk, we will cover Perl one-liners

# -e executes commands
perl -e'print 2+3,"\n"'
perl -e'$x = 2+3; print $x, "\n"';

# -n wraps commands in a loop reading lines from stdin
# or files listed as arguments
# the following is roughly the same as cat:
perl -ne'print'
# how to use:
perl -ne'print' < input > output
perl -ne'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions, 
# including print, so print is the same as print $_

# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_

# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne'print $_, "*"' 

# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in the numbers stored
# in the second and first column
# (e.g. interval size if each line coordinates of one interval)
perl -lane'print $F[1]-$F[0]'

# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane'print $F[1]-$F[0]'

# END { commands } is run at the very end, after we finish reading input
# the following example computes the sum of interval lengths
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
# similarly BEGIN { command } before we start

Other interesting possibilities:

# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files
# in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces sequence of whitespace by exactly one space 
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt

# variable $. contains the line number. $ARGV the name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2

# moving files *.txt to have extension .tsv:
#   first print commands 
#   then execute by hand or replace print with system
#   mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'

HWbash

Lecture on Perl, Lecture on command-line tools

  • In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
  • Include all relevant used commands in your protocol and add a short description of your approach.
  • Submit the protocol and required output files.
  • Outline of the protocol is in /tasks/bash/protocol.txt, submit to directory /submit/bash/yourname

Task A (passwords)

  • The file /tasks/bash/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
    • The output file has columns separated by commas ','
    • The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands.

Example line from input:

Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un

Hints:

  • Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
  • We also recommend using perl, wc, paste (check option -d in paste)
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.

Task B (yeast genome)

The input file:

  • /tasks/bash/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
    • Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [2].
    • It was further processed to omit DNA sequences from the end of file.
    • The size of the file is 5.6M.
  • For easier work, link the file to your directory by ln -s /tasks/bash/saccharomyces_cerevisiae.gff yeast.gff
  • The file is in GFF3 format
  • The lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
  • Meaning of the first 5 columns:
    • column 0 chromosome name
    • column 1 source (can be ignored)
    • column 2 type of interval
    • column 3 start of interval (1-based coordinates)
    • column 4 end of interval (1-based coordinates)
  • You can assume that these first 5 columns do not contain whitespace

Task:

  • Print for each type of interval (column 2), how many times it occurs in the file.
  • Sort from the most common to the least common interval types.
  • Hint: commands sort and uniq will be useful. Do not forget to skip comments, for example using grep -v '^#'
  • The result should be a file types.txt formatted as follows:
   7058 CDS
   6600 mRNA
...
...
      1 telomerase_RNA_gene
      1 mating_type_region
      1 intein_encoding_region

Submit the file types.txt

Task C (chromosomes)

  • Continue processing file from task B.
  • For each chromosome, the file contains a line which has in column 2 string chromosome, and the interval is the whole chromosome.
  • To file chrosomes.txt, print a tab-separated list of chromosome names and sizes in the same order as in the input
  • The last line of chromosomes.txt should list the total size of all chromosomes combined.
  • Submit file chromosomes.txt
  • Hints:
    • The total size can be computed by a perl one-liner.
    • Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
    • Grepping for word chromosome does not check if this word is indeed in the second column
    • Tab character is written in Perl as "\t".
  • Your output should start and end as follows:
chrI    230218
chrII   813184
...
...
chrXVI  948066
chrmt   85779
total   12157105

Task D (blast)

Overall goal:

  • Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09. The file contains sequence of the protein as well as a short description of its biological function.
  • We have also downloaded proteins from the yeast Yarrowia lipolytica. We will pretend that nothing is known about the function of these proteins (as if they were produced by gene finding program in a newly sequenced genome).
  • For each Y.lipolytica protein, we have found similar proteins from other yeasts
  • Now we want to extract for each protein in Y.lipolytica its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the Y.lipolytica protein.

Files:

  • /tasks/bash/known.fa is a FASTA file containing sequences of known proteins from several species
  • /tasks/bash/yarLip.fa is a FASTA file with proteins from Y.lipolytica
  • /tasks/bash/known.blast is the result of finding similar proteins in yarLip.fa versus known.fa by these commands (already done by us):
formatdb -i known.fa
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
  • you can link these files to your directory as follows:
ln -s /tasks/bash/known.fa .
ln -s /tasks/bash/yarLip.fa .
ln -s /tasks/bash/known.blast .

Step 1:

  • Get the first (strongest) match for each query from known.blast.
  • This can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
  • Instead of using Perl, you can play with grep. Option -A 1 prints the matching lines as well as one line after each match
  • Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
  • Store the result in file best.tsv. The file should start as follows:
Q6CBS2  sp|B5BP46|YP52_SCHPO
Q6C8R4  sp|B5BP48|YP54_SCHPO
Q6CG80  sp|B5BP48|YP54_SCHPO
Q6CH56  sp|B5BP48|YP54_SCHPO
  • Submit file best.tsv with the result

Step 2:

  • Create file known.tsv which contains sequence names extracted from known.fa with leading > removed
  • This file should be sorted alphabetically.
  • The file should start as follows (lines are trimmed below):
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces...
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces...
  • Submit file known.tsv

Step 3:

  • Use command join to join the files best.tsv and known.tsv so that each line of best.tsv is extended with the text describing the corresponding target in known.tsv
  • Use option -1 2 to use the second column of best.tsv as a key for joining
  • The output of join may look as follows:
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces...
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=...
  • Further reformat the output so that the query name goes first (e.g. Q6CBS2), followed by target name (e.g. sp|B5BP46|YP52_SCHPO), followed by the rest of the text, but remove all text after OS=
  • Sort by query name, store as best.txt
  • The output should start as follows:
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
B5FVB1  sp|O13877|RPAB5_SCHPO   DNA-directed RNA polymerases I, II, and III subunit RPABC5
  • Submit file best.txt

Note:

  • Not all Y.lipolytica proteins are necessarily included in your final output (some proteins do not have blast match).
    • You can think how to find the list of such proteins, but this is not part of the task.
  • Files best.txt and best.tsv should have the same number of lines.

Lmake

Job Scheduling

  • Some computing jobs take a lot of time: hours, days, weeks,...
  • We do not want to keep a command-line window open the whole time; therefore we run such jobs in the background
  • Simple commands to do it in Linux:
    • To run the program immediately, then switch the whole console to the background: screen, tmux
    • To run the command when the computer becomes idle: batch
  • Now we will concentrate on Sun Grid Engine, a complex software for managing many jobs from many users on a cluster consisting of multiple computers
  • Basic workflow:
    • Submit a job (command) to a queue
    • The job waits in the queue until resources (memory, CPUs, etc.) become available on some computer
    • The job runs on the computer
    • Output of the job is stored in files
    • User can monitor the status of the job (waiting, running)
  • Complex possibilities for assigning priorities and deadlines to jobs, managing multiple queues etc.
  • Ideally all computers in the cluster share the same environment and filesystem
  • We have a simple training cluster for this exercise:
    • You submit jobs to queue on vyuka
    • They will run on computers runner01 and runner02
    • This cluster is only temporarily available until the next Thursday


Submitting a job (qsub)

Basic command: qsub -b y -cwd 'command < input > output 2> error'

  • quoting around command allows us to include special characters, such as <, > etc. and not to apply it to qsub command itself
  • -b y treats command as binary, usually preferable for both binary programs and scripts
  • -cwd executes command in the current directory
  • -N name allows to set name of the job
  • -l resource=value requests some non-default resources
  • for example, we can use -l threads=2 to request 2 threads for parallel programs
  • Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
  • qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27

Monitoring and deleting jobs (qstat, qdel)

Command qstat displays jobs of the current user

  • job 28 is running of server runner02 (status <t>r), job 29 is waiting in queue (status qw)
job-ID  prior   name       user         state submit/start at     queue       
---------------------------------------------------------------------------------
     28 0.50000 s3         bbrejova     r     03/15/2016 22:12:18 main.q@runner02
     29 0.00000 s3         bbrejova     qw    03/15/2016 22:14:08             
  • Command qstat -u '*' displays jobs of all users
  • Finished jobs disappear from the list
  • Command qstat -F threads shows how many threads available
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@runner01                BIP   0/1/2          0.00     lx26-amd64    
	hc:threads=1
    238 0.25000 sleeper.pl bbrejova     r     03/05/2020 13:12:28     1        
---------------------------------------------------------------------------------
main.q@runner02                BIP   0/1/2          0.00     lx26-amd64    
    237 0.75000 sleeper.pl bbrejova     r     03/05/2020 13:12:13     1        
  • Command qdel deletes a job (waiting or running)

Interactive work on the cluster (qrsh), screen

Command qrsh creates a job which is a normal interactive shell running on the cluster

  • In this shell you can manually run commands
  • When you close the shell, the job finishes
  • Therefore it is a good idea to run qrsh within screen
    • Run screen command, this creates a new shell
    • Within this shell, run qrsh, then whatever commands
    • By pressing Ctrl-a d you "detach" the screen, so that both shells (local and qrsh) continue running but you can close your local window
    • Later by running screen -r you get back to your shells

Running many small jobs

For example, we many need to run some computation for each human gene (there are roughly 20,000 such genes). Here are some possibilities:

  • Run a script which iterates through all jobs and runs them sequentially
    • Problems: Does not use parallelism, needs more programming to restart after some interruption
  • Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)
    • Jobs can run in parallel on many different computers
    • Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
  • Array jobs in qsub (option -t): runs jobs numbered 1,2,3...; number of the current job is in an environment variable, used by the script to decide which gene to process
    • Queue contains only running sub-jobs plus one line for the remaining part of the array job.
    • After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
  • Next: using make in which you specify how to process each gene and submit a single make command to the queue
    • Make can execute multiple tasks in parallel using several threads on the same computer (qsub array jobs can run tasks on multiple computers)
    • It will automatically skip tasks which are already finished, so restart is easy

Make

Make is a system for automatically building programs (running compiler, linker etc)

  • In particular, we will use GNU make
  • Rules for compilation are written in a Makefile
  • Rather complex syntax with many features, we will only cover basics

Rules

  • The main part of a Makefile are rules specifying how to generate target files from some source files (prerequisites).
  • For example the following rule generates file target.txt by concatenating files source1.txt and source2.txt:
target.txt : source1.txt source2.txt
      cat source1.txt source2.txt > target.txt
  • The first line describes target and prerequisites, starts in the first column
  • The following lines list commands to execute to create the target
  • Each line with a command starts with a tab character
  • If we have a directory with this rule in file called Makefile and files source1.txt and source2.txt, running make target.txt will run the cat command
  • However, if target.txt already exists, the command will be run only if one of the prerequisites has more recent modification time than the target
  • This allows to restart interrupted computations or rerun necessary parts after modification of some input files
  • make automatically chains the rules as necessary:
    • if we run make target.txt and some prerequisite does not exist, make checks if it can be created by some other rule and runs that rule first
    • In general it first finds all necessary steps and runs them in appropriate order so that each rules has its prerequisites ready
    • Option make -n target will show which commands would be executed to build target (dry run) - good idea before running something potentially dangerous

Pattern rules

We can specify a general rule for files with a systematic naming scheme. For example, to create a .pdf file from a .tex file, we use the pdflatex command:

%.pdf : %.tex
      pdflatex $^
  • In the first line, % denotes some variable part of the filename, which has to agree in the target and all prerequisites
  • In commands, we can use several variables:
    • Variable $^ contains the names of the prerequisites (source)
    • Variable $@ contains the name of the target
    • Variable $* contains the string matched by %

Other useful tricks in Makefiles

Variables

Store some reusable values in variables, then use them several times in the Makefile:

MYPATH := /projects/trees/bin

target : source
       $(MYPATH)/script < $^ > $@

Wildcards, creating a list of targets from files in the directory

The following Makefile automatically creates .png version of each .eps file simply by running make:

EPS := $(wildcard *.eps)
EPSPNG := $(patsubst %.eps,%.png,$(EPS))

all:  $(EPSPNG)

clean:
        rm $(EPSPNG)

%.png : %.eps
        convert -density 250 $^ $@
  • variable EPS contains names of all files matching *.eps
  • variable EPSPNG contains desirable names of .png files
    • it is created by taking filenames in EPS and changing .eps to .png
  • all is a "phony target" which is not really created
    • its rule has no commands but all .png files are prerequisites, so are done first
    • the first target in a Makefile (in this case all) is default when no other target is specified on the command-line
  • clean is also a phony target for deleting generated .png files

Useful special built-in target names

Include these lines in your Makefile if desired

.SECONDARY:
# prevents deletion of intermediate targets in chained rules

.DELETE_ON_ERROR:
# delete targets if a rule fails

Parallel make

Running make with option -j 4 will run up to 4 commands in parallel if their dependencies are already finished. This allows easy parallelization on a single computer.

Alternatives to Makefiles

  • Bioinformaticians often uses "pipelines" - sequences of commands run one after another, e.g. by a script or make
  • There are many tools developed for automating computational pipelines in bioinformatics, see e.g. this review: Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016.
  • For example Snakemake
    • Snake workflows can contain shell commands or Python code
    • Big advantage compared to make: pattern rules may contain multiple variable portions (in make only one % per filename)
    • For example, assume we have several FASTA files and several profiles (HMMs) representing protein families and we want to run each profile on each FASTA file:
rule HMMER:
     input: "{filename}.fasta", "{profile}.hmm"
     output: "{filename}_{profile}.hmmer"
     shell: "hmmsearch --domE 1e-5 --noali --domtblout {output} {input[1]} {input[0]}"

HWmake

See also the lecture

Motivation: Building phylogenetic trees

The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences

  • A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors.
  • The input contains sequences of all proteins from each species (we will use only a smaller subset)
  • The process is typically split into several stages shown below

Identify ortholog groups

Orthologs are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called blast to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species.

Align proteins on each group

For each ortholog group, we need to align proteins in the group to identify corresponding parts of the proteins. This is done by a tool called muscle

Unaligned sequences (start of protein O60568):

>human
MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA...
>baboon
MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA...
>dog
MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA...
>elephant
MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA...
>guineapig
MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA...
>opossum
SDKLLVITAA...
>pig
AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA...
>rabbit
MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA...
>rat
MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...

Aligned sequences:

rabbit       MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ...
guineapig    MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ...
opossum      ---------- ---------- ---------- ---------- SDKLLVITAA ...
rat          MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ...
elephant     MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ...
human        MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ...
baboon       MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ...
dog          MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ...
pig          AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...

Build phylogenetic tree for each grup

For each alignment, we build a phylogenetic tree for this group. We will use a program called phyml.

Example of a phylogenetic tree in newick format:

 ((opossum:0.09636245,rabbit:0.85794020):0.05219782,
(rat:0.07263127,elephant:0.03306863):0.01043531,
(dog:0.01700528,(pig:0.02891345,
(guineapig:0.14451043,
(human:0.01169266,baboon:0.00827402):0.02619598
):0.00816185):0.00631423):0.00800806);
Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)

Build a consensus tree

The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree.


Files and submitting

Our goal is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.

  • Submit by copying requested files to /submit/make/username/
  • Do not forget to submit protocol, outline of the protocol is in /tasks/make/protocol.txt

Start by copying directory /tasks/make to your user directory

cp -ipr /tasks/make .
cd make

The directory contains three subdirectories:

  • large: a larger sample of proteins for task A
  • tiny: a very small set of proteins for task B
  • small: a slightly larger set of proteins for task C

Task A (long job)

  • In this task, you will run a long alignment job (more than two hours)
  • Use directory large with files:
    • ref.fa: selected human proteins
    • other.fa: selected proteins from 8 other mammalian species
    • Makefile: runs blast on ref.fa vs other.fa (also formats database other.fa before that)
  • run make -n to see what commands will be done (you should see makeblastdb, blastp, and echo for timing)
    • copy the output to the protocol
  • run qsub with appropriate options to run make (at least -cwd -b y)
  • then run qstat > queue.txt
    • Submit file queue.txt showing your job waiting or running
  • When your job finishes, check the following files:
    • the output file ref.blast
    • standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
  • Submit the last 100 lines from ref.blast under the name ref-end.blast (use tool tail -n 100) and the file make.oX mentioned above

Task B (finishing Makefile)

  • In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group
    • This Makefile works with much smaller files and so you can run it quickly many times without qsub
  • Work in directory tiny
    • ref.fa: 2 human proteins
    • other.fa: a selected subset of proteins from 8 other mammalian species
    • Makefile: a longer makefile
    • brm.pl: a Perl script for finding ortholog groups and sorting them to directories

The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3

  • If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
  • Stage 1: run as make ref.brm
    • It runs blast as in task A, then splits proteins into ortholog groups and creates one directory for each group with file prot.fa containing protein sequences
  • Stage 2: run as make alignments
    • In each directory with an ortholog group, it will create an alignment prot.phy and link it under names lg.phy and wag.phy
  • Stage 3: run as make trees (needs to be written by you)
    • In each directory with an ortholog group, it should create files lg.phy_phyml_tree and wag.phy_phyml_tree containing the results of the phyml program run with two different evolutionary models WAG and LG, where LG is the default
    • Run phyml by commands of the form:
      phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG
      phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG
    • Change INPUT and LOG in the commands to the appropriate filenames using make variables $@, $^, $* etc. The input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension .log added (e.g. lg.phy_phyml_tree.log)
    • Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target trees which uses these variables
  • Stage 4: run as make consensus
    • Output trees from stage 3 are concatenated for each model separately to files lg/intree, wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
    • This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
  • Run your Makefile and check that the files lg.tree and wag.tree are produced
  • Submit the whole directory tiny, including Makefile and all gene directories with tree files.


Task C (running make)

  • Copy your Makefile from part B to directory small, which contains 9 human proteins and run make on this slightly larger set
    • Again, run it without qsub, but it will take some time, particularly if the server is busy
  • Look at the two resulting trees (wag.tree, lg.tree) using the figtree program
    • it is available on vyuka, but you can also install it on your computer if needed
  • In figtree, change the position of the root in the tree to make the opossum the outgroup (species branching as the first away from the others). This is done by clicking on opossum and thus selecting it, then pressing the Reroot button.
  • Also switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch. To do this, use the left panel with options.
  • Export the trees in pdf format as wag.tree.pdf and lg.tree.pdf
  • Compare the two trees
    • Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
    • Do the two trees differ? What is the highest and lowest support for a branch in each tree?
    • Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included)
    • Write your observations to the protocol
  • Submit the entire small directory (including the two pdf files)

Further possibilities

Here are some possibilities for further experiments, in case you are interested (do not submit these):

  • You could copy your extended Makefile to directory large and create trees for all ortholog groups in the big set
    • This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
    • After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
  • Phyml also supports other models, for example JTT (see manual); you could try to play with those.
  • Command touch FILENAME will change the modification time of the given file to the current time
    • What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?

Lpython

#HWpython

This lecture introduces the basics of the Python programming language. We will also cover basics of working with databases using the SQL language and SQLite3 lightweight database system.

The next three lectures

  • Computer science students will use Python, SQLite3 and several advanced Python libraries for complex data processing
  • Bioinformatics students will use several bioinformatics command-line tools

Overview, documentation

Python

SQL

  • Language for working with relational databases, more in a dedicated course
  • We will cover basics of SQL and work with a simple DB system SQLite3
  • Typical database systems are complex, use server-client architecture. SQLite3 is a simple "database" stored in one file. You can think of SQLite3 not as a replacement for Oracle but as a replacement for fopen().
  • SQLite3 documentation
  • SQL tutorial
  • SQLite3 in Python documentation

Outline of this lecture:

  • We introduce a simple data set
  • We look at several Python scripts for processing this data set
  • Solve task A, where you create another such script
  • We introduce basics of working with SQLite3 and writing SQL queries
  • Solve tasks B1 and B2, where you write your own SQL queries
  • We look at how to combine Python and SQLite
  • Solve task C, where you write a program combining the two
  • Students familiar with both Python and SQL may skip tasks A, B1, B2 and and do tasks C and D

Dataset for this lecture

  • IMDb is an online database of movies and TV series with user ratings
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub
  • From this dataset, we have selected 10 series with high average number of voting users
  • Data are two files in csv format: list of series, list of episodes
  • csv stands for comma-separated values

File series.cvs contains one row per series

  • Columns: (0) series id, (1) series title, (2) TV channel:
3,Breaking Bad,AMC
2,Sherlock,BBC
1,Game of Thrones,HBO 

File episodes.csv contains one row per episode:

  • Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
  • Here is a sample of 4 episodes from the Game of Thrones series
  • If the episode title contains a comma, the whole title is in quotation marks
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
1,No One,58,6,8,8.3,20709
1,Battle of the Bastards,59,6,9,9.9,138353
1,The Winds of Winter,60,6,10,9.9,93680

Note that a different version of this data was used already in the lecture on Perl.

Several Python scripts

We will illustrate basic features of Python on several scripts working with these files.

prog1.py

The first script prints the second column (series title) from series.csv

#! /usr/bin/python3

# open a file for reading
with open('series.csv') as csvfile:
    # iterate over lines of the input file
    for line in csvfile:
        # split a line into columns at commas
        columns = line.split(",")
        # print the second column
        print(columns[1])
  • Python uses indentation to delimit blocks. In this example, the with command starts a block and within this block the for command starts another block containing commands columns=... and print. The body of each block is indented several spaces relative to the enclosing block.
  • Variables are not declared, but directly used. This program uses variables csvfile, line, columns.
  • The open command opens a file (here for reading, but other options are available).
  • The with command opens the file, stores the file handle in csvfile variable, executes all commands within its block and finally closes the file.
  • The for-loop iterates over all lines in the file, assigning each in variable line and executing the body of the block.
  • Method split of the built-in string type str splits the line at every comma and returns a list of strings, one for every column of the table (see also other string methods)
  • The final line prints the second column and the end of line character.

prog2.py

The following script prints the list of series of each TV channel

  • For illustration we also separately count the number of the series for each channel, but the count could be also obtained as the length of the list
#! /usr/bin/python3
from collections import defaultdict

# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we will count the series
channel_counts = defaultdict(int)

# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)

# open a file and iterate over lines
with open('series.csv') as csvfile:
    for line in csvfile:
        # strip whitespace (e.g. end of line) from end of line
        line = line.rstrip()
        # split line into columns, find channel and series names
        columns = line.split(",")
        channel = columns[2]
        series = columns[1]
        # increase counter for channel
        channel_counts[channel] += 1
        # add series to list for the channel
        channel_lists[channel].append(series)

# print counts
print("Counts:")
for (channel, count) in channel_counts.items():
    print("The number of series for channel \"%s\" is %d" 
    % (channel, count))

# print series lists
print("\nLists:")
for channel in channel_lists:
    list = ", ".join(channel_lists[channel]) 
    print("Series for channel \"%s\": %s" % (channel,list))
  • In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary channel_counts stores the number of series, channel_lists stores the list of series names.
  • For simplicity we use a library data structure called defaultdict instead of a plain python dictionary. The reason is easier initialization: keys do not need to be explicitly inserted to the dictionary, but are initialized with a default value at the first access.
  • Reading of the input file is similar to the previous script
  • Afterwards we iterate through the keys of both dictionaries and print both the keys and the values
  • We format the output string using the % operator to replace %s and %d with values channel and count.
  • Notice that when we print counts, we iterate through pairs (channel, count) returned by channel_counts.items(), while when we print series, we iterate through keys of the dictionary

prog3.py

This script finds the episode with the highest number of votes among all episodes

  • We use a library for csv parsing to deal with quotation marks around episode names with commas, such as "Dark Wings, Dark Words"
  • This is done by first opening a file and then passing it as an argument to csv.reader, which returns a reader object used to iterate over rows.
#! /usr/bin/python3
import csv

# keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None

# open a file
with open('episodes.csv') as csvfile:
    # create a reader for parsing csv files
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    # iterate over rows already split into columns
    for row in reader:
        votes = int(row[6])
        if votes > max_votes:
            max_votes = votes
            max_votes_episode = row[1]
        
# print result
print("Maximum votes %d in episode \"%s\"" % (max_votes, max_votes_episode))

prog4.py

The following script shows an example of function definition

  • The function reads a whole csv file into a 2d array
  • The rest of the program calls this function twice for each of the two files
  • This could be followed by some further processing of these 2d arrays
#! /usr/bin/python3
import csv

def read_csv_to_list(filename):
    # create empty list
    rows = []
    # open a file
    with open(filename) as csvfile:
        # create a reader for parsing csv files
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        # iterate over rows already split into columns
        for row in reader:
            rows.append(row)
    return rows

series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print("the number of episodes is %d" % len(episodes))
# further processing of series and episodes...

Now do #HWpython, task A

SQL and SQLite

Creating a database

SQLite3 database is a file with your data stored in a special format. To load our csv file to a SQLite database, run command:

sqlite3 series.db < create_db.sql

Contents of create_db.sql file:

CREATE TABLE series (
  id INT,
  title TEXT,
  channel TEXT
);
.mode csv
.import series.csv series
CREATE TABLE episodes (
  seriesId INT,
  title TEXT,
  orderInSeries INT,
  season INT,
  orderInSeason INT,
  rating REAL,
  votes INT
);
.mode csv
.import episodes.csv episodes
  • The two CREATE TABLE commands create two tables named series and episodes
  • For each column (attribute) of the table we list its name and type.
  • Commands starting with a dot are special SQLite3 commands, not part of SQL itself. Command .import reads the txt file and stores it in a table.

Other useful SQLite3 commands;

  • .schema tableName (lists columns of a given table)
  • .mode column and .headers on (turn on human-friendly formatting (not good for further processing)

SQL queries

  • Run sqlite3 series.db to get an SQLite command-line where you can interact with your database
  • Then type the queries below which illustrate the basic features of SQL
  • In these queries, we use uppercase for SQL keywords and lowercase for our names of tables and columns (SQL keywords are not case sensitive)
/*  switch on human-friendly formatting */
.mode column
.headers on

/* print title of each series (as prog1.py) */
SELECT title FROM series;

/* sort titles alphabetically */
SELECT title FROM series ORDER BY title;

/* find the highest vote number among episodes */
SELECT MAX(votes) FROM episodes;

/* find the episode with the highest number of votes, as prog3.py */
SELECT title, votes FROM episodes
  ORDER BY votes DESC LIMIT 1;

/* print all episodes with at least 50k votes, order by votes */
SELECT title, votes FROM episodes
  WHERE votes>50000 ORDER BY votes DESC;

/* join series and episodes tables, print 10 episodes
 * with the highest number of votes */
SELECT s.title, e.title, votes
  FROM episodes AS e, series AS s
  WHERE e.seriesId=s.id
  ORDER BY votes DESC 
  LIMIT 10;

/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() AS series_count
  FROM series GROUP BY channel;

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Parts of a typical SQL query:

  • SELECT followed by column names, or functions MAX, COUNT etc. These columns or expressions are printed for each row of the table, unless filtered out (see below). Individual columns of the output can be given aliases by keyword AS
  • FROM followed by a list of tables. Tables also can get aliases (FROM episodes AS e)
  • WHERE followed by a condition used for filtering the rows
  • ORDER BY followed by an expression used for sorting the rows
  • LIMIT followed by the maximum number of rows to print

More complicated concepts:

  • GROUP By allows to group rows based on common value of some columns and compute statistics per group (count, maximum, sum etc)
  • If you list two tables in FROM, you will conceptually create all pairs of rows, one from one table, one from the other. These are then typically filtered in the FROM clause to only those that have a matching ID (for example WHERE e.seriesId=s.id in one of the queries above)

Now do #HWpython, tasks B1 and B2.

Accessing a database from Python

We will use sqlite3 library for Python to access data from the database and to process them further in the Python program.

read_db.py

  • The following script illustrates running a SELECT query and getting the results
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('series.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

# run a select query
# supply parameters of the query using placeholders ?
threshold = 40000
cursor.execute("""SELECT title, votes FROM episodes
  WHERE votes>? ORDER BY votes desc""", (threshold,))

# retrieve results of the query
for row in cursor:
    print("Episode \"%s\" votes %s" % (row[0],row[1]))
    
# close db connection
connection.close()

write_db.py

This script illustrates creating a new database containing a multiplication table

#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('multiplication.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

cursor.execute("""
CREATE TABLE mult_table (
a INT, b INT, mult INT)
""")

for a in range(1,11):
    for b in range(1,11):
        cursor.execute("INSERT INTO mult_table (a,b,mult) VALUES (?,?,?)",
                       (a,b,a*b))

# important: save the changes
connection.commit()
    
# close db connection
connection.close()

We can check the result by running command

sqlite3 multiplication.db "SELECT * FROM mult_table;"

Now do #HWpython, task C.

HWpython

See also the lecture

Introduction

Choose one of the options:

  • Tasks A, B1, B2, C (recommended for beginners)
  • Tasks C, D (recommended for experienced Python/SQL programmers)

Preparation

Copy files:

mkdir python
cd python
cp -iv /tasks/python/* .

The directory contains the following files:

  • *.py: python scripts from the lecture, included for convenience
  • series.csv, episodes.csv: data files introduced in the lecture
  • create_db.sql: SQL commands to create the database needed in tasks B1, B2, C, D
  • protocol.txt: fill in and submit the protocol.

Submit by copying requested files to /submit/python/username/

Task A (Python)

Write a script taskA.py which reads both csv files and outputs for each TV channel the total number of episodes in their series combined. Run your script as follows:

./taskA.py > taskA.txt

One of the lines of your output should be:

The number of episodes for channel "HBO" is 76

Submit file taskA.py with your script and the output file taskA.txt:

Hints:

  • A good place to start is prog4.py with reading both csv files and prog2.py with a dictionary of counters
  • It might be useful to build a dictionary linking the series id to the channel name for that series

Task B1 (SQL)

To prepare the database for tasks B1, B2, C and D, run the command:

sqlite3 series.db < create_db.sql

To verify that your database was created correctly, you can run the following commands:

sqlite3 series.db ".tables"
# output should be  episodes  series  

sqlite3 series.db "select count() from episodes; select count() from series;"
# output should be 348 and 10

The last query in the lecture counts the number of episodes and average rating per each season of each series:

SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Use join with the series table to replace the numeric series id with the series title and add the channel name. Write your SQL query to file taskB1.sql. The first two lines of the file should be

.mode column
.headers on

Run your query as follows:

sqlite3 series.db < taskB1.sql > taskB1.txt

For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25

True Detective   HBO         1           8              9.3       
True Detective   HBO         2           8              8.25      

Submit taskB1.sql and taskB1.txt

Task B2 (SQL)

For each channel compute the total count and average rating of all their episodes. Write your SQL query to file taskB2.sql. As before, the first two lines of the file should be

.mode column
.headers on

Run your query as follows:

sqlite3 series.db < taskB2.sql > taskB2.txt

For example, all 76 episodes for the two HBO series have average rating as follows:

HBO         76          8.98947368421053

Submit taskB2.sql and taskB2.txt

Task C (Python+SQL)

If you have not done so already, create an SQLite database, as explained at the beginning of task B1.

Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database file

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • SQL can store the results from a query directly in a table, but in this task you should instead read each row of the SELECT query in Python and to store it by running INSERT command from Python
  • Also do not forget to create the new table in the database with appropriate column names and types. You can execute CREATE TABLE command from Python
  • The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
  • If you change your database during debugging, you can start over by running the command for creating the database above
  • Store the script as taskC.py.

To check that your table was created, you can run command

sqlite3 series.db "SELECT * FROM seasons;"

This will print many lines, including this one: "5|1|8|9.3" which is for season 1 of series 5 (True Detective).

Submit your script taskC.py and the modified database series.db.

Task D (SQL, optionally Python)

For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased.

  • For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
  • Print a table containing series name, season number x, average rating in season x and average rating in season x+1
  • The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons with the highest drop.
  • One option is to run a query in SQL in which you join the seasons table from task C with itself and select rows that belong to the same series and successive seasons
  • You can also read the rows of the seasons table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
  • Submit your code as taskD.py or taskD.sql and the resulting table as taskD.txt

The output should start like this (the formatting may differ):

series      season x    rating for x  rating for x+1  
----------  ----------  ------------  ----------------
Sherlock    1           8.825         9.26666666666667
Breaking B  4           9.0           9.375           

When using SQL without Python, include the following two lines in taskD.sql

.mode column
.headers on

and run your query as sqlite3 series.db < taskD.sql > taskD.txt

Lweb

#HWweb

Sometimes you may be interested in processing data which is available in the form of a website consisting of multiple webpages (for example an e-shop with one page per item or a discussion forum with pages of individual users and individual discussion topics).

In this lecture, we will extract information from such a website using Python and existing Python libraries. We will store the results in an SQLite database. These results will be analyzed further in the following lectures.

Scraping webpages

In Python, the simplest tool for downloading webpages is urllib2 library. Example usage:

import urllib2
f = urllib2.urlopen('http://www.python.org/')
print f.read()

You can also use requests package (this is recommended):

import requests
r = requests.get("http://en.wikipedia.org")
print(r.text[:10])

Parsing webpages

When you download one page from a website, it is in HTML format and you need to extract useful information from it. We will use beautifulsoup4 library for parsing HTML.

  • In your code, we recommend following the examples at the beginning of the documentation and the example of CSS selectors. Also you can check out general syntax of CSS selectors.
  • Information you need to extract is located within the structure of the HTML document
  • To find out, how is the document structured, use Inspect element feature in Chrome or Firefox (right click on the text of interest within the website). For example this text on the course webpage is located within LI element, which is within UL element, which is in 4 nested DIV elements, one BODY element and one HTML element. Some of these elements also have a class (starting with a dot) or an ID (starting with #).
  • Based on this information, create a CSS selector.

Parsing dates

To parse dates (written as a text), you have two options:

Other useful tips

  • Don't forget to commit changes to your SQLite3 database (call db.commit()).
  • SQL command CREATE TABLE IF NOT EXISTS can be useful at the start of your script.
  • Use screen command for long running scripts.
  • All packages are installed on our server. If you use your own laptop, you need to install them using pip (preferably in an virtualenv).

HWweb

See the lecture

Submit by copying requested files to /submit/web/username/

General goal: Scrape comments from user discussions at the sme.sk website. Store comments from several (hundreds) users from the last month in an SQLite3 database.

Task A

Create SQLite3 "database" with appropriate schema for storing comments from SME.sk discussions. You will probably need tables for users and comments. You don't need to store which comment replies to which one but store the date and time when the comment was made.

Submit two files:

  • db.sqlite3 - the database
  • schema.txt - a brief description of your schema and rationale behind it


Task B

Build a crawler, which crawls comments in sme.sk discussions. You have two options:

  • For fewer points: Script which gets URL of a user (e.g. http://ekonomika.sme.sk/diskusie/user_profile.php?id_user=157432) and crawls comments of this user from the last month.
  • For more points: Scripts which gets one starting URL (either user profile or some discussion, your choice) and automatically discovers users and crawls their comments.

This crawler should store the comments in SQLite3 database built in the previous task.

Submit the following:

  • db.sqlite3 - the database
  • every python script used for crawling
  • README (how to start your crawler)

Lflask

#HWflask

In this lecture, we will use Python to process user comments obtained in the previous lecture.

  • We will display information about individual users as a dynamic website written in Flask framework
  • We will use simple text processing utilities from ScikitLearn library to extract word use statistics from the comments

Flask

Flask is a simple web server for Python. Using Flask you can write a simple dynamic website in Python.


Running Flask

You can find a sample Flask application at /tasks/flask/simple_flask. Run it using these commands:

cd <your directory>
export FLASK_APP=main.py
export FLASK_ENV=development # this is optional, but recommended for debugging

# before running the following, change the port number
# so that no two users use the same number
flask run --port=PORT

PORT is a random number greater than 1024. This number should be different from other people running flask on the same machine (if you run into the problem where flask writes out lot of error messages complaining about permissions, select a different port number). Flask starts a webserver on port PORT and serves the pages created in your Flask application. Keep it running while you need to access these pages.

To view these pages, open a web browser on the same computer where the Flask is running, e.g. chromium-browser http://localhost:PORT/ (use the port number you have selected to run Flask). If you are running flask on a server, you probably want to run the web browser on your local machine. In such case, you need to use ssh tunnel to channel the traffic through ssh connection:

  • On your local machine, open another console window and create an ssh tunnel as follows: ssh -L PORT:localhost:PORT vyuka.compbio.fmph.uniba.sk (replace PORT with the port number you have selected to run Flask)
  • For Windows machines, follow a tutorial how to create an ssh tunnel
  • Keep this ssh connection open while you need to access your Flask web pages; it makes port PORT available on your local machine
  • In your browser, you can now access your Flask webpages, using e.g. chromium-browser http://localhost:PORT/

Structure of a Flask application

  • The provided Flask application resides in the main.py script.
  • Some functions in this script are annotated with decorators starting with @app.
  • Decorator @app.before_request marks a function which will be executed before processing a particular request from a web browser. In this case we open a database connection and store it in a special variable g which can be used to store variables for a particular request.
  • Decorator @app.route('/') marks a function which will serve the main page of the application with URL http://localhost:4247/. Similarly decorator @app.route('/wat/<random_id>/') marks a function which will serve URLs of the form http://localhost:4247/wat/100 where the particular string which the user uses in the URL (here 100) will be stored in random_id variable accessible within the function.
  • Functions serving a request return a string containing the requested webpage (typically a HTML document). For example, function wat returns a simple string without any HTML markup.
  • To more easily construct a full HTML document, you can use jinja2 templating language, as is done in the home function. The template itself is in file templates/main.html.


Processing text

The main tool we will use for processing text is CountVectorizer class from the Scikit-learn library. It transforms a text into a bag of words representation. In this representation we get the list of words and the count for each word. Example:

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(strip_accents='unicode')

texts = [
 "Ema ma mamu.",
 "Zirafa sa vo vani kupe a hneva sa."
]

t = vec.fit_transform(texts).todense()

print(t)
# prints:
# [[1 0 0 1 1 0 0 0 0]
#  [0 1 1 0 0 2 1 1 1]]

print(vec.vocabulary_)
# prints:
# {'vani': 6, 'ema': 0, 'kupe': 2, 'mamu': 4, 
# 'hneva': 1, 'sa': 5, 'ma': 3, 'vo': 7, 'zirafa': 8}

NumPy arrays

Array t in the example above is a NumPy array provided by the NumPy library. This library has also lots of nice tricks. First let us create two matrices:

>>> import numpy as np
>>> a = np.array([[1,2,3],[4,5,6]])
>>> b = np.array([[7,8],[9,10],[11,12]])
>>> a
array([[1, 2, 3],
       [4, 5, 6]])
>>> b
array([[ 7,  8],
       [ 9, 10],
       [11, 12]])

We can sum these matrices or multiply them by some number:

>>> 3 * a
array([[ 3,  6,  9],
       [12, 15, 18]])
>>> a + 3 * a
array([[ 4,  8, 12],
       [16, 20, 24]])

We can calculate sum of elements in each matrix, or sum by some axis:

>>> np.sum(a)
21
>>> np.sum(a, axis=1)
array([ 6, 15])
>>> np.sum(a, axis=0)
array([5, 7, 9])

There are many other useful functions, check the documentation.

HWflask

See the lecture

General goal: Build a simple website, which lists all crawled users and for each users has a page with simple statistics regarding the posts of this user.


Submit your source code (web server and preprocessing scripts) and database files. Copy these files to /submit/flask/username/

This lesson requires crawled data from previous lesson. If you don't have one, you can find it at /tasks/flask/db.sqlite3

Task A

Create a simple Flask web application which:

  • Has a homepage with a list of all users (with links to their pages).
  • Has a page for each user with basic information: the nickname, the number of posts and the last 10 posts of this user.

Task B

Make a separate script which computes and stores in the database the floowing information for each user:

  • the list of 10 most frequently used words
  • the list of top 10 words typical for this user (words which this user uses much more often than other users). Come up with some simple heuristics for measuring this.

Show this information on the page of each user.

Hint: To get the most frequently used words for each user, you can use argsort from NumPy.

Task C

Preprocess and store the list of top three similar users for each user (try to come up with some simple definition of similarity based on the text in the posts). Again show this information on the user page.

Bonus: Try to use some simple topic modeling (e.g. PCA as in TruncatedSVD from scikit-learn) and use it for finding similar users.

Ljavascript

#HWjavascript


In this lecture we will extend the website from the previous lecture with interactive visualizations written in JavaScript. We will not cover details of the JavaScript programming language, only use visualization contained in the Google Charts library.

Your goal is to take examples from the documentation and tweak them for your purposes.

Tips:

  • Each graph contains also HTML+JS code example. That is a good startpoint.
  • You can write your data into JavaScript data structures (`var data` from examples) in a Flask template. You might need a jinja for loop (https://jinja.palletsprojects.com/en/2.11.x/templates/#for). Or you can produce string in Python, which you will put into a HTML. It is a (very) bad practice, but sufficient for this lecture. (A better way is to load data in JSON format through API).
  • Consult the previous lecture on running and accessing Flask applications.

HWjavascript

See the lecture

General goal: Extend the user pages from the previous lecture with simple visualizations.

Submit your source code to /submit/javascript/username/

Task A

Display a calendar, which shows during which days was the user active. Use calendar from Google Charts.

Task B

Show a histogram of comment lengths. Use histogram from Google Charts.

Task C

Either: Show a word tree for a user using word tree from Google Charts. Try to normalize the text before building the tree (convert to lowercase, remove accents). CountVectorizer has build_analyzer method, which returns a function, which does this for you.

Or: Pick some other appropriate visualization from the gallery, feed it with data a show it. Also add some description to it.

Lbioinf1

#HWbioinf1

The next three lectures at targeted at the students in the Bioinformatics program and the goal is to get experience with several common bioinformatics tools. Students will learn more about the algorithms and models behind these tools in the Methods in bioinformatics course.

Overview of DNA sequencing and assembly

DNA sequencing is a technology of reading the order of nucleotides along a DNA strand

  • The result is represented as a string of A,C,G,T
  • Only fragments of DNA of limited length can be read, these are called sequencing reads
  • Different technologies produce reads of different characteristics
  • Examples:
    • Illumina sequencers produce short reads (typical length 100-200bp), but in great quantities and very low error rate (<0.1%)
    • Illumina reads usually come in pairs sequenced from both ends of a DNA fragment of an approximately known length
    • Oxford nanopore sequencers produce longer reads (thousands of base pairs or more), but the error rates are higher (10-15%)


The goal of genome sequencing is to read all chromosomes of an organism

  • Sequencing machines produce many reads coming from different parts of the genome
  • Using software tools called sequence assemblers, these reads are glued together based on overlaps
  • Ideally we would get the true chromosomes, but often we get only shorter fragments called contigs
  • The results of assembly can contain errors
  • We prefer longer contigs with fewer errors

Sequence alignments and dotplots

A short video for this section: [3]

  • Sequence alignment is the task of finding similarities between DNA (or protein) sequences
  • Here is an example - short similarity between region at positions 344,447..344,517 of one sequence and positions 3,261..3,327 of another sequence
Query: 344447 tctccgacggtgatggcgttgtgcgtcctctatttcttttatttctttttgttttatttc 344506
              |||||||| |||||||||||||||||| ||||||| |||||||||||| ||   ||||||
Sbjct: 3261   tctccgacagtgatggcgttgtgcgtc-tctatttattttatttctttgtg---tatttc 3316

Query: 344507 tctgactaccg 344517
              |||||||||||
Sbjct: 3317   tctgactaccg 3327
  • Alignments can be stored in many formats and visualized as dotplots
  • In a dotplot, the x-axis correspond to positions in one sequence and the y-axis in another sequence
  • Diagonal lines show alignments between the sequences (direction of the diagonal shows which DNA strand was aligned)
Dotplot of human and Drosophila mitochondrial genomes

File formats

FASTA

  • FASTA is a format for storing DNA, RNA and protein sequences
  • We have already seen FASTA files in Perl exercises
  • Each sequence is given on several lines of the file. The first line starts with ">" followed by an identifier of the sequence and optionally some further description separated by whitespace
  • The sequence itself is on the second line; long sequences are split into multiple lines
>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAA...
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGA...

FASTQ

  • FASTQ is a format for storing sequencing reads, containing DNA sequences but also quality information about each nucleotide
  • More in the lecture on Perl

SAM/BAM

  • SAM and BAM are formats for storing alignments of sequencing reads (or other sequences) to a genome
  • For each read, the file contains the read itself, its quality, but also the chromosome/contig name and position where this read is likely coming from, and an additional information e.g. about mapping quality (confidence in the correct location)
  • SAM files are text-based, thus easier to check manually; BAM files are binary and compressed, thus smaller and faster to read
  • We can easily convert between SAM and BAM using samtools
  • Full documentation of the format

PAF format

Gzip

  • Gzip is a general-purpose tool for file compression
  • It is often used in bioinformatics on large FASTQ or FASTA files
  • Running command gzip filename.ext will create compressed file filename.ext.gz (original file will be deleted).
  • The reverse process is done by gunzip filename.ext.gz (this deletes the gziped file and creates the uncompressed version)
  • However, we can access the file without uncompressing it. Command zcat filename.ext.gz prints the content of a gzipped file and keeps the gzipped file as is. We can use pipes | to do further processing on the file.
  • To manually page through the content of a gzipped file use zless filename.ext.gz
  • Some bioinformatics tools can work directly with gzipped files.

HWbioinf1

See also the lecture

Submit the protocol and the required files to /submit/bioinf1

Technical notes

  • Task D and task E ask you too look at data visualizations
  • If you are unable to open graphical applications from our server, you can download appropriate files and view them on your computer (in task D these are simply pdf files, in task E you would have to install IGV software on your computer)

Task A: examine input files

Copy files from /tasks/bioinf1/ as follows:

mkdir bioinf1
cd bioinf1
cp -iv /tasks/bioinf1/* .
  • ref.fasta is a piece of genome from Escherichia coli
  • miseq_R1.fastq.gz and miseq_R2.fastq.gz are sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in the R1 file, second reads in the R2 file. These reads come from the region in ref.fasta
  • nanopore.fasta are nanopore sequencing reads in FASTA format (without qualities). These reads are also from the region in ref.fasta

Try to find the answers to the following questions using command-line tools. In your protocol, note down the commands as well as the answers.

(a) How many reads are in the MiSeq files? Is the number of reads the same in both files?

  • Try command zcat miseq_R1.fastq.gz | wc -l
  • Can you figure out the answer from the result of this command?

(b) How long are individual reads in the MiSeq files?

  • Look at the file using zless - do all reads appear to be of an equal length?
  • Extend the following command with tail and wc -c to get the length of the first read: zcat miseq_R1.fastq.gz | head -n 2
  • Do not forget to consider the end of the line character
  • Repeat for both MiSeq files

(c) How many reads are in the nanopore file (beware - different format)

(d) What is the average length of the reads in the nanopore file?

  • Try command: samtools faidx nanopore.fasta
  • This creates nanopore.fasta.fai file, where the second column contains sequence length of each read
  • Compute the average of this column by a one-liner: perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai

(e) How long is the sequence in the ref.fasta file?

Task B: assemble the sequence from the reads

  • We will pretend that the correct answer (ref.fasta) is not known and we will try to assemble it from the reads
  • We will assemble Illumina reads by program SPAdes and nanopore reads by miniasm
  • Assembly takes several minutes, we will run it in the background using screen command

SPAdes

  • Run screen -S spades
  • Press Enter to get command-line, then run the following command:
spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log
  • Press Ctrl-a followed by d
  • This will take you out of screen command
  • Run top command to check that your command is running

Miniasm

  • Create file miniasm.sh containing the following commands:
# Find alignments between pairs of reads
minimap2 -x ava-ont -t 1 nanopore.fasta nanopore.fasta | gzip -1 > nanopore.paf.gz 
# Use overlaps to compute the assembled genome
miniasm -f nanopore.fasta nanopore.paf.gz > miniasm.gfa 2> miniasm.log
# Convert genome to fasta format
perl -lane 'print ">$F[1]\n$F[2]" if $F[0] eq "S"' miniasm.gfa > miniasm.fasta
# Align reads to the assembled genome
minimap2 -x map-ont --secondary=no -t 1 miniasm.fasta nanopore.fasta | gzip -1 > miniasm.paf.gz
# Polish the genome by finding consensus of aligned reads at each position
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
  • Run screen -S miniasm
  • In screen, run source ./miniasm.sh
  • Press Ctrl-a d to exit screen


To check if your commands have finished:

  • Re-enter the screen environment using screen -r spades or screen -r miniasm
  • If the command finished, terminate screen by pressing Ctrl-d or typing exit

Examine the outputs. Write commands and answers to your protocol.

  • Copy output of SPAdes under a new filename: cp -ip spades/contigs.fasta spades.fasta
  • Output of miniasm should be in miniasm2.fasta

(a) How many contigs are in each of these two files?

(b) What can you find out from the names of contigs in spades.fasta? What is the length of the shortest and longest contigs? String cov in the names is abbreviation of read coverage - the average number of reads covering a position on the contig. Do the reads have similar coverage, or are there big differences?

  • Use command grep '>' spades.fasta

(c) What are the lengths of contigs in miniasm2.fa file? (you can use LN:i: in the name of contigs)

Submit files miniasm2.fasta and spades.fasta

Task C: compare assemblies using Quast command

We have found basic characteristics of the two assemblies in task B. Now we will use program Quast to compare both assemblies to the correct answer in ref.fa

quast.py -R ref.fasta miniasm2.fasta spades.fasta -o stats

Submit file stats/report.txt.

Look at the results in stats/report.txt and answer the following questions.

(a) How many contigs has quast reported in the two assemblies? Does it agree with your counts in part B?

(b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the lecture)

(c) What portion of the reference sequence is covered by the two assemblies (reported as genome fraction)? Which assembly is better in this aspect?

(d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?

Task D: create dotplots of assemblies

We will now visualize alignments between each assembly and the reference genome using dotplots. As in other tasks, write commands and answers to your protocol.

(a) Create a dotplot comparing miniasm assembly to the reference sequence

# alignments
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
# creating dotplot
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | \
  ps2pdf -dEPSCrop - ref-miniasm2.pdf
# displaying dotplot
# if evince does not work, copy the pdf file to your commputer and view there
evince ref-miniasm2.pdf &
  • x-axis is reference, y-axis assembly
  • Which part of the reference is missing in the assembly?
  • Do you see any other big differences between the assembly and the reference?

(b) Use analogous commands to create a dotplot for spades assembly, call it ref-spades.pdf

  • What are vertical gray lines in the dotplot?
  • Is any contig aligning to multiple places in the reference? To how many places?

(c) Use analogous commands to create a dotplot of reference to itself, call it ref-ref.pdf

  • However, in the minimap2 command add option -p 0 to include also weaker self-alignments
  • Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with the dotplot in part (b)?

Submit all three pdf files ref-miniasm2.pdf, ref-spades.pdf, ref-ref.pdf

Task E: Align reads and assemblies to reference, visualize in IGV

Finally, we will align all source reads as well as assemblies to the reference genome, then visualize the alignments in IGV tool.

A short video introducing IGV: [4]

  • Write commands and answers to your protocol
  • Submit all four BAM files ref-miseq.bam, ref-nanopore.bam, ref-spades.bam, ref-miniasm2.bam

(a) Align illumina reads (MiSeq files) to the reference sequence

# align illumina reads to reference
# minimap produces SAM file, samtools view converts to BAM, 
# samtools sort orders by coordinate
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | \
  samtools view -S -b - |  samtools sort - ref-miseq
# index BAM file for faster access
samtools index ref-miseq.bam

(b) Similarly align nanopore reads, but instead of -x sr use -x map-ont, call the result ref-nanopore.bam, ref-nanopore.bam.bai

(c) Similarly align spades.fasta, but instead of -x sr use -x asm10, call the result ref-spades.bam

(d) Similarly align miniasm2.fasta, but instead of -x sr use -x asm10, call the result ref-miniasm2.bam

(e) Run the IGV viewer. Beware: It needs a lot of memory, do not keep open on the server unnecessarily

  • igv -g ref.fasta &
  • Using Menu->File->Load from File, open all four BAM files
  • Look at region ecoli-frag:224,000-244,000
  • How many spades contigs do you see aligning in this region?
  • Look at region ecoli-frag:227,300-227,600
  • Comment on what you see. How frequent are errors in the individual assemblies and read sets?
  • If you are unable to run igv from home, you can install it on your computer [5] and download ref.fasta and all .bam and .bam.bai files

Lbioinf2

#HWbioinf2

Eukaryotic gene structure

  • Recall the Central dogma of molecular biology: the flow of genetic information from DNA to RNA to protein (gene expression)
  • In eukaryotes, mRNA often undergoes splicing, where introns are removed and exons are joined together
  • The very start and end of mRNA remain untranslated (UTR = untranslated region)
  • The coding part of the gene starts with a start codon, contains a sequence of additional codons and ends with a stop codon. Codons can be interrupted by introns.
Gene expression in eukaryotes

Computational gene finding

  • Input: DNA sequence (an assembled genome or a part of it)
  • Output: positions of protein coding genes and their exons
  • If we know the exact position of coding regions of a gene, we can use the genetic code table to predict the protein sequence encoded by it.
  • Gene finders use statistical features observed from known genes, such as typical sequence motifs near the start codons, stop codons and splice sites, typical codon frequencies, typical exon and intron lengths etc.
  • These statistical parameters need to be adjusted for each genome.
  • We will use a gene finder called Augustus.

Gene expression

  • Not all genes undergo transcription and translation all the time and at the same level.
  • The processes of transcription and translation are regulated according to cell needs.
  • The term "gene expression" has two meanings:
    • the process of transcription and translation (synthesis of a gene product),
    • the amount of mRNA or protein produced from a single gene (genes with high or low expression).

RNA-seq technology can sequence mRNA extracted from a sample of cells.

  • We can align sequenced reads back to the genome.
  • The number of reads coming from a gene depends on its expression level (and on its length).

HWbioinf2

See also the lecture

Submit the protocol and the required files to /submit/bioinf2

Input files

Copy files from /tasks/bioinf2/

mkdir bioinf2
cd bioinf2
cp -iv /tasks/bioinf2/* .

Files:

  • ref.fasta is a 38kb piece of the genome of the fungus Aspergillus nidulans
  • rnaseq.fastq are RNA-seq reads from Illumina sequencer extracted from the Short read archive
  • annot.gff is the reference gene annotation from the database (we will consider this as correct gene positions)

Task A: Gene finding

Run the Augustus gene finder with two versions of parameters:

  • one trained specifically for A. nidulans genes
  • one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
augustus --species=anidulans ref.fasta > augustus-anidulans.gtf
augustus --species=human ref.fasta > augustus-human.gtf

The results of gene finding are in the GTF format. Rows starting with # are comments, each of the remaining rows describes some interval of the sequence. If the second column is CDS, it is a coding part of an exon. The reference annotation annot.gff is in a similar format called GFF3.

Examine the files and try to find the answers to the following questions using command-line tools

(a) How many exons are in each of the two GTF files? (Beware: simply using grep with pattern CDS may yield lines containing this string in a different column. You can use e.g. techniques from the lecture and exercises on command-line tools).

(b) How many genes are in each of the two GTF files? (The files contain rows with word gene in the second column, one for each gene)

(c) How many exons and genes are in the annot.gff file?

Write the anwsers and commands to the protocol. Submit files augustus-anidulans.gtf and augustus-human.gtf.

Task B: Aligning RNA-seq reads

  • Align RNA-seq reads to the genome
  • We will use a specialized tool tophat, which can recognize introns
  • Then we will sort and index the BAM file, similarly as in the previous lecture
bowtie2-build ref.fasta ref.fasta
tophat2 -i 10 -I 10000 --max-multihits 1 --output-dir rnaseq ref.fasta rnaseq.fastq
samtools sort rnaseq/accepted_hits.bam rnaseq
samtools index rnaseq.bam

In addition to the BAM file, TopHat produced several other files in the rnaseq folder. Examine them to find out answers to the following questions (you can do it manually by looking at the the files, e.g. by less command):

(a) How many reads were in the FASTQ file? How many of them were successfully mapped?

(b) How many introns ("junctions") were predicted? How many of them are supported by more than one read? (The 5th column of the corresponding file is the number of reads supporting a junction.)

Write answers to the protocol. Submit the file rnaseq.bam.

Task C: Visualizing in IGV

As before, run IGV as follows:

igv -g ref.fasta &

Open additional files using menu File -> Load from File: annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam

  • Exons are shown as thicker boxes, introns are thinner.
  • For each of the following questions, select a part of the sequence illustrating the answer and export a figure using File->Save image
  • You can check these images using command eog

Questions:

(a) Create an image illustrating differences between Augustus with human parameters and the reference annotation, save as a.png. Briefly describe the differences in words.

(b) Find some differences between Augustus with A. nidulans parameters and the reference annotation. Store an illustrative figure as b.png. Which parameters have yielded a more accurate prediction?

(c) Zoom in to one of the genes with a high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as c.png.

Submit files a.png, b.png, c.png. Write answers to your protocol.

Lbioinf3

#HWbioinf3

Polymorphisms

  • Individuals within species differ slightly in their genomes
  • Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
  • SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide)
  • Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
  • At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)

Finding polymorphisms / genome variants

  • We compare sequencing reads coming from an individual to a reference genome of the species
  • First we align them, as in the exercises on genome assembly
  • Then we look for positions where a substantial fraction of reads does not agree with the reference (this process is called variant calling)

Programs and file formats

Human variants

  • For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
  • There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia

UCSC genome browser

A short video for this section: [6]

  • On-line tool similar to IGV
  • http://genome-euro.ucsc.edu/
  • Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented

Basics

  • On the front page, choose Genomes in the top blue menu bar
  • Select a genome and its version, optionally enter a position or a keyword, press submit
  • On the browser screen, the top image shows chromosome map, the selected region is in red
  • Below there is a view of the selected region and various tracks with information about this region
  • For example some of the top tracks display genes (boxes are exons, lines are introns)
  • Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space)
  • Buttons for navigation are at the top (move, zoom, etc.)
  • Clicking at the browser figure allows you to get more information about a gene or other displayed item
  • In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15

Blat

  • For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species)
  • Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG

HWbioinf3

See also the lecture

Submit the protocol and the required files to /submit/bioinf3

Input files

Copy files from /tasks/bioinf3/

mkdir bioinf3
cd bioinf3
cp -iv /tasks/bioinf3/* .

Files:

  • humanChr7Region.fasta is a 7kb piece of the human chromosome 7
  • motherChr7Region.fastq is a sample of reads from an anonymous donor known as NA12878; these reads come from region in humanChr7Region.fasta
  • fatherChr12.vcf and motherChr12.vcf are single-nucleotide variants on the chromosome 12 obtained by sequencing two individuals NA12877, NA12878 (these come from a larger family)

Task A: read mapping and variant calling

Align reads to the reference:

bwa index humanChr7Region.fasta
bwa mem humanChr7Region.fasta  motherChr7Region.fastq | \
  samtools view -S -b - |  samtools sort - motherChr7Region
samtools index motherChr7Region.bam

Call variants:

freebayes -f humanChr7Region.fasta --min-alternate-count 10 \
  motherChr7Region.bam >motherChr7Region.vcf

Run IGV, use humanChr7Region.fasta as genome, open motherChr7Region.bam and motherChr7Region.vcf. Looking at the aligned reads and the VCF file, answer the following questions:

(a) How many variants were found in the VCF file?

(b) How many variants are heterozygous and how many are homozygous?

(c) Are all variants single-nucleotide variants or do you also see some insertions/deletions (indels)?

Also export the overall view of the whole region from IGV to file motherChr7Region.png.

Submit the following files: motherChr7Region.png, motherChr7Region.bam, motherChr7Region.vcf

Task B: UCSC browser

(a) Where is sequence from regionChr7.fasta located in the browser?

  • Go to http://genome-euro.ucsc.edu/, from the blue menu, select Tools->Blat
  • Check that Blat uses Human, hg38 assembly
  • Open regionChr7.fasta in a graphical editor (e.g. kate), select all, paste to the BLAT window, run BLAT
  • In the table of results, the best result should have identity close to 100% and span close to 7kb
  • For this best result, click on the link named Browser
  • Report which chromosome and which region you get

(b) Which gene is located in this region?

  • Once you are in the browser, press Default tracks button
  • Track named GENCODE contains known genes, shown as rectangles (exons) connected by lines (introns). Short gene names are next to them.
  • Report the name of the gene in the region

(c) In which tissue is this gene most highly expressed? What is the function of this gene?

  • When you click on the gene (possibly twice), you get an information page which starts with a summary of the known function of this gene. Copy the first sentence to your protocol.
  • Further down on the gene information page you see RNA-Seq Expression Data (colorful boxplots). Find out which tissues have the highest signal.

(d) Which SNPs are located in this gene? Which trait do they influence?

  • You can see SNPs in the Common SNPs(151) track, but their IDs appear only after switching this track to pack mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
  • Information page of the gene (part c) also describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
  • You can also find information about individual SNPs by looking for them by their ID in SNPedia (not required in this task)

Task C: Examining larger vcf files

In this task, we will look at motherChr12.vcf and fatherChr12.vcf files and compute various statistics. You can use command-line tools, such as grep, wc, sort, uniq and Perl one-liners (as in #Lbash), or you can write small scripts in Perl or Python (as in #Lperl and #Lpython).

  • Write all used commands to your protocol
  • If you write any scripts, submit them as well

Questions:

(a) How many SNPs are in each file?

  • This can be found easily by wc, only make sure to exclude lines with comments

(b) How many heterozygous SNPs are in each file?

  • The last column contains 1|1 for homozygous and either 0|1 or 1|0 for heterozygous SNPs
  • Character | has special meaning on the command line and in grep patterns; make sure to place it in apostrophes ' ' and possibly escape it with backslash \

(c) How many SNP positions are shared between the two files?

  • The second column of each file lists the position. We want to compute the size of intersection of the set of positions in motherChr12.vcf and fatherChr12.vcf files
  • You can e.g. create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using comm command with options -1 -2. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.

(d) List the 5 most frequent pairs of reference/alternate allele in motherChr12.vcf and their frequencies. Do they correspond to transitions or transversions?

  • The fourth column contains the reference value, fifth column the alternate value. For example, the first SNP in motherChr12.vcf has a pair C,A.
  • For each possible pair of nucleotides, find how many times it occurs in the motherChr12.vcf
  • For example, pair C,A occurs 6894 times
  • Then sort the pairs by their frequencies and report 5 most frequent pairs
  • Mutations can be classified as transitions and transversions. Transitions change purine to purine or pyrimidine to pyrimidine, transversions change a purine to pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?
  • To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands sort and uniq to count each pair.

(e) Which parts of the chromosome have the highest and lowest density of SNPs in motherChr12.vcf?

  • First create a list of windows of size 100kb covering the whole chromosome 12 using these two commands:
perl -le 'print "chr12\t133275309"' > humanChr12.size
bedtools makewindows -g humanChr12.size -w 100000 -i srcwinnum > humanChr12-windows.bed
  • Then count SNPs in each window using this command:
bedtools coverage -a  humanChr12-windows.bed -b motherChr12.vcf > motherChr12-windows.tab
  • Find out which column of the resulting file contains the number of SNPs per window, e.g. by reading the documentation obtained by command bedtools coverage -h
  • Sort according to the column with the SNP number, look at the first and last line of the sorted file
  • For checking: the second highest count is 387 in window with coordinates 20,800,000-20,900,000

Lr1

#HWr1 · Video introduction

Program for this lecture: basics of R

  • A very short introduction will be given as a lecture.
  • Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots
  • Ctrl-Enter in editor executes the current command in console
  • You can also install RStudio on your home computer and work there

In R, you can create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x=c(1:10)
plot(x,x*x)

Suggested workflow

  • work interactively in Rstudio or on command line, try various options
  • select useful commands, store in a script
  • run script automatically on new data/new versions, potentially as a part of a bigger pipeline

Additional information

Gene expression data

  • DNA molecules contain regions called genes, which are "recipes" for making proteins
  • Gene expression is the process of creating a protein according to the "recipe"
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein
  • Different proteins are created in different quantities and their amount depends on the needs of a cell
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances

Gene expression data

  • Rows: genes
  • Columns: experiments (e.g. different conditions or different individuals)
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for this gene in the sample

We will use a data set for yeast:

Part of the file (only first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...

HWr1

See also the lecture

In this homework, try to read the text, execute given commands, potentially trying some small modifications. Within the tutorial, you will find tasks A-E to complete in this exercise.

  • Submit the required files (4x .png)
  • In your protocol, enter the commands used in all tasks, with explanatory comments in more complicated situations
  • In tasks B and D also enter the required output to the protocol
  • Protocol template in /tasks/r1/protocol.txt

The first steps

Type a command, R writes the answer, e.g.:

> 1+2
[1] 3

We can store values in variables and use them later:

> # population of Slovakia in millions, 2019
> population = 5.457
> population
[1] 5.457
> # area of Slovakia in thousands of km2
> area = 49.035
> density = population / area
> density
[1] 0.1112879

Surprises in the R language:

  • dots are used as parts of id's, e.g. read.table is name of a single function (not a method for the object read)
  • assignment via <- or =
  • vectors etc are indexed from 1, not from 0

Vectors, basic plots

A vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans)

# Vector can be created from a list of numbers by function named c
a = c(1,2,4)
a
# prints [1] 1 2 4

# c also concatenates vectors
c(a,a)
# prints [1] 1 2 4 1 2 4

# Vector of two strings 
b = c("hello", "world")

# Create a vector of numbers 1..10
x = 1:10
x
# prints [1]  1  2  3  4  5  6  7  8  9 10

Vector arithmetic

Many operations can be easily applied to each member of a vector

x = 1:10
# Square each number in vector x
x*x
# prints [1]   1   4   9  16  25  36  49  64  81 100

# New vector y: logarithm of a number in x squared
y = log(x*x)
y
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
# [9] 4.394449 4.605170

# Draw the graph of function log(x*x) for x=1..10
plot(x,y)
# The same graph but use lines instead of dots
plot(x,y,type="l")

# Addressing elements of a vector: positions start at 1
# Second element of the vector 
y[2]
# prints [1] 1.386294

# Which elements of the vector satisfy certain condition? 
# (vector of logical values)
y>3
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

# write only those elements from y that satisfy the condition
y[y>3]
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170

# we can also write values of x such that values of y satisfy the condition...
x[y>3]
# prints [1]  5  6  7  8  9 10

Alternative plotting facilities: ggplot2 library, lattice library

Task A

Create a plot of the binary logarithm with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)

  • Store it in file log.png and submit this file

Hints:

  • Create x and y by vector arithmetic
  • To compute binary logarithm check help ? log
  • Before running plot, use command png("log.png") to store the result, afterwards call dev.off() to close the file (in Rstudio you can also export plots manually)

Data frames and simple statistics

Data frame: a table similar to a spreadsheet. Each column is a vector, all are of the same length.

We will use a table with the following columns:

  • Country name
  • Region (continent)
  • Area in thousands of km2
  • Population in millions in 2019

(source of data UN)

The table is stored in the csv format (columns separated by commas).

Afghanistan,Asia,652.864,38.0418
Albania,Europe,28.748,2.8809
Algeria,Africa,2381.741,43.0531
American Samoa,Oceania,0.199,0.0553
Andorra,Europe,0.468,0.0771
Angola,Africa,1246.7,31.8253
# reading a data frame from a file
a = read.csv("/tasks/r1/countries.csv",header = TRUE)

# display mean, median, etc. of each column
summary(a);
# Compactly display structure of a 
# (good for checking that import worked etc)
str(a)

# print the column with the name "Area"
a$Area

# population density: divide the population by the area
a$Population / a$Area

# Add density as a new column to frame a
a = cbind(a, Density = a$Population / a$Area)

# Scatter plot of area vs population
plot(a$Area, a$Population)

# we will see smaller values better in log-scale (both axes)
plot(a$Area, a$Population, log='xy')

# use linear scale, but zoom in on smaller countries:
plot(a$Area, a$Population, xlim=c(0,1500), ylim=c(0,150))

# average country population 33.00224 million
mean(a$Population)
# median country population 5.3805 million
median(a$Population)

# median country population in Europe
median(a$Population[a$Region=="Europe"])
# Standard deviation
sd(a$Population)

# Histogram of country populations in Europe
hist(a$Population[a$Region=="Europe"])

Task B

Create frame europe which contains data for European countries selected from frame a. Also create a similar frame for African countries. Hint:

  • To select the first three rows of a frame: a[c(1,2,3),].
  • Here we want to select rows based on values not position (see computation of median country size in Europe above)

Run the command summary separately for each new frame. Comment on how their characteristics differ. Write output and your conclusion to the protocol.

Task C

Draw a graph comparing the area vs population in Europe and Africa; use different colors for points representing European and African countries. Apply log scale on both axes.

  • Submit the plot in file countries.png

To draw the graph, you can use one of the options below, or find yet another way.

Option 1: first draw Europe with one color, then add Africa in another color

  • Color of points can be changed by as follows: plot(1:10,1:10, col="red")
  • After the plot command, you can add more points to the same graph by command points, which can be used similarly as plot
  • Warning: command points does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options xlim and ylim, e.g. plot(x,y, col="red", xlim=c(0.1,100), ylim=c(0.1,100))

Option 2: plot both Europe and Africa in one plot command, and give it a vector of colors, one for each point. Command plot(1:10,1:10,col=c(rep("red",5),rep("blue",5))) will plot the first 5 points red and the last 5 points blue

Bonus task: add a legend to the plot, showing which color is Europe and which is Africa.

Expression data

The dataset was described in the lecture.

# Read gene expression data table
a = read.csv("/tasks/r1/microarray.csv", row.names=1)
# Visual check of the first row
a[1,]
# Plot control replicate 1 vs. acetate acid replicate 1
plot(a$control1, a$acetate1)
# Plot control replicate 1 vs. control replicate 2
plot(a$control1, a$control2)
# To show density in dense clouds of points, use this plot
smoothScatter(a$control1, a$acetate1)

Task D

In the plots above we compare two experiments, say A=control1 and B=acetate1. Outliers away from the diagonal in the plot are the genes whose expression changes. However distance from the diagonal is hard to judge visually, instead we will create MA plot:

  • As above, each gene is one dot in the plot (use plot rather than smoothScatter).
  • The x-axis is the average between values for conditions A and B. The points on the right have overall higher expression than points on the left.
  • The y-axis is the difference between condition A and B. The values in frame a are in log-scale base 2, so the difference of 1 means 2-fold change in expression.
  • The points far from the line y=0 have the highest change in expression. Use R functions min, max, which.min and which.max to find the largest positive and negative difference from line y=0 and which genes they correspond to. Functions min and max give you the minimum and maximum of a given vector. Functions which.min and which.max return the index where this extreme value is located. You can use this index to get the appropriate row of the dataframe a, including the gene name.
  • Submit file ma.png with your plot. Include the genes with the extreme changes in your protocol.

Clustering applied to expression data

Clustering is a wide group of methods that split data points into groups with similar properties. We will group together genes that have a similar reaction to acids, i.e. their rows in gene expression data matrix have similar values. We will consider two simple clustering methods

  • K means clustering splits points (genes) into k clusters, where k is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
Examples of heatmaps
  • Hierarchical clustering puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.


# Create a new version of frame a in which row is scaled so that 
# it has mean 0 and standard deviation 1
# Function scale does such transformation on columns instead of rows, 
# so we transpose the frame using function t, then transpose it back
b = t(scale(t(a)))
# Matrix b shows relative movements of each gene, 
# disregarding its overall high or low expression

# Command heatmap creates hierarchical clustering of rows, 
# then shows values using color ranging from red (lowest) to white (highest)
heatmap(as.matrix(a), Colv=NA, scale="none")
heatmap(as.matrix(b), Colv=NA, scale="none")
# compare the two matrices - which phenomena influenced clusters in each of them?
# k means clustering to 5 clusters
k = 5
cl = kmeans(b, k)
# Each gene is assigned a cluster (number between 1 and k)
# the command below displays the first 10 elements, i.e. clusters of first 10 genes
head(cl$cluster)
# Draw heatmap of cluster number 3 out of k, no further clustering applied
# Do you see any common pattern to genes in the cluster?
heatmap(as.matrix(b[cl$cluster==3,]), Rowv=NA, Colv=NA, scale="none")

# Reorder genes in the whole table according to their cluster cluster number
# Can you spot our k clusters?
heatmap(as.matrix(b[order(cl$cluster),]), Rowv=NA, Colv=NA, scale="none")

# Compare overall column means with column means in cluster 3
# Function apply runs mean on every column (or row if 2 changed to 1)
apply(b, 2, mean)
# Now means within cluster 3
apply(b[cl$cluster==3,],2,mean)

# Clusters have centers which are also computed as means
# so this is the same as the previous command
cl$centers[3,]

Task E

Example of a required plot (but for k=3, not k=5)

Draw a plot in which the x-axis corresponds to experiments, the y-axis is the expression level and the center of each cluster is shown as a line (use k-means clustering on the scaled frame b, computed as shown above)

  • Use command matplot(x, y, type="l", lwd=2) which gets two matrices x and y of the same size and plots each column of matrices x and y as one line (setting lwd=2 makes lines thicker)
  • In this case we omit matrix x, the command will use numbers 1,2,3... as columns of the missing matrix
  • Create y from cl$centers by applying function t (transpose)
  • Submit file clusters.png with your final plot

Lr2

#HWr2

The topic of this lecture are statistical tests in R.

  • Beginners in statistics: listen to lecture, then do tasks A, B, C
  • If you know basics of statistical tests, do tasks B, C, D
  • More information on this topic in the 1-EFM-340 Computer Statistics course


Introduction to statistical tests: sign test

  • Two friends A and B have played their favorite game n=10 times, A has won 6 times and B has won 4 times.
  • A claims that he is a better player, B claims that such a result could easily happen by chance if they were equally good players.
  • Hypothesis of player B is called the null hypothesis that the pattern we see (A won more often than B) is simply a result of chance
  • The null hypothesis reformulated: we toss coin n times and compute value X: the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail
  • Similar situation: comparing programs A and B on several inputs, counting how many times is program A better than B.
# simulation in R: generate 10 psedorandom bits
# (1=player A won)
sample(c(0,1), 10, replace = TRUE)
# result e.g. 0 0 0 0 1 0 1 1 0 0

# directly compute random variable X, i.e. the sum of bits
sum(sample(c(0,1), 10, replace = TRUE))
# result e.g. 5

# we define a function which will m times repeat 
# the coin tossing experiment with n tosses 
# and returns a vector with m values of random variable X
experiment <- function(m, n) {
  x = rep(0, m)     # create vector with m zeroes
  for(i in 1:m) {   # for loop through m experiments
    x[i] = sum(sample(c(0,1), n, replace = TRUE)) 
  }
  return(x)         # return array of values     
}
# call the function for m=20 experiments, each with n tosses
experiment(20,10)
# result e.g.  4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4
# draw histograms for 20 experiments and 1000 experiments
png("hist10.png")  # open png file
par(mfrow=c(2,1))  # matrix of plots with 2 rows and 1 column
hist(experiment(20,10))
hist(experiment(1000,10))
dev.off() # finish writing to file
  • It is easy to realize that we get binomial distribution (binomické rozdelenie)
  • The probability of getting k ones out of n coin tosses is
  • The p-value of the test is the probability that simply by chance we would get k the same or more extreme than in our data.
  • In other words, what is the probability that in n=10 tosses we see head 6 times or more (one sided test)
  • P-value for k ones out of n coin tosses
  • If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player A is in fact better than B
# computing the probability that we get exactly 6 heads in 10 tosses
dbinom(6, 10, 0.5) # result 0.2050781
# we get the same as our formula above:
7*8*9*10/(2*3*4*(2^10)) # result 0.2050781

# entire probability distribution: probabilities 0..10 heads in 10 tosses
dbinom(0:10, 10, 0.5)
# [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
# [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
# [11] 0.0009765625

# we can also plot the distribution
plot(0:10, dbinom(0:10, 10, 0.5))
barplot(dbinom(0:10, 10, 0.5))

# our p-value is the sum for k=6,7,8,9,10
sum(dbinom(6:10, 10, 0.5))
# result: 0.3769531
# so results this "extreme" are not rare by chance,
# they happen in about 38% of cases

# R can compute the sum for us using pbinom 
# this considers all values greater than 5
pbinom(5, 10, 0.5, lower.tail=FALSE)
# result again 0.3769531

# if probability is too small, use log of it
pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE)
# [1] -6931.472
# the probability of getting 10000x head is exp(-6931.472) = 2^{-100000}

# generating numbers from binomial distribution 
# - similarly to our function experiment
rbinom(20, 10, 0.5)
# [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5

# running the test
binom.test(6, 10, p = 0.5, alternative="greater")
#
#        Exact binomial test
#
# data:  6 and 10
# number of successes = 6, number of trials = 10, p-value = 0.377
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
# 0.3035372 1.0000000
# sample estimates:
# probability of success
#                   0.6

# to only get p-value, run
binom.test(6, 10, p = 0.5, alternative="greater")$p.value
# result 0.3769531

Comparing two sets of values: Welch's t-test

  • Let us now consider two sets of values drawn from two normal distributions with unknown means and variances
  • The null hypothesis of the Welch's t-test is that the two distributions have equal means
  • The test computes test statistics (in R for vectors x1, x2):
    • (mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))
  • If the null hypothesis holds, i.e. x1 and x2 were sampled from distributions with equal means, this test statistics is approximately distributed according to the Student's t-distribution with the degree of freedom obtained by
n1=length(x1)
n2=length(x2)
(var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
  • Luckily R will compute the test for us simply by calling t.test
# generate x1: 6 values from normal distribution with mean 2 and standard deviation 1
x1 = rnorm(6, 2, 1)
# for example 2.70110750  3.45304366 -0.02696629  2.86020145  2.37496993  2.27073550

# generate x2: 4 values from normal distribution with mean 3 and standard deviation 0.5
x2 = rnorm(4, 3, 0.5)
# for example 3.258643 3.731206 2.868478 2.239788
t.test(x1, x2)
# t = -1.2898, df = 7.774, p-value = 0.2341
# alternative hypothesis: true difference in means is not equal to 0
# means 2.272182  3.024529
# this time the test was not significant

# regenerate x2 from a distribution with a much more different mean
x2 = rnorm(4, 5, 0.5)
# 4.882395 4.423485 4.646700 4.515626
t.test(x1, x2)
# t = -4.684, df = 5.405, p-value = 0.004435
# means 2.272182  4.617051
# this time much more significant p-value

# to get only p-value, run 
t.test(x1,x2)$p.value

We will apply Welch's t-test to microarray data

Multiple testing correction

  • When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01/
  • Purely by chance this would happen in 1% of cases (from the definition of the p-value).
  • So purely by chance we would expect to get about 64 genes with p-value at most 0.01.
  • So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance.
  • Sometimes false positives may even overwhelm the results.
  • Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests, there are many different methods
  • The simplest one is Bonferroni correction, where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 = 1.56e-6
  • This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality)
  • We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected
  • After Bonferroni correction we get only one significant gene
# the results of t-tests are in vector pb of length 6398
# manually multiply p-values by length(pb), count those that have value <= 0.01
sum(pb * length(pb) <= 0.01)
# in R you can use p.adjust for multiple testing correction
pb.adjusted = p.adjust(pa, method ="bonferroni")
# this is equivalent to multiplying by the length and using 1 if the result > 1
pb.adjusted = pmin(pa*length(pa),rep(1,length(pa)))

# there are less conservative multiple testing correction methods, 
# e.g. Holm's method, but in this case we get almost the same results
pa.adjusted2 = p.adjust(pa, method ="holm")

Another frequently used correction is false discovery rate (FDR), which is less strict and controls the overall proportion of false positives among results.

HWr2

See also the current and the the previous lecture.

  • Do either tasks A,B,C (beginners) or B,C,D (more advanced). You can also do all four for bonus credit.
  • In your protocol write used R commands with brief comments on your approach.
  • Submit required plots with filenames as specified.
  • For each task also include results as required and a short discussion commenting the results/plots you have obtained. Is the value of interest increasing or decreasing with some parameter? Are the results as expected or surprising?
  • Outline of protocol is in /tasks/r2/protocol.txt

Task A: sign test

  • Consider a situation in which players played n games, out of which a fraction of q were won by A (the example in the lecture corresponds to q=0.6 and n=10)
  • Compute a table of p-values for n=10,20,...,90,100 and for q=0.6, 0.7, 0.8, 0.9
  • Plot the table using matplot (n is x-axis, one line for each value of q)
  • Submit the plot in sign.png
  • Discuss the values you have seen in the plot / table

Outline of the code:

# create vector rows with values 10,20,...,100
rows=(1:10)*10
# create vector columns with required values of q
columns=c(0.6, 0.7, 0.8, 0.9)
# create empty matrix of pvalues 
pvalues = matrix(0,length(rows),length(columns))
# TODO: fill in matrix pvalues using binom.test

# set names of rows and columns
rownames(pvalues)=rows
colnames(pvalues)=columns
# careful: pvalues[10,] is now 10th row, i.e. value for n=100, 
#          pvalues["10",] is the first row, i.e. value for n=10

# check that for n=10 and q=0.6 you get p-value 0.3769531
pvalues["10","0.6"]

# create x-axis matrix (as in previous exercises, part D)
x=matrix(rep(rows,length(columns)),nrow=length(rows))
# matplot command
png("sign.png")
matplot(x,pvalues,type="l",col=c(1:length(columns)),lty=1)
legend("topright",legend=columns,col=c(1:length(columns)),lty=1)
dev.off()

Task B: Welch's t-test on microarray data

Read the microarray data, and preprocess them (last time we worked with preprocessed data). We first transform it to log scale and then shift and scale values in each column so that median is 0 and sum of squares of values is 1. This makes values more comparable between experiments; in practice more elaborate normalization is often performed. In the rest, work with table a containing preprocessed data.

# read the input file
input = read.table("/tasks/r2/acids.tsv", header=TRUE, row.names=1)
# take logarithm of all the values in the table
input = log2(input)
# compute median of each column
med = apply(input, 2, median)
# shift and scale values
a = scale(input, center=med)

Columns 1,2,3 are control, columns 4,5,6 acetic acid, 7,8,9 benzoate, 10,11,12 propionate, and 13,14,15 sorbate

Write a function my.test which will take as arguments table a and 2 lists of columns (e.g. 1:3 and 4:6) and will run for each row of the table Welch's t-test of the first set of columns versus the second set. It will return the resulting vector of p-values, one for each gene.

  • For example by calling pb <- my.test(a, 1:3, 7:9) we will compute p-values for differences between control and benzoate (computation may take some time)
  • The first 5 values of pb should be
> pb[1:5]
[1] 0.02358974 0.05503082 0.15354833 0.68060345 0.04637482
  • Run the test for all four acids
  • Report how many genes were significant with p-value at most 0.01 for each acid
  • Report how many genes are significant for both acetic and benzoate acids simultaneously (logical and is written as &).

Task C: multiple testing correction

Run the following snippet of code, which works on the vector of p-values pb obtained for benzoate in task B

# adjusts vectors of p-vales from tasks B for using Bonferroni correction
pb.adjusted = p.adjust(pb, method ="bonferroni")
# add this adjusted vector to frame a
a <-  cbind(a, pb.adjusted)
# create permutation ordered by pb.adjusted
ob = order(pb.adjusted)
# select from table five rows with the lowest pb.adjusted (using vector ob)
# and display columns containing control, acetate and adjusted p-value
a[ob[1:5],c(1:3,7:9,16,17)]

You should get an output like this:

      control1  control2  control3  benzoate1  benzoate2  benzoate3
PTC4 0.5391444 0.5793445 0.5597744  0.2543546  0.2539317  0.2202997
GDH3 0.2480624 0.2373752 0.1911501 -0.3697303 -0.2982495 -0.3616723
AGA2 0.6735964 0.7860222 0.7222314  1.4807101  1.4885581  1.3976753
CWP2 1.4723713 1.4582596 1.3802390  2.3759288  2.2504247  2.2710695
LSP1 0.7668296 0.8336119 0.7643181  1.3295121  1.2744859  1.2986457
               pb pb.adjusted
PTC4 4.054985e-05   0.2594379
GDH3 5.967727e-05   0.3818152
AGA2 8.244790e-05   0.5275016
CWP2 1.041416e-04   0.6662979
LSP1 1.095217e-04   0.7007201

Do the same procedure for acetate p-values and report the result (in your table, report both p-values and expression levels for acetate, not bezoate). Comment on the results for both acids.

Task D: volcano plot, test on data generated from null hypothesis

Draw a volcano plot for the acetate data

  • The x-axis of this plot is the difference between the mean of control and the mean of acetate. You can compute row means of a matrix by rowMeans.
  • The y-axis is -log10 of the p-value (use the p-values before multiple testing correction)
  • You can quickly see the genes that have low p-values (high on y-axis) and also big difference in the mean expression between the two conditions (far from 0 on x-axis). You can also see if acetate increases or decreases the expression of these genes.

Now create a simulated dataset sharing some features of the real data but observing the null hypothesis that the mean of control and acetate are the same for each gene

  • Compute vector m of means for columns 1:6 from matrix a
  • Compute vectors sc and sa of standard deviations for control columns and for acetate columns respectively. You can compute standard deviation for each row of a matrix by apply(some.matrix, 1, sd)
  • For each i in 1:6398, create three samples from the normal distribution with mean m[i] and standard deviation sc[i] and three samples with mean m[i] and deviation sa[i] (use the rnorm function)
  • On the resulting matrix apply Welch's t-test and draw the volcano plot.
  • How many random genes have p-value at most 0.01? Is it roughly what we would expect under the null hypothesis?

Draw a histogram of p-values from the real data (control vs acetate) and from the random data (use function hist). Describe how they differ. Is it what you would expect?

Submit plots volcano-real.png, volcano-random.png, hist-real.png, hist-random.png (real for real expression data and random for generated data)

Lcloud

Today we will work with Amazon Web Services (AWS), which is a cloud computing platform. It allows highly parallel computation on large datasets. We will use an educational account which gives you certain amount of resources for free.


Credentials

  • First you need to create .aws/credentials file in your home folder with valid AWS credentials.
  • Also run `aws configure`. Press enter for access key ID and secret access key and put in `us-east-1` for region. Press enter for output format.
  • Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):

https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p

  • Sometimes these credentials expire. In that case repeat the same steps to get new ones.

AWS command line

  • We will access AWS using aws command installed on our server.
  • You can also install it on your own machine using pip install awscli

Input files and data storage

Today we will use Amazon S3 cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:

# the following command should give you a big list of files
aws s3 ls s3://idzbucket2

# this command downloads one file from the bucket
aws s3 cp s3://idzbucket2/splitaa splitaa

# the following command prints the file in your console 
# (no need to do this).
aws s3 cp s3://idzbucket2/splitaa -

You should also create your own bucket (storage area). Pick your own name, must be globally unique:

aws s3 mb s3://mysuperawesomebucket

MapReduce

We will be using MapReduce in this session. It is kind of outdated concept, but simple enough for us and runs out of box on AWS. If you ever want to use BigData in practice, try something more modern like Apache Beam. And avoid PySpark if you can.

For tutorial on MapReduce check out PythonHosted.org or TutorialsPoint.com.

Template

You are given basic template with comments in /tasks/cloud/example_job.py

You can run it locally as follows:

python3 example_job.py <input file> -o <output_dir>

You can run it in the cloud on the whole dataset as follows:

python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 \
  --num-core-instances 4 -o s3://<your bucket>/<some directory>

For testing we recommend using a smaller sample as follows:

python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* \
  --num-core-instances 4 -o  s3://<your bucket>/<some directory>

Other useful commands

You can download output as follows:

# list of files
aws s3 ls s3://<your bucket>/<some directory>/
# download
aws s3 cp s3://<your bucket>/<some directory>/ . --recursive

If you want to watch progress:

  • Click on AWS Console button workbench (vocareum).
  • Set region (top right) to N. Virginia (us-east-1).
  • Click on services, then EMR.
  • Click on the job, which is running, then Steps, view logs, syslog.

HWcloud

See also the lecture

For both tasks, submit your source code and the result, when run on whole dataset (s3://idzbucket2). The code is expected to use the MRJob framework presented in the lecture. Submit directory is /submit/cloud/

Task A

Count the number of occurrences of each 4-mer in the provided data.

Task B

Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Hints: