1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Project"

From MAD
Jump to navigation Jump to search
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Cieľom projektu je vyskúšať si naučené zručnosti na konkrétnom projekte spracovania dát. Vašou úlohou je zohnať si dáta, tieto dáta analyzovať niektorými technikami z prednášok, prípadne aj inými technológiami a získané výsledky zobraziť v prehľadných grafoch a tabuľkách. Ideálne je, ak sa vám podarí prísť k zaujímavým alebo užitočným záverom, ale hodnotiť budeme hlavne voľbu vhodného postupu a jeho technickú náročnosť. Rozsah samotného programovania alebo analýzy dát by mal zodpovedať zhruba trom domácim úlohám, ale celkovo bude projekt náročnejší, lebo na rozdiel od úloh nemáte postup a dáta vopred určené, ale musíte si ich vymyslieť sami a nie vždy sa prvý nápad ukáže ako správny.  
+
The goal of the project is to apply and extend the skills acquired during the course while working on a data analysis project of your choice. Your task is to obtain data, analyze it and display the obtained results in clear graphs and tables. It is ideal if you obtain interesting or useful conclusions, but we will mainly evaluate your choice of suitable methods and technical difficulty of the project. The scope of programming or data analysis should correspond to roughly three homework assignments, but overall the project will be more demanding, because unlike assignments, you are not provided with data and a sequence of tasks, but you have to come up with them yourself, and the first ideas do not always work.
  
V projekte môžete využiť aj existujúce nástroje a knižnice, ale dôraz by mal byť na nástrojoch spúšťaných na príkazovom riadku a využití technológií preberaných na predmete. Pri prototypovaní vášho nástroja a vytváraní vizualizácií do záverečnej správy sa vám môže dobre pracovať v interaktívnych prostrediach, ako napríklad Jupyter notebook, ale v odovzdanej verzii projektu by sa mala dať väčšia časť kódu spustiť zo samostatných skriptov spustiteľných na príkazovom riadku, potenciálne s výnimkou samotnej vizualizácie, ktorá môže zostať ako notebook alebo interaktívna webstránka (flask).
+
The emphasis should be on tools run on the command line and the use of technologies covered during the course, but if needed, you can supplement them by other methods. While prototyping your tool and creating visualizations for your final report, you may prefer to work in interactive environments such as the Jupyter notebook, but in the submitted version of the project, most of the code should be in scripts runnable from the command line, potentially excluding the final visualization, which can remain as a notebook or an interactive website (flask).
  
==Návrh projektu==
+
==Project proposal==
Zhruba v dvoch tretinách semestra budete odovzdávať návrh projektu v rozsahu asi pol strany. V tomto návrhu uveďte, aké dáta budete spracovávať, ako ich zoženiete, čo je cieľom analýzy a aké technológie plánujete použiť. Ciele a technológie môžete počas práce na projekte mierne pozmeniť podľa okolností, mali by ste však mať počiatočnú predstavu. K návrhu vám dáme spätnú väzbu, pričom v niektorých prípadoch môže byť potrebné tému mierne alebo úplne zmeniť. Za načas odovzdaný vhodný návrh projektu získate 5% z celkovej známky. Návrh odporúčame pred odovzdaním konzultovať s vyučujúcimi.
 
  
Odovzdávanie: súbor vo formáte txt alebo pdf nakopírujte do </tt>/submit/navrh/your_username</tt> na serveri.
+
In roughly two thirds of the semester, you will submit a project proposal with length of about half a page. In this proposal, state what data you will process, how you will collect it, what the purpose of the analysis is and what technologies you plan to use. You can slightly change the goals and technologies as you work on the project, but you should have an initial idea. We will give you feedback on the proposal, and in some cases it may be necessary to change the topic slightly or completely. For a suitable project proposal submitted on time, you will receive 5% of the total grade. We recommend consulting the instructors before submitting the proposal.
  
==Odovzdanie projektov==
+
'''How to submit the proposal:''' copy a file in txt alebo pdf format to </tt>/submit/proposal/username</tt> on the course server.
Cez skúškové obdobie bude určený termín odovzdania projektu.  Podobne ako pri domácich úlohách odovzdávajte adresár s požadovanými súbormi:
 
* Vaše '''programy a súbory s dátami''' (veľmi veľké dátové súbory vynechajte)
 
* '''Protokol''' podobne ako pri domácich úlohách
 
** formát txt alebo pdf, stručné heslovité poznámky
 
** obsahuje zoznam súborov, podrobný postup pri analýze dát (spustené príkazy), ako aj použité zdroje (dáta, programy, dokumentácia a iná literatúra atď)
 
* '''Správu k projektu''' vo formáte pdf. Na rozdiel od menej formálneho protokolu by správu mal tvoriť súvislý text v odbornom štýle, podobne ako napr. záverečné práce. Môžete písať po slovensky alebo po anglicky, ale pokiaľ možno gramaticky správne. Správa by mala obsahovať:
 
** úvod, v ktorom vysvetlíte ciele projektu, prípadne potrebné poznatky zo skúmanej oblasti a aké dáta ste mali k dispozícii
 
** stručný popis metód, v ktorom neuvádzajte detailne jednotlivé kroky, skôr prehľad použitého prístupu a jeho zdôvodnenie
 
** výsledky analýzy (tabuľky, grafy a pod.) a popis týchto výsledkov, prípadne aké závery sa z nich dajú spraviť (nezabudnite vysvetliť, čo znamenajú údaje v tabuľkách, osi grafov a pod.). Okrem finálnych výsledkov analýzy uveďte aj čiastkové výsledky, ktorými ste sa snažili overovať, že pôvodné dáta a jednotlivé časti vášho postupu sa správajú rozumne.
 
** diskusiu, v ktorej uvediete, ktoré časti projektu boli náročné a na aké problémy ste narazili, kde sa vám naopak podarilo nájsť spôsob, ako problém vyriešiť jednoducho, ktoré časti projektu by ste spätne odporúčali robiť iným než vašim postupom, čo ste sa na projekte naučili a podobne
 
  
Projekty môžete robiť aj vo '''dvojici''', vtedy však vyžadujeme rozsiahlejší projekt a každý člen by mal byť primárne zodpovedný za určitú časť projektu, čo uveďte aj v správe. Dvojice odovzdávajú jednu správu, ale po odovzdaní projektu majú stretnutie s vyučujúcimi individuálne.
+
==Submitting projects==
  
==Typické časti projektu==
+
Similarly as for homeworks, submit the required files to the specified folder. Submit the following:
Väčšina projektov obsahuje nasledujúce kroky, ktoré by sa mali premietnuť aj v správe
+
* Your '''programs and data files''' (do not submit large data above 50Mb)  
* '''Získanie dát.''' Toto môže byť ľahké, ak vám dáta niekto priamo dá alebo ich stiahnete ako jeden súbor z internetu, alebo náročnejšie, napríklad ak ich parsujete z veľkého množstva súborov alebo webstránok. Nezabudnite na (aspoň námatkovú) kontrolu, či sa vám podarilo dáta stiahnuť správne. V správe by malo byť jasne uvedené, kde a ako ste dáta získali.
+
* '''Protocol''' in txt format with brief notes similarly as for homeworks. If should contain
* '''Predspracovanie dát do vhodného tvaru.''' Táto etapa zahŕňa parsovanie vstupných formátov, vyberanie užitočných dát, ich kontrola, odfiltrovanie nevhodných alebo neúplných položiek a podobne. Dáta si uložte do súboru alebo databázy vo vhodnom tvare, v ktorom sa vám s nimi bude dobre ďalej pracovať. Nezabudnite na kontrolu, či dáta vyzerajú byť v poriadku a spočítajte základné štatistiky, napríklad celkový počet záznamov, rozsahy rozličných atribútov a podobne, ktoré môžu vám aj čitateľovi správy ilustrovať, aký je charakter dát.
+
** a list of submitted files with brief descriptions
* '''Ďalšie analýzy na dátach a vizualizácia výsledkov.''' V tejto fáze sa pokúste v dátach nájsť niečo zaujímavé alebo užitočné pre zadávateľa projektu. Výsledkom môžu byť statické grafy a tabuľky, alebo aj interaktívna webstránka (flask). Aj v prípade interaktívnej webstránky však aspoň niektoré výsledky uveďte aj v správe.
+
** detailed steps done in data analysis (commands used)
Ak sa váš projekt od týchto krokov výrazne odlišuje, poraďte sa s vyučujúcimi.
+
** explanation how your script can be run by somebody else (if relevant)
 +
** list of sources (data, programs, documentation and other literature etc.)
 +
* '''Project report''' in pdf format. Unlike the less format protocol, the report should be a coherent text written in a technical style. You can write in English or Slovak, but avoid grammar mistakes. The report should contain:
 +
** an introduction explaining goals of the project, necessary background from the area you work in, and what data you have used
 +
** a brief method description which does not contain details of individual steps but rather overview of the method and its justification
 +
** results of your analysis (tables, graphs etc.), description of these results and discussion about conclusions that can be made from your results
 +
*** do not forget to explain the meaning of individual tables and graphs (axes, colors etc.)
 +
*** include the final results as well initial exploration of your data and analyses that you have used to verify correctness of the data and your analysis method
 +
*** you can interleave results and methods or keep them separate
 +
** conclusion where you discuss which parts of the project were difficult and what problems you have encountered, which parts you have in contrast were able to do easily, which parts you would recommend in retrospect to do differently, what you have learned during the project etc.  
  
==Vhodné témy projektov==
+
Project can be done by '''pairs of students''', however the project should be then bigger. Each student should be primarily responsible for some parts of the project and the division of labor should be listed in the project report. Pairs submit only one report and protocol, but the oral exam is separate for each student.
* Môžete spracovať nejaké dáta, ktoré potrebujete do bakalárskej alebo diplomovej práce, prípadne aj dáta, ktoré potrebujte na iný predmet (v tom prípade uveďte v správe, o aký predmet ide a takisto upovedomte aj druhého vyučujúceho, že ste použili spracovanie dát ako projekt pre tento predmet). Obzvlášť pre BIN študentov môže byť tento predmet vhodnou príležitosťou nájsť si tému bakalárskej práce a začať na nej pracovať.
+
 
* Môžete skúsiť zopakovať analýzu spravenú v nejakom vedeckom článku a overiť, že dostanete tie isté výsledky. Vhodné je tiež skúsiť analýzu aj mierne obmeniť (spustiť na iné dáta, zmeniť nejaké nastavenia, zostaviť aj iný typ grafu a pod.)
+
==Typical scheme of a project==
* Môžete skúsiť nájsť niekoho, kto má dáta, ktoré by potreboval spracovať, ale nevie ako na to (môže ísť o biológov, vedcov z iných oblastí, ale aj neziskové organizácie a pod.) V prípade, že takýmto spôsobom kontaktujete tretie osoby, bolo by vhodné pracovať na projekte obzvlášť zodpovedne, aby ste nerobili zlé meno našej fakulte.
+
Most projects consist of the following stages which should also be represented in the report
* V projekte môžete porovnávať niekoľko programov na tú istú úlohu z hľadiska ich rýchlosti či presnosti výsledkov. Obsahom projektu bude príprava dát, na ktorých budete programy bežať, samotné spúšťanie (vhodne zoskriptované) ako aj vyhodnotenie výsledkov.
+
* '''Acquiring data.''' This can be easy if someone gives you the data or if you download it from internet as a single file. It can also be more difficult if you need to parse the data from many data files or webpages. Do not forget to check if the data was downloaded correctly (at least by manually checking several records). The report  (in combination with protocol) should clearly indicate where and how you obtained the data.
* A samozrejme môžete niekde na internete vyhrabať zaujímavé dáta a snažiť sa z nich niečo vydolovať.
+
* '''Preprocessing data to a suitable form.''' This stage involves parsing input formats, selecting useful data, checking them for correctness, filtering incomplete records etc. Store your data set in a file or database suitable for further processing. Do not forget to check if data seems to be correct. In the report, include basic characteristics of the data such as the overall number of records, attributes of each record and value ranges for these attributes etc. This will help the reader to understand the character of your data set.
 +
* '''Further analyses and data visualization.''' At this stage try to arrive to interesting or useful conclusions. The results can be static figures and tables or an interactive webpage in flask. However, even for an interactive webpage include selected results in the report as static images.
 +
 
 +
If your project significantly differs from this scheme, consult the instructors.
 +
 
 +
==Project topics==
 +
* You can process some data useful for your bachelor or master thesis or data necessary for another course (in that case mention in the report which course it is and also notify the instructors of the other course that you have used results from this course). For DAV and BIN students, this project can be a good opportunity to select a topic of the bachelor thesis and start working on it.
 +
* You can try to find someone who needs to process some data set but does not have the necessary skills (this could be scientists from different fields, non-profit organizations or even companies). If you contact third parties in this way, it is especially important that you try to produce the best project you can, in order to maintain the good reputation of your study program.
 +
* In your project, you can compare speed or accuracy of several programs for the same task. The project will consist of preparing the input data for the programs, support for automated running of the programs and evaluation of results.
 +
* You can also try to replicate an analysis published in a scientific paper or a blog. You can check if you obtain the same results as the original authors and try variations of the original analysis, such as trying different parameters, changing settings, adding new visualizations etc.
 +
* You can also find interesting data on the internet and analyze them. Students often choose topics related to their hobbies and activities, such as sports, computer games, programming contests, music, cooking etc. Many successful projects involve scrapping such data from some websites.
 +
 
 +
'''Not recommended:'''
 +
* We do not recommend choosing a dataset from Kaggle or a similar site.
 +
* Many of these datasets are dubious, submitted by anonymous users without a good explanation of where the data came from. If you do use a dataset from Kaggle, make sure you research its background thoroughly so that you can convince us that it is trustworthy.
 +
* Another problem is that these datasets are usually already preprocessed. Thus work related to this course is mostly done and it is hard to find suitable tasks for the project.
 +
* Finally, many of these datasets already have many analyses done and published, so why add more to an already large pile?
 +
 
 +
==Use of AI code generation==
 +
* Some editors provide options for automated code generation based on context or comments.
 +
* On the project, you are allowed to use such tools (but not on the homework).
 +
* If you use such features, make sure you closely examine any generated code and correct any mistakes. You should understand how the code works and test it thoroughly. On the oral exam, we will check if you can explain how your code works and to modify it.
 +
* Acknowledge any such tools used in the resources section of your protocol.

Latest revision as of 12:40, 14 March 2024

The goal of the project is to apply and extend the skills acquired during the course while working on a data analysis project of your choice. Your task is to obtain data, analyze it and display the obtained results in clear graphs and tables. It is ideal if you obtain interesting or useful conclusions, but we will mainly evaluate your choice of suitable methods and technical difficulty of the project. The scope of programming or data analysis should correspond to roughly three homework assignments, but overall the project will be more demanding, because unlike assignments, you are not provided with data and a sequence of tasks, but you have to come up with them yourself, and the first ideas do not always work.

The emphasis should be on tools run on the command line and the use of technologies covered during the course, but if needed, you can supplement them by other methods. While prototyping your tool and creating visualizations for your final report, you may prefer to work in interactive environments such as the Jupyter notebook, but in the submitted version of the project, most of the code should be in scripts runnable from the command line, potentially excluding the final visualization, which can remain as a notebook or an interactive website (flask).

Project proposal

In roughly two thirds of the semester, you will submit a project proposal with length of about half a page. In this proposal, state what data you will process, how you will collect it, what the purpose of the analysis is and what technologies you plan to use. You can slightly change the goals and technologies as you work on the project, but you should have an initial idea. We will give you feedback on the proposal, and in some cases it may be necessary to change the topic slightly or completely. For a suitable project proposal submitted on time, you will receive 5% of the total grade. We recommend consulting the instructors before submitting the proposal.

How to submit the proposal: copy a file in txt alebo pdf format to /submit/proposal/username on the course server.

Submitting projects

Similarly as for homeworks, submit the required files to the specified folder. Submit the following:

  • Your programs and data files (do not submit large data above 50Mb)
  • Protocol in txt format with brief notes similarly as for homeworks. If should contain
    • a list of submitted files with brief descriptions
    • detailed steps done in data analysis (commands used)
    • explanation how your script can be run by somebody else (if relevant)
    • list of sources (data, programs, documentation and other literature etc.)
  • Project report in pdf format. Unlike the less format protocol, the report should be a coherent text written in a technical style. You can write in English or Slovak, but avoid grammar mistakes. The report should contain:
    • an introduction explaining goals of the project, necessary background from the area you work in, and what data you have used
    • a brief method description which does not contain details of individual steps but rather overview of the method and its justification
    • results of your analysis (tables, graphs etc.), description of these results and discussion about conclusions that can be made from your results
      • do not forget to explain the meaning of individual tables and graphs (axes, colors etc.)
      • include the final results as well initial exploration of your data and analyses that you have used to verify correctness of the data and your analysis method
      • you can interleave results and methods or keep them separate
    • conclusion where you discuss which parts of the project were difficult and what problems you have encountered, which parts you have in contrast were able to do easily, which parts you would recommend in retrospect to do differently, what you have learned during the project etc.

Project can be done by pairs of students, however the project should be then bigger. Each student should be primarily responsible for some parts of the project and the division of labor should be listed in the project report. Pairs submit only one report and protocol, but the oral exam is separate for each student.

Typical scheme of a project

Most projects consist of the following stages which should also be represented in the report

  • Acquiring data. This can be easy if someone gives you the data or if you download it from internet as a single file. It can also be more difficult if you need to parse the data from many data files or webpages. Do not forget to check if the data was downloaded correctly (at least by manually checking several records). The report (in combination with protocol) should clearly indicate where and how you obtained the data.
  • Preprocessing data to a suitable form. This stage involves parsing input formats, selecting useful data, checking them for correctness, filtering incomplete records etc. Store your data set in a file or database suitable for further processing. Do not forget to check if data seems to be correct. In the report, include basic characteristics of the data such as the overall number of records, attributes of each record and value ranges for these attributes etc. This will help the reader to understand the character of your data set.
  • Further analyses and data visualization. At this stage try to arrive to interesting or useful conclusions. The results can be static figures and tables or an interactive webpage in flask. However, even for an interactive webpage include selected results in the report as static images.

If your project significantly differs from this scheme, consult the instructors.

Project topics

  • You can process some data useful for your bachelor or master thesis or data necessary for another course (in that case mention in the report which course it is and also notify the instructors of the other course that you have used results from this course). For DAV and BIN students, this project can be a good opportunity to select a topic of the bachelor thesis and start working on it.
  • You can try to find someone who needs to process some data set but does not have the necessary skills (this could be scientists from different fields, non-profit organizations or even companies). If you contact third parties in this way, it is especially important that you try to produce the best project you can, in order to maintain the good reputation of your study program.
  • In your project, you can compare speed or accuracy of several programs for the same task. The project will consist of preparing the input data for the programs, support for automated running of the programs and evaluation of results.
  • You can also try to replicate an analysis published in a scientific paper or a blog. You can check if you obtain the same results as the original authors and try variations of the original analysis, such as trying different parameters, changing settings, adding new visualizations etc.
  • You can also find interesting data on the internet and analyze them. Students often choose topics related to their hobbies and activities, such as sports, computer games, programming contests, music, cooking etc. Many successful projects involve scrapping such data from some websites.

Not recommended:

  • We do not recommend choosing a dataset from Kaggle or a similar site.
  • Many of these datasets are dubious, submitted by anonymous users without a good explanation of where the data came from. If you do use a dataset from Kaggle, make sure you research its background thoroughly so that you can convince us that it is trustworthy.
  • Another problem is that these datasets are usually already preprocessed. Thus work related to this course is mostly done and it is hard to find suitable tasks for the project.
  • Finally, many of these datasets already have many analyses done and published, so why add more to an already large pile?

Use of AI code generation

  • Some editors provide options for automated code generation based on context or comments.
  • On the project, you are allowed to use such tools (but not on the homework).
  • If you use such features, make sure you closely examine any generated code and correct any mistakes. You should understand how the code works and test it thoroughly. On the oral exam, we will check if you can explain how your code works and to modify it.
  • Acknowledge any such tools used in the resources section of your protocol.