1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Please submit project proposals until Monday April 14.


Difference between revisions of "HWcloud"

From MAD
Jump to navigation Jump to search
Line 6: Line 6:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task A===
+
===Task A for bonus points===
  
 
Document (with screenshots) your login journey until step `gcloud projects list`.
 
Document (with screenshots) your login journey until step `gcloud projects list`.

Revision as of 17:01, 17 April 2023

See also the lecture

For both tasks, submit your source code and the result, when run on whole dataset (gs://fmph-mad-2023-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/

Task A for bonus points

Document (with screenshots) your login journey until step `gcloud projects list`.

Task B

Count the number of occurrences of each 4-mer in the provided data.

Task C

Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Hints:

  • Try counting pairs for each 30-mer first.
  • You can yield something structured from Map/ParDo operatation (e.g. tuple).
  • You can have another Map/ParDo after CombinePerKey.
  • Run code locally on small data to quickly iterate and test :)