1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt


Difference between revisions of "HWcloud"

From MAD
Jump to navigation Jump to search
Line 9: Line 9:
  
 
===Task A===
 
===Task A===
 +
 +
Document (with screenshots) your login journey until step `gcloud projects list`.
 +
 +
===Task B===
  
 
Count the number of occurrences of each 4-mer in the provided data.
 
Count the number of occurrences of each 4-mer in the provided data.
  
===Task B===
+
===Task C===
  
 
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
 
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Revision as of 09:08, 17 April 2023

See also the lecture

Important: This homework counts only for bonus points.

For both tasks, submit your source code and the result, when run on whole dataset (gs://mad-2022-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/

Task A

Document (with screenshots) your login journey until step `gcloud projects list`.

Task B

Count the number of occurrences of each 4-mer in the provided data.

Task C

Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Hints:

  • Try counting pairs for each 30-mer first.
  • You can yield something structured from Map/ParDo operatation (e.g. tuple).
  • You can have another Map/ParDo after CombinePerKey.
  • Run code locally on small data to quickly iterate and test :)