1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt


Difference between revisions of "HWcloud"

From MAD
Jump to navigation Jump to search
Line 2: Line 2:
 
See also the [[Lcloud|lecture]]
 
See also the [[Lcloud|lecture]]
  
For both tasks, submit your source code and the result, when run on whole dataset (<tt>s3://idzbucket2</tt>).
+
For both tasks, submit your source code and the result, when run on whole dataset (<tt>gs://mad-2022-public/</tt>).
The code is expected to use the MRJob framework presented in the lecture. Submit directory is <tt>/submit/cloud/</tt>
+
The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is <tt>/submit/cloud/</tt>
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
Line 16: Line 16:
 
Hints:  
 
Hints:  
 
* Try counting pairs for each 30-mer first.
 
* Try counting pairs for each 30-mer first.
* You can yield something structured from Mapper (e.g. tuple).
+
* You can yield something structured from Map/ParDo operatation (e.g. tuple).
* There is a two-step MapReduce, which can help you with the final summation: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
+
* You can have another Map/ParDo after CombinePerKey.
 +
* Run code locally to quickly iterate and test :)

Revision as of 21:09, 2 May 2022

See also the lecture

For both tasks, submit your source code and the result, when run on whole dataset (gs://mad-2022-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/

Task A

Count the number of occurrences of each 4-mer in the provided data.

Task B

Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Hints:

  • Try counting pairs for each 30-mer first.
  • You can yield something structured from Map/ParDo operatation (e.g. tuple).
  • You can have another Map/ParDo after CombinePerKey.
  • Run code locally to quickly iterate and test :)