1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWcloud"
Jump to navigation
Jump to search
Line 2: | Line 2: | ||
See also the [[Lcloud|lecture]] | See also the [[Lcloud|lecture]] | ||
− | For both tasks, submit your source code and the result, when run on whole dataset (<tt> | + | For both tasks, submit your source code and the result, when run on whole dataset (<tt>gs://mad-2022-public/</tt>). |
− | The code is expected to use the | + | The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is <tt>/submit/cloud/</tt> |
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
Line 16: | Line 16: | ||
Hints: | Hints: | ||
* Try counting pairs for each 30-mer first. | * Try counting pairs for each 30-mer first. | ||
− | * You can yield something structured from | + | * You can yield something structured from Map/ParDo operatation (e.g. tuple). |
− | * | + | * You can have another Map/ParDo after CombinePerKey. |
+ | * Run code locally to quickly iterate and test :) |
Revision as of 20:09, 2 May 2022
See also the lecture
For both tasks, submit your source code and the result, when run on whole dataset (gs://mad-2022-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/
Task A
Count the number of occurrences of each 4-mer in the provided data.
Task B
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
Hints:
- Try counting pairs for each 30-mer first.
- You can yield something structured from Map/ParDo operatation (e.g. tuple).
- You can have another Map/ParDo after CombinePerKey.
- Run code locally to quickly iterate and test :)