1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWcloud"
Jump to navigation
Jump to search
Line 9: | Line 9: | ||
===Task A=== | ===Task A=== | ||
+ | |||
+ | Document (with screenshots) your login journey until step `gcloud projects list`. | ||
+ | |||
+ | ===Task B=== | ||
Count the number of occurrences of each 4-mer in the provided data. | Count the number of occurrences of each 4-mer in the provided data. | ||
− | ===Task | + | ===Task C=== |
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement. | Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement. |
Revision as of 08:08, 17 April 2023
See also the lecture
Important: This homework counts only for bonus points.
For both tasks, submit your source code and the result, when run on whole dataset (gs://mad-2022-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/
Task A
Document (with screenshots) your login journey until step `gcloud projects list`.
Task B
Count the number of occurrences of each 4-mer in the provided data.
Task C
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
Hints:
- Try counting pairs for each 30-mer first.
- You can yield something structured from Map/ParDo operatation (e.g. tuple).
- You can have another Map/ParDo after CombinePerKey.
- Run code locally on small data to quickly iterate and test :)