1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
HWcloud
Jump to navigation
Jump to search
See also the lecture
For both tasks, submit your source code and the result, when run on whole dataset (s3://idzbucket2). The code is expected to use the MRJob framework presented in the lecture. Submit directory is /submit/cloud/
Task A
Count the number of occurrences of each 4-mer in the provided data.
Task B
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
Hints:
- Try counting pairs for each 30-mer first.
- You can yield something structured from Mapper (e.g. tuple).
- There is a two-step MapReduce, which can help you with the final summation: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html