1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWcloud"
Jump to navigation
Jump to search
(Created page with "<!-- NOTEX --> See also the lecture For both tasks, submit your source code and the result, when run on whole dataset (<tt>s3://idzbucket2</tt>). The code is expec...") |
|||
Line 3: | Line 3: | ||
For both tasks, submit your source code and the result, when run on whole dataset (<tt>s3://idzbucket2</tt>). | For both tasks, submit your source code and the result, when run on whole dataset (<tt>s3://idzbucket2</tt>). | ||
− | The code is expected to use the MRJob framework presented in the lecture. | + | The code is expected to use the MRJob framework presented in the lecture. Submit directory is <tt>/submit/cloud/</tt> |
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
Revision as of 14:14, 30 April 2020
See also the lecture
For both tasks, submit your source code and the result, when run on whole dataset (s3://idzbucket2). The code is expected to use the MRJob framework presented in the lecture. Submit directory is /submit/cloud/
Task A
Count the number of occurrences of each 4-mer in the provided data.
Task B
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
Hints:
- Try counting pairs for each 30-mer first.
- You can yield something structured from Mapper (e.g. tuple).
- There is a two-step MapReduce, which can help you with the final summation: https://pythonhosted.org/mrjob/guides/writing-mrjobs.html