1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWcloud"

From MAD
Jump to navigation Jump to search
Line 17: Line 17:
 
* Try counting pairs for each 30-mer first.
 
* Try counting pairs for each 30-mer first.
 
* You can yield something structured from Mapper (e.g. tuple).
 
* You can yield something structured from Mapper (e.g. tuple).
* There is a two-step MapReduce, which can help you with the final summation: https://pythonhosted.org/mrjob/guides/writing-mrjobs.html
+
* There is a two-step MapReduce, which can help you with the final summation: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html

Revision as of 10:42, 22 April 2021

See also the lecture

For both tasks, submit your source code and the result, when run on whole dataset (s3://idzbucket2). The code is expected to use the MRJob framework presented in the lecture. Submit directory is /submit/cloud/

Task A

Count the number of occurrences of each 4-mer in the provided data.

Task B

Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Hints: