1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWcloud"
(8 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
See also the [[Lcloud|lecture]] | See also the [[Lcloud|lecture]] | ||
− | + | Deadline: 20th May 2024 9:00 | |
− | For both tasks, submit your source code and the result, when run on whole dataset (<tt>gs://mad- | + | For both tasks, submit your source code and the result, when run on whole dataset (<tt>gs://fmph-mad-2024-public/</tt>). |
The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is <tt>/submit/cloud/</tt> | The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is <tt>/submit/cloud/</tt> | ||
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
− | ===Task A=== | + | ===Task A for bonus points=== |
+ | |||
+ | Document (with screenshots) your login journey until step `gsutil mb`. | ||
+ | |||
+ | ===Task B=== | ||
Count the number of occurrences of each 4-mer in the provided data. | Count the number of occurrences of each 4-mer in the provided data. | ||
− | ===Task | + | Provided data are in Fastq format. They contain reads from some genomic sequencing. |
+ | By a read we mean part of some DNA. In Fastq format each read is on 4 lines. | ||
+ | The first line starts with @ and contains the read name. | ||
+ | The second line contains the actual read (this is the important part for you). | ||
+ | The third line contains + and the read name again. | ||
+ | The fourth line contains a quality score for each base (you should ignore this). | ||
+ | |||
+ | By k-mer we mean any consecutive substring in a read. | ||
+ | For example, in a read "ACGGCTA" the 4-mers are: "ACGG", "CGGC", "GGCT", "GCTA". | ||
+ | |||
+ | |||
+ | ===Task C=== | ||
+ | |||
+ | Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement. | ||
− | + | One more clarification: | |
+ | If you have two reads (and say we are counting 4 base overlaps): AAAAxxxxxCCCC and CCCCxxxxxAAAA, this counts as two overlaps. | ||
Hints: | Hints: |
Latest revision as of 17:10, 9 May 2024
See also the lecture
Deadline: 20th May 2024 9:00
For both tasks, submit your source code and the result, when run on whole dataset (gs://fmph-mad-2024-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/
Task A for bonus points
Document (with screenshots) your login journey until step `gsutil mb`.
Task B
Count the number of occurrences of each 4-mer in the provided data.
Provided data are in Fastq format. They contain reads from some genomic sequencing. By a read we mean part of some DNA. In Fastq format each read is on 4 lines. The first line starts with @ and contains the read name. The second line contains the actual read (this is the important part for you). The third line contains + and the read name again. The fourth line contains a quality score for each base (you should ignore this).
By k-mer we mean any consecutive substring in a read. For example, in a read "ACGGCTA" the 4-mers are: "ACGG", "CGGC", "GGCT", "GCTA".
Task C
Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.
One more clarification: If you have two reads (and say we are counting 4 base overlaps): AAAAxxxxxCCCC and CCCCxxxxxAAAA, this counts as two overlaps.
Hints:
- Try counting pairs for each 30-mer first.
- You can yield something structured from Map/ParDo operatation (e.g. tuple).
- You can have another Map/ParDo after CombinePerKey.
- Run code locally on small data to quickly iterate and test :)