1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt


Difference between revisions of "HWcloud"

From MAD
Jump to navigation Jump to search
 
(6 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
See also the [[Lcloud|lecture]]
 
See also the [[Lcloud|lecture]]
  
Important: This homework counts only for bonus points.
+
Deadline: 20th May 2024 9:00
  
For both tasks, submit your source code and the result, when run on whole dataset (<tt>gs://fmph-mad-2023-public/</tt>).
+
For both tasks, submit your source code and the result, when run on whole dataset (<tt>gs://fmph-mad-2024-public/</tt>).
 
The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is <tt>/submit/cloud/</tt>
 
The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is <tt>/submit/cloud/</tt>
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task A===
+
===Task A for bonus points===
  
Document (with screenshots) your login journey until step `gcloud projects list`.
+
Document (with screenshots) your login journey until step `gsutil mb`.
  
 
===Task B===
 
===Task B===
  
 
Count the number of occurrences of each 4-mer in the provided data.
 
Count the number of occurrences of each 4-mer in the provided data.
 +
 +
Provided data are in Fastq format. They contain reads from some genomic sequencing.
 +
By a read we mean part of some DNA. In Fastq format each read is on 4 lines.
 +
The first line starts with @ and contains the read name.
 +
The second line contains the actual read (this is the important part for you).
 +
The third line contains + and the read name again.
 +
The fourth line contains a quality score for each base (you should ignore this).
 +
 +
By k-mer we mean any consecutive substring in a read.
 +
For example, in a read "ACGGCTA" the 4-mers are: "ACGG", "CGGC", "GGCT", "GCTA".
 +
  
 
===Task C===
 
===Task C===
  
Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
+
Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.
 +
 
 +
One more clarification:
 +
If you have two reads (and say we are counting 4 base overlaps): AAAAxxxxxCCCC and CCCCxxxxxAAAA, this counts as two overlaps.  
  
 
Hints:  
 
Hints:  

Latest revision as of 18:10, 9 May 2024

See also the lecture

Deadline: 20th May 2024 9:00

For both tasks, submit your source code and the result, when run on whole dataset (gs://fmph-mad-2024-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/

Task A for bonus points

Document (with screenshots) your login journey until step `gsutil mb`.

Task B

Count the number of occurrences of each 4-mer in the provided data.

Provided data are in Fastq format. They contain reads from some genomic sequencing. By a read we mean part of some DNA. In Fastq format each read is on 4 lines. The first line starts with @ and contains the read name. The second line contains the actual read (this is the important part for you). The third line contains + and the read name again. The fourth line contains a quality score for each base (you should ignore this).

By k-mer we mean any consecutive substring in a read. For example, in a read "ACGGCTA" the 4-mers are: "ACGG", "CGGC", "GGCT", "GCTA".


Task C

Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.

One more clarification: If you have two reads (and say we are counting 4 base overlaps): AAAAxxxxxCCCC and CCCCxxxxxAAAA, this counts as two overlaps.

Hints:

  • Try counting pairs for each 30-mer first.
  • You can yield something structured from Map/ParDo operatation (e.g. tuple).
  • You can have another Map/ParDo after CombinePerKey.
  • Run code locally on small data to quickly iterate and test :)