Difference between revisions of "HWcloud"

Revision as of 13:44, 9 May 2024

Task A for bonus points

Document (with screenshots) your login journey until step `gcloud projects list`.

Task B

Count the number of occurrences of each 4-mer in the provided data.

Provided data are in Fastq format. They contain reads from some genomic sequencing. By a read we mean part of some DNA. In Fastq format each read is on 4 lines. The first line starts with @ and contains the read name. The second line contains the actual read (this is the important part for you). The third line contains + and the read name again. The fourth line contains a quality score for each base (you should ignore this).

By k-mer we mean any consecutive substring in a read. For example, in a read "ACGGCTA" the 4-mers are: "ACGG", "CGGC", "GGCT", "GCTA".

Task C

Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.

Hints:

Try counting pairs for each 30-mer first.
You can yield something structured from Map/ParDo operatation (e.g. tuple).
You can have another Map/ParDo after CombinePerKey.
Run code locally on small data to quickly iterate and test :)

Difference between revisions of "HWcloud"

Revision as of 13:44, 9 May 2024

Task A for bonus points

Task B

Task C

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 13: / Line 13: @@
 Count the number of occurrences of each 4-mer in the provided data.
+Provided data are in Fastq format. They contain reads from some genomic sequencing.
+By a read we mean part of some DNA. In Fastq format each read is on 4 lines.
+The first line starts with @ and contains the read name.
+The second line contains the actual read (this is the important part for you).
+The third line contains + and the read name again.
+The fourth line contains a quality score for each base (you should ignore this).
+By k-mer we mean any consecutive substring in a read.
+For example, in a read "ACGGCTA" the 4-mers are: "ACGG", "CGGC", "GGCT", "GCTA".
 ===Task C===
-Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.
+Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.
 Hints: