1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project Wednesday May 28 9:00am, oral exams Friday May 30 9:00am (limit 8 students).
Regular: submit project Monday June 16, 9:00am, oral exams Thursday June 19 and 20 (estimated 9:00am-2:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 16, 9:30am.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cpp homework is due May 15, 9:00am. Use the time to work on projects.


Lpar

From MAD
Revision as of 09:12, 15 May 2025 by Usama (talk | contribs)
Jump to navigation Jump to search

Today we will work with the parallel computing framework Apache Beam. Usually, we use these frameworks in cloud for computation over multiple machines, but today we will run them locally.

Apache beam

Running locally

On your own machine, please install packages with

pip install apache-beam

You are given basic template with comments in /tasks/par/example_job.py

You can run it as follows:

python3 example_job.py --output out --input "/tasks/par/dataset/splitaa"

This job uses one file and stores results into a file starting with name out. You can change the name if you want. This is very useful for any debugging.

The actual job just counts amount of each base in the input data (and discards any fastq headers).

You can use parameter --input to use some input file from your harddrive (or multiple files).

To run on multiple input files, you can do:

python example_job.py --output out --input "/tasks/par/dataset/*"

To run using multiple CPU cores, you can do:

python example_job.py --output out --input "/tasks/par/dataset/*" --direct_running_mode=multi_processing --direct_num_workers=4


This is the main part of the code:

 with beam.Pipeline(options=pipeline_options) as p:

# Read the text file[pattern] into a PCollection.
lines = p | 'Read' >> ReadFromText(known_args.input)

counts = (
    lines
    | 'Filter' >> beam.Filter(good_line)
    | 'Split' >> (beam.ParDo(WordExtractingDoFn()))
    | 'GroupAndSum' >> beam.CombinePerKey(sum))

# Format the counts into a PCollection of strings.
def format_result(word, count):
  return '%s: %d' % (word, count)

output = counts | 'Format' >> beam.MapTuple(format_result)

# Write the output using a "Write" transform that has side effects.
output | 'Write' >> WriteToText(known_args.output)

First, we create a PCollection (think of it as a distributed, unordered collection of elements, similar to an array but without meaningful indices). Then we apply various Beam transforms over it.

  • Read: We use ReadFromText to load lines from a text file. Each line becomes an element in the PCollection.
  • Filter: The beam.Filter transform is used to remove unwanted data early. It takes a function (in our case, good_line) and keeps only the elements for which the function returns True. Here, good_line(line) checks if a line contains only valid genome characters (A, C, G, or T). Lines that don't match are discarded.
  • ParDo: The beam.ParDo transform is a powerful and flexible way to process each element individually (or produce multiple outputs from a single input). Here, we use a custom DoFn (WordExtractingDoFn), which for each line emits multiple (character, 1) pairs — essentially "splitting" the line into individual letters and tagging each with a count of 1. Important difference between Map and ParDo: beam.Map produces exactly one output per input. beam.ParDo can emit zero, one, or many outputs per input. This makes ParDo suitable for more complex transformations, like splitting, filtering inside a DoFn, or emitting multiple records from a single input.
  • CombinePerKey: After we have a lot of (character, 1) pairs, we need to combine them: count how many times each character appeared. beam.CombinePerKey(sum) does exactly this: it groups all elements by key (the character) and applies a combining function (here, Python’s built-in sum) to the associated values. So if there are ten (A, 1) pairs, they get combined into one (A, 10).
  • MapTuple (Format): We then format the results into a human-readable string using beam.MapTuple. It takes the key and the value and formats them as "character: count".
  • Write: Finally, the formatted strings are written to a text file using WriteToText.

A few additional important notes:

Apache Beam is not just a library for running a pipeline in a linear fashion. It defines the pipeline structure, but the actual execution is deferred. Beam can optimize and parallelize the computation under the hood, possibly splitting it across multiple machines and processing elements concurrently.

Transformations like Filter, ParDo, and CombinePerKey are lazy: they build up the pipeline graph but don't actually execute until the pipeline is run.

You might want to check out more examples at beam documentation.

Tips