1-DAV-202 Data Management 2024/25
Lpar
Today we will work with the parallel computing framework Apache Beam. Usually, we use these frameworks in cloud for computation over multiple machines, but today we will run them locally.
Apache beam
Running locally
On your own machine, please install packages with
pip install 'apache-beam[gcp]'
You are given basic template with comments in /tasks/par/example_job.py
You can run it as follows:
python3 example_job.py --output out
This job uses one file and stores results into a file starting with name out. You can change the name if you want. This is very useful for any debugging.
The actual job just counts amount of each base in the input data (and discards any fastq headers).
You can use parameter --input to use some input file from your harddrive (or multiple files).
This is the relevant part of the code:
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | 'Read' >> ReadFromText(known_args.input)
counts = (
lines
| 'Filter' >> beam.Filter(good_line)
| 'Split' >> (beam.ParDo(WordExtractingDoFn()))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word, count):
return '%s: %d' % (word, count)
output = counts | 'Format' >> beam.MapTuple(format_result)
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | 'Write' >> WriteToText(known_args.output)
First we create a collection of data (thing about it as a big array, where indices are not significant). Then we apply various beam functions over it. First we filter it to keep only good lines, then we extract relevant parts of line (we emit (c, 1) for each letter c) and then we group results by key (first part of the tuple) and sum values (second part of the tuple).
Note that this is just template for the job. Beam decides what part of computation is run where and parallelizes things automatically.
One might ask what is the difference between ParDo and Map. Map only outputs one element per one input. ParDo can output as many as it wants.
You might want to check out more examples at beam documentation.
Tips
- If you run code locally, you can use print in processing functions (e.g. inside WordExtractionFN::process)
- CombinePerKey requires called function to be associative and commutative. If you want something more complicated look at averaging example 5 here: https://beam.apache.org/documentation/transforms/python/aggregation/combineperkey/