1-DAV-202 Data Management 2024/25
Lpar
Today we will work with the parallel computing framework Apache Beam. Usually, we use these frameworks in cloud for computation over multiple machines, but today we will run them locally.
Apache beam
This is the relevant part of the code:
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | 'Read' >> ReadFromText(known_args.input)
counts = (
lines
| 'Filter' >> beam.Filter(good_line)
| 'Split' >> (beam.ParDo(WordExtractingDoFn()))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word, count):
return '%s: %d' % (word, count)
output = counts | 'Format' >> beam.MapTuple(format_result)
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | 'Write' >> WriteToText(known_args.output)
First we create a collection of data (thing about it as a big array, where indices are not significant). Then we apply various beam functions over it. First we filter it to keep only good lines, then we extract relevant parts of line (we emit (c, 1) for each letter c) and then we group results by key (first part of the tuple) and sum values (second part of the tuple).
Note that this is just template for the job. Beam decides what part of computation is run where and parallelizes things automatically.
One might ask what is the difference between ParDo and Map. Map only outputs one element per one input. ParDo can output as many as it wants.
You might want to check out more examples at beam documentation.
Tips
- First run your code locally. It is much faster to iterate. Only if you are satisfied with the result, run it in cloud on full dataset.
- If you run code locally, you can use print in processing functions (e.g. inside WordExtractionFN::process)
- CombinePerKey requires called function to be associative and commutative. If you want something more complicated look at averaging example 5 here: https://beam.apache.org/documentation/transforms/python/aggregation/combineperkey/