1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project Wednesday May 28 9:00am, oral exams Friday May 30 9:00am (limit 8 students).
Regular: submit project Monday June 16, 9:00am, oral exams Thursday June 19 and 20 (estimated 9:00am-2:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 16, 9:30am.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cpp homework is due May 15, 9:00am. Use the time to work on projects.


Lpar

From MAD
Revision as of 08:22, 15 May 2025 by Usama (talk | contribs) (Created page with "Today we will work with the parallel computing framework Apache Beam. Usually, we use these frameworks in cloud for computation over multiple machines, but today we will run t...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Today we will work with the parallel computing framework Apache Beam. Usually, we use these frameworks in cloud for computation over multiple machines, but today we will run them locally.

Apache beam

This is the relevant part of the code:

  with beam.Pipeline(options=pipeline_options) as p:

    # Read the text file[pattern] into a PCollection.
    lines = p | 'Read' >> ReadFromText(known_args.input)

    counts = (
        lines
        | 'Filter' >> beam.Filter(good_line)
        | 'Split' >> (beam.ParDo(WordExtractingDoFn()))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

    # Format the counts into a PCollection of strings.
    def format_result(word, count):
      return '%s: %d' % (word, count)

    output = counts | 'Format' >> beam.MapTuple(format_result)

    # Write the output using a "Write" transform that has side effects.
    # pylint: disable=expression-not-assigned
    output | 'Write' >> WriteToText(known_args.output)

First we create a collection of data (thing about it as a big array, where indices are not significant). Then we apply various beam functions over it. First we filter it to keep only good lines, then we extract relevant parts of line (we emit (c, 1) for each letter c) and then we group results by key (first part of the tuple) and sum values (second part of the tuple).

Note that this is just template for the job. Beam decides what part of computation is run where and parallelizes things automatically.

One might ask what is the difference between ParDo and Map. Map only outputs one element per one input. ParDo can output as many as it wants.

You might want to check out more examples at beam documentation.

Tips

  • First run your code locally. It is much faster to iterate. Only if you are satisfied with the result, run it in cloud on full dataset.
  • If you run code locally, you can use print in processing functions (e.g. inside WordExtractionFN::process)
  • CombinePerKey requires called function to be associative and commutative. If you want something more complicated look at averaging example 5 here: https://beam.apache.org/documentation/transforms/python/aggregation/combineperkey/