1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lcloud"

From MAD
Jump to navigation Jump to search
 
(48 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Today we will work with [https://aws.amazon.com/ Amazon Web Services] (AWS), which is a cloud computing platform. It allows highly parallel computation on large datasets. We will use an educational account which gives you certain amount of resources for free.
+
Today we will work with [https://cloud.google.com/ Google Cloud] (GCP), which is a cloud computing platform.  
 +
GCP contains many services (virtual machines, kubernetes, storage, databases, ...). We are mainly interested in Dataflow and Storage.
 +
Dataflow allows highly parallel computation on large datasets.
 +
We will use an educational account which gives you a certain amount of resources for free.
  
 +
==Basic setup==
  
==Credentials==
+
You should have received instructions how to create GCloud account and get cloud credits via email.
* First you need to create <tt>.aws/credentials</tt> file in your home folder with valid AWS credentials.
+
You should be able to login to google cloud console.
* Also run `aws configure`. Press enter for access key ID and secret access key and put in `us-east-1` for region. Press enter for output format.
 
  
<!-- NOTEX -->
+
Now:
* Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):
+
* Login to some Linux machine (ideally vyuka)
https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p
+
* If the machine is not vyuka, install gcloud command-line package (we recommend via [https://cloud.google.com/sdk/docs/downloads-snap snap]).
* Sometimes these credentials expire. In that case repeat the same steps to get new ones.
+
* Run the following command to initialize and authorize your GCloud configuration:
<!-- /NOTEX -->
+
<syntaxhighlight lang="bash">gcloud init --console-only</syntaxhighlight>
<!-- TEX
+
* Follow the instructions (copy the provided link to your browser, login and then copy code back to the console).
* Instructions for doing so are given during the lecture.
 
/TEX -->
 
  
==AWS command line==
+
==Input files and data storage==
* We will access AWS using <tt>aws</tt> command installed on our server.
 
* You can also install it on your own machine using  <tt>pip install awscli</tt>
 
  
==Input files and data storage==
+
Today we will use [https://cloud.google.com/storage Gcloud storage] to store input and output files.
 +
Think of it as some limited external disk. You can just upload and download files, no random access to the middle of the file.
  
Today we will use [https://aws.amazon.com/s3/ Amazon S3] cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) assocated with this lecture:  
+
Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:  
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# the following command should give you a big list of files
 
# the following command should give you a big list of files
aws s3 ls s3://idzbucket2
+
gsutil ls gs://fmph-mad-2024-public/
  
 
# this command downloads one file from the bucket
 
# this command downloads one file from the bucket
aws s3 cp s3://idzbucket2/splitaa splitaa
+
gsutil cp gs://fmph-mad-2024-public/splitaa splitaa
  
 
# the following command prints the file in your console  
 
# the following command prints the file in your console  
 
# (no need to do this).
 
# (no need to do this).
aws s3 cp s3://idzbucket2/splitaa -
+
gsutil cat gs://fmph-mad-2024-public/splitaa
 
</syntaxhighlight>
 
</syntaxhighlight>
  
You should also create your own bucket (storage area). Pick your own name, must be globally unique):  
+
You should also create your own bucket (storage area). Pick your own name, must be <b>globally</b> unique:  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
aws s3 mb s3://mysuperawesomebucket
+
gsutil mb gs://mysuperawesomebucket
 
</syntaxhighlight>
 
</syntaxhighlight>
  
==MapReduce==
+
If you get "AccessDeniedException: 403 The billing account for the owning project is disabled in state absent", you should open project in web UI (console.cloud.google.com), head to page Billing -> Link billing account and select "XY for Education".
 +
 
 +
<!-- [[File:Billing.png.png||800px]] -->
 +
 
 +
==Apache beam and Dataflow==
  
We will be using MapReduce in this session. It is kind of outdated concept, but simple enough for us and runs out of box on AWS.
+
We will be using Apache Beam in this session (because Pyspark stinks).
If you ever want to use BigData in practice, try something more modern like [https://beam.apache.org/ Apache Beam]. And avoid PySpark if you can.
 
  
For tutorial on MapReduce check out [https://pythonhosted.org/mrjob/guides/concepts.html#mapreduce-and-apache-hadoop PythonHosted.org] or [https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm TutorialsPoint.com].
+
==Running locally==
  
==Template==
+
If you want to use your own machine, please install packages with <syntaxhighlight lang="bash">pip install 'apache-beam[gcp]'</syntaxhighlight>
  
 
You are given basic template with comments in <tt>/tasks/cloud/example_job.py</tt>
 
You are given basic template with comments in <tt>/tasks/cloud/example_job.py</tt>
  
 
You can run it locally as follows:  
 
You can run it locally as follows:  
 +
 +
First run (this is needed just once):
 +
<syntaxhighlight lang="bash">
 +
gcloud auth application-default login --no-launch-browser
 +
</syntaxhighlight>
 +
 +
Then:
 +
<syntaxhighlight lang="bash">
 +
python3 example_job.py --output out
 +
</syntaxhighlight>
 +
 +
This job downloads one file from cloud storage and stores it into file starting with name out. You can change the name if you want.
 +
This is very useful for any debugging.
 +
 +
The actual job just counts amount of each base in the input data (and discards any fastq headers).
 +
 +
You can even use parameter --input to use some input file from your harddrive.
 +
 +
==Running in Dataflow==
 +
 +
<!--
 +
First we need to create service account (account for machine). Run following commands:
 +
 +
 +
<syntaxhighlight lang="bash">
 +
# This will show you the project-id, you will need it.
 +
gcloud projects list
 +
# This will create service account named mad-sacc
 +
gcloud iam service-accounts create mad-sacc
 +
 +
# This will give your service account some permissions. Do not forget to change PROJECT_ID to your project-id. Also note that we are quite liberal with permissions, this is not ideal in production.
 +
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/storage.objectAdmin
 +
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/dataflow.admin
 +
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/editor
 +
 +
# This creates key for your service account, named key.json (in your current directory).
 +
gcloud iam service-accounts keys create key.json --iam-account=mad-sacc@PROJECT_ID.iam.gserviceaccount.com
 +
 +
# This will setup environment variable for some tools, which communicate with google cloud (change path to key to relevant path). You will need to set this everytime you open a console (or put it into .bashrc).
 +
 +
export GOOGLE_APPLICATION_CREDENTIALS=/home/jano/hrasko/key.json
 +
</syntaxhighlight>
 +
 +
-->
 +
Now you can run Beam job in Dataflow on small sample:
 +
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
python3 example_job.py <input file> -o <output_dir>
+
python3 example_job.py --output gs://YOUR_BUCKET/out/outxy --region europe-west1 --runner DataflowRunner --project PROJECT_ID --temp_location gs://YOUR_BUCKET/temp/ --input gs://fmph-mad-2024-public/splitaa
 
</syntaxhighlight>
 
</syntaxhighlight>
  
You can run it in the cloud on the whole dataset as follows:
+
You can find PROJECT_ID using:
 +
 
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
python3 example_job.py -r emr s3://idzbucket2 --num-core-instances 4 -o s3://<your bucket>/<some directory>
+
gcloud projects list
 
</syntaxhighlight>
 
</syntaxhighlight>
  
For testing we recommend using a smaller sample as follows:
+
You will probably get an error like:
 +
 
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
python3 example_job.py -r emr s3://idzbucket2/splita* --num-core-instances 4 -o  s3://<your bucket>/<some directory>
+
Dataflow API has not been used in project XYZ before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview?project=XYZ then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
 
</syntaxhighlight>
 
</syntaxhighlight>
  
==Other useful commands==
+
Visit the URL (from you error, not from this lecture) and click enable API, then run command again.
 +
 
 +
If you get an error containing: ZONE_RESOURCE_POOL_EXHAUSTED, try changing region to us-east1 or some other region from https://cloud.google.com/compute/docs/regions-zones#available.
 +
 
 +
You can then download output using:
  
You can download output as follows:
 
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
# list of files
+
gsutil cp gs://YOUR_BUCKET/out/outxy* .
aws s3 ls s3://<your bucket>/<some directory>/
 
# download
 
aws s3 cp s3://<your bucket>/<some directory>/ . --recursive
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 +
Now you can run job on full dataset (this is what you should be doing in homework):
 +
 +
<syntaxhighlight lang="bash">
 +
python3 example_job.py --output gs://YOUR_BUCKET/out/outxy --region europe-west1 --runner DataflowRunner --project PROJECT_ID --temp_location gs://YOUR_BUCKET/temp/ --input gs://fmph-mad-2024-public/* --num_workers 5 --worker_machine_type n2-standard-4
 +
</syntaxhighlight>
 +
 
If you want to watch progress:
 
If you want to watch progress:
* Click on AWS Console button workbench (vocareum).
+
* Go to web console. Find dataflow in the menu (or type it into search bar), and go to jobs and select your job.
* Set region (top right) to Oregon.
+
* If you want to see machines magically created, go to VM Instances.
* Click on services, then EMR.
+
 
* Click on the job, which is running, then Steps, view logs, syslog.
+
==Apache beam==
 +
 
 +
This is the relevant part of the code:
 +
 
 +
<syntaxhighlight lang="python">
 +
  with beam.Pipeline(options=pipeline_options) as p:
 +
 
 +
    # Read the text file[pattern] into a PCollection.
 +
    lines = p | 'Read' >> ReadFromText(known_args.input)
 +
 
 +
    counts = (
 +
        lines
 +
        | 'Filter' >> beam.Filter(good_line)
 +
        | 'Split' >> (beam.ParDo(WordExtractingDoFn()))
 +
        | 'GroupAndSum' >> beam.CombinePerKey(sum))
 +
 
 +
    # Format the counts into a PCollection of strings.
 +
    def format_result(word, count):
 +
      return '%s: %d' % (word, count)
 +
 
 +
    output = counts | 'Format' >> beam.MapTuple(format_result)
 +
 
 +
    # Write the output using a "Write" transform that has side effects.
 +
    # pylint: disable=expression-not-assigned
 +
    output | 'Write' >> WriteToText(known_args.output)
 +
</syntaxhighlight>
 +
 
 +
First we create a collection of data (thing about it as a big array, where indices are not significant).
 +
Then we apply various beam functions over it.
 +
First we filter it to keep only good lines, then we extract relevant parts of line (we emit (c, 1) for each letter c) and then we group results by key (first part of the tuple) and sum values (second part of the tuple).  
 +
 
 +
Note that this is just template for the job. Beam decides what part of computation is run where and parallelizes things automatically.
 +
 
 +
One might ask what is the difference between ParDo and Map. Map only outputs one element per one input. ParDo can output as many as it wants.
 +
 
 +
You might want to check out more examples [https://beam.apache.org/documentation/transforms/python/elementwise/map/ at beam documentation].
 +
 
 +
==Tips==
 +
 
 +
* First run your code locally. It is much faster to iterate. Only if you are satisfied with the result, run it in cloud on full dataset.
 +
* If you run code locally, you can use print in processing functions (e.g. inside WordExtractionFN::process)
 +
* CombinePerKey requires called function to be associative and commutative. If you want something more complicated look at averaging example 5 here: https://beam.apache.org/documentation/transforms/python/aggregation/combineperkey/

Latest revision as of 05:50, 22 July 2024

Today we will work with Google Cloud (GCP), which is a cloud computing platform. GCP contains many services (virtual machines, kubernetes, storage, databases, ...). We are mainly interested in Dataflow and Storage. Dataflow allows highly parallel computation on large datasets. We will use an educational account which gives you a certain amount of resources for free.

Basic setup

You should have received instructions how to create GCloud account and get cloud credits via email. You should be able to login to google cloud console.

Now:

  • Login to some Linux machine (ideally vyuka)
  • If the machine is not vyuka, install gcloud command-line package (we recommend via snap).
  • Run the following command to initialize and authorize your GCloud configuration:
gcloud init --console-only
  • Follow the instructions (copy the provided link to your browser, login and then copy code back to the console).

Input files and data storage

Today we will use Gcloud storage to store input and output files. Think of it as some limited external disk. You can just upload and download files, no random access to the middle of the file.

Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:

# the following command should give you a big list of files
gsutil ls gs://fmph-mad-2024-public/

# this command downloads one file from the bucket
gsutil cp gs://fmph-mad-2024-public/splitaa splitaa

# the following command prints the file in your console 
# (no need to do this).
gsutil cat gs://fmph-mad-2024-public/splitaa

You should also create your own bucket (storage area). Pick your own name, must be globally unique:

gsutil mb gs://mysuperawesomebucket

If you get "AccessDeniedException: 403 The billing account for the owning project is disabled in state absent", you should open project in web UI (console.cloud.google.com), head to page Billing -> Link billing account and select "XY for Education".


Apache beam and Dataflow

We will be using Apache Beam in this session (because Pyspark stinks).

Running locally

If you want to use your own machine, please install packages with

pip install 'apache-beam[gcp]'

You are given basic template with comments in /tasks/cloud/example_job.py

You can run it locally as follows:

First run (this is needed just once):

gcloud auth application-default login --no-launch-browser

Then:

python3 example_job.py --output out

This job downloads one file from cloud storage and stores it into file starting with name out. You can change the name if you want. This is very useful for any debugging.

The actual job just counts amount of each base in the input data (and discards any fastq headers).

You can even use parameter --input to use some input file from your harddrive.

Running in Dataflow

Now you can run Beam job in Dataflow on small sample:

python3 example_job.py --output gs://YOUR_BUCKET/out/outxy --region europe-west1 --runner DataflowRunner --project PROJECT_ID --temp_location gs://YOUR_BUCKET/temp/ --input gs://fmph-mad-2024-public/splitaa

You can find PROJECT_ID using:

gcloud projects list

You will probably get an error like:

Dataflow API has not been used in project XYZ before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview?project=XYZ then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

Visit the URL (from you error, not from this lecture) and click enable API, then run command again.

If you get an error containing: ZONE_RESOURCE_POOL_EXHAUSTED, try changing region to us-east1 or some other region from https://cloud.google.com/compute/docs/regions-zones#available.

You can then download output using:

gsutil cp gs://YOUR_BUCKET/out/outxy* .

Now you can run job on full dataset (this is what you should be doing in homework):

python3 example_job.py --output gs://YOUR_BUCKET/out/outxy --region europe-west1 --runner DataflowRunner --project PROJECT_ID --temp_location gs://YOUR_BUCKET/temp/ --input gs://fmph-mad-2024-public/* --num_workers 5 --worker_machine_type n2-standard-4

If you want to watch progress:

  • Go to web console. Find dataflow in the menu (or type it into search bar), and go to jobs and select your job.
  • If you want to see machines magically created, go to VM Instances.

Apache beam

This is the relevant part of the code:

  with beam.Pipeline(options=pipeline_options) as p:

    # Read the text file[pattern] into a PCollection.
    lines = p | 'Read' >> ReadFromText(known_args.input)

    counts = (
        lines
        | 'Filter' >> beam.Filter(good_line)
        | 'Split' >> (beam.ParDo(WordExtractingDoFn()))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

    # Format the counts into a PCollection of strings.
    def format_result(word, count):
      return '%s: %d' % (word, count)

    output = counts | 'Format' >> beam.MapTuple(format_result)

    # Write the output using a "Write" transform that has side effects.
    # pylint: disable=expression-not-assigned
    output | 'Write' >> WriteToText(known_args.output)

First we create a collection of data (thing about it as a big array, where indices are not significant). Then we apply various beam functions over it. First we filter it to keep only good lines, then we extract relevant parts of line (we emit (c, 1) for each letter c) and then we group results by key (first part of the tuple) and sum values (second part of the tuple).

Note that this is just template for the job. Beam decides what part of computation is run where and parallelizes things automatically.

One might ask what is the difference between ParDo and Map. Map only outputs one element per one input. ParDo can output as many as it wants.

You might want to check out more examples at beam documentation.

Tips

  • First run your code locally. It is much faster to iterate. Only if you are satisfied with the result, run it in cloud on full dataset.
  • If you run code locally, you can use print in processing functions (e.g. inside WordExtractionFN::process)
  • CombinePerKey requires called function to be associative and commutative. If you want something more complicated look at averaging example 5 here: https://beam.apache.org/documentation/transforms/python/aggregation/combineperkey/