1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "Lcloud"
Line 74: | Line 74: | ||
* This is setup environment variable (change path to key to relevant path). You will need to set this everytime you open a console (or put it into .bashrc). | * This is setup environment variable (change path to key to relevant path). You will need to set this everytime you open a console (or put it into .bashrc). | ||
+ | |||
+ | Now you can run Beam job in Dataflow: | ||
+ | |||
+ | <tt>python3 wordcount.py --output gs://mad-2022-usamec/out/outxy --region europe-west1 --runner DataflowRunner --project mad-2022 --temp_location gs://mad-2022-usamec/temp/ --input gs://mad-2022-public/splitaa</tt> | ||
+ | |||
+ | You will probably get an error like: | ||
+ | <tt>Dataflow API has not been used in project XYZ before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview?project=XYZ then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.</tt> | ||
+ | |||
+ | Visit the URL (from you error, not from this lecture) and click enable API. | ||
Revision as of 21:01, 2 May 2022
Today we will work with Google Cloud (GCP), which is a cloud computing platform. GCP contains many services (virtual machines, kubernetes, storage, databases, ...). We are mainly interested in Dataflow and Storage. Dataflow allows highly parallel computation on large datasets. We will use an educational account which gives you a certain amount of resources for free.
Contents
Basic setup
You should have received instructions how to create GCloud account via MS Teams. You should be able to login to google cloud console. (TODO picture).
Now:
- Login to some Linux machine (ideally vyuka)
- If the machine is not vyuka, install gcloud command line package (I recommend via snap: [1]).
- Run gcloud init --console-only
- Follow instructions (copy link to browser, login and then copy code back to console).
Input files and data storage
Today we will use Gcloud storage to store input files and outputs. Think of it as some limited external disk (more like gdrive, than dropbox). You can just upload and download files, no random access to the middle of the file.
Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:
# the following command should give you a big list of files
gsutil ls gs://mad-2022-public/
# this command downloads one file from the bucket
gsutil cp gs://mad-2022-public/splitaa splitaa
# the following command prints the file in your console
# (no need to do this).
gsutil cat gs://mad-2022-public/splitaa
You should also create your own bucket (storage area). Pick your own name, must be globally unique:
gsutil mb gs://mysuperawesomebucket
Apache beam and Dataflow
We will be using Apache Beam in this session (because Pyspark stinks).
Running locally
If you want to use your own machine, please install packages with pip install 'apache-beam[gcp]'
You are given basic template with comments in /tasks/cloud/example_job.py
You can run it locally as follows:
python3 example_job.py --output out
This job downloads one file from cloud storage and stores it into file starting with name out. You can change the name if you want. This is very useful for any debugging.
Running in Dataflow
First we need to create service account (account for machine). Run following commands:
- gcloud projects list
- This will show you the project-id, you will need it.
- gcloud iam service-accounts create mad-sacc
- This will create service account named mad-sacc.
- gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/storage.objectAdmin
- gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/dataflow.admin
- gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/editor
- This will give your service account some permissions. Do not forget to change PROJECT_ID to your project-id. Also note that we are quite liberal with permissions, this is not ideal in production.
- gcloud iam service-accounts keys create key.json --iam-account=mad-sacc@PROJECT_ID.iam.gserviceaccount.com
- This creates key for your service account, named key.json (in your current directory).
- export GOOGLE_APPLICATION_CREDENTIALS=/home/jano/hrasko/key.json
- This is setup environment variable (change path to key to relevant path). You will need to set this everytime you open a console (or put it into .bashrc).
Now you can run Beam job in Dataflow:
python3 wordcount.py --output gs://mad-2022-usamec/out/outxy --region europe-west1 --runner DataflowRunner --project mad-2022 --temp_location gs://mad-2022-usamec/temp/ --input gs://mad-2022-public/splitaa
You will probably get an error like: Dataflow API has not been used in project XYZ before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview?project=XYZ then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
Visit the URL (from you error, not from this lecture) and click enable API.
You can run it in the cloud on the whole dataset as follows:
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 \
--num-core-instances 4 -o s3://<your bucket>/<some directory>
For testing we recommend using a smaller sample as follows:
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* \
--num-core-instances 4 -o s3://<your bucket>/<some directory>
Other useful commands
You can download output as follows:
# list of files
aws s3 ls s3://<your bucket>/<some directory>/
# download
aws s3 cp s3://<your bucket>/<some directory>/ . --recursive
If you want to watch progress:
- Click on AWS Console button workbench (vocareum).
- Set region (top right) to N. Virginia (us-east-1).
- Click on services, then EMR.
- Click on the job, which is running, then Steps, view logs, syslog.