Difference between revisions of "Lcloud"

Revision as of 20:24, 2 May 2022

Today we will work with Google Cloud (GCP), which is a cloud computing platform. GCP contains many services (virtual machines, kubernetes, storage, databases, ...). We are mainly interested in Dataflow and Storage. Dataflow allows highly parallel computation on large datasets. We will use an educational account which gives you a certain amount of resources for free.

Basic setup

You should have received instructions how to create GCloud account via MS Teams. You should be able to login to google cloud console. (TODO picture).

Now:

Login to some Linux machine (ideally vyuka)
If the machine is not vyuka, install gcloud command line package (I recommend via snap: [1]).
Run gcloud init --console-only
Follow instructions (copy link to browser, login and then copy code back to console).

Input files and data storage

Today we will use Gcloud storage to store input files and outputs. Think of it as some limited external disk (more like gdrive, than dropbox). You can just upload and download files, no random access to the middle of the file.

Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:

# the following command should give you a big list of files
gsutil ls gs://mad-2022-public/

# this command downloads one file from the bucket
gsutil cp gs://mad-2022-public/splitaa splitaa

# the following command prints the file in your console 
# (no need to do this).
gsutil cat gs://mad-2022-public/splitaa

You should also create your own bucket (storage area). Pick your own name, must be globally unique:

gsutil mb gs://mysuperawesomebucket

MapReduce

We will be using MapReduce in this session. It is kind of outdated concept, but simple enough for us and runs out of box on AWS. If you ever want to use BigData in practice, try something more modern like Apache Beam. And avoid PySpark if you can.

For tutorial on MapReduce check out PythonHosted.org or TutorialsPoint.com.

Template

If you want to use your own machine, please install packages with pip install mrjob boto3

You are given basic template with comments in /tasks/cloud/example_job.py

You can run it locally as follows:

python3 example_job.py <input file> -o <output_dir>

You can run it in the cloud on the whole dataset as follows:

python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 \
  --num-core-instances 4 -o s3://<your bucket>/<some directory>

For testing we recommend using a smaller sample as follows:

python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* \
  --num-core-instances 4 -o  s3://<your bucket>/<some directory>

Other useful commands

You can download output as follows:

# list of files
aws s3 ls s3://<your bucket>/<some directory>/
# download
aws s3 cp s3://<your bucket>/<some directory>/ . --recursive

If you want to watch progress:

Click on AWS Console button workbench (vocareum).
Set region (top right) to N. Virginia (us-east-1).
Click on services, then EMR.
Click on the job, which is running, then Steps, view logs, syslog.

@@ Line 4: / Line 4: @@
 We will use an educational account which gives you a certain amount of resources for free.
-==Credentials and login==
+==Basic setup==
 You should have received instructions how to create GCloud account via MS Teams.
+You should be able to login to google cloud console. (TODO picture).
 Now:
@@ Line 12: / Line 13: @@
 * If the machine is not vyuka, install gcloud command line package (I recommend via snap: [https://cloud.google.com/sdk/docs/downloads-snap]).
 * Run <tt>gcloud init --console-only</tt>
+* Follow instructions (copy link to browser, login and then copy code back to console).
+==Input files and data storage==
-==Credentials==
+Today we will use [https://cloud.google.com/storage Gcloud storage] to store input files and outputs.
-* First you need to create <tt>.aws/credentials</tt> file in your home folder with valid AWS credentials.
+Think of it as some limited external disk (more like gdrive, than dropbox). You can just upload and download files, no random access to the middle of the file.
-<!-- NOTEX -->
-* Also run <tt>`aws configure`</tt>. Press enter for access key ID and secret access key and put in <tt>`us-east-1`</tt> for region. Press enter for output format.
-* Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):
-https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p
-* Sometimes these credentials expire. In that case repeat the same steps to get new ones.
-<!-- /NOTEX -->
-<!-- TEX
-* Instructions for doing so are given during the lecture.
-/TEX -->
-==AWS command line==
-* We will access AWS using <tt>aws</tt> command installed on our server.
-* You can also install it on your own machine using  <tt>pip install awscli</tt>
-==Input files and data storage==
-Today we will use [https://aws.amazon.com/s3/ Amazon S3] cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:
+Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:
 <syntaxhighlight lang="bash">
 # the following command should give you a big list of files
-aws s3 ls s3://idzbucket2
+gsutil ls gs://mad-2022-public/
 # this command downloads one file from the bucket
-aws s3 cp s3://idzbucket2/splitaa splitaa
+gsutil cp gs://mad-2022-public/splitaa splitaa
 # the following command prints the file in your console
 # (no need to do this).
-aws s3 cp s3://idzbucket2/splitaa -
+gsutil cat gs://mad-2022-public/splitaa
 </syntaxhighlight>
-You should also create your own bucket (storage area). Pick your own name, must be globally unique:
+You should also create your own bucket (storage area). Pick your own name, must be <b>globally</b> unique:
 <syntaxhighlight lang="bash">
-aws s3 mb s3://mysuperawesomebucket
+gsutil mb gs://mysuperawesomebucket
 </syntaxhighlight>

Difference between revisions of "Lcloud"

Revision as of 20:24, 2 May 2022

Contents

Basic setup

Input files and data storage

MapReduce

Template

Other useful commands

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools