1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lcloud"

From MAD
Jump to navigation Jump to search
Line 1: Line 1:
Today we will work with [https://aws.amazon.com/ Amazon Web Services] (AWS), which is a cloud computing platform. It allows highly parallel computation on large datasets. We will use an educational account which gives you certain amount of resources for free.
+
Today we will work with [https://aws.amazon.com/ Amazon Web Services] (AWS), which is a cloud computing platform. It allows highly parallel computation on large datasets. We will use an educational account which gives you a certain amount of resources for free.
  
  
 
==Credentials==
 
==Credentials==
 
* First you need to create <tt>.aws/credentials</tt> file in your home folder with valid AWS credentials.
 
* First you need to create <tt>.aws/credentials</tt> file in your home folder with valid AWS credentials.
* Also run `aws configure`. Press enter for access key ID and secret access key and put in `us-east-1` for region. Press enter for output format.
 
 
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 +
* Also run <tt>`aws configure`</tt>. Press enter for access key ID and secret access key and put in <tt>`us-east-1`</tt> for region. Press enter for output format.
 
* Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):
 
* Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):
 
https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p
 
https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p
Line 21: Line 20:
 
==Input files and data storage==
 
==Input files and data storage==
  
Today we will use [https://aws.amazon.com/s3/ Amazon S3] cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) assocated with this lecture:  
+
Today we will use [https://aws.amazon.com/s3/ Amazon S3] cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:  
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 35: Line 34:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
You should also create your own bucket (storage area). Pick your own name, must be globally unique):  
+
You should also create your own bucket (storage area). Pick your own name, must be globally unique:  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
aws s3 mb s3://mysuperawesomebucket
 
aws s3 mb s3://mysuperawesomebucket
Line 58: Line 57:
 
You can run it in the cloud on the whole dataset as follows:
 
You can run it in the cloud on the whole dataset as follows:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 --num-core-instances 4 -o s3://<your bucket>/<some directory>
+
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 \
 +
  --num-core-instances 4 -o s3://<your bucket>/<some directory>
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
For testing we recommend using a smaller sample as follows:
 
For testing we recommend using a smaller sample as follows:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* --num-core-instances 4 -o  s3://<your bucket>/<some directory>
+
python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* \
 +
  --num-core-instances 4 -o  s3://<your bucket>/<some directory>
 
</syntaxhighlight>
 
</syntaxhighlight>
  

Revision as of 14:34, 15 April 2021

Today we will work with Amazon Web Services (AWS), which is a cloud computing platform. It allows highly parallel computation on large datasets. We will use an educational account which gives you a certain amount of resources for free.


Credentials

  • First you need to create .aws/credentials file in your home folder with valid AWS credentials.
  • Also run `aws configure`. Press enter for access key ID and secret access key and put in `us-east-1` for region. Press enter for output format.
  • Please use the credentials which were sent to you via email and follows steps in here (there is a cursor in each screen):

https://docs.google.com/presentation/d/1GBDErp5xhrV2zLF5kKdwnOAjtmDEFN0pw3RFval419s/edit#slide=id.p

  • Sometimes these credentials expire. In that case repeat the same steps to get new ones.

AWS command line

  • We will access AWS using aws command installed on our server.
  • You can also install it on your own machine using pip install awscli

Input files and data storage

Today we will use Amazon S3 cloud storage to store input files. Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:

# the following command should give you a big list of files
aws s3 ls s3://idzbucket2

# this command downloads one file from the bucket
aws s3 cp s3://idzbucket2/splitaa splitaa

# the following command prints the file in your console 
# (no need to do this).
aws s3 cp s3://idzbucket2/splitaa -

You should also create your own bucket (storage area). Pick your own name, must be globally unique:

aws s3 mb s3://mysuperawesomebucket

MapReduce

We will be using MapReduce in this session. It is kind of outdated concept, but simple enough for us and runs out of box on AWS. If you ever want to use BigData in practice, try something more modern like Apache Beam. And avoid PySpark if you can.

For tutorial on MapReduce check out PythonHosted.org or TutorialsPoint.com.

Template

You are given basic template with comments in /tasks/cloud/example_job.py

You can run it locally as follows:

python3 example_job.py <input file> -o <output_dir>

You can run it in the cloud on the whole dataset as follows:

python3 example_job.py -r emr --region us-east-1 s3://idzbucket2 \
  --num-core-instances 4 -o s3://<your bucket>/<some directory>

For testing we recommend using a smaller sample as follows:

python3 example_job.py -r emr --region us-east-1 s3://idzbucket2/splita* \
  --num-core-instances 4 -o  s3://<your bucket>/<some directory>

Other useful commands

You can download output as follows:

# list of files
aws s3 ls s3://<your bucket>/<some directory>/
# download
aws s3 cp s3://<your bucket>/<some directory>/ . --recursive

If you want to watch progress:

  • Click on AWS Console button workbench (vocareum).
  • Set region (top right) to N. Virginia (us-east-1).
  • Click on services, then EMR.
  • Click on the job, which is running, then Steps, view logs, syslog.