2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


L10

From IDZ
Jump to: navigation, search

HW10

Hadoop is a distributed filesystem and a map/reduce framework. It allows large data sets to be stored across multiple computers and also efficiently processed through the map reduce paradigm.

Hadoop setup

In order to use haddop on the server, you must source a script that sets up all the required environment variables needed to run hadoop:

# set up hadoop environment
source ~yoyo/hadoop/hadoop-env.sh

After that you should be able to start the hadoop command.

Distributed filesystem

In order to process any data it must be stored in hadoop's distributed filesystem (HDFS). Hadoop exposes the usual filesystem manipulation commands:

# hadoop filesystem with usual commands
hadoop fs -ls
hadoop fs -ls /
hadoop fs -mkdir  input
hadoop fs -cp /input/example/pg100.txt input/
hadoop fs -ls input

Note that paths beginning with / are absolute, without slash are relative to your home directory which is /user/USERNAME.

To transfer files between the server and hadoop filesystems you can use the put and get (you can also use the cat command to get contents of short text files).

hadoop fs -put someLocalFile.txt input/
hadoop fs -get output/part-00000 .

Map Reduce

Map/reduce is a programming model / framework that allows processing large ammounts of (distributed) data in a distributed manner.

The system resembles the map nad reduce functions. Input data (files) is first processed by the map function into key-value pairs, these are redistributed so that all pairs with the same key are together and and then processed by the reduce function to produce the final result.

Multiple "mappers" and "reducers" can be run on different machines (i.e. mappers can be run directly on the nodes where the data resides and thus no copying of data needs to be done).

Simple word count example

  • Given a lot of large files distributed over several servers, produce word statistics: for each word how many times it occurs in the files.
  • map: for each input file count the occurrence of each word in this file
  • the result is a list of items separated by tab: word number
  • hadoop will group together map results from all partial files and will put the same keys together
  • reduce will sum together adjacent values with each key

Map part

  • read file word by word
  • for each word from input print this word, tab and 1 (hadoop will group the same words together anyway)
echo "abc abc def abc " | ~yoyo/example/map.awk

Redistribute / shuffle

Hadoop wil collect together the same keys, i.e. the result is something like this:

echo "abc abc def abc " | ~yoyo/example/map.awk  | sort

Reduce part

Reducer then does something similar to niq -c, i.e. counts adajcent equal words. Here implemented in bash

echo "abc abc def abc " | ~yoyo/example/map.awk  | sort | ~yoyo/example/reduce.sh


Running in hadoop

Normallly Hadoop Map-Reduce transformations are written in Java. However the default installation of hadoop contains a map-reduce class (hadoop streaming) that cat execute any process and write input to / read output from it.

Another technical problem with this solution is that the executed programs need to be present on all nodes where the mappers / reducers will be run. They either can be installed on every node or present in the haddop filesystem. The streaming class has an option (-file) to automatically upload any progra / data files along with the job.


# the streaming library
JAR=${HADOOP_PREFIX}/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
BINKS=""

# use files in the /input/example directory for input
# place output in the out directory
# use map.awk and reduce.sh as mapper / reducer, upload them to haddop fs along the job

hadoop jar $JAR $BINKS -mapper map.awk -reducer reduce.sh  -input /input/example/ -output out  -file map.awk -file reduce.sh