Project 2 Part 2
Task Solutions
Kiosk
-
- mapper
- /bin/cat
- reducer
- kiosk-reducer-1.py
-
- mapper
- kiosk-mapper-2.py
- reducer
- kiosk-reducer-2.py
Fee History
-
- mapper
- /bin/cat
- reducer
- fee-history-reducer-1.py
-
- mapper
- /bin/cat
- reducer
- fee-history-reducer-2.py
Checkout
Assumes that ticket is given as argument to script.
- mapper
- checkout-mapper.py
- reducer
- /bin/cat
Out-of-State Earnings
- mapper
- oose-mapper.py
- combiner
- oose-reducer.py
- reducer
- oose-reducer.py
Utilities
MapReduce runner
- File
- mr.py
- Invocation
- ./mr.py module-name
- Description
-
The mr.py script loads a module and
invokes either its mapper or reducer function. The module is a
simple script (such as those linked above) which
defines either mapper(key,
value) or reducer(key, values),
but not both. The purpose
of mr.py is to simplify writing the
mapper and reducer modules by abstracting out the drudgery of
instantiating a MapReduce.Mapper
or MapReduce.Reducer class and calling
its process method in each new module
you write.
MapReduce module
- File
- MapReduce.py
- Description
-
The MapReduce module contains two
classes, Mapper
and Reducer. You extend these classes
and override their dispatch method to
implement your own mappers and reducers. This extension is done in
the mr.py script.
utils module
- File
- utils.py
- Description
- Contains useful utilities such as fee calculation.
Automation Tips
With Bash
I recommend that unless you are skilled with Make, then you should
create simple bash scripts like this for automating your projects:
#!/usr/bin/env bash
SRCDIR=`pwd`
HADOOP_VERSION=0.15.3
HADOOP_HOME=${HOME}/src/hadoop-${HADOOP_VERSION}
######################################################################
cd ${HADOOP_HOME}
bin/hadoop dfs -rmr intermediate
bin/hadoop dfs -rmr kiosk
bin/hadoop dfs -mkdir intermediate
bin/hadoop jar contrib/hadoop-${HADOOP_VERSION}-streaming.jar \
-mapper ${SRCDIR}/mapper1.py \
-reducer ${SRCDIR}/reducer1.py \
-input inputs/intake inputs/discharge \
-output intermediate/kiosk
bin/hadoop jar contrib/hadoop-${HADOOP_VERSION}-streaming.jar \
-mapper ${SRCDIR}/mapper2.py \
-reducer ${SRCDIR}/reducer2.py \
-input inputs/parking \
-output kiosk
By the way, you can align your backslashes like this in emacs by
selecting the code that contains backslashes to align and entering
the
command M-x align-regexp \\
With Make
If you understand Make then you may want to
peruse this Makefile and
the last-ticket.sh helper script
that I made for testing my project. Here are a few of the useful
targets that it defines:
- make stop
- Stop Hadoop.
- make start
- Start Hadoop.
- make reset
-
Stops Hadoop, deletes all the temp directories in the cluster, and
reformats the DFS. Useful if you change configuration or have
trouble with one of the nodes in your cluster.
- make test_python PRINT=yes
-
Runs all tests on the command line and prints results. Leave off
the PRINT=yes statement (or
say PRINT=no) to not print results. You
probably only want to print if you are using a small dataset.
- make test_hadoop PRINT=yes
-
Runs all tests in Hadoop and prints results. Again, you probably
want to leave off the PRINT=yes
statement if you are testing with a big dataset.
- make docs
-
Colorizes the python and copies the Makefile into doc/code (the
colorized code you can see by clicking links above).
Note that in order to run the Hadoop tests using this Makefile you
have to do things in a certain order
(like make dfs/inputs
before make dfs/kiosk
before make dfs/checkout). Also, note
that the Makefile expects you to symlink an inputs directory
containing intake, discharge, and parking files to the main code
directory, something
like ln -s /var/local/cs147a-spr08/inputs-big
inputs.