Project 2 Part 2

Task Solutions

Kiosk

  1. mapper
    /bin/cat
    reducer
    kiosk-reducer-1.py
  2. mapper
    kiosk-mapper-2.py
    reducer
    kiosk-reducer-2.py

Fee History

  1. mapper
    /bin/cat
    reducer
    fee-history-reducer-1.py
  2. mapper
    /bin/cat
    reducer
    fee-history-reducer-2.py

Checkout

Assumes that ticket is given as argument to script.

mapper
checkout-mapper.py
reducer
/bin/cat

Out-of-State Earnings

mapper
oose-mapper.py
combiner
oose-reducer.py
reducer
oose-reducer.py

Utilities

MapReduce runner

File
mr.py
Invocation
./mr.py module-name
Description
The mr.py script loads a module and invokes either its mapper or reducer function. The module is a simple script (such as those linked above) which defines either mapper(key, value) or reducer(key, values), but not both. The purpose of mr.py is to simplify writing the mapper and reducer modules by abstracting out the drudgery of instantiating a MapReduce.Mapper or MapReduce.Reducer class and calling its process method in each new module you write.

MapReduce module

File
MapReduce.py
Description
The MapReduce module contains two classes, Mapper and Reducer. You extend these classes and override their dispatch method to implement your own mappers and reducers. This extension is done in the mr.py script.

utils module

File
utils.py
Description
Contains useful utilities such as fee calculation.

Automation Tips

With Bash

I recommend that unless you are skilled with Make, then you should create simple bash scripts like this for automating your projects:

#!/usr/bin/env bash

SRCDIR=`pwd`
HADOOP_VERSION=0.15.3
HADOOP_HOME=${HOME}/src/hadoop-${HADOOP_VERSION}

######################################################################

cd ${HADOOP_HOME}

bin/hadoop dfs -rmr intermediate
bin/hadoop dfs -rmr kiosk
bin/hadoop dfs -mkdir intermediate

bin/hadoop jar contrib/hadoop-${HADOOP_VERSION}-streaming.jar \
 -mapper  ${SRCDIR}/mapper1.py                                \
 -reducer ${SRCDIR}/reducer1.py                               \
 -input   inputs/intake inputs/discharge                      \
 -output  intermediate/kiosk

bin/hadoop jar contrib/hadoop-${HADOOP_VERSION}-streaming.jar \
 -mapper  ${SRCDIR}/mapper2.py                                \
 -reducer ${SRCDIR}/reducer2.py                               \
 -input   inputs/parking                                      \
 -output  kiosk

By the way, you can align your backslashes like this in emacs by selecting the code that contains backslashes to align and entering the command M-x align-regexp \\

With Make

If you understand Make then you may want to peruse this Makefile and the last-ticket.sh helper script that I made for testing my project. Here are a few of the useful targets that it defines:

make stop
Stop Hadoop.
make start
Start Hadoop.
make reset
Stops Hadoop, deletes all the temp directories in the cluster, and reformats the DFS. Useful if you change configuration or have trouble with one of the nodes in your cluster.
make test_python PRINT=yes
Runs all tests on the command line and prints results. Leave off the PRINT=yes statement (or say PRINT=no) to not print results. You probably only want to print if you are using a small dataset.
make test_hadoop PRINT=yes
Runs all tests in Hadoop and prints results. Again, you probably want to leave off the PRINT=yes statement if you are testing with a big dataset.
make docs
Colorizes the python and copies the Makefile into doc/code (the colorized code you can see by clicking links above).

Note that in order to run the Hadoop tests using this Makefile you have to do things in a certain order (like make dfs/inputs before make dfs/kiosk before make dfs/checkout). Also, note that the Makefile expects you to symlink an inputs directory containing intake, discharge, and parking files to the main code directory, something like ln -s /var/local/cs147a-spr08/inputs-big inputs.