Hadoop Example Program in Java

This tutorial mirrors the Pythonic example of multifetch, but accomplishes the same task using the Hadoop Java API.

Prereqs

Same as for the Pythonic example.

What you Will Create

Again, same as the Pythonic example, except in Java.

Let's Get Right to the Code

View the source code for MultiFetch.java (opens in new window).

Notes

Notice the package declaration; you must either change this or else put MultiFetch.java in edu/brandeis/cs147a/examples.
The class MultiFetch contains two nested classes, Map and Reduce. Nesting the classes in this way improves organization but is not necessary; arbitrary classes can be set for the mapper and reducer (and combiner) tasks when you provide the configuration.
This class ignores URLs that are malformed or cannot be fetched for any reason. Consider the try/catch block in the map method of class Map. Each of the errors that can occur (malformed url, unmatched title, or unable to fetch page) result in printing a warning on stderr, and then processing continues.
Output on stderr is found in log/userlogs/[map_task_id]. This directory can be hard to search, look in the web interface for the task id, then search each of the mapper invocation ids to find the stderr outputs, something like this:
```
cat task_${TASKID}_m_*/stderr
```
Where ${TASKID} is the id of the task found from the mapreduce console output or the web interface.
Like the python example, the reducer essentially does nothing, it is just an identity function which outputs all the input tuples. Technically you can achieve this same behavior by setting the number of reduce tasks to 0, but we wanted you to have an example of setting up the reduce task scaffolding.
Unlike the Python example (which uses HadoopStreaming and treats all pairs as lines of plain text), a Hadoop program in Java needs some typing for the keys and values of each pair. Keys and values must be of a type which implements Writable (i.e., one of the implementing classes listed here). This is because keys and values must be serialized in the particular way that Hadoop understands.

How to Build

mkdir multifetch_classes
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar \
      -d multifetch_classes MultiFetch.java
jar -cvf $HOME/proj/hadoop/MultiFetch.jar -C multifetch_classes/ .

How to Initialize Data

Input URLs into the DFS in the same way described in the Pythonic example.

How to Run

bin/hadoop jar $HOME/proj/MultiFetch.jar               \
               edu.brandeis.cs147a.examples.MultiFetch \
               urls/*                                  \
               titles

What's Next?

Set up a real Hadoop cluster, or go back to the Python version of the example.