This tutorial will help you write your first Hadoop program.
First we write a program to fetch titles from one or more web pages in Python:
#!/usr/bin/env python import sys, urllib, re title_re = re.compile("<title>(.*?)</title>", re.MULTILINE | re.DOTALL | re.IGNORECASE) for url in sys.argv[1:]: match = title_re.search(urllib.urlopen(url).read()) if match: print url, "\t", match.group(1).strip()
chmod u+x multifetch.pyThen execute it like this:
./fetch.py http://www.google.com http://www.brandeis.eduOutput should look something like this:
http://www.cs.brandeis.edu Michtom School of Computer Science http://www.brandeis.edu Brandeis University
This program does not do any error checking; so it will not behave correctly if you give it invalid URLs. Also note that it silently ignores a URL if it cannot find a title.
See if you can guess what each line of this short program does, even if you are not familiar with Python. If you are unsure about any part of this program, please ask the TA, we are always happy to talk about code with you!
Next we expand multifetch.py to accept URLs to fetch from standard input (one per line).
#!/usr/bin/env python # # Adapted from an example by Michael G. Noll at: # # http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python # import sys, urllib, re title_re = re.compile("<title>(.*?)</title>", re.MULTILINE | re.DOTALL | re.IGNORECASE) # Read pairs as lines of input from STDIN for line in sys.stdin: # We assume that we are fed a series of URLs, one per line url = line.strip() # Fetch the content and output the title (pairs are tab-delimited) match = title_re.search(urllib.urlopen(url).read()) if match: print url, "\t", match.group(1).strip()
echo "http://www.cs.brandeis.edu" >urls echo "http://www.nytimes.com" >>urls cat urls | ./multifetch.py
Now we must write the reducer. For this example our reducer does not do anything interesting, it just outputs all of the input pairs with no aggregation or transformation. This is a perfectly valid reducer, but for your projects you will probably have to do more.
#!/usr/bin/env python # # Adapted from an example by Michael G. Noll at: # # http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python # from operator import itemgetter import sys for line in sys.stdin: line = line.strip() print line
chmod u+x reducer.py
Your reducer doesn't actually do anything noticeable to the output, but in general if you want to test your Python mapper and reducer on the command line you can do something like this (here I am just using generic names instead of the specific names we used above, this same technique can be applied with any MapReduce code written in Python):
cat inputs | ./mappper.py | sort | ./reducer.pyNote that sort is a UNIX command line tool which sorts its input. This is necessary to simulate Hadoop behavior since the Hadoop reduce phase is preceded by the sort phase.
Running on the command line is not a substitute for testing with Hadoop, but it can be helpful as you are writing your code and will give you a good sense of what your code is doing without waiting on the overhead of starting a Hadoop task.
First we must create some input data and put it in the HDFS. Make two or more files named urlX where X is a number. Each file should contain exactly one URL. For example, here we create two files:
echo "http://www.cs.brandeis.edu" >url1 echo "http://www.nytimes.com" >url2This creates these two files:
http://www.cs.brandeis.edu
http://www.nytimes.com
Now we must put these files in the HDFS. Remember the command to put files into HDFS? We will also use the mkdir command so that the input files are in their own directory.
bin/hadoop dfs -mkdir urls bin/hadoop dfs -put url1 urls/ bin/hadoop dfs -put url2 urls/You may be able to come up with a more efficient way to put many such url files into HDFS. This part of the tutorial has been updated: you need to include a full, absolute path as illustrated below if you are running on a real cluster, and this path must be in your NFS-mounted homedir.
At long last, we can run the command:
bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \ -mapper $HOME/proj/hadoop/multifetch.py \ -reducer $HOME/proj/hadoop/reducer.py \ -input urls/* \ -output titlesThis should output something like this:
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming packageJobJar: [/tmp/hadoop-ross/hadoop-unjar5997/] [] /tmp/streamjob5998.jar tmpDir=null 08/01/11 10:34:43 INFO mapred.FileInputFormat: Total input paths to process : 5 08/01/11 10:34:44 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-ross/mapred/local] 08/01/11 10:34:44 INFO streaming.StreamJob: Running job: job_200801111003_0003 08/01/11 10:34:44 INFO streaming.StreamJob: To kill this job, run: 08/01/11 10:34:44 INFO streaming.StreamJob: /home/ross/cs147a/hadoop/hadoop-0.15.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200801111003_0003 08/01/11 10:34:44 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_200801111003_0003 08/01/11 10:34:45 INFO streaming.StreamJob: map 0% reduce 0% 08/01/11 10:34:52 INFO streaming.StreamJob: map 40% reduce 0% 08/01/11 10:34:53 INFO streaming.StreamJob: map 80% reduce 0% 08/01/11 10:34:54 INFO streaming.StreamJob: map 100% reduce 0% 08/01/11 10:35:03 INFO streaming.StreamJob: map 100% reduce 100% 08/01/11 10:35:03 INFO streaming.StreamJob: Job complete: job_200801111003_0003 08/01/11 10:35:03 INFO streaming.StreamJob: Output: titlesAnd you should be able to view the results like this:
bin/hadoop dfs -cat titles/part-00000
http://www.cs.brandeis.edu Michtom School of Computer Science http://www.nytimes.com The New York Times - Blah blah ...
You can look at the job statistics by following these two urls:
You can also use the Map/Reduce administration link to view the status of a job as it runs (but you must refresh the page manually, it does not automatically refresh).
These links have the port numbers from the Quickstart. If you are on a Berry patch machine you should have changed them to your assigned port numbers. The Map/Reduce administration link should have the port you assigned to the mapred.job.tracker.info.port property in conf/hadoop-site.xml. The NameNode status link should have the port you assigned to the dfs.info.port property. If you are on your personal computer then the links should work as-is.
There is a Java version of the
multifetch example. Please read it and
Set up a real Hadoop cluster!