Handout: 27 November, 2007
Due: 7 December, 2007
In this assignment you will have to work at two different tasks. A first one, in which you have to implement a POS tagger, and a second one, in which you have to test your heuristics and take some decisions.
Write a Python program that assigns tags to all tokens in a text. You should use our reference tokenizer to perform tokenization.
The goal of the tagger is to turn input like:
"The green dog."
into
[['The', 'DT'], ['green', 'JJ, NN'], ['dog', 'NN', 'VB'], ['.', '.']]
which corresponds to the text form "The/DT green/JJ|NN
dog/NN|VB ./.".
You'll have to reduce the ambiguity ratio by using a set of heuristics. However, you should not just throw away all tags that you don't like. The trick is to disambiguate correctly. For example,
The/DT green/JJ|NN dog/NN|VB ./.
should be disambiguated to
The/DT green/NN dog/VB ./.
rather than to
The/DT green/JJ dog/NN ./.
since this is a full sentence, and not only a NP.
(The/DT green/JJ|NN ... stands for
[['The', 'DT'], ['green', 'JJ', 'NN'],
...].)
The goal therefore is to reduce the ambiguity ratio without throwing away correct tags. Here are the steps that you have to take:
dictionary. If you get annoyed by waiting for the
program to read in the lexicon, then you should look into the
shelve module.advanced VBN JJ VBD VBN|VBD fresh JJ JJ|RB running VBG JJ NN NN|VBG
The lexicon file is created from a tagged corpus where the tags
were assumed to be correct. But in some instances, the tagger that
tagged the corpus could not disambiguate two tags. This is how
VBN|VBD and JJ|RB got into the lexicon. For this assignment, you
should see the tags of "fresh" as a set of three tags: JJ, JJ and
RB. But at some point, you will have to do something about these
tags since having duplicate tags doesn't do any good for the
ambiguity-length ratio. You may choose to delete duplicate tags at
this point, or while loading the dictionary; alternatively you may
also let the heuristics of the next step take care of it.
Your heuristics should be coded as instances of a class called Heuristic, not as a series of if-then statements or another kind of data. Here's one idea for coding them:
initDet = Heuristic("initial_det", # name of the rule
['DT', 'NNP', 'JJ'], # list of tags expected on token
['^'], # left context is beginning of line
[], # right context not important here
(DELETE, ('NNP', 'JJ'))) # tags to be deleted from token
adjNoun = Heuristic("adjective_noun",
['NN', 'VB'], # list of tags expected on token
['JJ'], # left context is beginning of line
[], # right context not important here
(DELETE, ('VB'))) # tags to be deleted from token
In order to develop your set of heuristics, you have to be able to know how well your tagger is performing. We will use two measures of performance: accuracy and ambiguity ratio.
Write a class named Tester that implements these two
functions.
The point of writing this class is for you to make changes to your
heuristics and see the impact on the quality of the tagger. Therefore,
a part of what you will submit is also a log of the tuning
process. Give us as much detail as we need (and only that level of
detail, please!) to understand why you made the decisions you
made. We want to know what considerations you made about
obtaining a good balance between accuracy and ambiguity ratio.
Work can be done individually or in groups of at most 3. We strongly encourage you to work in groups.
Email a file named tagger.py. The program has to be run by calling
python tagger.py input_file output_file
goldStandard_file.
goldStandard_file is the unambiguously tagged version of
the text.
output_file should show the output in the following form:
[['At', 'IN'], ['least', 'JJS'], ['two', 'CD'], ['people', 'NNS'], ['were', 'VBD'], ['killed', 'VBN', 'VBD'], ['and', 'CC'], ['several', 'JJ'], ['others', 'NNS'], ['injured', 'VBN', 'VBD', 'JJ'], ['.', '.', 'RP'], ['<PP>', '<PP>'], ['Officials', 'NNS'], ['said', 'VBD'], (...)] ACCURACY: .97 AMBIGUITY RATIO: 2.36363636364(N.B. This is a made up example. The numbers aren't right.)
Note that the list contains the whole text (not chunked by paragraphs), and that paragraphs marker have to be appropriately tagged as well.
You must submit your code (tagger.py) together with the output files, and the log containing your considerations and the decisions you made. In case that the code doesn't run properly, indicate it there.