Programming Assignment #3: POS Tagger

Handout: 27 November, 2007
Due: 7 December, 2007

Specification

In this assignment you will have to work at two different tasks. A first one, in which you have to implement a POS tagger, and a second one, in which you have to test your heuristics and take some decisions.

Writing a POS tagger:

Write a Python program that assigns tags to all tokens in a text. You should use our reference tokenizer to perform tokenization.

The goal of the tagger is to turn input like:

	"The green dog."

into

	[['The', 'DT'], ['green', 'JJ, NN'], ['dog', 'NN', 'VB'], ['.', '.']]

which corresponds to the text form "The/DT green/JJ|NN dog/NN|VB ./.".

You'll have to reduce the ambiguity ratio by using a set of heuristics. However, you should not just throw away all tags that you don't like. The trick is to disambiguate correctly. For example,

   The/DT green/JJ|NN dog/NN|VB ./.

should be disambiguated to

   The/DT green/NN dog/VB ./.

rather than to

   The/DT green/JJ dog/NN ./.

since this is a full sentence, and not only a NP.

(The/DT green/JJ|NN ... stands for [['The', 'DT'], ['green', 'JJ', 'NN'], ...].)

The goal therefore is to reduce the ambiguity ratio without throwing away correct tags. Here are the steps that you have to take:

  1. We provide you a non-tokenized input text, a lexicon, and its reference tagset. We are also providing you with a the unambiguously tagged version of that input text. Presumably this text has the correct tags (although some errors might still occur).
    Decide how you want to use the lexicon. The easiest thing to do is to read the lexicon file and put all lexical entries into a dictionary. If you get annoyed by waiting for the program to read in the lexicon, then you should look into the shelve module.

  2. Go through a sentence and add tags to all tokens. Note that in the unambiguously tagged version abbreviations are tagged as 'ABBREV', numbers as 'CD' and the rest of unknown words as 'NNP'.

    Also, some of the lexical entries have a tag that is ambiguous. For example,
    	advanced VBN JJ VBD VBN|VBD
    	fresh JJ JJ|RB
    	running VBG JJ NN NN|VBG

    The lexicon file is created from a tagged corpus where the tags were assumed to be correct. But in some instances, the tagger that tagged the corpus could not disambiguate two tags. This is how VBN|VBD and JJ|RB got into the lexicon. For this assignment, you should see the tags of "fresh" as a set of three tags: JJ, JJ and RB. But at some point, you will have to do something about these tags since having duplicate tags doesn't do any good for the ambiguity-length ratio. You may choose to delete duplicate tags at this point, or while loading the dictionary; alternatively you may also let the heuristics of the next step take care of it.

  3. Define heuristics that perform partial disambiguation of the tokens. Some of these heuristics will appear amongst the hints below. But you also have to define some (at least 5) heuristics yourself. The task is to show that you can write the code that applies the heuristics, that you can create heuristics that do some disambiguation (without introducing too many errors), and that you come to get aware of some of the problems. You don't have to write a perfect disambiguater. The Brill-tagger, for example, uses 800 rules for disambiguation (give and take a few hundred).

    Your heuristics should be coded as instances of a class called Heuristic, not as a series of if-then statements or another kind of data. Here's one idea for coding them:

        initDet = Heuristic("initial_det",           # name of the rule
                            ['DT', 'NNP', 'JJ'],     # list of tags expected on token
                            ['^'],                   # left context is beginning of line 
                            [],                      # right context not important here
                            (DELETE, ('NNP', 'JJ'))) # tags to be deleted from token  
    
        adjNoun = Heuristic("adjective_noun",
                            ['NN', 'VB'],     # list of tags expected on token
                            ['JJ'],           # left context is beginning of line 
                            [],               # right context not important here
                            (DELETE, ('VB'))) # tags to be deleted from token  
    
    

Testing task:

In order to develop your set of heuristics, you have to be able to know how well your tagger is performing. We will use two measures of performance: accuracy and ambiguity ratio.

Write a class named Tester that implements these two functions.
The point of writing this class is for you to make changes to your heuristics and see the impact on the quality of the tagger. Therefore, a part of what you will submit is also a log of the tuning process. Give us as much detail as we need (and only that level of detail, please!) to understand why you made the decisions you made. We want to know what considerations you made about obtaining a good balance between accuracy and ambiguity ratio.

Submission

Work can be done individually or in groups of at most 3. We strongly encourage you to work in groups.

Email a file named tagger.py. The program has to be run by calling python tagger.py input_file output_file goldStandard_file.
goldStandard_file is the unambiguously tagged version of the text.
output_file should show the output in the following form:

[['At', 'IN'],
['least', 'JJS'],
['two', 'CD'],
['people', 'NNS'],
['were', 'VBD'],
['killed', 'VBN', 'VBD'],
['and', 'CC'],
['several', 'JJ'],
['others', 'NNS'],
['injured', 'VBN', 'VBD', 'JJ'],
['.', '.', 'RP'],
['<PP>', '<PP>'],
['Officials', 'NNS'], 
['said', 'VBD'],
(...)]

ACCURACY:  .97
AMBIGUITY RATIO: 2.36363636364 
(N.B. This is a made up example. The numbers aren't right.)

Note that the list contains the whole text (not chunked by paragraphs), and that paragraphs marker have to be appropriately tagged as well.

You must submit your code (tagger.py) together with the output files, and the log containing your considerations and the decisions you made. In case that the code doesn't run properly, indicate it there.

Support material


CoSci101a
Last modified: Mon Sep 22 15:34:59 EDT 2003