A misty grey day on the hill and Clover is huddled in her house with great friend Chicory. So, perfect conditions for fiddling with part-of-speech tagging in R...
My goal was to try out the openNLP library in R on a learner corpus of short-answer responses. Having installed the openNLP library, the main issue was not having Java 1.6 installed which resulted in R crashing with an unhelpful fatal error message when loading the library with:
All was not lost however. This helpful post from R-bloggers sorted out the issue for me:
The tagger uses the Penn Treebank tagset and next job is to compare performance with NLTK pos tagger and Stanford tagger. I can never remember what all the tags are but a handy index to the Penn Treebank tags is here:
And here's the openNLP output for a model answer:
Because/IN blood/NN ejected/VBN from/IN the/DT left/JJ ventricle/NN into/IN the/DT aorta/NN is/VBZ under/IN high/JJ pressure/NN and/CC flow/NN in/IN the/DT aorta/NN and/CC arterial/JJ system/NN is/VBZ pulsatile/NN ./.
Compare with NLTK out-of-the-box tagger which also uses the Penn Treebank tagset:
Because/IN blood/NN ejected/VBD from/IN the/DT left/NN ventricle/NN into/IN the/DT aorta/NN is/VBZ under/IN high/JJ pressure/NN and/CC flow/NN in/IN the/DT aorta/NN and/CC arterial/JJ system/NN is/VBZ pulsatile/JJ ./.
The only difference between the two is in the interpretation of 'left' and pulsatile'. Pulsatile, in the physiology sense, probably is an adjective and the NLTK tagger got this right. Whereas openNLP tagged it as a noun. Arguably 'left' could be tagged as either a noun or an adjective. In this case openNLP assigned the adjective tag and NLTK assigned the noun tag.
But overall, not bad on a specific disciplinary dataset.
Clover couldn't care less. She knows that Clover is a NNP and a very important one at that.
Labels: NLP, POS-tagging