Main Article Content
'n Woordsoortetiketteerder vir Afrikaans
Abstract
A part-of-speech tagger (POS tagger) is an important core technology necessary for the development of various human language technology applications and it is thus of great importance to develop a POS tagger for a language with an emerging human language technology (HLT) industry. The development of a first POS tagger for Afrikaans is described in this article. The tagger was developed by training the TnT algorithm, a machine learning algorithm based on Hidden Markov Models, with annotated Afrikaans data. The reasons for using this algorithm are explicated in the article. The tagger uses a tagset that was developed specifically for Afrikaans to tag the words in an input text. This tagset can be implemented on different levels of specificity and the tagger therefore is evaluated both with a very specific, fine-grained tagset and with a much more general tagset to determine the effect of the size of a tagset on the accuracy of a POS tagger. With the complete tagset of 139 very specific tags, the tagger is able to tag 85.87% of words correctly after being trained with only 20 000 words. When using a tagset of only 13 general tags, the tagger is 93.69% accurate on the same text after being trained with the same 20 000 words. When using the specific tagset (139 tags) the tagger developed here is not accurate enough to be implemented into applications, but it can be used to annotate more training data semi-automatically. This training data can, in turn, be used to train a more accurate tagger that can be implemented into applications such as grammar checkers, syntactic parsers and machine translation systems.
Southern African Linguistics and Applied Language Studies 2008, 26(1): 119–134