Main Article Content

Penalizing unknown words’ emissions in hmm pos tagger based on Malay affix morphemes


H. Mohamed
N. Omar
M.J.A. Aziz

Abstract

The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is
that the training depends on an untagged corpus; the only supervised data limiting  possible tagging of words is a dictionary. Therefore, training cannot properly map  possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the   agglutinative Malay language is examined to assign unknown words’ probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second.


Keywords: Malay POS tagger; morpheme-based; HMM.


Journal Identifiers


eISSN:
print ISSN: 1112-9867