Main Article Content
Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa
Abstract
There are currently two distinct but not necessarily
mutually exclusive approaches to the retrieval of information from linguistic
corpora. ‘Corpus-driven' approaches rely solely on the corpus itself to yield
significant patterns. With the exception of orthographic spacing, no additional
annotations to a ‘raw' corpus are used to guide searches and the retrieval of
information from the corpus. Typically, key word in context (KWIC) analyses are
applied to relevant concordance lines to extract statistically significant
lexical and grammatical patterns. In ‘corpus-based' approaches, on the other
hand, information is retrieved from an enriched corpus on the basis of
annotations in the form of linguistic tags and annotations. That is, the
annotations are used to direct the searches to specific grammatical and lexical
phenomena in a corpus.
In this article, we propose a
corpus-based approach and a tagset to be used on a corpus of spoken language
for the African languages of South Africa. A number of problematic linguistic
phenomena such as fixed expressions, agglutination, morphemic merging and
spoken language phenomena such as interrupted words etc., often have some
effect on tagging principles. These problematic phenomena are discussed and
illustrated. The development of the tagset is based on the morphosyntactic properties
of Xhosa for reasons that are outlined in the article.
Manual tagging of a large corpus
would be quite a daunting and time-consuming task, not to mention the potential
for various kinds of errors. This problem is solved in a two-step process. Firstly,
a computer-based drag-and-drop tagger was developed to facilitate the manual
tagging of a so-called training corpus. This training corpus then forms the
input to the development of an automatic tagger. The principles and procedures
for the development of an automatic tagger for African languages are also
discussed.
Southern African Linguistics and
Applied Language Studies 2003, 21(4): 223–237
mutually exclusive approaches to the retrieval of information from linguistic
corpora. ‘Corpus-driven' approaches rely solely on the corpus itself to yield
significant patterns. With the exception of orthographic spacing, no additional
annotations to a ‘raw' corpus are used to guide searches and the retrieval of
information from the corpus. Typically, key word in context (KWIC) analyses are
applied to relevant concordance lines to extract statistically significant
lexical and grammatical patterns. In ‘corpus-based' approaches, on the other
hand, information is retrieved from an enriched corpus on the basis of
annotations in the form of linguistic tags and annotations. That is, the
annotations are used to direct the searches to specific grammatical and lexical
phenomena in a corpus.
In this article, we propose a
corpus-based approach and a tagset to be used on a corpus of spoken language
for the African languages of South Africa. A number of problematic linguistic
phenomena such as fixed expressions, agglutination, morphemic merging and
spoken language phenomena such as interrupted words etc., often have some
effect on tagging principles. These problematic phenomena are discussed and
illustrated. The development of the tagset is based on the morphosyntactic properties
of Xhosa for reasons that are outlined in the article.
Manual tagging of a large corpus
would be quite a daunting and time-consuming task, not to mention the potential
for various kinds of errors. This problem is solved in a two-step process. Firstly,
a computer-based drag-and-drop tagger was developed to facilitate the manual
tagging of a so-called training corpus. This training corpus then forms the
input to the development of an automatic tagger. The principles and procedures
for the development of an automatic tagger for African languages are also
discussed.
Southern African Linguistics and
Applied Language Studies 2003, 21(4): 223–237