Main Article Content

Semi-automatic Term Extraction for the African Languages, with Special Reference to Northern Sotho


E Taljard
G-M de Schryver

Abstract

Worldwide, semi-automatically extracting terms from corpora is becoming the norm for the compilation of terminology lists, term banks or dictionaries for special purposes. If Africanlanguage terminologists are willing to take their rightful place in the new millennium, they must not only take cognisance of this trend but also be ready to implement the new technology. In this article it is advocated that the best way to do the latter two at this stage, is to opt for computationally straightforward alternatives (i.e. use 'raw corpora') and to make use of widely available software tools (e.g. WordSmith Tools). The main aim is therefore to discover whether or not the semiautomatic extraction of terminology from untagged and unmarked running text by means of basic corpus query software is feasible for the African languages. In order to answer this question a fullblown case study revolving around Northern Sotho linguistic texts is discussed in great detail. The computational results are compared throughout with the outcome of a manual excerption, and vice versa. Attention is given to the concepts 'recall' and 'precision'; different approaches are suggested for the treatment of single-word terms versus multi-word terms; and the various findings are summarised in a Linguistics Terminology lexicon presented as an Appendix.

Keywords: terminology, terminography, manual excerption, reading and marking, semi-automatic term extraction, retrieval, african languages, northern sotho (sepedi), raw corpora, pretoria sepedi corpus (psc), wordsmith tools, weirdness ratio, key word, log-likelihood, recall, precision, mother term, single-word term, multi-word term, stem, root, key-word-in-context (kwic), collocation, collocate, lexical gap, cluster, linguistics terminology lexicon

Journal Identifiers


eISSN: 2224-0039
print ISSN: 1684-4904