Main Article Content

Machine learning for document classification in an archive of the National Afrikaans Literary Museum and Research Centre


Susan Brokensha
Eduan Kotzé
Burgert Senekal

Abstract









https://dx.doi.org/10.4314/jsasa.v56i1.10


ISSN: 1012-2796


©SASA 2023


 






 



Most archives were established before the digital age, where hardcopies of much smaller volumes were archived. In the information age, archives struggle to accommodate the large volumes of material produced. In addition, many archives, including in South Africa, had to contend with budget cuts that reduced the number of staff available. If digital material is not archived now, it creates the risk of gaps in the historical record in the future. In addition, with digital humanities gaining wider acceptance, large corpuses of digital material are needed, which archives could provide. This study’s aim was to investigate whether document classification using machine learning classifiers is feasible in a South African archive context, with a focus on the National Afrikaans Literary Museum and Research Centre (NALN). The researchers created and trained a document classification model and tested it for accuracy against human classifiers. It followed a basic linguistic approach to prepare specific text documents for text classification, in terms of Galloway and Roux’s (2019) six categories, namely articles, media reports, books, interviews, reviews, and dissertations and theses. The classification was done using two annotators, after which the annotated corpus was employed as training data for machine learning models. Following Rolan et al. (2018), Suominen (2019), and Connelly et al. (2020), Python libraries were used for document classifications. The researchers show that machine learning classifiers can accurately categorise documents into different types. If implemented, this means that archives can improve their collection efforts without spending more on salaries. One way of coping with the information explosion is to develop metadata generation tools, like machine learning and artificial intelligence. If metadata could be automatically generated, it would reduce the pressure on archival personnel by providing a way to handle larger volumes.


Journal Identifiers


eISSN:
print ISSN: 1012-2796