Main Article Content

Amharic Speech Recognition Using Joint Transformer and Connectionist Temporal Classification with Character-Based and Sub-word-Based Acoustic and Language Models


Alemayehu Yilma Demisse
Bisrat Derebssa Dufera

Abstract

Sequence-to-sequence attention-based models have gained considerable attention in recent times for automatic speech recognition  (ASR). The transformer architecture has been extensively employed for a variety of sequence-to-sequence transformation problems,  including machine translation and ASR. This architecture avoids sequential computation that is used in recurrent neural networks and  leads to improved iteration rate during the training phase. Connectionist temporal classification, on the other hand, is widely employed  to accelerate the convergence of the sequenceto-sequence model by explicitly learning a better alignment between the input speech feature and output label sequences. Amharic language, a Semitic language spoken by 57.5 million people in Ethiopia, is a morphologically  rich language that poses a challenge for continuous speech recognition as a root word can be conjugated and inflected  into thousands of words to reflect subject, object, tense and quantity. In this research, the connectionist temporal classification is  integrated with the transformer for continuous Amharic speech recognition. A suitable acoustic modeling unit for Amharic speech  recognition system is also investigated by utilizing characterbased and sub word-based models. The results show that a best character  error rate of 8.04 % for the character-based model with character-level language model (LM) and a best word error rate of 22.31 % for the sub word-based model with sub word-level LM. 


Journal Identifiers


eISSN:
print ISSN: 0514-6216