Main Article Content
Amharic Speech Recognition Using Joint Transformer and Connectionist Temporal Classification with Character-Based and Sub-word-Based Acoustic and Language Models
Abstract
Sequence-to-sequence attention-based models have gained considerable attention in recent times for automatic speech recognition (ASR). The transformer architecture has been extensively employed for a variety of sequence-to-sequence transformation problems, including machine translation and ASR. This architecture avoids sequential computation that is used in recurrent neural networks and leads to improved iteration rate during the training phase. Connectionist temporal classification, on the other hand, is widely employed to accelerate the convergence of the sequenceto-sequence model by explicitly learning a better alignment between the input speech feature and output label sequences. Amharic language, a Semitic language spoken by 57.5 million people in Ethiopia, is a morphologically rich language that poses a challenge for continuous speech recognition as a root word can be conjugated and inflected into thousands of words to reflect subject, object, tense and quantity. In this research, the connectionist temporal classification is integrated with the transformer for continuous Amharic speech recognition. A suitable acoustic modeling unit for Amharic speech recognition system is also investigated by utilizing characterbased and sub word-based models. The results show that a best character error rate of 8.04 % for the character-based model with character-level language model (LM) and a best word error rate of 22.31 % for the sub word-based model with sub word-level LM.