KEYWORDS
Speech recognition, Word Error Rate, Accuracy, Bi-directional LSTM, Recurrent Neural Networks
Introduction
Speech recognition is the conversion of the spoken word into text, enabling the creation and use of speech information. Speech-to-text is a crucial application as it is easy to store text, however indexing a specific utterance can be difficult as speech signals can be swift, intuitive, slow, and unpredictable at other times (Gupta & Joshi, 2018). Over the years, speech processing has evolved with the current state-of-the-art systems replaced. According to (Nassif et al., 2019), traditional speech recognition systems work by encoding the speech impulses using the Gaussian Mixture Model (GMM) which is based on Hidden Markov Model (HMM) as it can be viewed as a stationary signal. Dynamic Time Warping (DTW) algorithm, has been used for speech recognition and though it has been replaced with other new algorithms it’s still a very popular technique. This algorithm works very well by identifying the optimum distance which is the shortest distance between two speakers therefore addressing the issue of speed between speakers. In general speech recognition systems can be viewed in four stages: signal preprocessing, feature extraction, feature selection, and modeling.
Automatic speech recognition using Deep Neural Networks has grown in popularity. This technology typically involves categorizing acoustic templates into pre-established classes. Numerous studies have provided numerous instances illustrating how deep neural networks do better than conventional models at speech recognition. The final results of Microsoft Audio Video Indexing Service, a deep learning-based speech system that was disclosed by Microsoft, showed a 30% reduction in word error rate (WER) on four benchmarks when compared to the state of models based on Gaussian mixtures. (Samal, Jena, & Manjhi, 2019). Time-series data are involved in the speech recognition issue.
Deep neural networks are constructed in a series of layers, each of which contains neurons that take input from the layer above and conduct a single computation. The outputs of the previous levels are sent to the subsequent layer in feedforward neural networks because they are unidirectional. The feedforward networks’ inability to transmit historical information is one of their drawbacks. Additionally, problems like varying speaking rates and temporal dependencies frequently arise when Deep Neural Networks are employed to analyze speech recognition. By modeling a fixed sliding window of auditory frames, Deep Neural Networks are comfortable and able to handle temporal dependencies, but they are unable to accommodate varying speaking rates. Recurrent neural networks (RNN) are a subset of artificial neural networks where the connections between the nodes have the potential to cycle because they have loops in the hidden layers that store information from previous layers, allowing the output from one node to influence the subsequent input to that same node and predicting the value of the current step. Due to this RNN can handle the difficulty of diverse speaking rates because of this method (Apeksha Shewalkar, 2019).
Connectionist Temporal Classification (CTC) is a type of neural network that can resolve the limitations of RNN that require pre-segmented training as well as post-processing of the output given to convert it into labeled sequences. Long Short-Term Memory (LSTM) networks address the issue of long-term dependencies in the data which made the networks very effective. Gated Recurrent Unit (GRU) Neural Networks work well with sequential data thus also effectively handling the issue of long-term dependencies. Although integrated models have been developed to address the issues of accuracy and speed in speech recognition, insufficient details have been focused on how a model can be integrated to address all the issues raised by the various algorithms proposed. In this study, we integrate, BLSTM and recurrent neural networks to address the highlighted challenges specifically the continuous speech input stream.
In recent years, transformer architectures based on self-attention mechanisms have proved to demonstrate the ability to outperform Bidirectional Long-Short Term Memory (BLSTM) giving good results on acoustic modeling (Yongqiang Wang, 2020). With the introduction of transformers, gaps identified such as the limits of length of speech signals that are required for long-term range dependencies were addressed, and the fact that recurrent models do not allow parallelization transformers can address this challenge.
Although integrated models have been developed to address the issues of accuracy and speed in speech recognition, insufficient details have been focused on how a model can be integrated to address all the issues raised by the various algorithms proposed.
In this study, we integrate, BLSTM and recurrent neural networks to address the highlighted challenges specifically the continuous speech input stream.
Contributions of this study: -
1. Through investigations of several architectures in LSTM, GRU, and RNN this paper proposes an enhanced model Bidirectional LSTM recurrent Neural Network (BLSTM-RNN).
2. To demonstrate the superiority of the proposed model in resolving the issue of word error rate and accuracy on continuous input stream without increasing the required bandwidth
The paper is structured as follows. Section 2 discusses related work using various in the area of speech recognition. Section 3 discusses the proposed approach. Section 4 describes the datasets and experimental environment. Section 5 discusses the results. The conclusion and future work are provided in section 6.
Literature Review
In this section, this research reviews work that has been documented on ways to improve and achieve better results in speech recognition. (Hori et al., 2017) the model was developed with a Deep CNN Encoder together with RNN-LM while using advances in Join CT-Attention for speech recognition. This model achieved a 5-10% reduction in the word error rate in comparison to other traditional Speech recognition systems. This reduction was a result of training the LSTM language model exclusively as well as combining the attention-based decoder prediction and CTC predictions. This study recommended further work on the improvement of the model by pre-training the RNN-LM using large quantities of unlabeled data together with their recommended model.
(Kumar et al., 2018) provided a survey on the various architectures involved in the conversion of spoken word to text and the techniques used in deep learning. In this paper, different models were discussed based on Convolutional Neural Networks (CNNs), Deep Belief Networks (DBN), and Recurrent Neural Networks (RNN). From their findings, these neural networks (DBNs and CNNs) replaced Gaussian mixtures as they perform better in large vocabulary. This research recommended the use of one unified neural network to achieve end-to-end speech recognition. In their conclusion findings, the usage of RNNs in a hybrid Dynamic Neural Network for acoustic modeling results in mixed reactions using language model and CTC loss function. To eliminate difficult processing stages and learning-rich representations that deal with raw inputs, deep-learning Neural Networks were recommended.
(James1 et al., 2018) did a study on speech recognition Long-Short-Term Memory Neural Networks specifically for electronic devices. The developed model achieved a reduction in word error rate of 6.8% this was in comparison to a baseline system hybrid model which achieved 11.9%.  Using LSTM's modeling capacity for direct speech processing and exploiting past information to reliably estimate speech parameters is contributing to a better performance of the model. The results from this study have shown the promising potential of LSTM as a recognized technique for continuous speech recognition and use to control separate electronic devices, which can be implemented as a hardware model.
(Mokgonyane et al., 2019) a machine learning system for automatic speech recognition using the classifier models support vector machine (SVM), k-nearest neighbors (KNN), random forest (RF), and multi-layer perceptron (MLP). The researchers used the optimal hyperparameters for the auto Weka data mining tool to find the best classifier. 10-fold cross-validation was used to assess the model's performance. The experimental results demonstrated that the LSTM model performs better in the test data and training data. Models trained compared to CNN, DNN, and LSTM models combining numerous parameters. From 31.23% to 25.89%, noise reduction was the most effective. The researcher's motivation in performing additional studies on recurrent neural networks and bidirectional long short-term memory is greatly influenced by this.
(Daneshfar & Kabudian, 2020) created a speech recognition model using a modified quantum-behaved particle swarm optimization and discriminative dimension reduction. Applying the technique to emotional datasets revealed that the pQPSO algorithm outperformed the traditional QPSO algorithm, wQPSO, in terms of accuracy. using deep neural networks, traditional dimensionality reduction techniques, and cutting-edge techniques on the same datasets. The pQPSO technique can also be used to uncover new emotion-related features and optimize parameters used for classification, MFCC filter bank, GMM, and dimensionality reduction using a transformation matrix.
(Koduru et al., 2020) did a study on how to improve the recognition of emotions in speech using algorithms for feature extraction. From their findings, Decision Trees achieved 85%, LDA 65 %, and Support Vector Machine (SVM) achieved 70% accuracy based on the simulation results. This proposed system was able to extract more information thus achieving higher accuracy after the extraction of more data needed to identify emotions and characteristics representation from the signal efficiency this was in comparison with other models.
(Ho et al., 2020) developed a model using an open-source data set (House Twenty). The study was based on Dynamic Time Warping and Transformers on the prediction of Time series data. A great reduction of the prediction error rate was achieved at 27.79% from 45.70% using a support vector machine which is a commonly used method for time series prediction. The research recommended more experiments would be conducted to compare with other methods.
(Gulati et al., 2020) conducted a study on speech recognition using the convolution-augmented transformer known as a conformer. This research demonstrated that the developed model achieved a Word Error Rate (WER) of 2.1%/4.3% excluding the language model and with the external language model 1.9%/3.9% that is on test data and test other data. Using a smaller model of about 10M parameters, the model was able to achieve 2.7%/6.3% accuracy.
(Pawar & Kokate, 2021) used Mel-frequency Cepstrum coefficients to build a convolutional neural network (CNN) based on emotion recognition. In the study, CNN architecture combined with deep learning behavior offered the best classification of accuracy by simplifying speech with the aid of combined feature extraction algorithms like Pitch and Energy, Mel-Frequency Cepstral Coefficients (MFCC), and Mel Energy Spectrum Dynamic Coefficients (MEDC), as well as selection techniques. CNN was proven to perform better and to deliver the greatest results in terms of ROC and AUC characteristics curves when compared to other approaches, such as KNN classifiers. The study suggested using appropriate feature extraction and classification techniques for greater emotion recognition precision.
Methodology
The dataset
The LibriSpeech ASR corpus, which contains 1000 hours of recorded speech, served as the research's dataset. A better language model in the most recent dataset is a key contributor to lower WER (Word Error Rate) values. It is freely accessible and made to train language and audio models. Data folders for testing, validation, and training have been created. It is made up of 16kHz audio files with spoken English that range in length from 2 to 15 seconds and were extracted from audiobooks that were read aloud for the LibriVox project. To conduct this experiment, the audio files were first transformed into single channel (mono) WAV/WAVE files (.wav extension) with a 64k bit rate and a 16kHz sample rate. They were then encoded in PCM format and cut/padded to an equal length of 10 seconds. Any punctuation other than apostrophes was removed from the text transcriptions, and all characters were changed to lowercase as part of the pre-processing procedures.  There are 64220 training examples in total.
Experimental Set-up
Experimentation was done under the Google Cloud compute engines custom machine which is accessed with a personal laptop computer. Python 3.6 from Continuum Analytics Anaconda has been used to set up to create a virtual environment and sci-kit-learn, Tensor flow, Keras, and graph lab create libraries configured to work in jupyter notebook. Jupyter Notebook was accessed through the browser and all developed Python code has been typed and executed. Matlab software assisted with the algorithm development.
Evaluation Measures.
In speech recognition, there are two different types of performance or evaluation measures, which are based on (1) accuracy, and (2) speed. Evaluation measures based on accuracy include WER and mean edit distance.
Word Error rate (WER) is calculated as: -
\begin{equation} WER=\frac{S+I+D}{N}*100\nonumber \\ \end{equation}
Where: -
S is the number of substitutions,
I is a number of insertions,
D is the number of deletions,
N is the total number of words in the actual transcript.
The interpretation of WER is that the lower the WER, the better the speech recognition
Mean Edit Distance - is a measurement of how many changes we must do to one string to transform it into the string we are comparing it to. Let us say the normalized edit distance between two words/strings (consider A and B) is d (A, B). The mean edit distance is calculated by: -
\begin{equation} d\left(A,B\right)=min(\frac{W\left(P\right)}{N})\nonumber \\ \end{equation}
Where: -
where P is the editing path between string A and string B
W (P) is the total sum of weights of all the edited operations of P
and N is the total number of edited operations (the length of the editing path, P)
Parameter setup
The following parameters were used to run the model Dropout rate = 30, Number of epochs = 20, Training batch size = 16, Test batch size = 8, Activation function = ReLU, Neuron count in hidden layers = 1,000, Adam optimizer: β1 = 0.9, β2 = 0.999, ε = 1e-8, learning rate = 0.0001 
Results and Analysis
The experiments were conducted on five models: RNN, GRU, LSTM, Bidirectional LSTM, LSTM with bias, and Bidirectional LSTM-RNN model. The models were run on 1000 node architectures in each hidden layer. Table 1.0 provides the results of the WER, RNN achieved 94.34% on the training dataset, 94.05% on the validation data set, and 94.31 on the testing dataset. GRU achieved 29.95% on the training dataset, 94.05% 0n the validation dataset, and 94.31% on the testing dataset. LSTM achieved 28.30% on the training dataset, 31.76% on the validation dataset, and 32.82 on the testing dataset. BLSTM achieved 19.92% on the training dataset, 25.07% on the validation dataset, and 26.26% on the testing dataset. LSTM with bias initialized to one achieved 19.92 on the training dataset, 33.67 on the validation dataset, and 34.42 on the testing dataset. Bidirectional LSTM-RNN achieved 8.92% on the training dataset, 11.46% on the validation dataset, and 13.07% on the testing dataset. BLSTM-RNN model scoring best. Table 1.1 provides results of the model accuracy simple RNN achieved 5.66% and training dataset 5.95% on the validation dataset and 5.69 on the testing dataset. GRU achieved 70.05% on the training dataset, 66.37% on the validation dataset, and 65.58% on the testing dataset. LSTM achieved 71.7% on the training dataset, 68.24% on the validation dataset, and 67.18% on the testing dataset. BLSTM achieved 80.08% on the training dataset, 74.93% on the validation set, and 73.74% on the testing dataset. LSTM with bias initialized to one achieved 61.6% on the training dataset, 58.18% on the validation dataset, and 57.16% on the testing dataset and Bidirectional LSTM-RNN achieved 91.08% on the training dataset, 88.54% on the validation and 86.93% on the testing dataset, with BLSTM-RNN scoring best. In terms of the mean edit distance as shown in table 1.2 BLSTM-RNN model achieved the best value. Table 1.0 Word Error Rate.