KEYWORDS
Speech recognition, Word Error Rate, Accuracy, Bi-directional LSTM,
Recurrent Neural Networks
Introduction
Speech recognition is the conversion of the spoken word into text,
enabling the creation and use of speech information. Speech-to-text is a
crucial application as it is easy to store text, however indexing a
specific utterance can be difficult as speech signals can be swift,
intuitive, slow, and unpredictable at other times (Gupta & Joshi,
2018). Over the years, speech processing has evolved with the current
state-of-the-art systems replaced. According to (Nassif et al., 2019),
traditional speech recognition systems work by encoding the speech
impulses using the Gaussian Mixture Model (GMM) which is based on Hidden
Markov Model (HMM) as it can be viewed as a stationary signal. Dynamic
Time Warping (DTW) algorithm, has been used for speech recognition and
though it has been replaced with other new algorithms it’s still a very
popular technique. This algorithm works very well by identifying the
optimum distance which is the shortest distance between two speakers
therefore addressing the issue of speed between speakers. In general
speech recognition systems can be viewed in four stages: signal
preprocessing, feature extraction, feature selection, and modeling.
Automatic speech recognition using Deep Neural Networks has grown in
popularity. This technology typically involves categorizing acoustic
templates into pre-established classes. Numerous studies have provided
numerous instances illustrating how deep neural networks do better than
conventional models at speech recognition. The final results of
Microsoft Audio Video Indexing Service, a deep learning-based speech
system that was disclosed by Microsoft, showed a 30% reduction in word
error rate (WER) on four benchmarks when compared to the state of models
based on Gaussian mixtures. (Samal, Jena, & Manjhi, 2019). Time-series
data are involved in the speech recognition issue.
Deep neural networks are constructed in a series of layers, each of
which contains neurons that take input from the layer above and conduct
a single computation. The outputs of the previous levels are sent to the
subsequent layer in feedforward neural networks because they are
unidirectional. The feedforward networks’ inability to transmit
historical information is one of their drawbacks. Additionally, problems
like varying speaking rates and temporal dependencies frequently arise
when Deep Neural Networks are employed to analyze speech recognition. By
modeling a fixed sliding window of auditory frames, Deep Neural Networks
are comfortable and able to handle temporal dependencies, but they are
unable to accommodate varying speaking rates. Recurrent neural networks
(RNN) are a subset of artificial neural networks where the connections
between the nodes have the potential to cycle because they have loops in
the hidden layers that store information from previous layers, allowing
the output from one node to influence the subsequent input to that same
node and predicting the value of the current step. Due to this RNN can
handle the difficulty of diverse speaking rates because of this method
(Apeksha Shewalkar, 2019).
Connectionist Temporal Classification (CTC) is a type of neural network
that can resolve the limitations of RNN that require pre-segmented
training as well as post-processing of the output given to convert it
into labeled sequences. Long Short-Term Memory (LSTM) networks address
the issue of long-term dependencies in the data which made the networks
very effective. Gated Recurrent Unit (GRU) Neural Networks work well
with sequential data thus also effectively handling the issue of
long-term dependencies. Although integrated models have been developed
to address the issues of accuracy and speed in speech recognition,
insufficient details have been focused on how a model can be integrated
to address all the issues raised by the various algorithms proposed. In
this study, we integrate, BLSTM and recurrent neural networks to address
the highlighted challenges specifically the continuous speech input
stream.
In recent years, transformer architectures based on self-attention
mechanisms have proved to demonstrate the ability to outperform
Bidirectional Long-Short Term Memory (BLSTM) giving good results on
acoustic modeling (Yongqiang Wang, 2020). With the introduction of
transformers, gaps identified such as the limits of length of speech
signals that are required for long-term range dependencies were
addressed, and the fact that recurrent models do not allow
parallelization transformers can address this challenge.
Although integrated models have been developed to address the issues of
accuracy and speed in speech recognition, insufficient details have been
focused on how a model can be integrated to address all the issues
raised by the various algorithms proposed.
In this study, we integrate, BLSTM and recurrent neural networks to
address the highlighted challenges specifically the continuous speech
input stream.
Contributions of this study: -
1. Through investigations of several architectures in LSTM, GRU, and RNN
this paper proposes an enhanced model Bidirectional LSTM recurrent
Neural Network (BLSTM-RNN).
2. To demonstrate the superiority of the proposed model in resolving the
issue of word error rate and accuracy on continuous input stream without
increasing the required bandwidth
The paper is structured as follows. Section 2 discusses related work
using various in the area of speech recognition. Section 3 discusses the
proposed approach. Section 4 describes the datasets and experimental
environment. Section 5 discusses the results. The conclusion and future
work are provided in section 6.
Literature Review
In this section, this research reviews work that has been documented on ways to improve and achieve better results in speech recognition. (Hori et al., 2017) the model was developed with a Deep CNN Encoder together with RNN-LM while using advances in Join CT-Attention for speech recognition. This model achieved a 5-10% reduction in the word error rate in comparison to other traditional Speech recognition systems. This reduction was a result of training the LSTM language model exclusively as well as combining the attention-based decoder prediction and CTC predictions. This study recommended further work on the improvement of the model by pre-training the RNN-LM using large quantities of unlabeled data together with their recommended model.
(Kumar et al., 2018) provided a survey on the various architectures involved in the conversion of spoken word to text and the techniques used in deep learning. In this paper, different models were discussed based on Convolutional Neural Networks (CNNs), Deep Belief Networks (DBN), and Recurrent Neural Networks (RNN). From their findings, these neural networks (DBNs and CNNs) replaced Gaussian mixtures as they perform better in large vocabulary. This research recommended the use of one unified neural network to achieve end-to-end speech recognition. In their conclusion findings, the usage of RNNs in a hybrid Dynamic Neural Network for acoustic modeling results in mixed reactions using language model and CTC loss function. To eliminate difficult processing stages and learning-rich representations that deal with raw inputs, deep-learning Neural Networks were recommended.
(James1 et al., 2018) did a study on speech recognition Long-Short-Term Memory Neural Networks specifically for electronic devices. The developed model achieved a reduction in word error rate of 6.8% this was in comparison to a baseline system hybrid model which achieved 11.9%. Using LSTM's modeling capacity for direct speech processing and exploiting past information to reliably estimate speech parameters is contributing to a better performance of the model. The results from this study have shown the promising potential of LSTM as a recognized technique for continuous speech recognition and use to control separate electronic devices, which can be implemented as a hardware model.
(Mokgonyane et al., 2019) a machine learning system for automatic speech recognition using the classifier models support vector machine (SVM), k-nearest neighbors (KNN), random forest (RF), and multi-layer perceptron (MLP). The researchers used the optimal hyperparameters for the auto Weka data mining tool to find the best classifier. 10-fold cross-validation was used to assess the model's performance. The experimental results demonstrated that the LSTM model performs better in the test data and training data. Models trained compared to CNN, DNN, and LSTM models combining numerous parameters. From 31.23% to 25.89%, noise reduction was the most effective. The researcher's motivation in performing additional studies on recurrent neural networks and bidirectional long short-term memory is greatly influenced by this.
(Daneshfar & Kabudian, 2020) created a speech recognition model using a modified quantum-behaved particle swarm optimization and discriminative dimension reduction. Applying the technique to emotional datasets revealed that the pQPSO algorithm outperformed the traditional QPSO algorithm, wQPSO, in terms of accuracy. using deep neural networks, traditional dimensionality reduction techniques, and cutting-edge techniques on the same datasets. The pQPSO technique can also be used to uncover new emotion-related features and optimize parameters used for classification, MFCC filter bank, GMM, and dimensionality reduction using a transformation matrix.
(Koduru et al., 2020) did a study on how to improve the recognition of emotions in speech using algorithms for feature extraction. From their findings, Decision Trees achieved 85%, LDA 65 %, and Support Vector Machine (SVM) achieved 70% accuracy based on the simulation results. This proposed system was able to extract more information thus achieving higher accuracy after the extraction of more data needed to identify emotions and characteristics representation from the signal efficiency this was in comparison with other models.
(Ho et al., 2020) developed a model using an open-source data set (House Twenty). The study was based on Dynamic Time Warping and Transformers on the prediction of Time series data. A great reduction of the prediction error rate was achieved at 27.79% from 45.70% using a support vector machine which is a commonly used method for time series prediction. The research recommended more experiments would be conducted to compare with other methods.
(Gulati et al., 2020) conducted a study on speech recognition using the convolution-augmented transformer known as a conformer. This research demonstrated that the developed model achieved a Word Error Rate (WER) of 2.1%/4.3% excluding the language model and with the external language model 1.9%/3.9% that is on test data and test other data. Using a smaller model of about 10M parameters, the model was able to achieve 2.7%/6.3% accuracy.
(Pawar & Kokate, 2021) used Mel-frequency Cepstrum coefficients to build a convolutional neural network (CNN) based on emotion recognition. In the study, CNN architecture combined with deep learning behavior offered the best classification of accuracy by simplifying speech with the aid of combined feature extraction algorithms like Pitch and Energy, Mel-Frequency Cepstral Coefficients (MFCC), and Mel Energy Spectrum Dynamic Coefficients (MEDC), as well as selection techniques. CNN was proven to perform better and to deliver the greatest results in terms of ROC and AUC characteristics curves when compared to other approaches, such as KNN classifiers. The study suggested using appropriate feature extraction and classification techniques for greater emotion recognition precision.
Methodology
The dataset
The LibriSpeech ASR corpus, which contains 1000 hours of recorded speech, served as the research's dataset. A better language model in the most recent dataset is a key contributor to lower WER (Word Error Rate) values. It is freely accessible and made to train language and audio models. Data folders for testing, validation, and training have been created. It is made up of 16kHz audio files with spoken English that range in length from 2 to 15 seconds and were extracted from audiobooks that were read aloud for the LibriVox project. To conduct this experiment, the audio files were first transformed into single channel (mono) WAV/WAVE files (.wav extension) with a 64k bit rate and a 16kHz sample rate. They were then encoded in PCM format and cut/padded to an equal length of 10 seconds. Any punctuation other than apostrophes was removed from the text transcriptions, and all characters were changed to lowercase as part of the pre-processing procedures. There are 64220 training examples in total.
Experimental Set-up
Experimentation was done under the Google Cloud compute engines custom
machine which is accessed with a personal laptop computer. Python 3.6
from Continuum Analytics Anaconda has been used to set up to create a
virtual environment and sci-kit-learn, Tensor flow, Keras, and graph lab
create libraries configured to work in jupyter notebook. Jupyter
Notebook was accessed through the browser and all developed Python code
has been typed and executed. Matlab software assisted with the algorithm
development.
Evaluation Measures.
In speech recognition, there are two different types of performance or
evaluation measures, which are based on (1) accuracy, and (2) speed.
Evaluation measures based on accuracy include WER and mean edit
distance.
Word Error rate (WER) is calculated as: -
\begin{equation}
WER=\frac{S+I+D}{N}*100\nonumber \\
\end{equation}
Where: -
S is the number of substitutions,
I is a number of insertions,
D is the number of deletions,
N is the total number of words in the actual transcript.
The interpretation of WER is that the lower the WER, the better the
speech recognition
Mean Edit Distance - is a measurement of how many changes we must do
to one string to transform it into the string we are comparing it to.
Let us say the normalized edit distance between two words/strings
(consider A and B) is d (A, B). The mean edit distance is calculated
by: -
\begin{equation}
d\left(A,B\right)=min(\frac{W\left(P\right)}{N})\nonumber \\
\end{equation}
Where: -
where P is the editing path between string A and string B
W (P) is the total sum of weights of all the edited operations of P
and N is the total number of edited operations (the length of the
editing path, P)
Parameter setup
The following parameters were used to run the model Dropout rate = 30, Number of epochs = 20, Training batch size = 16, Test batch size = 8, Activation function = ReLU, Neuron count in hidden layers = 1,000, Adam optimizer: β1 = 0.9, β2 = 0.999, ε = 1e-8, learning rate = 0.0001
Results and Analysis
The experiments were conducted on five models: RNN, GRU, LSTM,
Bidirectional LSTM, LSTM with bias, and Bidirectional LSTM-RNN model.
The models were run on 1000 node architectures in each hidden layer.
Table 1.0 provides the results of the WER, RNN achieved 94.34% on the
training dataset, 94.05% on the validation data set, and 94.31 on the
testing dataset. GRU achieved 29.95% on the training dataset, 94.05%
0n the validation dataset, and 94.31% on the testing dataset. LSTM
achieved 28.30% on the training dataset, 31.76% on the validation
dataset, and 32.82 on the testing dataset. BLSTM achieved 19.92% on the
training dataset, 25.07% on the validation dataset, and 26.26% on the
testing dataset. LSTM with bias initialized to one achieved 19.92 on the
training dataset, 33.67 on the validation dataset, and 34.42 on the
testing dataset. Bidirectional LSTM-RNN achieved 8.92% on the training
dataset, 11.46% on the validation dataset, and 13.07% on the testing
dataset. BLSTM-RNN model scoring best. Table 1.1 provides results of the
model accuracy simple RNN achieved 5.66% and training dataset 5.95% on
the validation dataset and 5.69 on the testing dataset. GRU achieved
70.05% on the training dataset, 66.37% on the validation dataset, and
65.58% on the testing dataset. LSTM achieved 71.7% on the training
dataset, 68.24% on the validation dataset, and 67.18% on the testing
dataset. BLSTM achieved 80.08% on the training dataset, 74.93% on the
validation set, and 73.74% on the testing dataset. LSTM with bias
initialized to one achieved 61.6% on the training dataset, 58.18% on
the validation dataset, and 57.16% on the testing dataset and
Bidirectional LSTM-RNN achieved 91.08% on the training dataset, 88.54%
on the validation and 86.93% on the testing dataset, with BLSTM-RNN
scoring best. In terms of the mean edit distance as shown in table 1.2
BLSTM-RNN model achieved the best value.
Table 1.0 Word Error Rate.