Figure 5. The signal processing of REAL audio. The raw signal from APD is processed sequentially by detrending, bandpass filtering, denoising and normalization.
REAL audio neural network model
The overall network design relies on the use of CNNs and LSTM to learn meaningful time-frequency information from laser spectrograms to estimate recovered audios. CNNs and LSTM have been extensively used in speech processing, especially on the task of speech enhancement and source separation.[33-35] This section describes the full details of our laser audio processing network architecture.
The recorded dataset contained laser audios (as training and testing sets) and corresponding microphone audios as ground truth for supervised learning. Time synchronization should be ensured for each pair of recorded trials. In preprocessing, all audios are converted to single-channel with a sampling rate of 16 kHz, and the temporal sequence of each audio pair (REAL and microphone) is aligned with MATLAB,[36] which estimates time delay that maximizes the cross-correlation between a pair of signals. The synchronized audios are then divided into segments of 3-second pieces, each pair is regarded as a sample instance of the final dataset. We discard all segments with utterances shorter than 1 second. After filtering 924 valid sample instances are obtained, followed by a splitting procedure that randomly resamples and divides the dataset into training and testing subsets with a ratio of 7:3. To be clear, the REAL signal in the audio dataset used for the neural network model does not require spectral subtraction since the network is capable of denoising.
We used a modified version of the Voicefilter architecture[34] as the base of our network (Figure 4c). The speaker voice embedding was removed from the original architecture since it was aimed at separating the voice of a speaker of interest from several concurrent speakers, while in our situation the aim is to recover the laser signal to the original audio. The soft mask prediction from the earlier work is also removed due to the same reason. The network is trained to minimize the difference between the estimated magnitude spectrogram and the target magnitude spectrogram. The phase of the estimated magnitude spectrogram is directly assigned from the laser audio. The complete model is composed of 8 convolutional layers, 1 LSTM layer, and 1 fully connected layer, each with ReLU activations except the last layer, which has a sigmoid activation. The feature map outputs from the convolutional layers are concatenated along the frequency axis, and the concatenated full feature map is then fed as the input to the following LSTM layers in time frames orders. To reduce overfitting, a dropout layer with a 20% drop rate is added between the convolutional layer blocks. We apply an initial learning rate of 10-3, and decrease by a factor of 10 every 100 epochs. This model is implemented with PyTorch (https://pytorch.org/) deep learning framework[37]. Adam solver in PyTorch is used to train our model and minimize the cross-entropy loss.
To evaluate the performance of different audio enhancement models, we use two metrics: source-to-distortion ratio (SDR) and short-time objective intelligibility (STOI). SDR is a common metric to evaluate speech enhancement performance[10,34,38] and is typically expressed as an energy ratio (in dB) between the target signal and the total error from interference, noise and artifacts[34]. Intelligibility is another indicator describing the effectiveness in sound perception, and STOI is the state-of-the-art intelligibility metric[5]. STOI uses a discrete Fourier transform-based time-frequency decomposition to measure the correlation between the short-time temporal envelopes of a clean utterance and a separated utterance[30]. The value can be interpreted as percent correct between 0 and 1. It is clear that in both metrics the recovered REAL signal improved significantly after enhancement and allowed the understanding of REAL signals from throats.