Speech Denoising with Residual Attention U-Net

We applied end-to-end speech denoising to remove background noises from noisy, monaural speech signals by directly processing a raw waveform. Recent approaches have demonstrated effective results using various deep neural network (DNN) architectures. We propose the residual attention U-Net, which connects the same layer of multiple stacked residual channel attention encoder/decoder models for speech denoising. We evaluated the proposed method using an unseen test with single-channel speech denoising. Both objective and subjective evaluations indicated that our proposed method is preferred to other speech denoising methods.


Introduction
In the development of audio signal processing technologies, various information such as text (speech recognition), speaker (speaker verification), and emotion (emotion recognition) can be extracted from speech signals. The performance of these technologies is degraded when the SNR is less than 20 dB (1,2) . For high-accuracy information extraction, audio devices such as headsets, which can input with high SNR, has been equipped. The equipped devices limit the use case. Therefore, speech denoising (3) (or speech enhancement) is required during preprocessing. Our goal is to develop monaural (single-channel) speech denoising technologies to improve under the real environment, which SNR is 5-10dB, to SNR 20dB or more. If this can be realized, audio devices without equipped, such as general-purpose microphones or conference microphones, can be used for audio information extraction. And it means that the scope of application of audio information extraction can be extended.
As our research focuses on monaural speech denoising. A multi-channel recode such as microphone array (4) cannot be applied. Traditionally, Spectral subtraction (5) and Wiener filter (6) were popular algorithm that make noise reduction filter from its distribution and characteristic of the noise. Recently, a denoising method using a deep neural network (DNN) has been proposed (7,8) .
When paying attention to features using speech denoising, typical speech denoising methods estimate background noise by using spectrogram features on the speech signals. The short-time Fourier transform (STFT) is used to convert speech signals to a spectrogram. The denoising operation has also been applied to the magnitude (9) and complex (10) components of a spectrogram. A disadvantage is that these approaches depend on many parameters (window size of STFT and overlap of audio frames) that affect the time and frequency resolution. Moreover, signal artifacts arise due to time aliasing when using the inverse STFT.
To resolve this issue, end-to-end speech denoising methods that directly operate on raw waveforms have been proposed (11,12,13) . These methods need to deal with long dimensional data because speech signals contain high-density data samples. Therefore, the layers of a network become deeper, and deeper networks are more difficult to train.
Our contributions are follows: Based on the U-Net architecture (14) , which connects the same layer of multiple stacked encoder/decoder models, we added a residual attention network for each encoder/decoder. From the unseen test with single-channel speech denoising, we confirmed that our proposed method performed better than other speech denoising methods in both objective and subjective evaluations. Finally, we confirmed the improvement of information extraction such as speech recognition by the speech denoising.

Related Works
As speech signals have a high density of data samples (16,000 samples per second), end-to-end speech denoising methods need to deal with long dimensional data. Therefore, to train long dimensional data at high speeds, architectures based on convolutional neural networks (CNNs) such as U-Net and dilated CNN (15) have been proposed.
The U-Net is an expanded version of an autoencoder that consists of encoder and decoder network. The encoder and decoder are multilayered and successively reduce (in half) and double the resolution of features in each layer. The middle layer has a skip connection. Although the U-Net has been originally used for semantic segmentation, it is also used in illustration conversions (14) and the extraction of aerial images (16) . For speech denoising, SEGAN (11) has been proposed using generative adversarial networks (18) . Wave-U-Net (19) has been proposed for speech source separation.
Dilated CNN, an architecture based on multi-stacked dilated convolutions, is operated using a filter that is applied by skipping input values at a certain step. It is equivalent to a convolution with a larger filter with smaller parameters, and speech signals that use an exponentially increased dilated filter can be applied. Denoising WaveNet (12) , which was developed as a speech denoising method based on WaveNet's architecture (20) , can achieve high quality text-to-speech results. Speech denoising with Deep Feature Loss (13) was a proposed optimization that uses different networks trained for acoustic environment detection and domestic audio tagging.
The U-Net is a very simple network compared to dilated CNN. In the following sections, we describe the U-Net-based approach to improve the model performance.

Definitions
As discussed above, network optimization is difficult when the layer of the network becomes deeper. To solve this problem, We adopt residual attention network (21,22) on a U-Net architecture to improve the performance of speech denoising.
By definition, speech signals and background noise are typically formulated as is a noisy speech signal, ∈ [−1,1] × is a clean speech signal, ∈ [−1,1] × is background noise, T is the number of audio samples, and C is the number of audio channels. As we are focusing on speech denoising in monaural audio, C is set to 1. The goal of speech denoising is to estimate S (and E), given M. Hereinafter, speech signals are defined as ( , 1) array dimensions, where the first and second elements represent the temporal and channel axes respectively. In our evaluation, we used a speech signal of the sampling frequency 16 kHz and set = 16384. This means about one second of speech. Figure 1 shows our denoising model, referred to as U-Net architectures (14,19) . This model comprises an encoder, decoder, and skip connections (left block, right block and horizontal arrows in Fig. 1). "Conv(c,k,s)" in Fig. 1 means one-dimensional convolutional operation with c filters (channels), kernel size k, and s strides. To prevent size change after convolutional operations, zero padding can be applied at the edge of the temporal axis. "Downsampling" in Fig. 1 means average pooling with size 2 along the temporal axis. "Upsampling" in Fig. 1 repeats the temporal step twice along the temporal axis. "Concatenate" concatenates two inputs along the channel axis. "Block Enc/Dec" in Fig, 1 means some convolutional operations.

The base architecture
Our model inputs the noisy speech signals, which are mixed clean speech signals and background noise, and have the shape of (16384, 1). And it outputs denoised speech signals, which have the same shape. We also calculate the background noise by subtracting the denoised speech signals from the noisy speech signals, which is used for loss function described in the next subsection.

Residual Attention block
Next, we describe the component of the encoder and decoder, which are shown in Fig. 2. Figure 2(a), 2(b), and 2(c) show the normal block, residual block, and residual attention block respectively. Normal block indicates the usual U-Net architecture with two convolutional operations with batch normalization and ReLU activation ("BatchNorm" and "ReLU" in Fig. 2). Residual block applies a residual path (23) to a normal block to resolve vanishing gradient. We applied one-dimensional convolutional operation with c filters and size 1 kernel as a residual path. Residual attention block applies and adaptively learns channel-wise features in a residual block. "CA(r)" in Fig. 2(c) indicates the channel-wise attention block with parameter r. Figure 2(d) shows the channel-wise attention block when both the input and output have the shape of (t, c). This architecture, known as a squeeze-and-excitation network (24) , is composed of an information compression step and an information expansion step. "Global Pooling" in the Fig. 2(d) outputs channel-wise global temporal average (( , ) → (1, )) . Convolutional operations compress and expand the channel information ((1, ) → (1, / ) → (1, )). The channel information is set to [0,1] by the sigmoid function ("Sigmoid" in Fig. 2(d)) and performs channel-wise multiplication. Parameter in Fig. 2(c), 2(d) means the reduction parameters that influence the precision and computational complexity. Based on previous studies (24) , we set = 8.
Parameter c in Fig. 2 is the number of convolutional filters used in the block. Table 1 shows the relationship between blocks and parameters. With SEGAN, the number of convolutional filters increases multiplicatively according to the hierarchy of the block, which causes the network scale to increase. Thus, we used a value that is incrementally increased to the hierarchy of the block.
Our proposed model inputs noisy speech signals and outputs denoised speech signals and background noise. We trained the model in such a way that estimated speech signals and background noise are close to true clean speech signals and true background noise. From previous works (12,13) , we defined the loss function: where is a clean speech signal, is background noise, ̂ is a denoised speech signal, and ̂ is estimated background noise at sampling time .

Dataset
We evaluated the effectiveness of our proposed method from both objective and subjective evaluations. We used   single-channel noisy dataset created by Valentini et al. (25) . We chose this data set because it is open to the public (26) and includes many types of noise for many different speakers, which fits the purposes of this work. The training data set is generated from the speech data of 28 speakers (14 males and 14 females) from the VTCK Corpus (27) and ten types of background noise (two artificially generated noises, namely speech-shaped noise and babble noise, and eight real noise recordings from the DEMAND (28) database). Each noise is used to generate four signal-to-noise ratios (SNRs) of 0, 5, 10, and 15 dB. Therefore, the training set had 40 different noise conditions. The original data set had been sampled at 48 kHz, and we resampled it at 16 kHz and converted it to monaural sound. The complete training set comprises 11,572 files of clean/noisy data. We loaded audio files and divided them into 16,384 samples. Of the total data, 90% is used for training and 10% for validation. We create true background noise by subtracting clean data from noisy data. Each test set contained two speakers (one male, one female) and five other noises from the DEMAND database. The noise types were "living room," "office space," "bus," "cafeteria," and "public square." None of these were used in the training set. The noises of the test set, with 824 files of clean/noisy data, were mixed with the SNRs of 2.5, 7.5, 12.5, and 17.5 dB. For the denoising, the end of noisy data were zero-padded so that the sample lengths were equal to multiples of 16,384 samples. They were applied to the model, and fractions of denoised data were removed. Therefore, the number of samples of the denoised data was equal to that of the original.

Experimental Setup
The proposed method was implemented by TensorFlow 1.14 (29) . For the training, we used the previously described loss function along with the ADAM optimizer with the learning rate of 0.001. The training batch size was 16 and the number of training epochs was set to 200.

Objective Evaluation
First, we evaluated the effectiveness of speech denoising by calculating the following objective measures (the higher, the better) provided in the references (3,9) : Perceptual evaluation of speech quality (PESQ) using the wide-band version recommended in ITU-T P.862.2, predictor of signal distortion attending only to the speech signal (CSIG), predictor of the intrusiveness of background noise (CBAK), the predictor of overall quality (COVL), segmental signal-to-noise ratio (SSNR) and SNR. The unit of SSNR and SNR is decibels. Figure 3 show the result of validation loss for each block on training. The "Normal", "Residual" and "Proposal" labels means that the component of U-Net encoder/decoder block described in Fig. 2(a), Fig. 2(b) and Fig. 2(c), respectively. The value of validation loss of "Residual" and "Proposal" lead to smaller values compare with "Normal" because of residual path in each block. And Table 2 shows the result of objective measures for each block. The "Block type" label in the table refers to the component of U-Net encoder/decoder blocks. In these results, our proposal outperforms than other methods. We think that residual attention block with U-Net leads to more performance. Table 3 shows the result of objective measures compared with various conventional speech denoising methods. The "Noisy" label means that no denoising method was adopted. The "Wiener" label refers to the   results of Wiener filter (6) . These are baseline methods. "SEGAN" and "DFL" labels imply the results of end-to-end speech denoising method, SEGAN (11) and Deep Feature Loss (13) , respectively. The models of these methods trained the same dataset described in 4.1. As shown in Table  3, our method outperformed all denoising methods. And our method can improve the SNR to 19.2 dB. This rate is close to 20 dB which the information extraction technologies have a good performance.

Subjective Evaluation
Second, subjective evaluation was carried out with 32 native English listeners. Twenty samples were randomly selected from the test set with SNRs of 2.5 and 7.5 dB. Listeners were presented two samples: noisy speech and denoised speech by each denoising method, as described in Table 3. The listener rated the overall quality of enhanced speech on a scale from 1 to 5, with a 1 being described as "BAD: degraded speech with very intrusive noise" and a 5 being "EXCELLENT: very natural speech with no degradation or noticeable noise". Listeners could listen to each sample as many times as they wanted, and they were asked to focus on both signal distortion and noise intrusiveness. Table 4 summarizes the result by averaging the MOS scores from all listeners. Our proposed method had the highest score compared with other methods.

Improvement of speech recognition
Finally, we evaluated the improvement of the speech recognition accuracy by speech denoising. Our intent here was to determine whether the denoised speech signals closely resemble the clean speech signals. As the accuracy of speech recognition depends highly on the acoustic model, we defined the recovery rate R from word error rate (WER) of the speech recognition as follows: , and mean WER in the case of using clean, noisy, and denoised speech signals respectively. If R is close to 1.0, the denoised speech signals are nearly equal to clean speech signals, and if R is less than 0.0, the speech denoising method worsens or does not affect the results. We used two of the open-to-the-public speech recognition engines, Julius (30) and DeepSpeech (31) . For Julius, we used version 4.4 (32) and DNN English speech models for the acoustic model (33) . For DeepSpeech, we used version 0.4.1 (34) and the same version of English model. For each denoising model, we processed speech recognition from 834 noisy samples and calculated recovery rate from the recognized text. Table 5 shows the result of recovery rate R by speech denoising when using Julius and DeepSpeech. As observed, all the speech denoising methods except "Wiener" label were effective in improving the accuracy of speech recognition, as the rates were more than 0. Comparing all methods, we can see that our proposed method outperformed "DFL", which had the best performance among the conventional methods.

Conclusions
In this work, we applied a residual attention network on a U-Net architecture for end-to-end speech denoising. We evaluated both objective and subjective evaluations and found that our proposed method outperformed conventional methods. Future work will include further improvements such as adopting generative adversarial networks and more effective loss functions.