Blind Source Separation and Human Speech Extraction for Three Sound Sources Using Silent Interval

In this paper, we propose a blind source separation method for multiple sound source signals. We have proposed an estimation method for source signals without noise from observed mixture signals. The method estimates the rotation angle of a distribution of observed mixture signals. The distribution is rotated based on the estimated angle. Our method can estimate two source signals from two mixture signals. However, in case of three source signals, the source signals cannot be estimated by one-time processing. Therefore, we apply processing multiple times to observed mixture signals. In this way, we propose an estimation method for the original source.


Introduction
We are living in an environment where there are a variety of sounds.And we can recognize the target speech from among a lot of sounds.However, a machine is hard to recognize their target sounds in such noisy places.Recently, a lot of methods of separating noises and target sounds has been proposed by researchers.Blind source separation (BSS) is a method for estimating the source signals from observed mixture signals without the information about the sources and the transfer functions.For BSS, independent component analysis (ICA) is well known (1,2) .In case sources are statistically independent each other, ICA can separate unknown sources from mixtures.
However, ICA takes a long time to separate the source signals from mixture signals.Therefore, ICA is not good at a real-time processing.Furthermore, since ICA is separation algorithm, a target extraction processing is needed.
We have already proposed a rotation BSS method for two sound sources signals (3,4) .In the case of two sound sources, we can estimate the source signals.In this paper, we propose a new BSS method for three source signals.Our method can be applied even when the number of original source signals is three.The processes of the proposed method are as follows.First, we create a joint distribution and a histogram using the mixture signals observed by a pair of microphones.
Secondly, a rotation angle of the distribution is estimated.
Noise-removed mixture signal is estimated by rotating the distribution using the rotation angle.In the same way, we estimate another noise-removed mixture signal observed by a different pair of microphones.We can estimate the mixture signals of two sources without one source.Additionally, we estimate the source signal using two noise-removed mixture signals again.
As a consequence, the source signal can be estimated.The proposed method outputs the human speech.It means that the method can separate and extract the human speech.
In order to verify our proposals, several simulations were carried out when the source signals is two speaker speech and one stationary noise.From the simulation results, the proposed method can remove the noise from the mixture signals with the first processing, because the two speaker speech include a silent interval.In addition, we can obtain the source signals by reprocessing the signals without noise.These results lead to the conclusion that the proposed method can extract the human speech.

Rotation BSS
where a mn denotes a unknown mixing parameter, N and M denote the number of the sources and the microphones, respectively.The estimated signals y n (t) for the sources are expressed as where w nm denotes an estimated separating parameter.In order to estimate w nm , many methods have been proposed (1,2) .
In order to separate and extract target human speech, we have already proposed a rotation BSS (3,4) .The rotation BSS is based on the rotation of the distribution of observed signals.
Consider that a human utterance and noise as shown in Fig. 1 (a) are exist, the mixture signals as shown in (b) are observed by two microphones.
Human speech and car noise Using all ϕ(t), its histogram as shown in Fig. 2 (b) means that the components of the joint distribution are concentrated one direction.The reason is that since the human speech has a silent interval, it becomes only noise in the silent interval.
From the above discussion, we define θ calculated as the mode value (the most frequent value) of ϕ(t) as Based on the θ, the separated signals are rotated by the rotation matrix as follows.
x 1 (t) Fig. 2: Joint distribution and its directional histogram In order to extract the target human speech y(t), we calculate as because y 1 (t) and y 2 (t) in Eq (5) are estimated the noise and the target signal, respectively.

Rotation BSS for Three Source Signals
In this section, an estimation method for source signals based on the rotation BSS in the case of three sources.
The components of source signals without a silent interval appear as peaks in the histogram.In other words, the peak of the histogram represents the noise components.
We define ratios r 1 and r 2 of the observed signals as follows.Eqs.( 12) and ( 13) can be rewritten as follows.
The separated signal is estimated by Eqs.( 3), ( 4) and ( 6) using the estimated signals y 1 (t) and y 2 (t).When s 3 (t) has more silent interval than s 2 (t), the separated signal is estimates as where When s 2 (t) has more silent interval than s 3 (t), s 2 (t) is estimated by Eq(16).
Next, consider the case which s 2 (t) is less silent interval than s 1 (t) and s 3 (t), r 1 become a22 a12 and r 2 become a32 a12 .In the same way, we can calculate y 1 (t) and y 2 (t) as follows.
where l 4 = √ a 2 12 + a 2 22 and l 5 = √ a 2 12 + a 2 32 .In the same way, the final separated signal is obtained in the case which s 3 (t) has more silent interval than s 1 (t) as where l 6 = √ β 2 11 + β 2 21 .When s 1 (t) has more silent interval than s 3 (t), s 1 (t) is estimated.
Finally, consider the case of r 1 = a23 a13 and r 2 = a33 a13 .Similarly, we can calculate y 1 (t) and y 2 (t) as follows.In this case which s 2 (t) has more silent interval than s 1 (t), the separated signal is follows.
Judging from the above, the estimated signals y(t) are derived as the human source signals with silent interval.Since the proposed method removes the sound sources one by one, the method dose not need to extend to multidimensional space in the case of the sources increase.

Simulation
In order to verify our proposals, several simulations were carried out.Two sources were speaker speech data (5) and one sources was car noise (6) .
The source signals and the mixture signals are shown in Fig. 3(a) and (b), respectively.The separated signal y 1 (t) using x 1 (t) and x 2 (t) by the proposed method is shown in (c).
It is found that noise is removed in this waveform.In the same way, the separated signal y 2 (t) using x 1 (t) and x 3 (t) is shown in (d).Also in this case, it is found that noise is removed in this waveform.In addition to these, (e) is the waveform obtained using the estimated signals (c) and (d).This figure shows that x

Conclusion
We propose a new BSS method for three sound sources.
This method based on the rotation BSS for two sound sources.
The proposed method can extract the target speaker speech from three sound sources.Furthermore, our method dose not need to extend to multidimensional space when sources increase.Moreover, we can estimate the source signals by repeating simple processing.These results lead to the conclusion that the proposed method can estimate source signal when the number of sound sources increases.In future research, we apply our proposed method when the sound sources becomes multiple sources.

Fig. 3 :
Fig. 3: Simulation results 21 s 1 (t) + a 22 s 2 (t) + a 23 s 3 (t) a 11 s 1 (t) + a 12 s 2 (t) + a 13 s 3 (t) At first, consider that the case which s 1 (t) is less silent interval than s 2 (t) and s 3 (t), the ratios are estimate r 1 = a21 a11 and r 2 = a31 a11 .In this condition, the estimated signal y 1 (t) is generated signals without the source signal s 1 (t) as follows.