Fractional Lower-order Statistics for Yangzhou Dialectal Speech Recognition

With the appearances of information time based on digital techniques, people often concert on many kinds of machines receive, transact and transfer information. As computers are wildly used, it is becoming true that the natural communication between people and machines without using keyboard or mouse will become true. Multimedia era requests speech recognition system to put into practical from laboratory. Isolated word speech recognition system will bring advantages for people in daily life. However, because of the ambient noise, such as Gaussian noise and non-Gaussian noise, the product capability of isolated word speech recognition system is hard to meet the demands. Even the isolated word recognition systems are quite mature, there are lots of problems existed and many fields need to be improved. This paper focuses on the problems of isolated word speech recognition systems as follows: 1) The problem of Pre-treatment in noisy environment. Generally, researchers consider the Gaussian Noise, but usually in our life the non-Gaussian noise are not neglected. Then we can do a good endpoint. Studies showed that a speech system utilizing an isolated word recognizer, more than 50% of error rate was credited to the endpoint detector. 2) The problem of Yangzhou dialectal. To do speech recognition of Yangzhou language by way of phonetic introduction and to establish common-used model is practical for information-exchange between dialects and speech recognition.


Introduction
The world has a variety of natural languages, and most of natural languages contain many dialects.The difference between natural languages is very variety.Some dialect not only different in pronunciation, but writing system is also different (1) .
As the developed of the information communication technologies (ICTs), many speech recognition software and synthesis systems were designed in the last decade.Most of these are designed for large population speakers.Recently, the less scale population of language or dialect speech recognition systems and the speech recognition models are rarely.If we take the speech input method to identify Yangzhou dialect and to establish the speech models, can automatic exchange information between dialects.It will make the speech recognition technology for practical usage.Consequently, in this paper, we will concern about this need and propose a novel isolated words recognition system with non-Gaussian noise.
The rest of the paper is organized as follows.In Section 2, we decide the Endpoint detection, where we show that our new method can achieve well word recognition.The robust feature extraction is provided in Section 3, where we use standard Mel Frequency Campestral Coefficient (MFCC).And we use dynamic programming technique for time normalization and alignment in Section 4. Finally, we summarize our study and draw conclusions in Section 5.

Endpoint Detection
Fig1: The process of voice signal for pretreatment.
Endpoint detection, which aims at distinguishing speech and non-speech segments from speech signal, is considered as one of the most important steps in preprocessing components in speech recognition.The correct selection of the endpoint can improve accuracy and speed of speech recognition systems.A research showed that even small endpoint errors often result in relatively significant degradation in digit accuracy.In particular, one or more problems usually make accurate endpoint detection difficult.One particular class of problems is the background noise (2) .
In order to solve the problem of speech endpoint detection in real world noisy environments, a new robust feature intended for speech endpoint detection is proposed.We considered the Gaussian white noise and non-Gaussian noise.And use the Low pass filter and fractional lower order lp-norm based approaches of adaptive filter for process these noise.

Short-time Average Energy and Zero-Crossing Rate
Short-time Average Energy and Zero-crossing Rate are used for distance voiceless and voice.The amplitude of voiceless is smaller than that of voice, and the energy of voiceless is obviously less than voice.So we can distance voiceless and voice by energy function.
However, in many cases, the above formula cannot distance them clearly.We can also use Zero-crossing Rate to do this.When the throat produces voiceless, more waveform crosses the zero line.So we can distance voiceless and voice by zero-crossing rate.
In practice, usually set a threshold, if the symbol of two samples is different, the voice crossing zero line.In this paper we set threshold=0.02.

LMP
The classic method of adaptive filter is LMS method.However, it is not suitable for processing non-Gaussian noise signal.And we presented the fractional lower order lp-norm based approaches of adaptive filter for noise signal.
In this method, we set the error function to express the cost function of adaptive system.If meet 0<p<a.a is directly related to p. So, the cost function We get the gradient L~P-estimate is Then, LMP can be described as follows: Initialization: for n=1 to L-length do: In this paper, we set u=0.00001, p=1.1.For the figure we can see that LMP is better for noise elimination.As applied to speech, the theory of linear prediction coding has been well understood for many years.The basic idea behind the LPC model (3) is that a given speech sample at time n, s(n), can be approximated as a linear combination of the past p speech samples, such that,

Linear Prediction Coding
where the coefficients p a a a , , , 2 1  are assumed as constant over the speech analysis frame.We convert Eq.( 9) to an equality by including an excitation term, G*u(n),giving: where u(n) is a normalized excitation and G is the gain of the excitation.By expressing Eq.( 10) in the z-domain we get the relation Leading to the transfer function and, its frequency response ) ( jw e H is called LPC pectrum. In the areas of large signal energy (at spectrum peak), LPC spectrum is similar to signal spectrum; and at spectrum bottom, the distance between them is big.
Above all of these, we considered that if the value of p is more large, the| |, can use the pole-model instead of zero-model.However, these can increase the calculation and storage.Here, we use p=12. (4)ccording to the characteristics of the human ear, we calculate that

Mel Frequency Campestral Coefficient
where k S is the output power of the k-th filter of the filter bank, and n is from 1 to 12.We can also calculate the logged energy of each frame as one of the coefficients, We can calculate the Mel-Frequency cepstrum from Eq.( 14).

Dynamic Time Wrapping
Dynamic Time Warping (5) is an extensively studied and widely used tool in operations research for solving sequential decision problems.To illustrate its applicability in speech recognition, and particularly for the problem of time alignment and normalization, the algorithm can be summarized as follows, ), , , , , , ( with j i M  , (20)

Summarize and Conclusions
We can conclude from TABLE 1 that LMP algorithm has higher identification accuracy in non-Gaussian noise and Gaussian noise environment (6) .We found in the experiment that as the numbers of words become more, the identification accuracy becomes lower.We also found that it is difficult to distance between the number "1" and "7".In Yangzhou dialectal, the pronunciation of these two words is similarly.And we will adopt a new method to solve this problem.

Table 1 .
Experimental Results of Yangzhou Dialectal Recognition