Use of Neural Network Classifier for Detecting Human Emotion via Speech Spectrogram

The goal of this research is to present the algorithm for emotion recognition of speech based o n neural networks. The input for neural network should be the partitions of speech spectrogram. The proposed algorithms are b ased on feature extraction which analysis of two dimensions time-frequency representation of speech signals. The Berlin Emotion Database with 7 emotion classes was tested. The experimental results show that the method works wel l in time-frequency domain. Moreover, the proposed frame work can efficiently find the correct speech emotion compare d to using the comparing method. Keywords—Speech Emotion Recognition; Neural Network; Spectrogram; Feature extraction;


I. INTRODUCTION
Speech Emotion Recognition (SER) is a challenging research area in the field of Human Computer Interaction(HCI).Computer may not be able to exactly understand the natural of this unless we employ the speech processing.Several algorithms were introduced to make computer to be able to understand and to be able to classify the several type of emotion in human speech.The benefit of knowing emotion from speech is to use with the application which require a man-machine interaction such as computer tutorial, automatic translation, mobile interaction, health care, children education, etc.Some of researchers have used the statistics of the difference attributes of speech for being a representation of each sound such as pitch, formant, amplitude or power of the speech.These features can be classified to one of these three categories: prosodic features such as pitch (f0), intensity and duration, voice quality and spectral features such as Mel-Frequency Ceptral Coefficients (MFCC) or Linear Prediction Ceptral Coefficients (LPCC) [7].
Emotion feature extraction is an importance issue for the speech emotion classification.The most popular feature is prosodic feature which is pitch, energy and spectral feature such as MFCC, LPCC.The limitation of this system is that algorithms analyze each parameter separately and then combine them into a set of feature vector [9].Another feature extraction is by using a two dimensions magnitude spectrogram which is a graphical display of the squared magnitude of the time-frequency of speech [9].This magnitude spectrogram representation usually contains distinctive pattern that capture different characteristic of speech emotion signal [11].
Several algorithms have been developed for the emotional classification of speech signals such as Eun Ho Kim [1] presented a speech recognition using Eigen-FFT in clean and noisy environment.The accuracy of this algorithm is 90 percent.Qingli Zhang [2] used the technique based on Gaussian Mixture model for a classification of audio signals.Yazid Attabi and Pierre Dumouchel [3] presented an algorithm for speech classification which used a technique of Gaussian mixture model with support vector model to help for increasing accuracy of emotional classification.Yixiong Pan [4] presented a classification for emotion classification by using Support Vector Machine (SVM) for classification of audio signals.The limitation of each method is analyze each parameter separately in the feature extraction process and then combined these features into a set of feature space.From this limitation, a new novel for speech classification which does not use separately feature was proposed.A new research trend in emotional recognition/classification is to use a spectrogram for feature extraction processes.This method is based on the analysis of specific part of the spectrogram which called region of interest (ROI) [6].Kun-Ching Wang [5][6] presented a novel texture image feature for emotion sensing in speech, the texture image information derived from the spectrogram image can be extracted by using laws mask to characterize emotional state.The experimental results show that the proposed method based on using spectrogram image can provide significant classification for emotional recognition in speech.Liang He, et al. [9] presented and tested an approach to feature selection based on analysis of two dimensions time-frequency representation of speech.The wavelet packet method combined with the log-Gabor filters was showed in the method.In their method, the highest classification accuracy was archived while using single vowels.J.Pribil and A.Pibilova [8] focused on application of emotion speech conversion to evaluation of sentence after spectral and prosodic modification for emotional speech conversion.Results of evaluation experiment are confirmed that spectrogram can be successfully used for visual comparison of different type of emotional synthetic speech.A. Harimi et al. [10] recognized human emotion by analyzing the acoustics of speech sound.Spectral Pattern features (SPs) and Harmonic Energy features (HEs) for emotion recognition was proposed.These features extracted from the spectrogram of speech signal using image processing techniques.
Results of theses research confirm that a spectrogram can capture all of the characteristics of speech.Accordingly, in this paper, the spectrogram of speech signal for emotion recognition is proposed.The important feature will be extracted from spectrogram.We provide the new architecture for neural network to reduce the difficulty of the classification process and reduce number of feature to be sent to neural network compared with the other method.
The processes for detecting the speech emotion based on spectrogram are discussed in Section II.Results and discussions are discussed in Section III, and the conclusions of the research are presented in Section IV.

II. PROPOSED ALGORITHMS
In this section, we first briefly review the fundamental of the speech signal.Next, the speech recognition system is introduced based on recognizing the spectrogram.Finally, neural network is shown for labeling the emotion of speech signal.

A. Speech features.
In speech signal there are basic speech attributes for the feature extraction.Example of speech signal is shown in Fig. 1.Example of speech signal.
There are some importance feature in every speech which are pitch(f0), log-energy, formant frequency, mel-band energy and mel-frequency ceptral coefficient.More details of this feature can be described as follows.
1) Frequency or pitch A pitch is the fundamental frequency (f0) of the speech signal.It refers to the tone of speech.The pitch of each signal is different depends on the speed of the vibration of the object.Example of frequency or pitch is shown in Fig. 2. Freqency of low frequency wave and high frequency wave.

2) Wavelength
Wavelength refers to the distance between the two adjacent peaks which is 1 Lambda (λ), which occurs during the recording of the sound wave.Example of wavelength is shown in Fig. 3. Fig. 3.
The distance between two adjacent peaks which is 1 Lambda (λ).

3) Amplitude
Amplitude refers to the height of the waves.That represents the intensity of the sound or volume (Loudness).Example of amplitude is shown in Fig. 4. Amplitude.

4) Prosodic features
Prosodic features form the most widely used features in SER.In order to extract these types of features, statistical properties of pitch and energy tracking contours are commonly used.The statistical information of pitch and energy tracking contours were capture.These functions include: min, max, range, mean, median, trimmed mean 10%, trimmed mean 25%, 1st, 5th, 10th, 25th, 75th, 90th, 95th, and 99th percentile, interquartile range, average deviation, standard deviation, skewness and kurtosis.

5) Spectral features
Two types of spectral features: The Mel-Frequency Cepstral Coefficients (MFCCs) and formants are reported as effective spectral features for emotion recognition.The first 12 MFCCs and 4 formants are extracted from 20 ms Hamming-windowed speech frames every 10 ms, and so their contours are formed.

B. Speech emotion recognition system 1) Emotional speech input:
The emotional speech data set from standard database is selected for increasing efficience of the proposed framework.
2) Pre-processing: The continuos time signal is selected.The pre-emphasis for flattening speech spectrum is processed by High-pass Finite Impulse Response(FIR).After that normalize amplitude signal between -1 to 1.
3) Framming: Segment the speech signal into the small frame with time length 25 ms.

4) Windowing:
The discontinuities of the speech signal at the edge of frame was reduced by using Hamming window.

5) FFT:
The FFT is an abbreviation of the Fast Fourier Transform.Essentially, the FFT is still the DFT for transforming the discrete-time signal from time domain in to its frequency domain.The difference is that the FFT is faster and more efficient on computation.
6) Spectrogram: After the speech signal in time domain in each segment is transformed to the frequency domain by FFT, rotate the amplitude -frequency by 90 degrees.Map spectral amplitude to gray level(0-255) value.Higher the amplitude, darker the coresponding region.Time vs frequency representation of a speech signal is referred to as spectrogram.This graph will used for the classification step.Example of time and frequency representation of a speech signal is shown in Fig. 5. 7) Feature extraction: in this step the spectrogram which is time-frequency representation of speech signal is used to be input of neural network.The process for converting spectrogram to be input in neural network is as follows: At first , convert the spectrogram into grayscale image.Next, use a threshold to convert image to binary by set the value below threshold to be zero and the value greater than threshold to be one.After that, devide the spectrogram into 4x4 matrix.Sum all of number in each sub area.The total 16 areas will be used to neural network training process.Example of the total 16 areas which will be used in neural network training process is shown in Fig. 6.Fig. 6.
The process for converting spectrogram to be input in neural network.

8) Neural network for emotion speech classification.
In this paper, neural networks have been created in order to distinguish for each spectrogram.The formulation of neural network can be shown in Equation ( 1), (1) where n is a weight summation.is an input vector i is a weight vector i z is number of input attribute b is bias i is vector index from 1 to z Use input attribute from windows sizing process become an input to neural network which is the summation of identity point from each sub -window.We use 16 attributes to be input attribute.Neural network architecture for learn the relationship of data is shown in Fig. 7.
Train on only the training set by setting the stopping criteria and the network parameters.Feed-forward multilayer neural network with back propagation learning algorithms was used.The network consists of one input layer, one hidden layer, and one output layer.Set of inputs and desired output belongs to training patterns were fed into the neural network to learn the relationship of data.The manifold of these data set are generated using only the set of nearest data which called the local training approach.The difference between our proposed algorithms and traditional method is using the semi-supervised technique.The process in hidden layer is to adjust weight which connected to each node of input.The root mean square error is calculated from desired output and its calculated output.If the error is not satisfied with the predefined values, it will propagate error back to the former layer.This will be done from the direction of the upper layer towards the input layer.This algorithm will adjust weight from initial weight until it gives the satisfied mean square error.

III. EXPERIMENTAL RESULTS AND DISCUSSIONS
A. Experimental setup.The Berlin Emotion Database (German emotional speech) [11] is a well-known public database.This database contains 535 speech files for 7 emotion classes.These seven emotions are anger, joy, boredom, neutral, disgust, fear, and sadness.This database includes 10 different contexts expressed by ten professional actors (5 males and 5 females).Table 1 lists the numbers of samples for the emotion categories.In our experiments, we selected 5 speech emotions which are angry, sad, happy, neutral, and fear.These five emotions data can be represented for the emotion which shows the obviously different spectrogram.There are different characteristics.Details of emotion in speech database can be described as Table I.Tables II-IV are the confusion matrices, which are widely used graphical tools that reflect the performance of an algorithm.Each row of the matrix represents the instances of a predicted class, while each column represents the instances of an original class.Thus, it is easy to visualize the classifier's errors while trying to accurately predict each original class's instances.Percentage of the test data being classified to the original emotion was shown inside the table.

B. Experimental results.
The experimental results for speech emotion classification were shown in Table II-Table IV.Table II shows the confusion matrix of the proposed algorithms with male and female speech.Table III shows the confusion matrix of the proposed algorithms with male speech.Table IV shows the confusion matrix of the proposed algorithms with female speech.
The experimental results shown that by using the proposed method the following results was shown: For the anger emotion the accuracy of proposed algorithms is 80.67 percent, 76.67 percent and 84.67 percent in both male and female speech, male speech, and female speech, respectively.
For the sad emotion the accuracy of proposed algorithms is 79.35 percent, 75.35 percent and 82.35 percent in both male and female speech, male speech, and female speech, respectively.
For the happy emotion the accuracy of proposed algorithms is 82.47 percent, 79.47 percent and 85.47 percent in both male and female speech, male speech, and female speech, respectively.
For the neutral emotion the accuracy of proposed algorithms is 76.80 percent, 73.80 percent and 78.80 percent in both male and female speech, male speech, and female speech, respectively.
For the fear emotion the accuracy of proposed algorithms is 83.28 percent, 74.28 percent and 85.28 percent in both male and female speech, male speech, and female speech, respectively.
From the experimental results, we confirmed that the spectrogram image can be successfully used for visual quality comparison of different approach to emotion speech.The method works well in the time and frequency domain.Moreover, by using neural network for being a classifier, the accuracy of classification was satisfying algorithm.Number of attribute can be reduced compared to the past method.

IV. CONCLUTIONS
This research presents a novel algorithm for detecting human emotion via speech recognition by using speech spectrogram.The spectrogram of speech signal for emotion recognition is proposed.We extract the important feature from spectrogram and provide the new architecture for neural network to reduce the difficulty of the classification process and reduce number of feature to be sent to neural network compared with the other methods.We have presented a new approach to feature extraction based on analysis of two dimensions time-frequency representation of a speech signal.The algorithm was tested with EMO-Database.The experimental results show that the proposed framework can efficiently find the correct speech emotion compared to using the traditional method.

Fig. 5 .
Fig. 5. Time vs frequency representation of a speech signal.

TABLE I .
NUMBER OF SAMPLES IN THE BERLIN DATABASE