Speech Section Extraction Method Using Image and Voice Information

Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. An automatic speaker identification method using lip movements and speech data obtained by an omnidirectional camera was developed. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections. The proposed speech section extraction method was studied as a preprocessor for speaker identification. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. In the extraction of speech sections using lip movements, the nose width was used for a threshold for automatic calculation. The speech sections can be extracted, even when the distance between the camera and the subject changes, by using a threshold based on the width of the nose. Finally, 11 sentences of speech video data (154 data) of 14 subjects were used to evaluate the usefulness of the method. The evaluation result obtained was a high F-measure of 0.96 on average. The results reveal that the proposed method can extract speech sections, even when the distance between the camera and the subject changes.


Introduction
In Japan, many companies are striving to improve the efficiency of their operations and meetings (1) with work-style reform. Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. There are several practical speaker identification methods: one is to assign a microphone to each person, and another is to identify the speaker by voiceprint (2) . However, these methods require preparation in advance because they require multiple microphones and preregistered voiceprints. On the other hand, speaker identification method using image and voice data by Convolutional Neural Network was proposed (3) . This method needs to train the model using hundreds of hours of data. These problems can be solved by analyzing the relationship between lip movement and voice. Fig. 1 shows an automatic speaker identification system that uses lip movements and speech captured by an omnidirectional camera (4) . The similarity of lip movements taken from speech and images was evaluated, and the speaker was identified. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections.
In this article, a speech section extraction method that uses image and voice information as a preprocessor for speaker identification is proposed. Video data (154 data points) from 14 subjects were used to evaluate the usefulness of the method. The voice data Speaking sentences extraction. Speaker identification.

Data
The data acquisition environment is shown in Fig. 2. In this study, videos of 11 Japanese sentences (selected from Web news articles) spoken by 14 subjects (A-N, of Asian descent) were used. The videos were acquired using an omnidirectional camera (RICOH: THETA V) and a microphone (RICOH: TA-1). The omnidirectional camera was set horizontally with the ground. The lens was in front of the subject, and the camera height was set approximately 100 cm. The data used in this investigation were acquired in accordance with the ethical regulations concerning studies involving humans at Akita University, Japan.

Overview
An overview of the proposed method is shown in Fig. 3. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. First, the face areas in the image were extracted, and the height of the inside of the mouth was acquired. Second, the speaking frames were extracted using a time-series variation of the height of the inside of the mouth. Third, the speaking frames were re-extracted. Fourth, voice features were extracted from the voice data, and speaking frames were extracted using a time-series variation of voice features. Finally, the results of the speech sections were calculated using the speaking frames that were extracted using images and voice.

Extraction of height of the inside of the mouth
The face area was detected using the open-source library Dlib (5,6) , which can detect 68 feature points on the face. Twenty feature points can be used to extract the shape of the lip. Two points of the lip feature points were used to calculate the height of the inside of the mouth. An example of the calculation results of the height of the inside of the mouth is shown in Fig. 4. A time-series variation of the height was calculated and smoothed using a moving average filter. These waveforms were used as lip movement features.

Speaking frame extraction by lip movement
The movement of the lips in the speaking frames is larger than that in the nonspeaking frames. This movement difference was used to extract the speaking frames. First, the maximum and minimum values of lip movement features were calculated from the target frame and the 10 frames before and after it (21 frames). Second, the difference ‫)݂ܮ(‬ between the maximum and minimum values was calculated. Third, if ‫݂ܮ‬ was larger than or equal to the i) Extraction of speaking frames using lip movements.
ii) Extraction of speaking frames using voices.
threshold (ܶℎ), this frame was extracted as an utterance frame. The ܶℎ is calculated using Equation (1), where ‫ݓܰ‬ is the median width of the nose in the video. An example of the nose width calculation is shown in Fig. 4. Equation (1) was calculated based on the results of a preliminary experiment using subjects A-F.

Speaking frame re-extraction
The speaking frames were re-extracted because the speaking frames were not extracted correctly owing to small lip movements. Specifically, the frames that had speaking frames within the next five frames were re-extracted as speaking frames.

Voice feature extraction
The mel-frequency cepstrum coefficient (MFCC) is a feature value for speech recognition. The low-dimensional MFCC has features related to the acoustic properties of the vocal tract and the shape of the oral cavity. In this study, the zeroth dimension of the MFCC, as the lowest dimension, was used as the voice feature. Time series data of the voice features were smoothed using a moving average filter and used for speaking frame extraction.

Speaking frame extraction by voice
The voice feature values of the speaking frames are larger than those of the nonspeaking frame and maintain a large value. The speech sections were extracted by focusing on these features. First, the maximum and minimum values of the voice features were calculated from the target frame and the 20 frames before and after it (41 frames). Second, the difference ‫)݉ܦ(‬ between the maximum and minimum values was calculated. Third, the difference (ܸ݂) between the voice features and ‫݉ܦ‬ on each frame was calculated. Finally, the case in which ܸ݂ was larger than or equal to the 0.00 threshold was extracted as the speaking frame.

Speech section extraction
The speech sections were defined as consecutive speaking frames extracted using lip movement features. In this process, the sections were finalized as speech sections if the speech section contained speaking frames that were extracted using the voice features. Three lip movement features and 10 speech features are equal to 0.1 s in a video. An example of speech section extraction is shown in Fig. 5.

Labeling of speaking frames
The speech start and end frames were set manually. Furthermore, the frame between the speech start frame and end frame was defined as the speaking section.

Evaluation index
The F-measure was used as the evaluation index. The evaluation index was calculated using Equations (2)-(4), where ܶܲ represents the number of frames from which speaking frames are correctly extracted as speaking frames, ‫ܲܨ‬ represents the number of frames in which nonspeaking frames are accidentally extracted as speaking frames, and ‫ܰܨ‬ represents the number of frames in which speaking frames are accidentally extracted as nonspeaking frames. The F-measure ranges from 0.00 to 1.00, and the closer the value is to 1.00, the higher the success rate.

Evaluation procedure
The automatic threshold (ܶℎ) setting method of the proposed method was evaluated using 11 sentence videos (154 data points) of 14 subjects. A method using a ܶℎ of 1.5 was used as a comparison method. In the comparison method, ܶℎ was set to the best value based on the results of a preliminary study that used the data of subjects A-F. In this evaluation process, the median nose width ‫)ݓܰ(‬ and the height of the inside of the mouth were multiplied by weights (set in increments of 0.1 from 0.5 to 5.0) for pseudo-change of the distance between the camera and subject. Furthermore, the F-measure of the proposed method and the comparison method were compared for each weight value.

Results
The mean F-measure values for the 14 subjects in each weight value are shown in Fig. 6. In the comparison method, the average values of the F-measure decreased with an increase or decrease in the value of the weights. However, the mean of the F-measure remained higher than 0.95, even when the weight value changed, in the proposed method. The resolution is approximately 0.80 mm per pixels when the distance of between the camera and the subject is 50 cm. The resolution changes in proportion to the distance. For example, the resolution is approximately 1.70 mm per pixels when the distance of between the camera and the subject is 100 cm. The threshold (ܶℎ) calculation formula was considered using the weight 0.5-5.0 (the resolution of approximately 1.60-0.16 mm per pixel). Therefore, the proposed method is can be used for up to the distance of between the camera and the subject is approximately 100 cm. The result suggests that the proposed method can extract speech sections, even when the distance between the camera and the subject changes. Table 1 lists the mean F-measure values for each subject. In all 14 subjects, the mean F-measure improved. The results suggest that the proposed method can correctly extract speaking sections in different subjects.

Conclusions
A speech section extraction method using image and voice information as a preprocessor for speaker identification was proposed. Based on several experiments, it was determined that i) the proposed method can extract speech sections, even when the distance between the camera and the subject changes, and ii) the proposed method can correctly extract speaking sections in different subjects.