Fluctuation in Recognition Accuracy due to Utterance Experience in Command Input Interface using Lip Motion

A command input interfaces based on lip motion features can also be used with or without vocalization. Our previous study revealed that the fluctuations of lip movement attributable to vocalization, and the difference of the command recognition accuracy of both utterance conditions. However, the change in lip motion due to utterance experience is not clear yet. To develop a practical command input interface using lip motion features, it is necessary to clarify the relationship between utterance experience and command recognition accuracy. In this study, we acquired lip motion data of six participants for 10 days and conducted a command recognition experiment using DTW matching method to investigate the change in command recognition accuracy. According to our experimental results, command recognition accuracy decreased significantly when using vocalized data.


Introduction
Lip motion features due to utterances can provide important information for recognizing lip motions as commands.Because lip motion features are perceived visually, they are a natural basis for systems with non-contact-type interfaces.These systems have the advantages of high convenience and sanitation for users.Some researchers have demonstrated the potential of using lip motion features for command recognition (1)(2)(3) .In our previous paper, we proposed a non-contact-type command input interface using lip motion features and reported that lip motion features such as changes in lip width and length are useful for command recognition (3) .These are simple features that are easily acquired from sequential lip images and can be used to obtain high recognition accuracy.Additionally, these lip motion features can be acquired with or without vocalization, giving them greater flexibility in situations that may require silence.Unfortunately, the velocity and position of the lips may change under vocalization due to the action of other vocal organs (4) , and the number of frames varies depending on the speech speed.Thus, we investigated the fluctuations of lip movement attributable to vocalization.Our previous study revealed the differences in lip motion features between voiced data and non-voiced data and found that good recognition accuracy can be obtained in almost all cases when the input data and the registered data were of the same voicing state (5) .However, studies regarding changes in lip motion caused by utterance experience have not been conducted yet.
In this paper, we investigate the change in command recognition accuracy attributable to utterance experience.We acquired lip motion data while uttering five commands with and without vocalization for 10 days and conducted a command recognition experiment using the method given in reference (5).Our experiments led to some new findings for establishment of a command input interface using lip motion features.

Acquisition Environment and Commands Used
In this study, we acquired sequential images obtained from six participants (id01-id06, all of Asian descent) as they uttered five commands.The data acquisition environment and process were essentially identical to the conditions given in reference (5). Figure 1 shows the data acquisition environment.Utterance data for the six participants were recorded by CCD video camera (Point Grey Research: Grasshopper) under the following conditions: (a) The distance between the CCD video camera and the participant was fixed at 60 cm.
(b) A guide for lip position was shown to all participants.The dimensions of this guide were 75 × 40 pixels.
(c) Utterance videos were recorded under ordinary fluorescent lights (indoors, at a brightness of 650-800 lx).
(d) No lipstick or supplementary lighting was used.The commands uttered were five Japanese words, each of which contained four to nine syllables.Table 1 lists the commands and the selection criteria used.

Data Acquisition Flow
Figure 2 shows an outline of the data acquisition schedule for an individual participant.Data acquisition was scheduled on 10 days within a period of three weeks for each respective participant.Figure 3 shows the data acquisition process per participant per day.Data acquisition was performed using a process similar to that in reference (5), involving four steps.First, the participant sat in the chair in front of the monitor and remained quiet and rested for 2 min to calm the participant's physical and mental state.Second, the participant uttered each command six times without vocalization.A 30 s interval was maintained between the groups of utterances of each distinct command.Third, participants rested for an additional 2 min to reduce the effect of utterances without vocalization.Finally, the participant uttered each command six times with vocalization as with the step 2. Sixty utterance videos were recorded per participant per day, totaling 600 utterance videos per participant over 10 days within a three-week period.Thus, 3600 utterance videos were recorded for all participants combined.
The decision to have participants utter each command six times was based on the results of a questionnaire used in previous research (6) .These results revealed that most participants experienced negligible stress when uttering one command up to six times.-The word must contain an "n" sound.
-The word must contain a plosive.
-The word must contain a long vowel.Utterance data of 2nd day -10th day were acquired within 20 days from the first acquisition day.

Limit to last day of acquisition (20 days after the first day)
The following day (1) Rest for 5 min and check the first command.
(3) Rest for 2 min and check the first command.

s interval (check next command)
All commands uttered ?
Go to next step or finish

Yes
No Detailed procedure for step (2) and step (4) Mouth was closed at beginning and end of each utterance.
Utter the command six times with or without vocalization Main flow Fig. 3. Data acquisition process.

Preprocessing
Preprocessing was executed in four steps.First, the utterance videos were converted into sequential face images at 30 fps for acquisition of the lip motion features.Second, an automatic extraction method that employs fuzzy reasoning was performed to extract the shape of the lips (7) in each sequential face image (see Figure 4).Third, we measured the horizontal length (diX) and vertical length (diY) of the lips in all lip images.Finally, we manually extracted the utterance section from each set of utterance data.The utterance section was determined by an operator in accordance with the following criteria: (a) The first frame of the utterance section was set to the frame immediately before the start of the lip movement.
(b) The end frame of the utterance section was set to the frame immediately following the end of the lip movement.
Since we needed to extract the utterance section as accurately as possible using the criteria above, automatic extraction methods were not used.

Calculation of Lip Motion Features
Lip movements such as changes in lip width and length provide important information about uttered content and can be used for command input interfaces (3) .In this study, we conducted a command recognition experiment to investigate the variation of command recognition accuracy due to utterance experience in both utterance conditions (i.e., without vocalization and with vocalization) as described in reference (5).Three lip motion features (raXi, raYi, and Ri) were calculated in the same manner as in reference (3), equations (1) to (3) below.Lip motion features raXi and raYi are the normalized values of diXi and diYi, respectively, and Ri is the ratio of lip width to lip length.

Generation of Standard Patterns
In this study, we generated standard patterns to use as the registered data based on the research results in references (3) and (5).Each standard pattern was generated from 12 utterances acquired on the first and second days.Figure 5 shows the outline of the three-step process for generating the standard pattern per command per voicing state.First, the lip motion features of 12 utterance data acquired on the first and second days were input.These 12 data have the same conditions (i.e., same command, same participant, and same voicing state) and thus create a unique pattern for each distinct combination of command, participant, and voicing state.Second, for a number of the utterance frames from the 12 data, the size was standardized in accordance with the mean size of the frames from the 12 utterances.Third, the mean values of each of the three lip motion features (raXi, raYi, and Ri) across the 12 utterance data were calculated, and these mean values were set as the feature values for the standard pattern.In this way, we created 10 standard patterns that included five non-vocalized and five vocalized patterns per participant; thus, 60 standard patterns (6 participants, 5 commands, 2 Lip area and lip sizes Sequential face images Extraction of lip region by fuzzy reasoning using color information (7) diX i diY i Image size: 320 × 240 pixels voicing states) were generated.

Data-matching Procedure
We calculated the similarity between the standard pattern and the input data using the inclination-limited DTW matching method used in references ( 3) and ( 5).The correspondence of matching nodes was limited to nearby two nodes, and weight coefficients were set to the same values as in the above references.

Outline of Experiment
Figure 6 shows the outline of this experiment.In order to consider the influence of utterance experience for each utterance condition, we calculated the similarity between all combinations of input data for each day and the 10 kinds of standard pattern.Further, the matching experiment was performed on individual data; i.e., the same participant's input data and standard pattern were used.After the calculation, the pair having the minimum DTW score was then chosen for each command and voicing state.If the chosen pair was for the same command, it was determined that the command recognition was successful.The recognition rate was calculated using equation (4).
For the matching experiment, we used 2880 utterance data as input data that were acquired during the third day to the tenth day; the data consisted of 1440 non-vocalized data and 1440 vocalized data (i.e., six utterances per command and five commands per participant for each of the eight days from the third day to the tenth day).

Results and Discussion
Table 2 summarizes the experimental results using the same voicing condition for input data and registered data (i.e., matchups of input data to registered data were non-voiced data to non-voiced data, and voiced data to voiced data).The results show that good recognition rates of over 90% were obtained for both cases.On the other hand, focusing on the difference between voicing states, the overall recognition rate of non-vocalized data was slightly higher than that of vocalized data.This difference seems to increase as the acquisition day becomes later; a significant difference is seen after the 9th day. Figure 7 shows the change in command recognition rates for both voicing states over the eight days and the corresponding straight lines obtained by the least-squares method.The recognition rate has a decreasing tendency in both cases, with rates of decrease of 0.1% per day for the non-vocalized data and 1.0% per day for the vocalized data.For the non-vocalized data, command recognition accuracy decreased as participants became accustomed to uttering the commands, but the effect was small.On the other hand, command  recognition accuracy for the vocalized data decreased significantly as participants became accustomed to uttering the commands.In addition, the variance of the recognition rate across the eight days for the vocalized data was about 3.4 times that for the non-vocalized data.These results indicate an overall tendency for the characteristics of participants' mouth movement with and without vocalization: participants moved their mouth clearly when uttering commands without vocalization.The basic reason for the difference in mouth movement is due to awareness by participants at that time; when uttering commands without vocalization, they must pay attention to their mouth movements.On the other hand, when uttering commands with vocalization, they must pay attention to their voice.Thus, lip motion features are influenced by utterance experience when uttering commands with vocalization.It is revealed that the lip motion was changed due to utterance experience and lip motion fluctuation due to utterance experience in vocalized data was larger than that of non-vocalized data (8) .Therefore, it is necessary to develop an adaptive learning method for updating registered data to maintain high recognition accuracy.

Conclusions
This study investigated the relationship between utterance experience and command recognition accuracy.To investigate the influence of utterance experience, we acquired lip motion data for 10 days and conducted a command recognition experiment using the acquired data, calculating similarities between input data and the standard pattern for all combinations for each participant.Our experiments provide the following conclusions: -Experimental results revealed that command recognition accuracy was slightly decreased in either voicing state: with vocalization or without vocalization.
-The command recognition accuracy of vocalized data decreased significantly as participants became accustomed to uttering those commands.
For future work, we plan to investigate the influence of utterance experience over a longer term.We also plan to develop an adaptive learning method for updating registered data to maintain high recognition accuracy.

First day of data acquisition 2 Fig. 2 .
Fig. 2. Data acquisition schedule for a single participant.

Table 1 .
Commands used and the selection criteria.

Table 2 .
Results of experiment.