A Proposal for Creating Syllabic Datasets for Japanese Language Lipreading by Using Machine Learnings

Although lip-reading using image processing and machine learning have been performed at the word level, LipNet,(1) a network that enables recognition at the sentence level, improves the recognition accuracy over the former method. However, this was the case for English speakers, and no extra experimental result has been reported for Japanese speakers. This study aims to create an experimental database for Japanese speech scenes containing all 50 Japanese sounds and evaluate the recognition accuracy using LipNet.(1) keywords: Machine Learning, Lipreading.


Introduction
Machine lip reading using image processing and machine learning has been researched for the long term. But still, it is quite difficult to recognize untrained words than the trained ones. If the recognition system for machine lip reader can be made for speech recognition, the following may be possible. First, lip-reading itself is one of the tools for communicating with the hearing impaired, allowing for smoother communication. Secondly, it can be used in situations where it is difficult to hear voices in remote or noisy environments, such as through security cameras or in a stadium, where the ability to solve crimes is expected. Although it is slightly different from security cameras, recently, a video assistant referee (VAR) system has been introduced in soccer matches. Verbal abuse is often a problem in soccer matches. In soccer stadiums, there is a lot of cheering, and it is difficult to judge what was said exactly by video judgments. Therefore, the use of machine-reader lips in this system may prevent misjudgments.
In recent years, voice input software, such as Siri and Google Assistant, has become popular due to the prolifer-ation of smartphones.  Figure 1 shows that nearly half of the respondents think that searching by text input is troublesome. However, ac-cording to Fig.2, 67.6% of male respondents and 74.6% of female respondents felt embarrassed when they were in front of others. Therefore, more than half of the respondents felt embarrassed about speech input in front of others. Therefore, if the Japanese language machine reading is as good as speech recognition, it may be possible to enter characters without uttering a word.

Data Preprocessing
The GRID corpus (4) is processed with Dlib's (2) face detector and the iBug (3) face landmark predictor with 68 landmarks in combination with an online Kalman filter to apply an affine transform. Then, the image around the lips is extracted at a size of 100×50 pixels per frame.
There were 34 subjects. Each subject was recorded with 10,000 different utterances. The length of all the videos was 3 seconds, and the frame rate was 25 fps. The actual number of usable speech scenes was 32,746 due to the existence of corrupted data. In addition, an alignment is provided for each movie. An example of an alignments show in Fig.3. Where "sil" and "sp" mean silent time.

Dlib (2)
Dlib (2) is a general-purpose, cross-platform open source software written in C++. It includes components that handle many tasks such as networking, threading, GUIs, data structures, linear algebra, statistical machine learning, image processing, data mining, XML and text analysis, numerical optimization, and Bayesian networks. Much of the recent development has focused on the creation of statistical machine learning tools.

iBug Facial Landmark Predictor
The trained model is by using the iBug (3) 300-W dataset. Using this model and the previously mentioned Dlib (2) face detector, we obtain the 68 landmarks in Fig. 4.

LipNet (1)
Traditionally, lip-reading has been done by word classification only. LipNet (1) is the first lip-reading model to allow lip-reading at the sentence level. It also achieved an accuracy of 95.2%, surpassing the 86.4% accuracy of experienced human lip readers and the previous word level accuracy of 86.4%. The Fig. 5 shows the LipNet (1) network. The sequence of T frames is inputted and processed in three layers of Spatiotemporal Convolutional Neural Networks STCNN), which are convolutional neural networks in space-time. The GRU output is processed in both directions by the Gated Recurrent Unit (GRU), a type of Neural Network (RNN). LipNet (1) is trained with Connectionist Temporal Classification (CTC).

STCNN(Spatio-temporal convolutional neural networks) can process video data by folding time into spatial dimensions.
[stconv(x,

WER and CER
The recognition accuracy data in LipNet (1) is WER(Word Error Rate) (5) and CER(Character Error Rate) are calculated and used. The definitions of WER (5) and CER are as follows.

Data Set Creation
The dataset was created in the same way with the GRID corpus (4) structure used in LipNet, (1) in the order "command + color + color + preposition + letter + digit + adverb", the GRID corpus (4) contains all 26 letters of the alphabet. Therefore, the words were chosen so that they contained all 50 Japanese words. Fig6 shows the structure of the GRID corpus. (4) The data was created on an iPad (7th generation). The frame rate of the video was 30 fps. Then, to apply the LipNet (1) structure, the image around the lips was extracted at a size of 100×50 pixels per frame using a combination of the Dlib (2) face detector and the iBug (3) face landmark predictor with 68 landmarks. Furthermore, since the frames were different for each speech scene, we decided to fit all speech scenes to 400 frames by padding the final frame with closed lips.

Experiment 1
First, we filmed the speech scenes in Table 1 for 15 subjects (225 speech scenes). Then, we applied it to the LipNet (1) Fig. 6. An example of an alignment created Table 1. Created datasets  word order  verbs  color  prepositions letter digital modifier  1  wakaru  aka  kara  e  ichi  totemo  2  wakaru midori  made  ko  ni  marude  3  wakaru  shiro  yori  su  san  sibaraku  4  wakaru  ao  kara  se  yon  totemo  5  wakaru  kiiro  made  so  roku  marude  6  tatsu  aka  yori  nu  roku  sibaraku  7  tatsu  midori  kara  ne  nana  totemo  8  tatsu  shiro  made  hi  hachi  marude  9 tatsu  (5) and CER are shown in table 2. As shown in table 2, the recognition accuracy is very low. One possible reason for this is that LipNet (1) itself is not suitable for all Japanese sounds. The reason for this is that LipNet (1) itself is inappropriate for all Japanese sounds. Here, we will focus on the small number of data and padding of the data, which is a method of increasing the amount of data set when performing machine learning. Methods for padding data include "contrast adjustment", "scaling", and "flipping".

Experiment 2
In the padding of the data described above, we thought that a left-right inversion was effective because human faces are generally asymmetrical. Therefore, the data for 30 people (450 speech scenes) were applied to the LipNet (1) structure, including the original data from Experiment 1, by performing a left-right inversion. In this study, we applied the LipNet (1) structure to the data of 30 people (450 speech scenes). After- We can see that the accuracy has increased slightly compared to Experiment 1. Therefore, under the conditions of this experiment, we can consider that the data's padding by left-right flipping is effective, and the accuracy increases as the number of data increases.