Development of Action-Recognition Technology Using LSTM Based on Skeleton Data

Recently, the population of workers of adults aged over 55 years is growing at work sites. Middle-aged adult workers have a higher occurrence rate of work accidents than younger workers. Therefore, it is necessary to develop a safety-management system to ensure safety. This paper proposes an approach for the recognition of human actions based on human-skeleton data as a part of the system in construction industries. The proposed approach consists of four processes: i) extraction of skeleton data from captured video data, ii) interpolation of skeleton joint-points that were missed, iii) calculation of features using interpolated skeleton data, and iv) construction of action-recognition model using interpolated data and calculated features. We evaluated the action-recognition accuracy performance for 5 types of actions from 6 subjects. The evaluation result achieved a high recognition accuracy of 93.1% on average. The results reveal that the proposed approach can be used to recognize actions from video data, and interpolation methods can significantly improve the action-recognition accuracy of the proposed approach.


Introduction
Recently, with the development and improvement of deep learning, more industrial and research communities have focused on deep learning for practical applications, such as work assistants, video-based monitoring, and surveillance. In the last decade, the population of middle-aged adults, those aged 55-65 years, is growing at work sites, especially in the construction industry, where the proportion of older workers is higher than in all industries (1) . In general, older workers have a wealth of knowledge and experience that equip them with the ability to make correct decisions and lead groups to success in business (2) . However, as physical and mental functioning declines with age, older workers have a higher occurrence rate of work accidents compared with younger workers. Therefore, it is necessary to develop a safety-management system to ensure the safety of elderly workers at construction sites. Fig. 1 shows the safety-management system working in the workplace.
At present, many researchers have proposed methods using 3D skeleton data or RGB-D sensor to recognizing workers' actions then improving productivity and safety in construction site (3,4) . However, depending on the construction site, it may become difficult to setting multiple devices in a limited space. On the other hand, image-based action recognition approaches have got significant results that can recognize the basic work actions of construction workers in workplace (5,6) . Although there is a problem that training data is insufficient due to the worker's body is hidden and action features cannot be acquired correctly when detect object carrying large goods and interacting with construction equipment. Since the problem on action recognition has not examined the above issue in detail, an interpolation processing approach was necessitated to complement action data from construction workers.
In this paper, we propose an approach for recognition of human actions using long short-term memory (7) (LSTM) based on human skeleton data as a part of the management system. In this approach, the human-skeleton data extracted from the classified images data with OpenPose (8) and interpolated using three methods if existing missing points exist. Furthermore, features such as joint-joint orientation, distance, and frame-frame trajectory are calculated from interpolated skeleton data. Finally, a time-sequence dataset was created using skeleton data and features for the learning classification model. The performance-evaluation experiment determined the usability based on the calculated time-series features for action recognition. Finally, the precision and recall were demonstrated by using the best result of pattern in the highest recognition accuracy rate of the cross-validation.

Data
To verify that the proposed approach has the ability to recognize human action, 5 kind of basic workplace actions were simulated for our video data and are defined below.
(1) Walking: a person starts walking at a constant speed while waving hands. (5) Climbing down: a person starts climbing down from a stepladder step by step with hands. The video data was recorded in 1920×1080 pixels at 60 fps. A total of 6 subjects, comprising 4 men and 2 women, participated in this experiment. The actions of one subject were recorded 3 times with a monocular camera (FDR-AX60, SONY) under general fluorescent lights (500-700 lx). In particular, walking actions were captured from left and right when subjects walked between left and right. In addition, walking actions were captured when the subjects approached the camera and moved away from the camera. Moreover, squatting and climbing actions are captured from the front, side, and back. Fig. 2 shows the data-acquisition environment in the experiment. The data used in this investigation were acquired in accordance with the ethical regulations concerning studies involving humans at Akita University, Japan.

Overview
As Fig. 3 shows, first, the video data of the subjects are manually classified into each action in frame. Then, the skeleton data is extracted. Furthermore, row skeleton data are interpolated, as there exist some missing points. In addition, we calculate three time-sequential features to quantize the actional information from the skeleton data. Finally, the interpolated skeleton data and features are combined to train the LSTM model, and a trained model was used to recognize actions from each subject.

Human-Skeleton Data Extraction
In this research, OpenPose and BODY_25 (8) models were used to obtain 25 skeletal joint points and detection confidences, which are the accuracy scores (0.0-1.0) of the currently detected joint points from each frame of the video data. Owing to the camera angle, the body of the subject is hidden and undetectable, and row skeleton data thus contain many missing points. For this reason, we define the missing value as the coordinate of the joint location where the detection confidence is lower than 0.3.

Interpolation of Skeleton Data
Owing to the camera angles, such as from the side or back, some joint locations of the subject are hidden by the body or stepladder. To reduce the influence of missing points, we proposed four interpolation methods to process row skeleton data. The interpolation procedures using the four methods are listed below.
(1) To interpolate symmetric joint points (e.g., hands and feet), the coordinate values of the missing point are interpolated by the corresponding point, which could be detectable. (2) To interpolate asymmetric missing points, in which limited frames (n) exist between two frames of detectable points (P), the coordinate values of missing points (Pm) are interpolated using linear interpolation, which adds increasing values calculated from the coordinates of the points that were detected first and second. Formula (1) shows the calculation method.
where t is the position number of frames, which was detected first, and t+n is the position number of frames that was detected after. where i is the position number of frames that were counted from the first detectable frame of the joint point. Formulas (2) and (3) show the calculation methods.
where t is the 5th position number of the frames before or after the first frame of the missing point in the video data, and i is the position number of the frames counted from the first detectable frame of the joint point. Then, using the trigonometric function, the coordinates of the missing eye points are obtained.

Feature Extraction
In this research, 3 kinds of features were calculated to quantize the action information using interpolated skeleton data. ⅰ) A total of 21 orientation features were extracted from 2 joint points, in which 15 orientation features were based on the neck point with arbitrary points within points 2 to 14 (8) , and the other 10 orientation features were extracted between each point within the same range of joint points. ⅱ) The distance features were extracted from two arbitrary points within the range from points 0 to 14 in the same frame. ⅲ) The trajectory features were extracted from 2 consecutive frames of the same points.
The interpolated skeleton coordinates and obtained features were used to create a time-sequence dataset to train the LSTM model and test its performance.

Training process of LSTM model
To recognize the actions captured from subjects using coordinates and features, we designed an LSTM model to learn the patterns of actions and train the recognition ability using time-series variable features from an arbitrary set of 15 consecutive frames. Structurally, the LSTM model has 6 layers, in which 15 neurons in the input layer are a fully connected layer, which was inputted dataset with setting shape, 128, 64, 64, and 32 neurons in the middle layers of the LSTM, and 5 neurons in the fully connected layer as the output layer with the Softmax function. Adam (9) was used for the optimization function in the processing of the model.

Dataset creation
In this study, image data were obtained from the video data processed using FFmpeg (10) that can convert video stream to images with 29.97 fps rate. Then action-image data was manually selected only images in action are included. Samples of dataset was obtained by scanning 15 consecutive frames and shifting 1 frame in the order of the time-series stream, totally gain 116,028 samples after processing, are shown in Table 1. In addition, each sample include 15 consecutive frames and extracted then calculated 306 features (e.g., coordinates, orientation, distance, trajectory) from that. Furthermore, the samples of dataset was used for input to the trained LSTM model, which then recognized the types of actions (e.g., walking, squatting up, climbing up) based on each sample.

Contents of Experiment
In this research, we evaluated the proposed approach with the dataset that was made by the time-series coordinate and calculated features from the captured video data of 6 subjects. All the video data from 5 subjects were used as training data, and the rest of the video data from 1 subject were used to evaluate the performance of the proposed approach as testing data. Furthermore, cross-validation was performed on six different combinations (Patterns 1-6) of training and testing data subjects. Successful recognition was defined if the time-series features loaded into the trained LSTM model predicted the same action as that manually classified for the same video data. The comparison of recognition success rate was obtained for each predicted video data label, with the LSTM model trained in epochs from 10 to 100 times in increments of 1.

Experiment Results
In this experiment, two performance results were obtained using the same LSTM model settings with different datasets. One result was obtained by using datasets without interpolated coordinates and time-series features calculated without interpolated joint points. In addition, another result was obtained by using datasets containing interpolated coordinates, where the time-series features were calculated from the interpolated joint points. A comparison of the two results, which have the highest average recognition accuracies and recognition accuracies for each pattern in specific epochs, are shown in Table 2.
In contrast, the recognition accuracy significantly improved when datasets were interpolated using the proposed approach. The approach using interpolation methods obtained varied recognition accuracies according to the times of epochs, with a maximum of 94.5% and a minimum of 87.7% after 20 epochs. Furthermore, the average recognition accuracy was highest in the 81st epoch,  with all subjects achieving a recognition accuracy of 89.0% or higher. In addition, when subject C was used as the test data with interpolation, the maximum recognition accuracy was 97.0%. Furthermore, the confusion metrics with precision and recall were calculated from the highest recognition rate of video data, which is Pattern 3 in 81th epoch, are shown in Table 3. The maximum values of 100.0% and 98.0% for precision and recall were obtained using the proposed method. Therefore, it is clear that the proposed approach can obtain high action-recognition accuracy. However, it incorrectly predicted climbing up as climbing down. We considered that this action is difficult to learn because of the similarity of the climbing movements in each direction. Moreover, these action data only comprised a few limited frames, from which it was difficult to recognize the difference between two similar actions. Accordingly, the proposed approach needs further discussion with the appropriate setting of limited frames in the action data. Additionally, more kinds of actions and subjects need to be captured to test the robustness of the proposed approach.

Conclusions
In this paper, we proposed an action-recognition system using four interpolation methods to process human-skeleton data to train the LSTM model. Based on the experiment, it was determined that ⅰ) the proposed approach can be used to correctly recognize basic workplace actions from video data and ii) the interpolation methods can improve the performance of the proposed approach. Currently, we did not have any opportunity to obtain actual-situation data from construction site. The versatility and the usefulness of proposed approach will be examined using simulated action data of construction site in the future. In addition, the level of accuracy is required in actual situation will be considered for further work.