Behavioral Estimation for Multiple Possession Positions Using Smartphone Accelerometers

With the widespread use of smartphones and wearable devices, various research has been conducted using built-in sensors. For example, height estimation and road condi-tion estimation have been performed. In addition, behavioral estimation of the smartphone holder, possession position estimation, and person estimation has also been conducted. However, most of the measurement data is taken by ﬁxing the possession position at a single location and not considering it in actuality when estimating behavior. In this research, we aim to estimate a person’s behavior by considering multiple possession positions. It is necessary to estimate a person’s behavior by considering various possession positions when using behavior estimation as a system. In addition, by treat-ing the time series data acquired by the 3-axis acceleration sensor as a 2-dimensional image using the GAF algorithm, (1) class classiﬁcation by machine learning is performed.


Introduction
In recent years, wearable devices such as smartphones and smartwatches have become popular. Along with this, various researches using built-in sensors have been conducted. For example, efforts are being made to estimate a person's behavior, state, and height and estimate the snow cover on the road. Another advantage of sensors is that they can protect privacy compared to data acquisition by cameras. Some systems use sensors to assist in monitoring the elderly. The following describes previous research on person behavior recognition and individual identification that is highly relevant to this research.

Person identification based on acceleration sensors for multiple walking states (3)
Manabe et al. used a smartphone accelerometer to identify three types of walking conditions: level walking, stair climbing, and stair descending, and then used a discriminator defined for each walking state to identify the person. The results show that the gait state recognition rate is 95.7%. The person identification rate is 85.0% (level walking), 90.0% (stair climbing), 77.0% (stair descending) respectively. And the personal identification rate is 80.4% when performing the gait state recognition and personal identification step by step.

Activity Recognition using Cell Phone Accelerometers (4)
Jennifer R et al. acquired acceleration data using smartphones in the front pockets of 29 subjects for "walking," "jogging," "stair climbing," "sitting," and "standing" data. (WISDM dataset) The results showed that the best discrimi- nation accuracy was achieved using the multilayer perceptron, with 91.7%.

Human Activity Recognition Based on Gramian Angular Field and Deep Convolutional Neural Network (5)
Hongji et al. proposed a new network, Mdk-ResNet, for large-scale time-series human behavior datasets i ncluding the aforementioned WISDM dataset, (4) which has higher recognition accuracy than the conventional methods. Especially for the aforementioned WISDM dataset, (4) the accuracy is 9.88% higher than that of the MLP-based method by Jennifer R et al.
Although various researches have been conducted as described above, most of them are based on fixing the possession position of the smartphone in one place. However, in reality, the smartphone is not always held in one place. Therefore, the purpose of this study was to estimate the behavior of a person with a smartphone by setting multiple possession locations.

GAF Algorithm (1)
The GAF algorithm (1) is an algorithm proposed by Zhiguang Wang for the conversion of 1D time series data into 2D image data. First, given a time series data consisting of n real-valued observation data(X = x 1 , x 2 , ..., x n ) , normalize the data so that all the values fall in the interval [-1,1] Next, the normalized one-dimensional time series is transformed from a Cartesian coordinate system to a polar coordinate system.
Here, when transformed to the polar coordinate system, the value of ϕ is [0,π/2] when normalized to [0,1], and the value of ϕ is [0,π] when normalized to [-1,1]. After transforming the normalized time series into a polar coordinate system, each temporal correlation can be identified in terms of angle by calculating the sum and difference of the trigonometric  (6) functions between each sampling point. Based on the above, Gramian Summation Angular Field (G ASF) and Gramian Difference Angular Field (G ADF) are defined.

Dilated Convolution
When using large convolutional kernels, the network parameters and computational complexity increase. In addition, there is usually some information loss when using pooling layers. Therefore, Dilated Convolution solves the above problems by changing the spacing of the kernels, making it possible to increase the receptive field without information loss.

Mdk-Res Module
The features of the previous layer are input to the Mdk-Res module, which processes them by four convolutions. The results of the four blocks and the input features are then used as the output. The hyperparameter of the convolution of the four blocks is padding="same",stride=1 so that the dimensions of the output and input are the same. The green block is the previous layer of the network, the peach block has a kernel size of 1 × 1, the blue block has a kernel size of 3 × 3, the yellow block has a kernel size of 3 × 3, dilation rate is 2, and the red block has a kernel size of 3 × 3, dilation rate The Mdk-Res module uses multiple normal and expanded convolution kernels simultaneously. As a result, it helps to improve the capability of the network to extract features at various scales. In addition, as with regular ResNet, residual learning is used to suppress the gradient loss problem.

Data Acquisition
We set up four different positions for smartphone possession: "front pocket (outside the screen)," "back pocket (outside the screen)," "waving arms (inside the screen)," and "looking at the screen," and three different actions: "walking," "stair climbing," and "stair descending. Each data was measured for 3-axis acceleration by iPhoneSE (2nd generation). The sampling rate was set to 20Hz as in the WISDM (4) data set. For five subjects, data was acquired approximately every 30 seconds per class.

Data Preprocessing
From the acquired data, the window size was set to 64 and the overlap to 50% as in the study by Hong Xii et al. The GAF algorithm was then used to convert the time series data into two-dimensional images. Since G ASF and G ADF  images were generated from acceleration x, acceleration y, acceleration z, six channels of images were generated per time series data.

Learning and Results
The images were converted into 2D images using the GAF algorithm, (1) and then trained using MDK-ResNet. (5) The train:val:test ratio was set to 6 : 2 : 2. Adam was used as the optimization function, and the learning rate was set to 0.001. To prevent overtraining, Early Stopping was used. The following table shows the learning curve and the values of Accuracy, Precision, Recall, and F-Measure.

Conclusions
In this study, we evaluated the accuracy of simple action estimation and data creation considering possession position and training using Mdk-ResNet. As a result, the results were high, and we can say that it is possible to dis-criminate sufficiently. However, the number of data used in this study was a little small for machine learning, about 100 per class after converting to images. Therefore, in the future, we would like to increase the number of classes by increasing the number of data and adding possession positions and be-behaviors that were not acquired this time and verify whether it is possible to identify them.