Learning-based Hand Gesture Sensing and Controlling Techniques for Automotive Electronics

The control interfaces of current automotive electronics are mostly designed in push-button or touch panel styles. In order to decrease the eyes-off-the-road time and increase the driving safety, a novel hand gesture sensing and controlling system is proposed based on various signal processing and statistical machine learning techniques in this paper. Replacing traditional high resolution camera or depth sensors, the proposed system uses a low-cost hardware design which is composed of proximity sensor chip, digital signal processor, microcontroller unit, and three programmable LED light sources. The learning-based method starts at the collection of large-scale gesture database with different labels under various environmental settings. Based on the dynamic time warping similarity measurements, the first gesture recognition method can achieve high accuracy with heavy computation. In addition, we also developed time-series feature extraction methods to transform the samples into low-dimensional space. Three different statistical learning approaches, including linear discriminant analysis, artificial neural networks, and support vector machines, can train the multi-class gesture classifiers in feature space. The demo system of automotive music player shows that the proposed system can efficiently and effectively recognize the pre-defined controlling gestures and improve driving safety dramatically.


Introduction
Human-computer interaction (HCI) is an interdisciplinary research field which focuses on the design and implementation of the communication methods between human and machines.For current user interface of automotive electronics, the control panel of car audio, air conditioning, and navigation systems, may just fit the functional requirements.From the ergonomic perspective, a more intuitive control methods should be developed to decrease the operating time of automotive electronics and let the drivers focus on the road ahead.In this paper, a novel hand gesture controlling system is proposed based on various signal processing and statistical machine learning techniques.This new HCI design may decrease the eyes-off-the-road time and increase the driving safety dramatically.
Traditional researches of HCI focus on the design of user-friendly interface and comfortable user experience.With the rapid growth of novel information and communication technology, a variety of interdisciplinary applications and research issues had been developed for next generation HCI techniques.Along with the rapid development of audio signal processing and voice recognition techniques, a virtual assistant based on artificial intelligence technology is developed and integrated into the mobile devices, such as the famous Siri in iOS system.Kounoudes et al.(1) also proposed a user authentication procedure through analyzing users' specific voice.
HCI focuses on not only interaction between users and machines but human-centered design.Yang et al. (2)  conducted face detection research to improve interaction between users and machines.Pantic et al.(3) surveyed more than 100 state-of-the-art facial expression analysis methods which are also important for further emotional design in human-centric HCI.
Currently, using touch panel is the most common method to control smartphones and tablet PCs.Several advanced operational functions, such as multi-touch and customized gestures, are developed to improve the usability of touch panel devices.Moreover, users can control machines without touching but operating gestures in the air.Air-gesture sensing has been developed for a long time.Panwar et al. (4) provided a real-time recognition system through analyzing basic gesture features including direction, finger patterns, and the position of each finger.Shimada et al. (5) carried out a research about the relation among the rotational direction of hands.Panwar et al. (4) used single camera to develop a hand gesture recognition algorithm with four major steps, which are named as segmentation, orientation detection, feature extraction and classification.
The RGB-D cameras and depth sensors are also widely applied in the gesture recognition problem.Ren et al. (6) used Kinect sensor to capture the depth information of finger parts and developed a novel part-based hand gesture recognition system.They claimed that this system can work in cluttered background and lighting conditions.Ohn-Bar et al. (7) developed a human-machine interface application of hand gestures in the car.The RGB-D data is collected for the pre-process in temporary segmentation module, and then be classified in gesture recognition module.Combining the researches of color images and depth information, Doliotis et al. (8) applied dynamic time warping techniques to compare gesture recognition accuracy between these two different kinds input data.In addition, most early works had been surveyed in the research of Rautaray et al. (9) , which discussed about the techniques and applications of hand gestures in HCI.
In this paper, an air-gesture recognition system is proposed to intelligently analyze the time-series signals which are captured by multiple distance sensors.The design concept of air-gesture techniques is to create a more intuitive and convenient interactive methods for users to control the electrical devices, such as radio, mobile phones, table PCs, and other consumer electronics.

Proposed System Overview
This chapter describes an overview of the proposed gesture control system, including the sensing hardware and the decision system with dynamic time warping similarity measurements.The detailed definitions of eight control gestures are also recorded in the section 2.2.These gestures are used in the experiments and practical demo system.

Signal Acquisition Device
The proposed algorithm is developed based on Si1143 proximity sensor is composed of three LEDs light source, highly sensitive photodiodes to detect visible and infrared light, signal converter, and digital signal processor.The LEDs are soldered in three corners of sensor PCB and will be enabled sequentially during the signal acquisition process.As a low-power, reflectance-based, infrared proximity and ambient light sensor with I2C digital interface, Si1143 sensor can estimate the relative distance between the object and sensor chip.
P8x32A QuickStart board contains a multi-core parallel microprocessor, digital input/output pins, and temporary data storage space.Each program written in SPIN language can have 2KB instructions and data storage.In our signal acquisition subsystem, the distance measurements from the gesture sensor will be sequentially recorded in the temporary storage space in P8X32A QuickStart board.The computers will acquire these data through the serial port connection.

Gesture Design
After resampling the input data sequence into three independent signals by the light sources synchronization process, each training and testing gesture samples can be represented as three 1D time-series data.To define the target gestures is the first step of gesture recognition and control system design.These gestures should be designed in an ergonomic style which may let the user feel naturally and comfortably.In addition, the design has to avoid the ambiguous gesture definition which may lead to much lower classification accuracy.For example, it is very difficult to correctly identify a complex gesture which is composed of another two simple target gestures.In this paper, eight different control gestures are defined as above, which include four fundamental sliding movements, three commonly-used symbol drawing, and a grabbing gesture in 3D space:

Dynamic Time Warping
Dynamic Time Warping (DTW) is a temporal matching algorithm for measuring the similarity between two sequential data which may have speed variations.DTW algorithm had been widely applied in some audio and video processing systems.The gesture input are similar to multiple audio data, which are one-dimensional time-series signals.The similarity and distance between pairs of temporal signals cannot be calculated by simple Euclidean distance measurements because they may have different vector lengths.
The idea of DTW is to separate the temporal signals into multiple short units and to calculate the optimal unit matching method between two time-series sequences.One of the learning gesture samples in our database can be represented as a vector ( 1 1 ,  2 1 , ⋯ ,   1 ,  1 2 , ⋯ ,   3 ), and the testing vector is r( 1 1 , ⋯ ,   3 ), where the m and n mean the numbers of signal units, respectively.The goal of DTW is to find a route {(p 1 , q 1 ), (p 2 , q 2 ), . . ., (p  , q  )} with minimal distance measurement: In addition, the estimated route should follow these constraints:  (p 1 , q 1 ) = (1, 1), and (p  , q  ) = (m, n)  Assuming a point on the optimal path is represented by (i,j), there are only three possible ways to reach this point from (i-1,j), (i,j-1), and (i-1,j-1) D(i, j) is defined as the DTW similarity measurement of the signals {t1,t2,…,ti} and {r1,r2,…,rj} with corresponding optimal route.To solve this problem, we can formulate the optimization equation as a recursive form to estimate the D(m,n).The recursive relationship of D(i,j) can be represented as the following form: Another important issue is the time complexity, to calculate D(m,n) needs O(m×n) computation.Fig. 3 uses the intensity to visualize unit-to-unit similarity, where the dark pixels mean the small between-signal difference.The DTW route calculated by recursive energy minimization process is displayed as the red line segments.

Learning-based Approach
In addition to the direct pattern matching method, the learning approaches sometimes can ignore the noises and achieve better performance.This first step of statistical learning system is to extract the features from raw signals.Then, three different kinds of learning algorithms are described to train the gesture recognition model.

Time-series Feature Extraction
In our research, two major feature representation methods are adopted.The first kind of features is to estimate the accumulative energy of each signal independently.The ratio in timeline will be recorded when each signal of three LEDs reached 25%、50% and 75% of total signal magnitude summation.From the fig.2(c), we can observe that the red signal of RS motion will accumulate 25% earlier than the other two signals.It is because that the LED3 light source is located in the left side of sensor board and the left-to-right motion will enter the sensing area of LED3 first.Nine features can be extracted by this method.
The second feature extraction method is the binary comparison results among three LED signals at the specific time.There are also 9 features in this kind of feature.
The learning database includes 430 gesture learning samples which are captured from 20 subjects.The leave-one-out testing framework is adopted for accuracy experiments.

Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a statistical analysis method that builds a better discriminant function with the least misclassifications.All the data will be classified into two categories in the feature space.The most popular methods are Fisher's linear discriminant, Bayes 's linear discriminant, and distance discriminant.
In our research, the training samples and the corresponding label vector are used as the input data for the LDA model construction.The distances between samples and population are estimated for the optimization process.Then, this model can predict the labels of the testing samples according to the rules learned from training dataset.

Artificial Neural Networks
Artificial neural networks (ANN) is a parallel computing system, consisting of individual artificial nodes, and imitates the neural network model of biologic systems.In artificial neural networks, simple artificial nodes, known as "neurons", are connected together to form a network which mimics a biological neural network.Each artificial neuron represents a kind of specific output function.All connections between two neurons are an adaptive weight activated during training and prediction.The output of an artificial neural network depends on the weights and output functions after convergence.Most of artificial neural networks are multilayer networks, including input layer, output layer, and hidden layers.The nodes of input layer take the responsibility to connect the time-series features of each learning sample.Output layer is the final decisions after transmitting, analyzing and weighing.Hidden layer consists of neurons and connections between input layer and output layer.
ANNs have been widely applied in various applications, such as pattern recognition, identification, classification, speech, computer vision, and control system.And in our application is gesture recognition.In this paper, the binary ANNs are extended to solve the multiple-class classification problem, such as the gesture recognition.Eight output nodes are adopted to connect eight binary decisions to determine whether this sample belongs to the specific targets or not.
The major process of artificial neural network learning is to adjust the weights of each layer by back propagation algorithm.The accuracy may be increased by repeatedly weight updating.

Support vector machines
In data classification and regression analysis applications, support vector machines (SVMs) is a supervised learning model with associated learning algorithms that analyze data and recognize patterns.Given a set of training data with specific labels, SVM algorithm builds a model that can assign new samples into the corresponding category.Solid lines in fig. 4 are so-called support hyper-plane, which can be expressed as follows: w is the normal vector to the hyper-plane and b is a constant.To find optimal separating hyper-plane, we need to find two parallel support hyper-planes with largest margin first.d is the distance between separating hyper-plane and support hyper-plane, so the margin can be calculated as the equation (4).In addition, the largest margin can be estimated by the minimization system listed in the equation (5).In our research, we used C-support vector classification method and radial basis kernel function.

Experimental results
Various experimental results are listed in this section to prove the effectiveness of the proposed system.The most accurate but time-consuming DTW pattern matching approach is examined first.Then, the learning-based approaches with three different learning methods under different feature space settings are compared to find the optimal hand gesture recognition model.Finally, we found that we must adopt different strategy for eight different gestures.We first collect a database with 3440 gesture signals as the matching pools.The leave-one-out testing framework is adopted to evaluate the system performance of DTW pattern matching method.In other words, each signal in the database will be predicted by the other 3439 signals with corresponding labels.Table 1 shows the confusion matrix of the prediction results by DTW pattern matching method.Each column in this matrix represents a group of testing gestures.Taking the example of US gestures, 99.3% are correctly recognized and only 0.7% are misclassified to RS gesture.In addition, reject decision means that the system judge the input signal does not belong to the eight target gestures.

Performance of DTW pattern matching
All the accuracy of DTW algorithm is above 90%, and even half of them are higher than 95%.However, the relative high computation complexity may lead to longer response time.It is difficult to implement this kind of approach to the real-time applications.Then, the learning-based approaches are proposed to improve the system efficiency.

Performance of learning-based approaches
Three different learning-based approaches are examined in this section.Table 2 shows the system performance of these methods under different feature settings.From this table, we can find that the accuracy rate of dynamic time warping algorithm is almost higher than three learning-based methods in eight different gestures.In addition, we also examine the discrimination of different feature representation methods.In this table, F1 and F2 mean two different kinds of feature extraction methods which are described in the section 3.1.The experimental results of linear discriminant analysis (LDA) algorithm show that the accuracy rates of feature set F2 are lower than the accuracy of F1 in average.Even the richer feature representation method F1+F2 still cannot work better than F1.In these three kinds of learning-based approaches, ANNs with both kinds of features can achieve 94.80% recognition rate which is near the system performance of DTW matching.
To discuss about the feature representation methods, F2 cannot describe 3DG well because its accuracy with three learning method can only achieve 9.53% to 24.19%.

Hybrid Strategy
Table 2 also shows that we must adopt hybrid strategy to recognize different gestures.It is because that the feature representation methods and learning kernels may affect the recognition rate dramatically.From the table 2, the most accurate learning-based approach settings for different hand gesture recognition are chosen to form the final decisions.Table 3 shows the recognition rate with hybrid strategy.

Driving Safety Experiments
Finally, a demo system based on new HCI design of automotive electronics is developed to examine the improvement of driving safety.

Driving Experiments Overview
The demo system is designed to simulate the driving environments.There are 30 subjects with driver license join our driving safety experiments.In this experiment, the drivers are asked to control the music player by a sequence of events during driving.Three kinds of controlling system are examined, which includes the traditional push-button interface, 9.7 inches touch panel, and proposed air-gesture techniques.
In this experiment, the driving safety and the controlling fluency are compared to find the potential problems and to evaluate the improvement of proposed air-gesture techniques.The experimental environment is shown in fig. 5.

Experimental flow
All the subjects will go through the same experiment flow in this experiment.They will be told about the object、 flow of this experiment and events.They will also learn how to control the music player by three different interfaces before starting to drive.The experiments will be repeated three times for three devices.The experimental flow is shown in fig.6.
Six different events will be announced during the subjects drive this car.These events include a music play requirement, volume increasing and decreasing, music selection under the demands of the passenger, system pause and resume with phone call event.

Numerical Evaluation of Driving Safety
Three indexes are measured in this human-computer interaction experiment.There are some indexes of the ergonomic experiment, including mission accomplishment time, eyes-off-the-road time, and average error operations.The mission accomplishment time recorded the time interval between the ends of vocal commands and event accomplishment.Eyes-off-the-road time is a widely-used index to measure the total time when subjects' eyes leave their frontal sight.Average error operations are the third index which counts the number of error trials.6, it is obvious that the air-gesture needs more trials to accomplish the missions.The main reason our research surmised is that the subjects may not get habit of waving the pre-defined gestures without touching any device.When the subjects want to finish the hand moving quickly and go back to the driving task as soon as possible, a short and dense gesture signals will be captured.It is very difficult to recognize this kind of gestures which is dissimilar to the learning dataset.
Table 5 shows that the proposed method can decrease the eyes-off-the-road time and increase the driving safety dramatically.Almost all these events can be accomplished with less than half eyes-off-the-road time.

Conclusions
In our research, Si1143 gesture sensor is used to capture three synchronized time-series signals.The pattern matching algorithm and machine learning techniques are adopted to develop an air-gesture controlling method for automotive electronics.The proposed method based on To explain the goal and flow of the experiment learn how to control the music player events explanation start the experiment Three kinds of devices dynamic time warping (DTW) can achieve most accurate recognition rate.But it also needs the longest response time because the high computational complexity.Then, two different kinds of feature extraction methods and three learning kernels are used to develop the learning-based approaches.These learning methods include Linear Discriminant Analysis (LDA), Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs).A hybrid strategy which uses different feature space and learning methods to recognize different gestures is used as the final strategy to recognize the hand gestures in an efficient and effective way.Furthermore, the gesture sensing HCI techniques are applied to the driving safety experiments.
The proposed system can dramatically decrease the eyes-off-the-road time and increase the driving safety.
In the future, we believe this kind of air-gesture sensing technique can be extended to wearable device HCI design.It is because that the touch panels of some wearable device, such as the intelligent watches, are too small.Traditional user interfaces of smartphones and tablet PCs are not suitable for this kind of devices.In addition, it can also be used in some medical devices to avoid Wearable devices combined with gesture sensor can simply operations and thus eliminating the inconvenience that screen is too small to control.In medical applications, this techniques can avoid the contagion.

Fig. 1 .
Fig. 1.Signal acquisition subsystem includes (a) Si1143 gesture sensor and (b) P8x32A QuickStart board (a) Upward Sliding (US): bottom-up palm movement.(b) Downward Sliding (DS): top-down palm movement.(c) Rightward Sliding (RS): sliding from left to right.(d) Leftward Sliding (LS): sliding from right to left.(e) Tick drawing (TD): draw a ticket in x-y plane.(f) Circle drawing (CiD): draw a circle in x-y plane.(g) Cross drawing (CoD): draw a cross in x-y plane.(h) 3D grabbing (3DG): Grabbing gesture from the sensor to higher z depth in 3D space.Fig. 2 shows some example input signals captured by the proximity sensor after data synchronization and time-series signal pre-processes.In each sub figure, the x-axis is the time domain, y-axis represents the magnitude of signals, and the proximity signals captured with three different light sources are visualized by three lines in different colors.

Fig. 2 .
Fig. 2. The example synchronized proximity signals of different gestures

Fig. 3 .
Fig. 3.The unit-to-unit similarity table and estimated DTW route

Fig. 5 .
Fig. 5.The environment of driving safety experiments

Table 1 .
The confusion matrix of DTW matching (%)

Table 2 .
The system performance of three learning-based approaches under different feature settings

Table 3 .
Gesture recognition rate with hybrid strategy

Table 4 .
Mission accomplishment time

Table 5 .
Eyes-off-the-road time

Table 6 .
Average error operations

Table 4 ,
5, 6shows the experimental results of mission accomplishment time, eyes-off-the-road time, and average number of error operations.The numbers of error operations are diverse according to the habits and dexterity of the users to control different devices.Observed from the table4 and table