Face Recognition with Age, Gender and Emotion Estimations

In this paper, we propose an application that can be used to help retail industries adjust purchase lists and operation modes to increase sales and improve service quality. The convolutional neural network and residual network structures are used to estimate the age, gender, and emotions of the face, which can help merchants to build their own databases, analyze users’ shopping preferences, and track sentiment to the feedback shopping experience. keywords: Face detection, age estimation, gender estimation, emotion estimation.


Introduction
A face contains much information, such as age, gender, and emotions. Age and gender have features of personal identity, which play important roles in social life. Artificial intelligence's prediction of age and gender can be applied in many areas such as intelligent human-machine interface development, security, cosmetics, and e-commerce. And the recognition of emotions can help us understand the psychological state of the participants.The estimations of age, gender and emotion can make the feedback more comprehensive. We propose a system to estimate the age, gender, and emotions of the face by using convolutional neural network and residual network structures. This system can be applied in retail industries, helping merchants build their own databases to analyze shopping preferences and shopping experiences of users with different ages and genders, and then adjust purchase lists or operating models in a timely manner. The advantages are been shown as follows.
On the one hand, it can track customers' face vectors and collect useful information such as age and gender.
Through this data, merchants can categorize customers, and count their shopping preferences, so that to design targeted marketing strategies and in-store activities.
On the other hand, it can catch customer sentiment. This system can provide valuable feedback for in-store promotions by tracking customer emotional responses, enabling retailers to improve product categorization and improve service quality.
We create a face recognition system based on three pre-trained convolutional neural networks (CNN) models that recognize faces and predict their age, gender, and emotion from images or videos.
The IMDB Database contains the facial image with the gender and age tags were selected to train age and gender models, and the FER-2013 Database with the emotion tags was used to obtain the emotion model. The position of the face is obtained by using the Haar cascade classifier model. And then the age, gender and emotion models are separately trained in conjunction with the residual network. Finally, the age, gender, and emotion information of faces in the video can be estimated in real-time.

Database
In the process of interaction with the user, the machine can infer the user's age, gender, and emotion by relying only on facial information. This task is highly complex for different face samples. Using machine learning techniques, models with millions of parameters are trained under tens of thousands of samples. (1) This section introduced the databases used for real-time estimation with age, gender, and emotion.

IMDB-WIKI Database
In this paper, we use the IMDB-WIKI Database to train age and gender features and use the corresponding c ⃝ IMDB-WIKI is a database of facial images containing the corresponding age and gender labels. A total of 524,230 images containing age and gender information were obtained from the IMDB and WIKI websites. Table 1 shows the distribution of pictures obtained in IMDB and WIKI. Part of faces in the IMDB Database and the age distribution in IMDB-WIKI are shown in Fig. 1 and Fig. 2, respectively.

FER-2013 Database
Similar to IMDB-WIKI, the FER-2013 Database is used to train facial emotions, and the corresponding model is used for facial emotion recognition.
The FER-2013 facial emotion database consists of 35,886 facial emotion images. It contains 28,708 test pictures (Training), 3,589 public test charts (PublicTest) and private verification pictures (PrivateTest). Each picture is composed of grayscale images with a fixed size of 48×48. There are 7 kinds of emotions, which are represented by the numbers 0-6. The labels and corresponding emotions are shown in Table 2.

Model
Different target requires different detection and recognition model. In our system, one detection model, the face detection model, and three estimation models, age, gender and emotion estimation models, need to be considered.

Face Detection
For training and testing images, we use the Haar cascade classifier in OpenCV to get the position of the face in the face detector. The earliest Haar features were proposed by Papageorgiou C. et al. (4) Then, Paul Viola and Michal Jones proposed a method for quickly calculating Haar features using the integral image method. (5) Later, Rainer Lienhart and Jochen Maydt extended the Haar signature library with diagonal features. (6) The Haar cascade classifier in OpenCV is based on the extended feature library.
The Haar-like feature is a feature for real-time face tracking. Each class of Haar feature describes the contrast mode of adjacent image regions. For a given image, the features may vary depending on the size of the region, which may also be referred to as the window size. However, only two images that differ in scale should have similar features. Therefore, it is useful to be able to generate features for windows of different sizes. These feature sets are called cascading. And the Haar cascade has scale invariance.

Age and Gender Estimation
The age and gender of face images were trained using the IMDB-WIKI Database. The literature (2), (3) used a pre-trained VGG network and proposed a novel regression algorithm for the classification of ages. The essence is that after the 101 categories between 0-100, the scores obtained are multiplied by 0-100, and the final results are summed to obtain the final identified age. The process is shown in Fig. 3. Through investigation, we can know that using the IMDB Database, the correct rate for testing gender predictions can reach 96%. Different from, (2), (3) this paper uses a wide residual network to start from scratch. On the basis of, (7) the extensive residual network portion was modified by adding two classification layers at the top of the extensive residual network for age and gender estimates. In, (2), (3) age and gender were independently estimated by two different CNNs, while a single CNN simultaneous estimation was used in this paper.
Firstly, download the IMDB-WIKI Database, filter out the noisy data and serialize the images and tags for training into a type of ".mat" file. Then, train the network with the training data above.
If the verification loss becomes minimum in the previous period, the training age and gender weight files are stored as "weights.*.hdf5" and "gender mini XCEPTION.*.hdf5", respectively. The workflow is shown in Fig. 4.
In the training network section, we used a residual network. (8) The structure of its internal residual block is shown in Fig. 5. The residual module modifies the required mapping between subsequent layers so that the learned features become the difference between the original feature map and the desired feature. Therefore, the desired feature H(x) is modified to solve the easier learning problem F (x). That is, satisfy the equation (1).
(1) Fig. 3: Pipeline of DEX method (with one CNN) for apparent age estimation

Emotion Estimation
The FER-2013 Database is used to train the emotions of the face. Firstly, download the FER-2013 Database. There are several versions of the FER-2013 Database. This article uses the "imdb crop.tar" file (faces only 7G). After decompressing, perform the expression training classification operation to obtain the "fer2013 mini XCEPTION.*.hdf5" file. The next steps are similar to the workflow for age and gender estimation.
Emotions, images and the purposes of usage are stored in a csv file in the form of data, not pictures. As shown in Fig. 6, the first line is the header, which means the meaning of each column of data. The first column represents the emotional tag, contains seven numbers from 0-6. The second column is the original image data, which Use the pandas library to parse the csv file, and then save the original image data as a jpg file. Sort it according to the purpose and label. Then, save it to the corresponding folder. The running result is shown in Fig. 7. Subfolders with corresponding emotions in the PrivateTest, PublicTest, and Training folders, are shown in Fig. 8. A face image of the corresponding mood is stored under each emotion folder. Part of the emotion pictures in the FER-2013 Database is shown in Fig. 9. Through investigation, the accuracy can obtain 66% in the estimation of the emotion classification tasks.

Experiments
In order to verify the correctness, some experiments are carried out. This section describes the environment configurations and part experimental results.
Anaconda is an open source tool. The conda toolkit and virtual environment management system for Windows, MacOS and Linux, which can be supported for sklearn, TensorFlow and sciPy. This experiment is based on the Windows version of the Anaconda environment.
TensorFlow, a Python library for numerical calculations, is widely used in the programming of various ma-  Keras is an open-source artificial neural network library written in Python that can be used as a high-level application interface for Tensorflow, Microsoft-CNTK, and Theano for the design, debugging, evaluation, application, and visualization of deep learning models.

Real-Time Estimate Results
For the age, gender, and emotion of the face, we used a webcam for real-time detection. And the results of the estimations are shown in Fig. 10 and Fig. 11. Above the faces, the predicted age, gender and emotion are displayed, respectively. For the prediction of age, it is assumed that the error within plus or minus 2 years is considered to be correct. We did some experiments, and the results are shown in Table 3.
From Table 3, it can be seen that the prediction accuracy rate for gender is the highest, which is more than 95%. Followed by gender, the accuracy rate can reach more than 86% in age estimation. The recognition ac-  curacy of these two features can well realize user classification, feedback shopping preferences, so that to help merchants adjust the purchase list timely. However, the accuracy of emotion estimation is low, not more than 75%. Therefore, we made statistics on the accuracy of individual emotion estimation, and the results are shown in Table 4. We found that the correct rate for angry, happy, and sad emotions are high. And it is possible to focus on detecting these three emotions to feedback user's shopping experience.

Conclusion and Future Works
Through the face information, the system we proposed can correctly detect faces in real-time, and estimate the age, gender, and emotion. Applied in retail industries, it can help merchants build their own databases and analyze the shopping preference so that to adjust the purchase list and operation mode timely. Moreover, through emotion feedback, retailers can catch user experience to improve service quality.
At the same time, there are several considerations for future work. First of all, the prediction accuracy needs to be improved, specifically the emotion estimation. Secondly, we want to build a user interface for easy operation and data management. Finally, enter member information to improve personalized service.