Accurate Heart Rate Measuring from Face Video Using Efficient Pixel Selection and Tracking

As the coronavirus (COVID-19) spreads around the world, we are increasingly cognizant of our health on a daily basis. This paper focuses on heart rate monitoring, utilizing remote monitoring methodology as a vital indicator of health status. Remote photoplethysmography (rPPG), is a wellknown technique in human remote monitoring, to calculate the heart rate from face videos. Since rPPG analyzes small changes in: color and motion, physical factors (e.g., breathing and adjusting posture), and environmental factors (e.g., illumination and shade), it is difficult to measure heart rate with precision. To resolve these challenges, this paper proposes a system that effectively combines the following methods: 1) Lucas-Kanade method to dynamically track each skin pixel, 2) selection of proper pixels that are not affected by the environmental fluctuations inlight and shade, 3) the delineation of the heart rate signal from noisy to precise data to improve accuracy, and 4) Fast Fourier Transform (FFT) to estimate the main frequency of the signal to determine the heart rate. The results of the experiment showed that the mean absolute error (MAE) of the heart rate was 3.4 bpm for 72 face videos.


Introduction
Since a new variant of coronavirus (COVID-19) was found in 2019, public newspapers have reported many cases of contact infection. To mitigate the spread of COVID-19, people are working remotely from home rather than in the marketplace.. This situation requires people to be vigilant about their overall health and well-being. Additionally, the use of healthcare devices that require skin contact, further pose a risk of contact infection.
As a result of the COVID-19 pandemic, non-contact healthcare devices are in demand. We will focus on the remote monitoring of heart rate as one of the vital indicators of health status. Non-contact, remote monitoring systems, such as Remote photoplethysmography (rPPG), are divided into two categories; those that monitor (1) changes in facial color and (2) movement of the face. Regarding facial color, physical factors (e.g., breathing and adjustments to posture) prevent monitoring from correctly tracking identical pixels on the face. In addition, environmental factors (e.g., illumination and shade) make it difficult to detect skin color changes caused by fluctuation in heart rate. Regarding facial movements, the system tracks small nodding movements of the head caused by heart rate; however, other motions of the body prevent the monitoring system from extracting the heartbeat motion. Due to these reasons, the remote monitoring data is less accurate as a result of external noises.
In this paper, we will address the concept of detecting facial color changes to estimate heart rate with a combination of several technologies. The proposed remote monitoring system tracks independent pixels on the face correctly and selects proper feature points to observe the periodical signal. This paper is organized as follows. In section 2, we review previous research on heart rate measurement. In section 3, we describe our proposed system. In section 4 and 5, the proposed system is evaluated, and we discuss the differences from previous research. In section 6, we address the conclusion of this paper.

Heart rate measurement
Electrocardiogram (ECG) developed by Willem Einthoven measures the electric potential difference between two points on the human body. It can precisely measure the heart rate by counting periodic spikes of the potential difference. Thus, plethysmography detects the change in volume of the blood vessels. As a derivation of plethysmography, photoplethysmography (PPG) uses photos to observe blood flow with transmitted or reflected light. The transmitted approach uses a light source that shines on a thin skin area and measures the amount of transmitted light. It is necessary to place a part of the human body (e.g., earlobe or finger) between the light source and receiver. The reflected approach also uses a light source, but measures the amount of reflected light. In recent years, the last approach has become widely used because the Apple Watch is equipped with a built-in sensor using the reflected light.
All technologies mentioned above require skin contact for heart rate measurement. Remote photoplethysmography (rPPG), on the other hand, does not require skin contact but remotely observes human skin tone. Juan et al. measured the average skin tone using G-channel of RGB in the ROI (forehead and cheeks) because hemoglobin mainly absorbs green and yellow (1). Yang (7). These methods (6, 7) that extract head movements have an advantage to determine heart rate even when the face is hidden by something, like a mask. The disadvantage of noise tolerance due to body movements, such as breathing and adjusting posture, prevent correct observation of head movement.

Noise tolerant heart rate measurement
This paper measures heart rate accurately from facial color by tracking the feature points of the face and selecting noiseless points. Overall processes in the proposed system are shown in Fig. 1. The system consists of tracking feature points, discriminating useful feature points, delineation of heart rate signal via color from noisy to precise data points to improve accuracy and frequency analysis.

Feature point tracking
Many systems proposed in the previous papers specified an ROI on the face, within which the averages of RGB values are calculated. If the size of ROI is larger, nonskin areas such as hair and beard might be involved and then would make the averaged color values incorrect. On the other hand, if the size of ROI is smaller, the total number of pixels would become smaller, and the estimated time series of the signal would be distorted. For these reasons, previous systems had a difficulty to set the proper size of the ROI. In this paper, we do not set ROI explicitly; we calculate the sequences of RGB values for all pixels, select noiseless points, and average the selected RGB values afterwards.
The following are procedures of the calculation. After the first frame of the video is extracted, the face is detected using haar-like features. The extracted area is shrunk to 90% of the center in height and 50% of the center in width because the extracted face area often includes not only the face but also the background. Since there is a lot of movement around the eyes and mouth, which makes tracking difficult, we separately select two regions on the face with less movement. The forehead, a region of 20% from the top edge or hairline of the face region, and the mid-facial region with a height of 20% from the top 55% position in Fig. 2. Selection of these two regions make it possible to deal with the case where the forehead region is hidden by hair.
The next step detects skin pixels within the designated regions, shown in Fig. 3. We specify the range of skin color and extract pixels suitable for the range, referring to the paper (8 Pixels within the ranges in Equations (1) and (2) are extracted as candidates, and pixels are randomly selected among these pixels. The system tracks the selected points using the Lucas-Kanade method and calculates the sequences of RGB values for the points.

Normalizing RGB signals using small windows
The angle between the surface of the face and the light source subtly differs due to the shape of the surface, and the intensity of illuminated light also differs depending on the parts of the face. These factors may cause large changes in the sequence of RGB values even when the face slightly moves; therefore, we need to select proper feature points with small changes in RGB values within small windows for the sequence.
For the -th feature point from frame to + 1, the normalized amounts of change in RGB values are defined in Eq. (3).
stands for R value for − ℎ point at frame , and for the variation of R from to + 1. Notations for G and B are the same as those for R. According to the paper (9), ， and fall within the range of [−0.02, 0.02] if the point is correctly tracked with no illumination changes. However, since observation in actual environment is very noisy, it may not get enough points within the range. Therefore, we designed another way to select -best feature points for the small windows as follows. We use overlapped windows where the number of frames in the window is and the number of overlapped frames is 2 . The frame sequence of -th window starts from 2 and ends at 2 + − 1 in Eq. (4).
The maximum variance of color value for -th feature point in the -th frame is defined as .
The index sequence for the ascending sort of for -th window is defined as . In other word, (0) is the index of the point with the smallest . We define the sequences of averages for top-SN values of color in window as , and . As a notation for sequence, we use Seq[x] in the following equation. These sequences need to be normalized to reduce the discrepancy among windows. We calculate ̅̅̅̅̅̅ , ̅̅̅̅̅̅ and ̅̅̅̅̅̅ for the average values in a window and finally define ̅̅̅̅̅ , ̅̅̅̅̅ and ̅̅̅̅̅ as follows.

CHROM method
We calculate the chrominance signal from the sequences of RGB values to amplify the color of blood flow. The chrominance signal is defined in Equations. (8), (9) and (10).

Selection in relation to environment
The CHROM method amplifies the color of blood flow to measure heart rates accurately under white lights, while simple G-channel performs better in other environments, like under neutral white light bulb. Therefore, the final heart rate signal should be selected in relation to their environments shown in Eq. (11).

= { ℎ
, if white light or sunlight , otherwise (11) stands for the sequence of ( ̅̅̅̅̅ ) multiplied by Hanning window.
The FFT is applied to the heart rate signal with zero padding extension to calculate the main frequency . The estimated heart rate, , is the multiplication of and 60 in Eq. (12).

Experimental setting
For this experiment, we took videos of 10 men and 13 women by the camera of the iPhone SE 2 nd generation. The pixel resolution of the video was 3840×2160(4k), and a frame rate was 30 fps. The lengths of the videos were 30 to 45 seconds. The light source was either sunlight or indoor light. The distances between the camera and subject ware 0.3 to 0.5 meters. The subjects attached Citizen electronic blood pressure monitor (CH-602B) and sat in front of the camera without moving. We implemented the system using OpenCV (Python). We extracted 2000 feature points ( =2000) from image and set =200 as the parameter in section 3.

Experiment
We evaluated the accuracy of the proposed system, compared with the combination of the other techniques proposed in the previous research. We tested two types of tracking methods: region tracking and feature point tracking. When using region tracking, we set the fixed ROI with rectangle-shape on the forehead, tracked the relative area against whole face region and calculated the average values in the ROI. When using feature point tracking, we tracked several randomly selected feature points on the face using the Lucas-Kanade method to calculate the average values for the points. We also tested three types of signal extraction methods: the use of G-channel, extraction with ICA, and CHROM method. We further evaluated the system based on

Results of experiment
The performance results of the experiment with 72 facial videos are shown in Table 1. The accuracies in region tracking were lower than those in feature point tracking because forehead areas of many subjects were hidden by hair, and the size of ROI could not be averaged. Our system determined the heart rate with of 3.4 bpm. CHROM method was the best in the existing methods with of 6.6 bpm, but our system exceeded the CHROM by using nbest feature points for the small windows, even when the feature point tracking was applied to the CHROM. The performance of motion-based method was worse than that of color-based method.
From the viewpoint of processing time, feature point tracking was faster than region tracking because the cost for face detection and skin-area extraction was higher, while the cost for pixel tracking, using Lucas-Kanade method, was lower. Tables 2 and 3 show MAE of our method in gender and lighting categories. Figure 4 shows the correspondences between theorical and estimated HRs. In Fig. 4, plots were marked on the line if estimated value was the same as the theoretical value. Plots in Sub figures (2) and (3) of Fig. 4 were color-coded in different colors for gender and light sources.
Although there was no explicit difference in the accuracy of the proposed system by gender, there was a difference in success rates of tracking because the average number of feature points that could be tracked was 1587 points for men, while 1094 points for women. We presume the surface of the women's skin was covered with cosmetics and smoother in texture.
We used G-channel in the environment under the neutral white light and CHROM method in the other environments according to Eq. (11), but the accuracy under the neutral white light bulb source was the lowest of all in Table 3. The accuracy became worse than this result when we applied CHROM method to the neutral white light source. Therefore, the selection of G-channel was reasonable, and

Discussion
Most systems for heart rate measurement require human, skin contact. This poses a risk of contact infection when infectious diseases, like coronavirus, are rampant in society. Therefore, we focus on the remote monitoring of heart rate, utilizing Remote photoplethysmography (rPPG). Juan proposed a method of measurement, using the average of the skin tone of the forehead region, analyzed by the G-channel of RGB, called ROI. This method worked well when the surface of the face is uncovered but there are noisy objects, like hair and beard, that distort remote monitoring HR data in actual situations. To solve this problem, our system uses a pixel tracking method, where pixels are selected from a face region and tracked with Lucas-Kanade method, and then noiseless points are carefully extracted.
De Haan proposed a method to amplify the color of blood flow, called CHROM method. This method worked efficiently under the lighting of white light bulbs, but erroneous under the lighting of neutral white light bulbs. Our system combined advantages of CHROM and G-channel methods for signal extraction, selecting CHROM method under white light and sunlight, and G-channel otherwise.
The combination of feature point tracking and signal extraction yields greater accuracy of heart rate measuring than conventional remote monitoring systems.
Our experiments described in section 4 were conducted using high-resolution camera with 4K pixels, but we also tried to apply our method to low-resolution images, like video stream in an online communication tool whose resolution was 1280×720. It successfully worked for men in most cases, but did not always work for women. We presumed this failure came from the difficulty for tracking feature points. Therefore, we need to extend tracking method to deal with such low-resolution images.

Conclusion
In this paper, we proposed a system that is tolerant to physical and environmental factors with a method to effectively combine feature point tracking and signal extraction. The proposed system was able to estimate the heart rate with accuracy of 3.4 bpm evaluated with MAE. The system overcomes the performances of the existing systems. However, there was a slight difference in the estimation accuracy due to environmental difference in terms of lighting. To build a more stable system, it is necessary to divide the measuring conditions into smaller data points, and to carefully construct models and methods suitable for each condition. In addition, we should propose a highly convenient system that automatically switches the methods according to the environments.