A Pre-Focusing Method For Detecting Pedestrians Running Out Into The Street Using A Mobile Stereo Camera

For mobile stereo camera systems, constructing threedimensional (3D) surroundings was an important process to ensure safe navigation; however, it was very time consuming. To decrease the number of traffic accidental deaths, pedestrians entering the street needed to be efficiently detected. Thus, a pre-focusing method (PFM) was proposed to decrease the time required to retrieve matching pixel pairs in stereo pair images. The method PFM provided prefocused stereo pair images in which an object at a predetermined distance displays no disparity between images. Pre-focusing was performed by applying a perspective transformation to the image pair, whose coefficients were obtained via a calibration stage of the PFM. An object at the distance was extracted by evaluating the spatial similarity between the pre-focused stereo pair images without scanning in order to find matching point pairs. To avoid errors due to edges and uniform areas, the smooth test, which was a nonparametric test, was introduced to evaluate the spatial similarity. A watching window was introduced to detect the pedestrians. In this study, the method PFM was successfully applied to actual video data captured by a stereo video camera mounted on a bicycle.


Introduction
Various types of autonomous cars are currently being developed and tested on actual roads.They sense their environment to safely reach their destination without human inputs.In the sensing process, sensors such as infrared lasers, cameras, and millimeter-wave radars are used.Computer vision is also used to understand the sensor outputs.In fact, TOYOTA has already developed a practical Pre-Crash Safety System (1) .Camera-based systems have also been developed, i.e., SUBARU's Eyesight and HONDA's Intelligent night vision system (2) .These systems commonly acquire threedimensional (3D) environmental road information to control the vehicles without crashing.
The authors believe that another point of view is required in these sensing systems to decrease traffic accident deaths caused by pedestrians entering the street (3) .Most stereo camera-based sensing systems require a stereo matching process to obtain the 3D environmental road information (4)(5)(6) .The process is, however, very timeconsuming.There are at least two possible approaches to solve this problem.One is to use a sub-system, such as GPGPU, to accelerate the processing (7) , and the other is to adopt a smart method to avoid stereo matching.The former is an excellent solution but is very expensive.The latter is a low cost practical solution to detect whether an object exists in the volume of interest; however, it is necessary to develop the method.
In this study, we propose a pre-focusing method (PFM) to detect an object in a 3D region of interest without the stereo matching process, particularly in local mask scanning to find corresponding point pairs.Objects captured in stereo pair images usually have disparities accordingly to their distance from the camera.The stereo matching process calculates the distance via these disparities.In the proposed method PFM, a perspective transformation is applied to one of the stereo pair images to remove the relative disparity for an object at a pre-determined distance.The residual disparities are locally evaluated via an index of the spatial similarity, and high index areas provide detections of an object at a pre-determined distance.The spatial similarity may be affected by shapes of the density histogram in a local area; therefore, we use a nonparametric algorithm to avoid this.To quickly detect pedestrians entering the road, we also introduce a watching window to reduce the processing time.

Methodology
The principle of the proposed method PFM is based on the fact that a quadrangle on a plane in a 3D space is perfectly registered onto any quadrangle on another plane in the 3D space.The registration is performed via a perspective transformation whose parameters are determined by coordinate pairs for the four vertices of both quadrangles.After applying the transformation, the object at the predetermined distance has no positional deviation.Therefore, the object, i.e., a pedestrian, at the chosen distance is detected as an area with no displacement.

Perspective transformation
Using a perspective transformation, we can map a quadrangle on a plane onto another plane as shown in Fig. 1.The transformation is described as where (x, y) and (u, v) are coordinates on the two planes and ai (i = 1-8) are the coefficients for the transformation.Those coefficients are determined by at least four coordinate pairs, (xi, yi)-(ui, vi) (i = 1, …, n).Multiplying the denominator on both sides of Eq. (1), we obtain the matrix equation:  = , (2)  where = (  1  2 ⋯  8 ), (4)  and Equation ( 2) is solved as  = (  ) −1   .(6)  In the proposed PFM method, we apply a perspective transformation to one of the stereo pair images so that planar objects at a pre-determined distance have no disparity, as shown in Fig. 2. A checkered pattern is used as the planar object, and its extracted corner points are used to obtain the coefficients of the perspective transformation.The pre-focusing is achieved by applying the transformation to one of the stereo pair images.Now, an object at the predetermined distance has no disparity in the pre-focused stereo pair images.Next, the object at the pre-determined distance is detected by evaluating the local similarity. (8) the pre-focused stereo pair image, the area with no disparity suggests that there exists an object at the predetermined distance.To find the area, we locally evaluate the spatial similarity using a small-sized square mask.The similarity can be calculated as the correlation coefficient to fit a straight line to the density scatter-diagram of the pixels in the mask.The correlation coefficient is, however, affected by the distribution of the pixel density, i.e., the shape of the density histogram, in the mask.This requires a unimodal distribution with an adequate dynamic range; however, these conditions are often not satisfied, especially in road scenes.

Neyman's smooth test
In the proposed method PFM, we evaluate the spatial similarity by comparing the density histograms between pixels in a local mask on both pre-focused images.The smooth test, in which the goodness of the fit for the mean value and variance of the histogram are tested, is used for the evaluation.The smooth test is a nonparametric test and is based on the fact that a distribution, i.e., a histogram, transferred by a cumulative distribution function forms a uniform distribution.We assume that a distribution A is the same as another distribution B from which the cumulative distribution function is obtained, when the transferred distribution from A forms a uniform distribution.The first and second moments are used to evaluate the uniformity of the distribution.
Let ( ) pixel in the mask.To calculate the first and second moments of the transferred distribution, we define the statistics T and S as and When ( )( ) obtain the expected values as [ ] Therefore, the goodness of fit in mean difference is given as and that of the dispersion is given as . Therefore, the total goodness of fit is described (8) as As the goodness, g, asymptotically approaches a chi-squared distribution as the threshold for the test is determined by a percent point of the chi-squared distribution.When the g/N value obtained from the pixels in a small mask exceeds the threshold, we recognize the area as a part of an object at the pre-determined distance.The small mask is scanned over the pre-focused stereo pair images skipping the edges in the image.For pedestrians entering the street, there is no need to process all the images.Therefore, we set a Region Of Interest (ROI) with some lattice points in the image and apply the smooth test only to those points.The pedestrian is detected when a minimum fixed number of the points give us a larger g/N value than the threshold.

Experimental results and discussion
We used a digital camera, FinePix REAL 3D W3 (Fuji Film), as the stereo camera, as shown in Fig. 3.The camera has a capability to record stereo video data.The image size was 1280 × 720.The distance between the two lenses was 75 mm.The camera also had a zooming function.The recorded data were extracted and processed as sequential still images after partitioning.

Coefficients for the perspective transformation
A large-sized checkerboard plane was observed with a zoom of three times by the stereo camera, which slid on a rail so that its optical axis did not change during the series of observations.The observation distance was changed from 5 to 10 [m] with a 1 [m] step.Figure 4 Table .1 Coefficients for the perspective transformation blue.We can see some disparities on the checkerboard.The coefficients of pre-focusing were obtained so that the right image rectified well with the left image.Figure 4(b) shows an example of the rectification in which 70 corner points were extracted and used for the calculation.We can see that the disparities in Fig. 4(a) are removed in Fig. 4(b).Some of the coefficient values are listed in Table 1.

Extraction of the focused area
Figure 5(a) shows one of the original frames in the video data captured by the bicycle-mounted stereo camera.In the frame, a pedestrian near the left edge of the frame was 10 [m] from the camera.We can see that the object disparities depend on their distances from the camera.The 3D surroundings can be obtained by searching for corresponding point pairs over the image.shows the result using pre-focusing at 10 [m].We can see that only the pedestrian at the pre-determined distance was well rectified while the other objects still had some disparities.To extract the focused area, we evaluated the spatial similarity between the pre-focused stereo pair images.We compared the performance of the smooth test (Fig. 6(b)) with that of the correlation coefficient method (Fig. 6(a)).We used a 5 × 5 square mask to process both results.The extractions of Figs.6(a) and 6(b) took approximately 0.4 and 1.4 [s], respectively, using a PC (CPU: i7-4790 with a clock of 3.6 GHz and 32GB memory).The areas were extracted when the correlation coefficient exceeded 0.7 in Fig. 6(a) and when the g/N value was below 100 in Fig. 6(b).Those thresholds were experimentally selected so that the ratio of the correctly extracted areas to the incorrect ones was maximized.Comparing the results, we see that the correlation coefficient method resulted in poor detection performance for the focused pedestrian and larger incorrect areas at the edges and uniform density regions in the scene.Conversely, the smooth test resulted in higher detection performance and fewer errors.The edges in the scene affected both the correlation coefficient and the g value in the smooth test.To reduce the incorrect detection at the edges, we eliminated them from the processing regions.The edges were detected by a Prewitt filter with a threshold of 20. Figure 7 shows the color composite result in which the areas detected by  (a) Extracted pedestrian at a pre-determined distance.
(b) No pedestrian extracted at the pre-determined distance.Fig. 8. Eliminating edges when extracting the pedestrian.the smooth test are assigned to red, while those detected by the correlation coefficient method are assigned to green, and the edges are assigned to blue.The white regions in Fig. 7 indicate the edges that have high spatial similarity; therefore, their elimination is expected to reduce the noise in the background.In actuality, the performance when detecting the focused area was significantly improved, as shown in Fig. 8.

Detection of pedestrian at pre-determined distance
To quantitatively evaluate the performance of the proposed PFM method when detecting a pedestrian entering a street, we took a short video movie using a stereo camera attached to a bicycle.The focused distance was 10 m.For safety, we measured the pedestrian at a distance.We Fig. 10.The number of lattice points on which an object area was extracted.introduced a watching window with 15 lattice points and tentatively set the window near the left edge of the scene; however, the position needs to be carefully designated in an actual system.Figure 9 shows the position of the window.The video data were divided into 130 stereo pair images, and each pair was processed by PFM with a 5 × 5 pixel mask.Figure 10 shows the number of lattice points on which an object area was extracted in each image pair.In the video data, we captured a pedestrian near the end.Therefore, the object area was part of the pedestrian.The result shown in Fig. 10 indicates that we detected the pedestrian by counting the number of lattice points.

Focusing performance
We captured another pedestrian crossing the street at a distance of 8 m, as shown in Fig. 11.Using the stereo pair image, we evaluated the focusing performance in the proposed PFM method.We applied transformations with focusing coefficients for 5 to 10 m to the image pair and applied the area extraction using the smooth test as in the previous result.The performance was evaluated using the number of lattice points on which an object area was extracted, as shown in Fig. 12.The result indicated that the number of lattice points reached a maximum at the pre-Fig.9.A watching window with 15 lattice points.Fig. 12. Focusing performance of PFM.determined distance of 8 [m] and that the in-focus region extended a few meters in distance.The extension of the infocus region resulted from the small baseline (75 mm) of the stereo camera.A longer baseline will provide higher focusing performance.The peak value of the curve had only 8 of the 15 points because the pedestrian was dressed in dark textureless clothes and the points on the clothes did not contribute to the detection.Our PFM method does not work well for a pedestrian entirely dressed in black, as is the case with other former 3D matching methods.

Conclusions
In this paper, we proposed a pre-focusing method, PFM, to detect pedestrians entering the street using a stereo camera.This method modifies one of the stereo pair images so that no relative disparity is found for an object at a predetermined distance.The object is detected as an area that has a high spatial similarity between the pre-focused stereo pair images.The similarity was non-parametrically evaluated using the smooth test to avoid errors due to edges and/or constant density areas, such as the road surface.An edge eliminating technique was also used to reduce the misdetection of false-positive areas and to achieve a higher signal to noise ratio in the detection.To detect pedestrians entering the street, we introduced a watching window with interior lattice points.The number of lattice points on which an object area was extracted was correlated to the existence of a pedestrian.Omitting the processing area was effective in decreasing the processing time.Future areas of study include improving the accuracy of pedestrian detection by tuning the parameters in the similarity evaluation, designating a prototype system, and evaluating the total performance with regards to pedestrians entering the street.

Fig. 2 .
Fig.2.Planar objects and a stereo camera.The pre-focusing is achieved by applying the transformation to one of the stereo pair images.Now, an object at the predetermined distance has no disparity in the pre-focused stereo pair images.Next, the object at the pre-determined distance is detected by evaluating the local similarity.
density distribution function in a local mask and its cumulative distribution function, respectively, and let( )

Fig. 3 .
Fig. 3.A stereo camera used in the experiments.
(a) shows the color composite image between the originally observed images at a distance of 10 [m], in which the left image is assigned to red and the right image is assigned to green and (a) Color composite image of the original observations.(b) Color composite image with pre-focusing.Fig. 4. Checkerboard images (red: left image, green and blue: right image).
Figure 5(a) shows one of the original frames in the video data captured by the bicycle-mounted stereo camera.In the frame, a pedestrian near the left edge of the frame was 10 [m] from the camera.We can see that the object disparities depend on their distances from the camera.The 3D surroundings can be obtained by searching for corresponding point pairs over the image.Figure5(b) (a) Result processed by the correlation coefficient method.(b) Result processed by the smooth test.Fig. 6.Extraction of the focused area.

Fig. 7 .
Fig. 7. Color composite image.Red shows the result via the smooth test, green one via the correlation coefficient, and blue the edges.

Fig. 11 .
Fig. 11.A pedestrian at 8 [m] from the camera with the watching window (red shows left image, and green and blue show right image).