Merging Scored Bounding Boxes with Gaussian Mixture Model for Object Detection

Object detection has been struggled with the issue that how to localize the accurate position of a target from a large number of scored detections. One of the most widely used methods is non-maximum suppression (NMS). However, the fact that this method can only select the high-score detections locally makes the result sometimes less accurate. In this paper, we propose a novel approach to merge all the scored bounding boxes by Gaussian Mixture Model (GMM) that takes not only the spatial information but also the score of each detection into account. We report experiments on both tasks of pedestrian detection and face detection with publicly available datasets. The drawbacks of NMS can be overcome to some extent and the proposed method outperforms other conventional methods.


Introduction
Most of the object detection (1 ) (2) methods rely on sliding windows (3) from left top to right down for searching.Each candidate window will be sent to a classifier/detector which returns a score depending on how much confidence the classifier has to classify the window as pos itive.Usually, this procedure produces a large number of detections whose scores increase when getting closer to the correct location of the target.What the users are interested in most is only the window wh ich tightly encloses the target, rather than the other numerous raw scored detections (i.e.we need to merge all the detections to, ideally, one bounding box (BB) per object).
The process of merg ing detections usually explo its non-maximu m suppression (4) (NMS).The steps of merging detections with NMS is as follows: 1. Select the best-scored window among all of the detections.
2. Delete redundant detections according to a predetermined threshold.3. Repeat 1-2 until there is no more BB which needs to be suppressed.In the second step, the threshold controls how much the suppression is.Fast and simp le though NMS is, the accuracy largely suffers fro m false positives.On the other hand, both accuracy and processing time depend on the sampling step when conducting sliding window.It is not that any classifier can reach an ideal accuracy that the closer the sliding window is to the target, the score for it is larger.With false positive showing up, the precision of NMS will decrease to a great extent because it only chooses Fig. 1 An example of merging all raw detections.Raw detections (top), the results of NMS (bottom left), the results of the proposed method (bottom right) are shown respectively.Obviously, the results of NMS are not the best bounding boxes visually since they are either too large or somet imes too small.On the contrary, our proposed method takes both spatial in formation and score of each detection into account, which leads to better results.the bounding box of the highest score.Also, it is time consuming to conduct sliding window at the step of one pixel.On the other hand, sampling BB with a relatively large step will inevitably cause the situation that the BB of the best location is not sampled, as a result of wh ich the nearby BBs become the best choices.In conclusion, considering the characteristic of NM S, the top-scored BB may not be the ideal choice.
It is necessary to get over this drawback resulting fro m the limitation of NM S. Since the problem of merg ing all the detections can be interpreted as the clustering problem (5) , we utilize Gaussian Mixture Model (GMM) (6)(7) to cope with the merg ing problem.
The widely used clustering methods are mean-shift (8)(9)(10) and K-means (11)(12)(13) .However, mean-shift depends on sufficient high-density data.Also, data to be clustered is much likely to fall into the wrong cluster when the data is relatively natural.As for K-means, it is generally incapable to find non-convex clusters, which lowers its versatility in various forms of data.A different idea is to formulate a probabilistic d istribution to cluster the detections, which is proposed in (14) .However, the scale-sensitive Gaussians (SSG) in h is work is a 3-d model with position and scale.
The method neglects the potentially essential parameters (e.g.score, an isotropic scaling) of detections, and only sticks to the uniform scaling (one parameter) and location of BB (two parameters), which turns out to be inadequate.
In the case of GMM, all the detections obtained fro m the detector can be considered to comply with a kind of probabilistic d istribution, and what we have contributed in this paper is to use a mixture of Gaussian distributions to approach the distribution of BB's parameters.Since each BB has 5 parameters, the proposed approach GMM is five dimensional.Nevertheless, GMM needs a pre-fixed k for the number of clusters.In this paper, the k is determined by the number of NMS's output.However, NMS cannot always perform accurately as we mentioned above, and it emp irically tends to provide redundant results.To overcome this problem, we add an additional step to prune the results fro m GMM by reapplying NMS.

Proposed method
In this section, we introduce how we exp loit the GMM and EM algorith m to estimate the parameters of BBs in order to merge the detections.

Gaussian Mi xture Model
The probability density function of GMM with K components is Where  is the given n-dimensional data,   is mixtu re coefficient ranging fro m [0, 1], and   as well as Σ  are the mean and covariance matrix of k-th Gaussian component respectively. here denotes the vector of all unknown parameters.Theoretically, with enough amounts of Gaussian components, GMM can approximate any kinds of distribution at any precision.
The data obtained by the detector can be formu lated as  = { 1 ,  2 ,  3 , …   } .Each   = {  ,   ,   , ℎ  ,   } is a 5-d data point, in which   ,   stand for the the BB's left-top point,   , ℎ  stand for the width and length of BB, and   is the score for each BB obtained fro m classifier.We consider that all the BBs meet this GMM mo del with k components, so the goal is to determine the parameters, i.e. ,  and Σ in GMM .Given that the model now is a 5-d mixture model, for each co mponent, the parameters to be estimated are   ,   = {  ,  ,   ,  ℎ  ,   } and Σ  .
Generally, we estimate the parameters by maximizing the following log-likelihood function: (2)

EM Opti mization
It is extremely d ifficult to differentiate Equation (2), make it to 0 and solves it directly mathematically.Meanwhile, data is actually inco mplete since they miss their labels that tell which cluster they belong.Let latent variable (15) (16) (17 ) , which is effective when latent variab les are involved, to maximize Equation ( 2).E-step and M-step are the two parts of EM algorith m.In E-step we calculate the posterior probability of the mixture co mponent, based on which we can determine the new parameters by M-step where the value of ( 2) is gradually amplified.Iterat ing E-step and M-step until convergence, the EM algorith m eventually maximize (2).

E-step:
In this step we denote   that means the probability of the i-th point is generated, or belongs to the k-th Gaussian component.Through Bayes' theorem, we can compute   using the estimated parameters in last iteration: . (3)

M-step:
In this step we enlarge Equation ( 2) by increasing the lower bound, with the help of Jensen's inequality and the derived   in the last E-step, i.e., Equation ( 2) is updated to Taking differential with (4), we can get each parameter: ,   ,   ,  ℎ  of each Gaussian component we are the parameters for each result BB accord ingly {    ,   , ℎ  }.

Ex perime nt
In this section, we report the experimental results on merg ing scored bounding boxes .We compare our proposed method against several conventional approaches on different test images and classificat ion tasks.We utilize IoU (18) between detection result and ground truth and its average value as the evaluation criterion.

Setup of Experi ment
As a preparation, we put each image into a detector and get scored detections.The features and classifiers we used is list in Tab le 1.
The images we use are fro m Penn-Fudan (21) database and Inria (22) person database for pedestrian detection, and Wider dataset (23 ) for face detection.The experiment is carried out on 60 images for the face detection and 100 images for the pedestrian detection, totally 160 images.The face images we choose have the following ru les: 1. Faces in the images are all toward the front with as less occlusion as possible.
2. There are no hidden faces in the background 3. The faces are not blurred.These three rules are to make sure that the detector can recognize all the faces and give scored detections as the task is not on the performance of the detector in this paper.
Given that the size of some face images in the database are too large and with a wide no-face space, we cut the redundant part without faces off (like scenery) simply for a high efficiency for detection.
Since NMS does not need to know the pre-fixed cluster number K but our method needs, the K we choose for GMM is the cluster number NMS returns.But due to noises, the K NMS returns tends to be larger than the target number.
Notice that as the GMM we use is a 5-d model with score, the parameters for each Gaussian component are also supposed to be 5-dimensional.It is necessary for us to conduct NMS to the output of GMM again.Through this step, the output of GMM is refined again as the close-by low-scored results can be all discarded.

Evaluati on
The evaluation is through intersection over union (Io U) of the ground truth BB (  ) and the result BB ( result ). () is the IoU of the m-th result BB and its corresponding ground truth BB.
We divide the total 160 images into 8 g roups, 5 for face detection and 3 for pedestrian detection, according to the number of target in the image.Specifically, group 1 to group 5 in face detection contain 1-2, 3-4, 5-6, 7-8, 9-10 faces respectively, and group 6 to 8 in pedestrian detection contain 1-2, 3-4, 5-6 pedestrians respectively.Each group in this paper contains 20 images.Let i be the number of image, j be the group number and   is the total number of targets in the i-th in j-th group, we denote Table 1.Features and classifiers for different detection tasks Face detection (19 ) Pedestrian detection (20 ) feature Haar-like HOG classifier Adaboost Linear SVM Fig. 2 Examp le of co mparative results of the image with 6 faces.
as the average IoU of the i-th image in j-th group.Also, we denote as the average IoU of the j-th group.
There exists a possibility that several result BBs have intersection with the same ground truth BB.In this paper, we only take the BB that has the max IoU with the ground truth BB among all the intersecting result BBs.
Also, one result BB may have intersection with mo re than one ground truth BB at the same time especially when two ground truth BBs are relat ively close.If we face this situation, we take the result BB as the result of the ground truth BB with which the IoU is the highest, and one result BB can only be calculated once, one ground truth BB also only can have one result BB as the result.Fig. 2-Fig.7 are so me co mparative results by the conventional approaches and proposed approach.In Fig. 2-Fig.4, we can find that the traditional NMS tends to select the BB that only contains partial face regardless of other better BBs with lo wer scores in some case, and the same for the pedestrian detection task in Fig. 5-Fig.7, which is the major disadvantage mentioned before that NM S only chooses the given top-scored detection as the output.Instead, GMM can make a co mprehensive decision fro m judging the label of each detection to find the center of each cluster, avoiding the influence of true positives.
Nevertheless both conventional approaches and proposed approach have not shown good results in some cases, such as the second face fro m left in Fig. 3, or the first person    in Fig. 7.One possible reason assumed is that the detections themselves are not sufficient and satisfactory that GMM cannot work well.Another explanation is that since GMM is based on the assumption that the data complies with the Gaussian distribution, the limitation can appear when the data does not follo w the Gaussian distribution very well.

Conclusion
In this paper, we propose a new approach to merge scored bounding boxes for object detection with the combination of Gaussian Mixture Model and non-maximu m suppression.The results of the experiment on 160 images with different sizes of faces and pedestrians show that the IoU of final detections increased by 7.8% by average comparing with NMS.However, as the future wo rk, the proposed method still has a lot of improvable parts .For exa mple, a more powerful GMM model such as generalized GMM can be applied to achieve better performance.

Fig. 3
Fig. 3 Examp le of co mparative results of the image with 7 faces.

Fig. 4 Fig. 5
Fig. 4 Examp le of co mparative results of the image with 10 faces.

Fig. 6
Fig. 6 Examp le of co mparative results of the image with 3 pedestrians.

Fig. 7
Fig. 7 Examp le of co mparative results of the image with 4 pedestrians.
= { 1 ,  2 ,  3 , …   } be the missing part of each data point and   is an K-d imensional vector

Table 2 .
.inEquation(10).Average IoU for each image in each group   is used to calculate the   .