A Survey of Small Object Detection Based on Deep Learning

As a basic visual recognition problem in computer vision, object detection has made great progress based on traditional manual features and deep learning algorithms. However, researches on small object detection ha ve only begun to appear in recent years, which has become a hot and difficult point in the field and most of them are improved on the basis of existing object detection algorithms to enhance the detection accuracy. With the rapid development of deep learning, small object detection based on deep learning has made great progress, which has wide application requirements in the fields of automatic driving, remote sensing image detection, criminal investigation and other fields, so the research on small object detection has strong practical values. In this paper, the existing research on small target detection is reviewed in detail. Firstly, the existing algorithms are divided into one stage and two stages according to the number of detection stages, and then the characteristics of these algorithms are analyzed; Secondly, the small object detection datasets commonly used are introduced. Finally, the challenges of small object detection are summarized, and the future research directions are prospected.


Introduction
Small object detection has extensive and important applications in many fields [1]. For example, in the field of automatic driving, pedestrian objects or traffic signs in photos collected by cars are too small, but the accurate detection of these small objects is an import guarantee for the realization of safe automatic driving; In the field of medicine, the successful detection of small tumor is the premise of early and accurate diagnosis of tumors; The small defects on the surface of the automatic industrial inspection and positioning materials also reflect the importance of small object detection; In satellite image analysis, it is necessary to detect objects such as cars and ships, which are often difficult to detect because of their small scale; In the field of criminal investigation, the efficiency of solving cases can be greatly improved by effectively identifying abnormal small furnishings. Therefore, it is an import and urgent research topic to study the effective methods of small object detection and improve the detection performance of small object.
There are two main definitions of small objects [2]: the first is absolute small objects. The COCO dataset [3] [43] specifies that when the pixel number of objects is less than 32×32, the object can be regarded as a small object; The second is relatively small object. It is considered as relatively small object when the object size is less than 0.1 times the size of the original drawing. Small object is difficult to detect. Firstly, the small object itself has fewer pixels and less feature information; Secondly, because the small object has fewer pixels, it is more difficult to extract features after multiple down sampling [4]. Therefore, small object detection faces great challenges. At present, there are few algorithms for small object detection, and most of the existing algorithms are also modified on the basis of object detection algorithms, thus improving the detection effect of small objects [5][6] [7].
In the traditional object detection based on machine learning, small objects are mainly detected by constructing image pyramid at the bottom of the pyramid. With the emergence and development of deep learning, the method of using image pyramid to detect objects of different scales is gradually replaced by convolutional neutral network [8], which can effectively improve the detection performance of different scale objects by forming multi-level and rich feature expressions. In the convolutional neural network, the underlying features contain abundant details, which is conducive to small object detection; High level features contain rich semantic information, which is conducive to large object detection. With the continuous in-deep research, the detection effect of small objects has been greatly improved, but there is still a big gap between the detection performance of medium and large objects.
Regarding the progress of research on small object detection, Mask R-CNN proposed by He et al. [29] in 2017 has made a breakthrough, which not only solves the longstanding problem of precision loss of R-CNN, but also improves the situation of small object missing detection; And the contour of the detected object can be segmented, which makes the positioning of the object more accurate. Based on the good performance of Mask R-CNN in object detection, Yang et al. [30] proposed a ship detection method and a training strategy of hard sample relearning. Hard sample training improves the ability to adapt to difficult samples, so as to improve the network's ability to locate and recognize objects in complex scenarios, effectively solving the problems of cloud occlusion and shore-based interference, and reducing the situation of small object ships missing inspection; Secondly, the learning strategy of hard sample relearning is used to weaken the interference of cloud occlusion and land background, which solves the problem of ship segmentation and recognition in complex scenes to a certain extent. Hu P et al. [31] proposed an algorithm for small face detection, which studied the problem of small face recognition from three aspects of scale invariance, image resolution, and contextual reasoning. The number of faces that can be detected in a picture with 1000 faces is about 800.

Research status of small object detection algorithms
At present, small object detection algorithms are based on convolutional neural networks [9][10] [11]. The popular small object detection algorithms can be divided into two categories: the first category is one-stage object detection method, which uses a convolutional neural network to directly predict different object categories and locations without extracting the object candidate areas. The problem of object frame location is transformed into the problem of regression, which directly generates the object category probability and location coordinates, and has the advantage of speed; The second is two-stage object detection method, that is, firstly, candidate regions that may contain objects are generated by selective search [12], edge detection, region extraction network and other methods, and then convolutional neural network is used to extract features of candidate regions for accurate object class estimation and boundary box position regression, which needs to go through two stages to obtain the final detection effect. The kind of method has high accuracy, but the processing efficiency and the inference speed is low, which is not suitable for real-time detection tasks. Both one-stage object detection method and two-stage object detection method are faced with the difficulty of small objects detection.

Roadmap of small object detection
In the past ten years, small object detection based on deep learning is mainly divided into two stages, one-stage object detection method and two-stage object detection method. The main development roadmap is shown in Fig.1. It is the first single-stage object detector in the era of deep learning. YOLO is the abbreviation of "You Only Look Once". We can see from its name that only one feature extraction is performed on the input object. The author completely abandoned the previous "regional candidate + classification" detection mode. On the contrary, it follows a completely different science: a single neural network is applied to the entire input image, which divides the image into multiple regional grids and simultaneously predicts the bounding box and probability of each grid. Later, Joseph made a series of improvements on the basis of YOLO and proposed the YOLOv2 [14], which further improved the detection accuracy and maintained a high detection speed. Although the detection speed of YOLOv2 has been greatly improved, compared with the two-stage detector, the positioning accuracy of YOLOv2 still lags behind, especially for some small objects. Afterwards, the improved YOLOv3 [15] learned from the residual network structure, formed a deeper network level and multi-scale detection, and improved the detection effect of small objects.
(b) SSD algorithm SSD [16] was proposed by Wei Liu et al. in 2015, which is the second single-stage detector based on deep learning. The main contribution of SSD is to extract feature maps of different scales for detection and adopt anchor frames of different scales and aspect ratios, which have significantly improved the detection accuracy of single-stage detector, especially for certain small objects. SSD has advantages in both detection speed and accuracy. The main difference between SSD and previous detectors is that the former detects on the feature map of different scales, while the latter only detects the last layer. Although the SSD is simple and fast, its detection accuracy is somewhat inferior to that of a two-stage detector.
(c) Retina-Net algorithm Lin [17] et al. found the reason for the low accuracy through experiments, and proposed Retina-Net in 2017. They believe that the main reason is the unbalanced distribution of positive and negative samples in the training process. In order to solve this problem, the cross entropy loss has been improved and a loss function named focal loss is designed. This function can reduce the weight of easy-toclassify samples so that the network will focus more on difficult-to-classify samples in training, thereby successfully solving the problems that the positive and negative sample areas are extremely unbalanced, and the object detection loss is easily controlled by a large number of negative samples. The focal loss can make the single-stage detector achieve the accuracy of two-stage detector, while maintaining a high detection speed.
Since the one-stage small object detection method is fast, but the accuracy is low, the two-stage small object detection method is slow, but the accuracy is high. It is hoped that the accuracy of the one-stage small object detection method can reach the two-stage small object detection method, without affecting the original speed.

2.3
Two-stage small object detection algorithm (a) R-CNN In 2014, Girshick et al. proposed the R-CNN network structure. Firstly, about 2000 candidate regions were extracted by using the selective search [12] method, and then the convolutional neural network was used to calculate the feature map of each candidate region. After the feature maps were obtained, it was classified by using the support vector machine to distinguish the background and object. Since there will be multiple overlapping candidate regions near an object, the non-maximum suppression [19] method is used to remove the overlapping candidate regions, and finally the frame regression method is used to correct the position of the candidate frame. Because of its creative fusion of convolutional neural networks and object detection, it is known as the pioneer of modern object detection.
(b) SPPNet In 2015, He Kaiming et al. proposed the Spatial Pyramid Pooling Network [20]. Previous convolutional neural networks require fixed input size, for example, the input size of AlexNet [21] is 224x224. The main contribution of SPPNet is the introduction of the Spatial Pyramid Pool layer, which enables the convolutional neural network to generate feature maps of fixed size, independent of the size of the image or region of interest, without rescaling. When SPPNet is used for object detection, only one feature extraction is needed for the whole image, and then SPP layer is used to adjust the resolution of the feature map of the candidate region to a fixed size to train the detector, so as to avoid the repeated calculation of convolutional features. The speed of SPPNet is more than 20 times that of R-CNN without sacrificing any detection accuracy.
However, although SPPNet effectively improves the detection speed, there are still some shortcomings. Firstly, the training is still multi-stage; Secondly, SPPNet only finetunes its fully connected layer, ignoring the previous convolutional layer.
(c) Fast RCNN In 2015, Girshick proposed Fast RCNN [22], which is a further improvement of R-CNN and SPPNet [20] [23]. Fast RCNN enables us to train detectors and the boundary box regressors simultaneously in the same network configuration. Due to the large time, memory consumption of RCNN algorithm and the slow test speed, the Fast RCNN algorithm is improved to solve these problems. Fast RCNN successfully integrates the advantages of R-CNN and SPPNet, but there is still room for improvement in the method of proposing candidate regions. Then, a question naturally arises: "Can we use the CNN model to generate object proposals?" Later, Faster R-CNN [24] [25] shortly after the Fast RCNN was released. Faster RCNN is the first near real-time object detector based on deep learning and the first real end-to-end object detector. The greatest improvement is to propose a region candidate network, which makes the proposed candidate regions almost free of extra computation. From R-CNN to Faster RCNN, most independent modules of object detection system (such as candidate regions, feature extraction, border regression, etc.) have been gradually integrated into a unified end-to-end learning framework. Although Faster RCNN breaks through the speed bottleneck of Fast RCNN, there is still computational redundancy in the subsequent detection phase. Various improvement measures were proposed later, including RFCN [26] and Light head RCNN [27].
(e) Feature Pyramid Networks In 2017, Lin et al. proposed feature pyramid network [28] for small object detection based on Faster RCNN. Before using the feature pyramid structure, most detectors only detect at the top level of the network. Although the highlevel feature map extracted by the convolutional neural network is conducive to category recognition, it is not good for object location. Therefore, a top-down architecture with horizontal connections is proposed in the feature pyramid network to fuse high-level semantic information and lowlevel location information. Therefore, the feature pyramid network shows great progress in detecting objects of various scales. FPN has now become a basic part of many of the latest detectors.

Dataset of small object detection
At present, the object detection algorithms based on deep learning are data-driven. However, in the general object detection dataset, the number of large and medium objects is far higher than that of small objects. For the research of small objects, a specific dataset for small objects is needed. At present, the datasets for small objects include public datasets and datasets applied in various fields, among which several commonly used datasets and their parameters are shown in Table 1.

COCO [3] is a common object detection dataset, including natural pictures and common object pictures in life.
The background is more complicated, the number of objects is larger, and the size of object is smaller. For small object detection tasks, the current standard for measuring the quality of a model is more inclined to use the detection results on the COCO dataset.
As the benchmark dataset for object detection, PASCAL VOC [32] has 20 categories, including 11,530 images for training and evaluation. From 2005 to 2012, an image recognition competition will be held every year. The main purpose of the competition is to recognize some categories of objects in real scenes, which is a problem of supervised learning. The training set is given in the form of pictures with labels.

Application datasets in various fields
Tiny Person [33] is a dataset proposed by the University of Chinese Academy of Sciences, which only contains the category of person. The training set contains 794 images and the test set contains 816 images.
Tsinghua-Tencent100K is a large-scale traffic sign dataset [34], which provides 100,000 images and contains 30,000 traffic sign instances. The dataset is small enough to become a classic dataset for small object detection of traffic signs. The images in the dataset cover most of the lighting and weather conditions, and all the images are manually labeled, which can be divided into three categories: prohibitions, instructions, and warning signs. Tsinghua-Tencent100K is the only public traffic sign dataset based on real scenes in China with high resolution and large number of images, which fill the gap of lack of benchmark datasets in China.
Aerial remote sensing images have scale diversity. The shooting height ranges from several hundred meters to nearly 10,000 meters. The ground objects are different in size even if they are of the same type. Most of the objects in the aerial remote sensing images are small objects, containing dozens or even several pixels. Therefore, most of the aerial remote sensing image object detection datasets are small object detection datasets. The datasets commonly used are as follows: DOTA [35] is a dataset for remote sensing image object detection, which contains 15 categories, a total of 2806 images, and the size of each image is not more than 4000×4000 pixels. The UCAS-AOD dataset [36] only contains two categories of vehicles and aircraft, including 7,482 aircraft samples and 7,114 vehicle samples. NWPU VHR-10 [37] is an aerospace remote sensing object detection dataset annotated by Northwestern Polytechnical University, including 10 categories such as aircraft, ships, vehicles. The dataset has a total of 800 images, including 650 images of objects and 150 images of background. The RSOD-Dataset dataset [38] contains four types of objects: aircraft, playground, overpass, and oil bucket. There are 446 images for aircraft, 189 images for playground, 176 images for overpass, and 165 images for oil buckets.

Challenges of small object detection
Most of the object detection algorithms based on deep learning are aimed at medium and large objects with a certain size or proportion, which is difficult to apply to small object detection in complex backgrounds. Specifically, small object detection has the following challenges: (a) The underlying features lack semantic information. In the existing object detection models, the underlying features of backbone network are generally used to detect small objects, but the underlying features lack semantic information, which brings certain difficulties to the detection of small objects.
(b) There are few datasets for algorithm research, and the object scale distribution is uneven. At present, the dataset (PASCAL VOC, COCO) widely used in the mainstream object detection algorithm has less training samples for small objects, which makes the small objects cannot be fully learned in the process of model training. In the COCO dataset, there are a large number of small objects, but the distribution is very uneven. Only 52.3% of the images contain small objects, which makes the trained model more suitable for medium and large objects. The generalization ability of small objects is weak and the detection accuracy is low.
(c) Related algorithm research lacks versatility, which is difficult to directly migrate to other fields. The main network of the existing object detection models are all trained on the classification datasets, but the scale distribution of the objects in the classification datasets is different from that in the detection datasets.

Research trends
At present, the core problem of the research on small object detection based on deep learning is how to improve the feature expression of small objects to make it contain rich semantic information, which is also the key to improve the performance of small object detection. Small object detection has important application value in real life, such as in the field of criminal investigation, small packages on the table, small pedestrians in the corner of surveillance videos, small signs on clothes are clues to solve the case; In agriculture, the identification of insect pests and diseases is convenient for real-time detection of crop health status, and early treatment can be achieved in time, so as to reduce the decline of crop yield and quality caused by diseases and insect pests, and contribute to the development of precision agriculture and intelligent agriculture. Therefore, small object detection has important research significance.
Compared with the detection performance of large and medium objects, there is still a big gap in the performance of small object detection. In the future, we can explore better and more comprehensive multi-scale feature fusion strategies to further improve the detection performance of small objects. Of course, there are many problems to be solved. This section points out the future research directions for the difficulties of small object research.
(a) The bottom layer of the networks is fully used to detect multi-scale. Because the underlying features of convolutional neural networks have weak semantic information and high resolution, and high-level features have strong semantic information and low resolution, the low-level features are very important for detecting small objects. Through feature fusion, all scales of the convolutional neural network can have strong semantic information, and a more complete feature pyramid network can be established [39].
(b) Increase the amount of training data for small objects. Data expansion is a key component of training deep learning model. An optimized data expansion strategy can effectively improve the detection effect of small objects [40] [41]. At present, the research of small objects is mainly aimed at common object datasets, and the network cannot learn the characteristics of small objects in depth. Subsequent work can establish datasets for small object detection, which is convenient for fully learning the characteristics of small objects, and is more conducive to small object detection.
(c) Object detection in the complex environment is one of the core issues in the field of computer vision. Complex backgrounds include object ambiguity, overlapping of objects and object occlusion, etc. While the detection of small objects in the complex background is even more difficult, Zheng et al. [ 42] proposed a modified multicategory single-order detector (SSD) algorithm for the object occlusion, which is tested on the VOC and COCO datasets. The experimental results show that the detection effect of the improved method has been greatly improved. Subsequent work can be carried out in this aspect to ensure the accuracy and robustness of the detection in the complex environment.