A Semi-Dense Feature-based VSLAM System

SLAM (1) (Simultaneous Localization and Mapping) technology is the basis of a mobile robot, also one of the most important conditions for robot intelligence. VSLAM (Visual Simultaneous Localization and Mapping) technology uses visual sensors to realize the function that a moving robot builds incremental environment maps and simultaneously realizes the self-positioning. Currently, feature-based SLAM approaches are to build sparse maps, but we can ’ t realize robot path navigation and obstacle avoidance by using sparse maps. Therefore, this paper proposed a semi-dense VSLAM with monocular cameras. Our semi-dense mapping was optimised by local bundle adjustment and loop closing, and the depth measurement of the images used the ways of polar search and block matching. We evaluated our system in our laboratory, and it performed well and steadily.


Introduction
Simultaneous positioning and map construction algorithms are used to track the camera's current position while creating an environment map based on sensor datas.VSLAM is a solution that uses visual sensor to perceive the environment around the camera.Real-time, high-accuracy and robust SLAM algorithm are all fundamental condition for a robot to realize the components of navigation, path planning and autonomous navigation.Currently, SLAM technology is mostly applied in the fields of self-driving cars, sweeping robots and unmanned aircraft.Compared with traditional laser sensor, high-precision IMU and GPS sensors, the camera costs less and consumes less power.With the improvement of algorithm and processor performance, the accuracy of VSLAM is also improving.When we come to the areas where GPS signals are not available, such as enclosed rooms and remote mountains, VSLAM is irreplaceable.
At present, the main research direction of VSLAM is feature-based sparse SLAM and direct dense VSLAM.ORB-SLAM (9) and LSD-SLAM are two typical methods.ORB-SLAM algorithm can complete real-time tracking and mapping regardless of indoor or outdoor environment, and execute closed-loop detection and relocation very well.LSD-SLAM is a direct monocular SLAM algorithm that can construct large-scale environment maps, estimate camera pose accurately based on direct image registration, and reconstruct semi-dense depth image by key frame pose.However, ORB-SLAM is more robust in scenarios where illumination changes obviously.
The sparse maps constructed by feature-based (6) VSLAM system perform badly to realize the robot navigation and obstacle avoidance.Based on the feature-based monocular ORB-SLAM system, this paper proposes a feature-based semi-dense map reconstruction technique.

System master plan
This system is composed of the following four threads: tracking (4) , local mapping, loop closing, semi-dense mapping, the system structure diagram is shown in Fig. 1.

Tracking
The main idea is to find more congruent relationship between the current frame and the local map, in order to optimize the pose of current frame.The following three motion models in the Tracking are all used to get the pose of the current frame.

(a) Tracking with Motion Model
Assuming that the object was moving at a uniform speed, the camera pose of the current frame could be estimated by the pose and velocity of the last frame.The speed Fig. 1.System structure diagram of the last frame can be calculated by several previous frames.This model is suitable for the cases that camera is moving along with the straight line at a uniform speed.

(b) Tracking with Reference key frame
If the motion model has failed, you can try to match the last key frame, because the current frame is not far from the last key frame.We can use BoW(Bag of Words) to speed up the match process.Firstly, we calculate the BoW of the current frame and set the initial position as the pose of the last frame; Secondly, we need to search for feature matching according to pose and BoW dictionary; Finally, the matching feature is used to optimize camera pose.

(c) Relocalization
If the current frame also fails to match the nearest neighbor key frame, this means that the current frame has been lost.At this point, you have to match all the key frames to find the right place.Firstly, calculate the BoW vector of the current frame.Secondly, several key frames are choosed as alternatives by BoW dictionary.Thirdly, key frames with enough feature points are selected.Finally, the camera pose can be determined by feature point matching.We tend to select the key frame that has enough interior feature points.

Local Mapping
The current key frame is inserted into the map, but some key frames are found redundant after judgement, we have to delete it.Due to triangulation error, originally one point may be predicted into two points by mistake.The two points need to be fused into one.
The local map is associated with the previous tracking and subsequent loop closing, and the key frame is generated by the tracking thread, while the condition of local bundle adjustment for the local map is that there is no loop and the key frame queue is empty.The elements contained in the map are useful for loading and saving maps.

Loop Closing
Cyclic detection mainly consists of two steps: 1. Cyclic information detection and confirmation; 2. Modify and optimize the pose diagram.Regardless of monocular, binocular or RGBD camera, all of them exist the problem of measurement drift.As the path continues extending, the error of the previous frames will be transmitted to the following frames, resulting in a very large error of the pose of the last frame in the world coordinate system.In addition to using optimization methods to adjust the local and global pose, loop closing can also a good way to optimize the pose.
In other words, when the camera is moving, it returns to the same pose as before, the map forms a closed loop.The camera can automatically identify the previous location, and then transmit this information back to the entire map to get a fairly accurate map information, this is what loop closing works.Therefore, loop closing is a very useful method for large-scale (3) map reconstruction. (7)

Semi-dense Mapping
Suppose there is a sequence of video, and we get the trajectory of each frame.Now we take the first image as the reference frame and calculate the depth of each pixel in the reference frame.First, we should find out how we completed the process in the feature-based section: 1. First, we extract feature points from images and calculate the matching between feature points according to descriptors.In other words, by means of feature points, we track a point in space and know its pose in different images.
2. Then, since we cannot determine the pose of feature points with only one image, we must estimate its depth through observation from different perspectives, we will use the triangulation here.
In the semi-dense depth map estimation, the difference is that we cannot compute descriptors with every pixel as an feature point.So matching becomes an important part in the semi-dense (8) depth estimation problem.Now we requires the use of polar search and block matching techniques.When we know the position of pixels in each graph, we can use triangulation, like in the feature-based SLAM, to determine its depth.However the difference is that we're going to use triangulation for many times to make the depth estimation converge, not just one time.We hope that the depth estimation can gradually converge from an uncertain number to a stable number as the measurement times increasing, this is the depth filter technique (2) .

Polar search and block matching
As is shown in Fig. 2.We can see the geometry of same point from different angles.The camera on the left detects a certain pixel p 1 .Since this is a monocular camera, we cannot measure its depth, so its depth could be in some fixed region, instead the number is between the minimum and infinity.Therefore, the corresponding spatial points of the pixel are distributed on a line segment.From another perspective, the projection of this line segment also forms a line on the image plane, which is known as a polar line.When we know the motion between the two cameras, this polar line can also be determined.
In the feature-based method, we find the position of p 2 Fig. 2. Polar search diagram through feature point matching.However, we do not have descriptors, so we can only search the polar line for points that are similar to p 1 .Specifically, we might go from one end of the polar line to the other of the second image, one by one , comparing similarity of each pixel to p 1 .From the view of direct pixels comparison, this approach is similar to the direct method.
In the direct method, we also know that comparing the brightness value of a single pixel is not very stable and reliable.Since the brightness of individual pixels is indistinguishable, we can try to compare pixel blocks.We take a small piece of size w × w around p 1 , and then take many small pieces of the same size on the polar line for comparison, to some extent we improved the discrimination, this is block matching.During this process, the comparison only makes sense when we assume that the entire chunk grayscale remained invariability across different images.Now we take the little pieces around p 1 , and we take a bunch of little pieces on the polar line.We might call the small pieces around p 1 as A ∈ R w × w, and call the small pieces on the polar line as B i , i = 1 • • • n.To calculate the difference between small blocks, we can take the NCC method: NCC (Normalized Cross Correlation): This approach is a bit more complex, calculating the correlation between two small pieces: The correlation that is close to zero , means two images are not similar, while a correlation that is close to one means very similar.There is a contradiction between precision and efficiency in this calculation method.Methods with good accuracy (6) often requires complex calculations, while simple and fast algorithms are often ineffective.This requires a balance in practical engineering.

Gauss depth filter
The pixel depth estimation can also be converted to a state estimation problem.There are two solutions: filter and nonlinear optimization.Although the nonlinear optimization affection is better, considering that the front end has occupied a lot of computational load in the case of SLAM, which has a strong real-time (5) requirement, the filtering method with less computational load is more suitable for the map construction, so we use Gauss Distribution here.
Set the depth of a pixel to obey: Every time a new data comes in, we will calculate its depth.Similarly, suppose that this observation is also a Gauss Distribution: How to use the observed information to update the original distribution of d, is exactly an information fusion problem.According to appendix A, we understand that the product of two Gauss Distributions is still a Gauss Distribution.Assuming that the distribution of fused d is N(µ f use,σ 2 f u s e ), then according to the product of Gauss Distribution, we can get:

Uncertainty analysis
Considering a certain polar line search, we found the corresponding point p 2 of p 1 , and observed the depth value of p 1 , which is shown in fig. 3. We think that the 3D point corresponding to p 1 is P. Therefore, we can call O 1 P as p, O 1 O 2 as the translation t of the camera, and O 2 P as a.And I'll call the bottom two angles of this triangle α , β.Now, considering that there is a pixel size error on the polar line l 2 that makes the β Angle become β', and p becomes p', and call the upper angle γ.So we need to figure out how much the error of this one pixel cause from p' to p.
Disturbance of p 2 by one pixel will produce a delta beta change, since the camera focal length is f: Thus, we determine the depth uncertainty caused by the uncertainty of a single pixel.If it is considered that the block matching of polar line search has only one pixel error, then it can be set as follows: In practical engineering, when the uncertainty is less than a certain threshold, the depth data can be considered to have converged.In summary, we present a complete process for estimating dense depth: (a) Assume that the depth of all pixels satisfies some initial Gauss Distribution.

(b)
When a new data is generated , the pose of projection point is determined through polar line search and block matching.These 4 steps constitute a set of feasible depth estimation methods.

Experiment environment
We test the project in our laboratory, and used the intelligence sweeper to run the SLAM system, the experiment environment is shown in the Table 1.

Results of the small-scale environment
The picture is a view of my laboratory, and we can reconstruct the semi-dense map of the pircture, the point cloud in the picture is more dense , and the feature points is much more too, which is shown in the Fig. 4.

Results of big-scale environment
We walked for a round around the desk with the camera, found that the approach of this thesis is very robust and finish the task very well.And when we go back to the initial position, this approach can detect a loop and optimize the whole map.As is shown in Fig. 5.

Conclusions
The proposed semi-dense monocular VSLAM based on ORB features is more robust and efficient than the direct VSLAM method.Due to the existence of loop closing, the target can be quickly recovered after losing very easily.Semidense VSLAM is important for semantic segmentation, object recognition and 3D environment reconstruction (10) .So the semi-dense VSLAM will make a great contribution to future mobile robot development.This technology makes it become possible for mobile robots to possess the components of navigation, path planning and autonomous navigation.

Fig. 4 .
Fig. 4. Semi-dense VSLAM of a view in the lab

Table 1 .
Experiment environment