Search-Free Gridding and Temporal Local Matching Based Observation for High Frame Rate and Ultra-Low Delay SLAM System

Tracking of camera pose in simultaneous localization and mapping (SLAM) using particle filter has high potential of parallelism and is promising in high frame rate and ultralow delay applications. However, heavy calculation and hardware-unfriendly algorithm in projection and matching which take 88.6% time consumption of particle filter become problems in acceleration. This paper proposes search-free gridding based redundant projection removal, non-iteration linear function based division, and temporal 2D local matching in particle filter based SLAM on FPGA. Searchfree gridding based redundant projection removal removes dense map points and reduce projection times. Non-iteration linear function based division considers the particularly of projection problem in SLAM and achieve low delay partial division in projection. Temporal 2D local matching uses characteristic of high frame rate video and considers hardware implementation to realize a low delay local matching. The algorithm is tested on 762fps VGA (640×480) indoor sequences. Software result shows that this algorithm has 1.20° error in rotation and 1.90 cm error in translation compared with ORB-SLAM. Hardware result shows that matching and projection have 0.457 ms latency processing 1000 feature points.


Introduction
Visual simultaneous localization and mapping (visual SLAM) is known to estimate camera poses as well as sparse environment reconstruction at the same time.Current main framework of SLAM mainly consists of tracking and mapping (1) .In mapping, an environment map is created according to some selected frames, which are also known as keyframes need to be processed.In tracking however, camera poses are estimated using images and position of map points.In real-time applications such as drone flight control and autopilot, high frame rate and ultra-low delay estimation of camera poses is needed.There is currently high frame rate and ultra-low delay matching system on FPGA (2) .But no high frame rate and ultra-low delay SLAM system has been proposed.
Mainstream tracking algorithms in SLAM are bundle adjustment (3) based method or filter based method (4,5) .In ORB-SLAM (6) , bundle adjustment provides accurate estimation but it is an optimization based method which needs several iterations.In filter method, there are extended Kalman filter (EKF) (7) and particle filter (4,8) based method.Particle filter has higher potential of parallelism and is more suitable for high frame rate and ultra-low delay applications.However, Kalman filter is used in particle filter based SLAM to estimate the position of landmarks and it has larger error in large scale sequences compared with bundle adjustment.So, in this paper, mapping part uses bundle adjustment to realize local and global map point optimization.In tracking part, particle filter is employed as tracking algorithm.For the features, we use 256-bit ORB feature (9) because it is a binary feature which is simple enough to be realized on hardware.It also has less extraction time compared with other features such as SIFT (10) and SURF (11) .To achieve more parallelism and ultra-low delay, FPGA is chosen instead of GPU (12) .
According to software simulation, projection and matching in particle filter are the most time-consuming parts which take 76.6% and 12% separately.In projection, different particles have different poses.Since a reprojection error based likelihood is used in particle filter, all map points' 2D projections in each particle's camera plane need to be calculated.This leads to large computation.And unavoidable division in projection also leads to high latency in hardware implementation.Complex patterns in an image have more feature points crowded together and after triangulation in mapping, there are more corresponding map points crowded together in 3D space.Conventional work (6) uses a gridding based method to reduce dense feature points but it needs search in each level of grids.It divides the image into several layers of grids, which is also known as an octree.In each layer it searches for regions which still have dense features.After that, it makes smaller grids in this region.Finally it only keep one feature point in each grid to achieve a balanced distribution.This algorithm is suitable for software.But if it is implemented on hardware, in worst case, it leads to large latency because each feature needs to be processed in each layer.Search-free gridding based redundancy removal is proposed to achieve a hardware-friendly algorithm which also reduces map points in complex pattern areas.It uses a pipelined design to process each projection in the last frame in the same way.This reduces time consumption compared with conventional work.In division however, conventional works usually consider a fully functioning division.It is an iteration based method that calculates the result bit by bit and causes high latency.But in projection of SLAM problem, a fully functioning division is not needed.So, a partially functioning non-iteration division is proposed to decrease latency.
Matching part matches ORB features and map points according to their descriptors.Some conventional work uses tracker (13) to track corner features but it is only a visual odometry without mapping.Matching is the only part that uses image information and needs to be considered in hardware dataflow.Brute force matching is simple and has been realized in high frame rate and ultra-low delay system.But it leads to large resource cost or high latency when a feature point is compared with hundreds of map points in SLAM problem.To ensure accuracy and avoid unnecessary descriptor comparison, conventional work (5,6) uses camera movement and map points' position to constrain the map points that a feature point is compared with.However, this method still compares feature points with each map point.Considering characteristic of high frame rate video, difference between two frames is small even if the camera moves very fast.Also, considering hardware implementation, temporal 2D local matching is proposed to reduce redundant matchings and keep accuracy.
This paper is arranged as follows: section 2 introduces particle filter based SLAM framework and three proposals.Section 3 shows the hardware dataflow and experiment result.Section 4 is the conclusion.

Algorithm framework
Particle filter based SLAM framework is shown in Fig. 1.The inputs are ORB feature points and map points.Map points are provided by mapping part in ORB-SLAM.Also, initialization part and loop closing part are also the same with ORB-SLAM (6) .The output of particle filter is the camera pose of each frame.This particle filter has prediction, observation and estimation.Each particle represents a different camera pose, which is an element of Lie group of 3-dimensional Euclidean transformation, (3) .Scale is not considered here.So, each particle has 3 degrees of freedom in rotation and translation separately in Lie algebra  (3) .Rotation is represented as Euler angles ignoring gimble lock.State vector   which represents camera pose and velocity in frame k is given by In prediction, we assume that each particle has a constant velocity with a zero-mean gaussian noise in each element of (3):

Input:
Camera poses Output:

Prediction
where  is the time of one frame. 6 represents a 6× 6 identity matrix.Gaussian distribution's variance is different in translation and rotation.Then in observation, matching between 2D feature points and 3D map points is firstly done.Secondly, we do projection of each map point in each camera pose and calculate the reprojection error of each particle at moment k.Reprojection error is the difference of feature point  , and the projection of matched map point   in this particle's camera plane as shown in equation (7).The superscript means particle .Equation ( 8) is the observation model of one map point   in one particle pose , which is one part of observation in particle filter.It is also known as the camera projection model.The reciprocal of the sum of reprojection error of each particle is the likelihood.To emphasis poses with smaller reprojection error and discard ones with higher reprojection error, we take square of reciprocal as likelihood: In equation ( 9) the smaller reprojection error a particle has, the higher likelihood it has.For the estimation, we firstly do resampling according to the likelihood.Then the weighted average pose in ( 3) is calculated and taken as the output.
In the initialization of SLAM, we use an automatic initialization (6) which considers homography matrix for planar scenes and fundamental matrix for non-planar scenes.After the initialization of SLAM, initialization of particle filter is based on the initial camera pose.Initial gaussian noise is added to initial pose   and particles with different poses and velocities are generated.

P1: Search-Free Gridding based Redundant Projection Removal
While extracting features from images, there are more features in areas with complex patterns if there is no constraint.Thus, the corresponding map points also gather together in such areas.This method uses 2D projections in the last frame to judge whether map points in 3D space are gathering together and remove some map points in dense areas to achieve a balanced distribution.The last frame's projection is used because it is simpler to judge in 2D space than in 3D space and difference between two adjacent frames in high frame video is very small.In the proposed method, each projection is processed in 3 same steps.They are firstly assigned into grids, which is 32×24 for example.Then the 32×24 grid is called the top layer.As shown in Fig. 2, in each grid in top layer, there are possibly several projections.We only keep one projection whose corresponding map point is seen by the largest number of keyframes.If there are two projections observed by the same number of keyframes, we choose the firstly processed one.Map points seen by more keyframes exist longer than other map points and are more stable after bundle adjustment in mapping.
The bottom layer has larger grids than top layers.In case that there are still projections crowded together, we constraint the maximum number of projections in each grid in bottom layer.After that, map points are well-distributed in 2D space.The corresponding map points of the left projections are thought to be necessary and are processed in the projection step.
In hardware implementation, one projection is input in each clock.All projections are processed in pipeline.Gridding is based on binary search of projection's position.The result of gridding is stored in a RAM in which each element represents one grid.Thus, when the top layer has 32×24 grids and the input has  projections, 5 clocks are needed in binary search and one clock is needed in each layer to judge whether this projection should be kept.After all, this design has only  + 6 clocks latency.

P2: Non-iteration Linear Function based Division
In projection, the 2D projection of each map point in each particle's camera plane needs to be calculated.In camera projection model, firstly the map point division is needed when transforming from camera coordinate to file coordinate.Camera projection model is shown in equation (10): In SLAM, if a map point is too close to the camera, it means that an object blocks the camera.This situation is almost impossible in normal sequences.And it is also impossible that a map point is too far away from the camera because it has large error to triangulate a far map point.Even though a map point which is too close to or too far away from the camera exists, the position of this map point is not converged and has large error.In a word, only map points which are in an appropriate range from the camera are supposed to have little error.
Considering this fact, map points' selection in Fig. 3 is proposed.In this method, piecewise linear function is used to replace reciprocal calculation in division: The piecewise function has 16 intervals in which   and   are different and satisfy: We first use binary search to find the interval, then calculate the approximation of reciprocal of z.After reciprocal is obtained, multiplications with x and y are done.The division of this value has little error with true value but has shorter latency than common iteration based division.

P3: Temporal 2D Local Matching
In SLAM problem, matching between feature points and map point is needed.In high frame rate video, the most important characteristic is that the difference between two frames is small.The movement of a feature point between two frames is less than 10 pixels.As shown in Fig. 4, this method firstly calculates the 2D projection of map points in the last frame's camera plane.The map points used here are only necessary map points from proposal 1.
When we do matching between feature points and map points, we only compare hamming distance of descriptors of feature points and map points whose projection in the last frame is in a local search range.If there are several projections in the search range, the one with smallest hamming distance is selected and this distance must be smaller than a threshold.A one-to-one matching is considered in which one map point is only matched with one feature point.Considering the dataflow, only the first feature point matched is stored.
In storage aspect, as shown in Fig. 5, we use a RAM whose address is directly related with the position of the projection position.Each element in the RAM represents a 16× 16 pixels grid.It stores the address of the oldest map point whose projection in last frame is in this grid.after Proposal 1 assigns projections in the last frame into grids and removes dense map points, addresses of the map points left are assigned to RAM in this proposal.Only one map point is stored in grid here as well.The selection method is the same After preparation, when a feature point is input from feature extraction part, we firstly calculate the position of it in the RAM storing temporal projections.Then we compare hamming distance with map points whose address is in this RAM element and surrounding eight ones.So, one feature point is only compared with maximum nine map points.This reduces complexity and raise matching accuracy.This storage method is a sparse storage which has many empty elements but has a direct and fast accessing speed.

Experiment environment
The algorithm is tested on a computer with Intel Core i7 CPU and 8 GB memory by Visual Studio 2013.Third party libraries are Opencv 2.4.12,g2o, Eigen 3.2.10 and DBoW2.For hardware implementation, we used Vivado High Level synthesis 2017.2 and ISE 14.7.And the FPGA device we use is Xilinx Kintex-7 xc7k480tffv1156-2.
Test sequences we use are 9 high frame rate sequences to simulate the high frame rate hardware implementation.The resolution is 640×480 and frame rate is 762 fps.Each sequence has 2000 frames.Four examples are shown in Fig. 6.Table 1 gives explanations of sequences.Rotation here is the angle from rotation matrix using exponential map.Names of sequences are the same with move direction of cameras.Circle means camera moves like circle.Rand 1 and 2 mean random movement.Rand 1 is zigzag movement and rand 2 is helix movement.These sequences all include strong rotations and fast translation.

Accuracy evaluation
Accuracy of the tracking algorithm is evaluated on software compared with ORB-SLAM (6) .Because this paper targets on hardware implementation, fixed point decimal with three integer bits and ten fractional bits are used.The particle number is 512.In each image, we extract 1000 feature points.Since the sequences are made by ourselves, we measured the distance of movement and then we know the scale in monocular SLAM.We firstly transform our result to the coordinate of the result of ORB-SLAM.After this the two trajectories have the same initialization.Then evaluate the accuracy according to root mean square error (RMSE) of rotation and translation separately: where   and   are the absolute difference of rotation and translation.They are defined as equation ( 15), ( 16): where   is the rotation part of camera pose from particle filter at time k and   is the camera pose from ORB-SLAM.In equation ( 16), (  ,   ,   )  is the translation of camera pose from particle filter at time k and (  ,   ,   )  is that from ORB-SLAM.Because of the randomness of SLAM, each sequence is tested five times and we take the median value as the result.Compared with basic particle filter framework without proposals, after adding three proposals, total accuracy has little difference.The first proposal reduces some map points but make the map points' distribution more balanced.Rotation error decreases 0.04 degree compared with basic framework but translation error increase 0.05 cm.Although each sequence is tested 5 times Sequence X1 Sequence Y1 Sequence Z1 Sequence Rand 1 Fig. and median error is taken as the result, there is still randomness so that proposed methods have less error in some sequences compared with basic framework while have larger error in other sequences.In nine sequences, Z1 is the slowest one.So, it has the smallest error compared with other sequences.Rand 1 is the fastest one and has larger error than other sequences in conventional method and after adding proposal 1 and 2.

Hardware performance
Three proposals are implemented in hardware and compared with conventional works.Fig. 7 shows the data flow of hardware implementation.
For the input, temporal inputs are frame k-1 projections, map points and particles' camera poses.Temporal inputs are stored in RAMs.The update of temporal input is not considered.ORB features are input as a stream.For each frame, redundancy removal is firstly done as a preparation.This part takes only 0.01 ms and is not a problem in practical system.After this preparation, when an ORB feature is inputted into observation, it is firstly matched with map points.Then the matched map point and the position of the feature are inputted into parallel projection to calculate reprojection error.To realize parallelism, particles' camera   poses are also stored in several RAMs so that they can provide poses simultaneously.Hardware performance using 32 parallelisms is shown in Table 4. Compared with software result, hardware implementation has dramatic acceleration.Proposal 1 accelerates redundancy removal part.Proposal 2 accelerates projection part.Proposal 3 accelerates matching part.Using 32 parallelisms in projection is to balance the performance and resource utilization and is not fixed.Table 5 and Table 6 shows the performance and resource utilization using different parameters of parallelism.Because only projection part is in parallel, so when number of parallelism doubles, the processing time is not the half.Maximum frequency is tested using Xilinx ISE 14.7.

Conclusions
This paper proposes Search-free gridding based redundant projection removal, Non-iteration linear function based division and Temporal 2D local matching in particle filter based SLAM on FPGA.Search-free gridding based redundant projection removal reduces map points in complex pattern areas and reduce projection times.Non-iteration linear function based division is proposed to accelerate division in projection problem in SLAM.Temporal 2D local matching is proposed to reduce matching and keep accuracy.The proposed methods are tested on high frame rate sequences with challenging movement.Compared with ORB-SLAM (6) , the proposed method has 1.20° error in rotation and 1.90 cm error in translation averagely.Matching and projection parts which takes totally 88.6% time consumption in software are implemented on hardware and has 0.457 ms latency when 32 processing units are in parallel in projection.For future work, prediction and estimation in particle filter still need to be implemented.Resampling in estimation part is the only part in particle filter that is sequential and there are several methods to make it parallel.To make a complete hardware implementation, data communication of tracking with mapping, storage and update of particle filter's state vector and dataflow from camera are also needed to be considered.

Table 4 .
Hardware performance

Table 6 .
Resource utilization with different parameters

Table 5 .
Performance with different parameters

Table 2 .
Software accuracy rotation RMSE

Table 3 .
Software accuracy translation RMSE