الفهرس Introduction Literature review and Related work 2.1 Introduction This section discusses the related work and contributions that have been done in the past which is related to the work proposed in this thesis. such topics are listed as follow: Latest work related to enhancing Monocular VO. Latest work related to enhancing Stereo VO. Sensor fusion and state estimation Research related to enhancing drift in odometry based sensors. 2.2 Autonomous vehicles Perception Through the past two decades development of artificial perception modules which allow autonomous vehicles more situational awareness has grown very rapidly. New 11 developments in both hardware and software allowed vehicles to gain awareness either through exteroceptive or proprioceptive sensors [16]. Such data are processed through different algorithms [17] to provide the necessary information to the vehicle to navigate. Sensors that vehicles rely on in order to achieve this process suffers from different types of error depending on the sensor physical parameters or the way that the algorithm handles the sensor raw data. Algorithms that combine different sensor readings in order to achieve more accurate information are thoroughly investigated many of them rely on the Kalman filter or Bayesian based filters [18]. 2.2.1 Pose increments integration based sensors Sensors that are capable of measuring small distance increments such as encoders were widely used in mobile robots applications. Early autonomous mobile robot architectures adopted wheel encoders which is also known as Dead Reckoning in order to be able to localize the robot. Many modern localization algorithms which utilise particle filter techniques such as Adaptive Mount Carlo Localization (AMCL) rely on wheel encoders in addition to other sensors to perform localiza- tion [19]. Wheel encoders however as they seem to be historically very popular in autonomous mobile robots, are relatively of low accuracy when it comes to localization. Lots of work has been done to improve wheel encoders accuracy through out the past two decades. Work in [20] presents an approach where the accuracy of the wheel encoder is improved by adding the readings from a magnetic compass to readings of a gyroscope fusing both data together in order to produce more accurate pose information, the magnetic field effect of the vehicle is dealt with to avoid it’s interference with the magnetic compass. Other contributions utilized similar techniques [21, 22]. Work in [23] proposes a method of improving wheel encoders localization by not only combining other sensors such as Inertial Measurement Unit (IMU), it also uses two different Kalman filters one designed specifically for no slip scenario and the other for slip scenario also the paper proposes a supervisory algorithm to select between the two whenever slip happens or not. Such hybrid approach allows this scheme to overcome the slip challenge to a certain extent. Odometry sensor is now a term used to describe all sensors that use an integration process to solve the vehicle localization problem. Light Detection and Ranging (LIDAR) is a very powerful example of modern odometry sensors, where an array of light is projected in the field surrounding the vehicle building a point cloud realization to the vehicle surroundings [24], variations in this point cloud during the vehicle movement allows us to deduce the delta odometer in each time step and hence localize the vehicle/robot. Algorithms that utilizes LIDARs has been extensively developed such as [25] where a real time method for odometry and mapping is proposed, the technique relies on a two-dimensional LIDAR with a six Degrees Of Freedom (DOF). The technique is able to produce accurate results by separating the problem into two main algorithms, one with a high speed capable of measuring the velocity and the other with comparatively low speed and high accuracy the low speed algorithm is responsible for matching and point cloud registration such combination yields to a computation efficient and yet accurate algorithm. Cameras are one of the most popular sensors which are utilized in many aspects where human like detection and awareness is desired to be mimicked. Cameras have been developing historically for a very long time. Chinese texts which dates back to 300-400 B.C have details regarding a conceptional design of a pinhole camera is found in a book called Mozi [26] in this texts an effect called camera Obscura effect is discussed, since then developments has been made to the cameras until the modern Cameras that we have in hand today. localization using feature based Visual Odometry (VO) is one of the most used approaches in modern autonomous vehicles. VO can be categorized into two main categories, the type of the camera used or the method of computing the Odometry. Regarding the method of computing, feature based VO is a technique where the change in the location of certain features in the camera frame at time T is mapped into the camera frame of time T+1, such change is then transformed to the vehicle Odometry. Direct visual odometry utilizes a different approach where the transformations between the entire successive camera frames depending on the intensities of the entire camera Figure 2.1: Typical Monocular Visual Odometry Pipeline image at a certain time step is used to compute the odometry of the vehicle [27]. Monocular Visual odometry is a sub-category of camera types VO category that utilizes only one camera [28] to estimate the vehicle motion, a typical pipeline of Monocular VO is shown in Figure 2.1.Input images is subjected to features detection and description algorithms such as Speeded Up Robust Features which we will discuss thoroughly later in this chapter. The detected features are then matched between the image frames. An Outlier rejection scheme such as RANdom SAmple Consensus (RANSAC) is used to remove outliers (False matching). RANSAC assumes a hypothesis and validates it using the matched pairs. Points are said to be outliers if they fail to follow the hypothesis that most of the other pairs follow. The matched pairs are then used to estimate the camera essential Figure 2.2: Typical Stereo Visual Odometry Pipeline matrix in case of a calibrated camera or the fundamental matrix in case of a non calibrated camera. The technique used to estimate the essential matrix is called 5-point algorithm which will be discussed thoroughly in Chapter 3. The Camera essential matrix contains information regarding the rotation matrix and the the translation vector of the camera movement from time T-1 to time T. This information is used to compute the Vehicle delta motion increment and hence the vehicle pose. Being a single camera dependant Monocular Visual odometry is capable to estimate the vehicle motion with the limitation of the scale which is unobservable to the algorithm, in order to tackle such problem algorithms which are called Structure from Motion (SFM) [29, 30] usually utilizes other onboard sensors such as the IMU in order to estimate such scale. Recent Research also utilized deep learning algorithm in order to try to estimate the unobservable scale in Monocular VO algorithm without the reliance on other sensors [31]. Figure 2.3: Inputs and outputs of the image rectification function Stereo visual odometry [32] utilizes a set of two cameras configuration which unlike the one camera configuration in the Monocualar visual odometry allows the algorithm to estimate the complete motion of the vehicle to an observable scale that’s why it’s found to be quite more effective than monocular VO. A comparison between the two approches is presented in [33]. a typical pipeline of Stereo visual odometry is shown in Figure 2.2. The algorithms starts by detecting and matching features related to objects between successive left camera image pairs. Right and left stereo camera images are then rectified. The rectification process reduces the two dimensional stereo problem into a one dimensional problem. This reduction is achieved by ensuring that all the corresponding points between the right and the left images is on the same row. The image rectification function inputs and outputs are shown in Figure 2.3. The camera extrinsic parameters is a transformation matrix (rotation and translation) that maps the points from the world coordinates to the camera coordinates. While the intrinsic parameters is Figure 2.4: Typical Stereo Visual Odometry Pipeline a 3X3 matrix that includes the camera Focal length, the camera principle point and the camera skew coefficient. This matrix maps the camera coordinates into the pixel coordinates. The re-projection matrix is a 4X4 matrix that includes the rectified camera focal length, principle point and the baseline of the rectified stereo camera. Red Green Blue Depth (RGB-D) [34] Visual Odometry relies on a special types of cameras that are usually equipped with an additional sensor either another camera or an infrared sensor that is used in order to generate a depth map that represents the distance of each pixel with respect to the camera in addition to the normal Red, Green and blue maps of a typical camera. Such device is usually utilized on board of indoor robots [35]. Figure 2.4 shows the pipeline of a typical RGB-D visual odometry. Figure 2.5: Figure showing the process of calculating the DOG between different image Gaussian scales and octaves Introduction to Modern Scale invariant feature detectors and descriptors has been made in [36]. A revolutionary interest point detector was proposed called Scale Invariant Feature Transform (SIFT) which is based on the idea of blurring an image using different scales of Gaussian filter, then the resulted blurred images are subtracted in a process called difference of Gaussian (DOG) shown in Figure 2.5. Extreme points that stand out after performing DOG on different octaves are the what we call interest points. After such points are detected a 128 values feature vector is formed using a 4 × 4 region around the key point. For each neighbour in the region gradients are calculated into histograms in eight directions with forty five degrees between them. Contributions in enhancing the speed of scale invariant features detection and description has been made the work in [37]. Speeded Up Robust Features (SURF) utilizes several Approximations that enables the algorithm to dramatically decrease the computation time and yet perform efficiently. SURF depends on integral images where a pixel value is the summation of pixels above and to the left of the pixel then the algorithm uses the determinant of hessian matrix approximation. the hessian matrix can be expressed as (2.1) where x represents the pixel location and ? represents the scale, however SURF utilizes a box filter approximation to this matrix leading to further approximation to the Laplacian of Gaussian (LOG). Instead of using different image sizes SURF changes the box filter size, a 3×3×3 Non-maximum suppression methodology is applied to select the interest point. SURF descriptor relies on finding the dominant orientation using Haar Wavelet filter in both x and y dimensions. Once the dominant orientation is found a 4 × 4 sub-region is formed around point of inter- est then 4 features expressed as ?? dx , ?? dy, |?? dx| , |?? dy| are extracted for each of the 16 sub-regions forming a 64 values feature vector which is smaller compared to the 128 value vector formed in the SIFT algorithm yielding again to a relatively faster algorithm capable of running in real-time work made in [38] discusses thoroughly the performance comparison of the two algorithms. Features matching and association process in major known detectors and descriptors are most of the time found to be vulnerable to mismatching. Since that feature based visual odometry depend by nature on feature matching between successive camera frames, Errors associated with features mismatching results in an overall error regarding the odometry estimation. Several approaches are imposed on feature based Visual odometry in order to eliminate mismatching of features such as RANdom SAmpling Consensus (RANSAC) [39] where a hypotheses model is set via randomly selected matched pairs of features. this model is then verified by the remaining matched features if the features matches the model then they regarded as inliers if not they are outliers. In [40] work is presented utilizing RANSAC to improve the accuracy of Stereo Images VO. Work in [41] proposes a technique for outlier rejection suited for high-speed and large scale scenarios such as driving over a highway. In [42] an outlier rejection algorithm is proposed based on the optical flow of one camera. Since Odometry sensors rely on integration in order to estimate the position and orientation, individual errors in each incremental motion lead to the accumulation of error in every time step. As Mentioned previously VO algorithms are regarded as odometry algorithms which primarily rely on integration in order to calculate the current pose however, since such approach results in drift error which can’t be entirely eliminated, several contributions has been made to reduce such error. In [43], the authors proposed a new descriptor which is named SYntheitc BAsis (SYBA) developed with the aim to reduce the falsely matched features in the camera frames. SYBA uses a sliding window technique where a detected feature is matched with all the frames in the window instead of matching them with the previous frame only. Convolution Neural Networks (CNN) were also utilized to suppress the drift error in VO algorithms. Work in [44] proposes an approach that uses a Bayesian CNN to identify the sun direction such information is used as a global estimate with reference to the orientation estimate hence reducing the VO drift error. Work in [45] proposes an algorithm that utilizes an approach that combines direct VO and feature-based VO. Direct VO usually leads to an accurate motion estimate however, it’s computationally expensive compared with featurebased VO. The proposed approach provides a relatively accurate yet fast motion estimate without the approach can be extended into multiple cameras including fisheye cameras. the authors tested the proposed algorithm in different scenarios and compared it to algorithms that are commonly used in Simultaneous Localization And Mapping (SLAM) such as ORB SLAM without loop closure [46]. Work in [47] provides a study on the characteristics of drift error associated with VO algorithms and their effect on autonomous vehicle control. The paper utilizes an experimental platform called ARTEMIPS which is equipped with MANTA G-507C monocular camera. Tests showed that VO algorithms had a larger computation cost when compared to GPS in addition VO algorithm’s output suffers from drift error which tends to accumulate over time the authors also provided a model for such drift which the authors suggested using to estimate the drift and reduce it given that VO is essential in the case of GPS unavailability. Work in [48] demonstrates an unsupervised deep learning algorithm that would reduce the pose error in stereo VO. the algorithm proposed by the authors relies on a typical stereo VO to estimate the pose in addition to unsupervised deep learning that receives a pair of stereo camera images in addition to the output of the VO to generate a depth map and a mask which provides information regarding the dynamic objects. The authors utilized unsupervised learning to evade the need of requiring a diverse and relatively big dataset. The proposed network is trained on 46000 stereo pair images and tested on the KITTI dataset. The authors compared their results with other deep VO algorithms and their results showed the ability of the approach to generate accurate motion estimates. 2.3 Autonomous vehicles Senor fusion for localization Localization of autonomous vehicles or robots is a very challenging problem to tackle as discussed in 2.2, Data is fed into the localization algorithm in order to deduce with an acceptable degree of accuracy the vehicle pose. Kalman Filter was first introduced by Rudolf Kalman in 1960 [49] his revolutionary work lead to the creation of an optimal linear Gaussian state estimator in another words assuming that the system we would like to estimates its states is both linear and Gaussian the Kalman filter would present the optimum estimates for this system. A typical Kalman filter assumes a Gaussian white noise regarding both the process and measurement noise, such noise is inherently independent at each time step. in turn the noise at each time step is independent of the previous time steps. A Kalman filter operates on a two step configuration the first step is the prediction step using the model then a correcting step using the measurement model the process is shown in Figures 2.6, 2.7. The Kalman filter Work in [50] emphasizes the usage of the Kalman filters in mobile robot localization. Several variations of the Kalman filter state estimator has been proposed with the aim of adopting the idea into complex non-linear systems such as autonomous vehicles. Since the normal Kalman filter is aimed towards the linear time varying systems. The Extended Kalman Filter (EKF) [51] was proposed in order to deal with non-linear systems. Figure 2.8 shows a typical pipeline for a sensor fusion algorithm that would estimate the vehicle pose using an onboard stereo camera and an RTK GPS sensor. Work in [52] adopts the EKF in Autonomous vehicles and mobile robots localization with the aim of providing an outdoor parking system the proposed work incorporates an extended Kalman filter that deals with both odometry and 3D-lidar Data in addition the algorithm uses landmarks pinned on a map. later on, work presented in [53] as the unscented Kalman filter introduced a variant of the Kalman filter with a modification that enables the state estimator not to assume a Gaussian noise which is found to be very effective. Sensor Fusion Figure 2.6: The pipeline of a typical Kalman filter Figure 2.7: Figure from Mathworks demonstration of the Kalman filter estimator showing the optimal Gaussian state estimation given the prediction and the estimate distributions techniques that depends on Probabilistic estimation approaches such as Moving Horizon Estimation [54], pose graph optimization [55], EKF and so on, needs to have the uncertainty in the position and orientation estimates quantified using the covariance matrix of the measurements in order to enhance the accuracy of the estimation. In [56] an adaptive kalman filter localization approach is presented Figure 2.8: Camera and GPS sensor Fusion pipeline with the aim of estimating the covariance matrices of the Kalman filter of the drift error suffering sensors with the aid of non-drift error suffering sensors,such technique showed it’s effectiveness as long as data is always available from the non-drift error sensor which in this case is the Global positioning system (GPS). In [57] a mutual localization algorithm is proposed based on a probabilistic multiple regression and dynamic filtering algorithm in order to use multiple agents to aid each other to reach accurate information about their location. In [58] an adaptive Neural network aided Kalman filter is proposed where the process and measurement covariance matrices values are predicted via a neural network with a time series input instead of static values such approach is claimed to be able to provide better state estimation when compared to predefined covariance matrices the algorithm is validated by deploying it on the Clearpath Husky robot. Visual SLAM is one of the very common algorithms used to solve the vehicle localization problem the work presented in [59] offers an open source Visual SLAM algorithm meant to be handled by cross platforms to in order to utilize different types of cameras and provide a robust performance, the algorithm is tested on fish eye and equirectangular cameras. In [60] a double marker system is proposed to enhance a camera based position estimate utilizing this dual marker system and robot odometry through a Kalman filter to fuse both of them provides a more accurate localization for SLAM algorithms regarding indoor navigating robots such techniques can be also found to be relatively more cost effective when compared to motion capture techniques. Machine learning based techniques has been the focus of interest in the recent years in many applications. algorithms utilizing such approach has been developed with the aim of reducing the computation time and with good training obtain results that have the potential to perform better than the traditional probabilistic filters techniques and conventional motion estimation algorithms. Work in [61] proposes an end to end Monocular VO algorithm that instead of performing feature detection and matching for motion estimation it relies on an algorithm that is trained to output the vehicle odometry through a combination of Convolution Neural networks (CNN) and Recurrent Neural Networks (RNN) the algorithm is trained and tested on the KITTI Data set [62]. Work in [63] proposes a Camera and 3D LIDAR based end to end visual localization technique where the Deep neural network are trained to extract the features de- ception and match them onto a 3D LIDAR built map where the 3D LIDAR data is used only for map building. The Neural networks are designed to be able to extract optimal features from the images at different scale so that such features are reliable in a manner that enables them to be immune to extensive Scene changes and distortions this technique was able to perform in an accurate manner when compared to other high performance conventional LIDAR based localization algorithms. Work in [64] proposes a learning-based VO that addresses sensor fusion in end-to-end learning algorithms. Such approach tackles the problems related to sensor raw data enabling a more robust motion estimate. The approach fuses the data from the monocular VO beside, inertial measurements proposing a dual fusion mode where either a probabilistic or a deterministic fusion is selected enabling the algorithm to tackle different scenarios and data corruptions. The algorithm is tested and validated using KITTI dataset [62], EuRoC dataset [65], and PennCOSYVIO dataset [66] and the results show good performance of the proposed approach. 2.4 KITTI Dataset Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset [62] is one of the most used datasets in the field of autonomous mobile systems. The dataset contains traffic data collected from a high-resolution RGB camera, A stereo vision camera and a Velodyne HDL-64E laser scanner mounted on the vehicle shown in Figure2.11. The dataset includes 11 sequences captured Figure 2.9: Figure shows Velodyen’s HDL-64 S3 LIDAR in different urban environments and scenarios. All the 11 sequences contains the images captured by the stereo vision camera and the ground truth. The ground truth is captured using the high accuracy RTK GPS sensor mounted on the vehicle. Stereo camera frames are provided in both colored and grey scale. Rectified stereo images are also provided for the 11 sequences which are used in this thesis. stereo camera parameters such as the baseline, the principle point and the focal length are provided in the data set. The KITTI data set is very diverse allowing researchers to be able to test their algorithms in different navigation conditions. Figure 2.10 shows a group of images captured from the dataset to illustrate such difference. Another 11 sequences are available in the dataset however, without the ground truth for validation. Figure 2.10: Snap shots of the different scenarios in KITTI dataset Figure 2.11: KITTI test platform 2.5 Gap of Knowledge Although modern localization and state estimation sensors and algorithms are able to accurately estimate the vehicle pose during it’s navigation. Such techniques still need to be improved in order to ensure the feasibility and reliability of autonomous systems in terms of the following. Enhancing drift error associated with odometry sensors and algorithms with relatively non-complex techniques that can be practically implemented in modern day vehicles. Enhancing the range and capabilities of high accuracy GPS sensors. Reducing the onboard sensors adaptability to the navigating environment conditions. Major leaps has been done in the state of the art as discussed in 2.2,2.3 where newly developed high accuracy sensors such as RTK-GPS,LIDARs and Cameras in addition to efficient processing hardware such as high performance GPUs provide autonomous systems a very reliable perception to it’s navigating environment. Allowing better localization performance however some of these technologies have constraints imposed that limit their capabilities. RTK-GPS needs infrastructure of ground stations in order to achieve the desired accuracy which is still not popular in today’s cities around the world [67]. The infrastructure necessity related to the RTK GPS makes it vulnerable to signal losses which can significantly reduce its accuracy. LIDARS on the other hand tends to have high manufacturing cost, making the sensor commercially expensive to integrate with modern vehicles in addition most high efficiency LIDARs in the market such as Velodyen’s HDL-64 shown in Figure 2.9 include mechanical moving parts which increase the sensor vulnerability. In addition, accurate LIDARs suffer from drift error which can be minimized using efficient but yet complex approaches which may impose feasibility and applicability constraints on commercial autonomous transportation vehicles. Changes in the navigation environment such as weather changes or artificial road lights variation significantly impact onboard exteroceptive sensors such as the cameras affecting their accuracy and the availability of the extracted information. Recent incidents such as Uber’s self driving car accident in the state of Arizona, where a self driving car failed to use it’s onboard sensors (which included a LIDAR) and algorithms to identify a pedestrian leading into her death [68]. Incidents like this present the gaps in the state of the art technology that we depend on today. Cameras on the other hand, are less expensive compared to LIDARs and yet efficient sensors that don’t require any additional infrastructure however, as discussed cameras are vulnerable to many errors resulting from illumination changes in the environment that affects the accuracy and demands. Lots of relatively complicated algorithms to attempt to reduce such errors [69]. This Kind of vulnerabilities are inherently transferred to VO algorithms since they rely mainly on cameras. An example of these illumination and environmental changes is found in Figure 2.10. This thesis contributes in reducing the drift error associated with onboard odometry sensors by proposing a robust machine learning model that is capable of reducing the drift in visual odometry algorithms. Monocular Visual Odometry drift reduction using neural networks | Only 14 pages are availabe for public view |
Abstract Autonomous systems have been evolving rapidly in the last few decades. Research and development of autonomous systems modules enables such systems to perform complex tasks in an accurate and efficient manner. Such development has contributed to the stable growth of the world economy due to it’s direct impact on productivity and labour reduction. This allows people to focus on more important tasks. Autonomous mobile systems have been the new challenge to modern robotics and automotive engineering. The work in this thesis addresses some of the challenges related to modern autonomous mobile systems. The autonomous localization module is crucial for any autonomous mobile system. Several onboard sensors are used to detect the vehicle’s location such as cameras, satellite global positioning system, light detection and ranging, etc. Sensor fusion techniques fuse data from these sensors in order to provide an accurate estimate of the vehicle’s pose. This thesis focuses on enhancing camera-based localization. A neural network based machine learning model that is able to refine the pose estimate calculated by visual odometry algorithms is proposed. Visual Odomtery algorithms allow the vehicle localization module to estimate the incremental changes in the vehicle’s pose by detecting the variations between the successive camera frames. This thesis addresses both monocular and stereo Visual Odometry algorithms using drift reduction machine learning models that correlate the errors in the Visual Odometry algorithms with the physical changes in the image and hence reduce such errors. The work in this thesis proposes two different types of machine learning models, one dedicated for translation and one for orientation. The Drift Reduction Neural Networks (DRNN) were designed in a manner that enables them to generalize on the training data and to evade over fitting. The DRNNs were also able to adapt to different navigating environments and scenarios. Work in this thesis also proposed a hybrid Visual Odometry algorithms that utilizes the developed machine learning models, monocular Visual Odometry and stereo Visual Odometry. Results showed the efficacy and robustness of the proposed algorithms as they were able to improve the orientation error with up to (78%) and the translation error with up to (89.9%) when compared with normal Visual Odometry algorithms. |