# Self-Localization and Odometry

# Self-Localization and Odometry

## Table of Contents

## Introduction

Self-localization is one of the main challenges in the application of autonomous systems. These strategies can be divided into two major categories, GPS-based and Odometry. Odometry or position tracking is a form of navigation to detect the position and orientation of a robot by measuring the distance and angle of the robot’s movement using sensor data (e.g., inertial, visual, and radar). Position tracking is a fundamental task in autonomous navigation and it is a key component in many other applications, such as robotics, autonomous vehicles, and augmented reality. There are many odometry sensors, such as wheel encoders, inertial measurement units (IMU), and LiDAR. Odometry can be divided into two categories: dead reckoning and visual odometry. The process of dead reckoning involves calculating the current position from a previously determined position and orientation, taking into account acceleration, speed, and heading direction over a given period of time. Instead, visual odometry uses optical sensor data to analyze image sequences and provide incremental online pose estimation. In recent years, much attention has been drawn to this technique because it has high accuracy and generates less drift error than conventional methods but the high computation cost is one of its main challenges. Also, inertial sensor readings could be fused with visual odometry or can be used alone to estimate a robot’s position and orientation.

Inertial Odometery techniques typically use a combination of accelerometers and gyroscopes to estimate the 3D motion of a robot. The accelerometer measures the linear acceleration, while the gyroscope measures the angular velocity. IMUs despite all other types of sensors are independent of the environment and are egocentric. Moreover, recent advances in Micro-Electro-Mechanical Systems (MEMS) technology have enabled IMUs to become smaller, cheaper, and more accurate. They are now available for use in mobile robots, smartphones, drones, and autonomous vehicles. Low cost MEMS based IMUs are suffering from noise, drift and bias errors. Machnie learning approaches can be used to compensate these noises and biases.

The most challenging part of odometry estimation is trajectory tracking, independent of the type of motion. Estimate direction of gravity vector, noise and bias, simultanously is a challenging task. One of the simple suloition is double integration of accelerometer data. But, it is not accurate enough and could lead to high drift errors in the output. Another sulotion is to use a Kalman filter. The Kalman filter is a recursive Bayesian estimator that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe.

Recent studies have proposed a new deep learning aproch which known as Inertial Odometry Neuran Network (IONet) which could be used to estimate the 3D motion using IMUs measurements [aboldeoopio]. IONets are based on the idea of using a deep neural network to learn the relationship between the IMU raw data and the ground truth without any handcrafted engineering [chen2019]. The main advantage of this approach is that it can learn the features of the data and can be used to estimate the odometry of a robot but it requires a large amount of data to train the model. Also, it is computationally expensive. Perviouse studies shown that IONets outperforms the conventional methods in terms of accuracy and robustness.

In this paper, we propose a novel deep learning approach to estimate the odometry of a robot. The proposed method is based on a deep neural network that uses Long-Short Term Memory (LSTM) layers to learn the complex relationship between the IMU raw data and the ground truth. The proposed method is evaluated on a real-world dataset. The results show that the proposed method outperforms the state-of-the-art methods in terms of accuracy and computational cost.

## Related Work

### Inertial Navigation Systems

A Strapdown Inertial Navigation System (SINS) works by double integrating accelerometer readings on a rigidly mounted vehicle to determine the postion. MEMS bsaed IMUs which are deployed in mobile robots, smartphones, and drones postioning and navigtion systems are suffering from noise and drift which could lead to high errors in the output, due to accumulate positonal error and make them unusable for long term applications. So, typiclly the IMUs reading are fused with other sensors such as GPS, wheel encoders, and LiDAR to improve the accuracy of the system. One of the popular fusion tehniques for comercial use is Kalman Filter and its variants (i.e., Extended Kalman Filter, and Unscented Kalman Filter). Kalman filter is a recursive Bayesian estimator that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. [Inertial odometry on handheld smartphones] Solin et al. used EKF to develop a probabilistic approach for Online inertial odometry based on double integrating rotated accelerations using IMU measurements. In [A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry] quantitative principled method has been presented to eavaluate trajectory esimated be visual odometry and visual-inertial odometry algorithms. A monocolular visual-inertial odometry algorithm [Robust Visual Inertial Odometry Using a Direct EKF-Based Approach] presented using EKF. In [Keyframe-based visual–inertial odometry using nonlinear optimization], IMU measurements have been tightly coupled with image keypoints to solve a non-linear optimization problem for localization. Qin et al. introduce VINS-Mono a real-time visual-inertial navigation system which use monocular camera and low-cost IMU for 6 DoF state estimation.

### Deep Learning Approaches

Deep Learning approaches use End-to-End learning framework to estimate the position and orientation of the sensor given IMUs reading and initial state variables. As a result of these approaches, sequence data was handled in an excellent manner. Most of purposed IONets which knows as Visual-Inertial Odometry Neural Network (VIONet) have used optical sensors for solve the localization problem. Some recent works have proposed IONets which uses IMUs data only such as [ionet, AboldDeepIo, lima, …].

RIDI, uased a Support Vector Machine (SVM) for phone location classification (i.e., body, hand, leg, bag) and then fed the SVM outputs to one of the eight different Support Vector Regression (SVR) models to estimate the velocity of the phone and the output is used for acceleratin correction on a 2D map for Pedestrian Dead Reckoning (PDR).

Chen et.al presented an IONet based on LSTM layers which have shown acceptable performance PDR task. This model consist of two LSTM layers with 96 cells in each layer. Inputs are accelerometer and gyroscope measurements collected from smartphones and the outputs are the displacement in polar coordinate with a focus on 2D planar trajectory estimation for pdestrian tracking. As this network is not robust to noise and IMU sampling rate can not perform well in real world applications and only achieved good resualts forl humane pose estimation. The proposed model use sequence of IMU data as a window with the size of N measurements to compensate the drift error in output.

AbolDeepIO proposed an LSTM based model with three layers to estimate changes in postion and orientation but it is not capable for trajectory predication. Each layer coresponded to one input (i.e., sampleing rate, accelerometer, and gyroscope) to extract the feature from input data. During the training process, the model was exposed to simulated noises in order to make it more robust to noises and to take time intervals between IMU measurements into account, which meant the model was more robust to changes in sampling rate.

Lima et. al., presented a 6-DoF end-to-end deep learning frame work for inertial odometry. In this study, the authors used two CNN layer for each input (i.e., accelerometer and gyroscope) which concanated together and fed to a LSTM section. The LSTM section consists of two LSTM layer with 128 hidden cells in each layer. The output of the LSTM section is fed to a fully connected layer with 7 neurons to predict the 6-DoF relative pose representaion which conists of a 3d translation vector and a unit quaternion. The model has been trained on OxIOD dataset and EuRoC MAV dataset, seprately. IMU measurements has been fed into the model as a sequent of windows with the size of 200 measurment and stride of 10. This windows consits of 100 perviouse and 100 future measurements. As the IMU sampling rate has not been included in the model’s input, it is highly sensetive to any changes in sampling rate.

Kim et. al., preoped an Extended IoNet which is connected to the Nine-Axis IONet and the Pose TuningNet for perfomance improvment of the pose estimation. The inputs are geomagnetic, gyroscope and accelerometer measurements and the output is the 6-DoF relative pose representaion which conists of a 3d translation vector and a unit quaternion.

IDOL is a two stage deep learning framework for localization and orientation estimation based on inertial sensors measuremnets. In the first stage, IMU measurements fed into the Orientation Module which consits of LSTM and Fully Connected layers and EKF filter to estimate the orientation of the sensor in quaternion representation. In the second stage, the estimated orientation is fed into the Positoon Module which is a Bidirectional LSTM layer to estimate the position and minimize displacement error in each time window.

RoNIN presented three deep learning models based on LSTM, Residual Network (ResNet), and Temporal Convolutional Network (TCN) to regress the velocity vector and estimate the trajectory in 2d plane. TILO coupled a neural network with an EKF for 3D inertial odometry and the corresponing uncertainty based on IMU readings. Deep Learning model used to estimate the displacement and uncertainty using IMU measurements. The model’s output fed into an EKF to estimate the state of the system. OriNet used LSTM based deep learning framework to estimate the 3D orientation in the quaternion form from IMU sensor readings. To correct the gyroscope bias, the authors presented a Gentic Algorithms (GA) based method.

Weber et al. presented the RIANN model which used Gated Recurrent Unit (GRU) layers to estimate the Attiude (roll and pitch) directlu from the IMU sensor reading. This model accept the frequency of the IMU sensor as an input which made it robust to sampling rate changes. The model used multiple IMU datasets (i.e., BROAD, TUM-VI, OxIOD, RepoIMU, Sassari, and EuRoC MAV) for trainging, validation, and test. As the model only accept sampling rate, accelerometer, and gyroscope as inputs, it is not capable to estimate the yaw angle.

## Background

### Inertial Navigation Principles

Inertial Navigation algorithms use Newtonian mechanics to track the position and orientation using inertial sensors (i.e., accelerometer and gyroscope). The basic principle of inertial navigation is to measure the acceleration and angular velocity of the object and integrate them to estimate the position and orientation. Gyroscope measures the inertial angular velocity expressed in body frame with respect to an inertial frame and relays on the principle of the angular momentum conservation. Its outputs are reliable in high frequency responses. The gyro measurements can be mathematically modeled by following equations:

$$ \begin{equation} \begin{gathered} \omega_{measured}^b=\omega_{true}^b+b_g+N_{g}^b \end{gathered} \end{equation} $$

where $\omega_{measured}^b$ is the measured angular velocity, $\omega_{true}^b$ is the true angular velocity, $b_g$ is the gyro bias, and $N_{g}^b$ is the gyro noise. Accelerometer measures the force per unit mass (specific force) or the non-gravitational acceleration (usually in m/s^2) and is drift-free but have high frequency noises in its outputs. Accelerometers measure the difference between linear accelerations in the body frame and the earth’s gravitational field vector. When linear accelerations do not exist, they will measure the rotated gravitational field vector, which can be used to compute roll and pitch angles. Due to the inability to measure rotation around gravitational field vector, accelerometer data cannot be used to compute the yaw angle. The accelerometer measurements can be mathematically modeled by following equations:

$$ \begin{equation} \begin{gathered} f^b=a^b-g^b+N^b \end{gathered} \end{equation} $$

where $f^b$ is the measured specific force, $a^b$ is the true acceleration, $g^b$ is the gravitational field vector, and $N^b$ is the accelerometer noise. All IMUs readings are expressed in the body frame. To transform the IMU readings to the inertial frame, the orientation of the body frame with respect to the inertial frame is needed. The orientation of the sensor can be represented by a unit quaternion which is a four-dimensional vector which defined as follows:

$$ \begin{equation} \begin{gathered} \mathbf{q}=\begin{bmatrix} q_0 \ q_1 \ q_2 \ q_3 \end{bmatrix} \end{gathered} \end{equation} $$

where $\mathbf{q}$ is the unit quaternion. The rotation matrices can be used to map the IMU readings to the inertial frame as follows:

$$ \begin{equation} \begin{gathered} \phi=\arctan(\frac{A_x}{\sqrt{A_y^2+A_z^2}}) \end{gathered} \end{equation} $$

$$ \begin{equation} \begin{gathered} \theta=\arctan(\frac{A_y}{\sqrt{A_x^2+A_z^2}}) \end{gathered} \end{equation} $$

where $\phi$ is the roll angle and $\theta$ is the pitch angle.

where $\mathbf{\dot{q} }$ is the attitude rates, $\omega_x$, $\omega_y$, and $\omega_z$ are the angular velocity in x, y, and z axis respectively. The euler angles could be calculated from the quaternion represention using the following equations:

where $\phi$ is the roll angle, $\theta$ is the pitch angle, and $\psi$ is the yaw angle. To calculate position, Accelerometer readings first transformed to the inertial frame using the rotation matrix $\mathbf{C^b_n}$ and then integrated into velocity. Also, the velocity vector in the body frame can be calculated using the following:

$$ \begin{equation} \begin{gathered} \mathbf{V^b}= \begin{bmatrix} \begin{aligned} V_x^b \ V_y^b \ V_z^b \end{aligned} \end{bmatrix} \end{gathered} \end{equation} $$

where $\mathbf{V^b}$ is the velocity vector in the body frame. Velocity can be updated using the following equations:

$$ \begin{equation} \begin{gathered} \mathbf{V(t)}=\mathbf{V(t-1)}+\mathbf{C^b_n}a(t)\delta t \end{gathered} \end{equation} $$

where $\mathbf{V(t)}$ is the velocity at time $t$, $\mathbf{V(t-1)}$ is the velocity at time $t-1$, $\mathbf{C^b_n}$ is the rotation matrix from the body frame to the inertial frame, $\mathbf{a(t)}$ is the acceleration at time $t$ , and $\delta t$ is the time interval between two consecutive measurements. The location of the sensor can be calculated by integrating the velocity vector. The location can be updated using the following equations:

$$ \begin{equation} \begin{gathered} \mathbf{P(t)}=\mathbf{P(t-1)}+\mathbf{V(t-1)}\delta t \end{gathered} \end{equation} $$

where $\mathbf{P(t)}$ is the location at time $t$, $\mathbf{P(t-1)}$ is the location at time $t-1$, $\mathbf{V(t-1)}$ is the velocity at time $t-1$, and $\delta t$ is the time interval between two consecutive measurements.

The transformation of latent system states could be defines as a transfer functoin of state space model between two consecutive time steps. The transfer function of the state space model is defined as:

$$
\begin{equation}
\begin{gathered}
\begin{bmatrix} \mathbf{C^b_n} \ \mathbf{V} \ \mathbf{P} \end{bmatrix}*t = \begin{bmatrix} \mathbf{C^b_n} \ \mathbf{V} \ \mathbf{P} \end{bmatrix}*{t-1} + \begin{bmatrix} \mathbf{A} \ \mathbf{\omega} \end{bmatrix}_t
\end{gathered}
\end{equation}
$$

The displacement between two consecutive time steps is calculated using the following equations:

$$ \begin{equation} \begin{gathered} \mathbf{\delta P} = \int_{t-1}^{t} \mathbf{V} dt \end{gathered} \end{equation} $$

The displacement between two consecutive time steps is used to update the location of the sensor. The location of the sensor is updated using the following equations:

$$ \begin{equation} \begin{gathered} \mathbf{P} = \mathbf{P}_{t-1} + \mathbf{\delta P} \end{gathered} \end{equation} $$

The location of the sensor is used to calculate the distance between the sensor and the target. The distance between the sensor and the target is calculated using the following equations:

$$ \begin{equation} \begin{gathered} \mathbf{d} = \mathbf{P} - \mathbf{P}_t \end{gathered} \end{equation} $$

where $\mathbf{d}$ is the distance between the sensor and the target, $\mathbf{P}$ is the location of the sensor, and $\mathbf{P}_t$ is the location of the target. The distance between the sensor and the target is used to calculate the heading angle of the target. The heading angle of the target is calculated using the following equations:

where $\phi$ is the heading angle in the x axis, $\theta$ is the heading angle in the y axis, and $\psi$ is the heading angle in the z axis. It is noticable that the headning angle could not be caclulated by accelerometer, and gyroscope could only calulate the rate of change of the heading angle.

### 6-DoF relative pose representaion

There are various 6-DoF relative pose representaion, one approach is using sepherical coortdinate system to extend the polar coordinate system into 3D space. The relative pose in the sepherical coortdinate system is defined by adding the calculated changes in postion $\mathbf{\delta P}$ and the orientation $\delta \theta$ and $\delta \psi$ to the given pervious position and orientation, respectively. The relative pose in the sepherical coortdinate system is defined as:

One of disadvantages of this approch is its limitation to detect the backward and sideways movements [lima]. The other approch is using 3D translation vector and a unit quaternion which is defined as follows:

$$ \begin{equation} \begin{gathered} P_t = P_{t-1} + R(q_{t-1})\Delta P\ q_t = q_{t-1} \otimes {\Delta q}\ \end{gathered} \end{equation} $$

where $P_t$ is the location of the sensor at time $t$, $P_{t-1}$ is the location of the sensor at time $t-1$, $R(q_{t-1})$ is the rotation matrix of the sensor at time $t-1$, $\Delta P$ is the displacement between two consecutive time steps, $q_t$ is the orientation of the sensor at time $t$, $q_{t-1}$ is the orientation of the sensor at time $t-1$, $\Delta q$ is the change in orientation between two consecutive time steps, and $\otimes$ is the quaternion multiplication operator. The quaternion multiplication operator is defined as:

The predicted quaternions must be normalized to ensure that the quaternion is a unit quaternion using the following:

$$ \begin{equation} \begin{gathered} q_t = \frac{q_t}{\lVert q_t \rVert}= \frac{q_t}{\sqrt{q_t^0q_t^0 + q_t^1q_t^1 + q_t^2q_t^2 + q_t^3q_t^3}} \end{gathered} \end{equation} $$

## 6 DoF Inertial Odometry Neural Network

The proposed IONet takes IMU measurements (i.e., accelerometer and gyroscope) and its sampling rate as input and outputs the 6-DoF relative pose. The proposed IONet is composed of two parts: the feature extraction network and the relative pose estimation network. The feature extraction network is used to extract the features from the IMU measurements. The relative pose estimation network is used to estimate the 6-DoF relative pose from the extracted features. The proposed IONet is illustrated in Figure []. This model aims to improve the performance of 3D trajectory estimation and hanle IMU measurements with different sampling rates and its noise and bias.

### Network Architecture

This model is based on Recurrent Neural Network (RNN) which is suitable for processing sequential data. The is containing sensor sampling rate (Hz), triaxial accelerometer (m/s^2), and triaxial gyroscope (rad/s) as input.

After concatinating the outputs of these layers, they fed into LSTM layer

### Error Metrices

The oputput of this model has different units and scales, which makes it difficult to compare the output of the model with the ground truth. To overcome this problem, the error metrices must be divided into two categories: the error metrices for the position and the error metrices for the orientation. As the error is a geometric quantity, it is not reasonable to use an Algebraics error metrices such as the mean squared error (MSE) or the mean absolute error (MAE). In the following some of the error metrices are discussed.

#### Quaternion Inner Product

Quaternion Inner Product of two quaternions is represented the angle between the predicted orientation and the true orientation. This makes the dot product equal to the angle between two points on the quaternion hypersphere. The quaternion inner product is defined as:

$$ \begin{equation} \begin{gathered} q \cdot p = q_w p_w + q_x p_x + q_y p_y + q_z p_z \end{gathered} \end{equation} $$

The Quaternion Inner Product will return the quaternion differnce between two quaternion, so if the angle between two quaternions is equal to 0, the Quaternion Inner Product will return 1. The Quaternion Inner Product Loss Function becomes:

$$ \begin{equation} \begin{gathered} L_{QIP} = 1 - | q \cdot p | \end{gathered} \end{equation} $$

The angle between two quaternions can be calculated using quaternion inner product as follows:

$$ \begin{equation} \begin{gathered} L_{QIPA} = \theta = \arccos(q \cdot p) \end{gathered} \end{equation} $$

#### Quaternion Multiplicative Error

The Quaternion Multiplicative Error is defined as the angle between the predicted orientation and the true orientation using Hamilton product and it is defined as:

$$ \begin{equation} \begin{gathered} L_{QME} = 2 \cdot \lVert imag(q \otimes p^{\star}) \rVert _1 \end{gathered} \end{equation} $$

where $p^{\star}$ is the complex conjugate of the quaternion $p$. Another way to calculate the Quaternion Multiplicative Error is as follows:

$$ \begin{equation} \begin{gathered} L_{QMEA} = 2 \cdot \arccos(|scaler( q \otimes p^{\star}) |) \end{gathered} \end{equation} $$

By using $arccos$ function in the implementation, the scaler part of $(q \otimes p^{\star})$ could be leads to a value greater than 1, which is not possible. To overcome this problem, the scaler part of $(q \otimes p^{\star})$ is clamped to the range of [-1, 1]. The value clipping could lose the information about the angle between two quaternions, so another approch is replacing the $arccos$ function with a linear function to avoid exploiding the gradient. The linear function is defined as:

$$ \begin{equation} \begin{gathered} L_{QMEAL} = 1 - \sqrt{q^2_w + q^2_z} \end{gathered} \end{equation} $$

where $q_w$ and $q_z$ are the squared values of $q \otimes p^{\star}$.

#### Quaternion Shortest Geodesic Distance

The Quaternion Shortest Geodesic Distance is defined as the angle between the predicted orientation and the true orientation using the shortest geodesic distance on the quaternion hypersphere. The Quaternion Shortest Geodesic Distance is defined as:

The loss function corsponding to the Quaternion Shortest Geodesic Distance is defined as:

$$ \begin{equation} \begin{gathered} L_{QSGD} = | 1 - (|scaler( q \otimes p^{\star}) |) | \end{gathered} \end{equation} $$

or,

$$ \begin{equation} \begin{gathered} L_{QSGD2} = \sqrt{1-\sqrt{scaler( q \otimes p^{\star})^2}} \end{gathered} \end{equation} $$

The error between ground truth position and predicted position can be calculated using the Euclidean distance between the predicted position and the true position. The Euclidean distance is defined as:

$$ \begin{equation} \begin{gathered} L_{ED} = \sqrt{(x_t - x_p)^2 + (y_t - y_p)^2 + (z_t - z_p)^2} \end{gathered} \end{equation} $$

### Loss Function

The loss function is used to calculate the error between the predicted output and the ground truth. But the output of the model has different units and scales, which makes it difficult to compare the output of the model with the ground truth. So, this problem could be consider as a Multi-Task Learning problem. The Multi-Task Learning problem is a problem in which the model has to predict multiple outputs. a total loss function must be used