VisIRNet: Deep Image Alignment for UAV-taken Visible and Infrared Image Pairs
Abstract
This paper proposes a deep learning based solution for multi-modal image alignment regarding UAV-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas-Kanade (LK) based solutions for a successful alignment. However, we show that we can achieve state of the art results without using LK-based methods. Our approach carefully utilizes a two-branch based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (ModelA), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (ModelB), we predict the homography matrix directly. Applying alignment on the image corners forces algorithm to match only those four corners as opposed to computing and matching many (key)points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state of the art results, when compared to the existing recent deep LK-based architectures.
Index Terms:
Multimodal image registration, image alignment, deep learning, Infrared image registration, Lukas-Kanade algorithms, corner-matching, UAV image processing.I Introduction
Recent advancements in Unmanned Aerial Vehicle (UAV) technologies, computing and sensor technologies, allowed use of UAVs for various earth observation applications. Many UAV systems are equipped with multiple cameras today, as cameras provide reasonable and relatively reliable information about the surrounding scene in the form of multiple images or image pairs. Such image pairs can be taken by different cameras, at different view-points, different modalities or at different resolutions. In such situations, the same objects or the same features might appear at different coordinates on each image and, therefore, an image alignment (registration) step is needed prior to applying many other image based computer vision applications such as image fusion, object detection, segmentation or object tracking as in [45, 47, 46].
The infrared spectrum and visible spectrum may reflect different properties of the same scene. Consequently, images taken in those modalities, typically, differ from each other. On many digital cameras, the visible spectrum is captured and stored in the form of Red-Green-Blue (RGB) image model and a typical visible spectrum camera captures visible light ranging from approximately 400 nanometers to 700 nanometers in wavelength [7, 8, 37]. Infrared cameras, on the other hand, capture wavelengths longer than those of visible light, falling between 700 nanometers and 10000 nanometers [11]. Infrared images can be further categorized into different wavelength ranges as in near-infrared (NIR), mid-infrared (MIR), and far-infrared (FIR) capturing different types of information in the spectrum [11, 12, 17, 18].
Image alignment is, essentially, the process of mapping the pixel coordinates from different coordinate system(s) into one common coordinate system. This problem is studied under different names including image registration and image alignment. We will also use the terms alignment and registration interchangeably in this paper. Typically, alignment is done in the form of image pairs mapping from one image (source) onto the other one (target) [20]. Image alignment is a common problem that exists in many image-based applications where both of the target and source images can be acquired by sensors using the same modality or using different modalities. There is a wide range of applications of image alignment in many fields including medical imaging [1, 23], UAV applications [38, 24], image stitching [10] and remote sensing applications [5, 32, 31, 6, 39].

Image alignment, in many cases, can be reduced to the problem of estimating the parameters of the perspective transformation between two images acquired by two separate cameras, where we assume that the cameras are located on the same UAV system. Fig. 1 summarizes such an image alignment process where the input consists of a higher resolution RGB image (e.g., pixels) and a lower resolution IR image (e.g., pixels visualized in pseudocolors in the figure). The output of the registration algorithm is the registered (aligned) IR image on the RGB image’s coordinate system. As perspective transformation [22] is typically enough for UAV setups containing nearby onboard cameras, our registration process uses a registration function based on the Homography (H) matrix. H contains 8 unknown (projection) parameters and the goal of the registration process is predicting those 8 unknown parameters, directly or indirectly.
In the relevant literature, registering RGB and IR image pairs is done by using both classical techniques (such as Scale-Invariant Feature Transform, SIFT, [35] along with the Random Sample Consensus, RANSAC, [16] algorithm as in [3]) and by using more recent deep learning based techniques as in [36, 9, 54]. Classical techniques include feature-based [42, 52] and intensity-based [41] methods. Feature-based [42, 52] methods essentially find correspondences between the detected salient features from images [49]. Salient features are computed by using approaches such as SIFT [34], Speeded-Up Robust Features (SURF) [4], Harris Corner [21], Shi-Tomas corner detectors [26] in each image. The features from both images are then matched to find the correspondences as in [48, 44, 43], and to compute the transformation parameters in the form of homography matrix. The RANSAC [48] algorithm is commonly used to compute the homography matrix that minimizes the total number of outliers in the literature. Intensity-based [41] methods compare intensity patterns in images via similarity metrics. By estimating the movement of each pixel, optical flow is computed and used to represent the overall motion parameters. In [15, 2] uses LK based algorithms that take the initial parameters and iteratively estimate a small change in the parameters to minimize the error. A typical intensity-based registration technique, essentially uses a form of similarity as its metric or as its registration criteria including Mean Squared Error (MSE) [19], cross-correlation [30], Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR) [53]. Such metrics are not sufficient when source image and target image are acquired by different modalities. This can yield poor performance when such intensity based method are used.
Overall, such major classical approaches, typically, are based on finding and matching similar salient keypoints in image pairs, and therefore, they can yield unsatisfactory results in various multi-modal registration applications.
Relevant deep alignment approaches are using a form of keypoint matching, template matching or Lukas-Kanade (LK) based approaches as in [29, 9]. Those techniques typically consider multiple points or important regions in images to compute the homography matrix H which contains the transformation parameters. However, having the information of four matching points represented by their corresponding 2D coordinates , where is sufficient to estimate H. Therefore, if found accurately, four matching image-corner points between the IR and RGB images would be enough to perform accurate registration between the IR and RGB images. While many techniques based on keypoint extraction can be employed to find matching keypoints between the images, we argue that the corner points on the borders of one image can also be considered as keypoints, and by using those corners of the image, we do not need to utilize any keypoint extraction step.
In this paper, we propose a novel deep approach for registering IR and RGB image pairs, where instead of predicting the homography matrix directly, we predict the location of the four corner points of the entire image directly. This approach removes the additional iterative steps introduced by LK based algorithms and eliminates the steps of computing and finding important keypoints. Our main contributions can be listed as follows: (i) we introduce a novel deep approach for alignment problems of IR images onto RGB images taken by UAVs, where the resolutions of the input images differ from each other; (ii) we introduce a novel two-branch based deep solution for registration without relying on the Lukas-Kanade based iterative methods; (iii) instead of predicting the homography matrix directly, we predict the corresponding coordinates of the four corner points of the smaller image on the larger image; (iv) we study and report the performance of our approach on multiple aerial datasets and present the state of the art results.
II Related Work
Many recent techniques performing image alignment rely on deep learning. Convolutional Neural Networks (CNNs) form a pipeline of convolutional layers where filters learn unique features at distinct levels of the network. For example, the authors of [14] proposed a Deep Image Homography Estimation Network (DHN) that uses CNNs to learn meaningful features in both images and it directly predicts the eight affine transformation parameters of the homography matrix. Later, the authors of [28] proposed using a series of networks to regress the homograph parameters in their approach. The latter networks in their proposed architecture aims to gradually improve the performance of the earlier networks. Their method builds on top of DHN [14]. Another work in [9], proposed incorporating the LK algorithm in the deep learning pipeline.
The authors of [54] used a CNN based network and introduced a learning based Lucas-Kanade block. In their work, they designed modality specific pipelines for both source and template images, respectively. At the end of each block, there is a unique feature construction function. Instead of using direct output feature maps, they constructed features based on Eigen values and Eigen vectors of the output feature maps. The features constructed from the source and template network channels have a similar learned representation. Transformation parameters found at a lower scale are given as in input to the next level and the LK algorithm iterates until a certain threshold is reached. In another work, the authors of [13] utilized disentangled convolutional sparse coding to separate domain-specific and shared features of multi-modal images for improved accuracy of registration. Multi-scale Generative Adversarial Networks (GANs) are also used to estimate homography parameters as in [36].

The architectural comparisons of the above-mentioned multiple networks is provided in Fig.2. In DHN [14], the image to be transformed (it is noted as IIR in the figure) is padded to have the same dimensions as the target image (IRGB) and they are concatenated channel-wise. The concatenated images are given to the Deep Homography Network (DHN) for the direct regression of the 8 values of the homography matrix. On the other hand, Multi-scale homography estimation (MHN) [28] adapts using a series of networks (Neti). The inputs for Net2 are a concatenation of IIR and IRGB. For the succeeding levels, first, the warping function performs the projective inverse warping operation on the infrared image (IIR) via the homography matrix which was predicted at the previous level. The resulting image (I) is first concatenated with IRGB and then given as input to the Neti. For the following levels, the current matrix and previously predicted matrices are multiplied to form the final prediction. This way MHN aims to learn correcting mistakes made in the earlier levels. Cascaded Lucas-Kanade Network (CLKN) [9] uses separate networks for each modality. They use levels of different scales in the form of feature pyramid networks, and perform registration from smallest to the largest. The homography matrix from the earlier LK-layer is given as input to the next. Deep Lucas-Kanade Feature Maps (DLKFM) [54] also performs coarse to fine registration as shown in Fig.2. It uses a special feature construction block called (fcb). The (fcb) block takes in the feature maps and transforms them into new feature based on the Eigen vectors and covariance matrix. The constructed features capture principal information and the registration is performed on the constructed feature maps, thus, it aims to increase the accuracy of the LK-layer. Our approach uses separate feature embedding blocks to process each modality separately. It is trained to extract modality specific features so that the output feature maps of different modalities can have similar feature representations.
III Proposed Approach: VisIRNet
In our proposed approach, we aim at performing accurate, single and multi-modal image registration which is free of the iterative nature of LK-based algorithms. We name our network VisIRNet, where we aim to predict the location of the corners of the input image on the target image directly, since having four matching points is sufficient to compute the homography parameters. In our proposed architecture, we assume that there are two input images with different resolutions. The overview of our architecture is given in Fig.3. Our approach first processes two inputs separately by passing them through their respective feature embedding blocks and extracts representative features. Those features are then combined and given to the regression block as input. The goal of the regression block is computing the transformation parameters accurately. The output of the regression block is eight dimensional (which can represent the total number of homography parameters or the coordinates of the four corner points of the source image on the target image).
III-A Preliminaries
Perspective Transformation: Here, by perspective transformation, we mean a linear transformation in the homogenous coordinate system which, in some sense, warps the source image onto the target image. Homography matrix consists of the transformation parameters needed for the perspective transformation. The elements of the dimensional homography matrix represent the amount of rotation, translation, scaling and skewing motions. Homography matrix H is defined a follows:
(1) |
where the last element () is set to 1 to ensure the validity of conversion from homogeneous to the cartesian coordinates. Warping function maps a set of coordinates to another coordinate system via H. Let be the location of a point in the coordinates set of the source image. Let be the warping function that warps given coordinate with parameter set of H to the target image:
(2) |
The warping process is a linear transformation in homogeneous coordinate system. Therefore, the Cartesian coordinates are first transformed into the homogeneous coordinate system by adding the extra dimension to the 2D Cartesian pixel coordinates. Let be the pixel with coordinates. Homogeneous coordinate of can be represented by setting to 1 i.e., . Once we have the homography matrix, we warp any given pixel location represented by to its warped version on the other image’s Cartesian coordinate as follows;
(3) |
where are warped homogeneous coordinates of which can be converted to Cartesian coordinates by simply division by the value. Therefore, we can obtain the final warped 2D pixel coordinates in Cartesian coordinates as follows: , where:
(4) |
(5) |
III-B Network structure
Our proposed network is composed of multi-modal feature embedding blocks (MMFEB) and a regression block (see Fig.3). The regression block is responsible for predicting the 8 homography matrix parameters directly or indirectly. In this paper, we study the performance of two variants of our proposed model and we call them ModelA and ModelB. ModelA predicts the coordinates of the corner points while it’s variant, ModelB, predicts the direct homography parameters. In ModelA, 4 corners are enough to find the homography matrix. Therefore the last layer has 8 neurons for the four corner components for the ModelA, or for the eight unknown homography parameters for the modelB.

Multi-modal Feature Embedding Backbone: MMFEB is responsible for producing a combined representative feature set formed of fine level features for both of the input images. The network then, will use that combined representative feature set to transform the source image onto the target image. We adapt the idea of giving RGB and infrared modalities separate branches as in [54]. We use two identical networks (branches) with same structure but with different parameters for RGB and infrared images, respectively. Therefore, the multi-modal feature embedding block has two parallel branches with identical architectures (however they do not share parameters), namely RGB-branch and infrared branch. We first train the multi-modal feature embedding backbone by using average similarity loss (see Eq. 6). To compute the similarity loss, we first generate a recti-linear grid, representing locations in infrared coordinate system as in spatial transformers [25]. Then, we use the ground truth homography matrix to warp the grid onto the RGB coordinate system resulting in a warped curvilinear grid representing projected locations. We use bi-linear interpolation [40, 27] to sample those warped locations on the RGB feature maps (). After that, we can compute the similarity loss between IR feature maps and re-sampled RGB feature maps. Algorithm 1 provides the algorithmic details of calculating the similarity loss for the feature embedding block.
MMFEB is trained by using the (see Eq. 6) which is detailed in Algorithm 3. Steps for training the MMFEB are given in Algorithm 1. Regression block is trained with homography loss () in combination with average corner error ( ) (see the subsection ”Average Corner Error (Ace)” below for the definitions of ), yielding the total loss to train our model. Table I summarises the structure of our used MMFEB.
Layer | Filter Number | Filter-dims | Stride | Padding | Repetition |
Conv2D | 64 | 3x3 | 1 | SAME | x1 |
BN | - | - | - | - | |
Conv2D | 64 | 3x3 | 1 | SAME | |
BN | - | - | - | - | |
Relu | - | - | - | - | x3 |
Conv2D | 64 | 3x3 | 1 | SAME | |
BN | - | - | - | - | |
Relu | - | - | - | - | |
Conv2D | 64 | 3x3 | 1 | SAME | |
BN | - | - | - | - | x1 |
Conv2D | 64 | 3x3 | 1 | SAME |
Regression block: The second main stage of our pipeline is the regression block which is responsible for making the final prediction. The prediction can be the four corner locations, if ModelA; or the unknown parameters of the homography matrix, if ModelB. and are the feature maps extracted by passing the RGB image and infrared image through their respective feature embedding blocks in the feature embedding block. Note that and have different dimensions. Therefore, we apply zero-padding to the lower dimensional feature maps () so that we can bring its dimensions to the dimensions of , resulting . We concatenate (channel-wise) to feature maps coming from infrared and RGB feature embedding blocks and use that as input for the regression block.
The architecture for regression block is further divided in two sub-parts as shown in Fig. 3. The first part is composed of 6 levels. Apart from the last level, each level is composed of 2 sub-levels followed by a max-pooling layer. Sub-level is a convolution layer followed by a batch normalization layer followed by a relu activation function. sub-levels m and n of a level l are identical in terms of the filters used, kernel size, stride and padding used for level l. the level does not have a max-pooling layer. Second part has two 1024-dense layers with relu as activation function followed by a dropout layer and 8-dense output layer for 8 parameters of homography matrix or corner components. Feature maps from the previous part are flattened and given to the second part where homography matrix parameters or corner components are predicted according to the model used. Table II gives detailed information for the first and the second parts of the regression head.
Level | Number of filters / Units | Filter-dims | Stride | Padding | Activation |
L1 | 32 | 3x3 | 1 | SAME | |
max-pool | - | 2x2 | 2 | SAME | |
L2 | 64 | 3x3 | 1 | SAME | |
max-pool | - | 2x2 | 2 | SAME | |
L3 | 64 | 3x3 | 1 | SAME | |
max-pool | - | 2x2 | 2 | SAME | |
L4 | 128 | 3x3 | 1 | SAME | |
max-pool | - | 2x2 | 2 | SAME | |
L5 | 128 | 3x3 | 1 | SAME | |
max-pool | - | 2x2 | 2 | SAME | |
L6 | 256 | 3x3 | 1 | SAME | |
Flatten | |||||
Dense | 1024 | - | - | Relu | |
Dense | 1024 | - | - | Linear | |
Dropout | 20% | - | - | - | |
Dense | 8 | - | - | Linear |
III-C Loss
While MMFEB uses similarity loss, we used two loss terms based on the corner error and homography for the regression head.
Similarity loss: The similarity loss is used to train MMFEB and is defined as follows:
(6) |
where, is the value at location for respective image feature maps. is the value at location on the re-sampled RGB feature maps. Note that the is a location on the coordinate system constrained by the infrared image height and width. The algorithmic details of the similarity loss are provided in Algorithm 3.
Homography loss term: ModelB is trained to predict the values of the elements of the homography matrix. Therefore, its output is the 8 elements of a 3x3 matrix (where the ninth element is set to 1). The homography based loss term: is defined as follows: let [ : (for ), 1] be the elements of a H ground truth homography matrix. Similarly, let [ : (for ), 1] be elements of , the predicted homography matrix. Then, where, represents the homography loss based on the distance.
Average Corner Error (Ace): Ace is computed as the average sum of squared differences between the predicted and ground truth locations of the corner points. For ModelB, we use predicted homography matrix to transform the 4 corners of infrared image onto the coordinate system of RGB image and together with ground truth locations we compute . Let be a corner at the () coordinates on the infrared image and let be its warped equivalent on the RGB coordinate space such that where W is the warping function.
(7) |
where D is defined as: , and where and are ground truth and predicted vectorized homography matrices, respectively. The total loss for ModelB, then, is computed as where is weight factor (a hyperparameter).
SkyData | VEDAI | Average | |||
18.6 | 21.6 | 19.1 | 128.3 | 36.85 | |
18.5 | 35.8 | 18.5 | 178.1 | 62.7 | |
18.5 | 20.1 | 19.1 | 134.2 | 47.9 |
In ModelB, we predict the and locations of the 4 corner points, instead of computing the homography matrix. This makes it possible for the network to learn to predict exact locations (landmarks) instead of focusing on one solution. As shown in our experiments, (see Fig. 5 for qualitative and Fig. 7 for quantitative results), ModelA converges faster and yields better results, while minimizing outliers. We use a slightly modified version of for ModelA such that becomes the ground truth corner coordinate in RGB coordinate space. For ModelA, is defined as follows:
(8) |
In addition to these loss functions, we also used additional loss functions in the MMFEB block during our ablation study. Those functions are and . They are briefly defined below.
(9) |
(10) |
where SSIM is used as also used and defined in [45].
IV Experiments
In this section, we describe our experimental procedures, used datasets and our metrics. Below we describe our used datasets.
Datasets: In our experiments we use Skydata111www.skydatachallenge.com containing RGB and IR image pairs, MSCOCO [33], Google-Maps, and Google-Earth (as taken from DLKFM [54]), VEDAI [50] datasets. Please refer Table IV for more details about the used datasets in our experiments. SkyData is originally a video-based dataset which provides each frame of the videos in image format.
Generating the training and test sets: To train the algorithms we need unregistered and registered (ground truth) image pairs. For SkyData, we randomly select frame pairs for each video sequence.

For each dataset that we use, we generate the training and test sets as follows: (i) Select a registered image pair at higher resolutions. (ii) Sample (crop) regions around the center of the image to get smaller patches of 192x192 pixels. This process is done in parallel for visible and infrared images. (iii) If the extracted patches are not sufficiently aligned, manually align them. (iv) For each pair, select a subset of the IR image, by randomly selecting 4 distinct locations on the image. (v) Find perspective transformation parameters that map those randomly chosen points to the following fixed locations: (0,0) , (,0) , () , (0,) so that they can correspond to the corners of the unregistered IR image patch; where we assume that the unregistered is dimensional (in our experiments is set to 128). This process creates an unregistered infrared patch (from the already registered ground truth) that needs to be placed back to its true position. (vi) Use those 4 initially selected points as the ground truth corners for the registered image. (vii) Repeat process times to create different image pairs. This newly-created dataset is then split into training and test sets. (viii) The RGB images are used as the target set and the transformed infrared patches are used as the source set (for both training and testing). This process is done on randomly selected registered pairs for each dataset. Fig.4 also illustrates this process on a pair of RGB and IR images. The list of all the used data sets and their details are summarized in Table IV.
Dataset | Modality | Training set | Test Set |
SkyData | RGB + Infrared | 27700 | 7990 |
MSCOCO [33] | Single modality (RGB) | 82600 | 6400 |
Google Maps | RGB + map (vector) | 8800 | 888 |
Google Earth | RGB + RGB | 8750 | 850 |
VEDAI [50] | RGB + Infrared | 8722 | 3738 |
Model | BatchS | loss | mean | std | min | 25% | 50% | 75% | max |
ModelB | 8 | L1 | 0.6 | 0.44 | 0.03 | 0.32 | 0.49 | 0.76 | 8.38 |
ModelB | 8 | L2 | 1.84 | 4.49 | 0.0 | 0.37 | 0.89 | 2.03 | 219.33 |
ModelA | 8 | Ace | 2.49 | 4.15 | 0.0 | 0.48 | 1.23 | 2.85 | 63.65 |
ModelB | 16 | L1 | 0.6 | 0.4 | 0.03 | 0.34 | 0.51 | 0.76 | 6.08 |
ModelB | 16 | L2 | 1.94 | 4.05 | 0.0 | 0.3 | 0.8 | 2.02 | 92.81 |
ModelA | 16 | Ace | 2.14 | 4.1 | 0.0 | 0.4 | 1.04 | 2.42 | 176.51 |
ModelB | 32 | L1 | 0.7 | 0.46 | 0.02 | 0.4 | 0.61 | 0.9 | 8.17 |
ModelB | 32 | L2 | 2.79 | 5.3 | 0.0 | 0.5 | 1.31 | 3.08 | 149.63 |
ModelA | 32 | Ace | 2.98 | 4.71 | 0.0 | 0.57 | 1.51 | 3.5 | 81.88 |
ModelB | 64 | L1 | 0.73 | 0.49 | 0.03 | 0.39 | 0.61 | 0.93 | 7.03 |
ModelB | 64 | L2 | 3.62 | 5.94 | 0.0 | 0.71 | 1.82 | 4.09 | 107.96 |
ModelA | 64 | Ace | 3.65 | 5.17 | 0.0 | 0.91 | 2.22 | 4.57 | 157.6 |
Model | BatchS | Loss | Mean | Std | Min | 25% | 50% | 75% | Max |
ModelB | 8 | L1 | 21.38 | 44.01 | 2.53 | 8.65 | 12.57 | 18.27 | 235.99 |
ModelB | 8 | L2 | 25.15 | 119.95 | 2.30 | 11.83 | 13.11 | 15.53 | 893.01 |
ModelA | 8 | Ace | 3.96 | 1.64 | 0.93 | 2.84 | 3.68 | 4.72 | 19.59 |
ModelB | 16 | L1 | 25.83 | 108.53 | 1.66 | 8.02 | 10.97 | 16.27 | 743.54 |
ModelB | 16 | L2 | 31.27 | 143.28 | 2.29 | 10.52 | 13.65 | 18.47 | 1180.64 |
ModelA | 16 | Ace | 3.83 | 1.77 | 0.78 | 2.64 | 3.46 | 4.61 | 19.19 |
ModelB | 32 | L1 | 25.17 | 101.33 | 1.87 | 6.08 | 8.80 | 16.97 | 803.25 |
ModelB | 32 | L2 | 6.50 | 4.11 | 2.05 | 5.43 | 6.40 | 7.35 | 15.07 |
ModelA | 32 | Ace | 5.32 | 1.87 | 1.35 | 4.04 | 5.02 | 6.22 | 22.21 |
ModelB | 64 | L1 | 30.60 | 177.66 | 3.41 | 11.07 | 11.65 | 12.41 | 1298.48 |
ModelB | 64 | L2 | 31.86 | 135.19 | 1.74 | 8.10 | 13.46 | 19.23 | 981.56 |
ModelA | 64 | Ace | 5.01 | 2.01 | 0.95 | 3.62 | 4.65 | 6.0 | 18.94 |








mean | std | min | 25% | 50% | 75% | max | |
MHN [28] | 14.5 | 67.29 | 1.19 | 8.11 | 11.36 | 15.4 | 4296.21 |
DHN [14] | 77.96 | 854.59 | 4.1 | 48.34 | 65.95 | 82.19 | 76119.41 |
DLKFM [54] | 93.4 | 2894.7 | 7.32 | 32.08 | 40.73 | 51.95 | 258091.03 |
Ours | 3.83 | 1.77 | 0.78 | 2.64 | 3.46 | 4.61 | 19.19 |
CLKN [9] | 77.31 | 862.85 | 5.47 | 38.66 | 47.72 | 57.74 | 73661.8 |
SIFT [35] | 43477.63 | 201670.37 | 2.13 | 232.88 | 1275.96 | 1285.6 | 100000.0 |
mean | std | min | 25% | 50% | 75% | max | |
DLKFM [54] | 382.4 | 3363.72 | 10.93 | 101.43 | 120.09 | 187.6 | 189375.0 |
Ours | 24.76 | 4.77 | 4.09 | 21.59 | 24.88 | 28.03 | 42.77 |
MHN [28] | 374.53 | 5519.56 | 8.74 | 56.4 | 103.68 | 141.99 | 319559.19 |
SIFT [35] | 40221.01 | 195588.26 | 0.11 | 2.07 | 88.96 | 1264.86 | 100000.0 |
DHN [14] | 163.87 | 427.16 | 41.59 | 128.8 | 137.43 | 146.43 | 19138.92 |
CLKN [9] | 99.49 | 709.1 | 2.45 | 37.97 | 52.29 | 75.13 | 30289.16 |
mean | std | min | 25% | 50% | 75% | max | |
DHN [14] | 1073.9 | 25214.41 | 8.85 | 52.1 | 67.17 | 83.78 | 733889.5 |
Ours | 10.91 | 3.61 | 4.19 | 8.25 | 10.5 | 12.9 | 31.13 |
SIFT [35] | 1334.41 | 34297.53 | 0.51 | 2.63 | 8.45 | 108.25 | 100000.0 |
DLKFM [54] | 27.65 | 121.65 | 0.36 | 7.41 | 18.5 | 28.96 | 2733.34 |
MHN [28] | 118.86 | 257.47 | 12.62 | 45.06 | 60.91 | 106.44 | 4552.5 |
CLKN [9] | 14.58 | 36.57 | 0.38 | 3.15 | 6.51 | 14.51 | 730.46 |
mean | std | min | 25% | 50% | 75% | max | |
MHN [28] | 319.68 | 1003.02 | 9.48 | 40.68 | 73.54 | 208.8 | 12910.34 |
SIFT [35] | 178912.9 | 382210.28 | 48.79 | 1273.04 | 1281.86 | 1292.6 | 100000.0 |
CLKN [9] | 123.96 | 453.65 | 13.22 | 56.43 | 66.9 | 80.77 | 8496.15 |
DHN [14] | 131.27 | 410.38 | 12.28 | 30.27 | 43.08 | 115.07 | 9866.02 |
Ours | 9.57 | 4.15 | 2.6 | 6.82 | 8.68 | 11.28 | 33.67 |
DLKFM [54] | 77.78 | 183.6 | 0.47 | 20.27 | 38.76 | 66.28 | 3251.74 |
mean | std | min | 25% | 50% | 75% | max | |
DLKFM [54] | 67.16 | 2515.37 | 0.06 | 0.44 | 8.31 | 31.12 | 200374.44 |
Ours | 3.67 | 2.45 | 0.64 | 2.28 | 2.99 | 4.13 | 27.14 |
CLKN [9] | 6.45 | 8.96 | 0.1 | 1.68 | 3.86 | 8.01 | 280.96 |
DHN [14] | 622.38 | 7493.6 | 3.0 | 141.93 | 194.88 | 383.56 | 580642.19 |
SIFT [35] | 3308.58 | 57236.2 | 0.07 | 0.37 | 0.65 | 1.38 | 100000.0 |
MHN [28] | 15.5 | 7.31 | 1.84 | 10.17 | 14.39 | 19.32 | 90.29 |
Evaluation metrics: As shown in Table VI, we quantitatively evaluate the performance of our models using Ace and homography error. We compute each algorithm’s result distribution in terms of quantiles, mean, standard-deviation and min-max values for a given test set. Quartiles are a set of descriptive statistics which summarize central tendency and variability of data [51]. Quartiles are a specific type of quantiles that divide the data into four equal parts. The three quartiles are denoted as Q1, Q2 (which is also known as the median), and Q3. The 25% (Q1), 50% (Q2) and 75% (Q3) percentiles indicate that k% of the data falls below the quartile (the bottom right illustration in Fig.6 also illustrates these terms). To find quartiles; we first sort elements in data being analysed in ascending order. The first quartile is the number of samples that fall below the dataset size*(1/4) element. Likewise the second quartile is the count of elements that fall below dataset size * (2/4) and the third quartile is dataset size * (3/4) th element in sorted dataset. The samples that fall out of (Q1-1.5 * IQR and Q3+1.5IQR) where IQR is inter-quartile range, are considered outliers. The box plot as in Figure 6 illustrates the above mentioned description visually.
Table III shows an ablation study on using different loss functions in each block in our architecture. The used metric in the table is Ace and the best values are shown in bold. The loss functions in each row are used to train the MMFEB block includes , and . The loss functions used for the regression block are and . In the table, the last column shows the average error for both and (over two datasets including SkyData and VEDAI) for each of the used loss functions in the MMFEB block.
Next, we provide experimental results on the effect of the hyperparameters that we studied for both ModelA and ModelB. Table V summarizes those results. In particular, we studied the effect of using different loss functions (, and ) and using different batch sizes for both models. All the experiments were done on the SkyData set. The best results are shown in bold. Overall, ModelB showed promising results achieving better results when compared to the ModelA. Therefore, for the rest of our experiments, we kept using ModelB only.
Fig.6 uses a box plot, also known as a box-and-whisker plot, to display the distribution of average corner error for different different datasets and for different models. It provides a summary of key statistical measures such as the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The length of the box indicates the spread of the middle 50% of the data. The line inside the box represents the median (Q2). The whiskers extend from the box and represent the variability of the data beyond the quartiles, in our case they represent and . Individual data points that lie outside the whiskers are considered outliers and are plotted with diamonds. The figure compares the results for 6 algorithms on 5 different datasets.
Fig.7 shows the performance of 6 methods (SIFT, DHN, MHN, CLKN, DLKFM, Ours) on the SkyDataV1 dataset, in terms of average corner error. Skydata has RGB and infrared image pairs. In this figure, we aim to show that feature based registration techniques such as SIFT perform poorly whereas methods that leverage neural networks and learn representations are superior.
Fig.5 gives detailed qualitative results of our experiments. Each row represent a sample taken from a different dataset. The columns represent inputs and results for different approaches. Target is (192x192) (first column), and source (second column) (128x128) are input image pairs. Warped (third column), is the ground truth projection of source to the coordinate system of target image and Registered (fourth column) is the warped image overlayed on the target image as shown. Columns from 5 to 10 shows the registered and overlayed results for SIFT, DHN, MHN, CLKN, DLKFM, and Ours (ModelA) for the given input pair. While almost all algorithms relatively done well on Google Earth pair (which provides similar modalities for both target and source images), when the modalities are significantly different, as in the SkyData, Google Maps and VEDAI pairs, the figure shows that SIFT, CLKN, MHN, DHN and DLKFM algorithms can struggle for aligning them and they may not converge to any useful result near the ground truth (see SIFT and CLKN results), while our approach converges to the ground truth by yielding small ACE error for each of those sample pairs.
Table VI illustrates the results of using different approaches for each dataset, separately. in Table VI(e), the MSCOCO results being a single modality dataset, SIFT perform relatively better but there are cases where the algorithm could not find homography due to insufficient pairs. Google earth in (c) also has RGB image pairs but from different seasons. SIFT algorithm is still able to pick enough salient features therefore the performance is still reasonable. (d) Google maps, (a) SkyData and (b) VEDAI have pairs of significant modality difference. Deep learning based approaches were able to perform registration often with high number of outliers. Our approach was able to perform registration on both single and multi-modal image pairs, specifically we were able to keep the max error minimum as opposed to LK-based approaches.
V Conclusion and Discussion
In this paper, we introduce a novel image alignment algorithm that we call VisIRNet. VisIRNet has two branches and does not have any stage to compute keypoints. Our experimental results show that our proposed algorithm performs state of the art results, when it is compared to the LK based deep approaches.
Our method’s main advantages can be listed as follows: (a) Number of iterations during inference: Above-mentioned Lucas-Kanade based methods (after the training stage), also iterate a number of times during the inference stage and at each iteration, they try minimizing the loss. However, those methods are not guaranteed to converge to the optimal solution and often number of iterations, chosen as a hyperparameter, is an arbitrary number during the inference stage. Such iterative approaches introduce uncertainty for the processing time, as convergence can happen after the first iteration in some situations and after the last iteration in other situations during inference. Such uncertainty also affects the real time processing of images, as they can introduce varying frame per second values. Our Method uses a single pass during inference with make it more applicable to real time applications.
(b) Dependence on the initial H estimate: In addition to the above-mentioned difference, the LK-based algorithms require an initial estimate of the homography matrix and the performance (and number of iterations required for convergence) directly depend on the initial estimate of H and therefore it is typically given as input (hyperparameter). While we also have initialization of the weights in our architecture, we do not need an initial estimate of the homography matrix within the architecture as input.
Image alignment on image pairs taken by different onboard cameras on UAVs is a challenging and important topic for various applications. When the images to be aligned acquired by different modalities, the classic approaches, such as SIFT and RANSAC combination, can yield insufficient results. Deep learning techniques can be more reliable in such situations as our results demonstrate. LK based deep techniques recently shown promise, however, we demonstrate with our approach (VisIRNet) that without designing any LK based block, and by focusing only on the four corner points, we can sufficiently train deep architectures for image alignment.
Acknowledgments
This paper has been produced benefiting from the 2232 International Fellowship for Outstanding Researchers Program of TÜBİTAK (Project No:118C356). However, the entire responsibility of the paper belongs to the owner of the paper.
References
- [1] F. Alam, S. Ur Rahman, A. Din, and F. Qayum. Medical image registration: Classification, applications and issues. Journal of Postgraduate Medical Institute, 32:300–3007, 12 2018.
- [2] S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004.
- [3] D. Barath and Z. Kukelova. Relative pose from sift features, 2022.
- [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer vision and image understanding, 110(3):346–359, 2006.
- [5] Y. Bentoutou, N. Taleb, K. Kpalma, and J. Ronsin. An automatic image registration for applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 43(9):2127–2137, 2005.
- [6] Y. Bentoutou, N. Taleb, K. Kpalma, and J. Ronsin. An automatic image registration for applications in remote sensing. Geoscience and Remote Sensing, IEEE Transactions on, 43:2127 – 2137, 10 2005.
- [7] S. Bhowmick. The RGB rendering of visible wavelength lights (2019 02 28 14 47 31 UTC). 01 2017.
- [8] D. Carreres-Prieto, J. T. García, F. Cerdán-Cartagena, and J. Suardiaz-Muro. Performing calibration of transmittance by single rgb-led within the visible spectrum. Sensors, 20(12), 2020.
- [9] C.-H. Chang, C.-N. Chou, and E. Y. Chang. Clkn: Cascaded lucas-kanade networks for image alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- [10] S. Chen, F. Yu, and X. Zhu. Real-time registration in image stitching under the microscope. In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), pages 907–911, 2018.
- [11] C. Cucci, A. Casini, P. Marcello, and L. Stefani. Extending hyper-spectral imaging from vis to nir spectral regions: A novel scanner for the in-depth analysis of polychrome surfaces. Proc SPIE, 8790:09–, 05 2013.
- [12] J. Delaney, M. Thoury, J. Zeibel, P. Ricciardi, K. Morales, and K. Dooley. Visible and infrared imaging spectroscopy of paintings and improved reflectography. Heritage Science, 4, 03 2016.
- [13] X. Deng, E. Liu, S. Li, Y. Duan, and M. Xu. Interpretable Multi-Modal Image Registration Network Based on Disentangled Convolutional Sparse Coding. IEEE Transactions on Image Processing, 32:1078–1091, Jan. 2023.
- [14] D. DeTone, T. Malisiewicz, and A. Rabinovich. Deep image homography estimation. CoRR, abs/1606.03798, 2016.
- [15] B. Duvenhage, J. Delport, and J. de Villiers. Implementation of the lucas-kanade image registration algorithm on a gpu for 3d computational platform stabilisation. pages 83–90, 06 2010.
- [16] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- [17] G. Fox. The brewing industry and the opportunities for real-time quality analysis using infrared spectroscopy. Applied Sciences, 10, 01 2020.
- [18] R. Gade and T. B. Moeslund. Thermal cameras and applications: a survey. Machine Vision and Applications, 25(1):245–262, Jan 2014.
- [19] R. C. Gonzalez, R. E. Woods, and S. L. Eddins. Digital image processing. 2008.
- [20] A. A. Goshtasby. Image Registration - Principles, Tools and Methods. Advances in Computer Vision and Pattern Recognition. Springer, 2012.
- [21] C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, pages 147–151, 1988.
- [22] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
- [23] D. L. G. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes. Medical image registration. Physics in Medicine and Biology, 46(3):R1, mar 2001.
- [24] S.-M. Huang, C.-C. Huang, and C.-C. Chou. Image registration among uav image sequence and google satellite image under quality mismatch. In 2012 12th International Conference on ITS Telecommunications, pages 311–315, 2012.
- [25] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. CoRR, abs/1506.02025, 2015.
- [26] L. Juranek, J. Stastny, and V. Skorpil. Effect of low-pass filters as a shi-tomasi corner detector’s window functions. 07 2018.
- [27] E. J. Kirkland. Bilinear Interpolation, pages 261–263. Springer US, Boston, MA, 2010.
- [28] H. Le, F. Liu, S. Zhang, and A. Agarwala. Deep homography estimation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [29] R. Lei, B. Yang, D. Quan, Y. Li, B. Duan, S. Wang, H. Jia, B. Hou, and L. Jiao. Deep global feature-based template matching for fast multi-modal image registration. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 5457–5460. IEEE, 2021.
- [30] J. P. Lewis. Fast normalized cross-correlation. Vision interface, 10(1):120–123, 1995.
- [31] Y. F. LI Fuyu. Summarization of sift-based remote sensing image registration techniques. Remote Sensing for Natural Resources, 28(2):14, 2016.
- [32] Z. li Song, S. Li, and T. F. George. Remote sensing image registration approach based on a retrofitted sift algorithm and lissajous-curve trajectories. Opt. Express, 18(2):513–522, Jan 2010.
- [33] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
- [34] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999.
- [35] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
- [36] Y. Luo, X. Wang, Y. Wu, and C. Shu. Infrared and visible image homography estimation using multiscale generative adversarial network. Electronics, 12(4), 2023.
- [37] M. Magnusson, J. Sigurdsson, S. E. Armansson, M. O. Ulfarsson, H. Deborah, and J. R. Sveinsson. Creating rgb images from hyperspectral images using a color matching function. In IGARSS 2020 - 2020 IEEE International Geoscience and Remote Sensing Symposium, pages 2045–2048, 2020.
- [38] M. I. McCartney, S. Zein-Sabatto, and M. Malkani. Image registration for sequence of visual images captured by uav. In 2009 IEEE Symposium on Computational Intelligence for Multimedia Signal and Vision Processing, pages 91–97, 2009.
- [39] V. Mochalov, O. Grigorieva, D. Zhukov, A. Markov, and A. Saidov. Remote sensing image processing based on modified fuzzy algorithm. In R. Silhavy, editor, Artificial Intelligence and Bioinspired Computational Methods, pages 563–572, Cham, 2020. Springer International Publishing.
- [40] P. Monasse. Extraction of the Level Lines of a Bilinear Image. Image Processing On Line, 9:205–219, 2019. https://doi.org/10.5201/ipol.2019.269.
- [41] Y. Mumtaz Ahmad, S. Sahran, A. Adam, and S. Osman. Linear intensity-based image registration. International Journal of Advanced Computer Science and Applications, 9:211–217, 01 2018.
- [42] P. P. N, S. G. S, and V. K. Govindan. Threshold accepting approach for image registration. UACEE International Journal of Computer Science and its Applications, 2, 2012.
- [43] S. Ozer. Similarity domains machine for scale-invariant and sparse shape modeling. IEEE Transactions on Image Processing, 28(2):534–545, 2018.
- [44] S. Özer. Feature matching with similarity domains network. In 2020 28th Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2020.
- [45] S. Özer, M. Ege, and M. A. Özkanoglu. Siamesefuse: A computationally efficient and a not-so-deep network to fuse visible and infrared images. Pattern Recognition, 129:108712, 2022.
- [46] S. Ozer, E. Ilhan, M. A. Ozkanoglu, and H. A. Cirpan. Offloading deep learning powered vision tasks from uav to 5g edge server with denoising. IEEE Transactions on Vehicular Technology, 2023.
- [47] M. A. Özkanoğlu and S. Ozer. Infragan: A gan architecture to transfer visible images to infrared domain. Pattern Recognition Letters, 155:69–76, 2022.
- [48] R. Raguram, J.-M. Frahm, and M. Pollefeys. A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus. volume 5303, pages 500–513, 10 2008.
- [49] L. Ray. 2-d and 3-d image registration for medical, remote sensing, and industrial applications. Journal of Electronic Imaging, 14:9901–, 07 2005.
- [50] S. Razakarivony and F. Jurie. Vehicle detection in aerial imagery : A small target detection benchmark. Journal of Visual Communication and Image Representation, 34:187–203, 2016.
- [51] J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press, Belmont, CA, 3rd edition, 2007.
- [52] S. T. Vijay and P. N. Pournami. Feature based image registration using heuristic nearest neighbour search. In 2018 22nd International Computer Science and Engineering Conference (ICSEC), pages 1–3, 2018.
- [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- [54] Y. Zhao, X. Huang, and Z. Zhang. Deep lucas-kanade homography for multimodal image alignment. CoRR, abs/2104.11693, 2021.
Authors’ Bio:
Sedat Ozer received his M.Sc. degree from Univ. of Massachusetts, Dartmouth and his Ph.D. degree from Rutgers University, NJ. He has worked as a research associate in various institutions including Univ. of Virginia and Massachusetts Institute of Technology. His research interests include pattern analysis, remote sensing, object detection & segmentation, object tracking, visual data analysis, geometric and explainable AI algorithms and explainable fusion algorithms. As a recipient of TUBITAK’s international outstanding research fellow and as an Assistant Professor, he is currently at the department of Computer Science at Ozyegin University.
Alain Patrick Ndigande received his B.Eng. degree from Kocaeli University, Turkey, in 2022. He is currently a M.Sc. student at Ozyegin University. His current research interests are deep learning, image registration and remote sensing.