An invariant feature extraction for multi-modal images matching.

Abstract

This paper aims at providing an effective multi-modal images invariant feature extraction and matching algorithm for the application of multi-source data analysis. Focusing on the differences and correlation of multi-modal images, a feature-based matching algorithm is implemented. The key technologies include phase congruency (PC) and Shi-Tomasi feature point for keypoints detection, LogGabor filter and a weighted partial main orientation map (WPMOM) for feature extraction, and a multi-scale process to deal with scale differences and optimize matching results. The experimental results on practical data from multiple sources prove that the algorithm has effective performances on multi-modal images, which achieves accurate spatial alignment, showing practical application value and good generalization.

Index Terms— Image matching, multi-modal, multi-source, invariant feature, phase congruency, Gabor filter

1 Introduction

Image matching has been a classical, important, and challenging work in image processing, especially for multi-modal images. Nowadays, the focus of image matching or registration has shifted to multi-modal and multi-source techniques, which are more critical to the practical applications. Multi-modal image matching is an essential step to implement data fusion, change detection, collaborative classification, joint analysis, and other image technologies.

Image matching algorithms are generally divided into area-based methods and feature-based methods according to technical means [1], in which the feature-based ones are the most widely applied nowadays. By locating keypoints in the same positions, and accurately matching them as far as possible through similar or invariant features, the spatial transformation between images is represented in a sparse way. This type of algorithm is rather a good solution to the automatic multi-modal images matching, which has been a hot spot in recent years.

Chen et al. [2] proposed partial intensity invariant feature descriptor (PIIFD) based on scale-invariant feature transform (SIFT) [3, 4, 5] for multi-source retinal images, which reduces intensity distortion problems. PSO-SIFT [6] uses second order gradient and an enhanced matching method for multi-source remote sensing images. Ye et al. developed histogram of orientated phase congruency (HOPC) [7, 8], in which an extended phase congruency model is designed for feature description. Radiation-variation insensitive feature transform (RIFT) [9] utilizes a maximum index map (MIM) as the feature map using LogGabor filter, which has high invariance to intensity distortion. Multi-scale histogram of local main orientation (MS-HLMO) [10] comprehensively considers the differences of multi-source images, and proposed a framework focusing on orientation feature, which achieves intensity, rotation, and scale robustness on multi-modal remote sensing images. The key problems have emerged from these studies: 1) intensity distortion, 2) rotation, 3) scale difference, which are the difficulties of multi-modal image matching and need to be solved. The feature extracted from images should be robust to the above three aspects as possible.

This research focuses on analyzing the invariance feature of multi-modal images, and a feature-based image matching algorithm is proposed, which uses phase congruency (PC) and Shi-Tomasi feature point for keypoints detection, LogGabor filter and a weighted partial main orientation map (WPMOM) for feature description, and a multi-scale process to deal with scale differences and optimize matching results. The experiments on remote sensing and medical data from multiple sources indicate that the proposed method has excellent, stable performance in multi-modal image registration, achieving accurate spatial alignment, which shows practical application value and good generalization ability.

2 PROPOSED MATCHING METHOD

Refer to caption — Fig. 1: The proposed multi-modal image matching framework.

The process of the proposed algorithm is shown in Fig.1. The input multi-modal image pair are preprocessed. Then, keypoint detection based on phase congruency and Shi-Tomasi detector is performed. The feature extraction is carried out in a multi-scale process, in which Gaussian pyramids are first built. The odd-LogGabor filter is used to extract features of the images, then a weighted partial main orientation map (WPMOM) is calculated based on the LogGabor feature. The generalized GLOH-like descriptor (GGLOH) [10] is adopted to extract features for each keypoint. The keypoints are then matched through a multi-scale matching strategy. Finally, the spatial transformation is determined by the matched feature points.

2.1 Feature points detection

Harris corner detection [11] is an effective keypoints extraction method with stability and robustness, which has been widely used in multi-modal image matching. However, for multi-source images, the textures are likely to be very different, such as optical image and digital map as shown in Fig.2(a). There are texture features in the optical image that do not exist in the digital map. Due to the influence of inconsistent texture information, the keypoints are difficult to locate in the same position, causing the repetition rate of points very low, as shown in Fig.2(b). It is hoped that the extracted feature points are more focused on the salient and stable structure of the image. Thus the idea of adopting image’s phase congruency is come up with.

Phase congruency (PC) has been proved to be robust under different imaging modalities [7, 8, 9], which extracts stable structure features. The 2D-PC model can be calculated using components at multiple scales $s$ and orientations $o$ of the LogGabor wavelet [12, 13]:

{\bf{PC}}(x,y)=\frac{{\sum\nolimits_{s}{\sum\nolimits_{o}{{\omega_{o}}(x,y)\lfloor{{\bf{A}}_{s,o}}(x,y)\Delta{{\bf{\phi}}_{s,o}}(x,y)-T}\rfloor}}}{{\sum\limits_{s}{\sum\limits_{o}{{{\bf{A}}_{s,o}}(x,y)}}+\varepsilon}}

(1)

where ${\omega_{o}}(x,y)$ is the weighting factor based on frequency spread; ${\bf{A}}_{s,o}$ is amplitude component of LogGabor response; $\Delta{{\bf{\phi}}_{s,o}}$ is the phase deviation; T is the noise compensation; $\lfloor\cdot\rfloor$ is a truncation function that produces the equal when positive and zero otherwise. A maximum moment map ${{\bf{M}}_{\psi}}$ which indicates edge features and a minimum moment map ${{\bf{m}}_{\psi}}$ which indicates corner features are then calculated as:

{\bf{a}}=\sum\nolimits_{o}{{{({\bf{P}}{{\bf{C}}_{{\theta_{0}}}}\cos({\theta_{0}}))}^{2}}}

(2)

{\bf{b}}=2\sum\nolimits_{o}{({\bf{P}}{{\bf{C}}_{{\theta_{0}}}}\cos({\theta_{0}}))\cdot({\bf{P}}{{\bf{C}}_{{\theta_{0}}}}\sin({\theta_{0}}))}

(3)

{\bf{c}}={\sum\nolimits_{o}{({\bf{P}}{{\bf{C}}_{{\theta_{0}}}}\sin({\theta_{0}}))}^{2}}

(4)

{\bf{\psi}}=\frac{1}{2}\arctan(\frac{{\bf{b}}}{{{\bf{a-c}}}})

(5)

{{\bf{M}}_{\psi}}=\frac{1}{2}({\bf{c}}+{\bf{a}}+\sqrt{{{\bf{b}}^{2}}+{{({\bf{a}}-{\bf{c}})}^{2}}})

(6)

{{\bf{m}}_{\psi}}=\frac{1}{2}({\bf{c}}+{\bf{a}}-\sqrt{{{\bf{b}}^{2}}+{{({\bf{a}}-{\bf{c}})}^{2}}})

(7)

where ${\bf{P}}{{\bf{C}}_{{\theta_{0}}}}$ is the PC map at orientation $\theta_{0}$ . Then the two feature maps are added together to get a single map containing both edge and corner features:

{{\bf{M}}_{w}}={{\bf{M}}_{\psi}}+{{\bf{m}}_{\psi}}

(8)

Shi-Tomasi [14] feature is an improved algorithm of Harris, which states that the stability of the corner is related to the smaller eigenvalue of the feature matrix in Harris. Therefore, the Shi-Tomasi detector is performed on the feature map to extract stable feature points:

cornerness=min(\lambda_{1},\lambda_{2})

(9)

\textbf{\emph{M}}=\left[{\begin{array}[]{*{20}{l}}{\sum\limits_{\bf{W}}{{{{\bf{M}}_{w}}_{x}}^{2}}}&{\sum\limits_{\bf{W}}{{{{\bf{M}}_{w}}_{x}}{{{\bf{M}}_{w}}_{y}}}}\\ {\sum\limits_{\bf{W}}{{{{\bf{M}}_{w}}_{x}}{{{\bf{M}}_{w}}_{y}}}}&{\sum\limits_{\bf{W}}{{{{\bf{M}}_{w}}_{y}}^{2}}}\end{array}}\right]

(10)

where $\lambda_{1},\lambda_{2}$ are the eigenvalues of $\bf{M}$ , ${{{\bf{M}}_{w}}_{x}}$ and ${{{\bf{M}}_{w}}_{y}}$ are the gradient of the weighted moment map ${{\bf{M}}_{w}}$ in the $x$ and $y$ directions, and ${\bf{W}}$ is a gaussian window. After filtering the images using Eq.(1)-(10), with local non-maximum suppression and threshold judgment, the keypoints are obtained from the multi-modal images, as shown in Fig.2(c), the repetition rate of which is highly improved.

2.2 Multi-modal robust feature extraction

So far, there have been many studies on multi-modal image matching techniques, and a variety of robust features have been proposed, all of which have their effects and advantages. The key problem lies in finding the invariant or similar features among images with different modals. We reinspect the characteristics and matching methods of multi-modal image data and condense a core idea that is analyzed below to guide our feature design.

Fig. 3: Local intensity distortions in multi-modal images and their robust gradient orientation features.

To intuitively show the characteristics of the multi-modal robust feature, we briefly summarize the common intensity distortion in multi-modal images, as shown in Fig.3. The four images can be taken as the edge of an object or the interface of two substances. Assume that the center point is the keypoint, and the image block represents the intensity information of its neighborhood. In Fig.3, the intensity amplitude on the left is lower than on the right. In Fig.3, the gradient amplitude of the two parts changes. In Fig.3, due to large intensity distortion, the gradient orientation is reversed. And Fig.3 represents a general situation, which is considered as non-linear intensity distortion or some degradation such as down-sampling or blurring. Notice that the gradient orientation remains on the same line, and for the reversion case, if the angle is limited to [-90°, 90°), the orientation is still 0°.

In multi-source images, due to differences in sensors, temporal, environments, etc., various intensity distortions may be caused, but the morphology of the most detailed part of the image basically lies in the four cases in Fig.3. In summary, the magnitude of the local gradient varies, but the orientation is basically stable, which defines the feature information that should be focused on. In MS-HLMO [10], a similar idea emerges, which prompts it to count only the gradient orientation in its PMOM feature map and abandon the amplitude. However, PMOM is still calculated using classic gradient that is very sensitive to image noise and inconsistency. So, in this research, the orientation is obtained with an odd-LogGabor filter. The odd-LogGabor wavelet has been widely used as a representation of gradient operator in image processing, which has much stronger robustness. The image’s gradients along x and y directions based on odd-LogGabor are then calculated as:

\left\{{\begin{array}[]{*{20}{c}}{{\bf{G}}_{x}^{{\rm{LG}}}(x,y)=\sum\nolimits_{s}{\sum\nolimits_{o}{{\bf{I}}(x,y)*{\bf{LG}}_{s,o}^{{\rm{odd}}}(x,y)\cdot\cos(o)}}}\\ {{\bf{G}}_{y}^{{\rm{LG}}}(x,y)=\sum\nolimits_{s}{\sum\nolimits_{o}{{\bf{I}}(x,y)*{\bf{LG}}_{s,o}^{{\rm{odd}}}(x,y)\cdot\sin(o)}}}\end{array}}\right.

(11)

where ${\bf{LG}}_{s,o}$ denotes an odd-LogGabor filter with scale $s$ and orientation $o$ . The odd-LogGabor responses ${\bf{G}}_{x}^{{\rm{LG}}}(x,y)$ and ${\bf{G}}_{y}^{{\rm{LG}}}(x,y)$ is taken as the representation of gradients and in average squared gradient (ASG) calculation:

\left[\begin{array}[]{l}{{\bf{G}}_{{W_{\sigma}},s,x}}\\ {{\bf{G}}_{{W_{\sigma}},s,y}}\end{array}\right]=\left[\begin{array}[]{l}\sum\limits_{{W_{\sigma}}}{{{({\bf{G}}_{x}^{{\rm{LG}}})}^{2}}-{{({\bf{G}}_{y}^{{\rm{LG}}})}^{2}}}\\ \sum\limits_{{W_{\sigma}}}{2{\bf{G}}_{x}^{{\rm{LG}}}{\bf{G}}_{y}^{{\rm{LG}}}}\end{array}\right]

(12)

Table 1: Average NCMs of multi-modal image matching results comparison by six methods.

Method	Remote sensing						Medical
Method	optical-optical	optical-infrared	optical-depth	optical-SAR	optical-map	others	optical	others
SIFT	21.6	2.8	0.9	1.1	0.1	7.9	10.6	0
PSO-SIFT	14.9	50.7	3.0	0	1.0	14.2	8.9	0
PIIFD	44.0	34.0	11.4	12.3	2.1	2.0	6.5	0
MS-HLMO	208.1	270.3	75.4	112.7	91.1	108.2	97.0	0
Proposed	227.2	269.3	89.0	147.7	101.1	109.8	183.5	11.0
RIFT	390.5	279.3	294.0	263.8	255.0	160.3	157.0	3.1
Proposed⁺	523.2	447.0	340.9	349.2	382.4	275.8	317.0	93.0

Another issue is that, according to visual saliency, the pixel closer to the center pixel has more importance. So an improvement is made to the PMOM that in a local area, features from a larger scale should have fewer effects. Each scale is given a weight that is inversely proportional to the scale, then the local orientation is calculated as:

{\bf{G}}_{{\rm{PMOM}}}^{{\rm{LG}}}=\frac{1}{2}\angle(\sum\limits_{\sigma}{\frac{1}{\sigma}{{\bf{G}}_{{W_{\sigma}},s,x}}},\sum\limits_{\sigma}{\frac{1}{\sigma}{{\bf{G}}_{{W_{\sigma}},s,y}}})

(13)

\angle(X,Y)=\left\{\begin{array}[]{l}\arctan(\frac{Y}{X}),X\geq 0\\ \arctan(\frac{Y}{X})+\pi,X<0,Y\geq 0\\ \arctan(\frac{Y}{X})-\pi,X<0,Y<0\end{array}\right.

(14)

This orientation feature map ${\bf{G}}_{{\rm{PMOM}}}^{{\rm{LG}}}$ based on odd-LogGabor and weighted-PMOM (WPMOM) has much stronger invariance and stability in multi-modal images. The schematic diagram of feature extraction process is shown in Fig4.

2.3 Feature points description and matching

After the most crucial invariant feature map is obtained, the next step is to extract a descriptor for each keypoints with a selected structure. Theoretically, any kind of valid descriptor structure is feasible. In the proposed framework, the generalized gradient location and orientation histogram-like (GGLOH) feature descriptor [10] is adopted for feature description, which has proved a better performance and is more operable. The WPMOM value at each keypoint is taken as the reference orientation, and the WPMOM values within the local area are counted with GGLOH to obtain descriptor vectors of the keypoints which are then for matching.

To deal with scale differences, the feature extraction is extracted in the scale-space by using Gaussian pyramid. The multi-scale strategy proposed in [10] has shown effective results, which is adapted to the registration process. The commonly used nearest neighbor (NN) matching and fast sample consensus (FSC) outlier removal are utilized for the proposed algorithm and also for each method in the experiments to get a fair comparison.

3 EXPERIMENTS AND ANALYSIS

To comprehensively test the proposed method, the data sets provided in [15], and [10] are adopted, and we also carefully prepared multi-source images from a broader field, which includes remote sensing, medical, and other natural or artificial images. The sensor types include optical, infrared, depth, synthetic aperture radar (SAR), digital map, etc. SIFT [5], PIIFD [2], PSO-SIFT [6], RIFT [9], and MS-HLMO [10] are selected for comparison. The number of correct matches (NCM) is employed as the evaluation metric. The experiments are all implemented using MATLAB2021b on a Windows with Intel Core I7-8700 CPU.

The average NCM of each group of multi-modal image pairs are listed in Table 1. Note that, due to RIFT’s treatment to rotation property being more of an enumeration, adjustments are made accordingly to the proposed method for a fairer comparison, applying the same strategy, marked as Proposed⁺. It shows that the proposed method obtains the most NCMs in almost all groups. The performance of MS-HLMO is the best among compared algorithms, whose average NCM exceeds the proposed method in one group. Through a detailed analysis, it is found that MS-HLMO has very high NCMs in some image pairs, which pulls up the average, while it even fails in some others. The proposed method’s effect is satisfactory on all images. Therefore, it shows that the proposed method has better robustness and generalization, which is more unaffected by the data modal. Fig.5 shows some examples of the keypoint matching results.

(a) HSI-MSI

(b) optical-infrared

(d) optical-SAR

(e) optical-map

(f) SAR-SAR

(h) staining

(i) retina fundus

Fig. 5: Examples of multi-modal images matching from remote sensing and medical fields.

Through inspecting the transformed and aligned images and comparison with manual matching results, the spatial losses of matched image pairs are within 1 $\sim$ 2 pixels. The proposed algorithm is proved to be effective.

4 CONCLUSION

In this paper, an effective feature-based image matching algorithm is proposed for multi-modal images. The invariant features are analyzed and summarized, providing a robust feature based on local gradient orientation and LogGabor filter. Through experiments on a comprehensive set of multi-source data, the algorithm can well achieve the matching tasks, where the image matching losses can reach within 1 $\sim$ 2 pixels. It is concluded that the algorithm has good robustness, stability, and generalization.

References

[1] Barbara Zitova and Jan Flusser, “Image registration methods: a survey,” Image and Vision Computing, vol. 21, no. 11, pp. 977–1000, 2003.
[2] Jian Chen, Jie Tian, Noah Lee, Jian Zheng, R Theodore Smith, and Andrew F Laine, “A partial intensity invariant feature descriptor for multimodal retinal image registration,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 7, pp. 1707–1718, 2010.
[3] David G Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the Seventh IEEE International Conference on Computer Vision. IEEE, 1999, vol. 2, pp. 1150–1157.
[4] David G Lowe, “Local feature view clustering for 3d object recognition,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. IEEE, 2001, vol. 1, pp. I–I.
[5] David G Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
[6] Wenping Ma, Zelian Wen, Yue Wu, Licheng Jiao, Maoguo Gong, Yafei Zheng, and Liang Liu, “Remote sensing image registration with modified sift and enhanced feature matching,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 1, pp. 3–7, 2016.
[7] Yuanxin Ye, Jie Shan, Lorenzo Bruzzone, and Li Shen, “Robust registration of multimodal remote sensing images based on structural similarity,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2941–2958, 2017.
[8] Yuanxin Ye, Jie Shan, Siyuan Hao, Lorenzo Bruzzone, and Yao Qin, “A local phase based invariant feature for remote sensing image matching,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 142, pp. 205–221, 2018.
[9] Jiayuan Li, Qingwu Hu, and Mingyao Ai, “Rift: Multi-modal image matching based on radiation-variation insensitive feature transform,” IEEE Transactions on Image Processing, vol. 29, pp. 3296–3310, 2019.
[10] Chenzhong Gao, Wei Li, Ran Tao, and Qian Du, “Ms-hlmo: Multiscale histogram of local main orientation for remote sensing image registration,” IEEE Transactions on Geoscience and Remote Sensing.
[11] Christopher G Harris, Mike Stephens, et al., “A combined corner and edge detector.,” in Alvey Vision Conference. Citeseer, 1988, vol. 15, pp. 10–5244.
[12] Peter Kovesi, “Phase congruency: A low-level image invariant,” Psychological research, vol. 64, no. 2, pp. 136–148, 2000.
[13] Peter Kovesi, “Phase congruency detects corners and edges,” in The australian pattern recognition society conference: DICTA, 2003, vol. 2003.
[14] Jianbo Shi and Tomasi, “Good features to track,” in 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994, pp. 593–600.
[15] Yongxiang Yao, Yongjun Zhang, Yi Wan, Xinyi Liu, Xiaohu Yan, and Jiayuan Li, “Multi-modal remote sensing image matching considering co-occurrence filter,” IEEE Transactions on Image Processing, vol. 31, pp. 2584–2597, 2022.