Local Contrast and Global Contextual Information Make Infrared Small Object Salient Again

Chenyi Wang Huan Wang Peiwen Pan
NJUST
{wcyjerry, Nanjing}@qq.com

Abstract

Infrared small object segmentation (ISOS) aims to segment small objects only covered with several pixels from clutter background in infrared images. It’s of great challenge due to: 1) small objects lack of sufficient intensity, shape and texture information; 2) small objects are easily lost in the process where detection models, say deep neural networks, obtain high-level semantic features and image-level receptive fields through successive downsampling. This paper proposes a reliable segmentation model for ISOS, dubbed UCFNet (U-shape network with central difference convolution and fast Fourier convolution), which can handle well the two issues. It builds upon central difference convolution (CDC) and fast Fourier convolution (FFC). On one hand, CDC can effectively guide the network to learn the contrast information between small objects and the background, as the contrast information is very essential in human visual system dealing with the ISOS task. On the other hand, FFC can gain image-level receptive fields and extract global information on high-resolution features maps while preventing small objects from being overwhelmed. Experiments on several public datasets demonstrate that our method significantly outperforms the state-of-the-art ISOS models, and can provide useful guidelines for designing better ISOS deep models. Code are available at https://github.com/wcyjerry/BasicISOS.

1 Introduction

Infrared small object segmentation (ISOS) is a key technique broadly used in early warning systems, night navigation, maritime surveillance, UAV search and tracking and the like, due to its all-weather working, long-range detection and concealment characteristics. Therefore, improving its performance is of great significance.

Refer to caption — Figure 1: Examples of ISOS with the object indicated by the red bounding boxes and a close-up is shown in the top left corner. Left: a airplane with recognizable shape in a remote distance. Middle: a car whose shape is almost lost and only can be identified by estimation. Right: A small dim object drowned in a cloud background.

Researches on ISOS have been conducted for over several decades. Many methods are proposed which can be roughly categorized into (1) traditional methods focusing on signal processing and prior knowledge and (2) deep learning models relying on Convolutional Neural Networks (CNNs) and Visual Transformers(ViTs).

Traditional methods consist of three representative subcategories: background-oriented methods, object-oriented methods, and low-rank decomposition methods. Background-oriented methods like Max-mean/Max-medium[12] and Top-Hat[42] separate object from complex background by using all kinds of filters to estimate the scene background. Object-oriented methods segment small object by designing different measure methods. For instance, LCM[3], ILCM[17] and TLLCM[16] use the contrast measure between central point and its surroundings and PatchSim[1] applies patch similarities to suppress false alarms. Low-rank decomposition methods are frequently based on robust principal component analysis(RPCA), by inductively treating the input as a superposition of low-rank background and sparse objects and solving such detection issues via optimization techniques. Take the infrared patch-image (IPI)[14] model as an example. It suggests a patch-sliding design that exploits better non-local self-correlation properties of images via RPCA. Subsequently, recent works put efforts into designing sound low-rank and prior constraints[28, 35], exploiting spatio-temporal and multi-mode correlation[7], and applying advanced optimization schemes[9]. Though traditional methods have achieved some results in experimental scenarios, they are sensitive to hyper-parameter setting, lack of generalization and suffer from low performance under complex real scenes.

As deep learning has become the mainstream in many computer vision tasks, many pioneers have achieved great improvement in ISOS using deep neural networks. Wang et al. [36] used two generators to focus on two different tasks miss detection and false alarm and one discriminator to get the balanced result of two generators. Dai et al. [10] proposed a novel feature fusion method named asymmetric contextual modulation (ACM), and proposed the first public ISOS dataset in real scenes. Then they proposed ALC [11] which included an unlearnable conditional local contrast module. Though existing deep learning methods have got great results, they mostly focus on the feature fusion. Liu et al. [21] first introduced transformer block into ISOS task and got great results. Zhang et al. introduced attention-guided context module to help AGPCNet[46] focus on small object. Zhang et al. proposed ISNet[45] which take shape into account to achieve better performance.

However, the above models more or less neglect two native problems in ISOS: the infrared small object lacks of sufficient common information like color and shape and small objects would be drown in the excessive downsampling process. This paper aims to provide a better solution to the problems of information lacking and objects being drown. Specifically, we design a U-net network behaving as our ISOS backbone. Then we employ central difference convolution (CDC), a human visual system based convolution operator, to extract essential contrast information. As for vanish problem, we leverage fast Fourier convolution (FFC) which uses convolution operation on frequency domain, thus being easy to obtain global information while avoiding object disappearance. We demonstrate the effectiveness of CDC with other advanced convolution operators and we also compare FFC with other global information extracting methods. Our final model, coined as UCFNet, achieves new state-of-the-art performance over competitive ISOS methods on existing public datasets. The contributions of our work can be summarized as:

•

We reveal two native problems in infrared small object segmentation through analysing experiments with common segmentation methods and propose a baseline network structure.
•

We further propose CDC to extract local contrast information based on human visual system which is essential for infrared small object detection, meanwhile we use FFC to obtain image-level receptive fields and global context while maintaining high resolution.
•

We comprehensively evaluate our proposed method on two public datasets, our UCF achieves new state-of-the-art performance over other methods.

2 Related Work

2.1 ISOS

There are two main types of ISOS methods, traditional methods based on mathematical modeling and deep learning methods based on neural networks.

Among the traditional methods, Max-mean-Max-medium[12] and Top-Hat[42] used filters to separate target of background. LCM[3], ILCM[17], TLLCM[16] and MPCM[39] segmented small object by designing salient measures. IPI model[14] treated the input as a superposition of low-rank background and sparse yet shaped target, and solved such issues by using Low-rank decomposition, further methods like sound low-rank[28] and prior constraints[35] were proposed based on IPI. These methods suffered from low performance under complex scenarios.

As for deep learning methods, Wang et al. [36] used conditional GAN[26] with two generators and one discriminator to gain a great balance between miss detection and false alarm (MDvsFA). Dai et al. [10] proposed an asymmetric contextual modulation to help network performance well and introduced the first public ISOS dataset SIRST in real scenes, Dai et al. [11] further applied a handcraft dilated local contrast measure into network. Liu et al. [21] firstly introduced multi-head self-attention into ISOS tasks and got a good result. Zhang et al. [46] proposed AGPCNet with attention-guided context block and context pyramid module. Zhang et al. [45] took shape into account and designed Taylor finite difference edge block and two-orientation attention, and also proposed a more challengable dataset IRSTD. Deep learning methods have become more and more dominant in ISOS due to their great ability of robustness and generalization.

2.2 Convolution Operators

The ordinary convolution operation is to divide the feature map into patches of the same size as the convolution kernel and then perform a weighted sum operation, this fixed operation at each position may be suboptimal for some specific tasks. Therefore, researchers have proposed some advanced convolution operators. In order to solve the problem of standard convolution treating all input pixels as valid ones in image inpainting tasks, Liu et al. [22] proposed partial convolution, where the convolution is masked and renormalized to be conditioned on only valid pixels. Yu et al. [40] proposed gated convolution to provide a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. While in modeling geometric transformations, vanilla convolution is limited due to its fixed geometric structure, Dai et al. [6] proposed deformable convolution to enhance this ability by adding additional offsets learned from target task. Zhu et al. [48] then proposed deformable convolution v2 to reduce the impact of irrelevant region by adding additional weight and mimicking features from RCNN[29]. Yu et al. [41] proposed central difference convolution in face anti-spoofing task, which is able to capture intrinsic detailed patterns via aggregating both intensity and gradient information.

2.3 Global Information Extraction

One common concern is the ability of the network to grasp the local and global information, local context usually are easy to extract using convolution, while how global the information can get usually determined by the receptive fields of the network. The most common strategy to enlarge the receptive fields is stacking convolutions and downsamplings constantly, continuous convolution can linearly enlarge receptive fields while downsamplings multiply, dilated (atrous) convolution inserts holes between pixels in convolutional kernels and can obtain larger receptive fields than standard convolution, Chen et al. [4] proposed dilation atrous pyramid pooling (ASPP) to capture multi-scale information and Wang et al. [37] designed a hybrid dilated convolution to enlarge receptive fields, they both got good improvements in semantic segmentation tasks. Attention mechanism [38, 34] can gain global information by calculating the correlation of each single pixel with other pixels. Fu et al. proposed spatial attention and channel attention [13] and achieved great results. Chi et al. [5] proposed fast Fourier convolution which performs convolution operators in frequency domain to conduct a global influence in spatial domain thus it can extract image-level information. Suvorov et al. proposed LAMA[33] which successfully applied it in large kernel image inpainting tasks. Berenguel et al. [2] then used FFC in monocular depth estimation and semantic segmentation.

3 Method

3.1 Inspiration

We draw inspiration from a series of experiments using common segmentation networks, including FPN[20], U-Net[30], PSPNet[47] and DeepLabv3[4], the results in Table 1 indicate two anomalies: first, the performance is quite polarized, with the models that fuse more low-level features (FPN and U-Net) significantly outperforming the others; second, their performance drops only slightly as the network width and depth increase.

Table 1: Results of common segmentation models in ISOS tasks.

Method	Backbone	Base width	IoU $\uparrow$	Params (M) $\downarrow$
U-Net[30]	-	64	69.89	31.04
FPN[20]	ResNet-18	64	70.38	14.12
	ResNet-50	256	70.12	27.19
	ResNet-101	256	69.64	46.18
PSPNet[47]	ResNet-50	256	22.73	46.65
DeepLabv3[4]	ResNet-50	256	39.71	35.86

We further visualize the attention maps of 4 stages using grad-cam[31, 15] in FPN. As we can see in Fig. 2, the small object has been completely lost due to the excessive downsampling. Meanwhile directly increasing the width and depth of network still fails to mine more information while may lead to redundancy of parameters. Based on these observation, we summarize two essential guidelines on ISOS: 1) Extracting global information while maintaining high resolution may avoid small object being overwhelmed. 2) Extracting additional essential information using modules dedicated to ISOS could be more effective than directly increasing the depth and width of a network.

3.2 Overview of UFCNet

Based on the enlightenment from Sec. 3.1, we proposed a U-shape network with central difference convolution (CDC) and fast Fourier convolution (FFC). The whole framework can been seen in Fig. 3. The number of base channels is 32 and only 4 downsampling operations are performed in the whole process, which satisfies the needs of ISOS tasks while avoiding inefficient redundancy. Specially, during the downsample process, we use central difference convolution residual block which contains a standard convolution and CDC with a residual convolution for channel alignment. We use two cascaded FFC to form our fast Fourier convolution residual block which aims to extract global context in high resolution. In what follows, we will briefly introduce central difference convolution in Sec. 3.3 and fast Fourier convolution in Sec. 3.4.

3.3 Central Difference Convolution

Convolution can effectively extract color, shape, texture and other information, while standard convolution may be limited because the lack of these information in ISOS tasks. Inspired by human visual system sensitive to intensity difference and contrast, we utilize central difference convolution to guide our network. CDC can introduce additional contrast information by computing the difference between the center point and other points within the convolution window and can be described as Eq. 1,

\displaystyle f(x,y)=\sum_{(i,j)\in R}w_{(i,j)}\cdot(F_{(x+i,y+j)}-F_{(x,y)})

(1)

In addition, CDC is often combined with vanilla convolution to retain some of the traditional feature extraction capabilities, and the whole process can be expressed in the equation.

	$\displaystyle CDC(x,y)=$	$\displaystyle\theta\cdot\sum_{(i,j)\in R}w_{(i,j)}(F_{(x+i,y+j)}-F_{(x,y)})$
		$\displaystyle+(1-\theta)\cdot\sum_{(i,j)\in R}w_{(i,j)}(F_{(x+i,y+j)})$		(2)

where the hyperparameter $\theta\in(0,1)$ is used to determine the contribution between CDC and vanilla convolution. The window size $R$ used to calculate the difference is equal to the convolution kernel size while the receptive fields of it can naturally increase during the forward process, thus making it capable of extracting multi-level contrast information which further help the network identify infrared small objects with different size and effectively guide the network performs well.

3.4 Fast Fourier Convolution

Another issue of ISOS is that the small objects are easily overwhelmed in the excessive downsample layers which is used to obtain global information. In order to solve this contradiction, rather than using the Transformer model with long range dependency, we employ fast Fourier convolution (FFC) which can gain image-level receptive fields in high resolution thus we can extract global context without losing small objects. We follow the structure in LAMA[33] for FFC. Specifically, we splits the channels into two parallel branch, the local branch uses standard convolution to extract local information while global branch applies the Fourier Unit(FU) to gain global context, and there are two additional short-cuts for information fusion. The whole process can described as:

	$\displaystyle Y_{l}=Conv_{l->l}(x_{l})+Conv_{g->l}(x_{l})$		(3)
	$\displaystyle Y_{g}=Conv_{l->g}(x_{g})+FU_{g->g}(x_{g})$		(4)

The Four Unit is key to get global context, because it transforms input features from spatial domain to frequency domain where each single point corresponds to all points in the spatial domain. Therefore, the convolution operation being conducted within a small kernel size is able to influence the all image in spatial domain and obtain global contextual information. The FU makes following steps:

Transform the input feature map from spatial domain to frequency domain with Real FFT2d and concatenates the real and imaginary parts

	RealFFT2d	$\displaystyle:\mathbb{R}^{H\times W\times C}\rightarrow\mathbb{C}^{H\times\frac{W}{2}\times C}$
	ComplexToReal	$\displaystyle:\mathbb{C}^{H\times\frac{W}{2}\times C}\rightarrow\mathbb{R}^{H\times\frac{W}{2}\times 2C}$

2)

Apply convolution, normalization and activation function in the frequency domain

$Conv\circ Norm\circ Act:\mathbb{R}^{H\times\frac{W}{2}\times 2C}\rightarrow\mathbb{R}^{H\times\frac{W}{2}\times 2C}$

Transform inversely from frequency domain to spatial domain

	RealToComplex	$\displaystyle:\mathbb{R}^{H\times\frac{W}{2}\times 2C}\rightarrow\mathbb{C}^{H\times\frac{W}{2}\times C}$
	Inverse RealFFT2d	$\displaystyle:\mathbb{C}^{H\times\frac{W}{2}\times C}\rightarrow\mathbb{R}^{H\times W\times C}$

We concatenate the local branch and global branch into one in the end, and it contains rich local and global information.

4 Experiment

4.1 Datasets and Metrics

Datasets. We evaluate our methods on two widely-used datasets in ISOS: SIRST[10] and IRSTD[45]. SIRST contains 427 images in real IR scenes with half of the objects in SIRST only contains $0.1\%$ pixels of whole image. Larger than SIRST, IRSTD contains 1001 images with more challenging object and complex backgrounds. Both SIRST and IRSTD are separated into training set and test set with a ratio of 8:2.

Metrics. Following privious works, we use pixel-level metrics (IoU and nIoU) and target-level metrics (Pd and Fa) to measure our method. Intersection over Union (IoU) and normalized Intersection over Union (nIoU) can described as:

	$\displaystyle IoU$	$\displaystyle=\frac{1}{n}\cdot\frac{\sum_{i=0}^{n}tp_{i}}{\sum_{i=0}^{n}(fp_{i}+fn_{i}-tp_{i})}$		(5)
	$\displaystyle nIoU$	$\displaystyle=\frac{1}{n}\cdot\sum_{i=0}^{n}\frac{tp_{i}}{fp_{i}+fn_{i}-tp_{i}}$		(6)

Where $n$ means to the total number of samples, $tp$ denotes the true positive, $fp$ denotes false positive and $fn$ denotes false negative. While target-level metrics probability of detection (Pd) and false alarm rate (Fa) can be described as:

	$\displaystyle Pd$	$\displaystyle=\frac{1}{n}\cdot\sum_{i=0}^{n}\frac{N_{pred}^{i}}{N_{all}^{i}}$		(7)
	$\displaystyle Fa$	$\displaystyle=\frac{1}{n}\cdot\sum_{i=0}^{n}\frac{P_{false}^{i}}{P_{all}^{i}}$		(8)

where $N_{pred}$ , $N_{all}$ denotes the number of correct detected objects and the number of total objects and $P_{false}$ , $P_{all}$ denotes the pixels of false detected objects and the pixels of total objects. We regard that the detection is correct when the distance between the centers of the predict result and the ground truth is less than 4.

4.2 Implementation Details

We conduct experiments on a computer with a 2.50GHz CPU, 16GB RAM and GeForce RTX 3090 based on Pytorch. For more details, we use AdamW optimizer with an initial learning rate of 0.001 and decayed by Cosine-Anneling-LR[25] schduler and we use binary cross entropy loss and soft IoU loss as our criterion. Each experiment is trained for 300 epochs with a batch size of 8. Our UCF achieves the best performance with a CDC ratio $\theta$ of 0.7 while using 7 FFC residual blocks.

4.3 Quantitative Results

We evaluated several traditional, deep learning for ISOS and common segmentation methods and compared their results, which are shown in Table 2. Overall, deep learning methods outperformed traditional ones due to their powerful feature extraction and generalization capabilities. Our proposed method (UCF) demonstrated superior performance over other deep learning methods on both datasets, as evidenced by all metrics. Notably, on SIRST, UCF achieved an impressive performance of 80.89 IoU and 78.72 nIoU, representing an great improvement over other methods, while also achieving a perfect detection rate and an extremely low false alarm rate of only $2.22\times 10^{-6}$ . On IRSTD, our method also outperformed state-of-the-art deep learning methods by 3-8 points in IoU and nIoU, achieving scores of 68.92 and 69.26, respectively.

		SIRST				IRSTD
		Pixel Level		Object Level		Pixel Level		Object Level
Method		IoU	nIoU	Pd	Fa	IoU	nIoU	Pd	Fa
Top-Hat[42]	Traditional methods	5.86	25.42	78.90	1397.12	4.26	15.08	67.00	422.25
LCM[3]		6.84	8.96	77.06	183.15	4.45	4.73	57.58	66.56
WLDM[23]		22.28	28.62	87.16	98.34	9.77	16.07	63.97	177.35
NARM[43]		25.95	32.23	79.82	19.74	7.77	12.24	61.96	12.24
PSTNN[44]		39.44	47.72	83.49	41.07	16.44	25.91	65.32	76.92
IPI[14]		40.48	50.95	91.74	148.37	14.40	31.29	86.35	450.36
RIPT[8]		25.49	33.01	85.32	24.75	8.15	16.12	68.35	26.36
NIPPS[9]		33.16	40.91	80.73	23.64	16.38	27.10	70.37	63.27
MDvsFA[36]	Deep learning methods	56.17	59.84	90.88	177.90	50.85	45.97	81.48	23.01
ACM[10]		72.45	72.15	93.52	12.39	63.38	60.80	91.58	15.31
Res-Vit[21]		72.82	71.22	98.15	27.15	61.89	60.64	90.91	12.64
AGPC[46]		73.69	72.60	98.17	16.99	66.29	65.23	92.83	13.12
DNANet[19]		74.16	75.65	98.17	30.21	64.81	64.51	93.27	16.05
HRNet[32]		76.50	72.86	99.08	2.88	64.78	59.47	92.95	16.95
U²Net[27]		74.54	73.18	98.17	18.32	64.75	62.32	92.61	18.18
SwinT[24]		70.53	69.89	92.19	33.42	59.89	58.78	86.59	17.74
NAT[18]		74.33	71.67	97.25	10.26	63.23	62.01	91.92	15.53
UCF (Ours)		80.89	78.92	100.00	2.26	68.92	69.26	93.60	11.01

Table 2: Quantitative evaluation of ISOS on SIRST and IRSTD datasets. We report pixel level metric IoU

(\%)

and nIoU

(\%)

and object level metric Pd (

\%

) and Fa (

10^{-6}

). All the deep learning methods outperform traditional methods and our UCF achieves the best performance in all terms of metrics on both datasets.

We conduct a further investigation of the dynamic relationship between Precision and Recall using LCM[3], IPI[14], ACM[10], AGCP[46], and our proposed method UCF. The ROC curves are presented in Fig. 4, where the area under the curve (AUC) is a key metric for quantitatively evaluating the ROC. Our UCF method achieved the highest AUC and F-score on both datasets, as shown in Table 3.

Table 3: F-score and the area under ROC curve on both datasets with different methods.

Method	SIRST		IRSTD
Method	F-score $\uparrow$	Auc $\uparrow$	F-score $\uparrow$	Auc $\uparrow$
LCM[3]	12.80	0.058	8.52	0.099
IPI[14]	57.63	0.448	25.17	0.248
ACM[10]	84.02	0.684	77.59	0.719
AGPC[46]	84.85	0.765	79.73	0.734
UCF	89.43	0.843	81.60	0.745

4.4 Qualitative Comparisons

Several visualization results for different methods are presented in Fig. 5. These results clearly demonstrate that our proposed UCF method not only achieves a higher detection rate and fewer false alarms at the object level, but also predicts object shapes more accurately.

4.5 Ablation Experiments

We conduct a series of ablation experiments on SIRST to investigate the effectiveness of CDC and FFC. In Table 4, we demonstrate the general effectiveness of these methods. CDC improves performance by incorporating local contrast information, while FFC provides valuable global information. When combined, CDC and FFC achieve the best results in terms of IoU, nIoU, and Pd. However, the Fa score drops slightly compared to using only FFC. This is because CDC tends to fix shape and texture details, which can sometimes result in false pixels.

Table 4: Ablation study of CDC and FFC in IoU

(\%)

, nIoU

(\%)

, Pd

(\%)

, Fa

(10^{-6})

Method	IoU $\uparrow$	nIoU $\uparrow$	Pd $\uparrow$	Fa $\downarrow$
UCF (vanilla)	74.14	74.89	96.33	4.83
UCF + CDC	75.95	76.29	97.25	3.46
UCF + FFC	79.55	77.23	100.00	1.82
UCF + CDC + FFC	80.89	78.72	100.00	2.22

Effectiveness of CDC.

We conduct experiments with different hyperparameter $\theta$ which determines the ratio between CDC and vanilla convolution. As shown in Fig. 6(a), CDC consistently outperforms vanilla covolution by mining essential contrast information and we achieve the best performance when $\theta=0.7$ . In addtition, we compare the performance of deformable convolution and gated convolution with that of CDC. As show in Table 5, gated convolution only has a slight improvement over vanilla convolution, while deformable convolution has a significant drop a lot, this is because the sharp information is quite insufficient in ISOS tasks and deformable convolution fails to learn the offsets according to the target’s shape. These results indicate that our proposed CDC has clear advantages over other convolution operators for ISOS.

Table 5: Ablation study of CDC with other convolution operators in metrics of IoU (

\%

), nIoU (

\%

), Pd (

\%

), Fa (

10^{-6})

Conv Operator	IoU $\uparrow$	nIoU $\uparrow$	Pd $\uparrow$	Fa $\downarrow$
Vanilla	74.14	74.89	96.33	4.83
Gated[40]	74.31	74.57	97.25	23.20
DeformableV1[6]	68.00	69.60	93.58	12.60
DeformableV2[48]	69.72	72.67	96.33	29.90
CDC ( $\theta=0.7$ )	75.95	76.29	97.25	3.46

Table 6: Ablation study of FFC with other global information extracting method (vanilla conv blocks, multi-dilated conv blocks and double attention block in metrics of IoU (

\%

), nIoU (

\%

), Pd (

\%

), Fa (

10^{-6})

Method	IoU $\uparrow$	nIoU $\uparrow$	Pd $\uparrow$	Fa $\downarrow$
Vanilla Conv blocks	75.31	73.91	98.17	12.29
Dilated Conv blocks[37]	77.34	75.51	99.08	6.56
Double attention[13]	77.87	75.84	100.00	12.51
FFC blocks	79.55	77.23	100.00	1.82

Effectiveness of FFC. In our FFC block, we conduct experiments to study the inner parameter of $n$ which determines the number of FFC blocks used in our method. As shown in Fig. 6(b), the FFC residual block, designed to extract global context, significantly improves performance, and we achieve the best results when using five FFC residual blocks. We further explore other methods for extracting global information, such as dilated convolution and double attention, as mentioned in Sec. 2.3, from Table 6 we can see that dilated convolution and double attention both show improvements in performance by enlarging receptive fields and extracting global context. However, dilated convolution only captures image-level information in deep layers, while attention mechanisms lack local inductive bias. Therefore, they are inferior to FFC in ISOS tasks.

5 Conclusion

In this study, we identified and analyzed two important issues with ISOS that can affect model performance. Drawing inspiration from these issues, we propose a simple yet effective method. Specifically, to address the first issue of insufficient information, we use central difference convolution to guide the network’s focus on local contrast information. To deal with the second issue, we employ fast Fourier convolution to extract global context from high-resolution feature maps, preventing small objects from being overwhelmed. Extensive experiments have validated that our UCF model shows great superiority over other state-of-the-art methods on both public datasets. Our proposed model can also serve as a guide for further investigations into ISOS tasks.

References

[1] Kun Bai, Yuehuang Wang, and Qiong Song. Patch similarity based edge-preserving background estimation for single frame infrared small target detection. 2016 IEEE International Conference on Image Processing (ICIP), pages 181–185, 2016.
[2] Bruno Berenguel-Baeta, Jesus Bermudez-Cameo, and Jose J Guerrero. Fredsnet: Joint monocular depth and semantic segmentation with fast fourier convolutions. arXiv preprint arXiv:2210.01595, 2022.
[3] CL Philip Chen, Hong Li, Yantao Wei, Tian Xia, and Yuan Yan Tang. A local contrast method for small infrared target detection. IEEE transactions on geoscience and remote sensing, 52(1):574–581, 2013.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[5] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. Advances in Neural Information Processing Systems, 33:4479–4488, 2020.
[6] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
[7] Yimian Dai and Yiquan Wu. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE journal of selected topics in applied earth observations and remote sensing, 10(8):3752–3767, 2017.
[8] Yimian Dai and Yiquan Wu. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE journal of selected topics in applied earth observations and remote sensing, 10(8):3752–3767, 2017.
[9] Yimian Dai, Yiquan Wu, and Yu Song. Infrared small target and background separation via column-wise weighted robust principal component analysis. Infrared Physics & Technology, 77:421–430, 2016.
[10] Yimian Dai, Yiquan Wu, Fei Zhou, and Kobus Barnard. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 950–959, 2021.
[11] Yimian Dai, Yiquan Wu, Fei Zhou, and Kobus Barnard. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Transactions on Geoscience and Remote Sensing, pages 1–12, 2021.
[12] Suyog D. Deshpande, Meng Hwa Er, Ronda Venkateswarlu, and Philip Chan. Max-mean and max-median filters for detection of small targets. In Optics & Photonics, 1999.
[13] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3146–3154, 2019.
[14] Chenqiang Gao, Deyu Meng, Yi Yang, Yongtao Wang, Xiaofang Zhou, and Alexander Hauptmann. Infrared patch-image model for small target detection in a single image. IEEE Transactions on Image Processing, 22:4996–5009, 2013.
[15] Jacob Gildenblat and contributors. Pytorch library for cam methods. https://github.com/jacobgil/pytorch-grad-cam, 2021.
[16] Jinhui Han, Sibang Liu, Gang Qin, Qian Zhao, Honghui Zhang, and Nana Li. A local contrast method combined with adaptive background estimation for infrared small target detection. IEEE Geoscience and Remote Sensing Letters, 16(9):1442–1446, 2019.
[17] Jinhui Han, Yong Ma, Bo Zhou, Fan Fan, Kun Liang, and Yu Fang. A robust infrared small target detection algorithm based on human visual system. IEEE Geoscience and Remote Sensing Letters, 11(12):2168–2172, 2014.
[18] Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. arXiv preprint arXiv:2204.07143, 2022.
[19] Boyang Li, Chao Xiao, Longguang Wang, Yingqian Wang, Zaiping Lin, Miao Li, Wei An, and Yulan Guo. Dense nested attention network for infrared small target detection. IEEE Transactions on Image Processing, 2022.
[20] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[21] Fangcen Liu, Chenqiang Gao, Fang Chen, Deyu Meng, Wangmeng Zuo, and Xinbo Gao. Infrared small-dim target detection with transformer under complex backgrounds. arXiv preprint arXiv:2109.14379, 2021.
[22] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
[23] Jie Liu, Ziqing He, Zuolong Chen, and Lei Shao. Tiny and dim infrared target detection based on weighted local contrast. IEEE Geoscience and Remote Sensing Letters, 15(11):1780–1784, 2018.
[24] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[25] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
[26] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[27] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern recognition, 106:107404, 2020.
[28] Sur Singh Rawat, Sashi Kant Verma, and Yatindra Kumar. Reweighted infrared patch image model for small target detection based on non-convex lp-norm minimisation and tv regularisation. IET image processing, 14(9):1937–1947, 2020.
[29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[31] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[32] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
[33] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2149–2159, 2022.
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[35] Minjie Wan, Guohua Gu, Yunkai Xu, Weixian Qian, Kan Ren, and Qian Chen. Total variation-based interframe infrared patch-image model for small target detection. IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2021.
[36] Huan Wang, Luping Zhou, and Lei Wang. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8509–8518, 2019.
[37] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1451–1460. Ieee, 2018.
[38] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[39] Yantao Wei, Xinge You, and Hong Li. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognition, 58:216–226, 2016.
[40] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4471–4480, 2019.
[41] Zitong Yu, Chenxu Zhao, Zezheng Wang, Yunxiao Qin, Zhuo Su, Xiaobai Li, Feng Zhou, and Guoying Zhao. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5295–5305, 2020.
[42] Ming Zeng, Jian xun Li, and Zhang xiao Peng. The design of top-hat morphological filter and application to infrared target detection. Infrared Physics & Technology, 48:67–76, 2006.
[43] Landan Zhang, Lingbing Peng, Tianfang Zhang, Siying Cao, and Zhenming Peng. Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm. Remote Sensing, 10(11):1821, 2018.
[44] Landan Zhang and Zhenming Peng. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sensing, 11(4):382, 2019.
[45] Mingjin Zhang, Rui Zhang, Yuxiang Yang, Haichen Bai, Jing Zhang, and Jie Guo. Isnet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 877–886, 2022.
[46] Tianfang Zhang, Siying Cao, Tian Pu, and Zhenming Peng. Agpcnet: Attention-guided pyramid context networks for infrared small target detection, 2021.
[47] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[48] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9308–9316, 2019.