This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Guidance Learning for Camouflaged Object Detection

Zhennan Chen, Xuying Zhang, Tian-Zhu Xiang*, Ying Tai* Z. Chen and Y. Tai are with the PCALab, School of Intelligence Science and Technology, Nanjing University, Suzhou, China. (e-mail: [email protected], [email protected]). X. Zhang is with VCIP, College of Computer Science, Nankai University, Tianjin, China. (e-mail: [email protected]). T.-Z. Xiang is with the G42, Abu Dhabi, UAE. (e-mail: [email protected]). (* Corresponding author: T.-Z. Xiang and Y. Tai)
Abstract

Camouflaged object detection (COD) aims to segment objects visually embedded in their surroundings, which is a very challenging task due to the high similarity between the objects and the background. To address it, most methods often incorporate additional information (e.g., boundary, texture, and frequency clues) to guide feature learning for better detecting camouflaged objects from the background. Although progress has been made, these methods are basically individually tailored to specific auxiliary cues, thus lacking adaptability and not consistently achieving high segmentation performance. To this end, this paper proposes an adaptive guidance learning network, dubbed AGLNet, which is a unified end-to-end learnable model for exploring and adapting different additional cues in CNN models to guide accurate camouflaged feature learning. Specifically, we first design a straightforward additional information generation (AIG) module to learn additional camouflaged object cues, which can be adapted for the exploration of effective camouflaged features. Then we present a hierarchical feature combination (HFC) module to deeply integrate additional cues and image features to guide camouflaged feature learning in a multi-level fusion manner. Followed by a recalibration decoder (RD), different features are further aggregated and refined for accurate object prediction. Extensive experiments on three widely used COD benchmark datasets demonstrate that the proposed method achieves significant performance improvements under different additional cues, and outperforms the recent 20 state-of-the-art methods by a large margin. Our code will be made publicly available at: https://github.com/ZNan-Chen/AGLNet.

Index Terms:
Camouflaged object detection, auxiliary cues.
Refer to caption
(a) FDCOD with different additional cues.
Refer to caption
(b) DGNet with different additional cues.
Figure 1: Visual comparisons between FDCOD [1] and DGNet [2] with different additional cues. We show the feature maps from the previous layer of the network outputs to better visualize the model performance. From (a), we can see that FDCOD well involves frequency domain clues for camouflaged object detection, but is not applicable to boundary cues. From (b), DGNet fails to identify the camouflaged objects with frequency domain clues due to the weak feature changes around objects in the frequency domain.

I Introduction

Camouflaged object detection (COD) is the task of spotting and segmenting objects that are perfectly hidden in complex environments. Recent years have witnessed increasing research enthusiasm from the computer vision community on COD, which facilitates the wide application in various fields, such as medicine (e.g., polyp segmentation [3] and lung infection segmentation [4]), industry (e.g., surface defect detection [5] and autonomous driving [6]), art [7] (e.g., recreational arts and style transformation), ecology [8] (e.g., species search) and society [9] (e.g., search and rescue).

In recent years, numerous deep learning-based methods have been proposed for camouflaged object detection and have made great progress. Some methods adopt the coarse-to-fine learning strategy to explore contextual cues and aggregate multi-scale features for COD, such as SINet [10], PFNet [11] and FSPNet [12]. Some methods introduce uncertainty-aware learning to model the confidence of model predictions, such as UGTR [13] and ZoomNet [14]. As we know, species often adopt various camouflage strategies to conceal themselves deliberately in the surroundings for self-protection, making the high intrinsic similarities in appearance (e.g., color, texture, and shape) between candidate objects and backgrounds. This camouflage ability of species easily deceives the visual system [15], which makes it very difficult to identify camouflaged targets from only a single image feature. To address the above limitations, some methods resort to other additional information, such as boundary [16, 17, 18], texture [19], edge [2], saliency [20], and frequency [1]. However, we observe that almost all of these methods are designed for a specific type of additional information, and thus lack sufficient adaptability for different types of additional cues, and do not consistently achieve good detection performance. For instance, as shown in Fig. 1(a), FDCOD [1] is specially designed to incorporate frequency domain clues for effective camouflaged object detection, but is not applicable to other additional features (e.g., boundary). Similarly, as shown in Fig. 1(b), the spectrum feature shows a small gradient difference around the camouflaged object, so the DGNet [2], adapted to additional edge features (i.e. Canny), fails to detect the camouflaged object under frequency domain clues.

To this end, we propose an adaptive guidance learning network, termed AGLNet, which is able to unify the exploration and guidance of any kind of effective additional cues into an end-to-end learnable model to fully aggregate additional features and image features to guide camouflaged object detection. Specifically, the additional cue is first learned in convolutional space by the designed additional information generation (AIG) module. Then, the learned additional cue is fully integrated with image features in a multi-level fusion manner by the proposed hierarchical feature combination (HFC) module, to guide camouflaged feature learning. After that, a recalibration decoder (RD) is presented to further fuse and refine different features for accurate camouflaged object segmentation through multi-layer, multi-step calibration. It is noted that extensive experiments show that the proposed model can be adapted to explore and incorporate various additional information, such as boundary, edge, texture, and frequency cues. Our contributions can be summarized as follows:

  • We propose a powerful adaptive guidance learning network that can involve various additional cues into image features to guide the detection of camouflaged objects. To the best of our knowledge, we are the first to explore a unified end-to-end framework to adapt to various additional information for COD tasks.

  • We propose a hierarchical feature combination (HFC) module to deeply integrate additional cues with image features in a multi-level manner to make full use of additional information. Furthermore, we design a recalibration decoder (RD) for iterative calibration and aggregation of different features for object prediction.

  • Extensive quantitative and qualitative experiments demonstrate the applicability and effectiveness of the proposed method to different additional information and its superior performance over the recent 20 state-of-the-art COD methods by a large margin.

II Related Work

II-A Camouflaged Object Detection

Camouflaged object detection (COD) is a challenging task that aims to discover objects that are highly similar to the environment [21]. Recent developments in deep learning techniques and the release of large-scale COD datasets (e.g., COD10K [10]) have paved the way for research into deep learning-based camouflaged object detection. After that, several COD methods were proposed, and the performance leaderboard has been continuously refreshed on several widely used COD benchmarks. Some methods adopt the coarse-to-fine learning strategy to explore and integrate multi-scale camouflaged features for object detection, such as SINet [10, 22], Camoformer [23], C2FNet [24], PFNet [11], SegMaR [25], PreyNet [26] and FSPNet [12]. In particular, C2FNet proposes a context-aware fusion network that combines multilevel features and attention coefficients to generate rich contextual information. PFNet uses a disturbance mining strategy to locate potential objects globally and refines predictions by focusing on key regions. SegMaR, pioneering multi-stage detection, excels in small camouflage object scenarios. PreyNet also utilizes multi-stage detection to distinguish between sensory and cognitive mechanisms. Furthermore, FSPNet introduces a novel transformer-based pyramid network, achieving accurate segmentation of camouflaged objects by gradually shrinking neighboring transformer features in a hierarchical manner. Some methods incorporate confidence-aware learning to improve feature learning for difficult samples, such as UGTR [13] and ZoomNet [14]. UGTR combines Bayesian learning and transformer to address the camouflage object problem by introducing probabilistic information and determinism. ZoomNet considers the expressive characteristics of different scales and enhances feature expression through scale aggregation. Inspired by the advances in multi-modal learning [27, 28], some methods introduce additional information, such as boundary [16, 17, 18, 29], edge [2, 30], texture [19], fixations [31], motion [32, 33], and saliency [20], to facilitate the camouflaged feature exploration. Classification [34] and saliency detection [35] are jointly learned with COD based on a multi-task learning framework, respectively. The concept behind [34] and [35] is that introducing different tasks can enhance the accuracy and robustness of camouflage detection segmentation. More recently, collaborative feature exploration from a group of relevant images has been proposed to enhance camouflaged object detection performance via learning from multiple images with objects of the same category [36, 20].

Refer to caption
Figure 2: Overall architecture of our adaptive guidance learning network (AGLNet) for COD. The input image is processed by a visual backbone and an additional information generation (AIG) module to extract multi-scale image features and learn additional cues, respectively. Both sets of features are deeply integrated to guide the learning of camouflaged features in the hierarchical feature combination (HFC) module, which consists of combination and decoupling. Finally, the fused features are iteratively aggregated and refined with backbone and additional features by the recalibration decoder (RD) for object prediction.

II-B Additional Cues for COD

In salient object detection, many method have attempted to enhance performance by integrating additional information, such as edge information [37, 38] and high-resolution input [39, 40]. Zhao et al. [37] have utilized extensive edge and location information to more precisely locate the boundaries of salient objects. Zeng et al. [39] have input higher resolution image features, combining global semantic information with local high-resolution details to iteratively produce high-resolution predictions.

Building on these salient object detection methods, various studies have explored integrating additional cues into camouflaged object detection.

By introducing auxiliary cues such as texture maps and edge maps into camouflaged models, these models can be sensitized to discern subtle distinctions between the foreground and background elements, notably variations in texture, the salience of edges, or gradient transitions. Moreover, a segment of the academic community posits that exclusive reliance on the RGB color model may not exhaustively harness the entirety of data inherent in images. Consequently, the frequency domain has been proposed as an ancillary cue. Frequency domain analysis can provide information about different frequencies in an image, which might not be prominent or easily detectable in the RGB domain. For example, Zhu et al. [18] and Sun et al. [16] have introduced boundary cues to highlight the camouflaged boundary between the background and foreground of an image and enhance the understanding of the boundary by the model. Ji et al. [41] and He et al. [30] have incorporated edge information for exploring the semantic information of target edges. Zhu et al. [19] combined texture labels to make the network more focused on the structure and details of the target. Zhong et al. [1] and Cong et al. [42] have used the frequency domain cues to improve camouflaged object detection. He et al. [43] decomposes foreground and background features into different frequency bands while constructing edge information to assist in generating accurate predictions.

Despite the strides made in amalgamating image features with auxiliary cues for COD, there are still challenges to be addressed. Predominantly, the extant approaches are bespoke solutions tailored to specific types of additional information, thus limiting their applicability to other additional types of cues. In this paper, we propose a novel adaptive guidance learning model that alleviates this issue by a unified end-to-end framework to adapt any kind of additional information and guide camouflaged feature learning by hierarchical feature combination.

III Methodology

The overall architecture of the proposed AGLNet is shown in Fig. 2. The framework first uses additional information generation (AIG) to learn additional cues, which can be used as guidance for camouflaged feature learning. Then, the learned additional features are deeply integrated with multi-scale backbone features to explore the critical camouflaged object features by the designed hierarchical feature combination (HFC) module. To make full use of additional cues, we adopt a multi-level fusion manner to incorporate additional information at the combination stage and decoupling stage. After that, a recalibration decoder (RD) is adopted to aggregate and refine multiple features in a multi-level calibration manner for camouflaged object segmentation.

III-A Additional Information Generation (AIG)

The additional cues contain valuable features that are not perceived in the backbone network. Some additional information, such as frequency domain cues, also shows large modal differences from RGB spatial domain features. If the two features are integrated directly, they may interfere with each other, resulting in the loss of key features or the introduction of noise. To avoid this issue, we design a simple but effective additional information generation (AIG) module to learn the additional cues in CNN space, so that they can be easily merged into image features. Specifically, AIG contains three layers, where each consisting of an averaging pooling operation and a convolution operation. AIG learns the additional feature 𝒜H8×W8×C\mathcal{A}\in\mathbb{R}^{\frac{H}{8}\times\frac{W}{8}\times C} (CC = 64) from the input RGB image 𝐈H×W×3\mathbf{I}\in\mathbb{R}^{H\times W\times 3}, which is the explicit semantic cues complementary to backbone features. Then a 1×\times1 convolution layer is adopted to produce the prediction of additional cues rsr^{s}, supervised by the ground truth of additional information. We follow the existing approaches in the COD task to obtain additional information labels including boundary, texture, edge (i.e. Canny), and frequency, which are detailed below.

  • Boundary [16, 17, 18]. Get the object boundary from ground truth (GT) as the corresponding boundary map.

  • Texture [19]. Get object texture via contour edge map (ConEdge), texture map (Texture), and GT:

    ConEdge+Canny×GT=Texture\text{ConEdge}+\text{Canny}\times GT=\text{Texture} (1)
  • Canny [2]. Get the object’s canny information via standard canny edge detector [44] and GT:

    𝐙G=E((x,y))𝐙C\mathbf{Z}^{G}=\mathcal{F}_{E}(\mathcal{I}(x,y))\otimes\mathbf{Z}^{C} (2)

    where E\mathcal{F}_{E} represents the standard canny edge detector for input (x,y)\mathcal{I}(x,y) with discrete pixel coordinates (x,yx,y). \otimes means element-wise multiplication. 𝐙C\mathbf{Z}^{C} means the object-level ground-truth.

  • Frequency [1]. Get frequency domain information by discrete cosine transform (DCT) of RGB images.

III-B Hierarchical Feature Combination (HFC)

Preliminary. Motivated by the observation [45] that low-level features consume more computational resources and contribute less to performance, we adopt the top-three high-level features of the visual backbone as our multi-scale backbone features, denoted as 𝒳ir\mathcal{X}^{r}_{i}, i{1,2,3}i\in\{1,2,3\} whose resolution is Hk×Wk\frac{H}{k}\times\frac{W}{k}, k{8,16,32}k\in\{8,16,32\}.

Combination. We observe that cascade structures can efficiently aggregate multi-level features for accurate object detection [45] and the cooperation of adjacent features can localize objects well. Therefore, we designed a novel multi-scale feature combination (MFC) module. First, we build a convolution block with different kernel sizes to enhance visual features. Specifically, we process feature 𝒳ir\mathcal{X}^{r}_{i} with a 1×11\times 1 convolution operation, followed by two parallel convolution operations with a 5×55\times 5 kernel and a 7×77\times 7 kernel, respectively. Then, the element-wise summation is performed over the features of the two branches. Finally, the summed features are fed to a 3×33\times 3 convolution layer to get the final result 𝒳icHk×Wk×C\mathcal{X}^{c}_{i}\in\mathbb{R}^{\frac{H}{k}\times\frac{W}{k}\times C}. We utilize the high-resolution details of the shallow layers for accurate localization and the semantic information of the deep layers to ensure semantic consistency between layers. It is defined as:

{g3=𝒳3c,g2=𝒳2cUP×2(g3),g1=𝒳1cUP×2(g2)UP×2(𝒳2c)UP×4(g3)\left\{\begin{aligned} g_{3}&=\mathcal{X}^{c}_{3},\\ g_{2}&=\mathcal{X}^{c}_{2}\otimes\rm UP_{\times 2}\left(g_{3}\right),\\ g_{1}&=\mathcal{X}^{c}_{1}\otimes\rm UP_{\times 2}\left(g_{2}\right)\otimes\rm UP_{\times 2}(\mathcal{X}^{c}_{2})\otimes\rm UP_{\times 4}(g_{3})\end{aligned}\right. (3)

where UP×t()\rm UP_{\times t}(\cdot) denotes a t×\rm t\times bilinear upsampling operation. giHk×Wk×Cg_{i}\in\mathbb{R}^{\frac{H}{k}\times\frac{W}{k}\times C}.

Next, those features are integrated with the additional information feature 𝒜\mathcal{A} to generate the initial combined feature 𝒮\mathcal{S}. The integration process can be denoted as:

{𝒮3=Conv3×3([Dn×4(𝒜),g3]),𝒮2=Conv3×3([Dn×2(𝒜),g2,𝒮3]),𝒮1=Conv3×3([𝒜,g1,𝒮2]),𝒮=Conv3×3(𝒮1),\left\{\begin{aligned} \mathcal{S}_{3}&=\rm Conv_{3\times 3}([\rm Dn_{\times 4}(\mathcal{A}),g_{3}]),\\ \mathcal{S}_{2}&=\rm Conv_{3\times 3}([\rm Dn_{\times 2}(\mathcal{A}),g_{2},\mathcal{S}_{3}]),\\ \mathcal{S}_{1}&=\rm Conv_{3\times 3}([\mathcal{A},g_{1},\mathcal{S}_{2}]),\\ \mathcal{S}&=\rm Conv_{3\times 3}(\mathcal{S}_{1}),\end{aligned}\right. (4)

where [][\cdot] is concatenation, Dn×t()\rm Dn_{\times t}(\cdot) is a t×\rm t\times bilinear downsampling and Conv3×3{\rm Conv}_{3\times 3} is a 3×\times3 convolution operation. The corresponding channel numbers of 𝒮3,𝒮2,𝒮1,𝒮\mathcal{S}_{3},\mathcal{S}_{2},\mathcal{S}_{1},\mathcal{S} are CC, 2CC, 3CC, 3CC, respectively.

Decoupling. To further explore camouflaged object semantics, we design a dual-branch architecture to guide decoupling. In the first branch, the 𝒮\mathcal{S} is decoupled into three groups of features, i.e., {s1,s2,s3}\{s_{1},s_{2},s_{3}\}, which are then processed by a convolution operation, respectively. In the other branch, the 𝒮\mathcal{S} is first processed by two convolution layers after an average pooling. The activation function of the last convolution layer is Softmax, which is used to learn the weights of feature channels, namely w1×1×3Cw\in\mathbb{R}^{1\times 1\times 3C}. The ww is then split into {w1,w2,w3}\{w_{1},w_{2},w_{3}\}. Each decoupled feature is multiplied by its corresponding weight.

To capture more features of camouflaged objects, additional information features are incorporated to guide feature learning of camouflaged objects. The above operations are described as:

{d1=Conv3×3([w1Conv3×3(s1),𝒜]),d2=Conv3×3([w2Conv3×3(s2),𝒜]),d3=Conv3×3([w3Conv3×3(s3),𝒜]),\left\{\begin{aligned} d_{1}&={\rm Conv}_{3\times 3}([w_{1}\otimes{\rm Conv}_{3\times 3}(s_{1}),\mathcal{A}]),\\ d_{2}&={\rm Conv}_{3\times 3}([w_{2}\otimes{\rm Conv}_{3\times 3}(s_{2}),\mathcal{A}]),\\ d_{3}&={\rm Conv}_{3\times 3}([w_{3}\otimes{\rm Conv}_{3\times 3}(s_{3}),\mathcal{A}]),\end{aligned}\right. (5)

where \otimes denotes element-wise multiplication. Then, we obtain the initial prediction map r4=Conv1×1r_{4}={\rm Conv}_{1\times 1} ([d1,d2,d3])([d_{1},d_{2},d_{3}]).

TABLE I: The specific parameters of MikM^{k}_{i} module.
No. Operation Input Size Output Size
#1 Conv1×1 4×C4\times C 4×C2q4\times\frac{C}{2^{q}}
Split & Concat 4×C2q4\times\frac{C}{2^{q}} 4×C2q+2(n+1)4\times\frac{C}{2^{q}}+2^{(n+1)}
#2 Conv1×1 4×C2q+2(n+1)4\times\frac{C}{2^{q}}+2^{(n+1)} 3×C2q3\times\frac{C}{2^{q}}
Split & Concat 3×C2q3\times\frac{C}{2^{q}} 3×C2q+2(n+1)3\times\frac{C}{2^{q}}+2^{(n+1)}
#3 Conv1×1 3×C2q+2(n+1)3\times\frac{C}{2^{q}}+2^{(n+1)} 2×C2q2\times\frac{C}{2^{q}}
Split & Concat 2×C2q2\times\frac{C}{2^{q}} 2×C2q+2(n+1)2\times\frac{C}{2^{q}}+2^{(n+1)}

III-C Recalibration Decoder (RD)

Inspired by [46], after extracting the aggregated features, we design a recalibration decoder (RD) module which employs iterative calibration to refine the consistency of image features and additional features. RD further combines multi-scale backbone features and additional features to enhance feature representation. It consists of three levels of iterative optimization, and each level is a well-designed feature refiner (FR), whose architecture is shown in Fig. 2. For each FR, the backbone visual feature 𝒳ir\mathcal{X}^{r}_{i} of the corresponding scale is first split, and then is combined with the prediction map from the previous level and learned additional features. In FR, we perform multiple feature splits and merges, where the number of splits is n={4,3,2}n=\{4,3,2\}, and each split feature is merged with ri+1r_{i+1} and additional cue mask. The specific parameters of MikM^{k}_{i} module is shown in Table I, where q={22,21,20}q=\{2^{2},2^{1},2^{0}\}. The RD module can be formulated as:

{r3=FR3(𝒳3r,r4,rs),r2=FR2(𝒳2r,r3,rs),r1=FR1(𝒳1r,r2,rs),\left\{\begin{aligned} r_{3}&={\rm FR}_{3}(\mathcal{X}^{r}_{3},r_{4},r^{s}),\\ r_{2}&={\rm FR}_{2}(\mathcal{X}^{r}_{2},r_{3},r^{s}),\\ r_{1}&={\rm FR}_{1}(\mathcal{X}^{r}_{1},r_{2},r^{s}),\end{aligned}\right. (6)

On one hand, FR can facilitate the fusion of image features and additional features at different scales. On the other hand, FR adopts multiple iterations to boost accurate segmentation within a certain scale.

III-D Loss Function

Our loss function consists of additional information generation loss and camouflaged object detection loss. For the former, we reshape rsr^{s} to the input image size and calculate the MSE loss. For the latter, we also reshape each prediction (rir_{i}) to the input image size and adopt the weighted BCE loss (BCE\mathcal{L}_{BCE}) and the weighted IoU loss (IoU\mathcal{L}_{IoU}) [47]. Therefore, our loss function is defined as:

total=\displaystyle\mathcal{L}_{total}= i=13(BCE(ri,GT)+IoU(ri,GT))+\displaystyle\sum\nolimits_{i=1}^{3}(\mathcal{L}_{BCE}(r_{i},GT)+\mathcal{L}_{IoU}(r_{i},GT))+ (7)
MSE(rs,𝒟s),\displaystyle\mathcal{L}_{MSE}(r^{s},\mathcal{D}^{s}),
TABLE II: Quantitative comparisons of our proposed method and other 20 state-of-the-art methods on three widely used benchmark datasets. The higher the SαS_{\alpha}, FβωF_{\beta}^{\omega}, FmF_{m}, and EmE_{m}, the better the performance. The smaller the MAEMAE, the better. The best results are marked in 𝐛𝐨𝐥𝐝\mathbf{bold}. The second-best results are marked in underline.
Method COD10K NC4K CAMO
SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha} FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha} FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow
2020 SINet [10] 0.772 0.543 0.640 0.810 0.051 0.810 0.665 0.741 0.841 0.066 0.753 0.602 0.676 0.774 0.097
2021 PFNet [11] 0.797 0.656 0.698 0.875 0.039 0.826 0.743 0.783 0.884 0.054 0.774 0.683 0.737 0.832 0.087
2021 LSR [31] 0.805 0.660 0.703 0.876 0.039 0.832 0.743 0.785 0.888 0.053 0.793 0.703 0.753 0.850 0.083
2021 C2FNet [24] 0.811 0.680 0.722 0.890 0.036 0.839 0.763 0.805 0.896 0.050 0.782 0.698 0.751 0.838 0.082
2021 MGL [17] 0.815 0.667 0.709 0.852 0.035 0.832 0.739 0.782 0.868 0.053 0.772 0.670 0.725 0.811 0.089
2021 UGTR [13] 0.818 0.668 0.725 0.894 0.035 0.839 0.749 0.812 0.892 0.048 0.784 0.687 0.741 0.844 0.086
2021 UJSC [35] 0.818 0.702 0.737 0.892 0.033 0.840 0.772 0.817 0.899 0.047 0.793 0.721 0.766 0.854 0.078
2022 SINet-V2 [22] 0.815 0.674 0.711 0.885 0.037 0.848 0.768 0.801 0.902 0.047 0.819 0.743 0.781 0.882 0.070
2022 R-MGL_v2 [48] 0.816 0.689 0.733 0.879 0.034 0.838 0.758 0.801 0.899 0.050 0.769 0.672 0.731 0.847 0.086
2022 BSANet [18] 0.818 0.699 0.738 0.890 0.034 0.841 0.771 0.817 0.897 0.048 0.794 0.717 0.763 0.851 0.079
2022 FAPNet [49] 0.822 0.694 0.731 0.888 0.036 0.851 0.775 0.810 0.899 0.047 0.815 0.734 0.776 0.865 0.076
2022 BGNet [16] 0.831 0.722 0.753 0.901 0.033 0.851 0.788 0.820 0.907 0.044 0.812 0.749 0.789 0.870 0.073
2022 SegMaR [25] 0.833 0.724 0.757 0.899 0.034 0.841 0.781 0.820 0.896 0.046 0.815 0.753 0.795 0.874 0.071
2022 FDCOD [1] 0.837 0.731 0.749 0.918 0.030 0.834 0.750 0.784 0.894 0.052 0.844 0.778 0.809 0.898 0.062
2022 ZoomNet [14] 0.838 0.729 0.766 0.888 0.029 0.853 0.784 0.818 0.896 0.043 0.820 0.752 0.794 0.878 0.066
2023 DGNet [2] 0.822 0.693 0.728 0.896 0.033 0.857 0.784 0.814 0.911 0.042 0.839 0.769 0.806 0.901 0.057
2023 FEDER [30] 0.822 0.716 0.751 0.900 0.032 0.847 0.789 0.824 0.907 0.044 0.802 0.738 0.781 0.867 0.071
2023 PopNet [50] 0.851 0.757 0.786 0.910 0.028 0.861 0.802 0.833 0.909 0.042 0.808 0.744 0.784 0.859 0.077
2023 HitNet [40] 0.868 0.798 0.806 0.932 0.024 0.870 0.825 0.853 0.921 0.039 0.844 0.801 0.831 0.902 0.057
2023 FSPNet [12] 0.851 0.735 0.769 0.895 0.026 0.879 0.816 0.843 0.915 0.035 0.856 0.799 0.830 0.899 0.050
AGLNet-Boundary 0.870 0.785 0.808 0.930 0.024 0.883 0.830 0.854 0.929 0.035 0.867 0.816 0.843 0.917 0.053
AGLNet-Texture 0.871 0.786 0.809 0.928 0.024 0.884 0.834 0.857 0.930 0.034 0.868 0.823 0.850 0.920 0.049
AGLNet-Canny 0.873 0.789 0.811 0.930 0.023 0.884 0.833 0.855 0.929 0.034 0.870 0.823 0.848 0.916 0.050
AGLNet-Frequency 0.875 0.791 0.813 0.933 0.023 0.889 0.836 0.858 0.934 0.033 0.874 0.825 0.851 0.918 0.050
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a) Image
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b) GT
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c) Ours
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d) FSPNet
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(e) FEDER
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(f) FDCOD
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(g) DGNet
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(h) SINet
Figure 3: Qualitative comparison of our proposed method and other representative COD methods. Our method provides better performance than all competitors for camouflaged object segmentation in various complex scenes.

IV Experiments

IV-A Experimental Setup

Datasets. We conduct experiments on three widely used benchmark datasets of COD task, i.e., CAMO [34], COD10K [10] and NC4K [31]. In particular, CAMO, covering eight categories, contains 1,250 camouflaged images and 1,250 non-camouflaged images. COD10K consists of 5,066 camouflaged, 1,934 non-camouflaged, and 3,000 background images, and it is currently the largest dataset which covers 10 superclasses and 78 subclasses. NC4K is a newly published dataset which has a total of 4,121 camouflaged images. Following standard practice of COD tasks, we use 3,040 images from COD10K and 1,000 images from CAMO as the training set and the remaining data as the test set.

Evaluation Metrics. According to the standard evaluation protocol of COD, we employ the five common metrics to evaluate our model, i.e., structure-measure (SαS_{\alpha}) [51], weighted F-measure FβωF_{\beta}^{\omega} [52], mean F-measure (FmF_{m}), mean E-measure (EmE_{m}) [53] and mean absolute error (MAEMAE).

Implementation Details. All experiments are implemented with the PyTorch toolbox. The visual backbone we adopted is the EfficientNet-B4 [54] pretrained on ImageNet unless otherwise stated. We use Adam [55] as our model optimizer with a learning rate of 1e-4, which is adjusted according to cosine annealing strategy [56] with a period of 40 epochs and a minimum learning rate of 1e-5. We train our AGLNet for 100 epochs with a batch size of 8, which takes about 9 hours on an NVIDIA GeForce RTX 3090 GPU. During the training and inference, the input images are resized to 704×\times704 via bilinear interpolation and augmented by random flipping, cropping, and color jittering.

Competitors. Our AGLNet is compared with 20 recent state-of-the-art methods, including SINet [10], PFNet [11], LSR [31], C2FNet [24], MGL [17], UGTR [13], UJSC [35], SINet-V2 [22], R-MGL_v2 [48], BSANet [18], FAPNet [49], BGNet [16], SegMaR [25], FDCOD [1], ZoomNet [14], DGNet [2], FEDER [30], PopNet [50], HitNet [40], FSPNet [12]. For a fair comparison, all results are either provided by the authors or reproduced by an open-source model re-trained on the same training set with the recommended setting.

IV-B Comparisons with the State-of-the-arts

Quantitative Evaluation. Table II shows the comparison results of AGLNet with 20 cutting-edge methods. We used the most common additional cues available currently, including boundary, texture, canny, and frequency. It can be seen that our proposed AGLNet achieves significant performance improvements, regardless of what additional information is used, and outperforms other comparison methods on all datasets. Note that unless otherwise specified, the following AGLNet refers to frequency as the additional clue. Compared with FDCOD [1], which also introduces the frequency domain cues for COD, our AGLNet shows a large performance improvement. Specifically, our method increases the performance on the three COD datasets by an average of 4.9%, 8.6%, 7.7%, 2.8%, and 26.4% for SαS_{\alpha}, FβωF_{\beta}^{\omega}, FmF_{m}, EmE_{m} and MAEMAE, respectively. The performance improvement can be attributed to the learnable additional information exploration and the deep multi-level integration within the same domain in the proposed AGLNet. This leads to a more effective incorporation of additional cues to guide object prediction. Compared with another well-performing method that does not use additional cues, e.g., ZoomNet, our method improves SαS_{\alpha} by 5.1%, FβωF_{\beta}^{\omega} by 8.3%, FmF_{m} by 6.9%, EmE_{m} by 3.5% and MAEMAE by 23.6% averagely. Compared with the second-best competitor - HitNet, which adopts a powerful transformer as the backbone, our method still achieves better detection performance, with 2.1%, 1.3%, 0.5%, 1.4% and 15.4% increase on NC4K dataset, and 3.6%, 3.0%, 2.4%, 1.8% and 12.3% increase on CAMO dataset. As a result, our AGLNet shows effectiveness and superior performance in detecting camouflaged objects compared with the existing methods.

TABLE III: Ablation studies of our AGLNet. Note that Combination and Decoupling are sub-components of HFC.
No. Component COD10K NC4K CAMO
Baseline Combination Decoupling RD AIG SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow
#1 \checkmark 0.829 0.701 0.731 0.881 0.033 0.846 0.732 0.776 0.871 0.050 0.840 0.736 0.780 0.868 0.068
#2 \checkmark \checkmark 0.859 0.757 0.782 0.905 0.026 0.879 0.813 0.841 0.912 0.034 0.860 0.808 0.837 0.905 0.051
#3 \checkmark \checkmark \checkmark 0.862 0.773 0.794 0.918 0.026 0.880 0.824 0.850 0.921 0.035 0.863 0.810 0.838 0.913 0.052
#4 \checkmark \checkmark \checkmark \checkmark 0.865 0.779 0.799 0.921 0.025 0.882 0.828 0.852 0.926 0.035 0.866 0.814 0.841 0.914 0.052
#OUR \checkmark \checkmark \checkmark \checkmark \checkmark 0.875 0.791 0.813 0.933 0.023 0.889 0.836 0.858 0.934 0.033 0.874 0.825 0.851 0.918 0.050
Refer to captionRefer to captionRefer to caption
(a) Image
Refer to captionRefer to captionRefer to caption
(b) GT
Refer to captionRefer to captionRefer to caption
(c) Baseline
Refer to captionRefer to captionRefer to caption
(d) +Combination
Figure 4: Visual comparison of the proposed Combination part. (a) input image, (b) ground-truth, (c) baseline, and (d) baseline + Combination.
Refer to captionRefer to captionRefer to caption
(a) Image
Refer to captionRefer to captionRefer to caption
(b) GT
Refer to captionRefer to captionRefer to caption
(c) +Combination
Refer to captionRefer to captionRefer to caption
(d) +Decoupling
Figure 5: Visual comparison of the Decoupling part. (a) input image, (b) ground-truth, (c) baseline+Combination, (d) baseline+Combination+Decoupling. Red circles shows improvements.
Refer to captionRefer to captionRefer to caption
(a) Image
Refer to captionRefer to captionRefer to caption
(b) GT
Refer to captionRefer to captionRefer to caption
(c) +HFC
Refer to captionRefer to captionRefer to caption
(d) +RD
Figure 6: Visual comparison of the proposed RD module. (a) input image, (b) ground-truth, (c) baseline+HFC, and (d) baseline+HDC+RD.
Refer to captionRefer to captionRefer to caption
(a) Image
Refer to captionRefer to captionRefer to caption
(b) GT
Refer to captionRefer to captionRefer to caption
(c) w/o AIG
Refer to captionRefer to captionRefer to caption
(d) AGLNet
Figure 7: Visual comparison of the proposed AIG. (a) input image, (b) ground-truth, (c) AGLNet w/o AIG, (d) AGLNet.

Qualitative Evaluation. Figure 3 shows the visual comparisons between our AGLNet and other representative COD methods in some challenging scenarios, including tiny objects (e.g., lines 1-2), occlusions (e.g., lines 3-4), and multiple objects (e.g., lines 5-6). These comparisons intuitively show a more competitive visual performance of our proposed AGLNet. With a good integration of the discriminative information provided by the additional cue, our AGLNet provides more accurate and complete camouflaged object localization and segmentation under various complex and highly similar backgrounds, even with the interference of noisy objects/regions (salient but non-camouflaged).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Visualization of the ablation experiment results for each parameter of FR. Note that we have normalized the results for better display.

IV-C Ablation Studies

Overview. We perform ablation studies on key components to verify their effectiveness and analyze their impacts on performance, as shown in Table III. Note that, for the baseline model, we remove all the additional modules, and then use convolution blocks to fuse the multi-level features in a top-down manner and generate predictions. Experimental results demonstrate that our designed HFC (including combination and decoupling), RD, and AIG can improve detection performance. When they are combined to build AGLNet, significant improvements in all evaluation metrics are observed.

Effectiveness of Combination. As can be seen from Table III (#2), compared with baseline, Combination achieves significant performance improvement, which provides a gain of 3.3%, 9.6%, 7.6%, 3.9%, and 26.1% on SαS_{\alpha}, FβωF_{\beta}^{\omega}, FmF_{m}, EmE_{m} and MAEMAE on three datasets by an average, respectively. The Combination part fully interacts and accumulates critical cues by dense aggregation of multi-scale backbone features, thus greatly enhancing the feature representation for COD. Figure 4 provides some visual results, showing the effectiveness of Combination for improving performance.

Refer to caption
(a) SαS_{\alpha}
Refer to caption
(b) FβωF_{\beta}^{\omega}
Refer to caption
(c) FmF_{m}
Refer to caption
(d) EmE_{m}
Refer to caption
(e) MAEMAE
Figure 9: Ablation studies of the model adaptability to different additional cues on the COD10K dataset. Please zoom in for more details.
Refer to captionRefer to captionRefer to caption
(a) Image
Refer to captionRefer to captionRefer to caption
(b) GT
Refer to captionRefer to captionRefer to caption
(c) w/o AIG
Refer to captionRefer to captionRefer to caption
(d) Boundary
Refer to captionRefer to captionRefer to caption
(e) Texture
Refer to captionRefer to captionRefer to caption
(f) Canny
Refer to captionRefer to captionRefer to caption
(g) Frequency
Figure 10: Visual comparison of different additional cues. (a) input image, (b) ground-truth, (d)-(e) denote AGLNet using boundary, texture, canny and frequency as additional cues respectively.
TABLE IV: Ablation studies of different visual backbones.
Method (Backbone) COD10K NC4K CAMO Params. (M)
SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow
ZoomNet (ResNet-50) 0.838 0.729 0.766 0.888 0.029 0.853 0.784 0.818 0.896 0.043 0.820 0.752 0.794 0.878 0.066 32.382
AGLNet (ResNet-50) 0.849 0.740 0.773 0.920 0.028 0.857 0.789 0.823 0.902 0.042 0.842 0.768 0.803 0.888 0.064 114.09
FDCOD (Res2Net-50) 0.837 0.731 0.749 0.918 0.030 0.834 0.750 0.784 0.894 0.052 0.844 0.778 0.809 0.898 0.062 197.41
AGLNet (Res2Net-50) 0.856 0.753 0.784 0.926 0.028 0.863 0.793 0.826 0.906 0.042 0.851 0.779 0.819 0.895 0.061 114.69
DGNet (EfficientNet-B4) 0.822 0.693 0.728 0.896 0.033 0.857 0.784 0.814 0.911 0.042 0.839 0.769 0.806 0.901 0.057 21.02
AGLNet (EfficientNet-B4) 0.875 0.791 0.813 0.933 0.023 0.889 0.836 0.858 0.934 0.033 0.874 0.825 0.851 0.918 0.050 93.65
TABLE V: Ablation studies of AGLNet at different input resolutions.
Method COD10K NC4K CAMO
SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow SαS_{\alpha}\uparrow FβωF_{\beta}^{\omega}\uparrow FmF_{m}\uparrow EmE_{m}\uparrow MAEMAE\downarrow
SegMaR (704*704) 0.830 0.708 0.745 0.894 0.033 0.845 0.762 0.799 0.892 0.050 0.792 0.701 0.748 0.843 0.085
ZoomNet (704*704) 0.842 0.738 0.778 0.891 0.029 0.854 0.786 0.822 0.896 0.043 0.797 0.721 0.768 0.845 0.080
FDCOD (704*704) 0.843 0.733 0.761 0.902 0.030 0.842 0.761 0.793 0.896 0.049 0.850 0.784 0.821 0.886 0.059
HitNet (704*704) 0.868 0.798 0.806 0.932 0.024 0.870 0.825 0.853 0.921 0.039 0.844 0.801 0.831 0.902 0.057
AGLNet (704*704) 0.875 0.791 0.813 0.933 0.023 0.889 0.836 0.858 0.934 0.033 0.874 0.825 0.851 0.918 0.050

Effectiveness of Decoupling. As shown in Table III (#3), the addition of Decoupling part significantly improves detection performance by 1.2%, 0.9% and 1.1% on FβωF_{\beta}^{\omega}, FmF_{m} and EmE_{m} on three datasets by an average, respectively. Decoupling adopts feature splitting and group-wise exploration to dig deep into different feature groups for fine discriminative features, which strengthen feature representation. The decoupling operation compensates for more details for fine feature exploration. Figure 5 shows some visual results. We can see Decoupling part plays a crucial role in fine feature exploration, which compensates for more details, such as edges (e.g. row 1), textures (e.g. row 2), and torsos (e.g. row 3) for camouflaged object detection.

Effectiveness of RD. The RD module utilizes three FR components, which combine multi-scale backbone features to further refine feature representation for camouflaged object detection. As shown in Table III (#4), RD further increases the detection performance. Figure 6 provides some visual comparison results, showing the effectiveness of the proposed RD module.

Parameter Analysis for FR. Fig. 8 shows the performance comparison of MikM^{k}_{i} module in FR under different iterations. We can see that the model performance is relatively robust for different iterations. In our experiments, we adopt three iterations of MikM^{k}_{i} module, which achieves the slightly best performance for camouflaged object detection. Besides, inside MikM^{k}_{i} module, we perform multiple split and merge operations to deeply explore critical features for camouflaged object detection. Fig. 8 shows the performance comparison under the different numbers of split and merge operations. In our experiments, we employ three times of split-merge operations, which achieve the best detection performance. Fig. 8 provides a quantitative comparison of parameter qq in FR under different settings. Fig. 8 provides a quantitative comparison of parameter nn in FR under different settings. Experiments show that the performance of FR is relatively robust under different parameter settings. We chose q={4,2,1}q=\{4,2,1\} and n={4,3,2}n=\{4,3,2\}, respectively, which achieve slightly better performance in camouflaged object detection.

Effectiveness of AIG. The AIG module learns additional information features of camouflaged objects and then incorporates them into image features for camouflaged object segmentation. To make full use of the additional cues, multi-level fusion is adopted to incorporate additional information features into different stages of the model, including HFC and RD, for deep aggregation. As shown in Table III (#OUR), the performance gains are 1.5%, 1.8%, 1.3% and 8.0% in terms of FβωF_{\beta}^{\omega}, FmF_{m}, EmE_{m} and MAEMAE on COD10K, respectively, demonstrating the effectiveness of additional information features for camouflaged object detection.

Model Adaptability to Different Additional Cues. Fig. 9 shows a comparison of the model adaptability of some representative COD methods for different additional cues, including boundary, texture, canny, and frequency. We adopt FDCOD [1] and DGNet [2] as comparison. The former introduces the frequency domain information and the latter integrates edge cues. We can see, FDCOD achieves better results using frequency domain cues, but it significantly reduces the performance with other additional cues. Differently, DGNet shows poor performance when using the frequency domain cues. These methods are tailored to specific auxiliary cues and not applicable to other additional cues. By contrast, our proposed method shows outstanding adaptability for different additional cues. Our method achieves the best performance under different additional information with minor performance differences when compared to other competitors. Fig. 10 shows the feature maps of different additional cues. We can see that: (a) the boundary provides relatively little object information (only the edges of the object silhouette), so it is more sensitive to the object silhouette, but explores more limited effective object features than the other three additional cues. (b) Boundary and texture only provide cues to object regions, while the canny and frequency also provide contextual information (i.e., background of the objects), which helps to improve the understanding of a scene and increase detection performance. (c) Frequency provides additional information beyond the human visual system and shows the best results.

Backbone Analysis. We also test different backbones to verify the performance of the proposed method for COD. ZoomNet [14], FDCOD [1] and DGNet [2] are state-of-the-art methods, with common-used ResNet-50 [57], Res2Net-50 [58] and EfficientNet-B4 [54] as the backbones, respectively. Therefore, we take these three methods as competitors. As shown in Table IV, we test our proposed AGLNet with three kinds of backbones respectively, and find that the proposed method outperforms other competitors significantly. Specifically, compared with FDCOD, AGLNet (Res2Net-50) achieves 2.3%, 3.0% and 4.7% improvement in SαS_{\alpha}, FβωF_{\beta}^{\omega} and FmF_{m} on COD10K dataset, respectively. Compared with DGNet, AGLNet (EfficientNet-B4) achieves average improvements of 4.9%, 9.4%, 7.6%, 2.8% and 21.3% in SαS_{\alpha}, FβωF_{\beta}^{\omega}, FmF_{m}, EmE_{m}, and MAEMAE on three datasets by an average, respectively. Overall, our AGLNet achieves the best performance with EfficientNet-B4 as the backbone. Besides, in the AGLNet variant, AGLNet (EfficientNet-B4) has the smallest amount of parameters, but is still larger than that of DGNet and ZoomNet. The design of light-weight models is also the focus of our future work.

Input Resolution. We also conduct a series of ablation experiments to analyze the impact of input image resolution on detection performance. As shown in Table V, under the same resolution (704×\times704), the proposed AGLNet significantly outperforms the comparison methods. This is because: a) high-resolution input provides more effective object details to improve the detection; b) high-resolution input also introduces noise interference, so a good network design can better explore critical cues. The proposed method designs the deep integration of additional features and image features and recalibration decoder, providing a very compelling performance. Actually, from our experiments, our method achieves the best results at all common input resolutions.

V Conclusion

This paper proposes an adaptive guidance learning framework that can handle any of the additional cues theoretically which are commonly used in COD tasks while achieving significant performance gains. To our knowledge, this is the first work to study a unified end-to-end model to adapt different additional information for COD tasks. The proposed method designs an additional information generation module to learn the additional cues, which are then deeply integrated with image features by the hierarchical feature combination module to guide the learning of camouflaged features. Extensive experiments show the superiority over 20 other state-of-the-art methods on three datasets.

References

  • [1] Y. Zhong, B. Li, L. Tang, S. Kuang, S. Wu, and S. Ding, “Detecting camouflaged object in frequency domain,” in CVPR, 2022, pp. 4504–4513.
  • [2] G.-P. Ji, D.-P. Fan, Y.-C. Chou, D. Dai, A. Liniger, and L. Van Gool, “Deep gradient learning for efficient camouflaged object detection,” Machine Intelligence Research, vol. 20, no. 1, pp. 92–108, 2023.
  • [3] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in MICCAI.   Springer, 2020, pp. 263–273.
  • [4] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Inf-net: Automatic covid-19 lung infection segmentation from ct images,” IEEE TMI, vol. 39, no. 8, pp. 2626–2637, 2020.
  • [5] D. Tabernik, S. Šela, J. Skvarč, and D. Skočaj, “Segmentation-based deep-learning approach for surface-defect detection,” Journal of Intelligent Manufacturing, vol. 31, no. 3, pp. 759–776, 2020.
  • [6] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR, 2021, pp. 7077–7087.
  • [7] R. Feng and B. Prabhakaran, “Facilitating fashion camouflage art,” in ACM MM, 2013, pp. 793–802.
  • [8] K. S. Kumar and A. Abdul Rahman, “Early detection of locust swarms using deep learning,” in Advances in machine learning and computational intelligence.   Springer, 2021, pp. 303–310.
  • [9] T. Liu, Y. Zhao, Y. Wei, Y. Zhao, and S. Wei, “Concealed object detection for activate millimeter wave image,” IEEE Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9909–9917, 2019.
  • [10] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Camouflaged object detection,” in CVPR, 2020, pp. 2777–2787.
  • [11] H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” in CVPR, 2021, pp. 8772–8781.
  • [12] Z. Huang, H. Dai, T.-Z. Xiang, S. Wang, H.-X. Chen, J. Qin, and H. Xiong, “Feature shrinkage pyramid for camouflaged object detection with transformers,” in CVPR, 2023, pp. 5557–5566.
  • [13] F. Yang, Q. Zhai, X. Li, R. Huang, A. Luo, H. Cheng, and D.-P. Fan, “Uncertainty-guided transformer reasoning for camouflaged object detection,” in ICCV, 2021, pp. 4146–4155.
  • [14] Y. Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoom in and out: A mixed-scale triplet network for camouflaged object detection,” in CVPR, 2022, pp. 2160–2170.
  • [15] M. Stevens and S. Merilaita, “Animal camouflage: current issues and new perspectives,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 364, no. 1516, pp. 423–427, 2009.
  • [16] Y. Sun, S. Wang, C. Chen, and T.-Z. Xiang, “Boundary-guided camouflaged object detection,” in IJCAI, 2022, pp. 1335–1341.
  • [17] Q. Zhai, X. Li, F. Yang, C. Chen, H. Cheng, and D.-P. Fan, “Mutual graph learning for camouflaged object detection,” in CVPR, 2021, pp. 12 997–13 007.
  • [18] H. Zhu, P. Li, H. Xie, X. Yan, D. Liang, D. Chen, M. Wei, and J. Qin, “I can find you! boundary-guided separated attention network for camouflaged object detection,” in aaaiI, 2022, pp. 3608–3616.
  • [19] J. Zhu, X. Zhang, S. Zhang, and J. Liu, “Inferring camouflaged objects by texture-aware interactive guidance network,” in aaaiI, 2021, pp. 3599–3607.
  • [20] X. Zhang, B. Yin, Z. Lin, Q. Hou, D.-P. Fan, and M.-M. Cheng, “Referring camouflaged object detection,” arXiv preprint arXiv:2306.07532, 2023.
  • [21] Z. Chen, R. Gao, T. Xiang, and F. Lin, “Diffusion model for camouflaged object detection,” in ECAI.   IOS Press, 2023, pp. 445–452.
  • [22] D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Concealed object detection,” IEEE TPAMI, vol. 44, no. 10, pp. 6024–6042, 2022.
  • [23] B. Yin, X. Zhang, Q. Hou, B.-Y. Sun, D.-P. Fan, and L. Van Gool, “Camoformer: Masked separable attention for camouflaged object detection,” arXiv preprint arXiv:2212.06570, 2022.
  • [24] Y. Sun, G. Chen, T. Zhou, Y. Zhang, and N. Liu, “Context-aware cross-level fusion network for camouflaged object detection,” in IJCAI, 2021, pp. 1025–1031.
  • [25] Q. Jia, S. Yao, Y. Liu, X. Fan, R. Liu, and Z. Luo, “Segment, magnify and reiterate: Detecting camouflaged objects the hard way,” in CVPR, 2022, pp. 4713–4722.
  • [26] M. Zhang, S. Xu, Y. Piao, D. Shi, S. Lin, and H. Lu, “Preynet: Preying on camouflaged objects,” in ACM MM, 2022, pp. 5323–5332.
  • [27] X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 465–15 474.
  • [28] X. Zhang, B.-W. Yin, Y. Chen, Z. Lin, Y. Li, Q. Hou, and M.-M. Cheng, “Temo: Towards text-driven 3d stylization for multi-object meshes,” arXiv preprint arXiv:2312.04248, 2023.
  • [29] B. Dong, J. Pei, R. Gao, T.-Z. Xiang, S. Wang, and H. Xiong, “A unified query-based paradigm for camouflaged instance segmentation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2131–2138.
  • [30] C. He, K. Li, Y. Zhang, L. Tang, Y. Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge reconstruction,” in CVPR, 2023, pp. 22 046–22 055.
  • [31] Y. Lv, J. Zhang, Y. Dai, A. Li, B. Liu, N. Barnes, and D.-P. Fan, “Simultaneously localize, segment and rank the camouflaged objects,” in CVPR, 2021, pp. 11 591–11 601.
  • [32] X. Cheng, H. Xiong, D.-P. Fan, Y. Zhong, M. Harandi, T. Drummond, and Z. Ge, “Implicit motion handling for video camouflaged object detection,” in CVPR, 2022, pp. 13 864–13 873.
  • [33] Y. Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoomnext: A unified collaborative pyramid network for camouflaged object detection,” arXiv 2310.20208, 2023.
  • [34] T.-N. Le, T. V. Nguyen, Z. Nie, M.-T. Tran, and A. Sugimoto, “Anabranch network for camouflaged object segmentation,” CVIU, vol. 184, pp. 45–56, 2019.
  • [35] A. Li, J. Zhang, Y. Lv, B. Liu, T. Zhang, and Y. Dai, “Uncertainty-aware joint salient object and camouflaged object detection,” in CVPR, 2021, pp. 10 071–10 081.
  • [36] C. Zhang, H. Bi, T.-Z. Xiang, R. Wu, J. Tong, and X. Wang, “Collaborative camouflaged object detection: A large-scale dataset and benchmark,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023.
  • [37] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Egnet: Edge guidance network for salient object detection,” in ICCV, October 2019, pp. 8778–8787.
  • [38] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object detection with pyramid attention and salient edges,” in CVPR, 2019, pp. 1448–1457.
  • [39] Y. Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards high-resolution salient object detection,” in ICCV, 2019, pp. 7234–7243.
  • [40] X. Hu, S. Wang, X. Qin, H. Dai, W. Ren, D. Luo, Y. Tai, and L. Shao, “High-resolution iterative feedback network for camouflaged object detection,” in AAAI, 2023, pp. 881–889.
  • [41] G.-P. Ji, L. Zhu, M. Zhuge, and K. Fu, “Fast camouflaged object detection via edge-based reversible re-calibration network,” Pattern Recognition, vol. 123, p. 108414, 2022.
  • [42] R. Cong, M. Sun, S. Zhang, X. Zhou, W. Zhang, and Y. Zhao, “Frequency perception network for camouflaged object detection,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1179–1189.
  • [43] C. He, K. Li, Y. Zhang, L. Tang, Y. Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22 046–22 055.
  • [44] J. Canny, “A computational approach to edge detection,” IEEE TPAMI, vol. 8, no. 6, pp. 679–698, 1986.
  • [45] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in CVPR, 2019, pp. 3907–3916.
  • [46] H. Liu, J. Zhang, K. Yang, X. Hu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” arXiv preprint arXiv:2203.04838, 2022.
  • [47] J. Wei, S. Wang, and Q. Huang, “F3net: fusion, feedback and focus for salient object detection,” in aaaiI, 2020, pp. 12 321–12 328.
  • [48] Q. Zhai, X. Li, F. Yang, Z. Jiao, P. Luo, H. Cheng, and Z. Liu, “Mgl: Mutual graph learning for camouflaged object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1897–1910, 2022.
  • [49] T. Zhou, Y. Zhou, C. Gong, J. Yang, and Y. Zhang, “Feature aggregation and propagation network for camouflaged object detection,” IEEE Transactions on Image Processing, vol. 31, pp. 7036–7047, 2022.
  • [50] Z. Wu, D. P. Paudel, D.-P. Fan, J. Wang, S. Wang, C. Demonceaux, R. Timofte, and L. Van Gool, “Source-free depth for object pop-out,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1032–1042.
  • [51] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in ICCV, 2017, pp. 4548–4557.
  • [52] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” in CVPR, 2014, pp. 248–255.
  • [53] D.-P. Fan, G.-P. Ji, X. Qin, and M.-M. Cheng, “Cognitive vision inspired object segmentation metric and loss function,” SCIENTIA SINICA Informationis, vol. 6, p. 6, 2021.
  • [54] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML.   PMLR, 2019, pp. 6105–6114.
  • [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, vol. 9, 2015.
  • [56] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” ICLR, 2017.
  • [57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [58] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE TPAMI, vol. 43, no. 2, pp. 652–662, 2019.