Adaptive Guidance Learning for Camouflaged Object Detection
Abstract
Camouflaged object detection (COD) aims to segment objects visually embedded in their surroundings, which is a very challenging task due to the high similarity between the objects and the background. To address it, most methods often incorporate additional information (e.g., boundary, texture, and frequency clues) to guide feature learning for better detecting camouflaged objects from the background. Although progress has been made, these methods are basically individually tailored to specific auxiliary cues, thus lacking adaptability and not consistently achieving high segmentation performance. To this end, this paper proposes an adaptive guidance learning network, dubbed AGLNet, which is a unified end-to-end learnable model for exploring and adapting different additional cues in CNN models to guide accurate camouflaged feature learning. Specifically, we first design a straightforward additional information generation (AIG) module to learn additional camouflaged object cues, which can be adapted for the exploration of effective camouflaged features. Then we present a hierarchical feature combination (HFC) module to deeply integrate additional cues and image features to guide camouflaged feature learning in a multi-level fusion manner. Followed by a recalibration decoder (RD), different features are further aggregated and refined for accurate object prediction. Extensive experiments on three widely used COD benchmark datasets demonstrate that the proposed method achieves significant performance improvements under different additional cues, and outperforms the recent 20 state-of-the-art methods by a large margin. Our code will be made publicly available at: https://github.com/ZNan-Chen/AGLNet.
Index Terms:
Camouflaged object detection, auxiliary cues.

I Introduction
Camouflaged object detection (COD) is the task of spotting and segmenting objects that are perfectly hidden in complex environments. Recent years have witnessed increasing research enthusiasm from the computer vision community on COD, which facilitates the wide application in various fields, such as medicine (e.g., polyp segmentation [3] and lung infection segmentation [4]), industry (e.g., surface defect detection [5] and autonomous driving [6]), art [7] (e.g., recreational arts and style transformation), ecology [8] (e.g., species search) and society [9] (e.g., search and rescue).
In recent years, numerous deep learning-based methods have been proposed for camouflaged object detection and have made great progress. Some methods adopt the coarse-to-fine learning strategy to explore contextual cues and aggregate multi-scale features for COD, such as SINet [10], PFNet [11] and FSPNet [12]. Some methods introduce uncertainty-aware learning to model the confidence of model predictions, such as UGTR [13] and ZoomNet [14]. As we know, species often adopt various camouflage strategies to conceal themselves deliberately in the surroundings for self-protection, making the high intrinsic similarities in appearance (e.g., color, texture, and shape) between candidate objects and backgrounds. This camouflage ability of species easily deceives the visual system [15], which makes it very difficult to identify camouflaged targets from only a single image feature. To address the above limitations, some methods resort to other additional information, such as boundary [16, 17, 18], texture [19], edge [2], saliency [20], and frequency [1]. However, we observe that almost all of these methods are designed for a specific type of additional information, and thus lack sufficient adaptability for different types of additional cues, and do not consistently achieve good detection performance. For instance, as shown in Fig. 1(a), FDCOD [1] is specially designed to incorporate frequency domain clues for effective camouflaged object detection, but is not applicable to other additional features (e.g., boundary). Similarly, as shown in Fig. 1(b), the spectrum feature shows a small gradient difference around the camouflaged object, so the DGNet [2], adapted to additional edge features (i.e. Canny), fails to detect the camouflaged object under frequency domain clues.
To this end, we propose an adaptive guidance learning network, termed AGLNet, which is able to unify the exploration and guidance of any kind of effective additional cues into an end-to-end learnable model to fully aggregate additional features and image features to guide camouflaged object detection. Specifically, the additional cue is first learned in convolutional space by the designed additional information generation (AIG) module. Then, the learned additional cue is fully integrated with image features in a multi-level fusion manner by the proposed hierarchical feature combination (HFC) module, to guide camouflaged feature learning. After that, a recalibration decoder (RD) is presented to further fuse and refine different features for accurate camouflaged object segmentation through multi-layer, multi-step calibration. It is noted that extensive experiments show that the proposed model can be adapted to explore and incorporate various additional information, such as boundary, edge, texture, and frequency cues. Our contributions can be summarized as follows:
-
•
We propose a powerful adaptive guidance learning network that can involve various additional cues into image features to guide the detection of camouflaged objects. To the best of our knowledge, we are the first to explore a unified end-to-end framework to adapt to various additional information for COD tasks.
-
•
We propose a hierarchical feature combination (HFC) module to deeply integrate additional cues with image features in a multi-level manner to make full use of additional information. Furthermore, we design a recalibration decoder (RD) for iterative calibration and aggregation of different features for object prediction.
-
•
Extensive quantitative and qualitative experiments demonstrate the applicability and effectiveness of the proposed method to different additional information and its superior performance over the recent 20 state-of-the-art COD methods by a large margin.
II Related Work
II-A Camouflaged Object Detection
Camouflaged object detection (COD) is a challenging task that aims to discover objects that are highly similar to the environment [21]. Recent developments in deep learning techniques and the release of large-scale COD datasets (e.g., COD10K [10]) have paved the way for research into deep learning-based camouflaged object detection. After that, several COD methods were proposed, and the performance leaderboard has been continuously refreshed on several widely used COD benchmarks. Some methods adopt the coarse-to-fine learning strategy to explore and integrate multi-scale camouflaged features for object detection, such as SINet [10, 22], Camoformer [23], C2FNet [24], PFNet [11], SegMaR [25], PreyNet [26] and FSPNet [12]. In particular, C2FNet proposes a context-aware fusion network that combines multilevel features and attention coefficients to generate rich contextual information. PFNet uses a disturbance mining strategy to locate potential objects globally and refines predictions by focusing on key regions. SegMaR, pioneering multi-stage detection, excels in small camouflage object scenarios. PreyNet also utilizes multi-stage detection to distinguish between sensory and cognitive mechanisms. Furthermore, FSPNet introduces a novel transformer-based pyramid network, achieving accurate segmentation of camouflaged objects by gradually shrinking neighboring transformer features in a hierarchical manner. Some methods incorporate confidence-aware learning to improve feature learning for difficult samples, such as UGTR [13] and ZoomNet [14]. UGTR combines Bayesian learning and transformer to address the camouflage object problem by introducing probabilistic information and determinism. ZoomNet considers the expressive characteristics of different scales and enhances feature expression through scale aggregation. Inspired by the advances in multi-modal learning [27, 28], some methods introduce additional information, such as boundary [16, 17, 18, 29], edge [2, 30], texture [19], fixations [31], motion [32, 33], and saliency [20], to facilitate the camouflaged feature exploration. Classification [34] and saliency detection [35] are jointly learned with COD based on a multi-task learning framework, respectively. The concept behind [34] and [35] is that introducing different tasks can enhance the accuracy and robustness of camouflage detection segmentation. More recently, collaborative feature exploration from a group of relevant images has been proposed to enhance camouflaged object detection performance via learning from multiple images with objects of the same category [36, 20].

II-B Additional Cues for COD
In salient object detection, many method have attempted to enhance performance by integrating additional information, such as edge information [37, 38] and high-resolution input [39, 40]. Zhao et al. [37] have utilized extensive edge and location information to more precisely locate the boundaries of salient objects. Zeng et al. [39] have input higher resolution image features, combining global semantic information with local high-resolution details to iteratively produce high-resolution predictions.
Building on these salient object detection methods, various studies have explored integrating additional cues into camouflaged object detection.
By introducing auxiliary cues such as texture maps and edge maps into camouflaged models, these models can be sensitized to discern subtle distinctions between the foreground and background elements, notably variations in texture, the salience of edges, or gradient transitions. Moreover, a segment of the academic community posits that exclusive reliance on the RGB color model may not exhaustively harness the entirety of data inherent in images. Consequently, the frequency domain has been proposed as an ancillary cue. Frequency domain analysis can provide information about different frequencies in an image, which might not be prominent or easily detectable in the RGB domain. For example, Zhu et al. [18] and Sun et al. [16] have introduced boundary cues to highlight the camouflaged boundary between the background and foreground of an image and enhance the understanding of the boundary by the model. Ji et al. [41] and He et al. [30] have incorporated edge information for exploring the semantic information of target edges. Zhu et al. [19] combined texture labels to make the network more focused on the structure and details of the target. Zhong et al. [1] and Cong et al. [42] have used the frequency domain cues to improve camouflaged object detection. He et al. [43] decomposes foreground and background features into different frequency bands while constructing edge information to assist in generating accurate predictions.
Despite the strides made in amalgamating image features with auxiliary cues for COD, there are still challenges to be addressed. Predominantly, the extant approaches are bespoke solutions tailored to specific types of additional information, thus limiting their applicability to other additional types of cues. In this paper, we propose a novel adaptive guidance learning model that alleviates this issue by a unified end-to-end framework to adapt any kind of additional information and guide camouflaged feature learning by hierarchical feature combination.
III Methodology
The overall architecture of the proposed AGLNet is shown in Fig. 2. The framework first uses additional information generation (AIG) to learn additional cues, which can be used as guidance for camouflaged feature learning. Then, the learned additional features are deeply integrated with multi-scale backbone features to explore the critical camouflaged object features by the designed hierarchical feature combination (HFC) module. To make full use of additional cues, we adopt a multi-level fusion manner to incorporate additional information at the combination stage and decoupling stage. After that, a recalibration decoder (RD) is adopted to aggregate and refine multiple features in a multi-level calibration manner for camouflaged object segmentation.
III-A Additional Information Generation (AIG)
The additional cues contain valuable features that are not perceived in the backbone network. Some additional information, such as frequency domain cues, also shows large modal differences from RGB spatial domain features. If the two features are integrated directly, they may interfere with each other, resulting in the loss of key features or the introduction of noise. To avoid this issue, we design a simple but effective additional information generation (AIG) module to learn the additional cues in CNN space, so that they can be easily merged into image features. Specifically, AIG contains three layers, where each consisting of an averaging pooling operation and a convolution operation. AIG learns the additional feature ( = 64) from the input RGB image , which is the explicit semantic cues complementary to backbone features. Then a 11 convolution layer is adopted to produce the prediction of additional cues , supervised by the ground truth of additional information. We follow the existing approaches in the COD task to obtain additional information labels including boundary, texture, edge (i.e. Canny), and frequency, which are detailed below.
III-B Hierarchical Feature Combination (HFC)
Preliminary. Motivated by the observation [45] that low-level features consume more computational resources and contribute less to performance, we adopt the top-three high-level features of the visual backbone as our multi-scale backbone features, denoted as , whose resolution is , .
Combination. We observe that cascade structures can efficiently aggregate multi-level features for accurate object detection [45] and the cooperation of adjacent features can localize objects well. Therefore, we designed a novel multi-scale feature combination (MFC) module. First, we build a convolution block with different kernel sizes to enhance visual features. Specifically, we process feature with a convolution operation, followed by two parallel convolution operations with a kernel and a kernel, respectively. Then, the element-wise summation is performed over the features of the two branches. Finally, the summed features are fed to a convolution layer to get the final result . We utilize the high-resolution details of the shallow layers for accurate localization and the semantic information of the deep layers to ensure semantic consistency between layers. It is defined as:
(3) |
where denotes a bilinear upsampling operation. .
Next, those features are integrated with the additional information feature to generate the initial combined feature . The integration process can be denoted as:
(4) |
where is concatenation, is a bilinear downsampling and is a 33 convolution operation. The corresponding channel numbers of are , 2, 3, 3, respectively.
Decoupling. To further explore camouflaged object semantics, we design a dual-branch architecture to guide decoupling. In the first branch, the is decoupled into three groups of features, i.e., , which are then processed by a convolution operation, respectively. In the other branch, the is first processed by two convolution layers after an average pooling. The activation function of the last convolution layer is Softmax, which is used to learn the weights of feature channels, namely . The is then split into . Each decoupled feature is multiplied by its corresponding weight.
To capture more features of camouflaged objects, additional information features are incorporated to guide feature learning of camouflaged objects. The above operations are described as:
(5) |
where denotes element-wise multiplication. Then, we obtain the initial prediction map .
No. | Operation | Input Size | Output Size |
---|---|---|---|
#1 | Conv1×1 | ||
Split & Concat | |||
#2 | Conv1×1 | ||
Split & Concat | |||
#3 | Conv1×1 | ||
Split & Concat |
III-C Recalibration Decoder (RD)
Inspired by [46], after extracting the aggregated features, we design a recalibration decoder (RD) module which employs iterative calibration to refine the consistency of image features and additional features. RD further combines multi-scale backbone features and additional features to enhance feature representation. It consists of three levels of iterative optimization, and each level is a well-designed feature refiner (FR), whose architecture is shown in Fig. 2. For each FR, the backbone visual feature of the corresponding scale is first split, and then is combined with the prediction map from the previous level and learned additional features. In FR, we perform multiple feature splits and merges, where the number of splits is , and each split feature is merged with and additional cue mask. The specific parameters of module is shown in Table I, where . The RD module can be formulated as:
(6) |
On one hand, FR can facilitate the fusion of image features and additional features at different scales. On the other hand, FR adopts multiple iterations to boost accurate segmentation within a certain scale.
III-D Loss Function
Our loss function consists of additional information generation loss and camouflaged object detection loss. For the former, we reshape to the input image size and calculate the MSE loss. For the latter, we also reshape each prediction () to the input image size and adopt the weighted BCE loss () and the weighted IoU loss () [47]. Therefore, our loss function is defined as:
(7) | ||||
Method | COD10K | NC4K | CAMO | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020 SINet [10] | 0.772 | 0.543 | 0.640 | 0.810 | 0.051 | 0.810 | 0.665 | 0.741 | 0.841 | 0.066 | 0.753 | 0.602 | 0.676 | 0.774 | 0.097 |
2021 PFNet [11] | 0.797 | 0.656 | 0.698 | 0.875 | 0.039 | 0.826 | 0.743 | 0.783 | 0.884 | 0.054 | 0.774 | 0.683 | 0.737 | 0.832 | 0.087 |
2021 LSR [31] | 0.805 | 0.660 | 0.703 | 0.876 | 0.039 | 0.832 | 0.743 | 0.785 | 0.888 | 0.053 | 0.793 | 0.703 | 0.753 | 0.850 | 0.083 |
2021 C2FNet [24] | 0.811 | 0.680 | 0.722 | 0.890 | 0.036 | 0.839 | 0.763 | 0.805 | 0.896 | 0.050 | 0.782 | 0.698 | 0.751 | 0.838 | 0.082 |
2021 MGL [17] | 0.815 | 0.667 | 0.709 | 0.852 | 0.035 | 0.832 | 0.739 | 0.782 | 0.868 | 0.053 | 0.772 | 0.670 | 0.725 | 0.811 | 0.089 |
2021 UGTR [13] | 0.818 | 0.668 | 0.725 | 0.894 | 0.035 | 0.839 | 0.749 | 0.812 | 0.892 | 0.048 | 0.784 | 0.687 | 0.741 | 0.844 | 0.086 |
2021 UJSC [35] | 0.818 | 0.702 | 0.737 | 0.892 | 0.033 | 0.840 | 0.772 | 0.817 | 0.899 | 0.047 | 0.793 | 0.721 | 0.766 | 0.854 | 0.078 |
2022 SINet-V2 [22] | 0.815 | 0.674 | 0.711 | 0.885 | 0.037 | 0.848 | 0.768 | 0.801 | 0.902 | 0.047 | 0.819 | 0.743 | 0.781 | 0.882 | 0.070 |
2022 R-MGL_v2 [48] | 0.816 | 0.689 | 0.733 | 0.879 | 0.034 | 0.838 | 0.758 | 0.801 | 0.899 | 0.050 | 0.769 | 0.672 | 0.731 | 0.847 | 0.086 |
2022 BSANet [18] | 0.818 | 0.699 | 0.738 | 0.890 | 0.034 | 0.841 | 0.771 | 0.817 | 0.897 | 0.048 | 0.794 | 0.717 | 0.763 | 0.851 | 0.079 |
2022 FAPNet [49] | 0.822 | 0.694 | 0.731 | 0.888 | 0.036 | 0.851 | 0.775 | 0.810 | 0.899 | 0.047 | 0.815 | 0.734 | 0.776 | 0.865 | 0.076 |
2022 BGNet [16] | 0.831 | 0.722 | 0.753 | 0.901 | 0.033 | 0.851 | 0.788 | 0.820 | 0.907 | 0.044 | 0.812 | 0.749 | 0.789 | 0.870 | 0.073 |
2022 SegMaR [25] | 0.833 | 0.724 | 0.757 | 0.899 | 0.034 | 0.841 | 0.781 | 0.820 | 0.896 | 0.046 | 0.815 | 0.753 | 0.795 | 0.874 | 0.071 |
2022 FDCOD [1] | 0.837 | 0.731 | 0.749 | 0.918 | 0.030 | 0.834 | 0.750 | 0.784 | 0.894 | 0.052 | 0.844 | 0.778 | 0.809 | 0.898 | 0.062 |
2022 ZoomNet [14] | 0.838 | 0.729 | 0.766 | 0.888 | 0.029 | 0.853 | 0.784 | 0.818 | 0.896 | 0.043 | 0.820 | 0.752 | 0.794 | 0.878 | 0.066 |
2023 DGNet [2] | 0.822 | 0.693 | 0.728 | 0.896 | 0.033 | 0.857 | 0.784 | 0.814 | 0.911 | 0.042 | 0.839 | 0.769 | 0.806 | 0.901 | 0.057 |
2023 FEDER [30] | 0.822 | 0.716 | 0.751 | 0.900 | 0.032 | 0.847 | 0.789 | 0.824 | 0.907 | 0.044 | 0.802 | 0.738 | 0.781 | 0.867 | 0.071 |
2023 PopNet [50] | 0.851 | 0.757 | 0.786 | 0.910 | 0.028 | 0.861 | 0.802 | 0.833 | 0.909 | 0.042 | 0.808 | 0.744 | 0.784 | 0.859 | 0.077 |
2023 HitNet [40] | 0.868 | 0.798 | 0.806 | 0.932 | 0.024 | 0.870 | 0.825 | 0.853 | 0.921 | 0.039 | 0.844 | 0.801 | 0.831 | 0.902 | 0.057 |
2023 FSPNet [12] | 0.851 | 0.735 | 0.769 | 0.895 | 0.026 | 0.879 | 0.816 | 0.843 | 0.915 | 0.035 | 0.856 | 0.799 | 0.830 | 0.899 | 0.050 |
AGLNet-Boundary | 0.870 | 0.785 | 0.808 | 0.930 | 0.024 | 0.883 | 0.830 | 0.854 | 0.929 | 0.035 | 0.867 | 0.816 | 0.843 | 0.917 | 0.053 |
AGLNet-Texture | 0.871 | 0.786 | 0.809 | 0.928 | 0.024 | 0.884 | 0.834 | 0.857 | 0.930 | 0.034 | 0.868 | 0.823 | 0.850 | 0.920 | 0.049 |
AGLNet-Canny | 0.873 | 0.789 | 0.811 | 0.930 | 0.023 | 0.884 | 0.833 | 0.855 | 0.929 | 0.034 | 0.870 | 0.823 | 0.848 | 0.916 | 0.050 |
AGLNet-Frequency | 0.875 | 0.791 | 0.813 | 0.933 | 0.023 | 0.889 | 0.836 | 0.858 | 0.934 | 0.033 | 0.874 | 0.825 | 0.851 | 0.918 | 0.050 |
















































IV Experiments
IV-A Experimental Setup
Datasets. We conduct experiments on three widely used benchmark datasets of COD task, i.e., CAMO [34], COD10K [10] and NC4K [31]. In particular, CAMO, covering eight categories, contains 1,250 camouflaged images and 1,250 non-camouflaged images. COD10K consists of 5,066 camouflaged, 1,934 non-camouflaged, and 3,000 background images, and it is currently the largest dataset which covers 10 superclasses and 78 subclasses. NC4K is a newly published dataset which has a total of 4,121 camouflaged images. Following standard practice of COD tasks, we use 3,040 images from COD10K and 1,000 images from CAMO as the training set and the remaining data as the test set.
Evaluation Metrics. According to the standard evaluation protocol of COD, we employ the five common metrics to evaluate our model, i.e., structure-measure () [51], weighted F-measure [52], mean F-measure (), mean E-measure () [53] and mean absolute error ().
Implementation Details. All experiments are implemented with the PyTorch toolbox. The visual backbone we adopted is the EfficientNet-B4 [54] pretrained on ImageNet unless otherwise stated. We use Adam [55] as our model optimizer with a learning rate of 1e-4, which is adjusted according to cosine annealing strategy [56] with a period of 40 epochs and a minimum learning rate of 1e-5. We train our AGLNet for 100 epochs with a batch size of 8, which takes about 9 hours on an NVIDIA GeForce RTX 3090 GPU. During the training and inference, the input images are resized to 704704 via bilinear interpolation and augmented by random flipping, cropping, and color jittering.
Competitors. Our AGLNet is compared with 20 recent state-of-the-art methods, including SINet [10], PFNet [11], LSR [31], C2FNet [24], MGL [17], UGTR [13], UJSC [35], SINet-V2 [22], R-MGL_v2 [48], BSANet [18], FAPNet [49], BGNet [16], SegMaR [25], FDCOD [1], ZoomNet [14], DGNet [2], FEDER [30], PopNet [50], HitNet [40], FSPNet [12]. For a fair comparison, all results are either provided by the authors or reproduced by an open-source model re-trained on the same training set with the recommended setting.
IV-B Comparisons with the State-of-the-arts
Quantitative Evaluation. Table II shows the comparison results of AGLNet with 20 cutting-edge methods. We used the most common additional cues available currently, including boundary, texture, canny, and frequency. It can be seen that our proposed AGLNet achieves significant performance improvements, regardless of what additional information is used, and outperforms other comparison methods on all datasets. Note that unless otherwise specified, the following AGLNet refers to frequency as the additional clue. Compared with FDCOD [1], which also introduces the frequency domain cues for COD, our AGLNet shows a large performance improvement. Specifically, our method increases the performance on the three COD datasets by an average of 4.9%, 8.6%, 7.7%, 2.8%, and 26.4% for , , , and , respectively. The performance improvement can be attributed to the learnable additional information exploration and the deep multi-level integration within the same domain in the proposed AGLNet. This leads to a more effective incorporation of additional cues to guide object prediction. Compared with another well-performing method that does not use additional cues, e.g., ZoomNet, our method improves by 5.1%, by 8.3%, by 6.9%, by 3.5% and by 23.6% averagely. Compared with the second-best competitor - HitNet, which adopts a powerful transformer as the backbone, our method still achieves better detection performance, with 2.1%, 1.3%, 0.5%, 1.4% and 15.4% increase on NC4K dataset, and 3.6%, 3.0%, 2.4%, 1.8% and 12.3% increase on CAMO dataset. As a result, our AGLNet shows effectiveness and superior performance in detecting camouflaged objects compared with the existing methods.
No. | Component | COD10K | NC4K | CAMO | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | Combination | Decoupling | RD | AIG | ||||||||||||||||
#1 | 0.829 | 0.701 | 0.731 | 0.881 | 0.033 | 0.846 | 0.732 | 0.776 | 0.871 | 0.050 | 0.840 | 0.736 | 0.780 | 0.868 | 0.068 | |||||
#2 | 0.859 | 0.757 | 0.782 | 0.905 | 0.026 | 0.879 | 0.813 | 0.841 | 0.912 | 0.034 | 0.860 | 0.808 | 0.837 | 0.905 | 0.051 | |||||
#3 | 0.862 | 0.773 | 0.794 | 0.918 | 0.026 | 0.880 | 0.824 | 0.850 | 0.921 | 0.035 | 0.863 | 0.810 | 0.838 | 0.913 | 0.052 | |||||
#4 | 0.865 | 0.779 | 0.799 | 0.921 | 0.025 | 0.882 | 0.828 | 0.852 | 0.926 | 0.035 | 0.866 | 0.814 | 0.841 | 0.914 | 0.052 | |||||
#OUR | 0.875 | 0.791 | 0.813 | 0.933 | 0.023 | 0.889 | 0.836 | 0.858 | 0.934 | 0.033 | 0.874 | 0.825 | 0.851 | 0.918 | 0.050 |
















































Qualitative Evaluation. Figure 3 shows the visual comparisons between our AGLNet and other representative COD methods in some challenging scenarios, including tiny objects (e.g., lines 1-2), occlusions (e.g., lines 3-4), and multiple objects (e.g., lines 5-6). These comparisons intuitively show a more competitive visual performance of our proposed AGLNet. With a good integration of the discriminative information provided by the additional cue, our AGLNet provides more accurate and complete camouflaged object localization and segmentation under various complex and highly similar backgrounds, even with the interference of noisy objects/regions (salient but non-camouflaged).




IV-C Ablation Studies
Overview. We perform ablation studies on key components to verify their effectiveness and analyze their impacts on performance, as shown in Table III. Note that, for the baseline model, we remove all the additional modules, and then use convolution blocks to fuse the multi-level features in a top-down manner and generate predictions. Experimental results demonstrate that our designed HFC (including combination and decoupling), RD, and AIG can improve detection performance. When they are combined to build AGLNet, significant improvements in all evaluation metrics are observed.
Effectiveness of Combination. As can be seen from Table III (#2), compared with baseline, Combination achieves significant performance improvement, which provides a gain of 3.3%, 9.6%, 7.6%, 3.9%, and 26.1% on , , , and on three datasets by an average, respectively. The Combination part fully interacts and accumulates critical cues by dense aggregation of multi-scale backbone features, thus greatly enhancing the feature representation for COD. Figure 4 provides some visual results, showing the effectiveness of Combination for improving performance.


























Method (Backbone) | COD10K | NC4K | CAMO | Params. (M) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ZoomNet (ResNet-50) | 0.838 | 0.729 | 0.766 | 0.888 | 0.029 | 0.853 | 0.784 | 0.818 | 0.896 | 0.043 | 0.820 | 0.752 | 0.794 | 0.878 | 0.066 | 32.382 |
AGLNet (ResNet-50) | 0.849 | 0.740 | 0.773 | 0.920 | 0.028 | 0.857 | 0.789 | 0.823 | 0.902 | 0.042 | 0.842 | 0.768 | 0.803 | 0.888 | 0.064 | 114.09 |
FDCOD (Res2Net-50) | 0.837 | 0.731 | 0.749 | 0.918 | 0.030 | 0.834 | 0.750 | 0.784 | 0.894 | 0.052 | 0.844 | 0.778 | 0.809 | 0.898 | 0.062 | 197.41 |
AGLNet (Res2Net-50) | 0.856 | 0.753 | 0.784 | 0.926 | 0.028 | 0.863 | 0.793 | 0.826 | 0.906 | 0.042 | 0.851 | 0.779 | 0.819 | 0.895 | 0.061 | 114.69 |
DGNet (EfficientNet-B4) | 0.822 | 0.693 | 0.728 | 0.896 | 0.033 | 0.857 | 0.784 | 0.814 | 0.911 | 0.042 | 0.839 | 0.769 | 0.806 | 0.901 | 0.057 | 21.02 |
AGLNet (EfficientNet-B4) | 0.875 | 0.791 | 0.813 | 0.933 | 0.023 | 0.889 | 0.836 | 0.858 | 0.934 | 0.033 | 0.874 | 0.825 | 0.851 | 0.918 | 0.050 | 93.65 |
Method | COD10K | NC4K | CAMO | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SegMaR (704*704) | 0.830 | 0.708 | 0.745 | 0.894 | 0.033 | 0.845 | 0.762 | 0.799 | 0.892 | 0.050 | 0.792 | 0.701 | 0.748 | 0.843 | 0.085 |
ZoomNet (704*704) | 0.842 | 0.738 | 0.778 | 0.891 | 0.029 | 0.854 | 0.786 | 0.822 | 0.896 | 0.043 | 0.797 | 0.721 | 0.768 | 0.845 | 0.080 |
FDCOD (704*704) | 0.843 | 0.733 | 0.761 | 0.902 | 0.030 | 0.842 | 0.761 | 0.793 | 0.896 | 0.049 | 0.850 | 0.784 | 0.821 | 0.886 | 0.059 |
HitNet (704*704) | 0.868 | 0.798 | 0.806 | 0.932 | 0.024 | 0.870 | 0.825 | 0.853 | 0.921 | 0.039 | 0.844 | 0.801 | 0.831 | 0.902 | 0.057 |
AGLNet (704*704) | 0.875 | 0.791 | 0.813 | 0.933 | 0.023 | 0.889 | 0.836 | 0.858 | 0.934 | 0.033 | 0.874 | 0.825 | 0.851 | 0.918 | 0.050 |
Effectiveness of Decoupling. As shown in Table III (#3), the addition of Decoupling part significantly improves detection performance by 1.2%, 0.9% and 1.1% on , and on three datasets by an average, respectively. Decoupling adopts feature splitting and group-wise exploration to dig deep into different feature groups for fine discriminative features, which strengthen feature representation. The decoupling operation compensates for more details for fine feature exploration. Figure 5 shows some visual results. We can see Decoupling part plays a crucial role in fine feature exploration, which compensates for more details, such as edges (e.g. row 1), textures (e.g. row 2), and torsos (e.g. row 3) for camouflaged object detection.
Effectiveness of RD. The RD module utilizes three FR components, which combine multi-scale backbone features to further refine feature representation for camouflaged object detection. As shown in Table III (#4), RD further increases the detection performance. Figure 6 provides some visual comparison results, showing the effectiveness of the proposed RD module.
Parameter Analysis for FR. Fig. 8 shows the performance comparison of module in FR under different iterations. We can see that the model performance is relatively robust for different iterations. In our experiments, we adopt three iterations of module, which achieves the slightly best performance for camouflaged object detection. Besides, inside module, we perform multiple split and merge operations to deeply explore critical features for camouflaged object detection. Fig. 8 shows the performance comparison under the different numbers of split and merge operations. In our experiments, we employ three times of split-merge operations, which achieve the best detection performance. Fig. 8 provides a quantitative comparison of parameter in FR under different settings. Fig. 8 provides a quantitative comparison of parameter in FR under different settings. Experiments show that the performance of FR is relatively robust under different parameter settings. We chose and , respectively, which achieve slightly better performance in camouflaged object detection.
Effectiveness of AIG. The AIG module learns additional information features of camouflaged objects and then incorporates them into image features for camouflaged object segmentation. To make full use of the additional cues, multi-level fusion is adopted to incorporate additional information features into different stages of the model, including HFC and RD, for deep aggregation. As shown in Table III (#OUR), the performance gains are 1.5%, 1.8%, 1.3% and 8.0% in terms of , , and on COD10K, respectively, demonstrating the effectiveness of additional information features for camouflaged object detection.
Model Adaptability to Different Additional Cues. Fig. 9 shows a comparison of the model adaptability of some representative COD methods for different additional cues, including boundary, texture, canny, and frequency. We adopt FDCOD [1] and DGNet [2] as comparison. The former introduces the frequency domain information and the latter integrates edge cues. We can see, FDCOD achieves better results using frequency domain cues, but it significantly reduces the performance with other additional cues. Differently, DGNet shows poor performance when using the frequency domain cues. These methods are tailored to specific auxiliary cues and not applicable to other additional cues. By contrast, our proposed method shows outstanding adaptability for different additional cues. Our method achieves the best performance under different additional information with minor performance differences when compared to other competitors. Fig. 10 shows the feature maps of different additional cues. We can see that: (a) the boundary provides relatively little object information (only the edges of the object silhouette), so it is more sensitive to the object silhouette, but explores more limited effective object features than the other three additional cues. (b) Boundary and texture only provide cues to object regions, while the canny and frequency also provide contextual information (i.e., background of the objects), which helps to improve the understanding of a scene and increase detection performance. (c) Frequency provides additional information beyond the human visual system and shows the best results.
Backbone Analysis. We also test different backbones to verify the performance of the proposed method for COD. ZoomNet [14], FDCOD [1] and DGNet [2] are state-of-the-art methods, with common-used ResNet-50 [57], Res2Net-50 [58] and EfficientNet-B4 [54] as the backbones, respectively. Therefore, we take these three methods as competitors. As shown in Table IV, we test our proposed AGLNet with three kinds of backbones respectively, and find that the proposed method outperforms other competitors significantly. Specifically, compared with FDCOD, AGLNet (Res2Net-50) achieves 2.3%, 3.0% and 4.7% improvement in , and on COD10K dataset, respectively. Compared with DGNet, AGLNet (EfficientNet-B4) achieves average improvements of 4.9%, 9.4%, 7.6%, 2.8% and 21.3% in , , , , and on three datasets by an average, respectively. Overall, our AGLNet achieves the best performance with EfficientNet-B4 as the backbone. Besides, in the AGLNet variant, AGLNet (EfficientNet-B4) has the smallest amount of parameters, but is still larger than that of DGNet and ZoomNet. The design of light-weight models is also the focus of our future work.
Input Resolution. We also conduct a series of ablation experiments to analyze the impact of input image resolution on detection performance. As shown in Table V, under the same resolution (704704), the proposed AGLNet significantly outperforms the comparison methods. This is because: a) high-resolution input provides more effective object details to improve the detection; b) high-resolution input also introduces noise interference, so a good network design can better explore critical cues. The proposed method designs the deep integration of additional features and image features and recalibration decoder, providing a very compelling performance. Actually, from our experiments, our method achieves the best results at all common input resolutions.
V Conclusion
This paper proposes an adaptive guidance learning framework that can handle any of the additional cues theoretically which are commonly used in COD tasks while achieving significant performance gains. To our knowledge, this is the first work to study a unified end-to-end model to adapt different additional information for COD tasks. The proposed method designs an additional information generation module to learn the additional cues, which are then deeply integrated with image features by the hierarchical feature combination module to guide the learning of camouflaged features. Extensive experiments show the superiority over 20 other state-of-the-art methods on three datasets.
References
- [1] Y. Zhong, B. Li, L. Tang, S. Kuang, S. Wu, and S. Ding, “Detecting camouflaged object in frequency domain,” in CVPR, 2022, pp. 4504–4513.
- [2] G.-P. Ji, D.-P. Fan, Y.-C. Chou, D. Dai, A. Liniger, and L. Van Gool, “Deep gradient learning for efficient camouflaged object detection,” Machine Intelligence Research, vol. 20, no. 1, pp. 92–108, 2023.
- [3] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in MICCAI. Springer, 2020, pp. 263–273.
- [4] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Inf-net: Automatic covid-19 lung infection segmentation from ct images,” IEEE TMI, vol. 39, no. 8, pp. 2626–2637, 2020.
- [5] D. Tabernik, S. Šela, J. Skvarč, and D. Skočaj, “Segmentation-based deep-learning approach for surface-defect detection,” Journal of Intelligent Manufacturing, vol. 31, no. 3, pp. 759–776, 2020.
- [6] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR, 2021, pp. 7077–7087.
- [7] R. Feng and B. Prabhakaran, “Facilitating fashion camouflage art,” in ACM MM, 2013, pp. 793–802.
- [8] K. S. Kumar and A. Abdul Rahman, “Early detection of locust swarms using deep learning,” in Advances in machine learning and computational intelligence. Springer, 2021, pp. 303–310.
- [9] T. Liu, Y. Zhao, Y. Wei, Y. Zhao, and S. Wei, “Concealed object detection for activate millimeter wave image,” IEEE Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9909–9917, 2019.
- [10] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Camouflaged object detection,” in CVPR, 2020, pp. 2777–2787.
- [11] H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” in CVPR, 2021, pp. 8772–8781.
- [12] Z. Huang, H. Dai, T.-Z. Xiang, S. Wang, H.-X. Chen, J. Qin, and H. Xiong, “Feature shrinkage pyramid for camouflaged object detection with transformers,” in CVPR, 2023, pp. 5557–5566.
- [13] F. Yang, Q. Zhai, X. Li, R. Huang, A. Luo, H. Cheng, and D.-P. Fan, “Uncertainty-guided transformer reasoning for camouflaged object detection,” in ICCV, 2021, pp. 4146–4155.
- [14] Y. Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoom in and out: A mixed-scale triplet network for camouflaged object detection,” in CVPR, 2022, pp. 2160–2170.
- [15] M. Stevens and S. Merilaita, “Animal camouflage: current issues and new perspectives,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 364, no. 1516, pp. 423–427, 2009.
- [16] Y. Sun, S. Wang, C. Chen, and T.-Z. Xiang, “Boundary-guided camouflaged object detection,” in IJCAI, 2022, pp. 1335–1341.
- [17] Q. Zhai, X. Li, F. Yang, C. Chen, H. Cheng, and D.-P. Fan, “Mutual graph learning for camouflaged object detection,” in CVPR, 2021, pp. 12 997–13 007.
- [18] H. Zhu, P. Li, H. Xie, X. Yan, D. Liang, D. Chen, M. Wei, and J. Qin, “I can find you! boundary-guided separated attention network for camouflaged object detection,” in aaaiI, 2022, pp. 3608–3616.
- [19] J. Zhu, X. Zhang, S. Zhang, and J. Liu, “Inferring camouflaged objects by texture-aware interactive guidance network,” in aaaiI, 2021, pp. 3599–3607.
- [20] X. Zhang, B. Yin, Z. Lin, Q. Hou, D.-P. Fan, and M.-M. Cheng, “Referring camouflaged object detection,” arXiv preprint arXiv:2306.07532, 2023.
- [21] Z. Chen, R. Gao, T. Xiang, and F. Lin, “Diffusion model for camouflaged object detection,” in ECAI. IOS Press, 2023, pp. 445–452.
- [22] D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Concealed object detection,” IEEE TPAMI, vol. 44, no. 10, pp. 6024–6042, 2022.
- [23] B. Yin, X. Zhang, Q. Hou, B.-Y. Sun, D.-P. Fan, and L. Van Gool, “Camoformer: Masked separable attention for camouflaged object detection,” arXiv preprint arXiv:2212.06570, 2022.
- [24] Y. Sun, G. Chen, T. Zhou, Y. Zhang, and N. Liu, “Context-aware cross-level fusion network for camouflaged object detection,” in IJCAI, 2021, pp. 1025–1031.
- [25] Q. Jia, S. Yao, Y. Liu, X. Fan, R. Liu, and Z. Luo, “Segment, magnify and reiterate: Detecting camouflaged objects the hard way,” in CVPR, 2022, pp. 4713–4722.
- [26] M. Zhang, S. Xu, Y. Piao, D. Shi, S. Lin, and H. Lu, “Preynet: Preying on camouflaged objects,” in ACM MM, 2022, pp. 5323–5332.
- [27] X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 465–15 474.
- [28] X. Zhang, B.-W. Yin, Y. Chen, Z. Lin, Y. Li, Q. Hou, and M.-M. Cheng, “Temo: Towards text-driven 3d stylization for multi-object meshes,” arXiv preprint arXiv:2312.04248, 2023.
- [29] B. Dong, J. Pei, R. Gao, T.-Z. Xiang, S. Wang, and H. Xiong, “A unified query-based paradigm for camouflaged instance segmentation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2131–2138.
- [30] C. He, K. Li, Y. Zhang, L. Tang, Y. Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge reconstruction,” in CVPR, 2023, pp. 22 046–22 055.
- [31] Y. Lv, J. Zhang, Y. Dai, A. Li, B. Liu, N. Barnes, and D.-P. Fan, “Simultaneously localize, segment and rank the camouflaged objects,” in CVPR, 2021, pp. 11 591–11 601.
- [32] X. Cheng, H. Xiong, D.-P. Fan, Y. Zhong, M. Harandi, T. Drummond, and Z. Ge, “Implicit motion handling for video camouflaged object detection,” in CVPR, 2022, pp. 13 864–13 873.
- [33] Y. Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoomnext: A unified collaborative pyramid network for camouflaged object detection,” arXiv 2310.20208, 2023.
- [34] T.-N. Le, T. V. Nguyen, Z. Nie, M.-T. Tran, and A. Sugimoto, “Anabranch network for camouflaged object segmentation,” CVIU, vol. 184, pp. 45–56, 2019.
- [35] A. Li, J. Zhang, Y. Lv, B. Liu, T. Zhang, and Y. Dai, “Uncertainty-aware joint salient object and camouflaged object detection,” in CVPR, 2021, pp. 10 071–10 081.
- [36] C. Zhang, H. Bi, T.-Z. Xiang, R. Wu, J. Tong, and X. Wang, “Collaborative camouflaged object detection: A large-scale dataset and benchmark,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023.
- [37] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Egnet: Edge guidance network for salient object detection,” in ICCV, October 2019, pp. 8778–8787.
- [38] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object detection with pyramid attention and salient edges,” in CVPR, 2019, pp. 1448–1457.
- [39] Y. Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards high-resolution salient object detection,” in ICCV, 2019, pp. 7234–7243.
- [40] X. Hu, S. Wang, X. Qin, H. Dai, W. Ren, D. Luo, Y. Tai, and L. Shao, “High-resolution iterative feedback network for camouflaged object detection,” in AAAI, 2023, pp. 881–889.
- [41] G.-P. Ji, L. Zhu, M. Zhuge, and K. Fu, “Fast camouflaged object detection via edge-based reversible re-calibration network,” Pattern Recognition, vol. 123, p. 108414, 2022.
- [42] R. Cong, M. Sun, S. Zhang, X. Zhou, W. Zhang, and Y. Zhao, “Frequency perception network for camouflaged object detection,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1179–1189.
- [43] C. He, K. Li, Y. Zhang, L. Tang, Y. Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22 046–22 055.
- [44] J. Canny, “A computational approach to edge detection,” IEEE TPAMI, vol. 8, no. 6, pp. 679–698, 1986.
- [45] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in CVPR, 2019, pp. 3907–3916.
- [46] H. Liu, J. Zhang, K. Yang, X. Hu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” arXiv preprint arXiv:2203.04838, 2022.
- [47] J. Wei, S. Wang, and Q. Huang, “F3net: fusion, feedback and focus for salient object detection,” in aaaiI, 2020, pp. 12 321–12 328.
- [48] Q. Zhai, X. Li, F. Yang, Z. Jiao, P. Luo, H. Cheng, and Z. Liu, “Mgl: Mutual graph learning for camouflaged object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1897–1910, 2022.
- [49] T. Zhou, Y. Zhou, C. Gong, J. Yang, and Y. Zhang, “Feature aggregation and propagation network for camouflaged object detection,” IEEE Transactions on Image Processing, vol. 31, pp. 7036–7047, 2022.
- [50] Z. Wu, D. P. Paudel, D.-P. Fan, J. Wang, S. Wang, C. Demonceaux, R. Timofte, and L. Van Gool, “Source-free depth for object pop-out,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1032–1042.
- [51] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in ICCV, 2017, pp. 4548–4557.
- [52] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” in CVPR, 2014, pp. 248–255.
- [53] D.-P. Fan, G.-P. Ji, X. Qin, and M.-M. Cheng, “Cognitive vision inspired object segmentation metric and loss function,” SCIENTIA SINICA Informationis, vol. 6, p. 6, 2021.
- [54] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML. PMLR, 2019, pp. 6105–6114.
- [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, vol. 9, 2015.
- [56] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” ICLR, 2017.
- [57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- [58] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE TPAMI, vol. 43, no. 2, pp. 652–662, 2019.