This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Edge-Aware Mirror Network for Camouflaged Object Detection
thanks: *Corresponding author. This work was supported in part by the National Natural Science Foundation of China under the Grant 41927805.

1st Dongyue Sun College of Computer
Science and Technology
Ocean University of China
Qingdao, China
[email protected]
   2nd Shiyao Jiang College of Computer
Science and Technology
Ocean University of China
Qingdao, China
[email protected]
   3rd Lin Qi College of Computer
Science and Technology
Ocean University of China
Qingdao, China
[email protected]
Abstract

Existing edge-aware camouflaged object detection (COD) methods normally output the edge prediction in the early stage. However, edges are important and fundamental factors in the following segmentation task. Due to the high visual similarity between camouflaged targets and the surroundings, edge prior predicted in early stage usually introduces erroneous foreground-background and contaminates features for segmentation. To tackle this problem, we propose a novel Edge-aware Mirror Network (EAMNet), which models edge detection and camouflaged object segmentation as a cross refinement process. More specifically, EAMNet has a two-branch architecture, where a segmentation-induced edge aggregation module and an edge-induced integrity aggregation module are designed to cross-guide the segmentation branch and edge detection branch. A guided-residual channel attention module which leverages the residual connection and gated convolution finally better extracts structural details from low-level features. Quantitative and qualitative experiment results show that EAMNet outperforms existing cutting-edge baselines on three widely used COD datasets. Codes are available at https://github.com/sdy1999/EAMNet.

Index Terms:
Camouflaged objected detection, Low-level features, Cross refinement, Edge Cues

I Introduction

Camouflage refers to the phenomenon that wild animals adapt their colors and textures to surroundings in order to hide themselves and deceive other animals (e.g., predators). Camouflaged object detection (COD) [1] aims to search and segment camouflaged objects from single image. Analyzing and exploring camouflage patterns can benefit a range of downstream applications such as polyp segmentation [2] and recreational art [3]. Camouflages exhibit highly similarity with background in visual appearance, making COD more challenging than ordinary object detection task, where traditional hand-crafted algorithms can hardly deal with [4].

With the release of large-scale and well annotated COD datasets [1, 5, 6], many deep learning based COD models have been proposed. Fan et al. [1] constructed the COD10K dataset which contains 5066 samples of camouflaged objects and proposed a search-identification network (SINet). SINet first uses a search module to roughly locate the camouflaged object, and then uses an identification module for precise segment. Inspired by the design ethos of SINet, a variety of approaches focusing on cross-level feature fusion have been proposed [7, 8]. By leveraging the well-designed fusion approaches, these methods can roughly locate regions containing the camouflaged object, but still fail to clarify the indistinct boundary between the camouflaged object and its surroundings, resulting in fuzzy segmentations.

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Figure 1: Model paradigm (a), represented by SINet [1], focuses on exploring multi-level segmentation feature fusion; model paradigm (b), represented by BGNet [9], introduces edge cues to further enhance representation of segmentation features; model paradigm (c), represented by this approach, models edge detection and camouflaged object segmentation as a cross-refinement process. Our model outperforms the first two models in the integrity of the camouflaged object segmentation (d).

To overcome the above limitation, researchers have introduced edge cues to enhance the representation of segmentation features [10, 9]. As shown in Figure 1(b), the common thread is to generate an edge prediction at early stage of the network and use it as edge prior to guide the fusion of multi-level segmentation features. However, these methods still suffer from two shortcomings: (1) low-level features that preserve low-frequency texture information are less focused. (2) overemphasis on injecting the edge prior into segmentation features without considering the accuracy of early generated edge prediction. Due to the lack of semantics information, edge prediction generated early is prone to lose the integrity of the camouflaged object, which misleads segmentation into erroneous foreground prediction. Following ICON [11], we employ the integrity concept as the whole body of a certain camouflaged object. As shown in Figure 1(d), compared to the results of SINet, the prediction of BGNet has clearer boundaries, but still fails to identify the whole body of the batfish.

In this paper, we propose EAMNet, which consists of an edge detection branch and a segmentation branch. The design ethos of our EAMNet is to utilize edge features to enhance the segmentation features, and then by leveraging the enhanced segmentation features we can explore a more complete foreground representation, which in turn can be utilized to enhance edge features. Two components are proposed to implement the cross guidance at multiple scales: (1) the edge-induced integrity aggregation (EIA) module introduces edge stream to enhance segmentation features and explores the integrity of camouflaged objects from channel and spatial dimensions. (2) the segmentation-induced edge aggregation (SEA) module introduces segmentation stream to help the edge branch for the whole shape of the camouflaged object. Moreover, with the assumption that correct semantic guidance and low-frequency preserved residual connection are the two essential factors for leveraging low-level features, we propose the guided-residual channel attention (GCA) module to extract the structure details of camouflaged objects preserved in low-level features. Experimental results show that the proposed EAMNet surpasses recent nine state-of-the-art methods under four widely used metrics on three COD datasets.

Refer to caption

Figure 2: The overall architecture of EAMNet, which featured by three key components, i.e., segmentation-induced edge aggregation module (SEA), edge-induced integrity aggregation module (EIA) and guided-residual channel attention module (GCA).

II RELATED WORKS

For COD, traditional methods mostly separate the camouflaged object and its surroundings by utilizing hand-crafted features such as texture, color, and brightness, and these methods often failed to deal with complex scenes [4].

Recently, deep learning-based approaches have become dominant in COD field [1, 7, 12]. Fan et al. (SINet) [1] designs a search module and an identification module to detect camouflaged objects by simulating the process of animal hunting. Based on the observation that people usually search for camouflaged objects by changing the viewpoint of the same scene, Yan el at. (MirrorNet) [13] introduces a flipped image stream to better locate potential camouflaged objects. Except for these bio-inspired methods, Chen et al. (C2Net) [7] designs an attention-induced cross-level fusion module to fully capture valuable context information to boost the accuracy of COD. Yang et al. (UGTR) [14] integrates the benefits of Bayesian learning and transformer reasoning to leverage the uncertainty and probabilistic information in COD data annotation. To further clarify the indistinct boundaries, Sun et al. [9] designs an edge-guidance feature module to embed the edge cues into segmentation features. Zhu et al. [10] utilizes adaptive space normalization to do this in a more effect way. Notably, there are also some salient object detection (SOD) methods [15, 16] that explores the cross guidance between edge detection and salient object segmentation, but their design ethos are detecting the salient object which have strong distinctions to its surroundings, and due to the essential difference between “camouflaged” and “salient”, it is difficult for applying them into COD field.

III METHODOLOGY

III-A Overview

Figure 2 shows the overview of the proposed EAMNet, which consists of three kinds of key components including the bifurcated backbone, the edge detection branch, and the segmentation branch. The edge detection branch consists mainly of the SEA module, which aggregates multi-level edge features under the guidance from segmentation branch. The segmentation branch consists mainly of the EIA module, which aggregates multi-level segmentation features under the the guidance from edge detection branch. Both the two branches adopt the coarse-to-fine strategy under the supervision of edge map and ground truth map, respectively, which means that the prediction in two branches is progressively optimized from low to high resolution.

Specifically, given an input image I, we first adopt the Res2Net-50 [17] as backbone to extract features at four levels, which can be denoted as F={fi,i=1,2,3,4}F=\{f_{i},i=1,2,3,4\}. Then, we fed F into channel reduce modules containing a 1×11\times 1 convolutional layer to extract multi-level edge features with the channel size of 64 denoted as Fe={fie,i=1,2,3,4}F_{e}=\{f^{e}_{i},i=1,2,3,4\} and multi-level segmentation features denoted as Fs={fis,i=1,2,3,4}F_{s}=\{f^{s}_{i},i=1,2,3,4\}. After that, We stack three SEA modules in the edge detection branch to aggregate the multi-level edge features FeF_{e} in a coarse-to-fine manner and three EIA modules in the segmentation branch to aggregate segmentation features FsF_{s}. For clearer descriptions, We define the output of SEA modules as E={Ei,i=1,2,3}E=\{E_{i},i=1,2,3\}, f4ef^{e}_{4} as E4E_{4}, the output of EIA modules as S={Si,i=1,2,3}S=\{S_{i},i=1,2,3\} and f4sf^{s}_{4} as S4S_{4}. Multiple SEA modules and EIA modules are interacted in a cascaded manner to implement the cross refinement between the two branches. Moreover, in both two branches, the shallow detail features f1sf^{s}_{1} and f1ef^{e}_{1} are firstly fed into three cascaded GCA modules to filter the abundant background noise by leveraging the semantic stream from corresponding aggregation modules. We take the last EIA module’s prediction as the final segmentation result for testing stage.

III-B Edge-Induced Integrity Aggregation Module

The EIA module aims to inject the edge cues into the representation learning of segmentation features, and further explore the integrity information from both channel and spatial dimensions. Figure 3 shows the detail structure of EIA module, which consists of two stages, the first for feature fusion and the second for the integrity exploration.

Specifically, for the ith EIA module, given the segmentation feature fisRW1×H1×64f^{s}_{i}\in{R^{{W_{1}}\times{H_{1}}\times 64}}, the higher level segmentation feature Si+1RW2×H2×64S_{i+1}\in{R^{{W_{2}}\times{H_{2}}\times 64}} from the previous EIA module and the edge feature EiRW1×H1×64E_{i}\in{R^{{W_{1}}\times{H_{1}}\times 64}} from the previous SEA module as inputs. Inspired by [18], we first employ a mirror multiplication strategy consisting of two paths to fully aggregate the two-level segmentation features fisf^{s}_{i} and Si+1S_{i+1} (after up-sampling). In the main path, the higher level feature Si+1S_{i+1} is fed into a 3×33\times 3 convolutional layer to generate a semantic mask. This mask is then multiplied with fisf^{s}_{i} to enhance the response of the camouflaged object. The mirror path, on the other hand, utilizes the lower level feature fisf^{s}_{i} to generate a detail mask. This detail mask is then multiplied with Si+1S_{i+1} to preserve fine-grained details that might have been lost in the coarser segmentation map. The two paths are concatenated with the edge feature EiE_{i} and fed into a fusion block containing two 3×33\times 3 convolutional layers to implement the guidance from the edge branch to the segmentation branch. Note that, the 3×33\times 3 convolutional layer in this paper consists of a 3×33\times 3 convolution, a batch normalization and a relu function. We denote the the final fused feature as fis~\widetilde{f^{s}_{i}}, and the whole fusion process can be formulated as:

fis~=C3×3(Concat((fis,Si+1),Ei)),\widetilde{f^{s}_{i}}=C_{3\times 3}(\textit{Concat}(\mathcal{M}(f^{s}_{i},S_{i+1}),E_{i})), (1)

After that, the fused feature fs~\widetilde{f_{s}} is fed into a three-branch structure B={Bj,j=1,2,3}\textit{B}=\{B_{j},j=1,2,3\} to mine the integrity cues from a multi-scale perspective. Each branch contains a 3×33\times 3 convolutional layer and a 3×33\times 3 atrous convolutional layer with a dilation rate of njn_{j}. In this paper, we set nj={1,3,5}n_{j}=\{1,3,5\}. The three branches are concatenated and fed into two cascaded 3×33\times 3 convolutional layers for reducing the channels to 64. To further enhance the response of critical channels which preserve weak integrity cues of the camouflaged object (e.g., the legs of a crab), we adopt a multi-scale channel attention module (MS[19], which has a two-branch architecture. The attention matrix MM of MS can be calculated as: M(X)=L(X)+G(X)M(X)=L(X)+G(X), G(X) aims to discover global information by leveraging the global average pooling (GAP) operation, while L(X) adopt the point-wise convolution to obtain local contexts. and the whole exploration process can be formulated as:

Si=MS(C3×3(Concat(B1(fis~),B2(fis~),B3(fis~)))),{S_{i}}=\textit{MS}(C_{3\times 3}(\textit{Concat}(B_{1}(\widetilde{f^{s}_{i}}),B_{2}(\widetilde{f^{s}_{i}}),B_{3}(\widetilde{f^{s}_{i}})))), (2)

where SiS_{i} denotes the final output feature of the ith EIA module, which will be separately fed into the next SEA module to provide the edge branch with integrity cues, the next EIA module to implement the coarse-to-fine fusion and the corresponding GCA module for background noise filtering. Noted that, with a 1×11\times 1 convolution to change the channels of feature SiS_{i}, we obtain the segmentation prediction Ps={Pis,i=1,2,3}P^{s}=\{P^{s}_{i},i=1,2,3\} of the camouflaged object.

III-C Segmentation-Induced Edge Aggregation Module

The SEA module aims to inject the integrity cues into the representation learning of edge features, which shares the same design ethos as the EIA module, but is much lighter.

Specifically, for the ith SEA module, given the edge feature fieRW1×H1×64f^{e}_{i}\in{R^{{W_{1}}\times{H_{1}}\times 64}}, the higher level edge feature Ei+1RW2×H2×64E_{i+1}\in{R^{{W_{2}}\times{H_{2}}\times 64}} from the previous SEA module, the segmentation feature Si+1RW2×H2×64S_{i+1}\in{R^{{W_{2}}\times{H_{2}}\times 64}} from the EIA module as inputs. The two-level edge features fief^{e}_{i} and Ei+1E_{i+1} (after up-sampling) are concatenated and fed into a fusion block containing two 3×33\times 3 convolutional layers for fully fusion, and then we adopt the mirror multiplication strategy \mathcal{M} followed by an additional residual connection to inject the integrity cues from segmentation feature Si+1S_{i+1} (after up-sampling) into the fused edge feature fief^{e}_{i}. We denote the the final fused feature as fie~\widetilde{f^{e}_{i}}, the whole fusion process can be formulated as:

{fie=C3×3(Concat(fie,Ei+1)),fie~=(fie,Si+1)+fie.\begin{split}\begin{aligned} \left\{\begin{array}[]{lll}f^{e}_{i}&=&C_{3\times 3}(Concat(f^{e}_{i},E_{i+1})),\\ \widetilde{f^{e}_{i}}&=&\mathcal{M}(f^{e}_{i},S_{i+1})+f^{e}_{i}.\end{array}\right.\end{aligned}\end{split} (3)

After that, we feed fie~\widetilde{f^{e}_{i}} into two cascaded 3×33\times 3 convolutional layers to explore its discriminative representation in foreground-background blurred regions, and the final output feature EiE_{i} is fed into a 1×11\times 1 convolution for channel reducing to obtain the edge prediction Pe={Pie,i=1,2,3}P^{e}=\{P^{e}_{i},i=1,2,3\} of the camouflaged object.

III-D Guided-Residual Channel Attention Module

In natural images, camouflaged objects tend to show smaller dimensions, which makes the structure details preserved in low-level features essential for detecting the integrity of camouflaged objects. Therefore, we propose the Guided-Residual Channel Attention (GCA) module and its structure is illustrated in Figure 2. The GCA module first applies a guide-flow block that guides semantic features from SEAs (EIAs in segmentation branch) to help filtering the background noise in the detail feature, then uses a channel attention block to further capture the inter dependencies between channels of the detail feature.

Refer to caption

Figure 3: Illustration of the EIA module.

As shown in Figure 4, the guide-flow block takes detail features DRW1×H1×64D\in{R^{W_{1}\times H_{1}\times 64}} and semantic features SRW2×H2×64S\in{R^{W_{2}{\times}H_{2}{\times}64}} as inputs. The guidance map GRW1×H1×1G\in{R^{W_{1}\times H_{1}\times 1}} is obtained as:

G=Sigmoid(C1×1(Concat(D,C1×1(S)))),G=\textit{Sigmoid}(C_{1\times 1}(Concat(D,C_{1\times 1}(S)))), (4)

where C1×1C_{1\times 1} denotes the normalized 1×11\times 1 convolutional layers. The guidance maps generated by both SS and DD can fully utilize the positional information of camouflaged objects in the semantic feature SS , while also taking into account the structural information of the camouflaged objects preserved in the detail feature DD. The new detail feature D~RW1×H1×C\widetilde{D}\in{R^{W_{1}\times H_{1}\times C}} is calculated as:

D~=C1×1((GD)+D),\widetilde{D}=C_{1\times 1}((G\cdot D)+D), (5)

where \cdot denotes the element-wise product. C1×1C_{1\times 1} denotes the 1×11\times 1 convolutional layer, then we fed it into a convolutional block followed by a multi-scale channel attention module MS\textit{}{MS} to further explore its foreground representation in the channel dimension. Moreover, an additional residual connection is utilized to preserve the low-frequency structure information of camouflaged objects to the maximum extent possible. The process can be formulated as:

D=MS(C3×3(D~))+D~,D=\textit{MS}(C_{3\times 3}(\widetilde{D}))+\widetilde{D}, (6)

where C3×3C_{3\times 3} denotes two 3×33\times 3 convolutions with a relu function in the middle, DD denotes the final output detail feature.

Refer to caption

Figure 4: Illustration of the Guide-flow block.

III-E Loss Function

We utilise the co-supervision strategy to jointly train the two branchs. Following the previous state-of-the-art COD methods [1, 9], we adopt the weighted binary cross-entropy loss (LBCEwL^{w}_{BCE}) and the weighted intersection-over-union loss (LIOUwL^{w}_{IOU}) for the supervision of camouflaged mask (McM_{c}), the dice loss (LdiceL_{dice}) for the supervision of edge mask (MeM_{e}). The total loss of EAMNet can be calculated as: Ltotal=i=13λi(LBCEw(Mc,Pis)+LIOUw(Mc,Pis)+Ldice(Me,Pie))L_{total}=\sum_{i=1}^{3}{\lambda}_{i}(L^{w}_{BCE}(M_{c},P^{s}_{i})+L^{w}_{IOU}(M_{c},P^{s}_{i})+L_{dice}(M_{e},P^{e}_{i})), where λi{\lambda}_{i} is a trade-off parameter and we set λi=12i1{\lambda}_{i}=\frac{1}{2^{i-1}}, PisP^{s}_{i} is the segmentation prediction and PieP^{e}_{i} is the edge prediction of camouflaged object.

TABLE I: Quantitative comparison with state-of-the-art methods for COD on three benchmarks using four evaluation metrics (i.e., SαS_{\alpha}, EϕE_{\phi}, FβwF^{w}_{\beta} and MM), ”\uparrow” / ”\downarrow” indicates that higher/lower is better. Top two results are highlighted in red and blue.
CAMO COD10K NC4K
Method Pub./Year SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβwF^{w}_{\beta}\uparrow MM\downarrow SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβwF^{w}_{\beta}\uparrow MM\downarrow SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβwF^{w}_{\beta}\uparrow MM\downarrow
SINet[1] CVPR’20 0.745 0.804 0.644 0.092 0.776 0.864 0.631 0.043 0.808 0.871 0.723 0.058
S-MGL[20] CVPR’21 0.772 0.806 0.664 0.089 0.811 0.844 0.654 0.037 0.829 0.862 0.731 0.055
R-MGL[20] CVPR’21 0.775 0.812 0.673 0.088 0.814 0.851 0.666 0.035 0.833 0.867 0.739 0.053
UGTR[14] ICCV’21 0.784 0.821 0.683 0.086 0.817 0.852 0.665 0.036 0.839 0.874 0.746 0.052
SINet-V2[12] TPAMI’21 0.820 0.882 0.743 0.070 0.815 0.887 0.680 0.037 0.847 0.903 0.770 0.048
DCTNet[8] TMM’22 0.778 0.804 0.667 0.084 0.790 0.821 0.616 0.041 - - - -
C2C_{2}-Net[7] TCSVT’22 0.800 0.869 0.730 0.077 0.811 0.891 0.691 0.036 - - - -
BSANet[10] AAAI’22 0.796 0.851 0.717 0.079 0.818 0.891 0.699 0.034 - - - -
BGNet[9] IJCAI’22 0.812 0.870 0.749 0.073 0.831 0.901 0.722 0.033 0.851 0.907 0.788 0.044
Ours 0.831 0.890 0.763 0.064 0.839 0.907 0.733 0.029 0.862 0.916 0.801 0.040
TABLE II: Ablation study on GCA module and cross refinement process, ’w/ow/o’ denotes ’without’, EDB denotes the whole edge detection branch. The best results are highlighted in bold.
CAMO COD10K NC4K
Model SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβwF^{w}_{\beta}\uparrow MM\downarrow SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβwF^{w}_{\beta}\uparrow MM\downarrow SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβwF^{w}_{\beta}\uparrow MM\downarrow
a. w/ow/o eGCA 0.827 0.890 0.764 0.065 0.836 0.911 0.733 0.030 0.856 0.913 0.793 0.043
b. w/ow/o sGCA 0.827 0.886 0.766 0.067 0.835 0.908 0.731 0.030 0.857 0.915 0.797 0.043
c. w/ow/o eGCA & sGCA 0.823 0.883 0.757 0.067 0.832 0.906 0.723 0.030 0.852 0.910 0.785 0.045
d. w/ow/o EDB 0.820 0.887 0.758 0.069 0.825 0.894 0.708 0.032 0.848 0.905 0.778 0.046
e. Ours 0.831 0.890 0.763 0.064 0.839 0.907 0.733 0.029 0.862 0.916 0.801 0.040

IV EXPERIMENTS

IV-A Experimental Setup

Datasets. We employ EAMNet on three benchmark datasets: COD10K [1], NC4K [5], CAMO [6]. COD10K is the most challenging dataset by far, which consists of 78 subclasses with 5,066 samples. CAMO is also a widely used COD dataset consisting of 1,250 samples. Following the data partition of SINet [1], we use 3,040 samples from COD10K and 1,000 samples from CAMO for training stage, and rest ones for testing stage. Also, we use the NC4K dataset (4,121 samples) to evaluate the generalization ability of EAMNet.

Implementation Details. The proposed EAMNet is implemented with pytorch. Following the training settings in recent methods [1, 7, 12], Res2Net [17] pretrained on ImageNet is employed as the backbone. The input images are resized to 384×384384\times 384 and augmented by randomly horizontal flipping. AdamW with weight decay is chosen as the optimizer. The learning rate is set to 5e-5 and follows a linear warm-up and linear decay strategy which divided by 10 every 50 cycles. The entire model is trained for 100 epochs with a batch size of 24 on a single NVIDIA 3090 GPU.

Evaluation Metrics. We apply four evaluation metrics widely used in COD task, including S-measure (SαS_{\alpha})[21], E-measure (EϕE_{\phi})[22], weighted F-measure (FβwF^{w}_{\beta})[23] and Mean Absolute Error (MM)[24]. In general, a better COD method will present higher SαS_{\alpha}, EϕE_{\phi}, FβwF^{w}_{\beta} and lower MM.

IV-B Performance Comparison

Table I shows the quantitative comparison between EAMNet and 9 state-of-the-art methods in terms of SαS_{\alpha}, EϕE_{\phi}, FβwF^{w}_{\beta} and MM, and it can be observed that our EAMNet performs better than these methods in terms of all metrics. Specifically, in CAMO, our EAMNet outperforms the two edge-aware methods BGNet [9] and BSANet [10] by 3.5% and 1.9% in terms of structural similarity measure SαS_{\alpha}, respectively. This suggests that our EAMNet excels in mining the complete structure of the camouflaged object. Our improved performance can be attributed to the unique two-branch architecture, which allows for more accurate edge prior mining, and the EIA module, which can fully aggregate edge prior and segmentation features in both spatial and channel dimensions. Additionally, the proposed GCA module can make full use of the low-level features to boost COD performance.

We further evaluated the generalization ability of our EAMNet by testing it on the NC4K dataset, where it outperformed the second best method, BGNet, by 1.1%, 0.9%, and 1.3% in terms of SαS_{\alpha}, EϕE_{\phi}, and FβwF^{w}_{\beta}, respectively. Figure 5 provides visual comparisons between our EAMNet and the other methods in challenging scenes. Our method demonstrates superior performance in accurately detecting and segmenting the whole shape of the camouflaged object, as shown in the example of the Katydid image in the first row. Our EAMNet achieves the structural similarity measure SαS_{\alpha} of 85%, while BGNet achieves only 60%.

Refer to caption

Figure 5: Visual comparison of the proposed EAMNet with five state-of-the-art COD methods in tough scenarios.

IV-C Ablation study

To verify the effectiveness of the overall cross refinement strategy, we replace the whole edge detection branch with the boundary detection module in BSANet [10], which generates edge features only in the early stage using the multi-level features extracted by Res2Net. As shown in Table II, compared to model d, model e achieves obvious performance improvements on all benchmark datasets, with the performance gains of 1.3%1.3\%, 0.9%0.9\%, 1.7%1.7\% on average in terms of SαS_{\alpha}, EϕE_{\phi} and FβwF^{w}_{\beta}. This is due to the lack of integrity information in the edge prior generated only early, and the cooperation between EIA module and SEA module in our EAMNet is beneficial to solve this problem. In addition, to verify the effectiveness of our GCA module, we remove the three GCA modules in the segmentation branch (sGCA) and the edge detection branch (eGCA), respectively. It can be observed that compared to module e, module c shows a significant decrease in all metrics, which indicates that the proposed GCA module’s powerful extraction of structural features in low-level features.

V CONCLUSION

In this paper, we propose a novel framework based on the cross refinement between edge detection and camouflaged object segmentation, called EAMNet. In particular, the EIA and SEA module are proposed to implement the cross-refinement relations. Moreover, we design the GCA module to better extracts structural details from low-level features to boost the accuracy of camouflaged object detection. Extensive experiments on three benchmark datasets have shown that our EAMNet outperforms other state-of-the-art COD methods. In the future, we plan to further explore our EAMNet on weakly supervised COD datasets.

References

  • [1] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao, “Camouflaged object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2777–2787.
  • [2] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23. Springer, 2020, pp. 263–273.
  • [3] Ranran Feng and Balakrishnan Prabhakaran, “Facilitating fashion camouflage art,” in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 793–802.
  • [4] Ajoy Mondal, “Camouflaged object detection and tracking: A survey,” International Journal of Image and Graphics, vol. 20, no. 04, pp. 2050028, 2020.
  • [5] Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan, “Simultaneously localize, segment and rank the camouflaged objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11591–11601.
  • [6] Trung-Nghia Le, Tam V Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto, “Anabranch network for camouflaged object segmentation,” Computer vision and image understanding, vol. 184, pp. 45–56, 2019.
  • [7] Geng Chen, Si-Jie Liu, Yu-Jia Sun, Ge-Peng Ji, Ya-Feng Wu, and Tao Zhou, “Camouflaged object detection via context-aware cross-level fusion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6981–6993, 2022.
  • [8] Wei Zhai, Yang Cao, HaiYong Xie, and Zheng-Jun Zha, “Deep texton-coherence network for camouflaged object detection,” IEEE Transactions on Multimedia, 2022.
  • [9] Yujia Sun, Shuo Wang, Chenglizhao Chen, and Tian-Zhu Xiang, “Boundary-guided camouflaged object detection,” arXiv preprint arXiv:2207.00794, 2022.
  • [10] Hongwei Zhu, Peng Li, Haoran Xie, Xuefeng Yan, Dong Liang, Dapeng Chen, Mingqiang Wei, and Jing Qin, “I can find you! boundary-guided separated attention network for camouflaged object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 3608–3616.
  • [11] Mingchen Zhuge, Deng-Ping Fan, Nian Liu, Dingwen Zhang, Dong Xu, and Ling Shao, “Salient object detection via integrity learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [12] Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao, “Concealed object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6024–6042, 2021.
  • [13] Jinnan Yan, Trung-Nghia Le, Khanh-Duy Nguyen, Minh-Triet Tran, Thanh-Toan Do, and Tam V Nguyen, “Mirrornet: Bio-inspired camouflaged object segmentation,” IEEE Access, vol. 9, pp. 43290–43300, 2021.
  • [14] Fan Yang, Qiang Zhai, Xin Li, Rui Huang, Ao Luo, Hong Cheng, and Deng-Ping Fan, “Uncertainty-guided transformer reasoning for camouflaged object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4146–4155.
  • [15] Zhe Wu, Li Su, and Qingming Huang, “Stacked cross refinement network for edge-aware salient object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7264–7273.
  • [16] Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu, and Errui Ding, “A mutual learning method for salient object detection with intertwined multi-supervision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8150–8159.
  • [17] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr, “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 2, pp. 652–662, 2019.
  • [18] Zuyao Chen, Qianqian Xu, Runmin Cong, and Qingming Huang, “Global context-aware progressive aggregation network for salient object detection,” in Proceedings of the AAAI conference on artificial intelligence, 2020, vol. 34, pp. 10599–10606.
  • [19] Yimian Dai, Fabian Gieseke, Stefan Oehmcke, Yiquan Wu, and Kobus Barnard, “Attentional feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3560–3569.
  • [20] Aixuan Li, Jing Zhang, Yunqiu Lv, Bowen Liu, Tong Zhang, and Yuchao Dai, “Uncertainty-aware joint salient object and camouflaged object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10071–10081.
  • [21] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji, “Structure-measure: A new way to evaluate foreground maps,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4548–4557.
  • [22] Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming-Ming Cheng, and Ali Borji, “Enhanced-alignment measure for binary foreground map evaluation,” arXiv preprint arXiv:1805.10421, 2018.
  • [23] Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal, “How to evaluate foreground maps?,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 248–255.
  • [24] Federico Perazzi, Philipp Krähenbühl, Yael Pritch, and Alexander Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 733–740.