∎

¹¹institutetext: Dingwen Zhang ²²institutetext: Northwestern Polytechnical University, Xi’an 710129, China
²²email: [email protected] ³³institutetext: Liangbo Cheng ⁴⁴institutetext: Northwestern Polytechnical University, Xi’an 710129, China
⁴⁴email: [email protected] ⁵⁵institutetext: ^🖂 Yi Liu⁶⁶institutetext: Changzhou University, Changzhou 213000, China
⁶⁶email: [email protected] ⁷⁷institutetext: Xinggang Wang ⁸⁸institutetext: Huazhong University of Science and Technology, Wuhan 430074, China
⁸⁸email: [email protected] ⁹⁹institutetext: ^🖂 Junwei Han ¹⁰¹⁰institutetext: Northwestern Polytechnical University, Xi’an 710129, China
¹⁰¹⁰email: [email protected]

Mamba Capsule Routing Towards Part-Whole Relational Camouflaged Object Detection

Dingwen Zhang Liangbo Cheng Yi Liu^🖂
Xinggang Wang Junwei Han^🖂

(Received: date / Accepted: date)

Abstract

The part-whole relational property endowed by Capsule Networks (CapsNets) has been known successful for camouflaged object detection due to its segmentation integrity. However, the previous Expectation Maximization (EM) capsule routing algorithm with heavy computation and large parameters obstructs this trend. The primary attribution behind lies in the pixel-level capsule routing. Alternatively, in this paper, we propose a novel mamba capsule routing at the type level. Specifically, we first extract the implicit latent state in mamba as capsule vectors, which abstract type-level capsules from pixel-level versions. These type-level mamba capsules are fed into the EM routing algorithm to get the high-layer mamba capsules, which greatly reduce the computation and parameters caused by the pixel-level capsule routing for part-whole relationships exploration. On top of that, to retrieve the pixel-level capsule features for further camouflaged prediction, we achieve this on the basis of the low-layer pixel-level capsules with the guidance of the correlations from adjacent-layer type-level mamba capsules. Extensive experiments on three widely used COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-arts. Code has been available on https://github.com/Liangbo-Cheng/mamba_capsule.

Keywords:

Camouflaged object detection part-whole relationship capsule network mamba

^†^†journal:

1 Introduction

Camouflage is a form of protective adaptation exhibited by animals in nature, wherein they alter their appearance, texture, and coloration to enhance their ability to hunt for prey and ensure survival. The goal of camouflaged object detection (COD) is to accurately locate and segment concealed objects within their camouflaged surroundings. Thanks to the ability of COD, it has been widely applied in biological conservation Cai et al (2023); Yang et al (2023), industrial detection Rahmon et al (2024); Wang et al (2024), artistic creation Zhao et al (2024); Huang et al (2024), and medical image segmentation Liu et al (2024a); Zhao et al (2021), etc.

Early researchers extract hand-crafted features Vistnes (1989); Fazekas et al (2009); Singh et al (2013); Andreopoulos and Tsotsos (2013), such as color, texture and optical flow, to detect the camouflaged target. However, these approaches suffer from poor discrimination between foreground and background, resulting in unsatisfactory detection performance. Recently, deep learning based framework has moved forward the development of COD. Especially, since the advent of large-scale COD datasets COD10K Fan et al (2020), there have been a large-scale works for COD by simulating the biological mechanisms Fan et al (2020) and human visual patterns Pang et al (2022). They mostly elaborate the indistinguishable feature mining module Chen et al (2022); Sun et al (2021); Mei et al (2023); Luo et al (2023) and encompass multiple tasks He et al (2023); Le et al (2019); Sun et al (2022); Zhu et al (2022) using Convolutional Neural Networks (CNNs) Bideau et al (2024); Zhai et al (2023); Li et al (2021); Cheng et al (2022) and Transformers Yang et al (2021); Pei et al (2022); Mei et al (2023); Luo et al (2024). However, the strong inherent similarity between the camouflaged object and its background restricts the feature extraction capability of both CNN and Transformer networks that try to find discriminative regions, causing incomplete detection easily with object details missed or local parts lost.

To cater to this issue, part-whole relational property endowed by Capsule Networks (CapsNets) Liu et al (2021b) has been proven successful for the complete segmentation of camouflaged object, which is implemented by excavating the relevant parts of the object Liu et al (2019). However, the previous Expectation-Maximization (EM) routing Hinton et al (2018) makes the part-whole relational COD Liu et al (2021b) challenging in terms of computational complexity, parameter, and inference speed. The reason behind is that the previous pixel-level EM routing inevitably generates large-scale capsule assignments at the pixel level, causing large-scale dense computation.

Recently, Vision Mamba (VMamba) Liu et al (2024c) has successfully adapted the mamba Gu and Dao (2024) that is renowned for efficient modeling of long sequences to address computer vision tasks. Specifically, four-direction scans are implemented on the 2D input to obtain 1D sequence tokens for further image recognition, which are fed into the selective State Space Models (SSMs) Gu et al (2022, 2024) to attend the important tokens. During the selective SSMs stage, the 1D sequence latent state implicitly models the global context. Besides, four-direction scans in VMamaba ensure spatial context for the latent state.

In this paper, inspired by VMamba Liu et al (2024c), we introduce VMamba in the task of part-whole relational COD with the aim of designing a lightweight capsule routing. To this end, we propose a novel Mamba Capsule Routing Network (MCRNet) to detect the camouflaged object. Specifically, the 2D pixel-level spatial capsules are first fed into the scanning mechanism to obtain 1D capsule tokens in various scanning direction, which are input into the selective SSM Gu and Dao (2024) to learn the 1D implicit latent state capsules, named mamba capsules. Such Mamba Capsule Generation (MCG) module ensures to extract the 1D type-level capsules from the pixel-level 2D primary capsules. The EM routing Hinton et al (2018) absorbs the mamba capsules to generate the high-layer¹¹1High-layer and whole-level can be applied in an interleaved manner. versions and the type level, which implements the part-whole relationships with lightweight routing computation. On top of that, to retrieve the pixel-level spatial details from the high-layer mamba capsules for the final dense prediction of camouflaged object, we design a Capsules Spatial Details Retrieval (CSDR) module. Concretely, we rely on two components, including the low-layer²²2Low-layer and part-level can be applied in an interleaved manner. pixel-level capsules and the correlation of adjacent-layer mamba capsules, to achieve the high-layer capsule spatial details. Using this mechanism, spatial capsules in four scanning directions are integrated to be in the uniform direction for the final camouflage detection.

Refer to caption — Figure 1: Different capsule routings for part-whole relational camouflaged object detection. The original EM routing Hinton et al (2018) involves a significant number of parameters and routing complexity at the pixel level due to the dense routing. Differently, our proposed MCRNet compresses spatially pixel-level capsules into type-level capsules, leading to a substantial complexity-reduction type-level capsule routing. On top of that, the capsules spatial details retrieval is used to learn the spatial details of mamba capsules for further camouflaged object detection.

To sum up, the contributions of this paper are as follows:

•

We design a novel MCRNet for the part-whole relational COD task, which gets the capsule routing complexity reduced significantly. To the best of our knowledge, it is the first attempt to employ mamba for CapsNets and the the task of COD.
•

We purpose a MCG module to generate the type-level mamba capsule from the pixel-level capsules, which helps for the lightweight capsule routing.
•

We design a CSDR module to retrieve the spatial details from the high-layer type-level mamba capsules for the dense detection of camouflaged object.
•

Comprehensive experiments demonstrate that our proposed MCRNet achieves superior performance on three widely-used COD datasets compared to 25 existing state-of-the-art methods.

This paper is organized as follows. Sec. 2 reviews the related references to our work. Sec. 3 summarizes the preliminaries of mamba. Sec. 4 describes the details of the proposed MCRNet. Sec. 5 carries out abundant experiments and analyses to understand our method. Sec. 6 concludes the paper.

2 Related Work

In this section, we will review references related to our work, including camouflaged object detection, capsule network, and vision mamba.

2.1 Camouflaged Object Detection

The task of COD refers to accurately segmenting objects that are intentionally designed or naturally evolved to blend into their surroundings, which is rather challenging due to the high similarity between target object and background. Researchers have done a lot of wonderful works that have greatly advanced the development. In the early stage, most of the works are developed based on the hand-crafted features, e.g., colour Huerta et al (2007), 3D convexity Pan et al (2011) and intensity features Sengottuvelan et al (2008). However, they are relatively less robust and prone to fail in complex scenarios with low contrast situations. With the popularity of deep learning Fan et al (2020), recent works focus on mining more detailed features in a learning manner to distinguish camouflaged objects from their surroundings. Inspired by the biological mechanisms in nature or human visual psychological patterns, Fan et al. Fan et al (2022) mimiced the behavior process of predators to simulate the search and identification towards preys. Pang et al. Pang et al (2022) adopted the zoom mechanism of humans when they observed fuzzy objects for the task of COD. Jia et al. Jia et al (2022) segmented, magnified and reiterated the camouflaged object in a coarse-to-fine manner with the multi-stage strategy. Besides, some feature mining modules are elaborated for unearthing the subtle discriminative features of camouflaged objects. For example, Zhu et al. Zhu et al (2021) focused on texture-aware learning. Mei et al. Mei et al (2021) aimed to learn contextual-aware information. He et al. He et al (2023) and Zhong et al. Zhong et al (2022) introduced frequency clues to aid camouflaged object detection. Moreover, incorporating auxiliary tasks with the COD task can facilitate the precise segmentation map, such as classification Le et al (2019), edge/boundary detection Sun et al (2022); Zhu et al (2022); Luo et al (2024) and object ranking Lv et al (2021). The methods above, which are based on CNNs and Transformer, exhibit suboptimal performance in detecting camouflaging objects with high similarity and low contrast to their background. Alternatively, Liu et al. Liu et al (2021b) made the first attempt to complete COD task in the part-whole relational perspective successfully.

In this paper, in along with the pipeline of the part-whole relational COD, our work takes a further step in terms of the lightweight capsule routing, which is achieved by introducing VMamba Liu et al (2024c) to CapsNets Hinton et al (2018).

2.2 Capsule Network

The history of part-whole representation goes back several decades. Krivic and Solina et al. Krivic and Solina (2004) recognized articulated objects based on part-level descriptions obtained by the Segmentor system Chen and Schmitt (1993). Girshick et al. Girshick et al (2015) designed a CNN to formulate the deformable part model using a distance transform pooling, object geometry filters, and maxout units. To address the problem of CNNs with space invariance, Hinton et al. Hinton et al (2011); Sabour et al (2017); Hinton et al (2018) explored the part-whole relationships by the CapsNets, which route low-level capsules (parts) to their familiar high-level ones (wholes). As the previous EM routing Hinton et al (2018) consumes too much computational resources, lots of improved works focus on the lightweight routing. Liu et al. Liu et al (2022b) disentangled two orthogonal 1D routings, which greatly reduce parameters and routing complexity, resulting in faster inference than the previous omnidirectional 2D routing adopted by the EM routing strategy. Liu et al. Liu et al (2024b) presented a residual pose routing. Likewise, Geng et al. Geng et al (2024) designed an orthogonal sparse attention routing to reduce redundancy and reducing parameters.

Despite the above improvement for capsule networks has made great progress in reducing the redundancy, the EM routing Hinton et al (2018) still remains the pixel level, resulting in large-scale capsule assignments and computational complexity. Unlike previous works, we introduce the VMamba Liu et al (2024c) to generate type-level mamba capsules from the pixel-level capsules for routing, which ensures a lightweight computation.

2.3 Vision Mamba

Considering that the high-order complexity of the self-attention mechanism in the transformer increases quadratically with increasing image size, mamba Gu and Dao (2024) has recently shown good performance in long sequence modeling and can be a promising alternative. Due to the low complexity, mamba has been involved in the computer vision community. For example, Zhu et al. Zhu et al (2024) designed the first mamba-based backbone network to generate a linear computational complexity while retaining advantages of vision transformer, which showcases superior performance and the ability to capture complex visual dynamics. Liu et al. Liu et al (2024c) designed a cross-scan mechanism to bridge the gap between 1D array scanning and 2D plain traversing. Ma et al. Ma et al (2024) proposed a hybrid CNN-SSM structure to capture local fine-grained features and remote dependencies in images to solve the problem of biomedical image segmentation. Liang et al. Liang et al (2024) introduced a reordering strategy to scan data in a specific sequence to capture point cloud structures.

In this paper, inspired by the computational efficiency, we introduce VMamba Liu et al (2024c) in CapsNets Hinton et al (2018) to solve the part-whole relational COD task, which generates the type-level mamba capsules from the pixel-level versions using the implicit hidden state for further capsules routing.

3 Preliminaries

In this section, we will review the mechanism of SSMs Gu and Dao (2024); Liu et al (2024c) with details, which will be involved in the proposed MCG module.

The original SSMs Kalman (1960) are regarded as Linear Time Invariant (LTI) systems that map the input stimulation $\mathbf{x}(t)\in\mathbb{R}$ to response $\mathbf{y}(t)\in\mathbb{R}$ through the latent state $\mathbf{h}(t)\in\mathbb{R}^{N}$ , which is expressed in linear ordinary differential equations

\begin{array}[]{l}\mathbf{h}^{{}^{\prime}}(t)=\mathbf{A}\mathbf{h}(t)+\mathbf{B}\mathbf{x}(t),\\ \ \mathbf{y}(t)=\mathbf{C}\mathbf{h}(t),\end{array}

(1)

where $\mathbf{A}\in\mathbb{R}^{N\times N}$ means the evolution parameter. $\mathbf{B}\in\mathbb{R}^{N\times 1}$ and $\mathbf{C}\in\mathbb{R}^{1\times N}$ are the projection parameters.

To facilitate integration into the deep learning model, the continuous system is discretized, including a time-scale parameter $\mathbf{\Delta}$ to transform the continuous parameters $\mathbf{A}$ , $\mathbf{B}$ to discrete parameters $\overline{\mathbf{A}}$ , $\overline{\mathbf{B}}$ . Using the zero-order hold (ZOH) relu:

\begin{array}[]{l}\overline{\mathbf{A}}=\mathrm{exp}(\mathbf{\Delta}\mathbf{A}),\\ \overline{\mathbf{B}}=(\mathbf{\Delta}\mathbf{A})^{-1}(\mathbf{\Delta}\mathbf{A}-\mathbf{I})\cdot\mathbf{\Delta}\mathbf{B}.\end{array}

(2)

After the discretization, Eq. $(\ref{equ:ssm})$ can be rewritten as

\begin{array}[]{l}\mathbf{h}_{t}=\overline{\mathbf{A}}\mathbf{h}_{t-1}+\overline{\mathbf{B}}\mathbf{x}_{t},\\ \mathbf{y}_{t}=\mathbf{C}\mathbf{h}_{t}.\end{array}

(3)

Finally, the model calculates the output by global convolution

\begin{array}[]{l}\overline{\mathbf{K}}=(\mathbf{C}\overline{\mathbf{B}},\mathbf{C}\overline{\mathbf{A}}\overline{\mathbf{B}},\dots,\mathbf{C}{\overline{\mathbf{A}}}^{L-1}\overline{\mathbf{B}}),\\ \mathbf{y}=\mathbf{x}\ast\overline{\mathbf{K}},\end{array}

(4)

where $L$ is the length of the input sequence $\mathbf{x}$ , and $\overline{\mathbf{K}}\in\mathbb{R}^{L}$ is a structured convolutional kernel. $\ast$ indicates the operation of convolution.

To tackle the limitation of LTI SSMs (Eq. $(\ref{equ:ssm})$ ) in capturing the contextual information, Gu et al. Gu and Dao (2024) proposed a novel parameterization method for SSMs that integrates an input-dependent selection mechanism, i.e.,

\begin{array}[]{l}\mathbf{B}=\mathrm{{Linear}}_{N}(\mathbf{x}),\,\mathbf{C}=\mathrm{{Linear}}_{N}(\mathbf{x}),\,\mathbf{\Delta}=\mathrm{{Linear}}_{D}(\mathbf{x}),\end{array}

(5)

where $\mathrm{{Linear}}_{N}$ and $\mathrm{{Linear}}_{D}$ are the parameterized projection to dimension $N$ and $D$ , respectively. Eq. $(\ref{equ:selective})$ focuses on or ignores specific tokens selectively according to the input sequence, making the model process information efficiently.

4 Proposed Method

In this section, we will illustrate the details of MCRNet for camouflaged object detection.

4.1 Overview

Fig. 2 depicts the overall architecture of our proposed MCRNet, including a Swin Transformer encoder Liu et al (2021c), a Mamba Capsule Generation (MCG) module, a Capsules Spatial Details Retrieval (CSDR) module, and a multi-task learning decoder Liu et al (2021a). To be specific, the input image is divided into non-overlapped patches after data augmentation, which are fed into Swin Transformer Liu et al (2021c) to capture long-range context features. On top of that, MCG is designed to generate type-level mamba capsules from the pixel-level capsules for further capsule routing to obtain the high-level mamba capsules. To learn the spatial details of mamba capsules, CSDR is developed to retrieve the spatial resolution of the high-layer type-level mamba capsules. Finally, a multi-task learning decoder including camouflaged detection and edge detection is designed to detect the camouflaged object with excellent boundary.

4.2 Transformer Encoder

The transformer encoder first splits the input RGB image $\mathbf{I}\in\mathbb{R}^{C\times H\times W}$ into non-overlapped patches ( $p\times p$ ) by a patch embedding module, where $C$ , $H$ and $W$ denote channel size, height and width of image $\mathbf{I}$ , respectively, and $p$ = 16. These image patches are linearly projected into a 1D sequence of token embeddings $\mathbf{F}^{E}\in\mathbb{R}^{l\times d}$ , where $l=HW/p^{2}$ and $d$ are the length of the patch sequence and the channel dimension, respectively. The Swin Transformer Liu et al (2021c) encoder is used to capture global dependencies $\mathbf{F}_{i}^{E}\in\mathbb{R}^{l_{i}\times d_{i}}$ , where $i\in[0,1,2,3]$ indicates the index of blocks in the encoder, $l_{i}$ and $d_{i}$ mean the length of the sequence and the channel dimension of the token, respectively. Its unique shifted windowing mechanism reduces the computational burden with more efficient batch computation, showing efficiency and high performance.

4.3 Mamba Capsule Generation

In this subsection, we will detail the MCG module that learns the type-level mamba capsules from the pixel-level capsules, which is composed by primary capsules generation, multi-direction serialization, implicit latent state learning and mamba capsule acquisition.

Step 1: Primary capsules generation. As shown in Fig. 3, the feature sequence $\mathbf{F}_{2}^{E}\in\mathbb{R}^{l_{2}\times d_{2}}$ obtained by the encoder is reshaped into $\mathbf{F}^{{}^{\prime}}\in\mathbb{R}^{h_{2}\times w_{2}\times d_{2}}$ to facilitate subsequent Primary Capsules (PrimaryCaps) generation $\mathbf{P}\in\mathbb{R}^{h_{2}\times w_{2}\times{O}\times U}$ , which contains the pose matrix $\mathbf{P}_{pose}\in\mathbb{R}^{h_{2}\times w_{2}\times{O}_{P}\times U}$ and the activation value $\mathbf{P}_{act}\in\mathbb{R}^{h_{2}\times w_{2}\times{O}_{A}\times U}$ , where $O=\{{O}_{P}=16,{O}_{A}=1\}$ is the dimension of the pose matrix and the activation, $U$ represents the number of primary capsules, i.e.,

\begin{split}\mathbf{P}=\mathrm{Cat}(\mathbf{P}_{pose},\mathbf{P}_{act})=\mathrm{Cat}(\Phi(\mathbf{F}^{{}^{\prime}}),\mathrm{Sigmoid}(\Phi(\mathbf{F}^{{}^{\prime}}))),\end{split}

(6)

where $\Phi(\cdot)$ represents the operation of convolution, batch normalization and relu. $\mathrm{Sigmoid}(\cdot)$ means the sigmoid function. $\mathrm{Cat}(\cdot)$ represents the concatenation.

Step 2: Multi-direction serialization. Using the scanning mechanism of VMamba Liu et al (2024c) for providing more accurate and rich 2D spatial context, the 2D primary capsules are serialized to four groups of 1D capsule sequences $\mathbf{S}_{g}=\{\mathbf{S}_{1},\dots,\mathbf{S}_{G}\}\in\mathbb{R}^{V\times O\times U}$ , where $G=4$ means four various scanning directions as shown in Fig. 3 (positive Z shape, inverted Z shape, positive N shape, and inverted N shape), $U$ represents the number of capsule sequences, $V=h_{2}\times w_{2}$ is the length of the capsule sequence and $O$ means the dimension of the capsule token. In a certain scanning direction $g$ , each capsule sequence $\mathbf{S}_{g}(u)=\{\mathbf{S}_{g}(1),\dots,\mathbf{S}_{g}(U)\}\in\mathbb{R}^{V\times O}$ represents various part object. In a certain capsule sequence $\mathbf{S}_{g}(u)$ , there are $V$ capsule tokens $\mathbf{S}_{g}(u,v)=\{\mathbf{S}_{g}(u,1),\dots,\mathbf{S}_{g}(u,V)\}\in\mathbb{R}^{O}$ .

Step 3: Implicit latent state learning. During the selective SSM, the current latest implicit latent state is associated with both the accumulated latent state and the current input token, which can be formulated as

\begin{array}[]{l}\mathbf{h}_{g}(u,v)=\overline{\mathbf{A}}\mathbf{h}_{g}(u,v-1)+\overline{\mathbf{B}}\mathbf{S}_{g}(u,v),\end{array}

(7)

where $\mathbf{S}_{g}(u,v)$ represents the ${v}^{th}$ token in the ${u}^{th}$ capsule sequence obtained in the scanning direction $g$ . $\mathbf{h}_{g}(u,v)\in\mathbb{R}^{N}$ means the updated implicit latent state after the token $\mathbf{S}_{g}(u,v)$ is input. $\overline{\mathbf{A}}$ and $\overline{\mathbf{B}}$ represent the discretized evolution parameters of the model, which will be computed based on the input sequence.

The most recent output mamba token can be obtained by utilizing this latest latent state

\begin{array}[]{l}\mathbf{F}_{g}^{M}(u,v)=\mathbf{C}\mathbf{h}_{g}(u,v),\end{array}

(8)

where $\mathbf{F}_{g}^{M}(u,v)$ indicates the output mamba token corresponding to $\mathbf{S}_{g}(u,v)$ . $\mathbf{C}$ represents the projection parameter of the model, which is computed relying on the input sequence.

As each token $\mathbf{S}_{g}(u,v)$ in the sequence $\mathbf{S}_{g}(u)$ is fed into the selective SSM, the implicit latent state $\mathbf{h}_{g}(u,v)$ is updated constantly. The final implicit latent state $\mathbf{h}_{g}(u,V)$ , which is defined as mamba capsule vectors completes the global modeling of the sequence $\mathbf{S}_{g}(u)$ . Algorithm 1 lists the process of mamba capsule vectors learning in details.

Algorithm 1 Mamba Capsule Vectors Learning.

\mathbf{S}

is the input capsule sequence.

\mathbf{F}^{M}

is the output mamba sequence.

\mathbf{h}

is the mamba capsule vectors, which is also the implicit latent state from the last iteration.

Procedure: 1. Initialize parameter A: (D, N) 2. Learn parameters B, C,

\mathbf{\Delta}

: — B: (B, L, N) ←

\mathrm{{Linear}}_{N}(\mathbf{S})

— C: (B, L, N) ←

\mathrm{{Linear}}_{N}(\mathbf{S})

— /*

\mathrm{Linear}_{N}

is a linear projection to dimension

N

*/ —

\mathbf{\Delta}

: (B, L, D) ←

\mathrm{softplus}

(

\Omega

— +

\mathrm{Broadcast_{D}({Linear}_{1}}(\mathbf{S}))

— /*softplus ensures activation*/ — /*

\Omega

means initialize parameters*/ 3. Discretization parameter

\overline{\mathbf{A}}

\overline{\mathbf{B}}

: —

\overline{\mathbf{A}},\overline{\mathbf{B}}\mathbf{:}

(B, L, D,N) ← discretize(

\mathbf{\Delta}

, A, B) — /*Eq. (2)*/ 4. Get mamba capsule vectors

\mathbf{h}

and sequence

\mathbf{F}_{M}

: —

\mathbf{h}

: (B, D, N) ← SSM(

\overline{\mathbf{A}}

\overline{\mathbf{B}}

, C)(

\mathbf{S}

) —

\mathbf{F}^{M}

: (B, L, D) ← SSM(

\overline{\mathbf{A}}

\overline{\mathbf{B}}

, C)(

\mathbf{S}

) — /*Eq. (7) and Eq. (8)*/ 5. Return

\mathbf{h}

\mathbf{F}^{M}

Step 4: Mamba capsule acquisition. Ultimately, in the certain scanning direction $g$ , we obtain learned mamba sequences $\mathbf{F}_{g}^{M}\in\mathbb{R}^{V\times O\times U}$ , and the implicit latent state $\mathbf{h}_{g}\in\mathbb{R}^{N\times U}$ . Due to the fact that the final latent state $\mathbf{h}_{g}(u,V)$ implicitly explores the global context of the corresponding sequence, we choose it as the capsule pose vector $\mathbf{M}_{pose,g}\in\mathbb{R}^{N\times U}$ , which has a comprehensive understanding about the pixel-level capsules. Based on the pose vector $\mathbf{M}_{pose,g}$ , we can compute the activation values $\mathbf{M}_{act,g}\in\mathbb{R}^{O_{A}\times U}$ through

\begin{array}[]{l}\mathbf{M}_{act,g}=\mathrm{Sigmoid}(\Phi(\mathbf{M}_{pose,g})).\end{array}

(9)

To this end, the type-level Mamba Capsules (MambaCaps) $\mathbf{M}_{g}\in\mathbb{R}^{1\times 1\times O\times U}$ is constructed as

\begin{array}[]{l}\mathbf{M}_{g}=\mathrm{Unsqueeze}(\mathrm{Cat}(\mathbf{M}_{pose,g},\mathbf{M}_{act,g})),\end{array}

(10)

where $\mathrm{Unsqueeze}(\cdot)$ represents the operation of unsqueeze. In Eq. (10), the type-level mamba capsules are derived from the pixel-level capsules while preserving global context, which helps to get routing computation reduced significantly.

4.4 Capsules Spatial Details Retrieval

In this subsection, we will detail the CSDR module for further camouflaged prediction, which is composed by high-layer mamba capsules learning, adjacent-layer mamba capsules correlation and high-layer capsules spatial details retrieval.

Step 1: High-layer mamba capsules learning. As shown in Fig. 4, under the certain scanning direction $g$ , the obtained mamba capsules are fed into EM routing Hinton et al (2018) with iterative refinement for mining the part-whole relationship at the type-level, and the high-layer mamba capsules can be computed via

\begin{array}[]{l}[\widetilde{\mathbf{M}}_{g}(1),\ldots,\widetilde{\mathbf{M}}_{g}(U)]=\mathrm{EM}[\mathbf{M}_{g}(1),\ldots,\mathbf{M}_{g}(U)],\end{array}

(11)

where $\mathrm{EM}(\cdot)$ represents the EM routing algorithm Hinton et al (2018) and $\widetilde{\mathbf{M}}_{g}=[\widetilde{\mathbf{M}}_{g}(1),\ldots,\widetilde{\mathbf{M}}_{g}(U)]\in\mathbb{R}^{1\times 1\times O\times U}$ means $U$ explored high-layer mamba capsules under the certain scan direction $g$ .

Step 2: Adjacent-layer mamba capsules correlation. To utilize the type-level mamba capsules for the dense prediction of camouflaged object, it is necessary to retrieve the spatial details of the mamba capsules.

Under the certain scanning direction $g$ , the cosine similarity matrix $\mathbf{E}_{g}(m,n)\in\mathbb{R}^{U\times U}$ depicts the similarity degree between the adjacent-layer type-level mamba capsules $\mathbf{M}_{g}(m)$ and $\widetilde{\mathbf{M}}_{g}(n)$ , where $m\in[1,\dots,{U}]$ is the row index of the matrix $\mathbf{E}_{g}$ and the index of $\mathbf{M}_{g}$ , $n\in[1,\dots,{U}]$ represents the column index of the matrix $\mathbf{E}_{g}$ and the index of $\widetilde{\mathbf{M}}_{g}$ , which can be computed by

\begin{split}\mathbf{E}_{g}(m,n)=\frac{\sum_{x=y=1}^{O}{\mathbf{M}_{g}(m,x)\times\widetilde{\mathbf{M}}_{g}(n,y)}}{\sqrt{\sum_{x=1}^{O}{(\mathbf{M}_{g}(m,x))^{2}}}\sqrt{\sum_{y=1}^{O}{(\widetilde{\mathbf{M}}_{g}(n,y))^{2}}}},\end{split}

(12)

where $x\in[1,\dots,O]$ , $y\in[1,\dots,O]$ represent the element index of $\mathbf{M}_{g}(m)$ and $\widetilde{\mathbf{M}}_{g}(n)$ , respectively.

To further enhance the differentiation and distinguishable ability of the cosine similarity between adjacent-layer mamba capsules, the sigmoid function is utilized for activation as

\begin{array}[]{l}\mathbf{\widehat{E}}_{g}(m,n)=\mathrm{Sigmoid}(\mathbf{E}_{g}(m,n)).\end{array}

(13)

In the cosine similarity matrix $\mathbf{\widehat{E}}_{g}$ , the element in row $m$ and column $n$ represents the degree of correlation between the ${m}^{th}$ low-layer mamba capsule and the ${n}^{th}$ high-layer mamba capsule, which also reflects the correlation between the corresponding pixel level adjacent layer capsules.

Step 3: High-layer capsules spatial details retrieval. Under the certain scanning direction $g$ , the activation values $\mathbf{S}_{act,g}\in\mathbb{R}^{V\times O_{A}\times U}$ of the capsule sequences is multiplied with the activated cosine similarity matrix to obtain the feature map $\mathbf{F}_{g}^{R}\in\mathbb{R}^{V\times O_{A}\times U}$ i.e.,

\begin{array}[]{l}\mathbf{F}_{g}^{R}=\mathrm{Split}(\mathbf{S}_{g})\times\mathbf{\widehat{E}}_{g},\end{array}

(14)

where $\mathrm{Split}(\cdot)$ represents the operation of split the final dimension along the channel axis.

The obtained relational sequences $\mathbf{F}_{g}^{R}$ is integrated with learned mamba sequences $\mathbf{F}_{g}^{M}$ to obtain the sequence $\mathbf{F}_{g}^{C}$ . Finally, four sequence $\mathbf{F}_{g}^{C}$ are integrated under multiple scanning directions through flip and transpose operations to ensure consistency with the first scanning direction

\begin{array}[]{l}\mathbf{F}^{D}=\mathrm{MSA}(\mathrm{Cat}(\mathbf{F}_{1}^{C},\Psi\left(\mathbf{F}_{2}^{C}\right),\Gamma\left(\mathbf{F}_{3}^{C}\right),\Psi(\Gamma\left(\mathbf{F}_{4}^{C}\right)))),\end{array}

(15)

where $\mathrm{MSA}(\cdot)$ means multi-head self-attention. $\Psi(\cdot)$ and $\Gamma(\cdot)$ represent the operations of transpose and flipping, respectively.

4.5 Transformer Decoder

In the decoder, following the idea of VST Liu et al (2021a), two designed task-related tokens (i.e., a camouflage token $\mathbf{t}^{c}\in\mathbb{R}^{1\times d}$ and a boundary token $\mathbf{t}^{b}\in\mathbb{R}^{1\times d}$ ) are added on the obtained tokens $\mathbf{F}^{D}$ for completing camouflaged object segmentation and edge detection distinctively. Then, the all tokens are processed via transformer layers to capture global dependencies. In every layer, a patch-task-attention between $\mathbf{F}_{j}^{D}$ and $\mathbf{t}_{j}^{c}$ is designed for camouflage prediction $\mathbf{F}_{j}^{c}$ , where $j\in[0,1,2]$ indicates the index of blocks in the decoder. $\mathbf{F}_{j}^{D}$ is mapped to queries $\mathbf{Q}_{j}^{c}\in\mathbb{R}^{l_{j}\times d_{j}}$ and $\mathbf{t}_{j}^{c}$ is mapped to a key $\mathbf{k}_{j}^{c}\in\mathbb{R}^{1\times d_{j}}$ and a value $\mathbf{v}_{i}^{c}\in\mathbb{R}^{1\times d_{j}}$ , where $l_{j}$ and $d_{j}$ mean the length of the sequence and the channel dimension of the token, respectively. The camouflage prediction $\mathbf{F}_{j}^{c}$ can be computed by

\begin{array}[]{l}\mathbf{F}_{j}^{c}=\mathrm{Sigmoid}\left(\mathbf{Q}_{j}^{c}{{\times\mathbf{k}}_{j}^{c}}^{T}/\sqrt{d}\right)\times\mathbf{v}_{j}^{c}\oplus\mathbf{F}_{j}^{D},\end{array}

(16)

where $(\cdot)^{T}$ and $\oplus$ represent the transpose operation of the matrix and the operation of element-wise addition, respectively. In a similar way, for boundary prediction, $\mathbf{F}_{j}^{D}$ is mapped to queries $\mathbf{Q}_{j}^{b}\$ and $\mathbf{t}_{j}^{b}$ is mapped to a key $\mathbf{k}_{j}^{b}$ and a value $\mathbf{v}_{j}^{b}$ to gain the result

\begin{array}[]{l}\mathbf{F}_{j}^{b}=\mathrm{Sigmoid}\left(\mathbf{Q}_{j}^{b}{{\times\mathbf{k}}_{j}^{b}}^{T}/\sqrt{d}\right)\times\mathbf{v}_{j}^{b}\oplus\mathbf{F}_{j}^{D}.\end{array}

(17)

Whereafter, two linear transformations are applied with the sigmoid activation to project $\mathbf{F}_{j}^{c}$ , $\mathbf{F}_{j}^{b}$ to scalars in [0, 1]. Therefore, get the final 2D camouflaged map $\mathbf{\widetilde{F}}_{j}^{c}$ and boundary map $\mathbf{\widetilde{F}}_{j}^{b}$ at the corresponding scale.

4.6 Loss Function

In this work, both the weighted binary cross-entropy (BCE) loss function ( $l_{wbce}$ ) and the Intersection over Union (IoU) loss ( $l_{iou}$ ) Yu et al (2016) are adopted as loss functions to train the network. Suppose $\mathbf{\widetilde{F}}^{c}$ , $\mathbf{\widetilde{F}}^{b}$ , $\mathbf{G}^{c}$ and $\mathbf{G}^{b}$ are the predicted camouflaged map, boundary map, corresponding camouflaged and boundary ground truth, respectively. $l_{wbce}$ can be expressed in the formula as follows:

\begin{array}[]{l}l_{wbce}=\sum_{j}{\left[l_{bce}\left(\mathbf{\widetilde{F}}_{j}^{c},\mathbf{G}_{j}^{c}\right)+l_{bce}\left(\mathbf{\widetilde{F}}_{j}^{b},\mathbf{G}_{j}^{b}\right)\right]\times w_{j}},\end{array}

(18)

where $j$ is the index of blocks in the decoder and $w_{j}$ is a set of hyperparameters that we set the value of $[w_{0},w_{1},w_{2},w_{3}]$ to $[1,\ 0.8,\ 0.5,\ 0.5]$ . In Eq. (18)

\begin{array}[]{l}l_{bce}=-\frac{1}{n}\sum_{m}\mathbf{G}_{m}log(\mathbf{\widetilde{F}}_{m})+(1-\mathbf{G}_{m})log(1-\mathbf{\widetilde{F}}_{m}),\end{array}

(19)

where $m$ represents the pixel index and $n$ means the total number of pixels.

$l_{iou}$ is defined on the input scale,

\begin{array}[]{l}l_{iou}=1-\frac{\sum_{m}{\mathbf{\widetilde{F}}{c}(m)\mathbf{G}^{c}(m)}}{\sum_{m}{[\mathbf{\widetilde{F}}{c}(m)+\mathbf{G}^{c}(m)\ -\ \mathbf{\widetilde{F}}{c}(m)\mathbf{G}^{c}(m)]}}.\end{array}

(20)

5 Experiment and Analysis

In this section, we will carry out abundant experiments and analysis to provide a comprehensive understanding of the proposed method.

5.1 Experimental Settings

Dataset. We evaluate the proposed method on three widely public benchmarks.

CAMO Le et al (2019) is the first COD dataset, containing 1,250 camouflaged images with 1,000 training images and 250 testing images.

COD10K Fan et al (2020) is a currently large COD datasets, consisting of 3,040 training images and 2,026 testing images..

NC4K Lv et al (2021) is a recently released large-scale COD dataset containing 4,121 images.

To ensure consistency with previous studies Fan et al (2022), 3040 samples from COD10K and 1000 samples from CAMO are utilized as the training set, while the test set consisting of 2026 test images in COD10K, 250 test images in CAMO and the entire NC4K dataset.

Table 1: Quantitative comparison with 25 SOTA methods on three benchmark datasets. Notes ↑ / ↓ denote the larger/smaller is better, respectively. “–” is not available. The best and second best are bolded and underlined for highlighting, respectively.

Method CAMO (250 images) COD10K (2026 images) NC4K (4121 images) $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ CapsNet based method POCINet Liu et al (2021b) 0.110 0.662 0.777 0.7017 0.051 0.0614 0.825 0.751 —— —— —— —— CNNs based methods SINet Fan et al (2020) 0.091 0.708 0.829 0.746 0.042 0.691 0.874 0.777 0.058 0.775 0.883 0.810 MGL Zhai et al (2021) 0.089 0.725 0.811 0.772 0.035 0.709 0.852 0.815 0.053 0.782 0.868 0.832 PFNet Mei et al (2021) 0.085 0.758 0.855 0.782 0.039 0.725 0.891 0.800 0.053 0.799 0.899 0.829 LSR Lv et al (2021) 0.080 0.753 0.854 0.787 0.037 0.732 0.892 0.805 0.048 0.815 0.907 0.839 C2FNet Sun et al (2021) 0.079 0.770 0.864 0.796 0.036 0.743 0.900 0.813 0.049 0.810 0.904 0.840 UJSC Li et al (2021) 0.073 0.779 0.873 0.800 0.035 0.738 0.891 0.809 0.046 0.816 0.906 0.841 SLTNet Cheng et al (2022) 0.082 0.763 0.848 0.792 0.036 0.681 0.875 0.804 0.049 0.787 0.886 0.830 OCENet Liu et al (2022a) 0.080 0.777 0.865 0.802 0.033 0.764 0.906 0.827 0.045 0.832 0.913 0.853 BSANet Zhu et al (2022) 0.079 0.763 0.851 0.794 0.034 0.738 0.890 0.818 0.048 0.817 0.897 0.841 FAPNet Zhou et al (2022) 0.076 0.792 0.880 0.815 0.036 0.758 0.902 0.822 0.047 0.826 0.911 0.851 BGNet Sun et al (2022) 0.073 0.799 0.882 0.812 0.033 0.774 0.916 0.831 0.044 0.833 0.916 0.851 SegMaR Jia et al (2022) 0.071 0.803 0.884 0.816 0.034 0.775 0.907 0.833 0.046 0.827 0.907 0.841 SINet-v2 Fan et al (2022) 0.070 0.801 0.895 0.820 0.037 0.752 0.906 0.815 0.047 0.823 0.914 0.847 FDCOD Zhong et al (2022) 0.062 0.809 0.898 0.844 0.030 0.749 0.918 0.837 0.052 0.784 0.894 0.834 ZoomNet Pang et al (2022) 0.066 0.805 0.892 0.820 0.029 0.780 0.911 0.839 0.043 0.828 0.912 0.853 R-MGL-v2 Zhai et al (2023) 0.086 0.731 0.847 0.769 0.034 0.733 0.879 0.816 0.050 0.801 0.899 0.838 PopNet Wu et al (2023) 0.077 0.784 0.859 0.808 0.028 0.786 0.910 0.851 0.042 0.833 0.909 0.861 FEDER He et al (2023) 0.071 0.789 0.873 0.802 0.032 0.768 0.905 0.822 0.044 0.833 0.915 0.847 DGNet Ji et al (2023) 0.057 0.822 0.915 0.839 0.033 0.759 0.911 0.823 0.042 0.833 0.922 0.857 DINet Zhou et al (2024) 0.068 0.807 0.886 0.821 0.031 0.780 0.914 0.832 0.043 0.839 0.919 0.856 Transformer based methods UGTR Yang et al (2021) 0.086 0.754 0.855 0.785 0.035 0.742 0.891 0.818 0.051 0.807 0.899 0.839 OSFormer Pei et al (2022) 0.073 0.767 0.858 0.799 0.034 0.701 0.881 0.811 0.049 0.790 0.891 0.832 FDCOD Zhong et al (2022) 0.062 0.809 0.898 0.844 0.030 0.749 0.918 0.837 0.052 0.784 0.894 0.834 VSCode Luo et al (2024) 0.060 0.818 0.908 0.836 0.029 0.795 0.925 0.847 0.038 0.853 0.930 0.874 MCRNet (Ours) 0.054 0.847 0.915 0.854 0.026 0.807 0.924 0.854 0.036 0.857 0.930 0.875

Evaluation metrics. Four commonly-used metrics are employed for COD task to assess model performance, including Mean Absolute Error $(MAE)$ Achanta et al (2009), maximum F-measure $F_{m}$ Margolin et al (2014), maximum enhanced-alignment measure $E_{m}$ Fan et al (2018) and structure-measure $S_{m}$ Fan et al (2017). Given a continuous camouflaged map, a binary mask $\hat{F}$ is achieved by thresholding the camouflaged map $F$ . Precision is defined as $Precision={{\left|{\hat{F}\cap G}\right|}\mathord{\left/{\vphantom{{\left|{\hat{F}\cap G}\right|}{\left|\hat{F}\right|}}}\right.\kern-1.2pt}{\left|\hat{F}\right|}}$ , and recall is defined as $Recall={{\left|{\hat{F}\cap G}\right|}\mathord{\left/{\vphantom{{\left|{\hat{F}\cap G}\right|}{\left|G\right|}}}\right.\kern-1.2pt}{\left|G\right|}}$ .

$MAE$ is defined as

MAE=\frac{1}{{\hat{W}\times\hat{H}}}{\sum\limits_{i}{\left|{F\left({i}\right)-G\left({i}\right)}\right|}},

(21)

where $\hat{W}$ and $\hat{H}$ are the width and height of the image, respectively.

Maximum F-measure ( $F_{m}$ ) is the maximum value of the F-measure $(F_{\beta})$ under different thresholds. F-measure $(F_{\beta})$ is an overall performance indicator, which is computed by

{F_{\beta}}=\frac{{\left({1+{\beta^{2}}}\right)Precision\times Recall}}{{{\beta^{2}}Precision+Recall}}.

(22)

As suggested in Margolin et al (2014), ${{\beta^{2}}=0.3}$ .

Maximum enhanced-alignment measure ( $E_{m}$ ) is the maximum value of E-measure under different thresholds, which combines local pixel values with the image-level mean value to jointly evaluate the similarity between the prediction and the ground truth.

Structure-measure ( $S_{m}$ ) is computed by

{S_{m}}=\alpha{S_{o}}+\left({1-\alpha}\right){S_{r}},

(23)

where $S_{o}$ and $S_{r}$ represent the object-aware and region-aware structure similarities between the prediction and the ground truth, respectively. $\alpha$ is set to 0.5 Fan et al (2017).

Implementation details. The proposed MCRNet is implemented by PyTorch. The tiny Swin-transformer Liu et al (2021c) is adopted as the network encoder. Other modules are randomly initialized. Each image is resized to 384×384 pixels and then randomly cropped to 352×352 for training. The network employs the Adam optimizer Kingma and Ba (2015) with an initial learning rate of 0.0001, which is reduced by a factor of 10 at half and three-quarters of the total training steps. The complete training process contains a total of 150,000 training steps with a batch size of 8 using a 4090 GPU.

Table 2: Ablation study. “B” denotes the baseline of the Swin Transformer-T backbone. “MCG” indicates the incorporation of mamba capsules routing into the baseline. “CSDR” represents the capsules spatial details retrieval.

Candidate CAMO (250 images) COD10K (2026 images) NC4K (4121 images) B MCG CSDR $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ (a) ✓ 0.0615 0.8250 0.9087 0.8398 0.0292 0.7921 0.9181 0.8449 0.0404 0.8435 0.9242 0.8663 (b) ✓ ✓ 0.0581 0.8338 0.9107 0.8412 0.0269 0.8053 0.9221 0.8507 0.0382 0.8540 0.9287 0.8710 (c) ✓ ✓ ✓ 0.0547 0.8466 0.9151 0.8536 0.0264 0.8069 0.9236 0.8544 0.0366 0.8570 0.9301 0.8751

5.2 Comparison with the State-of-the-arts

In order to demonstrate the efficacy of the proposed method, the comparative analysis is conducted with 25 recent state-of-the-art methodologies, including the capsule network based method (i.e., POCINet Liu et al (2021b)), the CNNs based methods ( SINet Fan et al (2020), MGL Zhai et al (2021), PFNet Mei et al (2021), LSR Lv et al (2021), C2FNet Sun et al (2021), UJSC Li et al (2021), R-MGL-v2 Zhai et al (2023), SLTNet Cheng et al (2022), OCENet Liu et al (2022a), BSANet Zhu et al (2022), FAPNet Zhou et al (2022), BGNet Sun et al (2022), SegMaR Jia et al (2022), SINet-v2 Fan et al (2022), FDCOD Zhong et al (2022), ZoomNet Pang et al (2022), PopNet Wu et al (2023), FEDER He et al (2023), DGNet Ji et al (2023), DINetZhou et al (2024)) and the transformer based methods (UGTR Yang et al (2021), OSFormer Pei et al (2022), FDCOD Zhong et al (2022), VSCode Luo et al (2024)). For a fair comparison, all the predictions of these methods are either provided by the authors or generated by models retrained based on the open source codes with the same code.

Quantitative analysis. Table 1 presents a summary of the quantitative analysis of the proposed approach in contrast to 25 rivals on three COD datasets in terms of four evaluation metrics. From the metric data, it can be seen that the proposed MCRNet comprehensively surpasses all existing state-of-the-art methods. Compared to the second best and transformer based network called VSCode Luo et al (2024), the method achieves average performance gains of 8.5%, 1.7%, 0.2%, 1.0% in terms of $MAE$ , $F_{m}$ , $E_{m}$ , $S_{m}$ , respectively after averaging all metrics of these three datasets. Compared with FEDER He et al (2023) based on CNN, which also completes object segmentation and edge detection tasks, it shows significant performance improvements of 21.1%, 5.1%, 2.9%, and 4.5% respectively in the four indicators from the average perspective. Compared to the multi-scale method ZoomNet Pang et al (2022), the average gains are 15.9%, 4.1%, 2.0%, and 2.8%, respectively. Besides, compared with the CapsNets based method POCINet Liu et al (2021b), we have achieved a significant improvement in segmentation accuracy, which benefits from the mamba capsule routing in our MCRNet.

Table 3: Ablation study on different type-level capsules generations. “FC” indicates the fully connection operation, and “Mamba” represents our method.

Generation CAMO (250 images) COD10K (2026 images) NC4K (4121 images) $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ FC 0.0581 0.8338 0.9107 0.8412 0.0269 0.8050 0.9202 0.8500 0.0382 0.8540 0.9216 0.8705 Mamba 0.0547 0.8466 0.9151 0.8536 0.0264 0.8069 0.9236 0.8544 0.0366 0.8570 0.9301 0.8751

Table 4: Ablation study on the number of mamba capsules. All models are trained based on Table 2 (c).

Number CAMO (250 images) COD10K (2026 images) NC4K (4121 images) $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ 0 0.0615 0.8250 0.9087 0.8398 0.0292 0.7921 0.9181 0.8449 0.0404 0.8435 0.9242 0.8663 32 0.0547 0.8466 0.9151 0.8536 0.0264 0.8069 0.9236 0.8544 0.0366 0.8570 0.9301 0.8751 64 0.0548 0.8421 0.9127 0.8489 0.0266 0.8036 0.9217 0.8522 0.0365 0.8592 0.9315 0.8760 96 0.0545 0.8455 0.9136 0.8500 0.0268 0.8045 0.9229 0.8529 0.0371 0.8570 0.9309 0.8746 128 0.0575 0.8372 0.9105 0.8470 0.0272 0.8035 0.9224 0.8511 0.0373 0.8557 0.9292 0.8731

Qualitative analysis. Fig. 5 presents the segmentation visualizations of the MCRNet with ten good methods. From the three test sets, camouflage objects of diverse sizes and camouflage scenes of various types are selected for visualizations, encompassing small objects, large objects, the objects with uncertain boundaries, the objects that are obscured, and the concealed persons. As can be witnessed from the first and second rows, the small camouflage objects can be detected extremely well and not missed, particularly the second row of camouflage objects with a high resemblance to the background. In the instance of large camouflaged objects, they can be identified with good completeness. In the case of fuzzy boundaries, the camouflaged object can be detected entirely from the low-contrast background. It is worthy of mention that for objects obscured by the jungle, they can be distinguished meticulously. Similarly a favorable detection effect has also been attained in the scene containing persons. The aforementioned outstanding segmentation of camouflaged objects are attributed to the exploration of the part-whole relationship by the proposed MCRNet.

5.3 Ablation Analysis

Table 5: Ablation study on different scanning directions for capsules sequence.

Capsule Sequence Order CAMO (250 images) COD10K (2026 images) NC4K (4121 images) $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ One Direction 0.0572 0.8396 0.9111 0.8463 0.0268 0.8040 0.9236 0.8527 0.0370 0.8567 0.9301 0.8748 Two Directions 0.0561 0.8389 0.9097 0.8470 0.0262 0.8063 0.9239 0.8542 0.0373 0.8555 0.9301 0.8741 Four Directions (Ours) 0.0547 0.8466 0.9151 0.8536 0.0264 0.8069 0.9236 0.8544 0.0366 0.8570 0.9301 0.8751

Table 6: FLOPs, Parameters and Time of different capsule routing algorithms for part-whole relational COD.

Network FLOPs Params Time CAMO (250 images) COD10K (2026 images) NC4K (4121 images) (G) $\downarrow$ (M) $\downarrow$ (s) $\downarrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ $MAE\downarrow$ $F_{m}\uparrow$ $E_{m}\uparrow$ $S_{m}\uparrow$ EM Routing Hinton et al (2018) 155.16 77.96 0.039 0.0623 0.8269 0.9045 0.8378 0.0292 0.7931 0.9168 0.8453 0.0395 0.8481 0.9250 0.8686 DCR Liu et al (2022b) 150.37 73.62 0.036 0.0599 0.8310 0.9101 0.8425 0.0279 0.7967 0.9210 0.8477 0.0384 0.8511 0.9284 0.8699 MCR (Ours) 145.74 69.11 0.028 0.0547 0.8466 0.9151 0.8536 0.0264 0.8069 0.9236 0.8544 0.0366 0.8570 0.9301 0.8751

Effectiveness of MCG and CSDR. The proposed MCG and CSDR module exert a significant role in facilitating proposed MCRNet for part-whole relational camouflaged object detection. To dig into the contributions of these two components, we design ablation studies by removing them from the entire framework. Table 2 and Fig. 6 demonstrate the performance and visualizations for the ablation study. Comparing the fourth and fifth rows in Fig. 6, it can be observed that MCG enables to better separate the camouflage object from its surroundings, which is also proven in Table 2 (a) and (b) in terms of performance. This is attributed to the mamba capsules by the latent state mechanism in the selective SSM Gu and Dao (2024) model that realizes the modeling of global spatial structure information. Comparing the third and fourth rows in Fig. 6, it can be seen that our CSDR can effectively detect what other models cannot and enhance the integrity of camouflaged object. Similar conclusion can be achieved by comparing Table 2 (b) and (c). This proves that the spatial details retrieved by CSDR help the segmentation of the camouflaged objects.

Generation of type-level capsules. To prove the validity of MCG module for type-level mamba capsules generation, we compare it with a straightforward manner that uses a linear mapping to generate the type-level capsules. As shown in Table 3, capsules generated with mamba outperform those generated with the fully-connected layer. This superiority can be attributed to that the latent state using the selective SSM Gu and Dao (2024) in VMamba Liu et al (2024c) enables accumulates selected token information for comprehensive global context. By contrast, full connection simply implements a weighted sum of all tokens in the sequence without retaining spatial structure context well.

Number of mamba capsules. Table 4 demonstrates the impact of the quantity of mamba capsules in the proposed network on the model’s detection capability. As is widely acknowledged in Table 4, insufficient capsules undermine the characterization ability of camouflaged objects, while excessive capsules result in overfitting and a decline in detection performance. After conducting a qualitative analysis the optimal equilibrium in terms of detection performance and generalization ability is to set to 32 for mamba capsules number, which is the setting in this paper to facilitate the efficient experiments.

Scanning direction for capsules sequence. Table 5 explores various serialization directions to study the scanning directions in the MCRNet, including one direction, two directions and four directions, which can be referred to Fig. 3. Specifically, one direction and two directions possess the scannings of ’Z’ and ’Z’ & ’N’, respectively. In Table 5, the combination of four scannings achieves the best performance, which indicates the capability of various scanning directions for global context extraction.

Efficiency analysis. To explore the efficiency of the proposed MCRNet for the pipeline of part-whole relational COD based on CapsNets, we replace the proposed mamba capsule routing with some previous capsule routing algorithms, including EM routing Hinton et al (2018) and Disentangled Capsule Routing (DCR) Liu et al (2022b) in the entire MCRNet framework. In Table 6, the proposed mamba capsule routing achieves the lowest FLOPs, and parameters, and highest inference speed, while performing best on various datasets, which demonstrates the complexity efficiency and performance superiority.

5.4 Failure Cases

Fig. 7 displays some failure cases of our camouflaged detector on extremely complex scenes. For instance, as depicted in the left two columns of Fig. 7, the model’s detection of camouflaged targets is influenced by salient target detection, leading to unnecessary object parts being detected and a shift in detection focus. Besides, in the scene shown in the right two columns of Fig. 7, subpar performance exhibits in detecting small objects out of the center due to increased observation angle distance, making it challenging to discern camouflage on small objects. In future endeavors, on the basis of leveraging part-whole relational methods, we will leverage the power of Large Language Model (LLM) Ouyang et al (2022) to help the understanding of the camouflaged scene for better concealed object searching and identification.

6 Conclusions

In this paper, we have designed the Mamba Capsule Routing Network (MCRNet) for the pipeline of part-whole relational COD task. To achieve the lightweight of capsule routing for part-whole relationships exploration, a MCG was designed to generate the type-level mamba capsules from the pixel-level capsules, which ensures a lightweight capsule routing at the type level. On top of that, the CSDR module was designed to retrieve spatial details of the mamba capsules for the final camouflaged object detection. Extensive experiments have demonstrated that our proposed network module significantly enhances the detection performance of camouflaging objects. In future work, we will apply the LLM Ouyang et al (2022) to improve the caouflaged object detection from the multi-modal understanding perspective.

Data Availability Statement

The datasets used and analyzed during the current study are available in the following public domain resources:

•

CAMO: https://github.com/ondyari/FaceForensics
•

COD10K: https://github.com/DengPingFan/SINet
•

NC4K: https://github.com/JingZhang617/COD-Rank-Localize-and-Segment

The models and source data generated and analyzed during the current study are available from the corresponding author upon reasonable request.

References

Achanta et al (2009) Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1597–1604
Andreopoulos and Tsotsos (2013) Andreopoulos A, Tsotsos JK (2013) A computational learning theory of active object recognition under uncertainty. International Journal of Computer Vision 101(1):95–142
Bideau et al (2024) Bideau P, Learned-Miller E, Schmid C, Alahari K (2024) The right spin: Learning object motion from rotation-compensated flow fields. International Journal of Computer Vision 132(1):40–55
Cai et al (2023) Cai L, McGuire NE, Hanlon R, Mooney TA, Girdhar Y (2023) Semi-supervised visual tracking of marine animals using autonomous underwater vehicles. International Journal of Computer Vision 131(6):1406–1427
Chen et al (2022) Chen G, Liu SJ, Sun YJ, Ji GP, Wu YF, Zhou T (2022) Camouflaged object detection via context-aware cross-level fusion. IEEE Transactions on Circuits and Systems for Video Technology 32(10):6981–6993
Chen and Schmitt (1993) Chen X, Schmitt F (1993) Vision-based construction of cad models from range images. In: Proceedings of the IEEE International Conference on Computer Vision, pp 129–136
Cheng et al (2022) Cheng X, Xiong H, Fan DP, Zhong Y, Harandi M, Drummond T, Ge Z (2022) Implicit motion handling for video camouflaged object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 13864–13873
Fan et al (2017) Fan DP, Cheng MM, Liu Y, Li T, Borji A (2017) Structure-measure: A new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4558–4567
Fan et al (2018) Fan DP, Gong C, Cao Y, Ren B, Cheng MM, Borji A (2018) Enhanced-alignment measure for binary foreground map evaluation. In: Proceedings of the International Joint Conference on Artificial Intelligence, p 698–704
Fan et al (2020) Fan DP, Ji GP, Sun G, Cheng MM, Shen J, Shao L (2020) Camouflaged object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2774–2784
Fan et al (2022) Fan DP, Ji GP, Cheng MM, Shao L (2022) Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10):6024–6042
Fazekas et al (2009) Fazekas S, Amiaz T, Chetverikov D, Kiryati N (2009) Dynamic texture detection based on motion analysis. International Journal of Computer Vision 82(1):48–63
Geng et al (2024) Geng X, Wang J, Gong J, Xue Y, Xu J, Chen F, Huang X (2024) Orthcaps: An orthogonal capsnet with sparse attention routing and pruning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6037–6046
Girshick et al (2015) Girshick R, Iandola F, Darrell T, Malik J (2015) Deformable part models are convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 437–446
Gu and Dao (2024) Gu A, Dao T (2024) Mamba: Linear-time sequence modeling with selective state spaces. 2312.00752
Gu et al (2022) Gu A, Goel K, Re C (2022) Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations
Gu et al (2024) Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A, Ré C (2024) Combining recurrent, convolutional, and continuous-time models with linear state-space layers. In: Advances in Neural Information Processing Systems
He et al (2023) He C, Li K, Zhang Y, Tang L, Zhang Y, Guo Z, Li X (2023) Camouflaged object detection with feature decomposition and edge reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 22046–22055
Hinton et al (2011) Hinton GE, Krizhevsky A, Wang SD (2011) Transforming auto-encoders. In: Proceedings of the International Conference on Artificial Neural Networks, p 44–51
Hinton et al (2018) Hinton GE, Sabour S, Frosst N (2018) Matrix capsules with em routing. In: Proceedings of the International Conference on Learning Representations, pp 3856–3866
Huang et al (2024) Huang P, Zhang D, Cheng D, Han L, Zhu P, Han J (2024) M-rrfs: A memory-based robust region feature synthesizer for zero-shot object detection. International Journal of Computer Vision pp 1–22
Huerta et al (2007) Huerta I, Rowe D, Mozerov M, Gonzàlez J (2007) Improving background subtraction based on a casuistry of colour-motion segmentation problems. In: Pattern Recognition and Image Analysis, pp 475–482
Ji et al (2023) Ji GP, Fan DP, Chou YC, Dai D, Liniger A, Gool L (2023) Deep gradient learning for efficient camouflaged object detection. Machine Intelligence Research 20:92–108
Jia et al (2022) Jia Q, Yao S, Liu Y, Fan X, Liu R, Luo Z (2022) Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4703–4712
Kalman (1960) Kalman RE (1960) A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering 82(1):35–45
Kingma and Ba (2015) Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations
Krivic and Solina (2004) Krivic J, Solina F (2004) Part-level object recognition using superquadrics. Computer Vision and Image Understanding 95(1):105–126
Le et al (2019) Le TN, Nguyen TV, Nie Z, Tran MT, Sugimoto A (2019) Anabranch network for camouflaged object segmentation. Computer Vision and Image Understanding 184:45–56
Li et al (2021) Li A, Zhang J, Lv Y, Liu B, Zhang T, Dai Y (2021) Uncertainty-aware joint salient object and camouflaged object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10066–10076
Liang et al (2024) Liang D, Zhou X, Xu W, Zhu X, Zou Z, Ye X, Tan X, Bai X (2024) Pointmamba: A simple state space model for point cloud analysis. 2402.10739
Liu et al (2022a) Liu J, Zhang J, Barnes N (2022a) Modeling aleatoric uncertainty for camouflaged object detection. In: IEEE Winter Conference on Applications of Computer Vision, pp 2613–2622
Liu et al (2024a) Liu J, Lin R, Wu G, Liu R, Luo Z, Fan X (2024a) Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. International Journal of Computer Vision 132(5):1748–1775
Liu et al (2021a) Liu N, Zhang N, Wan K, Shao L, Han J (2021a) Visual saliency transformer. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4702–4712
Liu et al (2019) Liu Y, Zhang Q, Zhang D, Han J (2019) Employing deep part-object relationships for salient object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1232–1241
Liu et al (2021b) Liu Y, Zhang D, Zhang Q, Han J (2021b) Integrating part-object relationship and contrast for camouflaged object detection. IEEE Transactions on Information Forensics and Security 16:5154–5166
Liu et al (2022b) Liu Y, Zhang D, Liu N, Xu S, Han J (2022b) Disentangled capsule routing for fast part-object relational saliency. IEEE Transactions on Image Processing 31:6719–6732
Liu et al (2024b) Liu Y, Cheng D, Zhang D, Xu S, Han J (2024b) Capsule networks with residual pose routing. IEEE Transactions on Neural Networks and Learning Systems pp 1–14
Liu et al (2024c) Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, Ye Q, Liu Y (2024c) Vmamba: Visual state space model. ArXiv abs/2401.10166
Liu et al (2021c) Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021c) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9992–10002
Luo et al (2023) Luo N, Pan Y, Sun R, Zhang T, Xiong Z, Wu F (2023) Camouflaged instance segmentation via explicit de-camouflaging. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 17918–17927
Luo et al (2024) Luo Z, Liu N, Zhao W, Yang X, Zhang D, Fan DP, Khan F, Han J (2024) Vscode: General visual salient and camouflaged object detection with 2d prompt learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 17169–17180
Lv et al (2021) Lv Y, Zhang J, Dai Y, Li A, Liu B, Barnes N, Fan DP (2021) Simultaneously localize, segment and rank the camouflaged objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11586–11596
Ma et al (2024) Ma J, Li F, Wang B (2024) U-mamba: Enhancing long-range dependency for biomedical image segmentation. 2401.04722
Margolin et al (2014) Margolin R, Zelnik-Manor L, Tal A (2014) How to evaluate foreground maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255
Mei et al (2021) Mei H, Ji GP, Wei Z, Yang X, Wei X, Fan DP (2021) Camouflaged object segmentation with distraction mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8772–8781
Mei et al (2023) Mei H, Xu K, Zhou Y, Wang Y, Piao H, Wei X, Yang X (2023) Camouflaged object segmentation with omni perception. International Journal of Computer Vision 131(11):3019–3034
Ouyang et al (2022) Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R (2022) Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems
Pan et al (2011) Pan Y, Chen Y, Fu Q, Zhang P, Xu X (2011) Study on the camouflaged target detection method based on 3d convexity. Mathematical Models and Methods in Applied Sciences 5:152
Pang et al (2022) Pang Y, Zhao X, Xiang TZ, Zhang L, Lu H (2022) Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2160–2170
Pei et al (2022) Pei J, Cheng T, Fan DP, Tang H, Chen C, Van Gool L (2022) Osformer: One-stage camouflaged instance segmentation with transformers. In: Proceedings of the European Conference on Computer Vision, pp 19–37
Rahmon et al (2024) Rahmon G, Palaniappan K, Toubal IE, Bunyak F, Rao R, Seetharaman G (2024) Deepftsg: Multi-stream asymmetric use-net trellis encoders with shared decoder feature fusion architecture for video motion segmentation. International Journal of Computer Vision 132(3):776–804
Sabour et al (2017) Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, p 3859–3869
Sengottuvelan et al (2008) Sengottuvelan P, Wahi A, Shanmugam A (2008) Performance of decamouflaging through exploratory image analysis. In: International Conference on Emerging Trends in Engineering and Technology, pp 6–10
Singh et al (2013) Singh SK, Dhawale CA, Misra S (2013) Survey of object detection methods in camouflaged image. Ieri Procedia 4:351–357
Sun et al (2021) Sun Y, Chen G, Zhou T, Zhang Y, Liu N (2021) Context-aware cross-level fusion network for camouflaged object detection. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 1025–1031
Sun et al (2022) Sun Y, Wang S, Chen C, Xiang TZ (2022) Boundary-guided camouflaged object detection. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 1335–1341
Vistnes (1989) Vistnes R (1989) Texture models and image measures for texture discrimination. International Journal of Computer Vision 3(4):313–336
Wang et al (2024) Wang J, Liu X, Yin Z, Wang Y, Guo J, Qin H, Wu Q, Liu A (2024) Generate transferable adversarial physical camouflages via triplet attention suppression. International Journal of Computer Vision
Wu et al (2023) Wu Z, Paudel DP, Fan DP, Wang J, Wang S, Demonceaux C, Timofte R, Van Gool L (2023) Source-free depth for object pop-out. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1032–1042
Yang et al (2021) Yang F, Zhai Q, Li X, Huang R, Luo A, Cheng H, Fan DP (2021) Uncertainty-guided transformer reasoning for camouflaged object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4126–4135
Yang et al (2023) Yang X, Burghardt T, Mirmehdi M (2023) Dynamic curriculum learning for great ape detection in the wild. International Journal of Computer Vision 131(5):1163–1181
Yu et al (2016) Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: An advanced object detection network. In: Proceedings of the ACM international conference on Multimedia, pp 516–520
Zhai et al (2021) Zhai Q, Li X, Yang F, Chen C, Cheng H, Fan DP (2021) Mutual graph learning for camouflaged object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12997–13007
Zhai et al (2023) Zhai Q, Li X, Yang F, Jiao Z, Luo P, Cheng H, Liu Z (2023) Mgl: Mutual graph learning for camouflaged object detection. IEEE Transactions on Image Processing 32:1897–1910
Zhao et al (2021) Zhao X, Zhang L, Lu H (2021) Automatic polyp segmentation via multi-scale subtraction network. In: Medical Image Computing and Computer Assisted Intervention, vol 12901, pp 120–130
Zhao et al (2024) Zhao X, Pang Y, Zhang L, Lu H, Zhang L (2024) Towards diverse binary segmentation via a simple yet general gated network. International Journal of Computer Vision 132:1–20
Zhong et al (2022) Zhong Y, Li B, Tang L, Kuang S, Wu S, Ding S (2022) Detecting camouflaged object in frequency domain. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4494–4503
Zhou et al (2022) Zhou T, Zhou Y, Gong C, Yang J, Zhang Y (2022) Feature aggregation and propagation network for camouflaged object detection. IEEE Transactions on Image Processing 31:7036–7047
Zhou et al (2024) Zhou X, Wu Z, Cong R (2024) Decoupling and integration network for camouflaged object detection. IEEE Transactions on Multimedia 26:7114–7129
Zhu et al (2022) Zhu H, Li P, Xie H, Yan X, Liang D, Chen D, Wei M, Qin J (2022) I can find you! boundary-guided separated attention network for camouflaged object detection. Proceedings of the AAAI Conference on Artificial Intelligence 36(3):3608–3616
Zhu et al (2021) Zhu J, Zhang X, Zhang S, Liu J (2021) Inferring camouflaged objects by texture-aware interactive guidance network. Proceedings of the AAAI Conference on Artificial Intelligence 35(4):3599–3607
Zhu et al (2024) Zhu L, Liao B, Zhang Q, Wang X, Liu W, Wang X (2024) Vision mamba: Efficient visual representation learning with bidirectional state space model. In: International Conference on Machine Learning