MirrorNet: Bio-Inspired Camouflaged Object Segmentation

Jinnan Yan University of Dayton, US Trung-Nghia Le National Institute of Informatics, Japan Khanh-Duy Nguyen Vietnam National University, Ho Chi Minh City, Vietnam University of Information Technology, VNU-HCM, Vietnam Minh-Triet Tran Vietnam National University, Ho Chi Minh City, Vietnam University of Science, Ho Chi Minh City, Vietnam John von Neumann Institute, VNU-HCM, Vietnam Thanh-Toan Do Monash University, Australia Tam V. Nguyen *Corresponding author. e-mail: [email protected] University of Dayton, US

Abstract

Camouflaged objects are generally difficult to be detected in their natural environment even for human beings. In this paper, we propose a novel bio-inspired network, named the MirrorNet, that leverages both instance segmentation and mirror stream for the camouflaged object segmentation. Differently from existing networks for segmentation, our proposed network possesses two segmentation streams: the main stream and the mirror stream corresponding with the original image and its flipped image, respectively. The output from the mirror stream is then fused into the main stream’s result for the final camouflage map to boost up the segmentation accuracy. Extensive experiments conducted on the public CAMO dataset demonstrate the effectiveness of our proposed network. Our proposed method achieves $89\%$ in accuracy, outperforming the state-of-the-arts.

Index Terms:

Camouflaged Object Segmentation, Bio-Inspired Network.

I Introduction

The term “camouflage” was originally used to describe the behavior of an animal or insect trying to hide itself from its surroundings to hunt or avoid being hunted [1], namely naturally camouflaged objects [2]. This ability is useful to reduce risk of being detected and increase their survival probability. For example, chameleons or fishes can change appearance of their bodies to match color and pattern of surrounding environments (see Fig. 1). Human adopted this mechanism and began to apply it widely on the battlefield. For example, soldiers and war equipment are applied the camouflage effect by dressing or coloring their appearance to blend them with their surroundings (see Fig. 1), namely artificially camouflaged objects [2]. Artificial camouflage has been also applied into entertainment (e.g., magic show) or art (e.g., body painting). Figure 1 shows a few examples of both naturally and artificially camouflaged objects in real life, in which these camouflaged objects are not identified obviously.

Refer to caption — Figure 1: A few examples of camouflaged objects in CAMO dataset [2]. From left to right, naturally camouflaged objects (e.g., fish, chameleon, insect) are followed by artificially camouflaged objects (e.g., soldier, body painting).

Autonomously detecting/segmenting camouflaged objects is thus a difficult task where discriminative features do not play an important role since we have to ignore objects that capture our attention. While detecting camouflaged objects is technically challenging on the one hand, it is beneficial in various practical scenarios, on the other, to include surveillance systems and search-and-rescue missions.

Figure 2 shows several examples that camouflaged objects are failed to be detected by state-of-the-art object segmentation networks. Moreover, there are plenty of creatures in nature that have evolved over time to confuse themselves with their surroundings and even harder to be detected. Therefore, the study of detecting these less obvious objects which we call it camouflaged object in this paper, is necessary in the field which targets detecting all objects in all scenes. Note that camouflage is a very subjective concept here. An object, such as a person which is regarded as a common class in MS-COCO [4], can be considered as a camouflaged object when he is hiding himself as a sniper. In other words, we treat all objects that are too similar to their surroundings because of their color, texture, or both as camouflaged object and its complement set as non-camouflaged object. It is time-consuming and laborious to collect data from all target objects in particular scenes which presents camouflage to improve models, and it is not well adaptable. Moreover, what camouflage has in common is that its features are very little different from its surroundings. Therefore, it is reasonable to treat all objects of different classes but similar to the background as one class.

Most state-of-the-art Convolutional Neural Network (CNN)-based models [5, 6, 7] simulate human brain. Therefore, the CNN-based models may also be fooled just like human [8]. This is also true for camouflaged objects, which attempt to fool our human visual perception. Through time, they evolve to well blend their appearance to the background. On the other hand, color-blind people are often better at detecting camouflaged objects [9]. This is perhaps because they rely less on colors and more on form and texture to discern the world around them. This motivates us to look at other bio-inspired features.

An object becomes a successful camouflaged object when it elegantly blends into the surrounding environment to create a familiar natural scene that can hide the object. By changing the viewpoint of the same scene, we expect to have a possibility to escape from such an illusion. This visual-psychological phenomenon motivates our bio-inspired solution to detect camouflaged objects by changing the viewpoint of the scene. We realize that just a simple flipping operation also can generate new views of the same scene. Indeed, the flipped images accidentally break the natural layout which leads to the much difference between the background and the camouflaged objects.

Inspired by above visual-psychological phenomenon, we are looking at a method to break the capability to prevent human/machine vision from recognizing camouflaged objects through changing the viewpoint. In this paper, we propose a new simple yet efficient bio-inspired framework, which greatly improves the existing camouflage object recognition performance. The framework leverages the advantages of cutting-edge models [10, 5] , which are well trained on large-scale datasets like ImageNet [11] and MS-COCO [4] datasets, making the performance of this framework surpass ANet [2] which are the state-of-the-art camouflaged object segmentation. The proposed framework consists of two streams, the main stream for segmenting original image and the mirror stream utilizing flipped image, whereas the flipped image stream exploits the bio-inspired effort to break the natural settings of camouflaged objects. Indeed, viewpoint change by flipping operation is useful to escape from the easy-to-be-fooled appearance of camouflaged objects. Leveraging this simple but effective approach can detect images’ mapping to discover more details, to further improve the performance. To solve the insufficiency of limited training data, we also utilize object-level flipping for natural data augmentation for network training. Extensive experiments on the benchmark CAMO dataset [2] for camouflaged object segmentation show the superiority of our proposed method over the state-of-the-arts. The code and results will be published on our website¹¹1https://sites.google.com/view/ltnghia/research/camo^,²²2https://sites.google.com/site/vantam/research.

Our contributions are as follows.

•

Differently from the state-of-the-art ANet [2], where classification stream and segmentation stream are applied on the whole image, in this work, we first obtain object proposal, then we apply segmentation on each object proposal.
•

As an effort to break the natural setting of camouflaged objects, we integrate additional mirror stream which utilizes horizontally flipped images, for mirror detection to support the main stream. We introduce the Data Augmentation in the Wild to solve the data insufficiency.
•

Last but not least, we conduct extensive experiments to demonstrate the performance of the proposed framework.

The rest of this paper is organized as follows. Section II first reviews related work. Section III briefly introduces the motivation and discusses details the proposed framework. Section IV reports the experimental results and the ablation study. Finally, Section V draws the conclusion and paves way to the future work.

II Related Work

In literature, the mainstreams of the computer vision community mainly focus on the detection/segmentation of the non-camouflaged objects, i.e., salient object, objects with predefined classes. There exist many research works on both general object detection [12, 13, 14, 15], and salient object segmentation [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Meanwhile, the camouflage object recognition has not been well explored in the literature. Early works related to camouflage are dedicated to detecting the foreground region even when some of their texture is similar to the background [27, 28, 29]. These works distinguish foreground and background based on simple features such as color, intensity, shape, orientation, and edge. A few methods based on handcrafted low-level features (i.e., texture [30, 31, 32] and motion [33, 34]) are proposed to tackle the problem of camouflage detection. However, all these methods work for only a few cases of simple and non-uniform background, thus their performances are unsatisfactory in camouflaged object segmentation due to strong similarity between the foreground and the background. Recently, Le et al. [2] proposed an end-to-end network, called ANet, for camouflaged object segmentation through integrating classification information into segmentation. Fan et al. [35] introduced SINet with two main modules, namely the search module S and the identification module I. Recently, Le et al. [36] explored the camouflaged instance segmentation by training Mask RCNN [3] on CAMO dataset [2].

Differently from existing work, which only interferes the training process, in this work, we also utilize bio-inspired mirror-stream in the inference process to guide the network towards the right track for camouflage object segmentation. We remark that our mirror-stream and image flipping augmentation of existing work are different for the purpose of usage and philosophy. For data augmentation, it applies image flipping in only the training phase to improve the performance of networks. It means that they consider the original image and its flipped image as two totally different images. They want to learn as much as possible cases that can appear. Meanwhile, our flip-stream is independent from image flipping augmentation. We apply flipped images together with the original images in the inference phase. It means that we consider the normal image and the flipped image as two-faces of the original image. Besides solving the standard case (normal image), we also want to solve the rare case (flipped image).

III Proposed Framework

III-A Bio-inspired Motivation

In biological vision studies, there exist viewpoint-invariant theories and viewpoint-dependent theories [37, 38, 39]. In viewpoint-invariant theories, once a particular object has been stored, the recognition of that object from novel viewpoints should be unaffected, provided that the necessary features can be recovered from that view. In view-dependent theories, once a particular object has been stored, recognition of that object from novel views may be impaired, relative to recognition of previously stored views. However, these theories were formed from the simple setting with the outstanding stimuli (non-camouflaged objects) and the clear background, i.e., white background. Indeed, it is natural for human vision to easily detect non-camouflaged objects such as salient objects since they are outstanding from the background, while it is harder for human to detect camouflaged objects since they are similar to the background. In other words, the visual difference between the background and the non-camouflaged object(s) should be larger than the one between the background and the camouflaged object(s). However, when we flip the images, humans are not fooled by the natural settings and notice differences in the images. Therefore, the flipped images accidentally break the natural layout which leads to the larger distance between the background and the camouflaged objects.

To prove this conclusion intuitively, we compute the visual difference between the main objects (the camouflaged and the non-camouflaged objects) with the background. In particular, we adopt the well-known deep learning model FCN [40] and its variants to compute the difference distance. We use the pretrained model on PASCAL-VOC [41] (with 20 semantic classes and the background class). We extract the confidence score at the second last layer of FCN ( $h\times w\times 21$ ). Then, we compute the mean semantic vectors $s_{bg}$ , $s_{fg}$ for the background and the camouflaged/non-camouflaged object regions (obtained from the ground truth maps), respectively. This ends up with 21-dim vector for both regions. $\ell_{2}$ normalization is applied for both $s_{bg}$ and $s_{fg}$ . Then the visual difference $d$ between the main objects (the camouflaged/non-camouflaged objects) with the background is computed as: $d=\|s_{fg}-s_{bg}\|_{2}$ , where $\|.\|_{2}$ is the Euclidean distance. We also compute the difference between the camo/non-camo object with the background in terms of color (RGB and $l\alpha\beta$ ), and texture (texton [42]).

Method	Non-Camo	Camo	Camo-Flipped
Color (RGB)	0.112	0.085	0.085
Color ( $l\alpha\beta$ )	0.217	0.143	0.143
Texton [42]	0.323	0.195	0.195
FCN (32s) [40]	0.520	0.352	0.354
FCN (8s) [40]	0.641	0.409	0.412
CRF-RNN [43]	0.786	0.460	0.462

TABLE I: The visual difference between the camouflaged/non-camouflaged objects and the background with different methods.

Table I shows the visual difference with different settings. The difference in terms of color is much smaller than the ones of features extracted from deep learning models. $l\alpha\beta$ shows a larger distance than RGB in the color space. Meanwhile, the textural features, i.e., Texton [42], yield the larger distance which indicates the usefulness of texture in the task of camouflaged object segmentation. The even larger distances in feature space show that features from deep learning models may be helpful to detect and segment camouflaged objects. Note that the features extracted from FCN, a CNN-based model, actually capture the information of textures and edges. In fact, FCN and CNN networks contain convolutional layers which learn spatial hierarchies of patterns by preserving spatial relationships. In these networks, the first convolutional layer can learn basic elements such as texture and edges [44]. Then, the second convolutional layer can learn patterns composed of basic elements learned in the previous layer. The training process continues until the deep network learns significantly complex patterns and abstract visual concepts. This demonstrates the importance of low-level information such as texture and edges in the task of camouflaged object segmentation.

In addition, the difference values of camo and camo-flip with the background are identical in terms of color. It shows that the flipped stream is not useful in terms of color. However, when we look at the difference in the feature space, the camo and non-camo values are different (non-camo value is slightly better). It shows that the camo flip in the feature space may be helpful for the task.

As a closer look, the distance between main object and background is larger in non-camouflaged and smaller in camouflaged images in all FCN and CRF-RNN settings. In camouflaged images, most pixels of camouflaged objects are classified as background class which leads to the small distance. Meanwhile, most pixels of the main non-camouflaged objects are classified with non-background classes which cause the larger distance. We are also interested in the horizontally flipped images. Therefore, we also compute the visual difference between the background and the camouflaged objects in the horizontally flipped images. As shown in Table I, the camouflaged-flipped images generally produce larger distance. It seems that the horizontally flipped images may contain useful information for the segmentation task. Note that the images of camouflaged objects are taken with the natural settings. Therefore, the flipped images accidentally break the natural layout which lead to the larger distance between the background and the camouflaged objects. This one absolutely paves way to our proposed method.

The camouflaged object contains intrinsic and extrinsic information. The intrinsic information means the difference between the camouflaged object and the background. The small distance may be observed in the color space. The distance may be larger in another space, i.e., deep learned feature space, semantic space. Meanwhile, the extrinsic information means the external change onto the camouflaged object, for example, rotating, translating, and flipping, so that the camouflaged object can be better recognized. Therefore, our proposed framework captures both intrinsic and extrinsic information of the camouflaged object.

III-B MirrorNet Overview

Figure 3 depicts the overview of our proposed MirrorNet, the bio-inspired network for camouflaged object segmentation. The MirrorNet actually consists of two streams, the main stream for original image segmentation and the mirror stream for flipped image segmentation, whereas the horizontally flipped image stream exploits the bio-inspired effort to break the natural setting of camouflaged objects. Each stream traverses through the camouflaged object proposal, camouflaged object segmentation and yields the segmentation masks. The two output masks from the two streams are finally fused to produce a pixel-wisely accurate camouflage map which uniformly covers camouflaged objects. Here, the camouflaged object segmentation targets at the intrinsic information. Meanwhile, the mirror stream aims to capture the extrinsic information.

III-C Network Design and Architecture

III-C1 Camouflaged Object Proposal

MirrorNet first attempts to localize the possible positions containing the camouflaged objects via Camouflaged Object Proposal component. Inspired by [14], we aim to find the object positions, the object classes, and their camouflage masks in images simultaneously. Follow the standard design in computer vision, the object position is defined by a rectangle with respect to the top-left corner of the image; the object class is defined over the rectangle; the object mask is encoded at every pixel inside the rectangle. Ideally, we want to detect all relevant objects in the image and map each pixel in these objects to its most probable camouflage/non-camouflage label.

Here, our Camouflaged Object Proposal component adopts the Region Proposal Network (RPN) [14]. RPN shares weights with the main convolutional backbone and outputs bounding boxes (RoI / object proposal) at various sizes. For each RoI, a fixed-size feature map is pooled from the image feature map using the RoIPool layer [14]. The RoIPool layer works by dividing the RoI into a regular grid and then max-pooling the feature map values in each grid cell. This quantization, however, causes misalignments between the RoI and the extracted features due to the harsh rounding operations when mapping the RoI coordinates from the input image space to the image feature map space and when dividing the RoI into grid cells. In order to address this problem, our Camouflaged Object Proposal component integrates Precise RoI Pooling (PrRoI) [45]. Indeed PrRoI Pooling uses average pooling instead of max pooling for each bin and has a continuous gradient on bounding box coordinates. That is, one can take the derivatives of some loss function with respect to the coordinates of each RoI and optimize the RoI coordinates. PrRoI Pooling is also different from the RoI Align proposed in Mask R-CNN. PrRoI Pooling uses a full integration-based average pooling instead of sampling a constant number of points. This makes the gradient with respect to the coordinates continuous.

III-C2 Camouflaged Object Segmentation

Following [14], we use a multi-task loss $\mathcal{L}$ to jointly train the bounding box class, the bounding box position, and the camouflage map (mask) as follows:

\mathcal{L}=\mathcal{L}_{cls}+\mathcal{L}_{loc}+\mathcal{L}_{mask},

(1)

where $\mathcal{L}_{cls}$ and $\mathcal{L}_{loc}$ are the output of the detection branch. Meanwhile, $\mathcal{L}_{mask}$ is defined on the output of the segmentation branch. The object classification loss $\mathcal{L}_{cls}(p,u)$ is the multinomial cross entropy loss computed as follows:

\mathcal{L}_{cls}(p,u)=-\log{p_{u}},

(2)

where ${p_{u}}$ is the softmax output for the true class $u$ . The bounding box regression loss $\mathcal{L}_{loc}(t^{u},v)$ is computed as the Smooth L1 loss [13] between the regressed box offset $t^{u}$ (corresponding to the ground-truth object class $u$ ) and the ground-truth box offset $v$ :

\mathcal{L}_{loc}(t^{u},v)=\sum_{i\in\{x,y,w,h\}}Smooth_{L1}(t^{u}_{i}-v_{i})

(3)

The segmentation loss $\mathcal{L}_{mask}(m,s)$ is the multinomial cross entropy loss computed as follows:

\mathcal{L}_{mask}(m,s)=\frac{-1}{N}\sum_{i\in RoI}{\log{m^{i}_{s_{i}}}},

(4)

where $m^{i}_{s_{i}}$ is the softmax output at pixel $i$ for the true label $s_{i}$ , $N$ is the number of pixels in the RoI.

III-C3 Mask Fusion

We first flip all detected bounding boxes and their corresponding camouflage map (mask) from the mirror stream. Then, we use threshold $\theta=0.5$ to eliminate bounding boxes with low prediction scores. After that, the prediction scores of bounding boxes from the two streams are sorted in descending order. Then, we apply the “winner take all” strategy to prune the redundant boxes. In particular, for each bounding box from highest prediction scores to lowest scores, we check the bounding boxes with lower scores, if the lower score bounding box and the higher score bounding box have 50% mutual overlap, the bounding box with lower score will be excluded. Then, we discard the bounding boxes (and their corresponding camouflage maps) classified as non-camouflage.

Finally we accumulate the camouflage maps (masks) from the retaining bounding boxes and then normalize the output, resulting the final camouflage map.

III-D Data Augmentation in the Wild

Since non-camouflaged object segmentation attracts most attention compared to camouflaged object segmentation, there are only a few relevant datasets, and most of them have the problem of too few samples. Therefore, we adopt the recently built CAMO dataset [2] which proposed and benchmarked by the previous state-of-the-art camouflaged object segmentation, for the training of instance segmentation framework. The dataset is divided into camouflage and non-camouflage categories, each containing 1,000 training and 250 test set, and a total of 2,500 manually annotated ground truths. Most of the dataset images are mammals, insects, birds, and aquatic animals, each with approximately similar proportions, and a small number of reptiles, human art, soldiers, and amphibians. The diversity of species in this dataset makes our model very adaptable, but it must be pointed out that it also has insufficient samples compared to mainstream datasets like COCO [4].

We first compute the number of connected components on the binary ground truth maps of CAMO dataset (the training part). There are many scenarios where multiple connected components belong to one instance. To avoid cherry pick the training samples, we exclude all training images with more than 2 connected components. Then, we compute the bounding box for the single component in the images. In order to increase the number of training samples, we introduce the Augmentation in Wild. In particular, we first perform translating and flipping instances. Furthermore, it is essentially different from data augmentation for non-camouflaged objects, because to determinate whether an object is camouflaged is not only depends on its own features, but also its surroundings. Therefore, we also clone the object instances and place them onto different image regions with a small color difference in the background. Figure 4 shows some samples of augmented data. In this way, we increase the number of training samples, thus alleviating the problem of insufficient data.

III-E Implementation Details

To demonstrate the generality and flexibility of our proposed MirrorNet, we employ various recent state-of-the-art backbones, i.e., ResNet [10] and ResNeXt [5]. Our implementation is based on the published code of Mask R-CNN ³³3https://github.com/facebookresearch/maskrcnn-benchmark, which can be adopted to train camouflaged object proposal and camouflaged object segmentation components. Furthermore, we replace the ROI Align layer with a Precise RoI Pooling layer (PrRoI) [45].

The training process is conducted by fine-tuning the available pre-trained model on camouflage images and non-camouflage images of the CAMO dataset [2] and our Data Augmentation in the Wild. In particular, we set the size of each mini-batch to 256 and used the Stochastic Gradient Descent (SGD) optimization with a moment $\beta=0.9$ and a weight decay of $0.0001$ . We trained our network for 120k iterations with the base learning rate of $0.00125$ , which is decreased by $10$ times every time we reach the next steps at 80k iterations and 100k iterations. We also remark that we implemented our method in PyTorch, and conducted all the experiments on a computer with a 2.40GHz processor (Intel Xeon CPU E5-2620), 64 GB of RAM, and one GeForce GTX TITAN X GPU.

IV Experiments

In this section, we first introduce dataset and evaluation criteria used in experiments. We compare our MirrorNet with state-of-the-art methods on the CAMO dataset [2], to demonstrate that the instance segmentation and the mirror stream can boost up the camouflaged object segmentation. We also present the efficiency of our general MirrorNet through the ablation study.

IV-A Benchmark Dataset and Evaluation Criteria

We used the entire the testing images in the CAMO dataset for the evaluation. We note that the zero-mask ground-truth labels (all pixels have zero values) are for the non-camouflaged object images.

Similarly to [2], we used the F-measure ( $F_{\beta}$ ) [46], Intersection Over Union (IOU) [40], and Mean Absolute Error (MAE) as the metrics to evaluate obtained results. The first metric, F-measure, is a balanced measurement between precision and recall as follows:

{F_{\beta}}=\frac{{\left({1+{\beta^{2}}}\right)Precision\times Recall}}{{{\beta^{2}}\times Precision+Recall}}.

(5)

Note that we set $\beta^{2}=0.3$ as used in [46, 22] to put an emphasis on precision. IOU is the area ratio of the overlapping against the union between the predicted camouflage map and the ground-truth map. Meanwhile, MAE is the average of the pixel-wise absolute differences between the predicted camouflage map and the ground-truth.

For MAE, we used the raw grayscale camouflage map. For the other metrics, we binarized the results depending on two contexts. In the first context, we assume that camouflaged objects are always present in every image like salient objects; we used an adaptive threshold [47] $\theta=\mu+\eta$ where $\mu$ and $\eta$ are the mean value and the standard deviation of the map, respectively. In the second context which is much closer to a real-world scenario, we assume that the existence of camouflaged objects is not guaranteed in each image; we used the fixed threshold $\theta=0.5$ .

As proposed in [35], we also use E-measure ( $E_{\phi}$ ) [48], S-measure ( $S_{\alpha}$ ) [49], and weighted F-measure ( $F_{\beta}^{\omega}$ ) [50] as alternative metrics.

IV-B Comparison with FCN and ANet Variants

TABLE II: Comparison with FCN and ANet variants on two settings: Only Camouflaged Images (the left part), and the full CAMO dataset (the right part). The evaluation is based on F-measure [46] (the higher the better), IOU [40] (the higher the better), and MAE (the lower the better). The 1st and 2nd places are shown in blue and red, respectively.

	Dataset in test	Only Camouflaged Images					Full CAMO dataset
Groups	Method		Adaptive Threshold		Fixed Threshold			Adaptive Threshold		Fixed Threshold
		MAE $\Downarrow$	F_β $\Uparrow$	IOU $\Uparrow$	F_β $\Uparrow$	IOU $\Uparrow$	MAE $\Downarrow$	F_β $\Uparrow$	IOU $\Uparrow$	F_β $\Uparrow$	IOU $\Uparrow$
FCN-finetuned	DHS [51]	0.138	59.6	38.8	61.4	36.7	0.072	79.6	67.9	80.8	68.1
	DSS [52]	0.145	58.2	38.5	58.4	38.1	0.076	79.0	68.6	79.2	68.7
	SRM [16]	0.127	66.3	45.4	65.6	42.1	0.067	83.0	71.7	83.1	70.8
	WSS [53]	0.149	64.2	43.9	63.8	38.2	0.085	81.1	67.8	82.0	68.7
ANet setting [2]	ANet-DHS	0.130	62.6	43.7	63.1	42.3	0.072	81.2	71.2	81.4	70.5
	ANet-DSS	0.132	58.7	40.4	60.7	39.0	0.067	79.5	70.1	80.4	69.4
	ANet-SRM	0.126	65.4	47.5	66.2	46.6	0.069	82.6	73.2	83.0	72.7
	ANet-WSS	0.140	66.1	45.9	64.3	40.7	0.078	82.6	71.0	82.0	69.7
MirrorNet setting	MirrorNet (ResNet-50)	0.100	72.3	58.1	72.5	58.4	0.062	86.1	77.9	86.2	78.0
	MirrorNet (ResNet-101)	0.089	73.4	60.4	73.5	60.5	0.053	86.7	79.4	86.7	79.4
	MirrorNet (ResNeXt-101)	0.084	76.6	62.7	76.9	62.9	0.051	88.2	80.4	88.4	80.5
	MirrorNet (ResNeXt-152)	0.077	78.4	65.8	78.5	65.7	0.045	89.3	82.2	89.3	82.2

TABLE III: Ablation study results on the full CAMO dataset. The evaluation is based on F-measure [46] (the higher the better), IOU [40] (the higher the better), and MAE (the lower the better). The best results are shown in blue. The detail of each baseline is clearly introduced in Section IV-C.

Method

Settings

Adaptive Threshold

Fixed Threshold

Main

Stream

Mirror

Stream

Data Augmentation

in the Wild

Pooling

MAE

\Downarrow

F_β

\Uparrow

IOU

\Uparrow

F_β

\Uparrow

IOU

\Uparrow

Baseline 1

✓

PrRoI

0.051

88.4

80.8

88.6

80.8

Baseline 2

✓

PrRoI

0.050

88.6

81.3

88.8

81.4

Baseline 3

✓

PrRoI

0.048

88.6

82.0

88.8

82.0

Baseline 4

✓

RoI Align

0.047

88.8

81.8

88.9

81.8

MirrorNet

✓

PrRoI

0.045

89.3

82.2

89.4

82.2

We conduct the experiments on CAMO dataset with two settings, namely, only camouflaged object set, and the full set. To investigate the impact of different backbones on our proposed MirrorNet, we fine-tuned pre-trained models of ResNet (ResNet-50, ResNet-101) [10], and ResNeXt (ResNeXt-101, ResNeXt-152) [5] on CAMO respectively. It is interesting for the computer vision community to see the performance of different frameworks in the camouflaged object segmentation problem. Particularly, we compare our MirrorNet with ANet family networks [2] (denoted by ANet-DHS, ANet-DSS, ANet-SRM, ANet-WSS), the-state-of-the-art for camouflaged object segmentation, and the original FCNs (DHS [51], DSS [52], SRM [16], and WSS [53]). All these methods were also fine-tuned on the CAMO dataset with their published parameters.

As shown in Table II, ANet variants outperform FCN-finetuned methods and ANet variants with the same backbone. Meanwhile, our proposed MirrorNet significantly outperforms all baselines. MirrorNet achieves the best result with ResNeXt-152 backbone. This is totally consistent with our discussion in Section III-A, i.e., the better backbones tend to work better in MirrorNet. Note that all baselines, i.e., FCN-finetuned or ANet variants are trained and tested separately on different sets of data, for example, training and testing on only camouflaged image sets, or training and testing on the full set. Meanwhile, our MirrorNet and its variants are only trained on the full set. However, their performance is also significantly better in the only camouflaged image set.

IV-C Ablation Study

In this subsection, we investigate the impact of different components in our proposed MirrorNet such as dual stream, data augmentation, and RoI pooling mechanisms. Experimental results are shown in Table III. We remark that we use the ResNeXt-152 backbone for all compared methods.

Dual Stream: We investigate the performance of each stream in the MirrorNet, namely, main stream and mirror stream. We compare our completed MirrorNet, which have both streams, with using single streams (denoted by Baseline 1 for only the main stream and Baseline 2 for only the mirror stream). As shown in Table III, our completed MirrorNet outperforms baselines using individual stream. In addition, the mirror stream output is better than the main stream output alone.

Data Augmentation in the Wild: We suspected that one of the major factors affecting the camouflaged objects segmentation was the limited number of training data. To evaluate the impact of Data Augmentation in the Wild, we compare MirrorNet without using Data Augmentation in the Wild in the training process, denoted by Baseline 3. Table III shows that using Data Augmentation in the Wild in the training process outperforms the one without using the augmented data.

ROI Pooling: We further investigate the performance of different ROI pooling mechanism, namely, RoI Align (denoted by Baseline 4) and PrROI. As also seen in Table III, PrRoI-based MirrorNet surpasses RoI Align-based MirrorNet. In addition, RoI Align-based MirrorNet outperforms the main stream and the mirror stream (with PrRoI). This clearly shows the importance of the mask fusion.

TABLE IV: Comparison of state-of-the-art methods on CAMO dataset (Camouflaged object only). The evaluation is based on E-measure (

E_{\phi}

) [48] (the higher the better), S-measure (

S_{\alpha}

) [49] (the higher the better), weighted F-measure (

F_{\beta}^{\omega}

) [50] (the higher the better), and MAE (the smaller the better). The best results are shown in blue.

Method	Year	Evaluation Metrics
Method	Year	S_α $\Uparrow$	E_ϕ $\Uparrow$	F ${}_{\beta}^{\omega}$ $\Uparrow$	MAE $\Downarrow$
UNet++ [54]	2018	0.599	0.653	0.392	0.149
PiCANet [55]	2018	0.609	0.584	0.356	0.156
MSRCNN [56]	2019	0.617	0.669	0.454	0.133
PoolNet [57]	2019	0.702	0.698	0.494	0.129
BASNet [20]	2019	0.618	0.661	0.413	0.159
PFANet [19]	2019	0.659	0.622	0.391	0.172
CPD [18]	2019	0.726	0.729	0.550	0.115
HTC [58]	2019	0.476	0.442	0.174	0.172
EGNet [17]	2019	0.732	0.768	0.583	0.104
ANet-SRM [2]	2019	0.682	0.685	0.484	0.126
SINet [35]	2020	0.751	0.771	0.606	0.100
MirrorNet (Ours)	2020	0.785	0.849	0.719	0.077

IV-D Comparison with State-of-the-art Methods

We further compare the performance of MirrorNet with state-of-the-arts reported in [35]. Table IV shows the performance of different methods in terms of E-measure ( $E_{\phi}$ ) [48], S-measure ( $S_{\alpha}$ ) [49], weighted F-measure ( $F_{\beta}^{\omega}$ ) [50], and MAE [2]. Note that these metrics were proposed in [35]. As can be seen, the recently proposed methods tend to achieve better results. Our proposed method, MirrorNet, achieves the best performance in terms of $E_{\phi}$ , $S_{\alpha}$ , $F_{\beta}^{\omega}$ , and MAE. MirrorNet with ResNeXt-152 backbone surpasses the state-of-the-art methods by a remarkable margin in all metrics.

Figure 5 shows the visual comparison of different methods. As illustrated in the figure, our MirrorNet variants yield better results than all state-of-the-art methods. Our results are close to the ground truth and focus on the camouflaged objects. From the results, our MirrorNet successfully segments the camouflaged objects images in CAMO dataset [2]. As discussed in [2], CAMO dataset consists of images from challenging scenarios such as object appearance (the camouflaged object has similar color and texture with the background), background clutter, shape complexity, small object, object occlusion, multiple objects, and distraction. This demonstrates that our MirrorNet can well handle a variety of camouflaged objects.

V Conclusion

In this paper, we propose a simple and flexible end-to-end network, namely MirrorNet, for camouflaged object segmentation. Our bio-inspired MirrorNet leverages both instance segmentation and mirror stream to segment camouflaged objects in images. Particularly, we propose to use the main stream and the mirror stream, which is embedded image flipping, to effectively capture different layouts of the scene, leading to significant performance in identifying camouflaged objects. The extensive experimental results show that our proposed method achieves state-of-the-art performance on the public CAMO dataset.

In the future, we will investigate different adversarial schemes. In addition, we aim to further explore the problem of camouflaged instance segmentation. Through experiments, our data augmentation slightly improves the performance as shown in the ablation study. Note that data augmentation is not the main focus of this paper. Exploring different data augmentation methods is an interesting topic which is left for future work. We are also interested in applying camouflaged object segmentation into practical systems [59].

Acknowledgment

This research is in part granted by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA19, National Science Foundation (NSF) under Grant No. 2025234, and JSPS KAKENHI Grant No. 20K23355. We gratefully acknowledge NVIDIA.

References

[1] S. Singh, C. Dhawale, and S. Misra, “Survey of object detection methods in camouflaged image,” Journal of IERI Procedia, vol. 4, pp. 351 – 357, 2013.
[2] T. Le, T. V. Nguyen, Z. Nie, M. Tran, and A. Sugimoto, “Anabranch network for camouflaged object segmentation,” Computer Vision and Image Understanding, vol. 184, pp. 45–56, 2019.
[3] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in International Conference on Computer Vision, 2017, pp. 2980–2988.
[4] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755.
[5] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5987–5995.
[6] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in AAAI Conference on Artificial Intelligence, 2017, pp. 4278–4284.
[7] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp. 2261–2269.
[8] M. A. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W.-S. Ku, and A. Nguyen, “Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects,” in Conf. on Computer Vision and Pattern Recognition, 2019.
[9] L. Prasad, “Outsmarting the art of camouflage,” Discover Magazine, 2016, available online at http://blogs.discovermagazine.com/crux/2016/11/02/outsmarting-the-art-of-camouflage/.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in 2014 CVPR, June 2014, pp. 580–587.
[13] R. B. Girshick, “Fast R-CNN,” in International Conference on Computer Vision, 2015, pp. 1440–1448.
[14] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in Conference on Neural Information Processing Systems, 2015, pp. 91–99.
[15] T.-N. Le, A. Sugimoto, S. Ono, and H. Kawasaki, “Attention r-cnn for accident detection,” in IEEE Intelligent Vehicles Symposium, 2020.
[16] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refinement model for detecting salient objects in images,” in International Conference on Computer Vision, Oct 2017.
[17] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Egnet: Edge guidance network for salient object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8779–8788.
[18] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3907–3916.
[19] T. Zhao and X. Wu, “Pyramid feature attention network for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3085–3094.
[20] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, “Basnet: Boundary-aware salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7479–7489.
[21] T.-N. Le and A. Sugimoto, “Semantic instance meets salient object: Study on video semantic salient instance segmentation,” in IEEE Winter Conference on Applications of Computer Vision, 2019, pp. 1779–1788.
[22] T. V. Nguyen and J. Sepulveda, “Salient object detection via augmented hypotheses,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2015, pp. 2176–2182.
[23] T.-N. Le and A. Sugimoto, “Contrast based hierarchical spatial-temporal saliency for video,” in Pacific-Rim Symposium on Image and Video Technology, 2015, pp. 734–748.
[24] T. V. Nguyen, K. Nguyen, and T. Do, “Semantic prior analysis for salient object detection,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 3130–3141, 2019.
[25] T.-N. Le and A. Sugimoto, “Deeply supervised 3d recurrent fcn for salient object detection in videos,” British Machine Vision Conference, 2017.
[26] T. Le and A. Sugimoto, “Video salient object detection using spatiotemporal deep features,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 5002–5015, Oct 2018.
[27] M. Galun, E. Sharon, R. Basri, and A. Brandt, “Texture segmentation by multiscale aggregation of filter responses and shape elements,” in International Conference on Computer Vision, Oct 2003, pp. 716–723.
[28] L. Song and W. Geng, “A new camouflage texture evaluation method based on wssim and nature image features,” in International Conference on Multimedia Technology, Oct 2010, pp. 1–4.
[29] F. Xue, C. Yong, S. Xu, H. Dong, Y. Luo, and W. Jia, “Camouflage performance analysis and evaluation framework based on features fusion,” Multimedia Tools and Applications, vol. 75, no. 7, pp. 4065–4082, Apr 2016.
[30] Y. Pan, Y. Chen, Q. Fu, P. Zhang, and X. Xu, “Study on the camouflaged target detection method based on 3d convexity,” Modern Applied Science, vol. 5, no. 4, p. 152, 2011.
[31] Z. Liu, K. Huang, and T. Tan, “Foreground object detection using top-down information based on em framework,” IEEE Trans. on Image Processing, vol. 21, no. 9, pp. 4204–4217, 2012.
[32] A. W. P. Sengottuvelan and A. Shanmugam, “Performance of decamouflaging through exploratory image analysis,” in International Conference on Emerging Trends in Engineering and Technology, July 2008, pp. 6–10.
[33] J. Yin, Y. Han, W. Hou, and J. Li, “Detection of the mobile object with camouflage color under dynamic background based on optical flow,” Procedia Engineering, vol. 15, no. Supplement C, pp. 2201 – 2205, 2011.
[34] J. Gallego and P. Bertolino, “Foreground object segmentation for moving camera sequences based on foreground-background probabilistic models and prior probability maps,” in ICIP, 2014, pp. 3312–3316.
[35] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Camouflaged object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[36] T.-N. Le, V. Nguyen, C. Le, T.-C. Nguyen, M.-T. Tran, and T. V. Nguyen, “Camoufinder: Finding camouflaged instances in images,” in AAAI Conference on Artificial Intelligence, 2021, pp. 1–4.
[37] K. D. Wilson and M. J. Farah, “When does the visual system use viewpoint-invariant representations during recognition?” Cognitive Brain Research, vol. 16, no. 3, pp. 399–415, 2003.
[38] E. Burghund and C. Marsolek, “Viewpoint-invariant and viewpoint-dependent object recognition in dissociable neural subsystems,” Psychonomic Bulletin and Review, vol. 7, pp. 480–489, 2000.
[39] Y. Li, Z. Pizlo, and R. M. Steinman, “A computational model that recovers the 3d shape of an object from a single 2d retinal representation,” Vision research, vol. 49, no. 9, pp. 979–991, 2009.
[40] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp. 3431–3440.
[41] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
[42] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, “Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context,” International Journal of Computer Vision, vol. 81, no. 1, pp. 2–23, 2009.
[43] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” in IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.
[44] Q. V. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Y. Ng, “Building high-level features using large scale unsupervised learning,” in Proceedings of International Conference on Machine Learning, 2012.
[45] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, “Acquisition of localization confidence for accurate object detection,” in European Conference on Computer Vision, 2018, pp. 816–832.
[46] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1597–1604.
[47] Y. Jia and M. Han, “Category-independent object-level saliency detection,” in International Conference on Computer Vision, 2013, pp. 1761–1768.
[48] D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji, “Enhanced-alignment measure for binary foreground map evaluation,” in IJCAI, 2018, pp. 698–704.
[49] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4548–4557.
[50] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 248–255.
[51] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in IEEE Conference on CVPR, 2016, pp. 678–686.
[52] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply supervised salient object detection with short connections,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[53] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
[54] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 3–11.
[55] N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wise contextual attention for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3089–3098.
[56] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring r-cnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6409–6418.
[57] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple pooling-based design for real-time salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3917–3926.
[58] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 4974–4983.
[59] T. V. Nguyen, Q. Zhao, and S. Yan, “Attentive systems: A survey,” International Journal of Computer Vision, vol. 126, no. 1, pp. 86–110, 2018.