An Effective and Robust Detector for Logo Detection

Xiaojun Jia [email protected] Institute of Information Engineering, Chinese Academy of SciencesBeijingChina , Huanqian Yan [email protected] Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang UniversityBeijingChina , Yonglin Wu [email protected] Institute of Information Engineering, Chinese Academy of SciencesBeijingChina , Xingxing Wei [email protected] Institute of Artificial Intelligence, Hangzhou Innovation Institute, Beihang UniversityBeijingChina , Xiaochun Cao [email protected] Institute of Information Engineering, Chinese Academy of SciencesBeijingChina and Yong Zhang [email protected] Institute of Automation, Chinese Academy of SciencesBeijingChina

(2020)

Abstract.

In recent years, intellectual property (IP), which represents literary, inventions, artistic works, etc, gradually attract more and more people’s attention. Particularly, with the rise of e-commerce, the IP not only represents the product design and brands, but also represents the images/videos displayed on e-commerce platforms. Unfortunately, some attackers adopt some adversarial methods to fool the well-trained logo detection model for infringement. To overcome this problem, a novel logo detector based on the mechanism of looking and thinking twice is proposed in this paper for robust logo detection. The proposed detector is different from other mainstream detectors, which can effectively detect small objects, long-tail objects, and is robust to adversarial images. In detail, we extend detectoRS algorithm to a cascade schema with an equalization loss function, multi-scale transformations, and adversarial data augmentation. A series of experimental results have shown that the proposed method can effectively improve the robustness of the detection model. Moreover, we have applied the proposed methods to competition ACM MM2021 Robust Logo Detection that is organized by Alibaba on the Tianchi platform and won top 2 in 36489 teams. Code is available at https://github.com/jiaxiaojunQAQ/Robust-Logo-Detection.

object detection, robust detection, long tail detection, adversarial examples

^†^†copyright: acmcopyright^†^†journalyear: 2020^†^†doi: 10.1145/1122445.1122456^†^†conference: Proceedings of the 29th ACM International Conference on Multimedia; 20-24 October 2021; Chengdu, China^†^†booktitle: Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), 20-24 October 2021 Chengdu, China^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Security and privacy Software and application security^†^†ccs: Computing methodologies Object detection

1. Introduction

On e-commerce platforms, commodity logo is one of the intellectual property rights of commodities (Machado et al., 2015; Bossel et al., 2019). It represents the creations and ideas of the business owners. However, some illegal merchants adopt adversarial attack methods to fool the well-trained logo detection model for infringement (Trappey et al., 2021). Adversarial examples (Szegedy et al., 2013; Zhao et al., 2020), which are generated via adding indistinguishable perturbations to benign images, can fool the well-trained logo detection model. Hence, we need a robust logo detection model to protect the intellectual property (IP). Logo detection belongs to object detection which is used to recognize and locate objects. A series of object detection methods (Ren et al., 2015; Cai and Vasconcelos, 2018; Qiao et al., 2021; Hu et al., 2021) have been proposed in recent years. These object detectors have achieved excellent performance on some conventional benchmarks such as PASCAL VOC (Everingham et al., 2010) and MS COCO (Lin et al., 2014). But these methods are not suitable for natural images with a long-tailed Zipfian distribution in a realistic scenario. Moreover, these models are easily attacked by adversarial examples. As for the logo detection in the realistic scenario, the logo detection model not only needs to detect and recognize logos but also needs to defend against adversarial examples. In detail, the logo detection task has following challenges. 1). small object detection, 2). long tail category detection and 3). robust detection with adversarial images.

Refer to caption — Figure 1. The network structure of the Cascade detectoRS. SAC (switchable Atrous Convolution) represents switchable atrous convolution, which can adaptively select the receptive field. RFP (Recursive Feature Pyramid) adopts a cyclic structure to repeatedly utilize and refine the extracted features. “H” represents the detection head, “B” represents the regression box, “C” represents the category prediction result, and “B0” represents the output result of the RPN network.

Small object detection is one of the difficulties in the field of object detection. For a realistic scenario logo detection, the average logo pixel is less than 1% of the whole image pixel. For a long tail category distribution in the realistic scenario, the natural images belong to a long-tailed Zipfian distribution, whose annotations of tail classes are insufficient for training a detector. For robust detection with adversarial images in a realistic scenario, it is very important. Because some illegal merchants adopt adversarial attack methods to fool the well-trained logo detection model for infringement. Hence, a robust detection model that can deal with adversarial images well is crucial. To our best knowledge, there are few studies on the robustness of the object detection models against adversarial examples. But the research about the robustness of the logo detection against adversarial examples is of great importance in a realistic scenario.

To overcome the above problems, a novel logo detector based on the mechanism of looking and thinking twice is proposed in this paper. More concretely, we extend detectoRS algorithm to a cascade schema with an equalization loss function, multi-scale transforms, and adversarial data augmentation. We train our logo detection model on the Logo detection database Open Brand (Jin et al., 2020), which consists of 584,920 natural images with 1,303,563 annotations. We propose to use the multi-scale training and testing strategy to improve the model’s ability to detect small objects. As for the long tail category detection, we adopt the equalization loss v2 (EQL-v2) (Tan et al., 2021) to protect the learning of tail categories. And as for the detection robustness against adversarial examples, we propose to use adversarial data augmentation to improve the robustness of the detector. Moreover, we find that the multi-scale training and testing strategy not only improves the ability to detect small objects but also improves the robustness against adversarial examples. The detection framework is shown in Fig. 1. Our main contributions can be summarized as follows:

•

We propose a novel logo detector based on the mechanism of looking and thinking twice to improve the robustness against adversarial images.
•

We find that the multi-scale training and testing strategy not only improves the ability to detect small objects but also improves the robustness against adversarial examples.
•

A series of experimental results show that the proposed method can improve the robustness of the detector in the realistic scenario.

The rest of this paper is organized as follows. The proposed algorithms are described in Section 2. The experimental results and analysis are presented in Section 3. Finally, we summarize the work in Section 4.

2. Methodology

In this section. We first introduce the logo detection model proposed in this paper. And then we will focus on the three modules designed for solving the problems mentioned earlier.

2.1. Logo detection model

An effective detection model plays an important role in defending against adversarial images and detection performance. We extend the original detectoRS model to cascade schema in our experiments. The reason we select detectoRS and cascade is that these two methods have same mechanism looking and thinking twice, and this mechanism is similar to the human vision system. Therefore, we think this combination may be robust to various adversarial perturbations. The network structure is shown in Fig. 1. In detail, we adopt the Resnet 50 (He et al., 2016) as the backbone network. The Recursive Feature Pyramid (RFP) is used as the neck layer, which can reuse and refine the extracted features. It can be defined as:

(1)

f_{i}=F_{i}(f_{i+1},x_{i}),\qquad x_{i}=B_{i}(x_{i-1},R_{i}(f_{i}))

where $B_{i}$ denotes the $i_{th}$ stage of the bottom-up backbone, $F_{i}$ denotes the $i_{th}$ top-down FPN (Ren et al., 2017) operation, $R_{i}$ denotes the feature transformations before connecting them back to the bottom-up backbone, $f_{i}$ is the feature map, and $x$ is the input image. Additionally, We also use the Switchable Atrous Convolution (SAC) to adaptively select the receptive field and an attention-based feature fusion mechanism which is the same as the global context module of SeNet (Hu et al., 2018), to enhance the network’s representation ability in the proposed detector.

2.2. Multi-scale training and testing strategy

We adopt the multi-scale training and testing strategy, which is a simple but effective method. We choose this strategy on the one hand to solve the problem of small objects, on the other hand to improve the robustness of the model. For small object detection, one of the most intuitive solutions is to expand the input size of the network. To not destroy the receptive field of the detection model, multi-scale training and testing strategy is used in our experiments. In this way, while taking into account the problem of small object detection, we also improve the detection performance on the normal size object detection. In the experiment, we set the maximum side length of the training and testing scale to 1333, and the short side scale ranges from 800 to 1100. Moreover, we find that the multi-scale training and testing strategy also can improve the robustness against adversarial examples. It is because that multi-scale image transformation can break up the particular structure of the adversarial perturbations. Some studies (Song et al., 2017; Jia et al., 2019) have proved that breaking up the adversarial perturbation structure is an effective method to defend against adversarial examples.

2.3. Equalization loss

The cross-entropy loss has an inhibitory effect on non-ground-truth classes. When calculating the gradient, it will produce a negative gradient with an inhibitory effect in other categories. And the long tail category is affected by the negative gradient of other categories, and the accumulation will be greater than the positive gradient generated by itself. It will be hard for the detection model to learn the features of the long tail category. A simple but effective method is to directly ignore the loss that has a negative effect on the long tail categories when calculating loss. In this way, the detection model can learn the better features of long tail categories. It can be simply defined as:

(2)

L_{EQL}=-\sum_{j=1}^{C}W_{j}*log(p_{j}),

where $C$ represents the number of categories, $W_{j}$ represents the weight of the category $j$ , and $p_{j}$ represents the predicted probability of the current proposal category by the network.

But this loss function ignores the influence of the background candidate area. To overcome this problem, we use equalization loss (Tan et al., 2021) as the classification loss function in the detection model. Equalization loss is based on gradient guidance. It weights the positive and negative gradients according to the cumulative ratio of the positive gradient and the negative gradient. The positive weights $q_{j}^{(t)}$ and negative weights $r_{j}^{(t)}$ can be defined as:

(3)

\begin{cases}q_{j}^{(t)}=1+4*(1-f(g_{j}^{(t)}))\\ r_{j}^{(t)}=f(g_{j}^{(t)}),\end{cases}

where $t$ represents the number of iterations and $f(x)=1/(1+e^{-12*(x-0.8)})$ .

After the positive and negative gradient weights are obtained, the current gradient values can be updated separately. They can defined as:

(4)

\begin{cases}\bigtriangledown_{z_{j}}^{pos^{\prime}}(L^{(t)})=q_{j}^{(t)}\bigtriangledown_{z_{j}}^{pos}(L^{(t)})\\ \bigtriangledown_{z_{j}}^{neg^{\prime}}(L^{(t)})=q_{j}^{(t)}\bigtriangledown_{z_{j}}^{neg}(L^{(t)}),\end{cases}

where ${z_{j}}$ represents the output of the classifier, and $L$ represents the loss function. Moreover, the cumulative ratio of positive and negative gradients for the $t+1$ time can be defined as $g_{j}^{t+1}=\sum_{t=0}^{T}|\bigtriangledown_{z_{j}}^{pos^{\prime}}(L^{(t)})|/\sum_{t=0}^{T}|\bigtriangledown_{z_{j}}^{neg^{\prime}}(L^{(t)})$

2.4. Adversarial data augmentation

In addition to using the input multi-scale transformation strategy, we also simulate the generation of adversarial attack noise and enhance the robustness of the detection model by augmenting the adversarial data. The illegal merchants usually generate adversarial examples to fool the well-trained logo detection model for infringement under unrestricted attack in the realistic scenario. Considering this, we first generate adversarial examples by adding Gaussian noise, adding rain, adding fog, and image blurring, etc. And then we inject these adversarial examples into the training data set and train the detection model. In this way, it can improve the robustness against adversarial images.

3. Experiment

We used the proposed detector in ACM MM2021 Security AI Challenger Phase 7: Robust Logo Detection competition, and got evaluation index $MAP$ 0.650807, which is the second highest result submitted by all the participating teams. Some detection images are shown in Fig. 2. These images show obvious signs of manipulation from various unknown adversarial operations that would disable the detector. It is no surprise that our detector is somewhat defensible against those apparent perturbations. The result of the competition can show the rationality and effectiveness of our method. In the following content, some ablation experiments have been done for analyzing and recording the detection performance changes of our algorithm.

In the competition, the indicator $MAP$ is used to evaluate the detection performance. The index $MAP$ is the average of $AP$ metrics for all detection categories. The $AP$ computes the average value of $Precision$ over the interval from $Recall=0$ to $Recall=1$ , i.e., the area under the $PRC$ , so the higher $AP$ value is, the better the detection performance and vice versa. A detection is considered to be true positive if the area overlap ratio $r$ between the predicted bounding box and the ground truth bounding box exceeds a presetting threshold value. In order to evaluate the detection performance more strictly and accurately, more thresholds are selected in the final evaluation index, like $\{0.5:0.05:0.95\}$ .

There are 50472 test images, which are polluted with image disturbances. As shown in Fig. 2, the adversarial perturbations are obvious, these adversarial images are robust and can make common detectors invalid, such as Faster RCNN (Ren et al., 2017), RetinaNet (Lin et al., 2020), Libra RCNN (Pang et al., 2019) and so on. To show the effectiveness of the proposed detector, the ablation studies of Multi-scale Training/Testing, Equalization Loss, and Data Augmentation are shown in Fig.3. There are five variant methods: cascade RCNN with backbone 50 (Cai and Vasconcelos, 2018), cascade RCNN with sac and RFP (cascade detectoRS), cascade detectoRS with multi scale, cascade detectoRS with multi scales and equalization loss, cascade detectoRS with multi scales, equalization loss, and data augmentation. We train all models for 20 epochs with a learning rate multiplied by 0.01 after 8, 12, and 16 epochs.

It is obvious that all our improvements are effective. The cascade detectoRS is about 8% ahead than cascade RCNN. Mutil-Scale training and testing can lead to an approximately 3.9% increase in detection performance. It is amazing, equalization loss brings about 16.1% performance improvement. The improvement effect of data augmentation is less obvious than other schemes, but it is effective. In addition, replacing standard NMS with Soft-NMS (Bodla et al., 2017) in our experiments also has an obvious promotion. We guess that Soft-NMS can retain more small objects, thus it can improve the recall values. Therefore, we can conclude that the proposed detector is effective, and it can deal with adversarial images well.

4. Conclusion

In this paper, a novel logo detector based on the mechanism of looking and thinking twice is proposed for robust logo detection against adversarial images in a realistic scenario. Specifically, we adopt three effective strategies to overcome the difficulties that the logo detection model meets in the realistic scenario. In detail, the multi-scale training and testing strategy is used to improve the model’s ability to detect small objects. Equalization loss is used to protect the learning of tail categories. And the adversarial data augmentation is used to improve the robustness against adversarial examples. Moreover, we find that the multi-scale training and testing strategy also can improve the robustness of the detector. A series of experimental results show that the proposed method can effectively improve the robustness of the detector.

References

(1)
Bodla et al. (2017) Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Soft-NMS — Improving Object Detection with One Line of Code. In 2017 IEEE International Conference on Computer Vision (ICCV). 5562–5570.
Bossel et al. (2019) Vera Bossel, Kelly Geyskens, and Caroline Goukens. 2019. Facing a trend of brand logo simplicity: The impact of brand logo design on consumption. Food Quality and Preference 71 (2019), 129–135.
Cai and Vasconcelos (2018) Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154–6162.
Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303–338.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
Hu et al. (2018) Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2018. Squeeze-and-Excitation Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. 42. 2011–2023.
Hu et al. (2021) Miao Hu, Yali Li, Lu Fang, and Shengjin Wang. 2021. A2-FPN: Attention Aggregation Based Feature Pyramid Network for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15343–15352.
Jia et al. (2019) Xiaojun Jia, Xingxing Wei, Xiaochun Cao, and Hassan Foroosh. 2019. Comdefend: An efficient image compression model to defend adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6084–6092.
Jin et al. (2020) Xuan Jin, Wei Su, Rong Zhang, Yuan He, and Hui Xue. 2020. The Open Brands Dataset: Unified brand detection and recognition at scale. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4387–4391.
Lin et al. (2020) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2020. Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2020), 318–327.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision. 740–755.
Machado et al. (2015) Joana Cesar Machado, Leonor Vacas de Carvalho, Anna Torres, and Patrício Costa. 2015. Brand logo design: examining consumer response to naturalness. Journal of Product & Brand Management (2015).
Pang et al. (2019) Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. 2019. Libra R-CNN: Towards Balanced Learning for Object Detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 821–830.
Qiao et al. (2021) Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. 2021. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10213–10224.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015), 91–99.
Ren et al. (2017) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
Song et al. (2017) Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766 (2017).
Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
Tan et al. (2021) Jingru Tan, Xin Lu, Gang Zhang, Changqing Yin, and Quanquan Li. 2021. Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1685–1694.
Trappey et al. (2021) Amy JC Trappey, Charles V Trappey, and Samuel Shih. 2021. An intelligent content-based image retrieval methodology using transfer learning for digital IP protection. Advanced Engineering Informatics 48 (2021), 101291.
Zhao et al. (2020) Yusheng Zhao, Huanqian Yan, and Xingxing Wei. 2020. Object Hider: Adversarial Patch Attack Against Object Detectors. arXiv preprint arXiv:2010.14974 (2020).