Exploring Color Invariance through Image-Level Ensemble Learning

Yunpeng Gong¹ Jiaquan Li¹ Lifei Chen² Min Jiang¹🖂 ¹School of Informatics, Xiamen University, China
²College of Computer and Cyber Security, Fujian Normal University, China [email protected], [email protected], [email protected], [email protected]

Abstract

In the field of computer vision, the persistent presence of color bias, resulting from fluctuations in real-world lighting and camera conditions, presents a substantial challenge to the robustness of models. This issue is particularly pronounced in complex wide-area surveillance scenarios, such as person re-identification and industrial dust segmentation, where models often experience a decline in performance due to overfitting on color information during training, given the presence of environmental variations. Consequently, there is a need to effectively adapt models to cope with the complexities of camera conditions. To address this challenge, this study introduces a learning strategy named Random Color Erasing, which draws inspiration from ensemble learning. This strategy selectively erases partial or complete color information in the training data without disrupting the original image structure, thereby achieving a balanced weighting of color features and other features within the neural network. This approach mitigates the risk of overfitting and enhances the model’s ability to handle color variation, thereby improving its overall robustness. The approach we propose serves as an ensemble learning strategy, characterized by robust interpretability. A comprehensive analysis of this methodology is presented in this paper. Across various tasks such as person re-identification and semantic segmentation, our approach consistently improves strong baseline methods. Notably, in comparison to existing methods that prioritize color robustness, our strategy significantly enhances performance in cross-domain scenarios. The code available at https://github.com/layumi/Person_reID_baseline_pytorch/blob/master/random_erasing.py or https://github.com/finger-monkey/Data-Augmentation.

1 Introduction

With the rapid advancement of computer vision and deep learning techniques, wide-area surveillance Yuan et al. (2023); Gong et al. (2022); Yunpeng Gong (2024) systems are crucial in various applications, such as public safety and industrial monitoring. Wide-area surveillance involves the monitoring and analysis of large-scale environments through diverse camera setups, providing critical information for decision-making and security. One of the primary challenges faced by deep learning models in such scenarios is the presence of color deviation arising from variation camera environments.

In wide-area surveillance scenarios, models are susceptible to color domain deviations due to diverse camera settings, complex lighting, and extensive scene coverage. These deviations are exacerbated by variations in lighting and environmental conditions, impacting color representation Zhong et al. (2018b). This problems can lead to model instability and compromise performance in tasks such as person re-identification and industrial dust segmentation.

Refer to caption — Figure 1: The retrieval results of the model trained with visible (RGB) image and the model trained with grayscale image on the Market1501 dataset. The numbers on the images indicate the rank of similarity in the retrieval results, the red and green numbers denote the wrong and correct results, respectively.

1.0.1 Person Re-identification

Person re-identification (ReID) is a challenging computer vision task focused on recognizing individuals across camera views Zhou et al. (2023). The core challenge of person re-identification lies in the variations introduced by pose changes, camera viewpoints, occlusions, and lighting conditions etc., resulting in significant intra-class differences Zheng et al. (2019). This variation often leads to substantial changes in the appearance of the same pedestrian image, causing substantial metric distance differences. Consequently, ReID models may exhibit an excessive reliance on color information, which is often the most salient and easily distinguishable feature, as emphasized by Gong Gong et al. (2022). This reliance stems from the challenge of feature matching.

It is no doubt that color features are important discriminative features, but it will restrict the model to make correct predictions in some cases. Fig. 1 demonstrates that the color deviation between query and gallery images impacts retrieval outcomes, and, interestingly, some samples exhibit improved retrieval results when color information is ignored.

1.0.2 Semantic Segmentation

In specific industrial production settings such as construction sites, mining fields, and manufacturing factories, dust (or smoke) represents a prevalent issue, significantly impacting construction progress and the health of workers.

Industrial dust (or smoke) segmentation Yuan et al. (2023) poses another highly challenging task in the realm of wide-area surveillance scenarios. Smoke or dust exhibits dynamic morphological variations as it flows and diffuses in the air due to its minute particles and gaseous nature. Its shape and distribution are influenced by surrounding environments, airflow, and temperature, leading to irregular and unstable characteristics in spatial representation. These factors also make the model overly reliant on color information, resulting in insufficient robustness of the model to color deviations. Due to the aforementioned challenges, boundary regions often present difficulties for segmentation models, as they tend to exhibit prediction errors, leading to suboptimal performance in practical applications.

To tackle the aforementioned challenge, this paper introduces an innovative strategy known as Random Color Erasing (RCE). RCE is devised to restore the balance between color features and other crucial discriminative features within a neural network. This strategy can manifest in various manifestations. RCE involves the random selection of an area within an RGB image and the replacement of its pixels with the corresponding area from a grayscale image (or vice versa, selecting an area within a grayscale image and substituting its pixels with the corresponding pixels from the equivalent region in an RGB image). This process generates new training samples with varying degrees of homogeneously mixed domains, contributing to model robustness during training. In comparison to existing techniques based on Generative Adversarial Networks (GANs), the proposed approach demonstrates superior efficiency and effectiveness. Notably, it avoids introducing additional noise while conserving substantial computational resources.

In addition, this paper analyzes the relationship between proposed RCE and the generalization ability of neural networks from the perspective of classification, and reveals the intrinsic reasons that networks trainning with RCE may outperform ordinary networks. Experiments show that the proposed method increases the robustness of the model to color deviation, and bridges the domain gap between different datasets.

The main contributions of this paper are summarized as follows:

$\bullet$ This paper introduces the Random Color Erasing (RCE) which enhances color-invariant learning in varied visual scenarios. This strategy effectively mitigates overfitting, thereby bolstering the model’s ability for improved generalization.

$\bullet$ This paper conducts an analysis that the network trained with RCE may be better than the ordinary network from the perspective of classification.

$\bullet$ The extensive experiments on two distinct visual tasks and five baseline models with diverse architectures demonstrate the effectiveness of the proposed ensemble learning strategy. Particularly, our method exhibits significant potential surpassing traditional approaches in cross-domain testing. This approach provides a novel perspective for subsequent research.

2 Related Works

Over the years, ReID has been devoted to seeking effective solutions to enhance the model’s robustness to color variations. To address this, numerous methods have been introduced and proposed. They can be broadly categorized into classical approaches and GAN-based methods.

2.1 Classical Methods

In solving the problem of color deviation, the early work Li et al. (2014) used the filter and the maximum grouping layer to learn the illumination transformation, divided the pedestrian image into more small pieces to calculate the similarity, and uniformly handled the problems of misalignment, occlusion and illumination variation under the deep neural network. Liao et al. Liao et al. (2015) performed pre-processing before feature extraction and used multiscale Retinex algorithm to enhance the color information of light shaded regions to improve the color changes caused by lighting condition changes.

Many data augmentation methods have been proposed, such as random cropping, flipping, which are widely used in computer vision. Different methods address distinct issues. CutMix Yun et al. (2019) replaces one patch of an image with a patch from another image, it helps improve the robustness and generalization of deep learning models by encouraging them to learn from mixed and diverse training samples. Autoaugment Cubuk et al. (2019) is a automated data augmentation strategy, which incorporates a series of data augmentation such as image shear transformations, color adjustments, brightness/contrast adjustments, etc.. Co-Mixup Kim et al. (2021) combines CutMix and Mixup Guo et al. (2019) methods to enhance model performance and generalization in deep learning tasks. Random erasing or cutout Zhong et al. (2020); DeVries and Taylor (2017) adds noise block to the image to regularize the network, while it helps to solve the occlusion problem.

Please note that our approach differs fundamentally in motivation and effect from the methods mentioned above. For instance, Random erasing primarily aims to address the issue of decreased generalization performance when the target is occluded. Although this method appears to eliminate color, it actually directly disrupts the structural information of the image. Therefore, the model cannot learn color invariance from relevant parts with or without color. In fact, when the camera style changes, it may even degrade the model’s original performance. This observation is further validated in our cross-domain experiments.

With the increasing maturity of GANs, GANs-based approaches for data augmentation have become an active research field.

2.2 GAN-Based Methods

In ReID, the goal of GANs methods Gu et al. (2022); Zheng et al. (2019); Zhong et al. (2018b); Liu et al. (2018) is to mitigate the effect of clothing change of pedestrians, color deviation or human-pose variation, and to improve the robustness of the model by learning the invariant features from the variation of the input. The appearance details and the emphases generated by different GANs-based methods are also different, but their goal is all to compensate for the difference between the source and target domains. For example, CamStyle Zhong et al. (2018b) generates new data for transferring different camera styles to learn invariant features between different cameras to increase the robustness of the model to camera style changes. CycleGAN Zhu et al. (2017) was applied in Deng et al. (2018); Zhong et al. (2019) to transfer pedestrian image styles from one dataset to another. StarGAN Choi et al. (2018) was used by Zhong et al. (2018a) to generate pedestrian images with different camera styles. Wei et al. Wei et al. (2018) proposed PTGAN to achieve pedestrian image transfer across different ReID datasets. It uses semantic segmentation to extract foreground masks to assist style transfer, and converts the background into the desired style of the dataset while keeping the foreground unchanged. Different from global style transfer, DGNet Zheng et al. (2019) utilized GANs to transfer clothing color among different pedestrians to generate more diverse data to reduce the impact of color changes on the model, which effectively improves the generalization ability of the model. CCFA Han et al. (2023) is a variant of DGNet. In contrast to DGNet, which primarily emphasizes learning color invariance, CCFA places greater emphasis on addressing the challenge of insufficient diversity in clothing styles within the training data. Therefore, in the experiments conducted in this paper, our primary focus is to compare with DGNet.

Up to now, GAN-based methods in ReID are considered the optimal choice for enhancing the model’s robustness to color variations. By exploring color invariance in ensemble learning, we now present a novel solution in our research.

Algorithm 1 Random Color Erasing

Input: Input image $x_{i}^{v}$ ; Global erasing probability $p_{g}$ ; Erasing probability $p_{r}$ .

Output: Color erased image $I$ .

1: Initialize

p\leftarrow

Rand (0, 1).

2: if

p\geq p_{r}

then

3: return

x_{i}^{v}

4: else

p\leftarrow

Rand (0, 1).

6: if

p\leq p_{g}

then

7: return

I\leftarrow t(x_{i}^{v})

8: else

x_{i}^{g}\leftarrow t(x_{i}^{v})

10:

rect=RandPosition(x_{i}^{v})

11:

I\leftarrow

LT(x_{i}^{v},x_{i}^{g},rect)

12: return

I

13: end if

14: end if

3 Proposed Methods

The proposed strategy RCE includes both global and local color erasing. Global color erasing can be viewed as a special case of random color erasing. The corresponding analysis of the proposed method is provided at the end of this subsection. The diagram illustrating the proposed method is presented in Fig. 2. The procedure of RCE is outlined in Alg.1.

3.1 Global Color Erasing

In the data loading, it randomly samples M images of per person and K identities to constitute a training batch, which size is $B=M\times K$ . The set is denoted as $x^{v}=\{x_{i}^{v}|i=1,2,...,M\times K\}$ .

The proposed method randomly performs global grayscale transformation on the training batch with a probability, and then inputs the processed images into the model for training. This grayscale transformation process can be defined as:

x_{i}^{g}=t(x_{i}^{v})

(1)

where $t(\bullet)$ is the grayscale image conversion function, which is implemented pixel-by-pixel accumulation calculations on the R, G, and B channels of the original visible RGB image.

3.2 Local Color Erasing

The local color erasing for each visible image $x_{i}^{v}$ can be achieved by the following equations:

rect=RandPosition(x_{i}^{v}),

(2)

x_{i}^{lg}=LT(x_{i}^{v},x_{i}^{g},rect).

(3)

the function $LT(\bullet)$ can be represented as:

LT(x^{v}_{i},x^{g}_{i},rect)=x^{v}_{i}-x^{v}_{i}(rect)+x^{g}_{i}(rect).

(4)

$RandPosition(\bullet)$ is used to generate a random rectangle in the image, the function of $LT(\bullet)$ is to replace the pixels at the rectangular position in the image $x_{i}^{g}$ with the pixels at the corresponding position in the image $x_{i}^{v}$ . And $x_{i}^{lg}$ is the sample after local grayscale transformation.

In the process of model training, we apply a local grayscale transformation function $LT(\bullet)$ , randomly to the training batch with a certain probability. For an image $x_{i}^{v}$ within a batch, $p_{r}$ represents the probability of performing $LT(\bullet)$ on the image $x_{i}^{v}$ . During this process, a rectangular region in the image is randomly selected and replaced with pixels from the corresponding rectangular region in the grayscale image. This introduces various levels of grayscale information into the training images. This process generates training images with diverse levels of grayscale while preserving the structural integrity of the objects. For specific details on the random selection of rectangular regions, please refer to random erasing Zhong et al. (2020). We adopt a similar process and parameter settings.

3.3 Analysis of Random Color Erasing Strategy

Assuming this task is to approximate a function $f:Q^{m}\rightarrow Y$ using an ensemble composed of $N$ component neural networks, where $Y$ is the set of class labels, and the prediction of component networks are combined by majority voting, in which each component network votes for a class, and the class label with the largest number of votes is obtained as the output of the ensemble. For the convenience of discussion, assuming that $Y$ contains only two class labels. The set of two class labels is usually denoted as {0,1}, but for the convenience of derivation, here use {-1, +1} instead of {0,1} to represent the class label. So the function to be approximated is $f:Q^{m}\rightarrow\{-1,+1\}$ . Nevertheless, please note that the following derivation can also be extended to the case where $Y$ contains more than two class labels.

Assuming there are $k$ samples, the expected output $O=[o_{1},o_{2},…,o_{k}]^{T}$ , where $d_{j}$ denotes the expected output on the $j$ -th instance, and the actual output of the $i$ -th component neural network is $F_{i}=[f_{i1},f_{i2},…,f_{ik}]^{T}$ where $f_{ij}$ denotes the actual output of the $i$ -th component network on the $j$ -th sample. $O$ and $F_{i}$ satisfy that $o_{j}$ $\in\{-1,+1\}(j=1,2,…,m)$ and $f_{ij}$ $\in\{-1,+1\}(i=1,2,…,N;j=1,2,…,k)$ , respectively. If the actual output of the $i$ -th component network on the $j$ -th sample is correct according to the expected output then $f_{ij}o_{j}=+1$ , otherwise $f_{ij}o_{j}=-1$ . Thus the generalization error of the $i$ -th component neural network on those $k$ sample is:

E_{i}=\frac{1}{k}\sum_{j=1}^{k}{Error(f_{ij}o_{j})}

(5)

where $Error(\bullet)$ is a function defined as:

Error(x)=\left\{\begin{array}[]{ll}1,\quad\quad if\quad x=-1\\ 0,\quad\quad if\quad x=1\end{array}\right.

(6)

Here we introduce a vector $S=[S_{1},S_{2},…,S_{k}]^{T}$ where $S_{j}$ denotes the sum of the actual output of all the component neural networks on the $j$ -th sample:

S_{j}=\sum_{i=1}^{N}f_{ij}

(7)

Then the output of the neural network ensemble on the j-th sample is:

\hat{f_{j}}=Sgn(S_{j})

(8)

where $\hat{f}_{j}\in\{-1,0,+1\}(j=1,2,…,k)$ , and $Sgn(\bullet)$ is a function defined as:

Sgn(x)=\left\{\begin{array}[]{ll}1,\quad\quad if\quad x>0\\ 0,\quad\quad if\quad x=0\\ -1,\quad if\quad x<0\end{array}\right.

(9)

If the output of the ensemble on the $j$ -th sample is correct according to the expected output then $\hat{f_{j}}o_{j}=+1$ . If it is wrong then $\hat{f}_{j}o_{j}=-1$ . Otherwise, $\hat{f}_{j}o_{j}=0$ , which means that there is a tie on the j-th sample. It means that half component networks vote for +1 while the other half networks vote for -1. Thus the generalization error of the ensemble is:

\hat{E}=\frac{1}{k}\sum_{j=1}^{k}{Error(\hat{f_{j}}o_{j})}

(10)

Here suppose that the $e$ -th component neural network is trained using grayscale images. Then the output of the new ensemble on the $j$ -th instance is:

\hat{f_{j}^{{}^{\prime}}}=Sgn(S_{j(j\neq e)}+f_{ej})

(11)

and the generalization error of the new ensemble is:

\hat{E^{{}^{\prime}}}=\frac{1}{k}\sum_{j=1}^{k}{Error(\hat{f_{j}^{{}^{\prime}}}o_{j})}

(12)

Assuming a certain quantity of networks with color information deviation will not impact the overall performance of the neural network, and neglecting color information may lead to improved retrieval results for certain examples.

Error(\hat{f_{j}^{{}^{\prime}}}o_{j})\leqslant Error(\hat{f_{j}}o_{j})

(13)

From Eq. (10) and Eqn. (12) we can derive that if Eqn. (13) is satisfied then $\hat{E}$ is not smaller than $\hat{E^{{}^{\prime}}}$ , and then:

\begin{split}\sum_{j=1}^{k}\{Error(Sgn(S_{j})o_{j})-\\ Error(Sgn(S_{j(j\neq e)}+f_{ej})d_{j})\}\geq 0\end{split}

(14)

Eqn. (14) indicates that the ensemble including $e$ -th component neural network which is trained using grayscale images is better than the one no including, as shown in Fig. 3.

Now, it can be reach the conclusion that in the context of classification, when a number of neural networks are available, ensembling part of them to fit the color features may be better than ensembling all of them.

4 Experimental Comparison and Analysis

In this section, we conducted a sequence of experiments involving two distinct visual tasks, employing five baseline models with varying architectures, and evaluating their performance across five diverse datasets. Additionally, we assessed the effectiveness of six different data augmentation strategies.

4.1 Person Re-identification Datasets and Evaluation Criteria

We evaluate our method on three ReID datasets, including Market-1501 Zheng et al. (2015), DukeMTMC Zheng et al. (2017), MSMT17 Wei et al. (2018). These three datasets are among the most representative and extensively utilized in ReID research. Together, they encompass multiple seasons, time periods, high-definition, and low-definition cameras, providing rich scenes, backgrounds, and intricate lighting variations.

Following existing works Zheng et al. (2015), we employ Rank-k precision and Cumulative Matching Characteristics (CMC) and mean Average Precision (mAP) as evaluation metrics. Rank-1 represents the average accuracy of the top-ranked result corresponding to each cross-modality query image. mAP represents the mean average accuracy, where the query results are sorted based on similarity, and the closer the correct result is to the top of the list, the higher the precision. As part of a ReID system, re-ranking Zhong et al. (2017) technology is typically employed to reorganize the initial retrieval results, aiming to more accurately reflect the similarity between images.

4.2 Semantic Segmentation Datasets and Evaluation Criteria

DustProj consists of 1343 training images, 199 validation images, and 261 testing images. It was annotated by non-experts for training and validation, while professional annotators labeled the test set. The dataset is tailored for practical dust analysis in industrial settings, and it has limitations due to challenging conditions.

The DSS Smoke Segmentation dataset Yuan et al. (2019) contains 73632 images, with 70632 for training and 3000 for testing, categorized into DS01, DS02, and DS03 groups.

For evaluation, we use mean Intersection over Union (mIoU) as a comprehensive segmentation accuracy metric.

4.3 Hyper-Parameter Setting and Ablation Study

During CNN training, two hyper-parameters need to be evaluated. One is global color erasing (GCE) probability $p_{g}$ . The results of different $p_{g}$ are shown in Fig. 4. We can see that when the value $p_{g}$ is set $0.00$ to $0.15$ , the performance of the model is better than the baseline in Rank-1 and mAP. We would like to supplement that when $p_{g}=0$ , it represents the accuracy of the original baseline. When $p_{g}=1$ , it indicates that the model is exclusively trained with grayscale images. The result of $p_{g}=1$ closely approximates $p_{g}=0.7$ . It indicates that training the model only with grayscale images does not yield satisfactory results.

GCE can be viewed as a specific case of random color erasing (RCE). Based on the experimental observations mentioned above, we implement GCE with a probability of $p_{g}=0.15$ within the framework of RCE. When evaluating tatal RCE probability $p_{r}$ , we fixed this parameter. The results of different $p_{r}$ are shown in Fig. 4. Obviously, when $p_{r}=0.40$ , the model achieves the best performance.

Table 1: Performance comparison with state-of-the-art methods on the Market1501 and DukeMTMC datasets.

Methods	References	Market1501	DukeMTMC
Methods	References	Rank-1/mAP(%)	Rank-1/mAP(%)
HOReID	CVPR(2020)	94.2/84.9	86.9/75.6
OfM	AAAI(2021)	94.9/87.9	89.0/78.6
PAT	CVPR(2021)	95.4/88.0	88.8/78.2
DRL-Net	TMM(2022)	94.7/86.9	88.1/76.6
DCAL	CVPR(2022)	94.7/87.5	89.0/80.1
CLIP-ReID	AAAI(2023)	95.4/90.5	90.8/83.1
FastReID	ACM MM(2023)	96. 3/90.3	92.4/83.2
FastReID+RCE	(ours)	96.5/91.2	92.8/84.2

Table 2: Performance comparison with different methods on the MSMT17 dataset.

Methods	MSMT17
Methods	Rank-1(%)	mAP(%)
FastReID	82.1	60.6
FastReID+REA	85.1	63.8
FastReID+REA+AutoAugment	84.2	61.4
FastReID+REA+RCE(ours)	86.2	65.9

Table 3: Performance comparison on the Market1501 dataset. +RK indicates the use of re-ranking.

Methods	Market1501	DukeMTMC
Methods	Rank-1/mAP(%)	Rank-1/mAP(%)
SB	94.5/85.9	86.4/ 76.4
SB+RK	95.4/94.2	90.3/89.1
SB+RCE(ours)	95.1/87.2	87.8/77.3
SB+RCE + RK(ours)	95.9/94.4	91.0/89.4
FastReID	96. 3/90.3	92.4/83.2
FastReID+RK	96.8/95.3	94.4/92.2
FastReID+RCE(ours)	96.5/91.2	92.8/84.2
FastReID+RCE+RK(ours)	96.9/95.6	94.3/92.7

Table 4: Performance comparison between our RCE and DGNet on Market1501.

Methods	Market1501
Methods	Rank-1	mAP(%)
baseline	88.5	71.6
baseline+DGNet	88.9	72.1
baseline+RCE(ours)	90.0	74.9

The results of the ablation experiments are also reflected in Fig. 4. It shows that our GCE outperforms the baseline within a probability range of 0.15. Combining local and global color erasing, referred to as RCE, further enhances the baseline. With RCE, we achieve a 1.5 percentage point improvement in Rank-1 and 3.3 percentage point improvement in mAP over the baseline, and a 2.0 percentage point improvement in Rank-1 and 2.7 percentage point improvement in mAP.

4.4 Comparison Experiments on Person Re-identification

We extend our evaluation to State-of-the-Art ReID baselines. As demonstrated in Tab. 3, our method exhibits improvements of 0.6 and 1.3 percentage points in Rank-1 accuracy and mean average precision (mAP) for SB Luo et al. (2019) on Market1501. Moreover, we observe enhancements of 0.2 percentage points in Rank-1 accuracy and 0.9 percentage points in mAP for FastReID He et al. (2023). Importantly, these improvements hold consistently across both DukeMTMC and MSMT17 datasets (see Tab. 2).

Table 5: Cross-domain tests. M→D means that we train the model on Market1501 and evaluate it on DukeMTMC, D→M means the reverse.

Methods	Cross-Domain
	M→D	D→M
	Rank-1/mAP(%)	Rank-1/mAP(%)
baseline	37.8/27.0	51.2/31.8
baseline+REA	29.5/18.4	43.6/24.1
baseline+DGNet	36.7/25.6	52.4/31.6
baseline+RCE(ours)	39.7/27.9	55.1/35.2
SB	45.5/37.0	58.2/37.8
SB+REA	33.6/24.3	51.6/32.3
SB+REA+RCE(ours)	37.8/27.8	55.4/35.7
SB+RCE(ours)	48.2/37.9	65.0/43.7

Comparison with State-of-the-Art methods. We compare our method with the State-of-the-Art methods, including HOReID Wang et al. (2020), OfM Zhang et al. (2021), PAT Li et al. (2021), DRL-Net Jia et al. (2022), DCAL Zhu et al. (2022) and CLIP-ReID Li et al. (2023). As shown in Tab. 1, we achieve state-of-the-art performance on FastReID by using proposed learning strategy RCE.

Our method enhances accuracy beyond default configurations using data augmentation in FastReID, which includes AutoAugment Cubuk et al. (2019) and random erasing augmentation (REA) Zhong et al. (2020). Notably, the automated data augmentation strategy, AutoAugment, which employs a series of augmentation techniques to simulate color deviations, does not consistently result in improved model performance.

Another comparison of the performance of our method with State-of-the-Art GAN-Based ReID methods is shown in Tab. 4 and Tab. 5. As can be seen from Tab. 4, our method delivers a performance improvement that far exceeds that of DGNet, the GAN-based method, by more than 2.8 percentage points on mAP. As can be seen from Tab. 5, the generalization ability of the proposed method in M→D cross-domain tests is improved by 1.9 percentage points in the Rank-1 compared with the baseline. The improvement is even more pronounced in the M→D cross-domain tests, which further shows that the proposed method is better than DGNet. Notably, using data generated by DGNet for model training leads to poor cross-domain performance.

Table 6: Performance comparison of two strong baselines on the DustProj and DSS datasets.

Methods	MIoU (%)	MIoU (%) in DSS
Methods	in DustProj	DS01	DS02	DS03
Segformer	59.2	74.8	75.4	75.1
DDRNet	53.0	75.0	74.2	73.1
Segformer+RCE(ours)	60.5	78.2	77.7	77.5
DDRNet+RCE(ours)	60.9	79.7	79.6	80.7

Cross-domain tests. Cross-domain ReID aims to adapt model training from a labeled source domain dataset to a target domain dataset. However, achieving higher accuracy does not always translate to better generalization, as highlighted by Luo Luo et al. (2019). To investigate the effectiveness of our method in cross-domain experiments, we compare RCE with other methods for cross-domain tests between Market-1501 and DukeMTMC, as illustrated in Tab. 5.

As shown in Table 5, due to the direct disruption of sample structural information during the model learning process by random erasing augmentation (REA), it results in adverse effects in cross-domain scenarios, consequently undermining the model’s performance. In contrast, our RCE demonstrates more pronounced gains in cross-domain testing through ensemble learning at the image level. This clearly indicates that the proposed method effectively guides the model to learn color invariance. This observation emphasizes the exceptional capability of our method in enhancing model robustness against color variations.

4.5 Visualization Analysis

Grad-CAM Selvaraju et al. (2017) employs gradients from the final CNN convolutional layer to visualize neuron importance in the output prediction. It highlights regions influencing model predictions. As shown in Fig. 5, the normally trained model activates irrelevant areas under color deviation, whereas RCE-trained model effectively activates critical regions. This indicates that RCE exhibits significant robustness against color deviations.

Table 7: Comparison of different data augmentation methods on the DSS dataset.

Methods	MIoU (%) in DSS
Methods	DS01	DS02	DS03
DDRNet	75.0	74.2	73.1
DDRNet+Cutout	75.5	73.8	75.1
DDRNet+Co-Mixup	77.4	76.9	77.2
DDRNet+REC(ours)	79.7	79.6	80.7

4.6 Comparison Experiments on Semantic Segmentation

Observing Tab.6, it can be noted that on both the dust and smoke semantic segmentation datasets, our method consistently demonstrates significant improvements across different architectures of strong baselines, including Segformer Xie et al. (2021) and DDRNet Pan et al. (2023). Notably, our method achieves improvements of over 7 percentage points on DDRNet in both datasets (DustProj and DS03). In addition, we compare our RCE with two methods, namely Cutout and Co-Mixup in Tab. 7. The experiments demonstrate that our approach achieves superior performance in this task.

Our approach consistently improves the results of various baseline models across different tasks and datasets, demonstrating the effectiveness of the proposed method.

5 Conclusion

This paper introduces a novel image-level ensemble learning method, namely Random Color Erasing (RCE), to enhance the robustness of deep learning models in various visual scenarios encountering color variations. By effectively rebalancing the emphasis between color features and other discriminative features, RCE demonstrates its potential in improving model generalization performance. We provide an interpretable analysis of the proposed method, and experiments conducted on ReID and industrial dust (or smoke) segmentation across different architectures and datasets validate the effectiveness of our approach. In conclusion, we aim for this research to serve as a valuable foundation for advancing the field of computer vision.

References

Choi et al. [2018] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
Cubuk et al. [2019] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. In CVPR, 2019.
Deng et al. [2018] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 994–1003, 2018.
DeVries and Taylor [2017] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
Gong et al. [2022] Yunpeng Gong, Liqing Huang, and Lifei Chen. Person re-identification method based on color attack and joint defence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 4313–4322, June 2022.
Gu et al. [2022] Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen. Clothes-changing person re-identification with rgb modality only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1060–1069, 2022.
Guo et al. [2019] Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3714–3722, 2019.
Han et al. [2023] Ke Han, Shaogang Gong, Yan Huang, Liang Wang, and Tieniu Tan. Clothing-change feature augmentation for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22066–22075, 2023.
He et al. [2023] Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, and Tao Mei. Fastreid: A pytorch toolbox for general instance re-identification. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9664–9667, 2023.
Jia et al. [2022] Mengxi Jia, Xinhua Cheng, Shijian Lu, and Jian Zhang. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Transactions on Multimedia, 25:1294–1305, 2022.
Kim et al. [2021] JangHyun Kim, Wonho Choo, Hosan Jeong, and Hyun Oh Song. Co-mixup: Saliency guided joint mixup with supermodular diversity. In International Conference on Learning Representations, 2021.
Li et al. [2014] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 152–159, 2014.
Li et al. [2021] Yulin Li, Jianfeng He, Tianzhu Zhang, Xiang Liu, Yongdong Zhang, and Feng Wu. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2898–2907, 2021.
Li et al. [2023] Siyuan Li, Li Sun, and Qingli Li. Clip-reid: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1405–1413, 2023.
Liao et al. [2015] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2197–2206, 2015.
Liu et al. [2018] Jinxian Liu, Bingbing Ni, Yichao Yan, Peng Zhou, Shuo Cheng, and Jianguo Hu. Pose transferrable person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4099–4108, 2018.
Luo et al. [2019] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
Pan et al. [2023] Huihui Pan, Yuanduo Hong, Weichao Sun, and Yisong Jia. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Transactions on Intelligent Transportation Systems, 24(3):3448–3460, 2023.
Selvaraju et al. [2017] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
Wang et al. [2020] Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6449–6458, 2020.
Wei et al. [2018] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.
Xie et al. [2021] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
Yuan et al. [2019] Feiniu Yuan, Lin Zhang, Xue Xia, Boyang Wan, Qinghua Huang, and Xuelong Li. Deep smoke segmentation. Neurocomputing, 357:248–260, 2019.
Yuan et al. [2023] Feiniu Yuan, Kang Li, Chunmei Wang, and Zhijun Fang. A lightweight network for smoke semantic segmentation. Pattern Recognition, 137:109289, 2023.
Yun et al. [2019] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
Yunpeng Gong [2024] Zhiming Luo Yansong Qu Rongrong Ji Min Jiang Yunpeng Gong, Zhun Zhong. Cross-modality perturbation synergy attack for person re-identification. https://arxiv.org/pdf/2401.10090.pdf, 2024.
Zhang et al. [2021] Enwei Zhang, Xinyang Jiang, Hao Cheng, Ancong Wu, Fufu Yu, Ke Li, Xiaowei Guo, Feng Zheng, Weishi Zheng, and Xing Sun. One for more: Selecting generalizable samples for generalizable reid model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3324–3332, 2021.
Zheng et al. [2015] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
Zheng et al. [2017] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision, pages 3754–3762, 2017.
Zheng et al. [2019] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and generative learning for person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2138–2147, 2019.
Zhong et al. [2017] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, 2017.
Zhong et al. [2018a] Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. Generalizing a person retrieval model hetero-and homogeneously. In Proceedings of the European conference on computer vision (ECCV), pages 172–188, 2018.
Zhong et al. [2018b] Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. Camera style adaptation for person re-identification. In CVPR, 2018.
Zhong et al. [2019] Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, and Yi Yang. Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 598–607, 2019.
Zhong et al. [2020] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
Zhou et al. [2023] Xiao Zhou, Yujie Zhong, Zhen Cheng, Fan Liang, and Lin Ma. Adaptive sparse pairwise loss for object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19691–19701, 2023.
Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
Zhu et al. [2022] Haowei Zhu, Wenjing Ke, Dong Li, Ji Liu, Lu Tian, and Yi Shan. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2022.