CINFormer: Transformer network with multi-stage CNN feature injection for surface defect segmentation

Xiaoheng Jiang, Kaiyi Guo, Yang Lu, Feng Yan, Hao Liu, Jiale Cao Mingliang Xu, and Dacheng Tao, Fellow, IEEE Manuscript created August 2023; This work was supported in part by the Nation Key Research and Development Program of China under Grant 2021YFB3301500; in part by the National Natural Science Foundation of China under Grant 62172371, U21B2037, 62102370, 62272421; in part by Natural Science Foundation of Henan Province under Grant 232300421093 and the Foundation for University Key Research of Henan Province under Grant 21A520040; in part by CAAI-Huawei MindSpore OpenFund. (Corresponding author: Mingliang Xu) Xiaoheng Jiang, Yang Lu, Hao Liu, and Mingliang Xu are with School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China; Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, China; National Supercomputing Center in Zhengzhou, Zhengzhou, China (e-mail: [email protected], [email protected], [email protected], [email protected]) Kaiyi Guo and Feng Yan are with School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China (e-mail: [email protected], [email protected]) Jiale Cao is with School of Electrical and Information Engineering, Tianjin University, Tianjin, China (e-mail: [email protected]) Dacheng Tao is with the Sydney AI Centre and the School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia (e-mail: [email protected])

Abstract

Surface defect inspection is of great importance for industrial manufacture and production. Though defect inspection methods based on deep learning have made significant progress, there are still some challenges for these methods, such as indistinguishable weak defects and defect-like interference in the background. To address these issues, we propose a transformer network with multi-stage CNN (Convolutional Neural Network) feature injection for surface defect segmentation, which is a UNet-like structure named CINFormer. CINFormer presents a simple yet effective feature integration mechanism that injects the multi-level CNN features of the input image into different stages of the transformer network in the encoder. This can maintain the merit of CNN capturing detailed features and that of transformer depressing noises in the background, which facilitates accurate defect detection. In addition, CINFormer presents a Top-K self-attention module to focus on tokens with more important information about the defects, so as to further reduce the impact of the redundant background. Extensive experiments conducted on the surface defect datasets DAGM 2007, Magnetic tile, and NEU show that the proposed CINFormer achieves state-of-the-art performance in defect detection.

Index Terms:

CNN feature injection, Transformer, top-K self-attention, surface defect segmentation.

I Introduction

Defect inspection [1, 2] is an important task in industrial automatic production. It aims to detect anomalous or defective areas on the surface of the products. Traditional manual defect inspection methods rely on experienced workers and suffer from low efficiency and poor accuracy.

With the development of deep learning [3, 4], deep convolution neural networks (CNNs) have achieved great success in many computer vision tasks such as image classification [5, 6, 7], object detection [8, 9, 10, 11], and semantic segmentation [12, 13]. Therefore, many CNN-based defect inspection methods [14, 15] have sprung up. Among them, pixel-level defect detection methods can provide fine-grained information about defects. Most of these methods [16, 17, 18, 19] introduce attention mechanisms into models to improve the defect detection ability. However, these methods can not accurately detect the defects in complex scenes with noise interference, especially when defects present a weak appearance.

Refer to caption — Figure 1: Comparison of feature visualization for CNN (a), transformer (b), and the proposed CINFormer (c). It is noted that the features are from the last stage of the corresponding models. It can be observed that CINFormer can better focus on defect areas and suppress redundant background interference.

Recently, transformers [20] have achieved excellent performance in the natural language processing task. And with the advent of ViT [21], the transformer shows great potential in computer vision, which can model long-distance dependencies through self-attention module. However, it is difficult for existing transformer models to accurately capture detailed defect information since the self-attention module is essentially a low-pass filter [22]. As a result, it tends to filter out high-frequency signals, which results in losing the detailed information of defects, especially for those weak defects, as presented in Fig. 1.

To address these problems, some works [23, 24, 25] combine CNN features and transformer features to strengthen the representation ability of features. Some other methods [26, 27, 28] introduce local properties of convolution into the transformer model [26, 27, 28] to better learn local features. In this way, the model can capture detailed information while suppressing background interference. Although these methods make the transformer better capture detailed information, they still confront challenges caused by weak defects and complex backgrounds.

This paper aims to design a cooperation strategy that can effectively combine the CNN network and the transformer network for surface defect segmentation. To this end, we propose a UNet-like transformer network named CINFormer. Specifically, a CNN stem is used to generate multi-level convolutional features which are injected into different stages of the transformer in the encoder of CINFormer. It is noted that the feature injection is a one-way procedure from CNN to the transformer. This can make CINFormer better absorb rich original detailed defect features while maintaining the ability of suppressing background noise. Furthermore, a Top-K self-attention module is further proposed to remedy the situation that detailed defect features can be drowned out to some extent by the background information. It selects more meaningful tokens and channels by ranking their variances to calculate attention. This can alleviate the impact of redundant background information, which facilitates the detection of weak defects.

It should be noted that the proposed injection strategy is inspired by Conformer [23] but differs from it. Conformer is a bidirectional feature injection pattern with features of CNN and transformer interacting with each other. Though this strategy can promote the full merging of features from the transformer and CNN, it at the same time compromises the ability of the CNN branch to represent the detailed features of the weak defects. In contrast, our proposed one-way multi-stage CNN feature injection method can effectively retain detailed CNN features and make them fully merged with transformer features, which facilitates defect detection. In addition, Conformer needs to train the CNN and Transformer branches simultaneously. Different from it, our CINFormer can just fine-tune the Transformer part while keeping the CNN part fixed. Even so, CINFormer outperforms Conformer on the defect datasets. When trained in an end-to-end way, CINFormer obtains a better performance.

In summary, our contributions are as follows:

1.

We propose a UNet-like defect segmentation network called CINFormer that combines CNN and transformer as the encoder. The multi-level CNN features are injected into different stages at the transformer network to strengthen the representation capability of the features.
2.

We present a Top-K self-attention module to alleviate the impact of the redundant background information. It helps focus on essential defect features while suppressing background interference.
3.

The extensive experiments on three typical surface defect datasets demonstrate that the proposed method is effective in different defect scenes.

II Related Work

II-A CNN-based Defect Inspection

Traditional machine visions utilize low-level texture features to detect defects. These methods can not effectively defect complex defects with low contrast appearance and small size. Recently, with the rapid development of deep learning, some CNN-based methods have been widely applied to surface defect detection. For instance, Tabernik et al. [14] proposed a two-stage network, which introduces a segmentation subnetwork in the classification network. Cui et al. [29] integrate fine-grained details of preceding layers at each deep layer, which facilitates the detection of small objects.

However, it is hard for these methods to identify defects in complex scenes. Benefiting from the attention mechanism that highlights important features, some methods introduce it into the model to strengthen the feature representation. Based on the human visual gain mechanism, Wei et al. [18] constructed a Faster VG-RCNN that integrates attention-related visual gain mechanism to better detect small defects. Su et al. [30] integrate the improved multi-head nonlocal self-attention into FPN to highlight defect features. Yang et al.[31] introduce the spatial attention module into each residual unit of the backbone to make the network adaptively focus on defect regions. Similarly, Jiang et al. [32] use joint channel-spatial attention weights to selectively fuse high-level and low-level features. Xiang ea al. [33] introduce the attention module after high-level and low-level feature fusion to strengthen feature representation. However, these CNN-based methods are still challenged by defects in complex scenes because CNN features are vulnerable to background interference.

II-B Vision Transformer

Recently, Dosovitskiy et al. [21] proposed a vision transformer, which shows the strong ability to capture global features benefiting from the self-attention mechanism. To deal with variations in the scale of objects, Liu et al. [26] proposed a hierarchical transformer based on window self-attention. Wang et al. [34] proposed a CrossFormer to improve Transformer’s ability to build cross-scale interactions. Xie et al. [35] constructed a hierarchical transformer as an encoder to generate multi-level features, and fused these features through MLP layers in the decoder. These methods focus on drawing global relationships, which are robust against background interference. However, these methods are prone to lose some detailed information, Because it does not pay attention to local information.

To deal with the above problem, Liu et al. [26] proposed a local window mechanism to perform self-attention operations within the window. Yang et al. [36] proposed a Focal Self-Attention (FSA), which pays fine-grained attention to the area around the current token and coarse-grained attention to the area far away from the current token. In this way, it can more effectively capture local and global attention. Xia et al. [37] proposed a deformable self-attention module in which key and value in self-attention are sampled to flexibly learn informative features. Fan et al. [38] proposed a multiscale Transformer, which creates a multiscale feature pyramid that learns low-level information at early layers and semantic information at deep layers. Ren et al. [39] proposed a Shunted Self-Attention (SSA), which can capture multi-scale features through multi-scale token aggregation. However, these methods still do not solve the problem that self-attention is low-pass filtering, making it difficult for them to deal with weak or minor defects.

II-C Combination of Transformer and CNN

Due to the local properties of CNN, it is prone to bring in noises when dealing with defects in complex scenes. In contrast, the transformer model can draw long-distance dependencies and suppress noise, while it is weak in capturing detailed information about objects. Therefore, some methods improve the networks’ performance by combining CNN and transformer networks. For example, Peng et al. [23] designed a hybrid network structure termed Conformer, which enhances representation learning by combining local and global features in an interactive way. Wu et al. [25] use convolutional projection instead of linear projection in each self-attention block, which can make the transformer model capture local details. Xie et al. [40] proposed a novel one-stage framework called Pyramid Grafting Network (PGNet), which utilizes global information to strengthen CNN’s feature representation in the decoder. Li et al. [24] designed a novel Unified transformer (UniFormer) to learn both local and global token affinity by integrating 3D convolution and spatiotemporal self-attention. Although these methods inherit the local properties of CNNs while maintaining the global merits of transformers, they still confront challenges posed by weak defects and complex background.

Different from the above methods, the proposed CINformer presents a one-way multi-stage feature injection method from the CNN to the transformer network, which can better preserve the local details of the CNN. Furthermore, a Top-K self-attention module is proposed to make the model focus on defect information by selecting important tokens or channels to calculate self-attention.

III Proposed Method

We briefly introduce the overall architecture of the model in Section III-A. Then the transformer with CNN injection is described in Section III-B. Finally, the proposed Top-K self-attention module is described in Section III-C.

III-A Overall Architecture

As depicted in Fig. 2, the proposed network is constructed based on a U-shaped architecture. A multi-level transformer network is adopted as the encoder in which each level is injected by CNN features. Specifically, we adopt the ResNet-18 [5] and the Swin-T [26] as the CNN and transformer backbones, respectively. Based on the ResNet-18, a CNN stem is constructed to generate hierarchical CNN features, denoted as { ${R_{i}|i=1,2,3,4}$ }. It is noted that the transformer network takes the convolutional feature $R_{1}$ instead of images as the input. The other convolutional features $R_{2}$ , $R_{3}$ , and $R_{4}$ are injected into features of different stages of the transformer network, which can provide the transformer with detailed defect features. The detailed CNN features are beneficial for segmenting those small defects.

The decoder consists of four decoding stages, each of which fuses high-level features and the corresponding low-level features in the encoder. Besides, a Top-K self-attention module is presented to alleviate the impact of redundant background information. The proposed Top-K self-attention module selects important tokens or channels to calculate self-attention, which is used in the last two stages of the transformer network to help the model attend to important defect regions.

III-B Transformer with CNN Injection

The illustration of the transformer with CNN injection is shown in Fig. 2. In CNN, features at deep layers tend to contain rich semantic information about defects, while features at shallow layers contain abundant detailed information about defects. So we construct a CNN stem through FPN [41] to generate multi-level features. The FPN fuses high-level and low-level features from top to down. In FPN, high-level features are upsampled and concatenated with the low-level features, and fed into 3 $\times$ 3 convolutions for further fusion. Compared with the multi-level features of the CNN backbone, the multi-level features generated by FPN contain richer semantic information and detailed information. So we use the multi-level features { ${R_{i}|i=1,2,3,4}$ } in the top-down path as the features injected into Transformer.

The convolutional feature $R_{1}$ is taken as the input of the transformer network, generating transformer features { ${S_{i}|i=1,2,3,4}$ } of different stages. The other features are injected into the subsequent transformer stages, respectively. Each transformer stage takes the fused features of transformer features and convolution features as input. Specifically, the convolutional feature $R_{i}$ is applied a $1\times 1$ convolution to adjust the number of channels. The transformed convolutional feature is reshaped from $\mathbb{R}^{C\times H\times W}$ into $\mathbb{R}^{C\times N}$ , where $N=H\times W$ . The reshaped feature is fused with transformer feature $S_{i-1}$ and fed into transformer stage $i$ . Mathematically, the injection process can be described as follows:

	$\displaystyle R^{\prime}_{i}=Reshape(Conv(R_{i}))$		(1)
	$\displaystyle Y=Linear(Concat(S_{i-1},R^{\prime}_{i}))$		(2)

where $Reshape(.)$ represents reshape operation. $Concat(.)$ represents concatenation operation. $Conv(.)$ denotes the convolution operation. And the $Linear(.)$ denotes linear projection.

III-C Top-K Self-attention Module

Traditional self-attention computes the feature similarity between any two spatial locations. As a result, this brings redundant information, which is likely to drown out some detailed defect information. This is because the defects are usually weak and small, which are easily confused by the background. Traditional self-attention utilizes all tokens to produce the attention map, which brings in redundant background information. To remedy this problem, we present a simple but effective Top-K self-attention module, which is used in the last two transformer stages. The proposed self-attention module selects the top $k$ important tokens and channels to compute the attention. As a result, it can highlight important defect features and filter out some redundant background information. At the same time, it brings lower computational complexity than the self-attention module. The details of the proposed self-attention module are shown in Fig. 3.

We will present the proof by visualizing the feature maps of the original self-attention and Top-K self-attention module in Fig. 7 in our paper. The feature map of the Top-K self-attention can focus on defect information while suppressing redundant background information compared with that of the original self-attention.

In our module, given input feature $X\in\mathbb{R}^{N\times C}$ , it is linearly mapped into queries Q, keys K, and values V. Mathematically, we have:

\displaystyle\textbf{Q}=XW_{q},\textbf{K}=XW_{k},\textbf{V}=XW_{v}

(3)

where $W_{q}$ , $W_{k}$ and $W_{v}$ represent three matrices of $\mathbb{R}^{C\times C}$ .

With K, Q, and V obtained, we use the Top-K self-attention mechanism to select the top $k$ most important tokens and channels to keep the most useful information and suppress the redundant background information. Specifically, for token selection, we compute the variance statistics of tokens along the channel dimension. If the variance of a token is higher, it tends to contain more useful information for defects. Therefore, we select the top $k$ important tokens by ranking their variances and generating the indexes of these selected tokens. Similarly, for channel selection, we first compute the variance statistics of channels along the spatial dimension and then select the top $k$ important channels by sorting the channel variances, generating the indexes of these selected channels.

In the Top-K self-attention module, the Top-K token indexes and channel indexes are first computed on the Q. Then they are used to select the most important tokens and channels from queries Q, generating Q^′. Meanwhile, they are used to select the most important channels from keys K and the most important tokens from values V, generating K^′ and V^′, respectively. Notably, the Top-K token indexes and channel indexes are only computed on the Q once because the self-attention module first calculates similarities of Q and K and then performs matrix multiplication with V. If the indexes are re-computed on K and V, the corresponding tokens and channels at the same position in the selected Q, K, and V do not match. With the obtained K^′,Q^′and V^′, the Top-K self-attention is calculated as follows:

	$\displaystyle A=Softmax(\frac{\textbf{Q}^{\prime}\textbf{K}^{\prime T}}{\sqrt{C}})$		(4)
	$\displaystyle Z=A\textbf{V}^{\prime}$		(5)

where the $Softmax(.)$ denotes the Softmax function. Compared with the original K, Q, and V, the K^′, Q^′and V^′ have fewer dimensions. So the proposed self-attention can also reduce the computational cost of the self-attention module.

In addition, we introduce a constraint vector to counteract the expansion of unselected regions due to the softmax operation. Because in the Top-K self-attention, each unselected token in the key K also computes similarity with selected tokens in the query Q, generating a score in the attention map. The constraint vectors (6) and (7) are introduced to suppress the scores of these unselected tokens. Mathematically, we have:

	$\displaystyle A_{c}=Sig(Layernorm(mean(\frac{\textbf{Q}^{\prime}\textbf{K}^{\prime T}}{\sqrt{C}})))*\gamma$		(6)
	$\displaystyle Z^{\prime}=ZA_{c}$		(7)

where $mean(.)$ is the mean operation, and $Sig(.)$ is the Sigmoid function. $\gamma$ is a learnable parameter.

As stated in [42], the relations of different tokens and semantic information at shallow layers are weak. Therefore, the proposed Top-K self-attention mechanism is only used in the last two stages of the network. In the ablation experiments, we also observe a performance degradation when using the Top-K self-attention module in the first two stages.

IV Experiments

TABLE I: Comparisons of different models on the three defect datasets. It should be noted the CINFormer* denotes that the weights of the ResNet-18 are frozen during the training.

			MT	NEU	DAGM
Model	#Param	FLOPs	mIoU	mIoU	mIoU
CNN
SPNet [2020]	13M	1.9G	71.8	77.6	64.5
UNet [2015]	17M	3.1G	78.7	74.7	68.2
OCNet [2021]	15M	11.9G	80.6	75.9	68.6
GCAPNet [2020]	67M	26.6G	82.9	75.8	69.6
ResNet-50 [2016]	24M	8.2G	80.6	80.7	-
Vision Transformer
PVT [2021]	20M	2.5G	60.1	79.5	63.5
SegFormer [2021]	8M	21.1G	68.6	82.9	70.1
DAT [2022]	27M	4.6G	65.5	80.8	63.0
Swin-T [2021]	28M	4.5G	62.0	80.2	70.3
CNN & Vision Transformer
UniFormer [2022]	22M	3.5G	58.4	80.1	70.4
Conformer [2021]	103M	23.5G	76.5	80.9	72.4
VST [2021]	44M	23.2G	-	82.8	70.1
PGNet [2022]	72M	17.6G	79.1	79.9	71.2
CINFormer*	30M	7.1G	84.2	83.8	75.2
CINFormer	30M	7.1G	86.5	85.7	78.1

IV-A Datasets

In the experiment, NEU [43], DAGM 2007 [44], and Magnetic tile [45] (MT) defect datasets are selected to demonstrate the effectiveness of the proposed method. The details of the three defect datasets are illustrated as follows.

NEU is a steel strip surface defect dataset, which contains three types of defects: patch, inclusion, and scratch. There are 300 images for each type of defect, and each image is provided with pixel-level ground truth. The intra-class difference and inter-class similarity of defects bring great challenges to defect segmentation.

DAGM 2007 is artificially generated but the samples are very similar to those in the real world, which contains 10 types of defects generated from various texture and defect models. The defect region in each defect image is roughly marked by an ellipse. Most of the defects are low-contrast and affected by complex background interference, so it is hard to detect them accurately.

MT is a magnetic tile defect dataset, which contains 1344 images. There are 392 defect images, including five categories of defects: uneven, fray, crack, blowhole, and break. These defects exhibit complicated appearances with multi-scale and low contrast, which bring challenges to defect segmentation.

Image Size: The images in the three defect datasets vary from 200 $\times$ 200 to 512 $\times$ 512 in size. It is a common practice to resize the image to a fixed resolution in the defect segmentation task [17, 46].

IV-B Implementation Details

We implement the proposed method based on Pytorch [47]. All experiments are conducted on the Tesla V100 platform. The CNN stem adopts the ResNet-18 pre-trained on the ImageNet [48]. We use AdamW [49] optimizer with a learning rate of 0.00075 and weight decay of 0.005 to train the model. Meantime, a cosine decay learning rate scheduler is adopted during the training stage. For each dataset, 70% of the images in each category are selected as the training set and the remaining images are used as the test set. Each image is resized to 224 $\times$ 224 during both the training and test stages by following the common practice [17, 46]. With a batch size of 4, the model is trained for 150 epochs, 100 epochs, and 100 epochs on NEU, DAGM 2007, and MT defect datasets, respectively, where the first 20 epochs are used for warming up. In the experiment, the mean Intersection of Union (mIoU) is adopted to evaluate the performance of different models. We utilize the cross-entropy loss as the supervision to train the network.

IV-C Comparison with State-of-the-arts

The proposed CINFormer is compared with 13 state-of-the-art models on DAGM 2007, NEU, and MT defect datasets. These methods are classified as CNN-based methods including UNet [50], ResNet [5], SPNet [51], GCAPNet [52], and OCNet [53], vision transformer based methods including PVT [28], Segformer [35], Swin-T [26], and DAT [37], and methods of combining CNN and transformer including Uniformer [24], Conformer [23], VST [54], PGNet [40].

The comparison results of various methods are given in Table I. It is seen that CNN-based methods are superior to transformer-based methods on MT. This is because there are many weak defects on the MT dataset and CNN is better at capturing defect details than the transformer. On the contrary, transformer-based methods outperform CNN-based methods on NEU. This is because the most of defects have relatively complex background disturbances. This indicates that the transformer models can better suppress the background interference in complex scenes than CNN. On the DAGM dataset, the transformer-based methods and the CNN-based methods perform similarly. As to the existing hybrid methods of combining CNN and transformer, they perform differently on three datasets. They generally perform worse than CNN on MT, perform almost the same as the transformer on NEU, and perform a little better on DAGM. This means the existing hybrid methods do not effectively exploit their respective merits on the surface defect datasets. The proposed CINFormer outperforms the above methods on three defect datasets. It is noted that the CINFormer can only fine-tune the Transformer part while keeping the weights of the ResNet-18 frozen during training. Even so, CINFormer outperforms Conformer on the defect datasets. When trained in an end-to-end way, CINFormer obtains the best result on three defect datasets. In general, the experimental results demonstrate the effectiveness of the proposed one-way CNN injection.

Fig. 4 presents some segmentation results of various methods. It is found that some methods fail to detect small or low-contrast defects, such as ResNet, UNet. Some methods such as Segformer and Swin even misclassify the defects, manifested by inconsistent colors between predictions and ground truth (GT) on defect regions in Fig. 4. On the contrary, the proposed CINFormer can accurately detect those defect regions. In addition, Fig. 5 presents the feature visualization results of various models on three defect samples. It is found that our method can better focus on defect regions while suppressing background interference.

IV-D Ablation Study

To further demonstrate the effectiveness of the proposed CNN injection strategy and the Top-K self-attention module, we conduct the following ablation experiments.

Ablation on the Transformer with CNN Injection: To demonstrate the priority of the proposed CNN feature injection manner, we conduct several experiments based on the models with and without the Top-K self-attention module. We first compare the Swin-T model with the CINFormer model without the Top-K self-attention module denoted as CINFormer (w/o) and then compare the two models having the Top-K self-attention module. The Swin-T model with the Top-K self-attention module is denoted as Swin-T (Top-K). The experimental results in Table II show that CINFormer outperforms the Swin-T in both conditions. CINFormer (w/o) surpasses Swin-T by 6.5 points and CINFormer surpasses Swin-T (Top-K) by 7.0 points in terms of mIoU on DAGM. This indicates that the injected multi-level CNN features are beneficial for the transformer to detect defects.

Furthermore, the proposed multi-stage CNN injection manner is compared with other hybrid structures of CNN and transformer, as shown in Fig. 6 (a) and (b). As shown in Table III, the proposed method obtains improvements of 1.0 and 1.8 points over the other two structures, respectively. It is noted that structure (a) is a bidirectional feature injection pattern with features of CNN and transformer interacting with each other, which is similar to Conformer [23]. Though this kind of bidirectional interaction can promote the full merging of CNN features and transformer features, it impairs the ability of the CNN branch to represent the detailed features. As a result, our proposed one-way CNN feature injection method outperforms it on the defect dataset.

TABLE II: Comparison of the CINFormer (w/o) and Swin-T on the DAGM dataset. CINFormer (w/o) denotes the CINFormer without the Top-K self-attention module. Swin (Top-K) denotes the Swin transformer with the Top-K Self-attention.

			mIoU
Model	#Prama	FLOPs	DAGM
Swin-T	28M	4.5G	70.3
CINFormer (w/o)	32M	7.3G	76.8
Swin-T (Top-K)	26M	4.5G	71.1
CINFormer	30M	7.1G	78.1

TABLE III: Comparison of different CNN feature injection structures presented in Fig. 6 on the DAGM dataset.

				mIoU
Structure	#Prama	FLOPs	Time/epoch	DAGM
(a)	28M	6.7G	102s	75.8
(b)	29M	7.1G	101s	75.0
(c)	30M	7.3G	95s	76.8

TABLE IV: Comparison of the proposed Top-K self-attention module and the traditional self-attention module in different transformer models. CINFormer(w/o) denotes the CINFormer without the Top-K self-attention module.

			MT	NEU	DAGM
Model	#Param	FLOPs	mIoU	mIoU	mIoU
Swin-T [2021]	28M	4.5G	64.0	80.0	70.3
Swin-T (Top-K)	26M	4.3G	66.7	81.6	71.1
Conformer [2021]	103M	23.6G	76.5	80.9	71.6
Conformer (Top-K)	103M	23.0G	76.9	81.2	72.4
Uniformer [2022]	21M	3.6G	58.2	80.1	69.7
Uniformer (Top-K)	20M	3.4G	60.2	80.6	70.4
CINFormer (w/o)	32M	7.3G	85.6	85.2	76.8
CINFormer	30M	7.1G	86.5	85.7	78.1

TABLE V: Comparison of using Top-K self-attention at different stages on the DAGM dataset. It should be noted that “

\checkmark

” denotes self-attention module is replaced with Top-K self-attention in stage i.

Model	Stage				#Params	FLOPs	DAGM
Model	1	2	3	4	#Params	FLOPs	mIoU
Swin-T (Top-K)					27M	4.4G	70.3
				✓	27M	4.3G	70.5
			✓	✓	26M	4.3G	71.1
		✓	✓	✓	26M	4.3G	70.7
	✓	✓	✓	✓	25M	4.2G	70.8
CINFormer					31M	7.2G	77.9
				✓	31M	7.2G	78.1
			✓	✓	30M	7.1G	78.1
		✓	✓	✓	30M	7.1G	77.9
	✓	✓	✓	✓	29M	7.0G	77.8

Ablation on the Top-K Self-attention Module: To demonstrate the effectiveness of the proposed Top-K self-attention module in the transformer network, we use it to replace the self-attention module of different transformer models. Specifically, we only replace the self-attention module in the last two stages of the transformer models. The experimental results in Table IV show that the introduction of Top-K self-attention module improves mIoU by 0.8, 0.8, 0.7, and 1.3 points compared with Swin-T, Conformer, Uniformer, and CINFormer (w/o) on the DAGM defect dataset, respectively. This indicates that the proposed Top-K self-attention module is a plug-and-play module, which can improve the performance of various transformer models in defect detection. Meanwhile, the proposed Top-K self-attention module reduces the computational complexity of the transformer model because it only selects important tokens and channels to calculate attention. Furthermore, we present the feature visualization of CINFormer and CINFormer (w/o), as shown in Fig. 7. The introduction of the Top-K self-attention module can make CINFormer focus on defect regions and suppress background interference.

Configuration of the Top-K Self-attention Module: First, we compare the performance of models using Top-K self-attention at different stages. To be specific, experiments are conducted based on the models with and without the CNN injection module, as shown in Table V. The Swin-T model contains four stages, denoted as 1, 2, 3, and 4, respectively. We gradually add the proposed Top-K self-attention in the transformer network starting from the last stage. The proposed CINFormer achieves the best performance when the Top-K self-attention is used in the last two stages. This is because a single token in the earlier stage is not highly correlated with its surrounding tokens [42], which makes it hard to reconstruct the abandoned tokens. Therefore, using the Top-K self-attention mechanism in the early stage will make the model lose some important detailed information during feature extraction, thereby reducing the accuracy of recognition. In contrast, a single token in the latter stage has a strong correlation with its surrounding tokens, the defect (target) tokens are easily affected by background noise. To sum up, when Top-K self-attention modules are used in the last two stages, the proposed CINFormer can achieve the best balance between accuracy and cost.

Secondly, we further investigate the effect of the hyperparameter K on model performance. The experiments are conducted on the model under the CNN-injected module and the non-CNN-injected module. The K is set to 14, 21, 28, 35, and 42, respectively. As shown in Table VI, the proposed CINFormer achieves the best performance when K is set to 28. Because the K with small values will discard a lot of tokens, including some useful information. And the K with large values will overwhelm the detailed information, which is the same as the original self-attention mechanism.

TABLE VI: Comparison of different K values of the Top-k self-attention module on the DAGM dataset. It should be noted Swin (Top-K) denotes the Swin transformer with the Top-K Self-attention.

				DAGM
Model	Number-K	#Param	FLOPs	mIoU
	14	25M	4.2G	70.7
	21	26M	4.3G	70.8
Swin-T (Top-K)	28	26M	4.3G	71.1
	35	27M	4.5G	71.0
	42	27M	4.5G	71.0
	14	30M	7.0G	77.3
	21	30M	7.0G	77.6
CINFormer	28	30M	7.1G	78.1
	35	31M	7.1G	78.0
	42	31M	7.2G	77.0

V Conclusion

In this paper, we propose a novel transformer network with multi-stage CNN feature injection for surface defect segmentation, termed CINFormer. CINFormer is a UNet-like architecture in which the encoder takes advantage of the transformer and the multi-level CNN features to promote the representation capacity of the feature. In the meantime, a Top-K self-attention module is presented to reduce the impact of redundant background information. It selects valuable tokens and channels to guide the model to highlight defect regions. The extensive experiments on DAGM 2007, Magnetic tile, and NEU defect datasets demonstrate the effectiveness of the proposed CINFormer in different defect scenes.

References

[1] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9592–9600, 2019.
[2] H. Liu, X. Xu, E. Li, S. Zhang, and X. Li, “Anomaly detection with representative neighbors,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 6, pp. 2831–2841, 2021.
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
[6] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the International conference on machine learning, pp. 6105–6114, 2019.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proceedings of the International Conference on Learning Representations, 2014.
[8] Y.-P. Tang, X.-S. Wei, B. Zhao, and S.-J. Huang, “Qbox: Partial transfer learning with active querying for object detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 6, pp. 3058–3070, 2021.
[9] J. Li, Z. Wang, Z. Pan, Q. Liu, and D. Guo, “Looking at boundary: Siamese densely cooperative fusion for salient object detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 7, pp. 3580–3593, 2021.
[10] J. Cao, Y. Pang, J. Han, and X. Li, “Hierarchical regression and classification for accurate object detection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 5, pp. 2425–2439, 2023.
[11] M. Li, Y. Zhang, M. Xiao, W. Zhang, and X. Sun, “Unsupervised learning for salient object detection via minimization of bilinear factor matrix norm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 3, pp. 1354–1366, 2023.
[12] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in IEEE winter conference on applications of computer vision, pp. 1451–1460, 2018.
[13] N. Zhang, X. Chen, X. Xie, S. Deng, C. Tan, M. Chen, F. Huang, L. Si, and H. Chen, “Document-level relation extraction as semantic segmentation,” in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021.
[14] D. Tabernik, S. Šela, J. Skvarč, and D. Skočaj, “Segmentation-based deep-learning approach for surface-defect detection,” Journal of Intelligent Manufacturing, vol. 31, no. 3, pp. 759–776, 2020.
[15] Y. He, K. Song, Q. Meng, and Y. Yan, “An end-to-end steel surface defect detection approach via fusing multiple hierarchical features,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 4, pp. 1493–1504, 2019.
[16] B. Fang, X. Long, F. Sun, H. Liu, S. Zhang, and C. Fang, “Tactile-based fabric defect detection using convolutional neural network with attention mechanism,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–9, 2022.
[17] H. Dong, K. Song, Y. He, J. Xu, Y. Yan, and Q. Meng, “Pga-net: Pyramid feature fusion and global context attention network for automated surface defect detection,” IEEE Transactions on Industrial Informatics, vol. 16, no. 12, pp. 7448–7458, 2019.
[18] B. Wei, K. Hao, L. Gao, and X.-s. Tang, “Detecting textile micro-defects: A novel and efficient method based on visual gain mechanism,” Information Sciences, vol. 541, pp. 60–74, 2020.
[19] B. Su, H. Chen, K. Liu, and W. Liu, “Rcag-net: Residual channelwise attention gate network for hot spot defect detection of photovoltaic farms,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–14, 2021.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations, 2021.
[22] N. Park and S. Kim, “How do vision transformers work?,” in Proceedings of the International Conference on Learning Representations, 2022.
[23] Z. Peng, W. Huang, S. Gu, L. Xie, Y. Wang, J. Jiao, and Q. Ye, “Conformer: Local features coupling global representations for visual recognition,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 367–376, 2021.
[24] K. Li, Y. Wang, P. Gao, G. Song, Y. Liu, H. Li, and Y. Qiao, “Uniformer: Unified transformer for efficient spatiotemporal representation learning,” Proceedings of the International Conference on Learning Representations, 2022.
[25] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 22–31, 2021.
[26] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 10012–10022, 2021.
[27] Z. Qin, W. Sun, H. Deng, D. Li, Y. Wei, B. Lv, J. Yan, L. Kong, and Y. Zhong, “cosformer: Rethinking softmax in attention,” Proceedings of the International Conference on Learning Representations, 2022.
[28] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 568–578, 2021.
[29] J. Yang, G. Fu, W. Zhu, Y. Cao, Y. Cao, and M. Y. Yang, “A deep learning-based surface defect inspection system using multiscale and channel-compressed features,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 10, pp. 8032–8042, 2020.
[30] B. Su, H. Chen, and Z. Zhou, “Baf-detector: An efficient cnn-based detector for photovoltaic cell defect detection,” IEEE Transactions on Industrial Electronics, vol. 69, no. 3, pp. 3161–3171, 2021.
[31] L. Yang, S. Xu, J. Fan, E. Li, and Y. Liu, “A pixel-level deep segmentation network for automatic defect detection,” Expert Systems with Applications, vol. 215, p. 119388, 2023.
[32] X. Jiang, F. Yan, Y. Lu, K. Wang, S. Guo, T. Zhang, Y. Pang, J. Niu, and M. Xu, “Joint attention-guided feature fusion network for saliency detection of surface defects,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022.
[33] X. Xiang, M. Liu, S. Zhang, P. Wei, and B. Chen, “Multi-scale attention and dilation network for small defect detection,” Pattern Recognition Letters, 2023.
[34] W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu, “Crossformer: A versatile vision transformer hinging on cross-scale attention,” Proceedings of the International Conference on Learning Representations, 2022.
[35] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[36] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal self-attention for local-global interactions in vision transformers,” Advances in Neural Information Processing Systems, 2021.
[37] Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer with deformable attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4794–4803, 2022.
[38] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 6824–6835, 2021.
[39] S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted self-attention via multi-scale token aggregation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10853–10862, 2022.
[40] C. Xie, C. Xia, M. Ma, Z. Zhao, X. Chen, and J. Li, “Pyramid grafting network for one-stage high resolution saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11717–11726, 2022.
[41] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125, 2017.
[42] Z. Yang, Z. Li, A. Zeng, Z. Li, C. Yuan, and Y. Li, “Vitkd: Practical guidelines for vit feature knowledge distillation,” in Proceedings of the International Conference on Learning Representations, 2022.
[43] K. Song and Y. Yan, “A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,” Applied Surface Science, vol. 285, pp. 858–864, 2013.
[44] M. Jager, C. Knoll, and F. A. Hamprecht, “Weakly supervised learning of a classifier for unusual event detection,” IEEE Transactions on Image Processing, vol. 17, no. 9, pp. 1700–1708, 2008.
[45] Y. Huang, C. Qiu, and K. Yuan, “Surface defect saliency of magnetic tile,” The Visual Computer, vol. 36, no. 1, pp. 85–96, 2020.
[46] X. Zhou, H. Fang, Z. Liu, B. Zheng, Y. Sun, J. Zhang, and C. Yan, “Dense attention-guided cascaded network for salient object detection of strip steel surface defects,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–14, 2022.
[47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
[48] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
[49] K. Da, “A method for stochastic optimization,” Proceedings of the International Conference on Learning Representations, 2015.
[50] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, 2015.
[51] Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng, “Strip pooling: Rethinking spatial pooling for scene parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4003–4012, 2020.
[52] Z. Chen, Q. Xu, R. Cong, and Q. Huang, “Global context-aware progressive aggregation network for salient object detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 10599–10606, 2020.
[53] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, “Ocnet: Object context for semantic segmentation,” International Journal of Computer Vision, vol. 129, no. 8, pp. 2375–2398, 2021.
[54] N. Liu, N. Zhang, K. Wan, L. Shao, and J. Han, “Visual saliency transformer,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4722–4732, 2021.