Defect Image Sample Generation With Diffusion Prior for Steel Surface Defect Recognition

Yichun Tai, Kun Yang, Tao Peng, Zhenzhen Huang, and Zhijiang Zhang The authors are with the School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Corresponding author: Zhijiang Zhang.

Abstract

The task of steel surface defect recognition is an industrial problem with great industry values. The data insufficiency is the major challenge in training a robust defect recognition network. Existing methods have investigated to enlarge the dataset by generating samples with generative models. However, their generation quality is still limited by the insufficiency of defect image samples. To this end, we propose Stable Surface Defect Generation (StableSDG), which transfers the vast generation distribution embedded in Stable Diffusion model for steel surface defect image generation. To tackle with the distinctive distribution gap between steel surface images and generated images of the diffusion model, we propose two processes. First, we align the distribution by adapting parameters of the diffusion model, adopted both in the token embedding space and network parameter space. Besides, in the generation process, we propose image-oriented generation rather than from pure Gaussian noises. We conduct extensive experiments on steel surface defect dataset, demonstrating state-of-the-art performance on generating high-quality samples and training recognition models, and both designed processes are significant for the performance.

Note to Practitioners—This article introduces StableSDG, a method that generates realistic defect images even with limited data. It overcomes the shortcomings of current deep learning approaches that need large datasets to train from scratch. Our solution is to adapt a text-to-image diffusion model for defect generation. The proposed strategy involves two processes: training to adapt token embeddings and model parameters, and generation from partially perturbed defect images. The results show enhanced generation quality and improved accuracy for recognition models trained on the expanded dataset. StableSDG can be practically applied to efficiently enlarge a defect dataset, even when starting with a small amount of data.

Index Terms:

Text-to-image diffusion, data expansion, deep learning, textual inversion, low-rank adaptation, defect image generation, steel surface defect recognition.

I Introduction

Steel surface defect recognition aims at categorizing imperfections found on steel products. This practice plays a vital role in enhancing the quality of these products [1]. Traditional techniques, which require manual inspections for regular evaluations of structural and functional necessities, are not just time-consuming but also require significant manpower [2].

Contrarily, Automated Visual Inspection (AVI) provides distinct benefits regarding accuracy and effectiveness. Among existing methods for defect recognition in industrial manufacturing [3, 4, 5], deep learning techniques are increasingly prevalent [6, 7, 8]. These methods recognize defects with a neural network, which learns the pattern of defects from a collection of defect samples and their corresponding labels in an end-to-end process. Unfortunately, because defects do not occur on a predictable schedule, it often ends up with a shortage of images for network training. This can make it more difficult for the neural networks to work effectively.

To tackle this problem, a straightforward solution is to expand the defect dataset. This can be done by creating more samples using generative models [9, 10, 11, 12, 13, 14, 15]. Niu et al. [10] propose surface defect-generation adversarial network (SDGAN) including two generators and four discriminators, to generate defect samples from defect-free images. Zhang et al. [13] design Defect-GAN with the compositional layer-based architecture, to achieve the generation and removal of defects on surface images. Inspired by the principle of image-to-image translation, Zhao et al. [11] design transP2P including transformer and U-Net, the former focuses on the global features perception, while the latter can better extract local detailed features, so as to transform defect-free images into defect images. However, training these generative models from scratch is challenging when the image samples are insufficient, which often leads to undesired patterns in the generated samples.

Recently, text-to-image generative models [16, 17, 18, 19, 20, 21] have demonstrated impressive capabilities. They are embedded with vast image distribution, and can generate samples with high levels of fidelity and diversity. One such open-source model, Stable Diffusion [22], enables many powerful downstream applications through its efficient latent diffusion approach [23, 24, 25]. To further improve the generation fidelity to the content provided in a few reference images, existing methods have explored how to inject the shared concept in the reference images to the diffusion model for such customized generation [26, 27, 28, 29, 30]. Chen et al. [30] first propose full-parameter adaptation to fit the diffusion model to the provided images. In order to alleviate the catastrophic forgetting when the number of references is limited, Han et al. [28] propose to introduce limited trainable parameters to the diffusion model through SVG decomposition, such as to align with the provided images while avoiding over-fitting. Chen et al. [29] introduce an image-conditioned adapter to preserve the concept feature in the provided images without network parameter optimization. Despite that these methods can also generate high-fidelity samples with limited data resource, their generated content has a high intersection with the original generation distribution, which has a distinctive gap with the defect image distribution. As a result, using Stable Diffusion to directly generate steel surface defects is ineffective and could negatively impact classifier training by introducing low-quality data. In this paper, we propose StableSDG, which leverages the strong generative capabilities of Stable Diffusion model to generate defective image samples. To adapt the power of Stable Diffusion for generating high-quality steel surface defect data effectively, we propose the following pipeline that includes generator adaptation and data generation processes:

In the process of generator adaptation, rather than full-parameter adaptation, we use a combination of Textual Inversion [26] and low-rank adaptation [31] to align the diffusion model with the distribution of defective images with limited parameter change. In the process of data generation, rather than generating from pure Gaussian noise, we propose to start the process from partially perturbed dataset samples. Our proposed pipeline is shown to be effective in generating defective image samples with limited data, achieving state-of-the-art performance in producing high-quality samples, and improving the performance of the defect recognition model.

In conclusion, our contributions are described as follows:

•

To tackle the scarcity of defect image samples, we propose to employ the powerful Stable Diffusion model for steel surface defect generation. To our knowledge, this is the first time that text-to-image diffusion prior is employed for industrial image generation.
•

An effective pipeline StableSDG is proposed to adapt the text-to-image generative network for generating defect images with a large distribution gap. It composes of efficient adapting of network in both token embedding and network parameter space during training, and also a generation scheme from image-oriented initialization.
•

We conduct extensive experiments to demonstrate that StableSDG can generate defect images with higher fidelity than existing methods. Besides, it can effectively expand the defect dataset and substantially improve the accuracy by around 10% on the task of continuous casting billet surface defect recognition.

The rest of this paper is organized as follows. In Section II, we review the related work in defect image generation and recognition, and the existing effort that implementing Stable Diffusion prior to customized generation. In Section III, we present the strategies composed in StableSDG to generate defect images using the Stable Diffusion prior. In Section IV, we conduct experiments to evaluate the quality of the generated defect images, and validate the effectiveness for improving the performance of the recognition model. Section V summarizes our work.

II Related Work

In this section, we discuss the existing work related to defect image generation and recognition. We also cover the advancements in using the text-to-image diffusion prior for specified generation task provided with limited reference images.

II-A Defect Image Generation and Recognition

The unpredictable occurrence of defects results in insufficient training data, making it very challenging to train a robust defect recognition model. To address this, existing methods can be categorized into two approaches: 1) developing algorithms to effectively train recognition models with limited data [32, 33, 34] and 2) generating additional samples to expand the dataset for training the recognition model [10, 11, 12, 13, 14, 15].

To improve the performance through the training algorithm, Song et al. [32] design a dynamic weighting module for discriminate features extraction, and a covariance metric module for similarity measurement. Wang et al. [34] propose to pre-train the model with unlabeled data to learn effective image representation, then fine-tune with labeled data. In addition to recognizing the defect in the images, there are also methods designed to locate the defect in the images [35, 36, 37].

For defect dataset expansion, traditional approaches typically involve simulating defects through digital image processing or artificially introducing defects into defect-free workpieces [38, 39, 40]. However, these methods can only create relatively straightforward defects with minimal diversity requirements, often resulting in significant wastage and increased costs. Thanks to the excellent image generation performance of deep learning, data expansion becomes easily achievable by utilizing random variables sampled from known distributions. The basic generative models involve Variational autoencoders (VAEs) [41] and Generative Adversarial Networks (GANs) [42]. Then the models [43, 44, 45] derived from it, show satisfactory generative ability and have been widely used in general image generation tasks. Based on these variations, several methods have been proposed for defect generation tasks. For example, Yun et al. [9] propose a conditional CVAE, its input to the decoder is the encoding of the defect label concatenated with latent variable, so as to generate images for each type of defect, while GAN-based defect generation methods usually generate defect samples from defect-free images. SDGAN [10] contains two generators and four discriminators to expand the commutator cylinder surface defect image dataset by using a large number of defect-free images from industrial sites. Zhang et al. [13] design Defect-GAN with the compositional layer-based architecture, to achieve the generation and removal of defects on surface images. Inspired by the principle of image-to-image translation, Zhao et al. [11] introduce transP2P combining transformer and U-Net, the former focuses on the global features perception, while the latter can better extract local detailed features, so as to transform defect-free images into defect images. Duan et al. [14] transfer the model pre-trained on defect-free images to the defect images to produce reasonable defect masks and accordingly manipulate the features within the masked regions. Yang et al. [46] train a denoising diffusion probabilistic model [47] from scratch to generate data for fault diagnosis. Furthermore, some methods [12, 15] are proposed to control regions and strength of generated defects.

These methods all require training models from scratch with a vast collection of images. However, due to constraints in industrial settings, like intricate lighting conditions and noise interference, collecting a comprehensive set of defect-free or defect samples is difficult. Motivated by the text-to-image diffusion model being embedded with a wide image distribution, we explore the use of text-to-image priors as a solution to defect image generation with limited available data.

II-B Customized Generation with Text-to-image Prior

Text-to-image generators [18, 19, 20, 21] based on diffusion models [47, 48] show the high capacity in high-fidelity generation given diverse and abstract textual descriptions. One of the most notable examples is Stable Diffusion model [22], which excels in generating images with low computation cost. However, these generators cannot be tailored to individual preferences. Users are confined to the concepts the network has been trained on. Considering the generation of defect images, when the term ”steel surface defect” is inputted as a prompt, the images generated by the Stable Diffusion model miss the intricate textures and significantly differ from the steel surface defect images obtained from actual environments, as illustrated in Fig. 1. To facilitate customized image generation, one could modify the Stable Diffusion model by incorporating new images into it. Yet, adapting the entire model with just a handful of images can significantly disrupt the learning process. The network may rapidly overfit to these new images, losing the broad array of concepts it was initially trained on. As a result, there is a need for a regulated adaptation method that enables the introduction of new concepts into the pre-trained model without compromising its original knowledge base.

Refer to caption — Figure 1: The images generated by Stable Diffusion model [22] with the prompt ”steel surface defect”.

A technique known as Textual Inversion [26] has been introduced, which focuses on learning a new token embedding using a small number of training examples and a prompt describing an unfamiliar concept. Zhang et al. [49] introduce a spatial regularization approach aimed at balancing attention among composed concepts. Cai et al. [50] address the interference of unrelated information by utilizing multiple tokens for image representation, mitigating its impact on the target concept. Additionally, Kumari et al. [51] achieve joint training for multiple concepts. However, the effectiveness of these methods is constrained by the limited number of parameters that can be trained within the token embedding. Consequently, the generative capabilities of the model are not fully enhanced. Another method for customizing Stable Diffusion, named DreamBooth [27], aims to maintain the integrity of the original knowledge by retraining the model with a combination of original generated images and the new target images. The process of learning the entire generative model for each new concept introduced is not only expensive but also carries the risk of the model becoming too finely tuned to the new images. This overfitting can lead to negative consequences, such as catastrophic forgetting. In order to address these concerns, Chen et al. [52] introduce apprenticeship learning to Text-to-Image generation, while the single apprentice model needs to be trained on a large amount of data. It should be noted that this method may not be suitable in situations where there is limited availability of data. In terms of adapting large pre-trained models, low-rank adaptation [31] is proposed to update a small set of parameters that can significantly influence the behavior of the model, avoiding the overfitting when adding new capabilities. This approach allows for a targeted modification of the function of model while leveraging the rich representations learned during its initial extensive training phase. Nevertheless, the content produced by these adaptive methods frequently shows substantial overlap with the original generative distribution, and high-quality generative results may not be achievable when the target image significantly deviates from the original generative distribution.

It should be noted that, to the best of our knowledge, this is the first instance where we have successfully preserved pre-existing knowledge while injecting new defect concepts into the Stable Diffusion model, even in the face of a distinctive distribution disparity between steel surface images and generated images of the diffusion model. When faced with a scarcity of available images, our approach is capable of generating defect images with high fidelity. By using our model to expand the defect dataset, the performance of recognition models has seen a substantial enhancement.

III Proposed Method

In this section, we introduce our method to implement the Stable Diffusion prior for expanding the dataset, under limited data resource. Specifically, we implement the Stable Diffusion model [22] as the diffusion prior, which performs the diffusion in the latent space instead of image space, and is wildly used for image generation tasks [26, 27, 28, 29, 30]. We conduct our proposed StableSDG, which is composed of two processes, for generating the images of each defect category. Through iterative quality evaluation, we tune hyperparameters to achieve optimal image generation. With the best hyperparameters, we generate high-quality images to expand the dataset. The generated images of each defect category along with the ground truth images are collected to train the defect recognition model. The overall pipeline is shown in Fig. 2.

III-A Preliminary

The diffusion model [47] is a class of generative models that learn the data distribution through a process of stepwise noise addition and subsequent recovery of the initial data. For the task of text-to-image generation, Stable Diffusion [22] is wildly adopted, for its effective training and generation via conducting the diffusion and denoising process in the low-dimensional latent space. At its core, Stable Diffusion combines an autoencoder with a text-conditioned latent diffusion model. In this section, we will explore the key components of Stable Diffusion, from its basic autoencoder and latent diffusion techniques to the more advanced text-conditioned latent diffusion.

III-A1 Auto-encoder

Stable Diffusion consists of an encoder and decoder to transform images to latent codes and vice versa. The encoder $E(\cdot)$ maps images $\mathbf{x}\in\mathbb{R}^{D}$ into latent codes $\mathbf{z}=E(\mathbf{x})$ , where $\mathbf{z}\in\mathbb{R}^{K}$ and $K\ll D$ . The decoder maps such latent codes back to images. With sufficient training, it holds that $D(E(\mathbf{x}))\approx\mathbf{x}$ .

III-A2 Latent Diffusion Model

The latent code $\mathbf{z}=E(\mathbf{x})$ is diffused into a series of increasingly noisy states ${\mathbf{z}_{0}=\mathbf{z},\mathbf{z}_{1},...,\mathbf{z}_{T}}$ . Each noisy state $\mathbf{z}_{t}$ follows the marginal distribution that $q(\mathbf{z}_{t}|\mathbf{z})=\mathcal{N}\left(\alpha_{t}\mathbf{z},\sigma_{t}^{2}\boldsymbol{I}\right)$ , where $\mathbf{z}_{t}$ can be sampled by $\mathbf{z}_{t}=\alpha_{t}\mathbf{z}+\sigma_{t}\epsilon,\epsilon\sim\mathcal{N}\left(\mathbf{0},\boldsymbol{I}\right)$ . Here $\alpha_{t}$ and $\sigma_{t}$ are scalars that $\sigma_{t}^{2}+\alpha_{t}^{2}=1$ . For the timesteps $t=0,\cdots,T$ , their $\alpha_{t}$ and $\sigma_{t}$ are configured such that as $t\rightarrow T$ , the posterior distribution $q(\mathbf{z}_{t}|\mathbf{z})$ approaches a normal distribution. For the generation procedure, the model applies a reverse process to recover the clean latent $\mathbf{z}_{0}$ from the noisy end state $\mathbf{z}_{T}\sim\mathcal{N}\left(\mathbf{0},\boldsymbol{I}\right)$ . This reverse process can be represented by a Markov chain $p_{\theta}\left(\mathbf{z}_{0}\right)=p\left(\mathbf{z}_{T}\right)\prod_{t=1}^{T}p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)$ , which is a product of transition kernel that parameterized by $\theta$ . Each transition kernel $p_{\theta}\left(\mathbf{z}_{t-1}|\mathbf{z}_{t}\right)$ is a normal distribution with mean $\mu_{\theta}\left(\mathbf{z}_{t},t\right)$ and variance $\sigma_{t}^{2}\boldsymbol{I}$ . Estimating $\mu_{\theta}$ is equivalent to predicting the noise in $\mathbf{z}_{t}$ , and such noise prediction is done with a neural network $\epsilon_{\theta}(\cdot)$ [47, 48]. In Stable Diffusion, $\epsilon_{\theta}(\cdot)$ is U-Net [53] that is composed of convolution and attention operations. Besides, its noise prediction is under the text guidance, we next explain this in detail.

III-A3 Text-conditioned Latent Diffusion Model

The noise prediction $\epsilon_{\theta}(\cdot)$ in Stable Diffusion is conditioned on the textual description, a.k.a. text prompt $y$ . To utilize this condition, the text prompt is encoded using CLIP text encoder [54], which maps strings to low-dimensional token embeddings. We denote this encoder as $\tau_{\theta}(\cdot)=\tau^{r}_{\theta}(\tau^{n}_{\theta}(\cdot))$ , it composes of a tokenizer $\tau^{n}_{\theta}(\cdot)$ followed by a Transformer network [55] $\tau^{r}_{\theta}(\cdot)$ , where we denote their pre-trained parameter sets as $\theta$ . In detail, the words in the string $y$ are tokenized into the token embeddings $v\in\mathbb{R}^{L\times C}$ , via $v=\tau^{n}_{\theta}(y)$ , where $L$ is the token embedding length and $C$ is the feature dimension. These token embeddings are then converted into a text embedding via $\tau^{r}_{\theta}(v)\in\mathbb{R}^{C}$ . The noise prediction is thus formulated as $\hat{\epsilon}=\epsilon_{\theta}(\mathbf{z}_{t};\tau_{\theta}(y),t)=\epsilon_{\theta}(\mathbf{z}_{t};\tau^{r}_{\theta}(\tau^{n}_{\theta}(y)),t)$ , which takes the timestep $t$ and the encoded text prompt $\tau_{\theta}(y)=\tau^{r}_{\theta}(\tau^{n}_{\theta}(y))$ as conditions. With a dataset of text-image pairs $(\mathbf{x},y)\sim S$ and the pre-trained auto-encoder, the training objective for the network $\epsilon_{\theta}$ is the following loss function:

\mathcal{L}_{\text{diff}}=\mathbb{E}_{t,\epsilon,\mathbf{x},\mathbf{z}=E(\mathbf{x})}\left[\left\|\epsilon_{\theta}\left(\mathbf{z}_{t};\tau^{r}_{\theta}(\tau^{n}_{\theta}(y)),t\right)-\epsilon\right\|_{2}^{2}\right],

(1)

which takes the expectation of Mean Squared Error over the text-image pairs $(\mathbf{x},y)\sim S$ , noise $\epsilon\sim\mathcal{N}(\mathbf{0},\boldsymbol{I})$ , and timestep $t\sim\{1,\cdots,T\}$ . In terms of the conditional generation, on top of the transition kernel mentioned in the previous section, diffusion models balance the fidelity and diversity of this conditional generation using classifier-free guidance [56]:

\hat{\epsilon}=\epsilon_{\theta}\!\left(\mathbf{z}_{t};t\right)+\omega_{cfg}\left[\epsilon_{\theta}\left(\mathbf{z}_{t};\tau^{r}_{\theta}(\tau^{n}_{\theta}(y)),t\right)-\epsilon_{\theta}\left(\mathbf{z}_{t};t\right)\right],

(2)

where $\epsilon_{\theta}\left(\mathbf{z}_{t};t\right)$ is the prediction of noise without text guidance, and $\omega_{cfg}$ is a scalar that adjusts the influence of the condition on the generative process. Generating images in Stable Diffusion is concluded as iterating $\mathbf{z}_{t-1}=\alpha_{t-1}\frac{\mathbf{z}_{t}-\sigma_{t}\hat{\epsilon}}{\alpha_{t}}+\sigma_{t}\epsilon$ [47], which starts from pure Gaussian $\mathbf{z}_{T}$ and ends in $\mathbf{z}_{0}$ that can be decoded to image via $\mathbf{x}=D(\mathbf{z}_{0})$ .

Algorithm 1 StableSDG

Input: prompt

y

, defect images of the single category

p(\mathbf{x})

, Stable Diffusion

\epsilon_{\theta}(\mathbf{z}_{t};\tau^{r}_{\theta}(\tau^{n}_{\theta}(y)),t)

, guidance scale

\omega_{cfg}

, strength

s

Generator Adaptation

Initialize

v=\tau^{n}_{\theta}(y)=[v^{\prime},v_{d}]

// Token Embedding Adaptation

While not converge, do:

Sample

\mathbf{\mathbf{x}}\sim p(\mathbf{\mathbf{x}})

\epsilon\sim\mathcal{N}\left(\mathbf{0},\boldsymbol{I}\right)

t\in\{1,\dots,T\}

\mathbf{z}_{0}=\emph{E}(\mathbf{x})

\mathbf{z}_{t}=\alpha_{t}\mathbf{z}_{0}+\sigma_{t}\epsilon

Update

v_{d}

with

\nabla_{v_{d}}\mathcal{L}=\|\epsilon_{\theta}(\mathbf{z}_{t};\tau^{r}_{\theta}(v),t)-\epsilon\|^{2}_{2}

Denote

{v}^{*}_{d}

as the optimized token embedding

v^{*}=[v^{\prime},v^{*}_{d}]

// Network Parameter Adaptation

While not converge, do:

Sample

\mathbf{\mathbf{x}}\sim p(\mathbf{\mathbf{x}})

\epsilon\sim\mathcal{N}\left(\mathbf{0},\boldsymbol{I}\right)

t\in\{1,\dots,T\}

\mathbf{z}_{0}=\emph{E}(\mathbf{x})

\mathbf{z}_{t}=\alpha_{t}\mathbf{z}_{0}+\sigma_{t}\epsilon

Update

\theta

with

\nabla_{\theta}\mathcal{L}=\|\epsilon_{\theta}(\mathbf{z}_{t};\tau^{r}_{\theta}(v^{*}),t)-\epsilon\|^{2}_{2}

Denote

{\theta}^{*}

as the optimized network parameter

Data Generation

// Image-oriented Generation

Sample

\mathbf{\mathbf{x}}\sim p(\mathbf{\mathbf{x}})

\epsilon\sim\mathcal{N}\left(\mathbf{0},\boldsymbol{I}\right)

T^{{}^{\prime}}=sT

\mathbf{z}_{T^{{}^{\prime}}}=\alpha_{T^{{}^{\prime}}}E(\mathbf{x})+\sigma_{T^{{}^{\prime}}}\epsilon

for

i=T^{{}^{\prime}}

1

\hat{\epsilon}=\epsilon_{\theta^{*}}(\mathbf{z}_{t};t)+\omega_{cfg}\left[\epsilon_{\theta^{*}}(\mathbf{z}_{t};\tau_{\theta^{*}}^{r}\left(v^{*}\right),t)-\epsilon_{\theta^{*}}(\mathbf{z}_{t};t)\right]

\epsilon\sim\mathcal{N}\left(\mathbf{0},\boldsymbol{I}\right)

\mathbf{z}_{i-1}=\alpha_{i-1}\frac{\mathbf{z}_{i}-\sigma_{i}\hat{\epsilon}}{\alpha_{i}}+\sigma_{i}\epsilon

end for

Output

\mathbf{x}=D(\mathbf{z}_{0})

III-B StableSDG

To adapt the power full Stable Diffusion for generating high-quality steel surface defect data, we propose StableSDG, which includes two processes, i.e. generator adaptation and data generation. Fig. 3 represents the overview of our method, and the details are shown in Algorithm 1. Next we present the two processes in detail.

III-B1 Generator Adaptation

This process adapts Stable Diffusion to generate images of steel surface defects, which deviates significantly from the model’s original image distribution. We propose the following two strategies in this process.

Token Embedding Adaptation. The discrepancy between the defect images from production collection and the images generated by the Stable Diffusion model, as indicated in Fig. 1, partially stems from an inaccurate text prompt. This can be enhanced by having the text prompt $y$ better represent steel surface defect images. We adopt the strategy proposed by [26], which optimizes the token embeddings $v=\tau^{n}_{\theta}(y)$ as they are differentiable. Specifically, consider $y$ is written as A photo of $<$ unknown $>$ , its $v$ is a sequence of token embeddings, i.e. $v=\left[v_{1},\cdots,v_{L}\right],v_{1},\cdots,v_{L}\in\mathbb{R}^{C}$ , we use the notation $v_{d}$ to index the part of embeddings in this sequence that correspond to the sub-string of $<$ unknown $>$ , which refers to the specific defect concept. In this case, we can denote $v^{\prime}$ as the sequence of other token embeddings and $v$ as $\left[v^{\prime},v_{d}\right]$ . We optimize $v_{d}$ via:

{v}^{*}_{d}\!=\!\underset{v_{d}}{\operatorname{argmin}}\mathbb{E}_{t,\epsilon,\mathbf{x},\mathbf{z}=E(\mathbf{x}),v=[v^{\prime},v_{d}]}\!\left[\!\left\|\epsilon_{\theta}\!\left(\!\mathbf{z}_{t};\!\tau^{r}_{\theta}(v),\!t\!\right)\!-\!\epsilon\right\|_{2}^{2}\!\right]\!,

(3)

which minimizes the training loss as in Equation 1 and takes the expectation over a specific category of defect images $\mathbf{x}$ (e.g. crazing, inclusion), diffusion timesteps $t$ and noise $\epsilon$ . We denote the full token embeddings after the optimization as $v^{*}=[v^{\prime},{v}^{*}_{d}]$ , which is used for the following adaptation.

Network Parameter Adaptation. Given that optimization within the token embedding space alone may bring with limited improvement on the fidelity of the generated content, further enhancement can be achieved through adaptation on the network parameters. However, since the Stable Diffusion model comprises billions of parameters, compared with the limited amount of defect images, attempting to fine-tune the entire network parameters, as done in [27], would lead to over-fitting. We fine-tune the model through low-rank adaptation [31], which allows for constrained parameter change. For the weight of each dense layer $W_{0}\in\mathbb{R}^{d\times k}$ , it conducts parameter adaptation by imposing a low-rank decomposition:

W_{0}+\Delta W=W_{0}+BA,

(4)

where $\Delta W$ is the change in weights, and is represented by the product of two matrices, $B\in\mathbb{R}^{d\times r}$ and $A\in\mathbb{R}^{r\times k}$ . Here, the rank $r$ is significantly smaller than the minimum of $d$ and $k$ . As a result, the number of parameters for adaptation is significantly reduced from $d\times k$ to $(d+k)\times r$ . In practice, we set $r=1$ . Matrix $A$ is initialized following a Gaussian distribution, and matrix $B$ is initialized to zero, which ensures that $\Delta W=BA=0$ at the start of training. Throughout the training process, $A$ and $B$ are adjusted while $W_{0}$ remains static. As shown in Fig. 3, such low-rank adaptation is conducted for all the attention layers [55] in the CLIP text encoder $\tau^{r}_{\theta}(\cdot)$ and the U-net $\epsilon_{\theta}(\cdot)$ , by adapting their original parameter sets from $\theta$ to $\theta^{*}$ . Both $\epsilon_{\theta}(\cdot)$ and $\tau^{r}_{\theta}(\cdot)$ are fine-tuned using the following objective function:

\theta^{*}=\underset{\theta}{\operatorname{argmin}}\mathbb{E}_{t,\epsilon,\mathbf{x},\mathbf{z}=E(\mathbf{x})}\left[\left\|\epsilon_{\theta}\left(\mathbf{z}_{t};\tau^{r}_{\theta}(v^{*}),t\right)-\epsilon\right\|_{2}^{2}\right].

(5)

This approach facilitates adaptation of the model with a much lower risk of over-fitting, because of the drastically reduced parameter space.

III-B2 Data Generation

To enhance the quality of defect image generation with the adapted Stable Diffusion, we introduce the following strategy.

Image-oriented Generation. Based on the token embeddings $v^{*}$ and the network parameter $\theta^{*}$ , we generate the latent code $\mathbf{z}_{0}$ from half-perturbed code $\mathbf{z}_{T^{{}^{\prime}}}=\alpha_{T^{{}^{\prime}}}E(\mathbf{x})+\sigma_{T^{{}^{\prime}}}\epsilon$ , where $T^{{}^{\prime}}=sT$ denotes the decreased maximum degree of noise diffusion, and $s\in(0,1)$ is the denoising strength. The scalar $s$ controls the similarity between the generated image $D(\mathbf{z}_{0})$ and the original image $\mathbf{x}$ . The smaller the value, the higher the similarity. The image-oriented generation can be represented as $p_{\theta^{*}}\left(\mathbf{z}_{0}\right)=p\left(\mathbf{z}_{T^{{}^{\prime}}}\right)\prod_{t=1}^{T^{{}^{\prime}}}p_{\theta^{*}}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)$ . With $\mathbf{z_{0}}\sim p_{\theta^{*}}(\mathbf{z}_{0})$ , the generated image is $D(\mathbf{z}_{0})$ .

TABLE I: The hyperparameters of image-oriented generation.

\omega_{cfg}

and

s

are scaling factor and denoising strength, respectively. The defect categories are detailed in IV-A.

	NEU						CCBSD
	Cr	In	Pa	PS	RS	Sc	Inc	Ind	Ox	SG
$s$	0.2	0.9	0.4	0.5	0.1	0.5	0.5	0.2	0.6	0.5
$\omega_{cfg}$	2	9	6	3	8	7	5	7	3	4

III-C Quality Evaluation

To improve the quality of generated defect images, we iteratively adjust the model hyperparameters based on the Fréchet Inception Distance (FID) evaluation metric[57]. FID measures the similarity between real and generated image distributions through

\mathrm{FID}={\left\|\mu_{r}-\mu_{g}\right\|}^{2}+\operatorname{Tr}\left(C_{r}+C_{g}-2\left(C_{r}C_{g}\right)\right)^{1/2},

(6)

where $\mu_{r}$ and $\mu_{g}$ are the mean feature vectors for the real and generated images, respectively. $C_{r}$ and $C_{g}$ are the covariance matrices for the feature vectors of the real and generated images. We adjust the guidance scale $\omega_{cfg}$ and strength $s$ of each defect category for lower FID scores. The final hyperparameters of image-oriented generation are detailed in Table I.

III-D Defect Recognition

After the quality evaluation, we can obtain the optimal distribution of the generated defect dataset $p_{\theta^{*}}(\mathbf{x})$ . And then, recognition models [58, 59, 60, 61, 62] are trained on the expanded defect dataset. The size of the input defect image is 200 $\times$ 200 $\times$ 3. The training objective for the recognition network $R(\cdot)$ is to minimize the cross-entropy loss function:

\mathcal{L}_{\text{cls}}=\mathbb{E}_{p(c\mid\mathbf{x})}[-\log\hat{p}(c\mid\mathbf{x})],

(7)

where $p\left(c\mid\mathbf{x}\right)$ is the empirical distribution of the training sets, and $\hat{p}\left(c\mid\mathbf{x}\right)$ is the predicted distribution from the $R(\cdot)$ . The image $\mathbf{x}$ is sampled from the combination of the original defect dataset and generated defect images.

IV Experiment

In this section, we first introduce the experimental settings (IV-A), and then validate the effectiveness of StableSDG with the ablation study (IV-B). To verify the superiority of the proposed method, we evaluate the generated image quality and the performance of the recognition models on the Northeastern University surface defect database [63] (IV-C), and the continuous casting billet surface defect dataset (IV-D) respectively.

IV-A Experimental Settings

Datasets. We conduct experiments on the Northeastern University surface defect database (NEU) [63], an open-source steel surface defect dataset. It consists of six typical surface defects of hot-rolled steel strip, including Crazing (Cr), Inclusion (In), Patches (Pa), Pitted Surface (PS), Rolled-in Scale (RS) and Scratches (Sc). Each category has 300 grayscale defect samples with 200 × 200 resolution. To verify the performance of the proposed method in practical application, we build the continuous casting billet surface defect dataset (CCBSD). Following the prior art on the process the steel images from industrial production [63], we convert them to grayscale followed by binarization, so as to get the region of interest, i.e., the region where the continuous casting billet is located. Subsequently, the image is segmented into multiple sub-images with a 1:1 aspect ratio, each being resized to a resolution of 200 $\times$ 200 pixels. In order to minimize the possibility of misrecognition, we adopt human annotators to generate the ground truth labels. Due to the insufficiency of defect images, the constructed initial dataset only contains 200 samples per class. The dataset will be made publicly available. Fig. 4 shows common surface defects of continuous casting billet such as inclusion (Inc), indentation (Ind), oxidation (Ox) and slag groove (SG).

TABLE II: FID scores of generated images with different stages of proposed method shown in Algorithm 1.

Generator Adaptation		Data Generation	FID Scores $\downarrow$
Token Emb.	Network Para.	Image-oriented	FID Scores $\downarrow$
✓			245.35
✓		✓	138.19
	✓		110.38
	✓	✓	70.56
✓	✓		106.52
✓	✓	✓	64.49
Full-parameter adaptation			111.61
Full-parameter adaptation		✓	70.96

Implementations. For data generation, we set up benchmarks on NEU and CCBSD datasets, and compare with existing state-of-the-art methods, i.e. DCGAN [43], StyleGAN3 [44], DDIM [48], Textual Inversion [26], DreamBooth [27], and Yang et al. [46]. Our StableSDG is based on Stable Diffusion v1.5 [64]. We conduct training with Adam optimizer [65] and the batch size of 4. Both the generator adaptation and the data generation processes run for 1,000 iterations. The learning rates for both stages are 5e-4 and 1e-4, respectively. For the baseline methods, we adhere to their official implementations and tune the hyperparameters to ensure the best possible performance. Each model is utilized to generate 1,000 images per defect category. For classifier training, we use data augmentation, i.e. including rotating within $\left[-10^{\circ},+10^{\circ}\right]$ and random flipping along the horizontal and vertical axes. We conduct a thorough experiment with various classifiers, i.e. VGG [59], ResNet [60], SqueezeNet [61] and DenseNet [62]. We train these networks with Adam optimizer, the learning rate of 1e-4, batch size of 32, select the best-performing networks based on their validation performance and evaluate on the test sets. All experiments are conducted with 1 NVIDIA V100 GPU.

TABLE III: FID scores of generated images with different prompt

y

. Take the defect category ”Pa” in NEU as an example.

Prompt $y$	FID Scores $\downarrow$
”A photo of defect”	65.56
”A photo of patches”	68.17
”A photo of $<$ unknown $>$ ”	64.60

TABLE IV: FID scores of generated images with different guidance scale

\omega_{cfg}

and strength

s

. Take the defect category ”Pa” in NEU as an example.

		Guidance Scale $\omega_{cfg}$
		3	4	5	6	7
Strength $s$	0.3	112.02	103.54	97.42	94.09	94.56
	0.4	97.95	93.12	90.73	89.69	90.51
	0.5	99.93	94.31	93.34	94.92	97.68
	0.6	99.60	96.88	94.83	97.71	97.20
	0.7	99.56	95.02	93.28	90.04	95.87

IV-B Ablation Study

We conduct ablation studies on NEU to observe the impact of different components and identify the best configuration and hyperparameters for our proposed method.

Impact of Multiple Stages. To assess the efficacy of our proposed method with the three stages, i.e. token embedding adaption, network parameter adaptation and image-oriented generation, we conduct an evaluation of the quality of images generated with these stages. From Table II, we can find that, 1) omitting any stage of our proposed StableSDG leads to a deterioration in image quality, thereby confirming the importance of each stage in the process; 2) compared with full-parameter adaptation, i.e., re-training the Stable Diffusion model on the limited defect image collection [27], our method yields improved generation quality. This improvement stems from our effective adaptation to the data distribution of steel surface defects, reducing the deterioration usually caused by full-parameter fine-tuning and catastrophic forgetting. We also present intermediate results of StableSDG through its three stages and across various defect categories in CCBSD, as illustrated in Fig. 5. The model incrementally generates samples that closely resemble real defect images.

Impact of the Prompt. Different text prompts $y$ within the text-to-image diffusion prior might influence generation quality. We explore several text prompts and quantitatively evaluate their effects in Table III. For each prompt, we optimize the token embedding related to the defect category, such as defect, patches, and $<$ unknown $>$ , during the token embedding adaptation stage. The results show that using $<$ unknown $>$ for new defect categories results in lower FID scores, suggesting that $<$ unknown $>$ as an initialization helps avoid local optima and performs better than more specific prompts like defect, which yield poorer outcomes.

TABLE V: Quantitative comparison among various image generative models trained on NEU.

	FID Scores $\downarrow$
	Cr	In	Pa	PS	RS	Sc
DCGAN (CCC 18 [43])	169.83	230.78	153.15	300.34	182.07	291.41
StyleGAN3 (NIPS 21 [44])	97.12	209.54	189.30	148.97	111.57	221.25
DDIM (ICLR 21 [48])	151.51	83.26	108.71	139.12	216.07	105.43
Textual Inversion (ICLR 23 [26])	246.53	166.09	173.13	139.13	278.76	151.64
DreamBooth (CVPR 23 [27])	79.64	173.97	133.84	116.04	136.93	119.49
Yang et al. (TII 24 [46])	68.54	78.07	91.86	75.22	82.41	96.99
StableSDG (Ours)	46.90	72.26	64.60	59.33	58.01	85.84

Impact of the LoRA Rank. Figure 6 shows how the LoRA rank $r$ influences both the generation quality and the count of trainable parameters. While increasing $r$ leads to a higher number of trainable parameters, it does not significantly improve performance. Consequently, we have chosen to set $r$ to 1 in our study.

Impact of Guidance Scale and Strength. Additionally, we underscore the importance of the guidance scale $\omega_{cfg}$ and strength $s$ as the hyper-parameters for image-oriented generation. As detailed in Table IV, we can see about 20% performance enhancement between the highest and lowest metrics, indicating that good hyperparameters bring significant improvement in the quality of generated images.

IV-C Performance on NEU

Generation quality. We evaluate StableSDG for defect image generation on NEU dataset. Table V shows the image generation performance, as measured by FID, in six categories of the dataset. The qualitative comparisons are also shown in Fig. 7. Regions with abnormal textures or anomalous patterns in the generated images are marked in red boxes. It is observed that images generated by DCGAN contain some abnormal textures within the backgrounds, notably within the ”In”, ”PS” and ”Sc” defect categories. The outputs from StyleGAN3 closely resemble the original images overall. However, there are noticeable anomalous patterns in specific regions across most of the defect categories. The results of Textual Inversion exhibit distinct bright and dark stripes, whereas the backgrounds of images generated by DreamBooth are more uniform, though they still have areas that are not entirely satisfactory. Regarding DDIM, the images it generates are somewhat lacking in detail, and there is a noticeable inconsistency in brightness, particularly in the images from the defect category ”Pa”, as is the case with Yang et al. In contrast, our StableSDG achieves the lowest FID in each defect category compared with other methods. Additionally, our method incurs the lowest training cost, as illustrated in Fig. 8, showing it requires the least amount of training time (21.60 minutes).

Data substitution. To further prove that our generated data and collected samples have high distribution overlap, we compare the recognition performance of the models trained on the datasets before and after the data substitution. The NEU dataset is divided into three portions: the training subset, the validation subset, and the testing sets, with the division being in the ratio of 8:1:1. Subsequently, we proceed to train the recognition model using only a fraction of the training subset. Let $\alpha$ represent the fraction of the original real images that are used for training, with values set to 0.8, 0.6, and 0.4, respectively. We use SqueezeNet [61] as the recognition model, and the evaluation results are shown in Fig. 9. It is not difficult to find that, a reduction in $\alpha$ corresponds to fewer training samples, which causes a corresponding decrease in the accuracy of the recognition model. This trend highlights the importance of the amount of training data for the performance of the recognition model. Then we supplement the training data with generated samples to match the original quantity of training subset, the performance of the recognition model can achieve the comparable accuracy to that obtained when using the complete original training subset, proving the effectiveness of the defect samples generated by the proposed method.

TABLE VI: Quantitative experimental results for
defect recognition on NEU.

Networks	Expansion Methods	Accuracy(%)
AlexNet	None	81.67
	DCGAN	92.22
	StyleGAN3	94.44
	DDIM	96.67
	Textual Inversion	91.67
	DreamBooth	92.22
	Yang et al.	96.11
	StableSDG	98.33
VGG	None	86.67
	DCGAN	89.44
	StyleGAN3	93.89
	DDIM	96.67
	Textual Inversion	90.00
	DreamBooth	95.56
	Yang et al.	97.78
	StableSDG	98.33
ResNet	None	77.78
	DCGAN	87.22
	StyleGAN3	91.67
	DDIM	95.00
	Textual Inversion	92.78
	DreamBooth	93.89
	Yang et al.	95.56
	StableSDG	97.78
SqueezeNet	None	62.78
	DCGAN	79.44
	StyleGAN3	93.89
	DDIM	96.11
	Textual Inversion	90.56
	DreamBooth	91.11
	Yang et al.	96.11
	StableSDG	97.78
DenseNet	None	73.33
	DCGAN	80.00
	StyleGAN3	91.67
	DDIM	92.78
	Textual Inversion	85.56
	DreamBooth	92.78
	Yang et al.	92.78
	StableSDG	94.44

Dataset expansion. Considering that replacing the real samples in the training subset with generated ones brings the result nearly equal to the original accuracy, we introduce an additional 1,000 generated samples to the training subset within the $\alpha=0.4$ configuration (where each class has 96 real images) to determine if there can be further enhancements to the recognition performance of model. This experiment is conducted with multiple off-the-shelf network architectures, i.e., AlexNet [58], VGG [59], ResNet [60], SqueezeNet [61] and DenseNet [62]. According to Table VI, there are consistent performance improvement when the generated images are introduced by all methods, showing that expanding the defect dataset with generative model is significantly helpful for higher defect recognition accuracy. When compared with other data expansion methods, StableSDG exhibits a superior ability to enhance the performance of steel surface defect recognition. The accuracy of the five recognition models is, on average, improved by approximately 20%, demonstrating that our method can expand the dataset more effectively.

TABLE VII: Quantitative comparison among various image generative models trained on CCBSD.

	FID Scores $\downarrow$
	Inclusion	Indentation	Oxidation	Slag Groove
DCGAN (CCC 18 [43])	328.58	270.33	277.52	242.12
StyleGAN3 (NIPS 21 [44])	167.06	157.19	99.08	175.54
DDIM (ICLR 21 [48])	166.76	176.14	81.60	113.24
Textual Inversion (ICLR 23 [26])	257.31	146.15	282.77	202.93
DreamBooth (CVPR 23 [27])	206.12	137.22	130.85	235.51
Yang et al. (TII 24 [46])	165.78	161.19	102.54	127.23
StableSDG (Ours)	112.68	72.18	71.79	97.13

TABLE VIII: Quantitative experimental results for
defect recognition on CCBSD.

Networks	Expansion Methods	Accuracy(%)
AlexNet	None	82.50
	DCGAN	92.50
	StyleGAN3	95.00
	DDIM	94.17
	Textual Inversion	93.33
	DreamBooth	93.33
	Yang et al.	95.83
	StableSDG	98.33
VGG	None	80.00
	DCGAN	93.33
	StyleGAN3	95.83
	DDIM	96.67
	Textual Inversion	94.17
	DreamBooth	95.00
	Yang et al.	95.83
	StableSDG	98.33
ResNet	None	88.33
	DCGAN	95.00
	StyleGAN3	96.67
	DDIM	95.83
	Textual Inversion	95.83
	DreamBooth	96.67
	Yang et al.	96.67
	StableSDG	99.17
SqueezeNet	None	85.00
	DCGAN	88.33
	StyleGAN3	89.17
	DDIM	89.17
	Textual Inversion	87.50
	DreamBooth	90.00
	Yang et al.	90.00
	StableSDG	91.67
DenseNet	None	90.00
	DCGAN	93.33
	StyleGAN3	97.50
	DDIM	98.33
	Textual Inversion	93.33
	DreamBooth	95.00
	Yang et al.	96.67
	StableSDG	99.17

IV-D Performance on CCBSD

Generation quality. To further evaluate StableSDG, we also conduct experiments on CCBSD. The dataset contains four categories of defects with 200 samples per category, 140 for training, 30 for validation, and 30 for testing. The quantitative comparison of defect image generation is shown in Table VII. It shows that StableSDG can achieve lower FID scores. For qualitative comparison, Fig. 10 represents the images generated by StableSDG and comparisons with other generative methods. We can see that the samples generated by DCGAN show artificial textures in the background. Textual Inversion and DreamBooth have enhanced the quality of their generated images, though some atypical patterns persist. Meanwhile, the samples from StyleGAN3, DDIM, and Yang et al. appear visually similar to real images but suffer from some blurriness. In contrast, the samples generated by StableSDG display well-defined defect characteristics. This indicates the superiority of our method in data expansion of continuous casting billet surface defect samples.

Dataset expansion. For each defect category, 1,000 samples are generated by different methods respectively. Then we adopt the aforementioned recognition models to perform the continuous casting billet surface defect recognition on the dataset before and after expansion. The experimental results are shown in Table VIII. It can be found that, the accuracy of recognition models is higher after adding generated samples to the original training set. Compared to the other expansion methods, StableSDG has greater advantages in improving the performance of continuous casting billet surface defect recognition. The five recognition models experience an average accuracy improvement of approximately 12%, which confirms the effectiveness of our method in expanding the dataset. This enhancement also enables the recognition models to be further utilized for recognizing surface defects in continuous casting billets within industrial manufacturing settings.

V Conclusion

The scarcity of data samples presents a significant challenge for deploying deep learning techniques in the recognition of steel surface defects. To address this problem, we introduce StableSDG, which blends text-to-image prior for defect image generation. During the process of generator adaptation, StableSDG adapts and modifies within both the token embedding space and the network parameters space. When generating data, it generates samples from image-oriented initialization, instead of starting from pure Gaussian noises. The experimental results on NEU and CCBSD verify that the proposed method can generate defect images with high fidelity, which can greatly improve the performance of recognition models. However, the proposed method is limited to text prompts, resulting in image generation that is stochastic and lacks direction. In the future, we plan to explore using other modalities as conditions, e.g., spatial conditions, to generate images adhering to the spatial conditioning input. In doing so, we can generate defect sample images that include bounding box information, which can be leveraged to improve the performance of neural networks in defect detection tasks.

References

[1] S. Ghorai, A. Mukherjee, M. Gangadaran, and P. K. Dutta, “Automatic defect detection on hot-rolled flat steel products,” IEEE Transactions on Instrumentation and Measurement, vol. 62, no. 3, pp. 612–621, 2012.
[2] S. I. Hassan, L. M. Dang, I. Mehmood, S. Im, C. Choi, J. Kang, Y.-S. Park, and H. Moon, “Underground sewer pipe condition assessment based on convolutional neural networks,” Automation in Construction, vol. 106, p. 102849, 2019.
[3] Y.-h. Ai and K. Xu, “Surface detection of continuous casting slabs based on curvelet transform and kernel locality preserving projections,” Journal of Iron and Steel Research International, vol. 20, no. 5, pp. 80–86, 2013.
[4] D.-C. Choi, Y.-J. Jeon, S. J. Lee, J. P. Yun, and S. W. Kim, “Algorithm for detecting seam cracks in steel plates using a gabor filter combination method,” Applied optics, vol. 53, no. 22, pp. 4865–4872, 2014.
[5] C. Dongyan, X. Kewen, N. Aslam, and H. Jingzhong, “Defect classification recognition on strip steel surface using second-order cone programming-relevance vector machine algorithm,” Journal of Computational and Theoretical Nanoscience, vol. 13, no. 9, pp. 6141–6148, 2016.
[6] S. Cheon, H. Lee, C. O. Kim, and S. H. Lee, “Convolutional neural network for wafer surface defect classification and the detection of unknown defect class,” IEEE Transactions on Semiconductor Manufacturing, vol. 32, no. 2, pp. 163–170, 2019.
[7] I. Konovalenko, P. Maruschak, J. Brezinová, J. Viňáš, and J. Brezina, “Steel surface defect classification using deep residual neural network,” Metals, vol. 10, no. 6, p. 846, 2020.
[8] Y. Wang, L. Gao, Y. Gao, and X. Li, “A new graph-based semi-supervised method for surface defect classification,” Robotics and Computer-Integrated Manufacturing, vol. 68, p. 102083, 2021.
[9] J. P. Yun, W. C. Shin, G. Koo, M. S. Kim, C. Lee, and S. J. Lee, “Automated defect inspection system for metal surfaces based on deep learning and data augmentation,” Journal of Manufacturing Systems, vol. 55, pp. 317–324, 2020.
[10] S. Niu, B. Li, X. Wang, and H. Lin, “Defect image sample generation with gan for improving defect recognition,” IEEE Transactions on Automation Science and Engineering, vol. 17, no. 3, pp. 1611–1622, 2020.
[11] C. Zhao, W. Xue, W. Fu, Z. Li, and X. Fang, “Defect sample image generation method based on gans in diamond tool defect detection,” IEEE Transactions on Instrumentation and Measurement, 2023.
[12] Y. Zhang, Y. Wang, Z. Jiang, F. Liao, L. Zheng, D. Tan, J. Chen, and J. Lu, “Diversifying tire-defect image generation based on generative adversarial network,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022.
[13] G. Zhang, K. Cui, T.-Y. Hung, and S. Lu, “Defect-gan: High-fidelity defect synthesis for automated defect inspection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2524–2534.
[14] Y. Duan, Y. Hong, L. Niu, and L. Zhang, “Few-shot defect image generation via defect-aware feature manipulation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 571–578.
[15] W. Li, C. Gu, J. Chen, C. Ma, X. Zhang, B. Chen, and S. Wan, “Dls-gan: generative adversarial nets for defect location sensitive data augmentation,” IEEE Transactions on Automation Science and Engineering, 2023.
[16] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “Cogview: Mastering text-to-image generation via transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 19 822–19 835, 2021.
[17] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
[18] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
[19] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro et al., “ediffi: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.
[20] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
[21] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
[22] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[23] J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” arXiv preprint arXiv:2305.07015, 2023.
[24] B. Xia, Y. Zhang, S. Wang, Y. Wang, X. Wu, Y. Tian, W. Yang, and L. Van Gool, “Diffir: Efficient diffusion model for image restoration,” arXiv preprint arXiv:2303.09472, 2023.
[25] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017.
[26] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
[27] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510.
[28] L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, “Svdiff: Compact parameter space for diffusion fine-tuning,” arXiv preprint arXiv:2303.11305, 2023.
[29] W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M.-W. Chang, and W. W. Cohen, “Subject-driven text-to-image generation via apprenticeship learning,” arXiv preprint arXiv:2304.00186, 2023.
[30] H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374, 2023.
[31] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[32] Y. Song, Z. Liu, S. Ling, R. Tang, G. Duan, and J. Tan, “Coarse-to-fine few-shot defect recognition with dynamic weighting and joint metric,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–10, 2022.
[33] Y. Wang, L. Gao, Y. Gao, and X. Li, “A graph guided convolutional neural network for surface defect recognition,” IEEE Transactions on Automation Science and Engineering, vol. 19, no. 3, pp. 1392–1404, 2022.
[34] T. Wang, Z. Li, Y. Xu, J. Chen, A. Genovese, V. Piuri, and F. Scotti, “Few-shot steel surface defect recognition via self-supervised teacher-student model with min-max instances similarity,” IEEE Transactions on Instrumentation and Measurement, 2023.
[35] X. Dong, C. J. Taylor, and T. F. Cootes, “Defect classification and detection using a multitask deep one-class cnn,” IEEE Transactions on Automation Science and Engineering, vol. 19, no. 3, pp. 1719–1730, 2021.
[36] D. Mo, W. K. Wong, Z. Lai, and J. Zhou, “Weighted double-low-rank decomposition with application to fabric defect detection,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 3, pp. 1170–1190, 2020.
[37] Y. Zhang, H. Wang, W. Shen, and G. Peng, “Duak: Reinforcement learning-based knowledge graph reasoning for steel surface defect detection,” IEEE Transactions on Automation Science and Engineering, 2023.
[38] Q. Huang, Y. Wu, J. Baruch, P. Jiang, and Y. Peng, “A template model for defect simulation for evaluating nondestructive testing in x-radiography,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 39, no. 2, pp. 466–475, 2009.
[39] D. Mery, D. Hahn, and N. Hitschfeld, “Simulation of defects in aluminium castings using cad models of flaws and real x-ray images,” Insight-Non-Destructive Testing and Condition Monitoring, vol. 47, no. 10, pp. 618–624, 2005.
[40] D. Mery and D. Filbert, “Automated flaw detection in aluminum castings based on the tracking of potential defects in a radioscopic image sequence,” IEEE Transactions on Robotics and Automation, vol. 18, no. 6, pp. 890–901, 2002.
[41] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[42] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[43] J. Li, J. Jia, and D. Xu, “Unsupervised representation learning of image-based plant disease with deep convolutional generative adversarial networks,” in 2018 37th Chinese control conference (CCC). IEEE, 2018, pp. 9159–9163.
[44] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 852–863, 2021.
[45] A. Sauer, K. Schwarz, and A. Geiger, “Stylegan-xl: Scaling stylegan to large diverse datasets,” in ACM SIGGRAPH 2022 conference proceedings, 2022, pp. 1–10.
[46] X. Yang, T. Ye, X. Yuan, W. Zhu, X. Mei, and F. Zhou, “A novel data augmentation method based on denoising diffusion probabilistic model for fault diagnosis under imbalanced data,” IEEE Transactions on Industrial Informatics, 2024.
[47] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[48] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[49] X. Zhang, X.-Y. Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, and Q. Li, “Compositional inversion for stable diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7350–7358.
[50] Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, and W. Zuo, “Decoupled textual embeddings for customized image generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 909–917.
[51] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1931–1941.
[52] W. Chen, H. Hu, Y. Li, N. Ruiz, X. Jia, M.-W. Chang, and W. W. Cohen, “Subject-driven text-to-image generation via apprenticeship learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[53] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[54] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[56] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
[57] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
[58] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[59] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[60] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[61] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[62] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[63] K. Song and Y. Yan, “A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,” Applied Surface Science, vol. 285, pp. 858–864, 2013.
[64] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Stable-diffusion-v1-5,” https://huggingface.co/runwayml/stable-diffusion-v1-5.
[65] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.