Progressive Alignment with VLM-LLM Feature to Augment
Defect Classification for the ASE Dataset
Abstract
Traditional defect classification approaches are facing with two barriers. (1) Insufficient training data and unstable data quality. Collecting sufficient defective sample is expensive and time-costing, consequently leading to dataset variance. It introduces the difficulty on recognition and learning. (2) Over-dependence on visual modality. When the image pattern and texture is monotonic for all defect classes in a given dataset, the performance of conventional AOI system cannot be guaranteed. A main question is, ”how to solve those two problems when they occur at the same time?” The feasible strategy is to explore another feature within dataset and combine an eminent vision-language model (VLM) and Large-Language model (LLM) with their astonishing zero-shot capability. In this work, we propose the special ASE dataset, including rich data description recorded on image, for defect classification, but the defect feature is uneasy to learn directly. Secondly, We present the prompting for VLM-LLM against defect classification with the proposed ASE dataset to activate extra-modality feature from images to enhance performance. Then, We design the novel progressive feature alignment (PFA) block to refine image-text feature to alleviate the difficulty of alignment under few-shot scenario. Finally, the proposed Cross-modality attention fusion (CMAF) module can effectively fuse different modality feature. Experiment results have demonstrated our method’s effectiveness over several defect classification methods for the ASE dataset.
1 Introduction

In the realm of industrial manufacturing, defect recognition is paramount, serving a dual purpose: enhancing product quality and efficiency and indirectly curbing production costs by reducing the prevalence of both false negatives and positives. At its core, defect detection leverages a myriad of analytical instruments to discern features rooted in the physical properties of diverse products. Once these features are harvested, models are then deployed for defect recognition.
Among the plethora of defect classification strategies, Automatic Optical Inspection (AOI) emerges as a predominant choice, especially in the industrial domain. The cost of manual inspection is high, and accuracy may decrease due to human fatigue. AOI harnesses the power of high-resolution imaging tools to inspect products throughout the manufacturing phase. It taps into deep-learning approach like convolutional neural network (CNN) in recent years, to spot defects. By juxtaposing the visual attributes of standard samples with defective ones, model can efficiently pinpoint anomalies. Many researchers have proposed several defect detect or classification method to improve the quality of system, such as [42, 70, 54, 19, 24, 65, 5, 29, 12].
Despite its success over the years, however, the AOI methodology isn’t without its limitations. For example, datasets are of low quality and have insufficient sample sizes, which may be due to the high cost of collecting defect data. Unpredictable lighting conditions and camera shifts within AOI system undercut the reliability of model. Even more challenges like sparse training data and sample imbalances can yield performance that falls short of expectations.These crucial factors result in the degradation of deep model. Several work utilize ensemble method [31] or feature enhancement [66, 21] approach to improve performance, the effect is still limited due to changes in different data sets and various negative factors.
On the other hand, the dawn of multimodal learning has ushered in a new era where features from disparate modalities can be synergized, amplifying model performance. This integration has catalyzed advancements in domains between computer vision and natural language processing. Many researcher have proposed cross-modality learning framework, boosting the downstream application by using the well-known Vision-Language Model (VLM) and Large Language Model (LLM). Its influence spans the field of deep learning. Due to its superiority of zero-shot capablilities learned from numerous and high quality image/text data pairs, there is a possible way to augment defect classification system by extracting VLM-LLM feature without cost-consuming data collection and laborious labeling, further solving the data insufficiency and over-dependence on AOI and vision modality.
The major novelties and contributions of this paper can be divided three-folds as follow:
-
•
Prompting with VLM-LLM to augment performance for the proposed ASE dataset: Traditional vision-based methods cannot easily solve the problem of ASE dataset. (1) The pattern of ASE data is very monotonous, and (2) the number of samples is insufficient. We leverage the zero-shot capabilities of VLM-LLM through prompting engineering to capture external-modal features to improve the performance on binary and multi-class classification task for the ASE dataset.
-
•
Progressive feature alignment block: the novel Progressive Feature Alignment (PFA) block effectively aligns image-text representation with Progressive Training Strategy (PTS) and contrastive learning manner, and selects the negative samples at beginning. Afterwards, we gradually samples more training data to align features between two image-text data pairs iteratively, addressing the challenge of features being difficult to align in a small number of samples.
-
•
Cross-modality attention fusion module and Task-specific Data Augmentation: the proposed Cross-Modality Attention Fusion (CMAF) module enables our model to adaptively fuse the high fidelity features from different modality branches. With the dedicated Task-specific Data Augmentation (TDA) for the ASE dataset, the source domain can be enlarged, further improving the recognition ability against novel samples. Experiment results have demonstrated our method’s promising performance compared with other methods for the ASE dataset.
2 Related Work

2.1 Defect Recognition under Few-shot Scenarios
Defect Recognition has witnessed remarkable enhancements with the integration of deep and machine learning algorithms. In overall, defect recognition can be toward divided into two groups: (1) defect detection and (2) defect classification. The major issue of defect recognition is that when the dataset is small or the quality of image is disappointing, the performance of deep model will be limited.
Defect Detection. Generally speaking, defect detection aims to point out where the defect exist. Defect detection tasks are more common than classification. There have been various works and open-sources specifically designed to handle different types of defect detection, such as [56, 3, 70, 54, 19]. Recently, in order to deal with the challenges in the above mentioned, CS-Flow [42] leverage cross-scale feature flow to integrate the high and low-level feature, enhancing learned representation. XDNet [21] is based on self-supervised learning strategy to explore the information within given dataset.
Defect Classification. Different from detection, the goal of defect classification is to figure out which class does sample belong to. It can be seen as image classification task. The simple way is to train a CNN [53, 25, 57] with suitable augmentation pipeline [52] or data synthesis [22], and ensembling different CNN backbone to enhance the few-shot capabilities [31]. Some researchers have proposed the incorporation of low-level features such as SIFT [41] or depth information [46, 55, 10] to bolster the model performance.
2.2 Prompting Learning for VLM and LLM
While traditional deep learning typically relies on single-modality data, CLIP [49] have pioneered the integration of vision-language. Benefit from the large-scale of high quality pre-training on vision-language data, multi-modality-based model [33, 34, 35, 11, 67] have shown astonishing zero-shot capabilities. The VLM paradigm introduced by CLIP, along with the versatility of LLM like GPT-3 [16] and LLaMA [27, 20], have a significant impact on the deep learning community. VPT [30] further enhance downstream task performance using fine-tuning-free prompt techniques.
VLM Prompting in Vision With the famous CoCoOp and MaPLe [72, 71, 32], which proposed prompting learning with VLM in vision, there has been a proliferation of works utilizing VLM to enhance computer vision task. Hu and Awal et al. utilized VLM-feature and accurate prompting to align the depth and visual semantic representation for few-shot depth estimation [64, 14]. Liang et al. proposed CLIP-LIT [36], which deployed iterative prompting strategy with CLIP prior for open-world image enhancement. Subramanyam et al., presented CREPE [50] and explored the potential of VLM to improve the performance of visual-object-relationship prediction.
LLM Prompting in Vision The concept of language instruction has also been embraced by the computer vision community for defining image-to-text tasks. Flamingo [17] stands out as a groundbreaking work in this regard, utilizing both vision and language inputs as prompts and achieving impressive few-shot results across various vision-language tasks like image captioning and Visual Question Answering (VQA) [48, 11, 34, 35]. Prophet [73, 68] is designed to guide LLM using answer heuristics for knowledge-based VQA without incorporating external knowledge. VisionLLM [61], presented remarkable results on several benchmarks.
2.3 Multi-Modality Feature Alignment and Fusion
Multi-modality alignment and fusion are both the most essential topics in multi-modal learning, which mainly focus on how to incorporate modality-wise features into a joint representation for downstream application. Different modality feature which embedded on high-dimensional space may contains complementarity, its information may benefits each other [7]. Originated from the limitation of single-modality approaches, which suffer from the variation of dataset and their poor generalizablilty, CLIP [49] and ALBEF [33], which aim to align image-text pairs representation embedded in high-dimensional latent space, has been introduced, lighting up an alternative avenue for tackling zero/few-shot and low-quality dataset scenarios [47, 37]. If multi-modal feature can be processed effectively, rich feature representation can be obtained [47, 8, 45, 38, 15, 26], outperforming the single-modality-based methods.
Multi-modal fusion strategy can be typically classified into feature-level fusion, decision-level fusion and hybrid fusion [44, 23, 2, 4]. Feature-level feature aims to conjuncte the different modality feature to archive high-level joint representation within single model [59, 60]. Decision-level fusion like the classical Mixture-of Experts (MoE) [43] and model-ensemble strategy, each expert specializes in a subset of all given modalities. Hybrid fusion is the most complicated design [63, 13], this type of strategy achieves the best results but has the highest computational burden.
3 Methodology
An overview of our approach is shown in the Figure 3. We first introduce the ASE dataset and the limitation of conventional methods. Then, we illustrate our method in details.


3.1 Proposed ASE Dataset
The dataset is provided by ASE corporation and consists of two parts, the details is shown in Figure 1. This dataset contains five classes, with detailed insights provided in Figure 2. (1) Image, which records the drilling machine’s drilling within a fixed range. The goal is to ensure that the drilled holes are positioned as close to the center of the circle as possible, with no shape deviation (without defect). (2) The information about the drilled hole positions, including frequency statistics within specific intervals, as well as corresponding mean and standard deviation . The second part is recorded by meticulous machine, ensuring high quality.
Limitations and Challenges. The ASE dataset itself exhibits a monotonous texture/pattern compared to conventional images. Due to its inductive biases, such as locality and spatial invariance, CNNs model can not learn sufficiently strong low-level feature within shallow layers. As for Vision Transformer (ViT) [58] approaches, while effectively focusing on the global dependencies in the image thanks to its attention-based design, is constrained by the limited number of samples [74], leading to awful performance. To sum up, conventional vision-only model is ineffective to address ASE dataset. Due to mainstream defective classification methods are CNN-based, it is necessary to develop the multi-modal-based method to deal with the difficulty. The GradCAM visualization is in Figure 4, as the figure shows, CNN focuses on the middle region but is not sensitive to capture the shape or distribution of pink dot. ViT-based model fails to concentrate on the central area,
Class | N() | ||||
---|---|---|---|---|---|
Type-0 (normal) | 225 | 0.04 | -0.05 | 3.71 | 3.52 |
Type-1 (defective) | 92 | 2.73 | 0.59 | 7.38 | 5.52 |
Type-2 (defective) | 44 | 6.43 | -3.21 | 8.27 | 8.63 |
Type-3 (defective) | 50 | -1.10 | 0.65 | 8.14 | 6.44 |
Type-4 (defective) | 44 | -0.21 | -0.01 | 9.44 | 8.77 |
3.2 Prompting to VLM-LLM for ASE dataset
LLMs possess the capability to retain long-term memory of data and accommodate extensive textual input through their large token capacity. Additionally, they can engage in high-level decision-making through iterative question-answering processes. These attributes form the primary motivation for leveraging LLMs to enhance defect classification. In contrast, VLMs are limited to receiving input in the form of individual image-text pairs. While they can perform rudimentary visual reasoning based on the textual content, their ability is restricted to providing basic descriptions of images.
Industrial defect classification, medical diagnosis, product identification, and similar datasets often contain textual or numerical records in addition to images, especially datasets like ASE dataset, as shown in Figure 1. Taking advantage of this, we extract numerical information using OCR and combine with our prior knowledge to serve as input for LLM to accomplish subsequent tasks. Recognizing the superior capability of VLM in simple visual reasoning and image description, we utilize straightforward prompting such as ”Please comprehensively describe the distribution and shape of the image” to obtain basic textual information. Subsequently, combining the output of LLM with the data extracted through OCR and , where means data preparation. We employ more complex prompting, as depicted in Figure 3, to perform advanced reasoning. The formula can be written as:
VLM’s answer | (1) | |||
LLM’s answer |
Thanks to the robust zero-shot recognition abilities of VLMs and LLMs, remarkable results are attainable without additional fine-tuning. Additionally, to mitigate the risk of catastrophic forgetting and maintain zero-shot capabilities, we train two adapters tailored to the ASE dataset.
After the VLM-LLM answering, we use a pre-trained image encoder (Resnet-50 [25]) and text encoder (BERT-base [28]) to get a representation for modality fusion. The encoded representation is defined as the Formula 2. we fine-tune on the ASE dataset with 30 epochs for warming up those three encoders before alignment and fusion (refer to Section 3-3, 3-4).
(2) | ||||
3.3 Progressive Feature Alignment


At recent years, progressive training strategy (PTS) [18] have shown the potential to improve model performance, enhancing the representation understanding without sample-size burden. PTS begins by selecting a subset of positive and negative samples from the training dataset , based on the given initial sampling rate. In our work, we define the positive and negative sample by ranking self-similarity within image-text pairs among training set, negative (low self-similarity) sample will be given priority in the early stages (refer to Figure 5). PTS divides the entire training set into sub-blocks, trains only a small portion of the whole dataset, and gradually increases the training data with each stage. The remaining samples are used for validation, denoted as . Benefited by PTS, the network’ parameters can be greatly initial and easy to converge. Inspired by [33, 37, 51], which aims to learn a powerful unimodal representation before fusion, we also use the contrastive learning manner for our model, as shown in Figure 6.
Directly aligning features is not effective when there are insufficient samples [62]. The philosophy behind the design of Progressive Feature Alignment (PFA) is that leveraging the PTS’s advantage to deal with the inefficacy of multi-modality learning, resulted from the insufficient-large dataset. In proposed PFA block, we use PTS to gradually align different modal representation encoded by VLM-LLM and image encoder, represented as , and respectively, where all of representation vectors are encoded to lower-dimensional (256-d) representations. For the every single-batch at the sub-training set at PTS framework, we can formulate the two data-pairs, and as the input of FPA block. We first align the feature representation (,) at divided training set, then aligh (,). Subsequently, we train for 15 epochs and add more training data progressively, until all training data are used.
For each image and text, we calculate the softmax-normalized image-to-text and text-to-image similarity as:
(3) | |||
where is a learnable parameter and denote as cosine similarity. Finally, the image-text contrastive loss of proposed PFA is defined as the cross-entropy function between the similarity and the ground-truth :
(4) | |||
(5) |
3.4 Cross-modality Attention Fusion

Directly concatenating the representation from different branches may loss some information, leading to reduced performance. Therefore, we proposed cross-modality attention fusion (CMAF) module to incorporate these altogether, avoiding the information missing. Suppose that the aligned features of , , and denoted by , , , the projection is defined as follows:
(6) | |||||
where consists of multiple stages to project the input feature into lower dimensional space. First, the convolution is used to project the into , where is the reduced number of dimension. We perform the cross-attention [9] by
(7) | ||||
where projects the concatenated cross-attentions to get adaptive weights , i.e., . In this way, we could judiciously fuse the different modalities, the proposed CMAF still remains strong due to its adaptivity, as follows:
(8) |
where indicates -th channel of . Eventually, the predicted class is obtained via a multi-layer perceptron by = MLP ().
3.5 Task-specific Data Augmentation
Data augmentation is crucial to improve model performance, especially when there is a domain gap or limited sample size. It aims to improve the recognition ability of the model by increasing the diversity of data domain. However, the text and numeric information is paired with image in ASE dataset, common data augmentation strategies, such as geometry and HSV transformation, are incompatible. We design a simple but effective Task-specific data augmentation (TDA), following offline synthesis manner [52, 22], to address the data insufficiency.
To synthesis the ASE data, we sample data points N times from bivariate Gaussian distribution for the different classes. For each sampled data points, set the pixel at coordinates of sampled point in the image to pink dot. The number of samples N obtained depends on the dataset’s configuration. For instance, we can utilize OCR-ed data to compute the total points across all radius (refer to Figure 1). After sampling process, we could obtain the augmented image-text pair data and for ASE dataset. The TDA pipeline can be simply defined as:
(9) |
(10) |
4 Experiment
Binary classification | Multi-class classification | ||||||||||||||||||||||||||
Method |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
DeiT-B [58] | 73.59 | 72.72 | 74.46 | 39.96 | 72.88 | 47.82 | 34.09 | 20.00 | 25.00 | ||||||||||||||||||
CrossViT-18 [9] | 73.16 | 72.39 | 73.91 | 51.33 | 76.44 | 59.78 | 31.81 | 50.00 | 38.63 | ||||||||||||||||||
EfficientNet-b6a [57] | 77.48 | 74.81 | 80.15 | 64.58 | 83.55 | 71.73 | 56.81 | 54.00 | 56.81 | ||||||||||||||||||
Karmakar et al. [31] | 78.02 | 78.07 | 77.97 | 60.69 | 76.44 | 65.21 | 56.81 | 48.54 | 56.54 | ||||||||||||||||||
Xie et al. [10] | 78.46 | 78.41 | 78.50 | 58.78 | 77.77 | 57.60 | 50.00 | 54.00 | 54.54 | ||||||||||||||||||
Proposed | 85.65 | 84.70 | 86.59 | 75.91 | 87.44 | 70.65 | 70.45 | 76.00 | 75.00 | ||||||||||||||||||
Proposed with TDA | 92.52 | 92.37 | 92.67 | 80.77 | 91.55 | 88.04 | 75.00 | 72.00 | 77.27 |
4.1 Implementation Details
General settings. The details of ASE dataset have been described in Section 3-1. It contains 455 samples, 325 of which are selected as the training set and 130 as the testing set in our experiments. We extracted external modal information utilizing CnOCR [1, 69], an efficient OCR package. Instruct-BLIP2 [11], a powerful VLM renowned for its adeptness in generating long-context and exhibiting promising performance across several VQA benchmarks, is employed. LLaMa2-7B [27], is chosen for generating high-low-level context about decision making. We use the proposed TDA to synthesis the training sample, from 325 to 650. Both VLM and LLM employed fixed prompting, as detailed in Section 3-2. For the vision encoder, ResNet50 [25] is employed. On the other hand, the text encoder utilized BERT-base [28] following inference from VLM and LLM. Our experiments is on NVIDIA GeForce RTX 3090.
Loss function and Hyperparameter-settings. As for alignment phase, the loss function was described in Section 3-3. After PFA, the loss function is changed to Cross-Entropy for training remains network. The AdamW optimizer [40, 39] is employed, where (, ) are (0.9, 0.999), and the epsilon is . The initial learning rate is with the step-wise learning rate decaying every 15 epochs with scaling factor , totally trains 60 epochs in fusion phase, and the batch size is .

Evaluation metric. In our experiments, we evaluate model performance using the f1-score for each class and the macro-f1-score. The f1-score is defined as:
(11) |
where precision and recall are computed for each individual class. The macro f1-score is the average of the f1-scores for all classes:
(12) |
where is the number of classes, and is the f1-score for the -th class. These metrics provide a balanced evaluation of the model’s ability to classify each class accurately and its overall performance across all classes.
4.2 Performance Comparison
To demonstrate the efficacy of our method in the ASE dataset, we selected three prominent image classification models as baselines. These include CrossViT-18 [9], DeiT-B [58], and EfficientNet-b6a [57]. Additionally, we evaluated our approach against three specialized few-shot-based defect classification frameworks: those proposed by Karmakar et al. [31], which mainly based on transfer learning and ensemble method, Xie et al. [66], which is rooted on exploring extra-feature within given image.
The experiment results has shown in the Table 2. Our method demonstrated a outstanding results. Despite being designed to enhance performance through the fusion of multi-scale features and to increase data utilization efficiency, respectively, both CrossViT [9] and DeiT [58] under-perform compared to the simple EfficientNet classifier [57] on the ASE dataset, whether in binary or multi-class classification tasks.
4.3 Ablation Study
In this section, we ablate important design elements in the proposed method. The results is in the Table 3. In overall, the PFA block is the most beneficial for the model’s effectiveness. It aligns the features of AOI images with those of the VLM/LLM image-text pairs, enabling the model to address the minority samples in multi-class issues, especially for the tail classes such as type 2, 3, and 4. Furthermore, our proposed TDA designed for the ASE data also effectively enhances the model’s performance.
PFA | CMAF | TDA |
|
|
||||
✓ | 83.47 | 69.72 | ||||||
✓ | 78.19 | 64.58 | ||||||
✓ | ✓ | 85.65 | 75.91 | |||||
✓ | ✓ | 80.60 | 69.19 | |||||
✓ | ✓ | 86.50 | 76.64 | |||||
✓ | ✓ | ✓ | 92.52 | 80.77 |
Method |
|
|
||||
---|---|---|---|---|---|---|
Without alignment | 80.60 | 64.19 | ||||
Direct alignment (whole) | 84.76 | 70.45 | ||||
PFA (0.2/0.6/1) | 90.46 | 78.13 | ||||
PFA (0.2/0.4/0.6/0.8/1) | 92.52 | 80.77 |
Method |
|
|
||||
---|---|---|---|---|---|---|
Direct Concatenation | 86.50 | 76.44 | ||||
CMAF (w/o. sigmoid) | 87.56 | 77.56 | ||||
CMAF (w/. sigmoid) | 92.52 | 80.77 |
Stride settings in PFA block. The stride is crucial as it pertains to the proportion of data added to the progressive training with each iteration. A smaller stride can more precisely align the embedded vectors between image-text pairs in negative samples, yielding more desirable performance. However, it may lead to additional training duration and computational load. Conversely, a too large stride may result in limited effectiveness, as shown in Figure 8. An appropriate stride can guide the model to effectively align with challenging samples without compromising performance. Ablation studies measuring performance across different strides are presented in Table 4.
Feature Fusion Strategy. The method of feature fusion may affect whether certain semantic information can interact well during the training process. Direct concatenation without the use of additional fusion networks, such as self-attention mechanisms, may lead to information loss. In contrast, our proposed CMAF contains the sigmoid function, allowing features from different modalities to adaptively update each other’s weights without overly relying on any single modality. The experimental results demonstrating this, as indicated in Table 5, affirm the efficacy of our approach.
5 Limitation
Our proposed method has demonstrated excellent performance on the ASE dataset. Scenarios that may face similar challenges, including data insufficiency and low-quality recorded images, could be (1) defect recognition in industry, (2) product descriptions or classification, and (3) healthcare and medical analysis. While visual modality data is often readily accessible due to the low cost of cameras, paired additional modality data requires databases for recording. This specificity limits the generalibility of the proposed method.
Furthermore, we believe that the proposed PFA is a promising training strategy, especially in scenarios where data scarcity and high inter/intra variability within datasets make model convergence challenging. This method can reduce the difficulty of achieving model convergence or meeting certain criteria during back propagation. However, a drawback is the increased computational burden, resulted from necesarity of holding on the loaded data in memory for every iteration.
6 Conclusion
In this paper, the special ASE dataset is proposed. The ASE dataset is faced with the issues from data insufficiency and monotonic pattern on image, leading to limitation of conventional deep model like CNN and Transformer. To deal with the challenges mentioned above, we leverages the zero-shot learning capabilities of VLM-LLM to enhance performance on both binary and multi-classification tasks by capturing external-modal features through prompting engineering. Subsequently, the novel progressive feature alignment block, which utilizes progressive training strategy and contrastive learning, effectively aligns image-text representations and progressively incorporates more training data to address the difficulty in alignment presented by a limited sample size. Lastly, the cross-modality attention fusion module adaptively fuses features from different modalities. In summary, our proposed method significantly bolster the model’s performance on the ASE dataset, as demonstrated by our experimental results.
References
- [1] CnOCR. https://github.com/breezedeus/cnocr.
- Ahuja et al. [2017] Chaitanya Ahuja, Louis Philippe Morency, et al. Multimodal machine learning: A survey and taxonomy. IEEE Transactions of Pattern Analysis and Machine Intelligence (TPAMI), pages 1–20, 2017.
- Akcay et al. [2022] Samet Akcay, Dick Ameln, Ashwin Vaidya, Barath Lakshmanan, Nilesh Ahuja, and Utku Genc. Anomalib: A deep learning library for anomaly detection. arXiv preprint, arXiv:2202.08341, 2022.
- Bayoudh et al. [2022] Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022.
- Bhatt et al. [2021] Prahar M. Bhatt, Rishi K. Malhan, Pradeep Rajendran, Brual C. Shah, Shantanu Thakar, Yeo Jung Yoon, and Satyandra K. Gupta. Image-based surface defect detection using deep learning: A review. Journal of Computing and Information Science in Engineering, 2021.
- Chattopadhay et al. [2018] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847, 2018.
- Chenm et al. [2019] Xing Chenm, Rostamzadeh Negar, Oreshkin Boris, and Pinheiro Pedro. Adaptive cross-modal few-shot learning. In Advances in neural information processing systems (NeurlPS), 2019.
- Chitta et al. [2023] Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. Pattern Analysis and Machine Intelligence (TPAMI), 2023.
- Chun-Fu et al. [2021] (Richard) Chen Chun-Fu, Fan Quanfu, and Panda Rameswar. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Cootes [2021] Xinghui Dong; Christopher J. Taylor; Tim F. Cootes. Defect classification and detection using a multitask deep one-class cnn. IEEE Transactions on Automation Science and Engineering (TASE), 2021.
- Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, arXiv:2305.06500, 2023.
- Domen et al. [2020] Tabernik Domen, Šela Samo, Skvarč Jure, and Skočaj Danijel. Segmentation-based deep-learning approach for surface-defect detection. 2020.
- Dongxu Guo [2022] Alexandre Alahi Dongxu Guo, Taylor Mordan. Pedestrian stop and go forecasting with hybrid feature fusion. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2022.
- Dylan Auty [2023] Krystian Mikolajczyk Dylan Auty. Learning to prompt clip for monocular depth estimation: Exploring the limits of human language. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Elhafsi et al. [2023] Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa A. D. Nesnas, and Marco Pavone. Semantic anomaly detection with large language models. Auton. Robots, page 1035–1055, 2023.
- et al. [2020] Brown et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), pages 1877–1901, 2020.
- et al. [2022] Jean Baptiste et al. Flamingo: a visual language model for few-shot learning. In Advances in neural information processing systems (NeurlPS), 2022.
- et al. [2023a] Keyao Wang et al. Dynamic feature queue for surveillance face anti-spoofing via progressive training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- et al. [2019] Martin Mundt et al. Meta-learning convolutional neural architectures for multi-target concrete defect classification with the concrete defect bridge image dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- et al. [2023b] Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint, arXiv:2307.09288, 2023b.
- et al. [2023c] Xian-Yeow Lee et al. Xdnet: A few-shot meta-learning approach for cross-domain visual inspection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2023c.
- Farady et al. [2023] Isack Farady, Chih Yang Lin, and Ming Ching Chang. Preaugnet: improve data augmentation for industrial defect classification with small-scale training data. Journal of Intelligent Manufacturing, 2023.
- Gao et al. [2020] Jingm Gao, Pengm Li, Zhikuim Chen, and Jianingm Zhang. A survey on deep learning for multimodal data fusion. Neural Computation, 2020.
- Haurum and Moeslund [2021] Joakim Bruslund Haurum and Thomas B. Moeslund. Sewer-ml: A multi-label sewer defect classification dataset and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13456–13467, 2021.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Hsu et al. [2023] Chih-Chung Hsu, Chia-Ming Lee, Xiu-Yu Hou, and Chi-Han Tsai. Gradient boost tree network based on extensive feature analysis for popularity prediction of social posts. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM), page 9451–9455, 2023.
- Hugo et al. [2023] Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Rozière Baptiste, Goyal Naman, Hambro Eric, Azhar Faisal, Rodriguez Aurelien, Joulin Armand, Grave Edouard, and Lample Guillaume. Llama: Open and efficient foundation language models. arXiv preprint, arXiv:2302.13971, 2023.
- Jacob et al. [2018] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805, 2018.
- Jakob et al. [2021] Božič Jakob, Tabernik Domen, and Skočaj Danijel. Mixed supervision for surface-defect detection: from weakly to fully supervised learning. 2021.
- Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Karmakar et al. [2023] Soumyajit Karmakar, Abeer Banerjee, Prashant Gidde, Sumeet Saurav, and Sanjay Singh. Convolutional ensembling based few-shot defect detection technique. In Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing, New York, NY, USA, 2023.
- khattak et al. [2023] Muhammad Uzair khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Li et al. [2021] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in neural information processing systems (NeurIPS), pages 9694–9705, 2021.
- Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International conference on machine learning (ICML), pages 12888–12900, 2022.
- Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, arXiv:2301.12597, 2023.
- Liang et al. [2023] Zhexin Liang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, and Chen Change Loy. Iterative prompt learning for unsupervised backlit image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8094–8103, 2023.
- Lin et al. [2023] Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Liu et al. [2023] Chang Liu, Henghui Ding, Yulun Zhang, and Xudong Jiang. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing (TIP), 32:3054–3065, 2023.
- Loshchilov and H. [2019] Ilya Loshchilov and Frank H. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Loshchilov and Hutter [2018] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- [41] David G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPR).
- Marco et al. [2022] Rudolph Marco, Wehrbein Tom, Rosenhahn Bodo, and Wandt Bastian. Fully convolutional cross-scale-flows for image-based defect detection. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2022.
- Masoudnia and Ebrahimpour [2014] S. Masoudnia and R. Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence Review, 2014.
- Pawłowski M [2023] Sysko-Romańczuk S. Pawłowski M, Wróblewska A. Effective techniques for multimodal data fusion: A comparative analysis. Sensors (Basel), 2023.
- Prakash et al. [2021] Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Qian et al. [2021] Xie Qian, Li Dawei, Xu Jinxuan, Yu Zhenghao, and Wang Jun. Automatic detection and classification of sewer defects via hierarchical deep learning. IEEE Transactions on Automation Science and Engineering (TASE), 2021.
- Qu et al. [2022] Linhao Qu, Shaolei Liu, Manning Wang, and Zhijian Song. Transmef: A transformer-based multi-exposure image fusion framework using self-supervised multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2126–2134, 2022.
- Rabiul Awal [2024] Aishwarya Agrawal Rabiul Awal, Le Zhang. Investigating prompting techniques for zero- and few-shot visual question answering. arXiv preprint, arXiv:2307.09288, 2024.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning (ICML), pages 8748–8763, 2021.
- Rakshith et al. [2023] Subramanyam Rakshith, T. S. Jayram, Anirudh Rushil, and Jayaraman J. Thiagarajan. Crepe: Learnable prompting with clip improves visual relationship prediction. arXiv preprint, arXiv:2307.04838, 2023.
- Robinson et al. [2021] Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2021.
- Shanmugam et al. [2021] Divya Shanmugam, Davis Blalock, Guha Balakrishnan, and John Guttag. Better aggregation in test-time augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1214–1223, 2021.
- Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556, 2014.
- Singh and Desai [2023] Swarit Anand Singh and K. A. Desai. Automated surface defect detection framework using machine vision and convolutional neural networks. In Journal of Intelligent Manufacturing, 2023.
- Syed et al. [2022] Ibrahim Syed, Minh Hassan, Mehmood Dang, Irfan, Im Suhyeon, Choi Changho, Kang Jaemo, Park Young-Soo, and Moon Hyeonjoon. Underground sewer pipe condition assessment based on convolutional neural networks. Automation in Construction, 2022.
- Tabernik et al. [2020] Domen Tabernik, Samo Šela, Jure Skvarč, and Danijel Skočaj. Segmentation-based deep-learning approach for surface-defect detection. pages 759–776, 2020.
- Tan and Le [2019] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International conference on machine learning (ICML), pages 6105–6114, 2019.
- Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers and distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
- Wang et al. [2020] Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. Deep multimodal fusion by channel exchanging. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Wang et al. [2022] Yikai Wang, Fuchun Sun, Wenbing Huang, Fengxiang He, and Dacheng Tao. Channel exchanging networks for multimodal and multitask dense image prediction. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 2022.
- Wenhai et al. [2023] Wang Wenhai, Chen Zhe, Chen Xiaokang, Wu Jiannan, Zhu Xizhou, Zeng Gang, Luo Ping, Lu Tong, Zhou Jie, Qiao Yu, and Dai Jifeng. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint, arXiv:2305.11175, 2023.
- Xu et al. [2021] Chengming Xu, Chen Liu, Li Zhang, Chengjie Wang, Jilin Li, Feiyue Huang, Xiangyang Xue, and Yanwei Fu. Learning dynamic alignment via meta-filter for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Xue and Marculescu [2023] Zihui Xue and Radu Marculescu. Dynamic multimodal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Multi-Modal Learning and Applications Workshop, 2023.
- Xueting et al. [2024] Hu Xueting, Zhang Ce, Zhang Yi, Hai Bowen, Yu Ke, and He Zhihai. Learning to adapt clip for few-shot monocular depth estimation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2024.
- Yajun et al. [2021] Chen Yajun, Ding Yuanyuan, Zhao Fan, Zhang Erhu, Wu Zhangnan, and Shao Linhao. Surface defect detection methods for industrial products: A review. Applied Sciences, 2021.
- Yu et al. [2024] Gong Yu, Liu Mingzhou, Wang Xiaoqiao, Liu Conghu, and Hu Jing. Few-shot defect detection using feature enhancement and image generation for manufacturing quality inspection. In Applied Intelligence, 2024.
- Yuxin et al. [2023] Fang Yuxin, Wang Wen, Xie Binhui, Sun Quan, Wu Ledell, Wang Xinggang, Huang Tiejun, Wang Xinlong, and Cao Yue. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Zhenwei et al. [2023] Shao Zhenwei, Yu Zhou, Wang Meng, and Yu Jun. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPR), 2023.
- Zhi et al. [2016] Tian Zhi, Huang Weilin, He Tong, He Pan, and Qiao Yu. Detecting text in natural image with connectionist text proposal network. 2016.
- Zhonghe Ren and Wu [2022] Ning Yan Zhonghe Ren, Fengzhou Fang and You Wu. State of the art in defect detection based on machine vision. 2022.
- Zhou et al. [2022a] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Zhou et al. [2022b] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022b.
- Zhou et al. [2023] Yu Zhou, Ouyang Xuecheng, Shao Zhenwei, Wang Meng, and Yu Jun. Prophet: Prompting large language models with complementary answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPR), 2023.
- Zhu et al. [2023] Haoran Zhu, Boyuan Chen, and Carter Yang. Understanding why vit trains badly on small datasets: An intuitive perspective. arXiv preprint arXiv:2302.03751, 2023.