A Comprehensive Survey on Transformers for Computer Vision Applications
Abstract
As a special type of transformer, Vision Transformers (ViTs) are used to various computer vision applications (CV), such as image recognition. There are several potential problems with convolutional neural networks (CNNs) that can be solved with ViTs. For image coding tasks like compression, super-resolution, segmentation, and denoising, different variants of the ViTs are used. The purpose of this survey is to present the first application of ViTs in CV. The survey is the first of its kind on ViTs for CVs to the best of our knowledge. In the first step, we classify different CV applications where ViTs are applicable. CV applications include image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection. Our next step is to review the state-of-the-art in each category and list the available models. Following that, we present a detailed analysis and comparison of each model and list its pros and cons. After that, we present our insights and lessons learned for each category. Moreover, we discuss several open research challenges and future research directions.
Index Terms:
Vision Transformers, Computer Vision, Deep learning, Image coding.I Introduction
Vision transformers (ViTs) are designed for tasks related to vision, including image recognition [1]. Originally, transformers were used to process natural language (NLP). Bidirectional encoder representations from transformers (BERT) [2] and generative pre-trained transformer 3 (GPT-3) [3] were the pioneers of transformer models for natural language processing. In contrast, classical image processing systems use convolutional neural networks (CNNs) for different computer vision (CV) tasks. The most common CNN models are AlexNet [4], ResNet [5], VGG [6], GoogleNet [7], Xception [8], Inception [9], DenseNet [10], and EfficientNet [11].
To track attention links between two input tokens, transformers are used. With an increasing number of tokens, the cost rises inexorably. The pixel is the most basic unit of measurement in photography while calculating every pixel relationship in a normal image would be time-consuming and memory-intensive. ViTs, however, take several steps as described below.
-
1.
ViTs divide the full image into a grid of small image patches.
-
2.
ViTs apply linear projection to embed each patch.
-
3.
Then, each embedded patch becomes a token, and the resulting sequence of embedded patches is passed to the transformer encoder (TE).
-
4.
Then, TE encodes the input patches, and the output is given to the multi-layer perceptron (MLP) head, and the output of the MLP head is the input class.
Figure 1 shows the primary illustration of ViTs. In the beginning, the input image is divided into smaller patches. Each patch is then embedded using linear projection. Tokens are created from embedded patches that are given to the TE as inputs. Multi-head attention and normalization are used by TE to encode the information embedded in patches. TE output is given to the MLP head, and MLP head output is the input image class.

For image classification, the most popular architecture uses the TE to convert multiple input tokens. However, the transformer’s decoder can also be used for other purposes. As described in 2017, transformers have rapidly spread across NLP, becoming one of the most widely used and promising designs [12].
For CV tasks, ViTs were applied in 2021 [13]. The aim was to construct a sequence of patches that, once reconstructed into vectors, are interpreted as words by a standard transformer. Imagine that the attention mechanism of NLP transformers was designed to capture the relationships between different words within the text. In this case, CV takes into account how the different patches of the image relate to one another.
In 2021, a pure transformer outperformed CNNs in image classification [13]. In June 2021, a transformer backend was added to the conventional ResNet, drastically lowering costs while enhancing accuracy [14],[15].
In the same year, several key ViTs versions were released. Various variants were more efficient, accurate, or applicable to specific regions. Swin transformers are the most prominent variants [16]. Using a multi-stage approach and altering the attention mechanism, the Swin transformer achieved cutting-edge performance on object detection datasets. There is also the TimeSformer, which was proposed for video comprehension issues and may capture spatial and temporal information through divided space-time attention [17].
ViTs performance is influenced by decisions like optimizers, dataset-specific hyperparameters, and network depth. Optimizing CNN is significantly easier.
Even when trained on data quantities that are not as large as those required by ViTs, CNNs perform admirably.
Apparently, CNNs exhibit this distinct behavior because of some inductive biases that they can use to comprehend the particularities of images more rapidly, even if they end up restricting them, making it more difficult for them to recognize global connections. ViTs, on the other hand, are devoid of these biases, allowing them to capture a broader and more global set of relationships at the expense of more difficult data training [18].
ViTs are also more resistant to input visual distortions such as hostile patches and permutations [19]. Conversely, preferring one architecture over another may not be the best choice. The combination of convolutional layers with ViTs has been shown to yield excellent results in numerous CV tasks [20]-[22].
To train these models, alternate approaches were developed due to the massive amount of data required. It is feasible to train a neural network virtually autonomously, allowing it to infer the characteristics of a given issue without requiring a large dataset or precise labeling. It might be the ability to train ViTs without a massive vision dataset that makes this novel architecture so appealing.
ViTs have been employed in numerous CV jobs with outstanding and, in some cases, cutting-edge outcomes. The following are some of the important application areas:
-
•
Image Classification
-
•
Anomaly Detection
-
•
Object Detection
-
•
Image Compression
-
•
Image Segmentation
-
•
Video Deepfake Detection
-
•
Cluster Analysis
Figure 2 shows that the percentage of the application of ViTs for image classification, object detection, image segmentation, image compression, image super resolution, image denoising and anomaly detection is 50%, 40%, 3%, less than 1%, less than 1%, 2% and 3% respectively.


ViTs have been widely utilized in CV tasks. ViTs can solve the problems faced by the CNNs. Different variants of ViTs are used for the image compression, super-resolution, denoising, and segmentation. With the advancement in the ViTs for CV applications, a state-of-the-art survey is required which highlights the performance of ViTs for the CV tasks. In this survey, we first classify different applications of CV such as image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection where ViTs are used. In the next step we survey the state-of-the-art in each CV application and tabulate the existing ViT-based models. We also discuss the pros and cons of each listed model. We present the lessons learnt for each CV application. In the end, we discuss several open research issues and future directions.
Following is a breakdown of the survey. The upcoming section presents related work, while Section 3 presents applications of ViTs in CV. In section 4, we present open research challenges and future directions. Lastly, Section 5 concludes the survey. Figure 3 shows the organization of the survey.
Acronym | Meaning |
---|---|
AID | Aerial image dataset |
AP | Average precision |
AQI | Air quality index |
AUC | Area under the curve |
AUROC | Area under receiver operating characteristic curve |
Box average precision | |
BERT | Bidirectional encoder representations from transformers |
bpp | Bits per pixel |
BrT | Bridged transformer |
BTAD | BeanTech anomaly detection |
CIFAR | Canadian institute for advanced research |
CV | Computer vision |
CNN | Convolutional neural network |
DHViT | Deep hierarchical ViT |
DOViT | Double output ViT |
ES-GSNet | Excellent teacher guiding small networks |
GAOs-1 | Get AQI in one shot-1 |
GAOs-2 | Get AQI in one shot-2 |
GPT-3 | Generative pre-trained transformer 3 |
HD | Hausdorff distance |
IoU | Intersection over union |
JI | Jaccord index |
LiDAR | Light detection and ranging |
mAP | Mean average precision |
MLP | Multi layer perceptron |
MIL-ViT | Multiple instance enhanced ViT |
MIM | Masked image modeling |
MITformer | Multi-instance ViT |
MSE | Mean squared error |
MS-SSIM | Multi-scale structural similarity |
NLP | Natural language processing |
NWPU | Northwestern Polytechnical University |
PSNR | Peak signal to noise ratio |
PRO | Per region overlap |
PUAS | Planet understanding the amazon from space |
RFMiD2020 | 2020 retinal fundus multi-disease image datase |
R-CNN | Region-based convolutional neural network |
RMSE | Root mean square error |
SSIM | Structural similarity |
TE | Transformer encoder |
UCM | UC-Mered land use dataset |
ViTs | Vision transformers |
VT-ADL | ViT network for image anomaly detection and localization |
ViT-PP | ViT with post processing |
YOLOS | You only look at one sequence |
II Related Work
A number of surveys have been conducted on ViTs in the literature. [23] reviews the theoretical concepts, foundation, and applications of the transformer for memory efficiency. They also discussed the applications of efficient transformers in NLP. CV tasks, however, were not included. A similar study, [24], examined the theoretical aspects of the ViTs, the foundations of transformers, the role of multi-head attention in transformers, and applications of transformers in image classification, segmentation, super-resolution, and object detection. The survey did not include applications of transformers for image denoising and compression.
In [26], the authors described the architectures of transformers for segmenting, classifying, and detecting objects in images. This survey did not include tasks such as image super-resolution, denoising, and compression associated with CV and image processing.
Lin et al. in [25] summarized different architectures of NLP. The survey, however, did not include any applications of transformers for CV tasks. In [27], the authors discuss different architectures of transformers for computational visual media. The authors discussed the application of transformers for low-level vision and generation, such as image colorization, image super-resolution, image generation, and text-to-image conversion. Additionally, the survey focused on high-level vision tasks such as segmentation and object detection. However, the survey did not discuss the transformer for image compression and classification.
Table II summarizes all existing surveys on the ViTs. As a result of an analysis of Table II, it is evident that the survey is needed to provide insight into the latest developments in ViTs for several image processing and CV tasks, including classification, detection, segmentation, compression, denoising, and super resolution.
Survey | Year | Scope | Contributions and limitations | ||||||
Image Classification | Object Detection | Image Segmentation | Image Compression | Image Super Resolution | Image Denoising | Anomaly Detection | |||
[23] | 2020 | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | Survey of foundation and applications of efficient transformers |
[24] | 2021 | \faCheckCircle | \faCheckCircle | \faCheckCircle | \faTimesCircle | \faCheckCircle | \faTimesCircle | \faTimesCircle | Survey of basic concepts and applications of transformers in CV |
[25] | 2021 | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | Survey of different architectures of transformers |
[26] | 2021 | \faCheckCircle | \faCheckCircle | \faCheckCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | \faTimesCircle | Survey of different architectures of transformers for image classification, object detection and image segmentation |
[27] | 2022 | \faTimesCircle | \faCheckCircle | \faCheckCircle | \faTimesCircle | \faCheckCircle | \faTimesCircle | \faTimesCircle | Survey of transformers in computational visual media |
Our survey | 2022 | \faCheckCircle | \faCheckCircle | \faCheckCircle | \faCheckCircle | \faCheckCircle | \faCheckCircle | \faCheckCircle | Survey of applications of transformers in CV, New outlook to the open research gaps |

III Applications of ViTs in CV
In addition to classical ViTs, modified versions of classical ViTs are used for object detection, image segmentation, compression, super-resolution, denoising, and anomaly detection. Fig. 4 shows the organization of section III.
III-A ViTs for Image Classification
In image classification, the image is initially divided into patches; these patches are fed linearly to the transformer encoder, where MLP, normalization, and multi-head attention are applied to create embedded patches. Embedded patches are fed to the MLP head, which predicts the output class. These classical ViTs have been used by many researchers to classify visual objects.
In [28], the authors proposed CrossViT-15, CrossViT-18, CrossViT-9, CrossViT-15, CrossViT-18 for the image classification. They used ImageNet1K, CIFAR10, CIFAR100, pet, crop disease, and ChestXRay8 datasets to evaluate the different variants of CrossViT. They achieved 77.1% accuracy on the ImageNet1K dataset by using CrossViT-9. Similarly, they attained 82.3% and 82.8% accuracy on ImageNet1K dataset using CrossViT-15 and CrossViT-18 respectively. Similarly, the authors got 99.0% and 99.11% accuracy with CrossViT-15 and CrossViT-18, respectively, on the CIFAR10 dataset. However, they obtained 90.77% and 91.36% accuracy on the CIFAR100 dataset using CrossViT-15 and CrossViT-18, respectively. The authors also used CrossViT for pet classification, crop disease classification, and chest X-ray classification. They observed the highest accuracy of 95.07% with CrossViT-18 for the pet classification. Similarly, they achieved the highest accuracy of 99.97% with CrossViT-15 and CrossViT-18 for the crop diseases classification. Moreover, they achieved the highest accuracy of 55.94% using CrossViT-18 for the chest x-rays classification.
Deng et al. in [29], proposed a combined CNN and ViT model named CTNet for the classification of high-resolution remote sensing images. To evaluate the model, they used the aerial image dataset (AID) and Northwestern Polytechnical University (NWPU)-RESISC45 dataset. CTNet obtained an accuracy of 97.70% and 95.49% using AID and NWPU-RESISC45 datasets, respectively. In [30], the authors proposed excellent teacher guiding small networks (ES-GSNet) for the classification of the remote sensing image scenes. They used four datasets: AID, NWPU-RESISC45, UC-Mered Land use dataset (UCM), and OPTIMAL-31. They gained the accuracy of 96.88%, 94.50%, 99.29%, and 96.45% for AID, NWPU-RESISC45, UCM, and OPTIMAL-31 datasets, respectively.
Yu et al. in [31], presented multiple instance enhanced ViT (MIL-ViT) for fundus image classification. They used APTOS2019 blindness detection and the 2020 retinal fundus multi-disease image dataset (RFMiD2020). MIL-ViT gave an accuracy of 97.9% on the APTOS2019 dataset and 95.9% on the RFMiD2020 dataset.
Xue et al. in [32], proposed deep hierarchical ViT (DHViT) for the hyperspectral and light detection and ranging (LiDAR) data classification. The authors used Trento, Houston 2013, and Houston 2018 datasets and obtained the accuracy of 99.58%, 99.55%, and 96.40%, respectively.
In [33] elaborated on the use ViT for the satellite imagery multilabel classification and proposed ForestViT. ForestViT demonstrated the accuracy of 94.28% on planet understanding of the amazon from space (PUAS) dataset.
In [34], the researchers put forward the concept of LeViT for pavement image classification. They used Chinese asphalt pavement and German asphalt pavement to evaluate the model’s performance. They obtained an accuracy of 91.56% using the Chinese asphalt pavement dataset and 99.17% using the German asphalt pavement dataset.
In [35], the authors used ViT to distinguish malicious drones from the aeroplanes, birds, drones, and helicopters. They demonstrated the efficiency of ViT for the classification over several CNNs such as AlexNet [4], ResNet-50 [36], MobileNet-V2 [37], ShuffleNet [38], SqueezeNet [39], and EfficicentNetb0 [40]. ViT model achieved 98.3% accuracy on malicious drones dataset.
Tanzi et al. in [41] applied ViT for the classification of the femur fracture. They used a dataset of real x-rays. The model achieved the accuracy of 83% with 77% precision, 76% recall, and 77% F1-score.
In [42], the authors modified classical ViT and proposed SeedViT for the classification of the maize seeds’ quality. They used a custom dataset. The model outperformed CNNs and achieved 96.70% accuracy.
Similarly, in [43], the researchers put forward double output ViT (DOViT) for the classification of air quality and its measurement. They used two datasets named get AQI in one shot-1 (GAOs-1) and get AQI in one shot-2 (GAOs-2). The model achieved 90.32% accuracy for GAOs-1 dataset and 92.78% accuracy for GAOs-2 dataset.
In [44], the authors developed a novel multi-instance ViT called MITformer for the remote sensing scene classification. They evaluated their model on three different datasets. The model achieved 99.83% accuracy for the UCM dataset, 97.96% accuracy for the AID dataset, and 95.93% accuracy for the NWPU dataset.
Table III shows the summary of the application of ViT for image classification.
Research | Model | Dataset | Objective | Accuracy |
---|---|---|---|---|
[28] | CrossViT-9 | ImageNet1K | Image classification | 77.1% |
CrossViT-15 | 82.3% | |||
CrossViT-18 | 82.8% | |||
CrossViT-15 | CIFAR10 | Image classification | 99.0% | |
CIFAR100 | 90.77% | |||
Pet | Pet classification | 94.55% | ||
Crop Diseases | Crop diseases classification | 99.97% | ||
ChestXRay8 | Chest X rays classification | 55.89% | ||
CrossViT-18 | CIFAR10 | Image classification | 99.11% | |
CIFAR100 | 91.36% | |||
Pet | Pet classification | 95.07% | ||
Crop Diseases | Crop diseases classification | 99.97% | ||
ChestXRay8 | Chest X rays classification | 55.94% | ||
[29] | CTNet | AID | Remote sensing scene classification | 97.70% |
NWPU-RESISC45 | 95.49% | |||
[30] | ET-GSNet | AID | Remote sensing image scene classification | 96.88% |
NWPU-RESISC45 | 94.50% | |||
UCM | 99.29% | |||
OPTIMAL-31 | 96.45% | |||
[31] | MIL-ViT | APTOS2019 | Fundus image classification | 97.9% |
RFMiD2020 | 95.9% | |||
[32] | DHViT | Trento | Hyperspectral and LiDAR data classification | 99.58% |
Houston 2013 | 99.55% | |||
Houston 2018 | 96.40% | |||
[33] | ForestViT | PUAS | Satellite imagery multilabel classification | 94.28% |
[34] | LeViT | Chinese asphalt pavement | Pavement image classification | 91.56% |
German asphalt pavement | 99.17% | |||
[35] | ViT | Malicious drone | Malicious drones classification | 98.3% |
[41] | ViT | Real X Rays | Femur fracture classification | 83.00% |
[42] | SeedViT | Maize seeds | Maize seeds quality classification | 96.70% |
[43] | DOViT | GAOs-1 | Air quality classification | 90.32% |
GAOs-2 | 92.78% | |||
[44] | MITformer | UCM | Remote sensing scene classification | 99.83% |
AID | 97.96% | |||
NWPU | 95.93% |
III-B ViTs for Object Detection
The effort to tame pre-trained vanilla ViT for object detection has never stopped since the evolution of transformer [12] to CV [13]. Beal et al. [45] is the first to use a faster region-based convolutional neural network (R-CNN) detector with a supervised pre-trained ViT for object detection. You only look at one sequence (YOLOS) [46] suggests using simply a pre-trained ViT encoder to conduct object detection in a pure sequence-to-sequence way. Li et al. [47] are the first to do a large-scale study of vanilla ViT on object detection using sophisticated masked image modeling (MIM) pre-trained representations [48]-[49], confirming vanilla ViT’s promising potential and capacity in object-level recognition.
In [50], the authors proposed the unsupervised learning-based technique using ViT for the detection of the manipulation in the satellite images. They used two different datasets for the evaluation of the framework. The ViT model with post-processing (ViT-PP) achieves an F1-score of 0.354 and a Jaccord index (JI) of 0.275 for dataset 2. F1-score and JI can be calculated by 1 and 2 respectively.
(1) |
(2) |
Here , and denote true positive, false positive and false negative respectively.
In [51], the authors proposed bridged transformer (BrT) for the 3D object detection. The model was applied for the vision and point cloud 3D object detection. They used ScanNet-V2 [53] and SUN RGB-D [54] datasets to validate their model. The model demonstrated the mean average precision (mAP)@0.5 of 52.8 for the ScanNet-V2 dataset and 55.2 for the SUN RGB-D dataset.
Similarly, in [52], the authors proposed a transformers-based framework for the detection of the 3D objects using point cloud data. They used ScanNet-V2 [53] and SUN RGB-D [54] datasets to validate their model. The model demonstrated the mean average precision (mAP)@0.5 of 52.8 for the ScanNet-V2 dataset and 45.2 for the SUN RGB-D dataset.
Table IV shows the application of ViT for the object detection.
Research | Model | Dataset | Objective | Performance metric | Value |
---|---|---|---|---|---|
[46] | YOLOS | COCO | Object detection | 42.0 | |
[50] | ViT | Satellite images (dataset 2) | Manipulation detection | F1-score | 0.354 |
JI | 0.275 | ||||
[51] | BrT | ScanNet-V2 | 3D object detection | [email protected] | 55.2 |
SUN RGB-D | 48.1 | ||||
[52] | ViT based | ScanNet-V2 | 3D object detection using point cloud data | [email protected] | 52.8 |
SUN RGB-D | 45.2 |
III-C ViTs for Image Segmentation
Image segmentation can also be done using transformers. A combination of ViT and U-Net was used in [55] to segment medical images. The authors replaced the encoder part of the classical U-Net with a transformer. A multi-atlas abdomen labeling challenge dataset from MICCAI 2015 was used. By using images of resolution 224, the TransUNet achieved an average dice score of 77.48%, while using images of resolution 512, it achieved an average dice score of 84.36%.
In [56] the authors proposed a ”ViT for biomedical image segmentation (ViTBIS)” for medical image segmentation. Transformers were used for both encoders and decoders in their transformer-based model. In addition, the MICCAI 2015 multi-atlas abdomen labeling challenge dataset and the Brain Tumor Segmentation (BraTS 2019) challenge dataset were used. The evaluation metric used was dice score and Hausdorff distance (HD) [64]. According to the MICCAI 2015 dataset, average dice scores were 80.45%, and average HDs were 21.24%.
Hatamizadeh et al. in [57], proposed UNetFormer for the medical image segmentation. The model contained a transformer-based encoder, decoder, and bottleneck part. They used medical segmentation decathlon (MSD) [58] and BraTS 2021 [59] dataset to test UNetFormer. They evaluated dice scores and HD. The dice score using the MSD dataset was 96.03% for liver and 59.16% for tumor, whereas the value of HD was 7.21% for liver and 8.49% for tumor. Moreover, the average dice score on the BraTS 2021 dataset is 91.54%.
In [60], the authors proposed a novel “language-aware ViT (LAVT)” for the image segmentation. They used four different datasets for the evaluation of the model. The datasets were RefCOCO [61], RefCOCO+ [61], G-Ref (UMD partition) [62] and G-Ref (Google partition) [63]. They used intersection over union (IoU) as the performance metric. The value of IoU for the RefCOCO dataset was 72.73%, and for RefCOCO+, the IoU was 62.14%. Similarly, for G-Ref (UMD partition), IoU was 61.24%, and for G-Ref (Google partition), IoU was 60.50%.
Table V shows the application of ViT for the image segmentation.
Research | Model | Dataset | Objective | Performance metric | Value |
[55] | TransUNet | MICCAI 2015 | Medical image segmentation | Dice score | 77.48% |
[56] | ViTBIS | MICCAI 2015 | Medical image segmentation | Dice score | 80.45% |
HD | 21.24% | ||||
[57] | UNetFormer | MSD | Liver segmentation | Dice score | 96.03% |
HD | 7.21% | ||||
Tumor segmentation | Dice score | 59.16% | |||
HD | 8.49% | ||||
BraTS 2021 | Brain tumor segmentation | Dice score | 91.54% | ||
[60] | LAVT | RefCOCO | Image segmentation | IoU | 72.73% |
RefCOCO+ | 62,14% | ||||
G-Ref (UMD partition) | 61.24% | ||||
G-Ref (Google partition) | 60.50% |
III-D ViTs for Image Compression
In recent years, learning-based image compression has been the focus of research. For lossy image compression based on learning, different CNN-based architectures proved effective. As ViTs evolved, learning-based image compression was also done by transformer-based models. In [65], the authors modified the entropy module of the Ballé 2018 mode [66] with the ViT. Due to the fact that the entropy module used a transformer, this model was called Entroformer. Entroformer effectively captured long-range dependencies in probability distribution estimation. On the Kodak dataset, they demonstrated the performance of the Entroformer. When the model was optimized for the mean squared error (MSE) loss function, the average peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) were 27.63 dB and 0.90132, respectively.
III-E ViTs for Image Super Resolution
CNN has been used to perform image super-resolution. With ViT’s superiority over CNN, image super-resolution can also be achieved by transformers. Spatio-temporal ViT, a transformer-based model for super-resolution of microscopic images, is developed in [67]. Additionally, the model addressed the problem of video super-resolution. To test the model’s performance, the authors used a video dataset. PSNR was calculated for static and dynamic videos. Static, medium, fast, and extreme motions were considered. The PSNR for static was 34.74 dB, whereas the PSNR for medium, fast and extreme was 30.15 dB, 26.04 dB, and 22.95 dB, respectively.
III-F ViTs for Image Denoising
Denoising images has also been a challenging problem for researchers. In spite of this, ViT has found a solution. A transformer was used to denoise CT images in [68]. They proposed a model called TED-Net for low-dose CT denoising. The authors used a transformer for both the encoder and decoder parts. Using the AAPM-Mayo clinic LDCT Grand Challenge dataset, they obtained structural similarity (SSIM) of 0.9144 and root mean square error (RMSE) of 8.7681.
Research | Model | Dataset | Objective | Performance metric | Value |
---|---|---|---|---|---|
[69] | VT-ADL | MNIST | Anomaly detection and localization | PRO | 0.984 |
MVTec | 0.807 | ||||
BTAD | 0.89 | ||||
[71] | AnoViT | MNIST | Anomaly detection and localization | AUROC | 92.4% |
CIFAR | 60.1% | ||||
MVTec | 78.0% | ||||
[72] | TransAnomaly (without swm) | Pred1 | Anomaly detection in videos | AUC | 84.00% |
Pred2 | 96.10% | ||||
Avenue | 85.80% | ||||
TransAnomaly (with swm) | Pred1 | 86.70% | |||
Pred2 | 96.40% | ||||
Avenue | 87.00% |
III-G ViTs for Anamoly Detection
Additionally, ViT is used for anomaly detection. A novel ViT network for image anomaly detection and localization (VT-ADL) was developed in [69]. In their study, the authors used a real-world dataset called BTAD. The model was also tested on two publicly available datasets, MNIST and MVTec [70]. For all three datasets, they calculated the model’s per region overlap (PRO) score. A mean PRO score of 0.984 was obtained for the MNIST dataset, 0.807 for the MVTec dataset, and 0.89 for the BTAD dataset.
Similarly, in [71], the authors proposed AnoViT for the detection and localization of anomalies. MNIST, CIFAR, and MVTecAD datasets were used by the authors. Based on the MINST, CIFAR, and MVTecAD datasets, the mean area under region operating characteristics (AUROC) curve was 92.4, 60.1, and 78, respectively.
Yuan et al. in [72] proposed TransAnomaly, a video ViT and U-Net-based framework for the detection of the anomalies in the videos. They used three datasets, Pred1, Pred2, and Avenue. The calculated area under the curve (AUC) for three datasets achieved 84.0%, 96.10%, and 85.80%, respectively, without using the sliding windows method (swm). With the swm, the model gave AUC of 86.70%, 96.40%, and 87.00% for the Pred1, Pred2, and Avenue datasets, respectively.
Table VI shows the summary of the ViT for anomaly detection.
IV Open Research Challenges and Future Directions
Despite showing promising results for different image coding and CV tasks. In addition to high computational costs, large training datasets, neural architecture search, interpretability of transformers, and efficient hardware designs, ViTs implementation still faces challenges. The purpose of this section is to explain the challenges and future directions of ViTs.
IV-A High computational cost
There are millions of parameters in ViT-based models. Computers with high computational power are needed to train these models. Due to their high cost, these high-performance computers increase the computational cost of ViTs. In comparison to CNN, ViT performs better; however, its computational cost is much higher. One of the biggest challenges researchers face is reducing the computational cost of ViTs.
IV-B Large training dataset
ViTs’ training requires a large amount of data. With a small training dataset, ViTs perform poorly. ViT trained with the ImageNet1K dataset performs worse than ResNet, but ViT trained with ImageNet21K performs better than ResNet.
IV-C Neural architecture search (NAS)
There has been a great deal of exploration of NAS for CNNs. In contrast, NAS has not yet been explored for ViTs. The NAS exploration for ViTs gives young investigators a new direction for the future.
IV-D Interpret-ability of the transformers
It is difficult to visualize the relative contribution of input tokens to the final predictions with ViTs since the attention that originates in each layer is intermixed in succeeding layers. The problem remains unresolved.
IV-E Hardware efficient desings
Power and processing requirements can make large-scale ViTs networks unsuitable for edge devices and resource-constrained contexts such as the internet of things (IoT).
V Conclusion
It is becoming more common to use ViTs for image coding and CV instead of CNNs. The use of ViTs for classification, detection, segmentation, compression, and image super-resolution has risen dramatically since the introduction of the classical ViT for image classification. This survey presented the existing surveys on ViTs in the literature. This survey highlighted the applications of different variants of ViTs in CV. This survey examined the use of ViTs for image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection. We also presented the lessons learned in each category. Additionally, we discussed the open research challenges faced by the researchers during the implementation of ViTs, such as high computational costs, large training datasets, interpretability of transformers, and hardware efficiency. By providing future directions, we gave the young researchers a new perspective.
References
- [1] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking Spatial Dimensions of Vision Transformers,” In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11936–11945.
- [2] I. Tenney, D. Das and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” arXiv preprint, 2019, arXiv:1905.05950.
- [3] L. Floridi and M. Chiriatti, “GPT-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30(4), pp. 681–694, Dec. 2020.
- [4] A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
- [5] Z. Wu, C. Shen and A. Van Den Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” Pattern Recognition, vol. 90, pp. 119–133, Jun 2019.
- [6] I. Hammad and K. El-Sankary, “Impact of Approximate Multipliers on VGG Deep Learning Network,” in IEEE Access, vol. 6, pp. 60438–60444, Oct 2018.
- [7] X. Yao, X. Wang, Y. Karaca, J. Xie and S. Wang, “Glomerulus Classification via an Improved GoogLeNet,” in IEEE Access, vol. 8, pp. 176916-176923, Sep 2020.
- [8] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” In Proc. of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. 2017.
- [9] C. Wang, D. Chen, L. Hao, X. Liu, Y. Zeng, J. Chen and G. Zhang, “Pulmonary Image Classification Based on Inception-v3 Transfer Learning Model,” in IEEE Access, vol. 7, pp. 146533–146541, Oct 2019.
- [10] K. Zhang, Y. Guo, X. Wang, J. Yuan and Q. Ding, “Multiple Feature Reweight DenseNet for Image Classification,” in IEEE Access, vol. 7, pp. 9872–9880, Jan 2019.
- [11] J. Wang, L. Yang, Z. Huo, W. He and J. Luo, “Multi-Label Classification of Fundus Images With EfficientNet,” in IEEE Access, vol. 8, pp. 212499–212508, Nov 2020.
- [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani et. al. “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint, 2020, arXiv:2010.11929.
- [14] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer and P. Vajda, “Visual transformers: Token-based image representation and processing for computer vision,” arXiv preprint, 2020, arXiv:2006.03677.
- [15] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar and R. Girshick, “Early convolutions help transformers see better,” Advances in Neural Information Processing Systems, 2021, vol. 34, pp. 30392–30400.
- [16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” In Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
- [17] G. Bertasius, H. Wang and L. Torresani, “Is space-time attention all you need for video understanding,” arXiv preprint, 2021, arXiv:2102.05095 2, no. 3 (2021): 4.
- [18] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang and A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?” Advances in Neural Information Processing Systems, 34, 2021.
- [19] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. S. Khan and M. H. Yang, “Intriguing properties of vision transformers,” Advances in Neural Information Processing Systems, 34, 2021.
- [20] Z. Dai, H. Liu, Q. V. Le and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977, 2021.
- [21] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan and L. Zhang. “Cvt: Introducing convolutions to vision transformers,” In Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 22–31, 2021.
- [22] D. Coccomini, N. Messina, C. Gennaro and F. Falchi, “Combining efficientnet and vision transformers for video deepfake detection,” arXiv preprint, 2021, arXiv:2107.02612.
- [23] Y. Tay, M. Dehghani, D. Bahri and D. Metzler, “Efficient transformers: A survey,” arXiv preprint, 2020, arXiv:2009.06732.
- [24] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan and M. Shah, Transformers in vision: A survey, ACM Computing Surveys (CSUR), Feb 2021.
- [25] T. Lin, Y. Wang, X. Liu and X. Qiu, “A survey of transformers,” arXiv preprint, 2021, arXiv:2106.04554.
- [26] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, J. Fan and Z. He, “A Survey of Visual Transformers,” arXiv preprint, 2021, arXiv:2111.06091.
- [27] Y. Xu, H. Wei, M. Lin, Y. Deng, K. Sheng, M. Zhang, F. Tang, W. Dong, F. Huang and C. Xu, “Transformers in computational visual media: A survey,” Computational Visual Media, vol. 8, no. 1, pp. 33–62, Mar 2022.
- [28] C. -F. R. Chen, Q. Fan and R. Panda, “CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification,” In Proc. of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 347–356.
- [29] P. Deng, K. Xu and H. Huang, “When CNNs Meet Vision Transformer: A Joint Framework for Remote Sensing Scene Classification,” in IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022, Art no. 8020305.
- [30] K. Xu, P. Deng and H. Huang, “Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-15, 2022, Art no. 5618715.
- [31] S. Yu, K. Ma, Q. Bi, C. Bian, M. Ning, N. He, Y. Li, H. Liu and Y. Zheng. “Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification,” In the Proc. of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 45–54. Springer, Cham, 2021.
- [32] Z. Xue, X. Tan, X. Yu, B. Liu, A. Yu and P. Zhang, “Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification,” in IEEE Transactions on Image Processing, vol. 31, pp. 3095–3110, 2022.
- [33] M. Kaselimi, A. Voulodimos, I. Daskalopoulos, N. Doulamis and A. Doulamis, “A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring,” in IEEE Transactions on Neural Networks and Learning Systems, 2022.
- [34] Y. Chen, X. Gu, Z. Liu and J. Liang, “A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method,” Remote Sensing, vol. 14, no. 8, p. 1877, 2022.
- [35] S. Jamil, M.S. Abbas and A.M. Roy. “Distinguishing Malicious Drones Using Vision Transformer,” AI vol. 3, no. 2, pp. 260–273, Mar. 2022.
- [36] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” In Proc. of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” In Proc. of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520, 2018.
- [38] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” In Proc. of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
- [39] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size,” arXiv preprint, 2016, arXiv:1602.07360.
- [40] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” In Proc. of International conference on machine learning, pp. 6105–6114, 2019.
- [41] L. Tanzi, A. Audisio, G. Cirrincione, A. Aprato and E. Vezzetti, “Vision Transformer for femur fracture classification.” Injury, Apr. 2022.
- [42] J. Chen, T. Luo, J. Wu, Z. Wang and H. Zhang, “A Vision Transformer network SeedViT for classification of maize seeds,” Journal of Food Process Engineering, 1:e13998, May 2022.
- [43] Z. Wang, Y. Yang and S. Yue, “Air Quality Classification and Measurement Based on Double Output Vision Transformer,” in IEEE Internet of Things Journal, Mar. 2022.
- [44] Z. Sha and J. Li, “MITformer: A Multi-Instance Vision Transformer for Remote Sensing Scene Classification,” in IEEE Geoscience and Remote Sensing Letters, May 2022.
- [45] J. Beal, E. Kim, E. Tzeng, D.H. Park, A. Zhai and D. Kislyuk, “Toward transformer-based object detection,” arXiv preprint, 2020. arXiv:2012.09958.
- [46] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu and W. Liu, “You only look at one sequence: Rethinking transformer in vision through object detection,” Advances in Neural Information Processing Systems, 34, 2021.
- [47] Y. Li, S. Xie, X. Chen, P. Dollar, K. He and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint, 2021 , arXiv:2111.11429.
- [48] H. Bao, L. Dong, S. Piao and F. Wei, “Beit: Bert pre-training of image transformers,” In ICLR, 2022.
- [49] K. He, X. Chen, S. Xie, Y. Li, P. Dollar and R. Girshick, “Masked autoencoders are scalable vision learners,” In CVPR, 2022.
- [50] J. Horváth, S. Baireddy, H. Hao, D. M. Montserrat and E. J. Delp, “Manipulation Detection in Satellite Images Using Vision Transformer,” In Proc. of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021, pp. 1032–1041.
- [51] Y. Wang, T. Ye, L. Cao, W. Huang, F. Sun, F. He and D. Tao, “Bridged Transformer for Vision and Point Cloud 3D Object Detection,” In Proc. of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),, 2022.
- [52] Z. Liu, Z. Zhang, Y. Cao, H. Hu and X. Tong, “Group-Free 3D Object Detection via Transformers,” In Proc. of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2929–2938.
- [53] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser and M. Nießner, “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes,” In Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2432–2443.
- [54] S. Song, S. P. Lichtenberg and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” In Proc. of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 567–576.
- [55] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A.L. Yuille and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint, 2021, arXiv:2102.04306.
- [56] A. Sagar, “ViTBIS: Vision Transformer for Biomedical Image Segmentation,” Clinical Image-Based Procedures, Distributed and Collaborative Learning, Artificial Intelligence for Combating COVID-19 and Secure and Privacy-Preserving Machine Learning, Springer, Cham, 2021, pp. 34–45.
- [57] A. Hatamizadeh, Z. Xu, D. Yang, W. Li, H. Roth and D. Xu, “UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation,” arXiv preprint, 2022, arXiv:2204.00631.
- [58] M. Antonelli, A. Reinke, S. Bakas, K. Farahani, B. A. Landman, G. Litjens, B. Menze, O. Ronneberger, R.M. Summers, B. van Ginneken, et al., “The medical segmentation decathlon,” arXiv preprint. 2021, arXiv:2106.05735.
- [59] U. Baid, et al. “The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification,” arXiv preprint, 2021, arXiv:2107.02314.
- [60] Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao and P. H. Torr, “LAVT: Language-Aware Vision Transformer for Referring Image Segmentation,” arXiv preprint, 2021, arXiv:2112.02244.
- [61] L. Yu, P. Poirson, S. Yang, A. C. Berg and T. L. Berg, “Modeling context in referring expressions,” In European Conference on Computer Vision, Oct. 2016, pp. 69–85. Springer, Cham.
- [62] V. K. Nagaraja, V. I. Morariu and L. S. Davis, “Modeling context between objects for referring expression understanding,” In European Conference on Computer Vision, Oct. 2016, pp. 792–807. Springer, Cham.
- [63] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille and K. Murphy, “Generation and Comprehension of Unambiguous Object Descriptions,” In Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 11–20.
- [64] D. P. Huttenlocher, G. A. Klanderman and W. J. Rucklidge, “Comparing images using the Hausdorff distance,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 850–863, Sept. 1993.
- [65] Y. Qian, X. Sun, M. Lin, Z. Tan and R. Jin, “Entroformer: A Transformer-based Entropy Model for Learned Image Compression,” In Proc. of International Conference on Learning Representations, 2022.
- [66] J. Ballé, D. Minnen, S. Singh, S.J. Hwang and N. Johnston, “Variational Image Compression with a Scale Hyperprior,” In Proc. of International Conference on Learning Representations, 2018.
- [67] C. N. Christensen, M. Lu, E. N. Ward, P. Lio and C. F. Kaminski, “Spatio-temporal Vision Transformer for Super-resolution Microscopy,” arXiv preprint, 2022, arXiv:2203.00030.
- [68] D. Wang, Z. Wu and H. Yu, “TED-net: Convolution-free T2T Vision Transformer-based Encoder-decoder Dilation network for Low-dose CT Denoising,” In International Workshop on Machine Learning in Medical Imaging, 2021, pp. 416–425.
- [69] P. Mishra, R. Verk, D. Fornasier, C. Piciarelli and G. L. Foresti, “VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization,” In Proc. of 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), 2021, pp. 01–06.
- [70] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9592–9600.
- [71] Y. Lee and P. Kang, “AnoViT: Unsupervised Anomaly Detection and Localization With Vision Transformer-Based Encoder-Decoder,” in IEEE Access, vol. 10, pp. 46717–46724, May 2022.
- [72] H. Yuan, Z. Cai, H. Zhou, Y. Wang and X. Chen, “TransAnomaly: Video Anomaly Detection Using Video Vision Transformer,” in IEEE Access, vol. 9, pp. 123977–123986, Aug. 2021.