A Comprehensive Survey on Transformers for Computer Vision Applications

Sonain Jamil1, Md. Jalil Piran2, and Oh-Jin Kwon1 Manuscript received Month xx, 2022; revised Month xx, 2022. Corresponding author: M. J. Piran (email: [email protected]) and O.-J. Kwon (email: [email protected]). 1Department of Electronics Engineering, Sejong University, Seoul 05006, Korea 2Department of Computer Science Engineering, Sejong University, Seoul 05006, Korea

Abstract

As a special type of transformer, Vision Transformers (ViTs) are used to various computer vision applications (CV), such as image recognition. There are several potential problems with convolutional neural networks (CNNs) that can be solved with ViTs. For image coding tasks like compression, super-resolution, segmentation, and denoising, different variants of the ViTs are used. The purpose of this survey is to present the first application of ViTs in CV. The survey is the first of its kind on ViTs for CVs to the best of our knowledge. In the first step, we classify different CV applications where ViTs are applicable. CV applications include image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection. Our next step is to review the state-of-the-art in each category and list the available models. Following that, we present a detailed analysis and comparison of each model and list its pros and cons. After that, we present our insights and lessons learned for each category. Moreover, we discuss several open research challenges and future research directions.

Index Terms:

Vision Transformers, Computer Vision, Deep learning, Image coding.

I Introduction

Vision transformers (ViTs) are designed for tasks related to vision, including image recognition [1]. Originally, transformers were used to process natural language (NLP). Bidirectional encoder representations from transformers (BERT) [2] and generative pre-trained transformer 3 (GPT-3) [3] were the pioneers of transformer models for natural language processing. In contrast, classical image processing systems use convolutional neural networks (CNNs) for different computer vision (CV) tasks. The most common CNN models are AlexNet [4], ResNet [5], VGG [6], GoogleNet [7], Xception [8], Inception [9], DenseNet [10], and EfficientNet [11].

To track attention links between two input tokens, transformers are used. With an increasing number of tokens, the cost rises inexorably. The pixel is the most basic unit of measurement in photography while calculating every pixel relationship in a normal image would be time-consuming and memory-intensive. ViTs, however, take several steps as described below.

1.

ViTs divide the full image into a grid of small image patches.
2.

ViTs apply linear projection to embed each patch.
3.

Then, each embedded patch becomes a token, and the resulting sequence of embedded patches is passed to the transformer encoder (TE).
4.

Then, TE encodes the input patches, and the output is given to the multi-layer perceptron (MLP) head, and the output of the MLP head is the input class.

Figure 1 shows the primary illustration of ViTs. In the beginning, the input image is divided into smaller patches. Each patch is then embedded using linear projection. Tokens are created from embedded patches that are given to the TE as inputs. Multi-head attention and normalization are used by TE to encode the information embedded in patches. TE output is given to the MLP head, and MLP head output is the input image class.

Refer to caption — Figure 1: ViT for Image Classification.

For image classification, the most popular architecture uses the TE to convert multiple input tokens. However, the transformer’s decoder can also be used for other purposes. As described in 2017, transformers have rapidly spread across NLP, becoming one of the most widely used and promising designs [12].

For CV tasks, ViTs were applied in 2021 [13]. The aim was to construct a sequence of patches that, once reconstructed into vectors, are interpreted as words by a standard transformer. Imagine that the attention mechanism of NLP transformers was designed to capture the relationships between different words within the text. In this case, CV takes into account how the different patches of the image relate to one another.

In 2021, a pure transformer outperformed CNNs in image classification [13]. In June 2021, a transformer backend was added to the conventional ResNet, drastically lowering costs while enhancing accuracy [14],[15].

In the same year, several key ViTs versions were released. Various variants were more efficient, accurate, or applicable to specific regions. Swin transformers are the most prominent variants [16]. Using a multi-stage approach and altering the attention mechanism, the Swin transformer achieved cutting-edge performance on object detection datasets. There is also the TimeSformer, which was proposed for video comprehension issues and may capture spatial and temporal information through divided space-time attention [17].

ViTs performance is influenced by decisions like optimizers, dataset-specific hyperparameters, and network depth. Optimizing CNN is significantly easier.

Even when trained on data quantities that are not as large as those required by ViTs, CNNs perform admirably.

Apparently, CNNs exhibit this distinct behavior because of some inductive biases that they can use to comprehend the particularities of images more rapidly, even if they end up restricting them, making it more difficult for them to recognize global connections. ViTs, on the other hand, are devoid of these biases, allowing them to capture a broader and more global set of relationships at the expense of more difficult data training [18].

ViTs are also more resistant to input visual distortions such as hostile patches and permutations [19]. Conversely, preferring one architecture over another may not be the best choice. The combination of convolutional layers with ViTs has been shown to yield excellent results in numerous CV tasks [20]-[22].

To train these models, alternate approaches were developed due to the massive amount of data required. It is feasible to train a neural network virtually autonomously, allowing it to infer the characteristics of a given issue without requiring a large dataset or precise labeling. It might be the ability to train ViTs without a massive vision dataset that makes this novel architecture so appealing.

ViTs have been employed in numerous CV jobs with outstanding and, in some cases, cutting-edge outcomes. The following are some of the important application areas:

•

Image Classification
•

Anomaly Detection
•

Object Detection
•

Image Compression
•

Image Segmentation
•

Video Deepfake Detection
•

Cluster Analysis

Figure 2 shows that the percentage of the application of ViTs for image classification, object detection, image segmentation, image compression, image super resolution, image denoising and anomaly detection is 50%, 40%, 3%, less than 1%, less than 1%, 2% and 3% respectively.

ViTs have been widely utilized in CV tasks. ViTs can solve the problems faced by the CNNs. Different variants of ViTs are used for the image compression, super-resolution, denoising, and segmentation. With the advancement in the ViTs for CV applications, a state-of-the-art survey is required which highlights the performance of ViTs for the CV tasks. In this survey, we first classify different applications of CV such as image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection where ViTs are used. In the next step we survey the state-of-the-art in each CV application and tabulate the existing ViT-based models. We also discuss the pros and cons of each listed model. We present the lessons learnt for each CV application. In the end, we discuss several open research issues and future directions.

Following is a breakdown of the survey. The upcoming section presents related work, while Section 3 presents applications of ViTs in CV. In section 4, we present open research challenges and future directions. Lastly, Section 5 concludes the survey. Figure 3 shows the organization of the survey.

TABLE I: List of Acronyms

Acronym	Meaning
AID	Aerial image dataset
AP	Average precision
AQI	Air quality index
AUC	Area under the curve
AUROC	Area under receiver operating characteristic curve
$AP^{box}$	Box average precision
BERT	Bidirectional encoder representations from transformers
bpp	Bits per pixel
BrT	Bridged transformer
BTAD	BeanTech anomaly detection
CIFAR	Canadian institute for advanced research
CV	Computer vision
CNN	Convolutional neural network
DHViT	Deep hierarchical ViT
DOViT	Double output ViT
ES-GSNet	Excellent teacher guiding small networks
GAOs-1	Get AQI in one shot-1
GAOs-2	Get AQI in one shot-2
GPT-3	Generative pre-trained transformer 3
HD	Hausdorff distance
IoU	Intersection over union
JI	Jaccord index
LiDAR	Light detection and ranging
mAP	Mean average precision
MLP	Multi layer perceptron
MIL-ViT	Multiple instance enhanced ViT
MIM	Masked image modeling
MITformer	Multi-instance ViT
MSE	Mean squared error
MS-SSIM	Multi-scale structural similarity
NLP	Natural language processing
NWPU	Northwestern Polytechnical University
PSNR	Peak signal to noise ratio
PRO	Per region overlap
PUAS	Planet understanding the amazon from space
RFMiD2020	2020 retinal fundus multi-disease image datase
R-CNN	Region-based convolutional neural network
RMSE	Root mean square error
SSIM	Structural similarity
TE	Transformer encoder
UCM	UC-Mered land use dataset
ViTs	Vision transformers
VT-ADL	ViT network for image anomaly detection and localization
ViT-PP	ViT with post processing
YOLOS	You only look at one sequence

II Related Work

A number of surveys have been conducted on ViTs in the literature. [23] reviews the theoretical concepts, foundation, and applications of the transformer for memory efficiency. They also discussed the applications of efficient transformers in NLP. CV tasks, however, were not included. A similar study, [24], examined the theoretical aspects of the ViTs, the foundations of transformers, the role of multi-head attention in transformers, and applications of transformers in image classification, segmentation, super-resolution, and object detection. The survey did not include applications of transformers for image denoising and compression.

In [26], the authors described the architectures of transformers for segmenting, classifying, and detecting objects in images. This survey did not include tasks such as image super-resolution, denoising, and compression associated with CV and image processing.

Lin et al. in [25] summarized different architectures of NLP. The survey, however, did not include any applications of transformers for CV tasks. In [27], the authors discuss different architectures of transformers for computational visual media. The authors discussed the application of transformers for low-level vision and generation, such as image colorization, image super-resolution, image generation, and text-to-image conversion. Additionally, the survey focused on high-level vision tasks such as segmentation and object detection. However, the survey did not discuss the transformer for image compression and classification.

Table II summarizes all existing surveys on the ViTs. As a result of an analysis of Table II, it is evident that the survey is needed to provide insight into the latest developments in ViTs for several image processing and CV tasks, including classification, detection, segmentation, compression, denoising, and super resolution.

Survey	Year	Scope							Contributions and limitations
Survey	Year	Image Classification	Object Detection	Image Segmentation	Image Compression	Image Super Resolution	Image Denoising	Anomaly Detection	Contributions and limitations
[23]	2020	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	Survey of foundation and applications of efficient transformers
[24]	2021	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faTimesCircle	\faCheckCircle	\faTimesCircle	\faTimesCircle	Survey of basic concepts and applications of transformers in CV
[25]	2021	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	Survey of different architectures of transformers
[26]	2021	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	\faTimesCircle	Survey of different architectures of transformers for image classification, object detection and image segmentation
[27]	2022	\faTimesCircle	\faCheckCircle	\faCheckCircle	\faTimesCircle	\faCheckCircle	\faTimesCircle	\faTimesCircle	Survey of transformers in computational visual media
Our survey	2022	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faCheckCircle	\faCheckCircle	Survey of applications of transformers in CV, New outlook to the open research gaps

Research	Model	Dataset	Objective	Accuracy
[28]	CrossViT-9 $\dagger$	ImageNet1K	Image classification	77.1%
	CrossViT-15 $\dagger$			82.3%
	CrossViT-18 $\dagger$			82.8%
	CrossViT-15	CIFAR10	Image classification	99.0%
		CIFAR100	Image classification	90.77%
		Pet	Pet classification	94.55%
		Crop Diseases	Crop diseases classification	99.97%
		ChestXRay8	Chest X rays classification	55.89%
	CrossViT-18	CIFAR10	Image classification	99.11%
		CIFAR100	Image classification	91.36%
		Pet	Pet classification	95.07%
		Crop Diseases	Crop diseases classification	99.97%
		ChestXRay8	Chest X rays classification	55.94%
[29]	CTNet	AID	Remote sensing scene classification	97.70%
[29]	CTNet	NWPU-RESISC45	Remote sensing scene classification	95.49%
[30]	ET-GSNet	AID	Remote sensing image scene classification	96.88%
		NWPU-RESISC45		94.50%
		UCM		99.29%
		OPTIMAL-31		96.45%
[31]	MIL-ViT	APTOS2019	Fundus image classification	97.9%
[31]	MIL-ViT	RFMiD2020	Fundus image classification	95.9%
[32]	DHViT	Trento	Hyperspectral and LiDAR data classification	99.58%
		Houston 2013		99.55%
		Houston 2018		96.40%
[33]	ForestViT	PUAS	Satellite imagery multilabel classification	94.28%
[34]	LeViT	Chinese asphalt pavement	Pavement image classification	91.56%
[34]	LeViT	German asphalt pavement	Pavement image classification	99.17%
[35]	ViT	Malicious drone	Malicious drones classification	98.3%
[41]	ViT	Real X Rays	Femur fracture classification	83.00%
[42]	SeedViT	Maize seeds	Maize seeds quality classification	96.70%
[43]	DOViT	GAOs-1	Air quality classification	90.32%
[43]	DOViT	GAOs-2	Air quality classification	92.78%
[44]	MITformer	UCM	Remote sensing scene classification	99.83%
		AID		97.96%
		NWPU		95.93%

Research	Model	Dataset	Objective	Performance metric	Value
[46]	YOLOS	COCO	Object detection	$AP^{box}$	42.0
[50]	ViT	Satellite images (dataset 2)	Manipulation detection	F1-score	0.354
[50]	ViT	Satellite images (dataset 2)	Manipulation detection	JI	0.275
[51]	BrT	ScanNet-V2	3D object detection	[email protected]	55.2
[51]	BrT	SUN RGB-D	3D object detection	[email protected]	48.1
[52]	ViT based	ScanNet-V2	3D object detection using point cloud data	[email protected]	52.8
[52]	ViT based	SUN RGB-D	3D object detection using point cloud data	[email protected]	45.2

Research	Model	Dataset	Objective	Performance metric	Value
[55]	TransUNet	MICCAI 2015	Medical image segmentation	Dice score	77.48%
[56]	ViTBIS	MICCAI 2015	Medical image segmentation	Dice score	80.45%
[56]	ViTBIS		Medical image segmentation	HD	21.24%
[57]	UNetFormer	MSD	Liver segmentation	Dice score	96.03%
			Liver segmentation	HD	7.21%
			Tumor segmentation	Dice score	59.16%
			Tumor segmentation	HD	8.49%
		BraTS 2021	Brain tumor segmentation	Dice score	91.54%
[60]	LAVT	RefCOCO	Image segmentation	IoU	72.73%
		RefCOCO+			62,14%
		G-Ref (UMD partition)			61.24%
		G-Ref (Google partition)			60.50%

Research	Model	Dataset	Objective	Performance metric	Value
[69]	VT-ADL	MNIST	Anomaly detection and localization	PRO	0.984
		MVTec			0.807
		BTAD			0.89
[71]	AnoViT	MNIST	Anomaly detection and localization	AUROC	92.4%
		CIFAR			60.1%
		MVTec			78.0%
[72]	TransAnomaly (without swm)	Pred1	Anomaly detection in videos	AUC	84.00%
		Pred2			96.10%
		Avenue			85.80%
	TransAnomaly (with swm)	Pred1			86.70%
		Pred2			96.40%
		Avenue			87.00%

A Comprehensive Survey on Transformers for Computer Vision Applications

Abstract

Index Terms:

I Introduction

II Related Work

III Applications of ViTs in CV

III-A ViTs for Image Classification

III-B ViTs for Object Detection

III-C ViTs for Image Segmentation

III-D ViTs for Image Compression

III-E ViTs for Image Super Resolution

III-F ViTs for Image Denoising

III-G ViTs for Anamoly Detection

IV Open Research Challenges and Future Directions

IV-A High computational cost

IV-B Large training dataset

IV-C Neural architecture search (NAS)

IV-D Interpret-ability of the transformers

IV-E Hardware efficient desings

V Conclusion

References