CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising

Dayang Wang¹ ¹Department of Electrical and Computer Engineering, University of Massachusetts, Lowell, MA, USA
² Weill Cornell Medicine, Cornell University, New York City, NY, US
³ 3920 Mystic Valley Parkway, Medford, MA, US Fenglei Fan² ¹Department of Electrical and Computer Engineering, University of Massachusetts, Lowell, MA, USA
² Weill Cornell Medicine, Cornell University, New York City, NY, US
³ 3920 Mystic Valley Parkway, Medford, MA, US Zhan Wu¹ ¹Department of Electrical and Computer Engineering, University of Massachusetts, Lowell, MA, USA
² Weill Cornell Medicine, Cornell University, New York City, NY, US
³ 3920 Mystic Valley Parkway, Medford, MA, US Rui Liu³ ¹Department of Electrical and Computer Engineering, University of Massachusetts, Lowell, MA, USA
² Weill Cornell Medicine, Cornell University, New York City, NY, US
³ 3920 Mystic Valley Parkway, Medford, MA, US Fei Wang² ¹Department of Electrical and Computer Engineering, University of Massachusetts, Lowell, MA, USA
² Weill Cornell Medicine, Cornell University, New York City, NY, US
³ 3920 Mystic Valley Parkway, Medford, MA, US Hengyong Yu^1∗

*

Dr. Hengyong Yu is the corresponding author. ¹Department of Electrical and Computer Engineering, University of Massachusetts, Lowell, MA, USA
² Weill Cornell Medicine, Cornell University, New York City, NY, US
³ 3920 Mystic Valley Parkway, Medford, MA, US

Abstract

Low-dose computed tomography (LDCT) denoising is an important problem in CT research. Compared to the normal dose CT (NDCT), LDCT images are subjected to severe noise and artifacts. Recently in many studies, vision transformers have shown superior feature representation ability over convolutional neural networks (CNNs). However, unlike CNNs, the potential of vision transformers in LDCT denoising was little explored so far. To fill this gap, we propose a Convolution-free Token2Token Dilated Vision Transformer (CTformer¹¹1This manuscript is an extension of our conference paper [1]. ) for low-dose CT denoising. The CTformer uses a more powerful token rearrangement to encompass local contextual information and thus avoids convolution. It also dilates and shifts feature maps to capture longer-range interaction. We interpret the CTformer by statically inspecting patterns of its internal attention maps and dynamically tracing the hierarchical attention flow with an explanatory graph. Furthermore, an overlapped inference mechanism is introduced to effectively eliminate the boundary artifacts that are common for encoder-decoder-based denoising models. Experimental results²²2Codes are available at github.com/wdayang/CTformer on Mayo LDCT dataset suggest that the CTformer outperforms the state-of-the-art denoising methods with a low computation overhead.

Index Terms:

Low-dose CT, denoising, Token2Token transformer, dilation, interpretability.

I INTRODUCTION

The LDCT problem has gained lots of attention in the community due to its potential of reducing X-ray radiation. However, compared to NDCT images, LDCT images suffer from severe noise and artifacts [2] when they are applied to clinical applications. To overcome this problem, two types of algorithms have been investigated: traditional algorithms and convolutional neural networks (CNNs) [3, 4]. i) Traditional algorithms such as iterative methods suppress the artifacts and noise by using a physical model based on a certain prior. Unfortunately, these algorithms are hard to be adopted in commercial CT scanners because of the hardware limitations and high computational cost [5]. ii) With the advent of deep learning, CNNs have been a prevailing approach for LDCT image denoising. Despite the superior learning ability aided by big data[6], CNNs are reported to be limited in capturing long-range contextual information in images [7, 8, 9, 10], which will adversely affect the retrieval of richer structural information in denoised images.

Recently, the transformer model [8] has shown excellent performance in computer vision [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]. Dosovitskiy et al. proposed the first vision transformer (ViT) by simply mapping an image into 16 $\times$ 16 patches (this operation is commonly referred to as tokenization) in analogy to words in a sentence in natural language processing [14]. Yuan et al. further proposed a Token2Token method to empower the transformer model with a diverse information encoding [10]. Next, Liu et al. designed a swin transformer to include patch fusion and cyclic shift to enlarge the perception of contextual information in tokens [9]. Moreover, Choromanski et al. proposed a Performer transformer to reduce the computational complexity of the self-attention by approximating the inherent softmax operator [21]. Currently, the transformer model is poised to replace CNNs as the mainstream deep learning model. On the one hand, compared to CNNs, the transformer model is good at capturing global information and long-range feature interactions, resulting in the utilization of richer information. As shown in Fig. 1, the transformer has diversified and effective features, while the CNN model has many inactive features. On the other hand, the transformer model enjoys higher visual interpretability by the virtue of its inherent self-attention block [22, 23, 24]. However, a typical CNN model contains no generic explanation modules [25].

Refer to caption — Figure 1: The feature maps visualization of the pretrained RED-CNN and the CTformer. The transformer model (CTformer) has diversified and effective features, while the CNN model (RED-CNN) has a lot of inactive features.

Despite the success and great promise, the transformer has been little investigated in LDCT denoising. In our opinion, the transformer model is suitable for LDCT denoising problem. Other than the effectiveness, a transformer is more desirable for physicians because it is self-explanatory [26], e.g., allowing a physician to make sense of the model’s logic. To the best of our knowledge, Zhang et al. pioneered to apply the transformer in LDCT denoising [27]. Although this model achieves the state-of-the-art performance, it has imperfections in three aspects: i) The model uses the vanilla transformer which can not fully explore the potential of the transformer, as relevant studies are rapidly advancing. ii) Intensive convolutions are included in the model, making their model essentially a hybrid model. Thus, the merits of using a transformer are insufficiently justified. iii) Their work neglects the interpretability that is essential for clinical applications [28].

We aim to fully explore the potential of transformers in LDCT denoising. Specifically, we propose a Convolution-free Token2Token Dilated Vision Transformer (CTformer) for low-dose CT denoising. The CTformer has the following characteristics: i) Although the convolution is instrumental to capture local features when it is combined with transformers on small datasets, it is not a necessity for the performance because the token rearrangement can also help complement the local information. Therefore, we completely exclude convolution operations in the proposed CTformer. To the best of our knowledge, the CTformer is the first pure transformer for LDCT denoising. ii) The dilation and a cyclic shift are used in the Token2Token to enlarge the receptive field, thereby gaining broader contextual information from the feature maps and reducing the computational cost. iii) We utilize an overlapped inference mechanism to address the boundary artifact that is common in the encoder-decoder denoising models. iv) We develop interpretability for the CTformer with the visual attention maps and an explanatory graph that shed light on how the CTformer discriminates key structures from noise as well as hierarchical attention flow across layers. Experiments results suggest that the CTformer delivers superior denoising performance over other state-of-the-arts with fewer trainable parameters and multiply-accumulate operations (MACs).

In summary, our contributions are threefold: i) This work is the forerunner to apply the vision transformer to LDCT denoising problem. What’s more, the proposed CTformer is the first pure transformer. ii) We introduce dilation and cyclic shift to enhance the tokenization process in the model, utilize a new inference mechanism to fix the boundary artifacts, and develop the interpretation methods to unveil the model’s denoising patterns. iii) Our experimental results demonstrate the superior denoising performance and model efficiency of the CTformer for LDCT denoising.

II RELATED WORK

The previous studies for the LDCT denoising problem can be categorized into two classes.

Traditional algorithms. Typically, these methods incorporate a physical prior into an iterative reconstruction framework to suppress noise. For example, compressed sensing (CS) has been widely used for the LDCT problem by adopting a sparse representation [29], i.e., the total variation minimization assumes that the clean image is piecewise constant whose gradients are sparse [30, 31, 32, 33]. Xu et al. used a dictionary to construct the sparse representation [34] for LDCT denoising. In addition to the sparsity prior, Ma et al. designed a non-local mean prior to utilize the image voxels across the whole image rather than the local region [35]. However, increasingly more studies [36, 37, 38, 39] implied that the traditional algorithms are surpassed by deep learning models driven by big data.

Convolution models. CNNs have been used for the LDCT image reconstruction. Wu et al. used a K-sparse autoencoder to learn the image features in an unsupervised fashion and minimize the distance between a normal-dose image and an iterative reconstruction result in the feature space of the autoencoder [36]. Liu et al. proposed a 3D residual convolutional network to estimate an iterative reconstruction (IR) image from an LDCT analytic reconstruction image [40]. Their method can save time because it avoids the time-consuming iterative reconstruction. He et al. proposed the 3pADMM method to address the problems of hyper-parameter optimization and prior knowledge selection in LDCT reconstruction [41].

Besides, a majority of deep LDCT denoising models focused on image post-processing. The paper of Chen et al. was a pioneer work which employed the convolution, deconvolution, and shortcut connections to prototype a residual encoder-decoder convolution neural network (RED-CNN) [37]. Yang et al. used the generative adversarial network with Wasserstein distance (WGAN) aided by a perceptual loss to improve the quality of denoised images [38]. Due to the excellent performance of WGAN in generating faithful real-world CT images and the role of the perceptual loss in structural fidelity, this model alleviated the over-smoothness in the denoised images. Li et al. employed a GAN armed with the structural similarity loss, the perceptual loss, the adversarial loss, and the sharpness loss to preserve structural details and sharp boundaries [42]. Fan et al. constructed a quadratic neuron-based autoencoder for LDCT image denoising with more robustness and efficiency as opposed to conventional CNN-based methods [39]. It is the first autoencoder based on a new type of neurons. Huang et al. proposed a two-stage residual CNN [43], where the first stage uses stationary wavelet transform for texture denoising, and the second one enhances the image structure via combining the average of NDCT images and the denoised image from the first stage.

However, CNN-based models typically lack the ability to capture global contextual information due to the limited receptive fields, thus less efficient to model the structural similarity across the whole image [27, 1, 44].

III METHODS

In the supervised setting, with a deep learning model, the LDCT denoising task is to learn a mapping from a paired noisy LDCT image $\boldsymbol{x}$ to a clean NDCT image $\boldsymbol{y}$ . Mathematically, a neural network can be trained by optimizing a mean square error (MSE) loss function as follows:

\underset{\boldsymbol{W}}{\min}\ \ \mathcal{J}(\boldsymbol{W};\boldsymbol{x})=\left\|f(\boldsymbol{W};\boldsymbol{x})-\boldsymbol{y}\right\|_{2},

(1)

where $f(\boldsymbol{W};\boldsymbol{x})$ is a neural network, and $\boldsymbol{W}$ is a collection of parameters for simplicity.

III-A Architecture of the CTformer

As shown in Fig. 2, the proposed CTformer takes the residual encoder-decoder structure with tokenization/detokenization blocks, four CTformer modules, and an intermediate transformer block. In the encoder, CTformer modules A and B include a transformer block (TB), and a Token2Token Dilation block (T2TD). In the decoder, CTformer modules C and D symmetrically encompass an inverse Token2Token Dilation block (IT2TD) and a TB. The IT2TB block takes the inverse design of the corresponding T2TB block. Now let us introduce the CTformer from its macro to micro structures.

Residual encoder-decoder structure. We use a residual encoder-decoder structure as the backbone of the CTformer. The shortcuts only bridge similar levels of layers in encoder and decoder parts. Although the unsatisfactory information loss is accompanied by denoising in the encoder block, which hurts the structural recovery in the decoder part, the employment of shortcuts can supplement information from the feature maps of the encoder to retain structural details. Besides, shortcuts can fix the gradient vanishing problem such that a deep model can still be stably trained [45].

Tokenization block. As shown in Fig. 3, in the tokenization process, a noisy CT image is unfolded into a sequence of two dimensional (2D) patches (also referred to as tokens): $\mathbf{T_{0}}\in\mathbb{R}^{b\times n\times d_{0}}$ , where $b$ is the batch size, $n$ is the number of tokens, and $d_{0}$ is the token dimension. Throughout this manuscript, we use tokens and patches interchangeably.

Transformer block. As shown in Fig. 3, a typical transformer block contains multiple head attention (MHA), layer normalization (LN), an MLP, and residual connections to enhance the expressive power. Specifically, in the self-attention, a token sequence $\mathbf{T_{0}}\in\mathbb{R}^{b\times n\times d_{0}}$ is linearly mapped into three tensors which are respectively referred to as query, key, and value, denoted as $\mathbf{Q},\mathbf{K},\mathbf{V}\in\mathbb{R}^{b\times n\times d_{m}}$ for short, where $d_{m}$ is the token embedding dimension. Mathematically, we have

\begin{cases}&\mathbf{Q}=\mathbf{T_{0}}\mathbf{W_{q}}\\ &\mathbf{K}=\mathbf{T_{0}}\mathbf{W_{k}}\\ &\mathbf{V}=\mathbf{T_{0}}\mathbf{W_{v}},\\ \end{cases}

(2)

where $\mathbf{W_{q}}$ , $\mathbf{W_{k}}$ and $\mathbf{W_{v}}$ are linear operators. Then, the output of the self-attention is calculated as

\mathrm{MHA}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}(\frac{\mathbf{Q}{{\mathbf{K}}^{\top}}}{\sqrt{{{d}_{k}}}})\mathbf{V},

(3)

where the scaling factor $\frac{1}{\sqrt{{{d}_{k}}}}$ is based on the network depth. Besides the authentic calculation of Eq. (3), the $\mathrm{softmax}$ operator can be approximated by a kernel method, thus, obtaining a reduced complexity of Eq. (3). The transformer using this approximation is also called Performer [21].

$\mathrm{\mathbf{Att}}=\mathrm{softmax}(\mathbf{Q}\mathbf{K}^{\top}/\sqrt{d_{k}})$ is the attention map that will be used in the post-hoc interpretability analysis. Through the transformer block, the output token $\mathbf{T}_{a}\in\mathbb{R}^{b\times n\times d_{a}}$ is

\begin{cases}&\mathbf{T}_{a}^{{}^{\prime}}=\mathrm{MHA}(\mathrm{LN}(\mathrm{MLP}(\mathbf{T}_{0})))+\mathbf{T}_{0}\\ &\mathbf{T}_{a}=\mathrm{MLP}(\mathrm{LN}(\mathbf{T}_{a}^{{}^{\prime}}))+\mathbf{T}_{a}^{{}^{\prime}}.\end{cases}

(4)

Token2Token dilation block. Previously, the simple tokenization in the vanilla transformer only includes one tokenization process using either reshaping or convolutions with a fixed stride to convert an image to tokens. Thus, it tends to ignore the dependence across neighboring tokens. What’s worse, it also makes the attention expressions redundant, which adversely results in limited feature richness in each layer [10]. To overcome these problems, as shown in Fig. 3, we adopt the recently-proposed T2T block which uses cascade tokenization to replace the simple tokenization [10]. The T2T block consists of reshaping and unfolding which can not only model the local information from the surrounding image pixels but also gain more feature representation than convolution. Furthermore, we use cyclic shift and dilation in the T2T (T2TD) to refine the contextual information fusion and leverage spatial relations across a larger region. Now, let us elaborate on these operations in detail.

Step 1: reshaping. A sequence of tokens $\mathbf{T_{a}}\in\mathbb{R}^{b\times n\times d_{a}}$ given rise by the transformer block are first transposed to $\mathbf{T_{a}}^{\top}\in\mathbb{R}^{b\times d_{a}\times n}$ and then reshaped into $\mathbf{F}\in\mathbb{R}^{b\times d_{a}\times h\times w}$ :

\mathbf{F}=\mathrm{reshape}(\mathbf{T_{a}}^{\top}),

(5)

where $h=w=\sqrt{n}$ are the height and width of the feature map, respectively.

Step 2: cyclic shift. We employ the cyclic shift to modify the 4D feature maps in each T2TD block. Specifically, the pixel values in the feature maps are shifted in a cyclic way to utilize the information more sufficiently. Then, an inverse cyclic shift is performed in the symmetric IT2TD block in the decoder to avoid any pixel mismatch in the final denoising results. Through cyclic shift, the tokens fed into the consequent transformer blocks are extracted from different feature maps rather than the fixed patches. Furthermore, now the tokens from the boundaries of the modified feature maps include pixels that are not boundaries in the original feature maps. In practice, the CTformer shifts the image by two pixels to extract new tokens. Fig. 3 illustrates the cyclic shift module,

\mathbf{F}_{c}=\mathrm{cyclicshift}(\mathbf{F}).

(6)

Step 3: dilated unfolding. The dilated unfolding will use the unfolding operation to retokenize the feature maps from the last step. To alleviate the information loss in this step, we adopt an overlapped splitting of patches. As a result, these aggregated tokens can respect the correlations among the neighboring tokens.

\mathbf{T}_{s}=\mathrm{dilatedunfold}(\mathbf{F_{c}}).

(7)

In this stage, the 4D feature maps $\mathbf{F}\in\mathbb{R}^{b\times d\times h\times w}$ are converted back to 3D tokens $\mathbf{T}_{s}\in\mathbb{R}^{b\times n_{s}\times d_{s}}$ , where $n_{s}$ and $d_{s}$ represent the new token number and token dimension, respectively. By aggregating surrounding patches and pixels, the local information is favorably preserved, and the number of tokens is changed. Specifically, the token number decreases in the encoder and increases in the decoder.

Instead of the normal unfolding, we endow the unfolding with a dilation to capture the longer range contextual information with less computational cost. Mathematically, the perceptive field $P$ of the dilation can be calculated as follows:

P=\prod_{i=0}^{1}(2^{K_{i}+D_{i}}-1),

(8)

where $K_{i}$ and $D_{i}$ denote the kernel size and the dilation rate in a certain dimension, respectively. After the dilated unfolding, the input feature map $\mathbf{F}\in\mathbb{R}^{b\times d\times h\times w}$ becomes $\mathbf{T}_{sd}\in\mathbb{R}^{b\times n_{sd}\times d_{sd}}$ , where $d_{sd}=d\times\prod_{i}{K_{i}}$ and the total number of tokens $n_{sd}$ after the dilated unfolding operation is calculated as:

n_{sd}=\prod_{i=0}^{1}{\left\lfloor\frac{\mathrm{spatial}(i)-\mathrm{dilation}\times(K_{i}-1)-1}{\mathrm{stride}}+1\right\rfloor},

(9)

where $\left\lfloor\cdot\right\rfloor$ is the floor function, $\mathrm{spatial}(i)$ means corresponding size in the $i$ dimension, $\mathrm{spatial}(0)=h$ in height dimension, and $\mathrm{spatial}(1)=w$ in width dimension. Here, $\mathrm{dilation}$ , $\mathrm{kernel}$ , and $\mathrm{stride}$ are related parameters in the unfolding operation. Then, an MLP is performed to map the embedding dimension to a desired size. For better understanding of our model, a flowchart is attached for the above-discussed tensors in Fig. 4.

III-B Inference of the CTformer.

In the inference phase, unlike CNN which can directly test the whole image, the transformer model can only do inference patch by patch. Because there exists information loss in the bottleneck of an encoder-decoder architecture [46], the denoised results of these patches are inconsistent at boundaries, causing boundary artifacts in the stitched image. As shown in Fig. 5, we can easily see the mosaic edge indicated by the red arrows, and artifacts are along all four directions. To address this problem, we propose an overlapped inference method. The core of our method is to discard the margin and only keep the center of the model output to stitch the final prediction.

Suppose that the patch size is $p\times p$ , we only keep the central part of a patch $(p-2\eta)\times(p-2\eta)$ to form the final prediction image, where $\eta$ is selected to be greater than the width of artifacts. In the overlapped inference, slightly more calculations are demanded because we discard the peripheral part of a patch. The increased cost is at the ratio of

\sigma=\Big{(}\frac{\lceil n/(p-2\eta)\rceil}{\lceil n/p\rceil}\Big{)}^{2},

(10)

where $n$ is the original image size, and $\lceil\cdot\rceil$ is the ceiling function. Therefore, we need to balance the computation cost with the artifact elimination effect.

III-C Interpretability of the CTformer

In interpretability research, saliency map is the most popular method. One can generate a saliency map for the CNN-based classification model after the model is trained [47]. However, for the image-to-image denoising task, deriving saliency maps are not applicable because denoising models are essentially regression models. In contrast, even if the transformer models are used for denoising, one can leverage the inherent attention modules to achieve saliency maps. Utilizing such an advantage, we develop the interpretability of the CTformer by probing the patterns of the attention maps. Thus, one can decode the inner-working of the CTformer, with an emphasis on the processing of important structural and semantic information. The self-interpretability makes the CTformer uniquely relative to other LDCT denoising models.

Furthermore, we observe that the attention only reflects where the model attends in a static manner, which cannot convey how the attended parts flow across layers in the CTformer. To complement this dynamic information, inspired by [48], we propose to construct an explanatory graph to describe the hierarchical flow of the attention. We take the attended parts as graph nodes and the attention flow as graph edges. Two nodes linked by a edge are usually co-activated and take similar mapping (denoising). Specifically, we first recognize the attended object parts by identifying the peak activations. Then, we build the graph connections between neighboring layers by forwarding a masked feature map and monitoring the high activations.

Node: To identify the object part, we provide two pixel-based methods: TopK and local maximum (LM) selection. The TopK extracts the $K$ -highest activation across the attention maps, while the LM detects the local maximum activations.

Edge: To construct edges among nodes, we propose to forward a masked feature map. Specifically, given a node (an object part) in a layer, we mask the feature maps and only keep the region around the node. Then, we feed the masked feature maps to obtain the attention map of the next layer. Finally, we extract the highest activation (node) from the obtained attention map and link it to the given node.

By performing the above steps recursively in two subsequent layers, the whole explanatory graph is built to inform us how the attention of the CTformer is shifted.

IV EXPERIMENTS

In this part, our model is trained and evaluated on a publicly available dataset. First, we demonstrate the superior denoising performance and the model efficiency of the CTformer over its counterparts. Then, we confirm the effectiveness of the overlapped inference mechanism. Finally, we elaborate on the model interpretability of the CTformer with the aforementioned interpretation methods.

Dataset. A publicly released dataset from 2016 NIH-AAPM-Mayo Clinic LDCT Grand Challenge³³3https://www.aapm.org/GrandChallenge/LowDoseCT/ [49] is used for model training and testing. The dataset includes $2,378$ $3.0$ mm slice thickness of low-dose (quarter) and normal-dose (full) CT images from ten anonymous patients. We select the patient L $506$ data for evaluation, while the rest nine patients for model training. Data augmentation is also applied. We generate more training images by randomly rotating ( $90$ , $180$ , or $270$ degrees) and flipping (up/down, left/right) the original image.

Experiment settings. We list the detailed experimental settings in the following:

•

The experiments are running on Ubuntu 18.04.5 LTS, with Intel(R) Core (TM) i9-9920X CPU @ 3.50GHz using PyTorch 1.5.0 [50] and CUDA 10.2.0. The model is trained with four NVIDIA GTX 2080Ti 11G GPUs.
•

The intermediate transformer block takes the authentic design, while the transformer blocks in the CTformer modules take Performer to facilitate model training. The embedding dimension for all transformer blocks is $64$ .

•

In tokenization/detokenization, the kernel for the unfolding/folding is set to $7$ with a stride of $2$ to reduce computational cost. For the four CTformer modules, the cyclic shift strides in the T2TD/IT2TD blocks are $\{2,2,-2,-2\}$ . The kernel sizes of the unfolding/folding operations are 3 with dilations of $\{2,1,1,2\}$ , respectively. The strides are set to 1 to avoid the information loss. Thus, according to Eq. (9), the corresponding token numbers $n_{1}$ , $n_{2}$ , and $n_{3}$ for the CTformer module A, CTformer module B, and the intermediate transformer layer are computed as follows:

\begin{cases}n_{1}=\Big{(}\left\lfloor\frac{64-1\times(7-1)-1}{2}+1\right\rfloor\Big{)}^{2}=841\\ n_{2}=\Big{(}\left\lfloor\frac{\sqrt{841}-2\times(3-1)-1}{1}+1\right\rfloor\Big{)}^{2}=625\\ n_{3}=\Big{(}\left\lfloor\frac{\sqrt{625}-1\times(3-1)-1}{1}+1\right\rfloor\Big{)}^{2}=529,\end{cases}

(11)

here $\mathrm{spatial}(d)=\sqrt{841}=29$ and $\mathrm{spatial}(d)=\sqrt{625}=25$ can be calculated from the reshaping process in Eq. (5). The transformer token numbers in the decoder are symmetrically arranged as $\{625,841\}$ .

•

We randomly extract $4$ patches from all available slices for training through $4000$ epochs with a batch size of $16$ . In a training batch, fewer patches with more images lead to less fluctuations and bias than more patches with fewer images because many patches from a single image usually cannot represent the overall data distribution.
•

Adam is adopted to minimize the MSE loss with an initial learning rate of ${1.0\times 10^{-5}}$ , which gradually decreases to ${1.0\times 10^{-6}}$ with a scheduled decay rate.
•

A margin size of $16$ is used for overlapped inference.

Denoising performance. The performance of the CTformer is compared to other state-of-the-arts, e.g., RED-CNN [37], WGAN-VGG [38], MAP-NN [51], and AD-NET [52]. The selected models are all popular low-dose CT or natural image denoising models that were published in flagship journals. We retrain all the models based on their officially-disclosed codes.

Fig. 6 shows the results of different networks on L $506$ with Lesion No. $575$ , and Fig. 7 demonstrates the ROIs from the rectangular area marked in Fig. 6. It can be seen that all methods can alleviate noise and artifacts to some extent, but the CTformer generates the clearest and the most perceptually-pleasing denoised images. Specifically, per the ROIs from Fig. 7, we find that WGAN-VGG and MAP-NN seem to introduce additional shadows and tissues. While the RED-CNN and AD-NET produce a smoother and clearer image relative to WGAN-VGG and MAP-NN, there still exists blotchy noise around the lesion. In contrast, the CTformer satisfactorily supresses the noise and artifacts, maintains high-level spatial smoothness, and keeps the structural details in the restored image. Therefore, we conclude that the CTformer is the best denoiser compared to its competitors.

TABLE I: Quantitative evaluation results of different methods on L

506

using SSIM and RMSE. The bold-faced numbers are the best results.

Method	#param.	MACs	SSIM $\uparrow$	RMSE $\downarrow$
LDCT	-	-	0.8759	14.2416
RED-CNN	1.85M	5.05G	0.9077	10.1044
WGAN-VGG	34.07M	3.61G	0.9008	11.6370
MAP-NN	3.49M	13.79G	0.9084	9.2959
AD-NET	2.07M	9.49G	0.9105	9.0997
CTformer	1.45M	0.86G	0.9121	9.0233

Additionally, two metrics: structural similarity (SSIM) and root mean square error (RMSE) are adopted to quantitatively assess the quality of the denoised images. For fairness, we evaluate the model complexity with the number of trainable parameters (#param.) and MACs. Table I shows the average SSIM and RMSE on all slices of L506. Among the state-of-the-art methods, only AD-NET achieve an SSIM score over $0.91$ , and only MAP-NN and AD-NET have an RMSE score below $10$ . In contrast, our CTformer has the highest SSIM of $0.9121$ and smallest RMSE of $9.0233$ . Concerning model complexity, MAP-NN has the highest MACs of $13.79$ G because it uses a lot of repeated modules, while WGAN-VGG has the greatest number of trainable parameters of $34.07$ M because it uses VGG as a feature extractor. In contrast, the CTformer has the smallest number of parameters and the lowest MACs. Compared to its competitors, our model has the best performance with the lowest computational cost.

Model efficiency. Model efficiency is an important issue in deep learning. To further verify the model efficiency of the CTformer, we compare the CTformer with RED-CNN, MAP-NN and AD-NET by checking the SSIM and RMSE scores from different model sizes. For the CTformer, we change the model size by revising the embedding size of the intermediate transformer block. The embedding sizes are set to $\{64,256,512,1024\}$ , respectively. While for other models, we vary their sizes by using different number of filters in each layer. The filter numbers in RED-CNN, MAP-NN, and AD-Net are $\{64,96,128,256\}$ , $\{64,128,256,400\}$ , and $\{64,96,128,256\}$ , respectively.

Fig. 8 shows the SSIM and RMSE scores of different models with respect to the number of parameters and MACs. The highlights of Fig. 8 are that the SSIM curves of the CTformer lie on the top left of other curves, while its RMSE curves lies on the bottom left. When the number of parameters and MACs are close, the CTformer always delivers the best scores compared to the RED-CNN, MAP-NN and AD-NET. We summarize that the CTformer has the superior model efficiency to its competitors.

Eliminating boundary artifacts. The overlapped inference is performed to eliminate boundary artifacts as shown in Fig. 9(a). From the ROIs in Fig. 9(b), we can see that the boundary artifacts are obvious when $\eta$ is 0 or 1 but soon become hardly perceivable when $\eta$ further increases. It is worth noting that as $\eta$ varies, the boundary artifacts can appear in different regions because the size of the patches integrated in the final image is different. To further confirm the effectiveness of the overlapped inference, quantitative analysis on the patient L $506$ is also conducted. As seen from Table II, the SSIM and RMSE scores improve fast when $\eta$ goes from 0 to 20 with a better performance on $16$ . The corresponding ratio of the extra computation over the authentic computation $\sigma$ is calculated from Eq.(10): $\big{(}\frac{\lceil 512/(64-2\times 16)\rceil}{\lceil 512/64\rceil}\big{)}^{2}=4$ . To sum up, the overlapped inference can sufficiently address the dense boundary artifacts.

TABLE II: The SSIM and RMSE scores improve with margin.

Margin	0	4	8	12	16	20
SSIM $\uparrow$	0.9071	0.9098	0.9113	0.9116	0.9121	0.9120
RMSE $\downarrow$	9.5671	9.1940	9.0890	9.0503	9.0233	9.0223

Visual interpretation. To reveal the latent learning behavior in the CTformer, we visualize the attention maps $\mathrm{Att}=\mathrm{softmax}(\mathbf{QK}^{\top}/\sqrt{d_{k}})$ in each layer. Specifically, we derive attention maps by averaging all grids of $\mathrm{Att}$ and resize it to the size of the original image. Then, the attention map is superimposed on the image with a transparence rate 0.4.

As shown in Fig. 10, the attention map in the first layer highlights the key object parts. Specifically, there are more attentions on the edges rather than the composition of key structures like bones. Moreover, there are scattered dotted attentions on the protruding texture in the original image. For the attention map in the second layer, it basically resembles the pattern in the first attention, but sparser and less focused on the structures. Next, the pattern in the third layer becomes semantically implicit. Finally, the attention in the fourth layer tends to ignore the edges of objects and emphasize the content where noise is concentrated.

Since attentions in different layers focus on different structures, we construct an explanatory graph to illustrate the flow of attention across various layers. In our experiments, the object nodes are represented by the pixel coordinates of the image. We select the top 60 activations in the attention maps as nodes using TopK/LM selection and identify the highest activation under each node’s influence. By applying the proposed method, the whole TopK and LM graph are obtained in Figs. 11 and 12, respectively.

From the TopK graph in Fig. 11, it can be seen that the attention flows across different testing slices have very similar patterns. First, from the first to the second layer, the attentions on the edges still favor other edges in the next layer as indicated by the white circle in Fig. 11. Second, all high activations from the second layer move to the top area of the third layer in a latent manner. Last, all the top attentions in the third layer spread across the noisy area in the fourth layer. While the TopK graph identifies the flow of the top activations, the LM graph illustrates that of the local protuberant objects. As shown in Fig. 12, the attention graphs of different slices using LM are also analogous. Compared to TopK graph, one principal distinction in the LM graph is that groups of local maximum activations tend to implicitly concentrate on the same point in the next layer. The white circles in Fig. 12 illustrate some concurrent points. Therefore, by inspecting the two attention graphs, the dynamic flow can be clearly followed. We can figure out how the object parts are co-activated and thus go through similar level of noise reduction.

In summary, the latent learning behavior of the CTformer can be visually interpreted statically and dynamically. This makes the proposed model more transparent and reliable for diagnostic decisions.

V ABLATION STUDY

In this part, comparative experiments are conducted to study the impact of the T2TD block, the cyclic shift operation and the number of the intermediate transformer blocks.

Impact of T2TD block. T2TD blocks are used in the CTformer to enhance the feature integration in the tokenization stage. Compared to fixed-region tokenization, the tokens in T2TD blocks are extracted from various regions of the original images. To verify the effectiveness of this part, a Sole-ViT model without the T2TD module is designed. We only adopt a sole convolution in the tokenization stage with a filter size of $8$ and a stride of $8$ . Then five layers of transformer with an embedding size of $256$ are applied for feature extraction and denoising. $256$ rather than $64$ embedding size is used because the model size and MACs are close to our model as shown in III. Finally, a detokenization with deconvolution is employed to transform the tokens back to desired image domain. By investigating the conjunction area inside the blue circle in Fig. 13(g), we can see that Sole-ViT brings in extra blotchy tissues. Meanwhile, Fig. 14 shows that the CTformer converges faster than Sole-ViT and has better scores with a margin of $0.0235$ on SSIM and $3.3362$ on RMSE.

Impact of cyclic shift. In this work, the cyclic shift is performed in the T2TD blocks to enhance the perceptual fields of our model. Fig. 13 shows that CTformer with cyclic shift enjoys more spatial smoothness compared to the CTformer without cyclic shift. The latter introduces some additional noise components. Quantitative results from Table III also confirm the effectiveness of cyclic shift in improving the SSIM and RMSE of the model by $0.0026$ and $0.1337$ , respectively.

Impact of block number. In terms of the number of intermediate transformer blocks, we evaluate the CTformer with $1$ , $2$ , $4$ , and $8$ blocks to identify the influence. When the block number grows, the network goes deeper. The computational cost increases slowly, but the actual training time climb up dramatically. However, Table III indicates that the CTformer with only one block yields the best performance over the ones with more blocks.

TABLE III: Quantitative evaluation results of the Sole-ViT, the CTformer(W/oCS), and the CTformers with different number of transformer blocks.

Method	TB	#param.	MACs	SSIM $\uparrow$	RMSE $\downarrow$
Sole-ViT	1	2.92M	0.24G	0.8886	12.3595
CTformer(W/oCS)	1	1.45M	0.86G	0.9095	9.1570
CTformer	1	1.45M	0.86G	0.9121	9.0233
CTformer	2	1.48M	0.87G	0.9115	9.0303
CTformer	4	1.55M	0.91G	0.9108	9.1285
CTformer	8	1.68M	0.98G	0.9115	9.0841

VI CONCLUSION

In this paper, we have proposed a novel convolution-free transformer empowered by dilated tokenization and cyclic shift for LDCT denoising, which is referred to as the CTformer. To the best of our knowledge, the proposed CTformer is the first pure transformer model for LDCT denoising. Also, we have developed the interpretation methods for the proposed CTformer to decode its hidden behavior. Moreover, we have proposed the overlapped inference to address the boundary artifacts that are common in an encoder-decoder model. Experimental results have demonstrated that the CTformer outperforms its competitors in terms of the denoising performance and model efficiency. In the future, more efforts can be made to translate the CTformer into other medical denoising problems.

References

[1] D. Wang, Z. Wu, and H. Yu, “Ted-net: Convolution-free t2t vision transformer-based encoder-decoder dilation network for low-dose ct denoising,” in Machine Learning in Medical Imaging (C. Lian, X. Cao, I. Rekik, X. Xu, and P. Yan, eds.), (Cham), pp. 416–425, Springer International Publishing, 2021.
[2] D. J. Brenner and E. J. Hall, “Computed tomography—an increasing source of radiation exposure,” New England journal of medicine, vol. 357, no. 22, pp. 2277–2284, 2007.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[4] F. Fan, D. Wang, H. Guo, Q. Zhu, P. Yan, G. Wang, and H. Yu, “On a sparse shortcut topology of artificial neural networks,” IEEE Transactions on Artificial Intelligence, 2021.
[5] X. Yin, Q. Zhao, J. Liu, W. Yang, J. Yang, G. Quan, Y. Chen, H. Shu, L. Luo, and J.-L. Coatrieux, “Domain progressive 3d residual convolution network to improve low-dose ct imaging,” IEEE transactions on medical imaging, vol. 38, no. 12, pp. 2903–2913, 2019.
[6] G. Wang, J. C. Ye, and B. De Man, “Deep learning for tomographic image reconstruction,” Nature Machine Intelligence, vol. 2, no. 12, pp. 737–748, 2020.
[7] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803, 2018.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
[9] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” arXiv preprint arXiv:2103.14030, 2021.
[10] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” arXiv preprint arXiv:2101.11986, 2021.
[11] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” arXiv preprint arXiv:2103.15808, 2021.
[12] X. Chu, B. Zhang, Z. Tian, X. Wei, and H. Xia, “Do we really need explicit position encodings for vision transformers?,” arXiv e-prints, p. arXiv: 2102.10882, 2021.
[13] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International Conference on Machine Learning, pp. 1691–1703, PMLR, 2020.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[15] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture transformer network for image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5800, 2020.
[16] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” arXiv preprint arXiv:2012.00364, 2020.
[17] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, and L. Kaiser, “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020.
[18] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021.
[19] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” arXiv preprint arXiv:2012.12877, 2020.
[20] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” arXiv preprint arXiv:2103.00112, 2021.
[21] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020.
[22] S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” arXiv preprint arXiv:2005.00928, 2020.
[23] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller, “Layer-wise relevance propagation: an overview,” Explainable AI: interpreting, explaining and visualizing deep learning, pp. 193–209, 2019.
[24] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791, 2021.
[25] F.-L. Fan, J. Xiong, M. Li, and G. Wang, “On interpretability of artificial neural networks: A survey,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 5, no. 6, pp. 741–760, 2021.
[26] A. Singh, S. Sengupta, and V. Lakshminarayanan, “Explainable deep learning models in medical image analysis,” Journal of Imaging, vol. 6, no. 6, p. 52, 2020.
[27] Z. Zhang, L. Yu, X. Liang, W. Zhao, and L. Xing, “Transct: Dual-path transformer for low dose computed tomography,” arXiv preprint arXiv:2103.00634, 2021.
[28] I. Kim, S. Rajaraman, and S. Antani, “Visual interpretation of convolutional neural network predictions in classifying medical image modalities,” Diagnostics, vol. 9, no. 2, p. 38, 2019.
[29] H. Yu and G. Wang, “Compressed sensing based interior tomography,” Physics in medicine & biology, vol. 54, no. 9, p. 2791, 2009.
[30] Y. Liu, J. Ma, Y. Fan, and Z. Liang, “Adaptive-weighted total variation minimization for sparse data toward low-dose x-ray computed tomography image reconstruction,” Physics in Medicine & Biology, vol. 57, no. 23, p. 7923, 2012.
[31] E. Y. Sidky and X. Pan, “Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization,” Physics in Medicine & Biology, vol. 53, no. 17, p. 4777, 2008.
[32] Z. Tian, X. Jia, K. Yuan, T. Pan, and S. B. Jiang, “Low-dose ct reconstruction via edge-preserving total variation regularization,” Physics in Medicine & Biology, vol. 56, no. 18, p. 5949, 2011.
[33] Y. Zhang, W. Zhang, Y. Lei, and J. Zhou, “Few-view image reconstruction with fractional-order total variation,” JOSA A, vol. 31, no. 5, pp. 981–995, 2014.
[34] Q. Xu, H. Yu, X. Mou, L. Zhang, J. Hsieh, and G. Wang, “Low-dose x-ray ct reconstruction via dictionary learning,” IEEE transactions on medical imaging, vol. 31, no. 9, pp. 1682–1697, 2012.
[35] J. Ma, J. Huang, Q. Feng, H. Zhang, H. Lu, Z. Liang, and W. Chen, “Low‐dose computed tomography image restoration using previous normal‐dose scan,” Medical physics, vol. 38, no. 10, pp. 5713–5731, 2011.
[36] D. Wu, K. Kim, G. El Fakhri, and Q. Li, “Iterative low-dose ct reconstruction with priors trained by artificial neural network,” IEEE transactions on medical imaging, vol. 36, no. 12, pp. 2479–2486, 2017.
[37] H. Chen, Y. Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang, “Low-dose ct with a residual encoder-decoder convolutional neural network,” IEEE transactions on medical imaging, vol. 36, no. 12, pp. 2524–2535, 2017.
[38] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra, Y. Zhang, L. Sun, and G. Wang, “Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1348–1357, 2018.
[39] F. Fan, H. Shan, M. K. Kalra, R. Singh, G. Qian, M. Getzin, Y. Teng, J. Hahn, and G. Wang, “Quadratic autoencoder (q-ae) for low-dose ct denoising,” IEEE transactions on medical imaging, vol. 39, no. 6, pp. 2035–2050, 2019.
[40] J. Liu, Y. Zhang, Q. Zhao, T. Lv, W. Wu, N. Cai, G. Quan, W. Yang, Y. Chen, L. Luo, et al., “Deep iterative reconstruction estimation (dire): approximate iterative reconstruction estimation for low dose ct imaging,” Physics in Medicine & Biology, vol. 64, no. 13, p. 135007, 2019.
[41] J. He, Y. Yang, Y. Wang, D. Zeng, Z. Bian, H. Zhang, J. Sun, Z. Xu, and J. Ma, “Optimizing a parameterized plug-and-play admm for iterative low-dose ct reconstruction,” IEEE transactions on medical imaging, vol. 38, no. 2, pp. 371–382, 2018.
[42] Z. Li, W. Shi, Q. Xing, Y. Miao, W. He, H. Yang, and Z. Jiang, “Low-dose ct image denoising with improving wgan and hybrid loss function,” Computational and Mathematical Methods in Medicine, vol. 2021, 2021.
[43] L. Huang, H. Jiang, S. Li, Z. Bai, and J. Zhang, “Two stage residual cnn for texture denoising and structure enhancement on low dose ct image,” Computer methods and programs in biomedicine, vol. 184, p. 105115, 2020.
[44] C. Wang, K. Shang, H. Zhang, Q. Li, Y. Hui, and S. K. Zhou, “Dudotrans: Dual-domain transformer provides more attention for sinogram restoration in sparse-view ct reconstruction,” arXiv preprint arXiv:2111.10790, 2021.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision, pp. 630–645, Springer, 2016.
[46] C. Innamorati, T. Ritschel, T. Weyrich, and N. J. Mitra, “Learning on the edge: Investigating boundary filters in cnns,” International Journal of Computer Vision, vol. 128, no. 4, pp. 773–782, 2020.
[47] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, pp. 618–626, 2017.
[48] Q. Zhang, X. Wang, R. Cao, Y. N. Wu, F. Shi, and S.-C. Zhu, “Extraction of an explanatory graph to interpret a cnn,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 43, no. 11, pp. 3863–3877, 2021.
[49] C. H. McCollough, A. C. Bartley, R. E. Carter, B. Chen, T. A. Drees, P. Edwards, D. R. Holmes III, A. E. Huang, F. Khan, and S. Leng, “Low‐dose ct for the detection and classification of metastatic liver lesions: results of the 2016 low dose ct grand challenge,” Medical physics, vol. 44, no. 10, pp. e339–e352, 2017.
[50] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.
[51] H. Shan, A. Padole, F. Homayounieh, U. Kruger, R. D. Khera, C. Nitiwarangkul, M. K. Kalra, and G. Wang, “Competitive performance of a modularized deep neural network compared to commercial algorithms for low-dose ct image reconstruction,” Nature Machine Intelligence, vol. 1, no. 6, pp. 269–276, 2019.
[52] C. Tian, Y. Xu, Z. Li, W. Zuo, L. Fei, and H. Liu, “Attention-guided cnn for image denoising,” Neural Networks, vol. 124, pp. 117–129, 2020.