ImplantFormer: Vision Transformer based Implant Position Regression Using Dental CBCT Data

Xinquan Yang [email protected] Xuguang Li [email protected] Xuechen Li [email protected] Peixi Wu [email protected] Linlin Shen [email protected] Yongqiang Deng [email protected] College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China AI Research Center for Medical Image Analysis and Diagnosis, Shenzhen University, Shenzhen, China National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Department of Stomatology, Shenzhen University General Hospital, Shenzhen, China the School of Dentistry, Shenzhen University, Shenzhen, China

Abstract

Implant prosthesis is the most appropriate treatment for dentition defect or dentition loss, which usually involves a surgical guide design process to decide the implant position. However, such design heavily relies on the subjective experiences of dentists. In this paper, a transformer-based Implant Position Regression Network, ImplantFormer, is proposed to automatically predict the implant position based on the oral CBCT data. We creatively propose to predict the implant position using the 2D axial view of the tooth crown area and fit a centerline of the implant to obtain the actual implant position at the tooth root. Convolutional stem and decoder are designed to coarsely extract image features before the operation of patch embedding and integrate multi-level feature maps for robust prediction, respectively. As both long-range relationship and local features are involved, our approach can better represent global information and achieves better location performance. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed ImplantFormer achieves superior performance than existing methods.

keywords:

Implant Prosthesis , Dental Implant , Vision Transformer , Deep Learning

1 Introduction

Periodontal disease is the world’s 11th most prevalent oral condition, which causes tooth loss in adults, especially the aged [1] [2]. Implant prosthesis is so far the most appropriate treatment of dentition defect/dentition loss, and using surgical guide in implant prosthesis leads to higher accuracy and efficiency [3] [4] [5]. Cone-beam computed tomography (CBCT) and oral scanning are common data used for surgical guide design, among which the CBCT data is used to estimate the implant position, while the oral scanning data is employed to analyze the surface of the teeth. However, such a process takes the dentist a long time to analyze the patient’s jaw, teeth and soft tissue using surgical guide design software. Artificial Intelligence (AI) based implant position estimation could significantly speed up such process [6].

Recently, deep learning has achieved great success in many tasks of dentistry [7] [8] [9] [10]. For dental implant planning, recent researchers mainly focus on implant depth estimation. Kurt et al. [11] utilised multiple pre-trained convolutional networks to segment the teeth and jaws to locate the missing tooth and generate a virtual tooth mask according to the neighbouring teeth’ location and tilt. The implant depth is determined by measuring the width of the alveolar bone, the distance of the mandibular canal, maxillary sinus and jaw bone edge of the virtual mask using the panoramic radiographic image. Widiasri et al. introduced Dental-YOLO [12] to detect the alveolar bone and mandibular canal on the sagittal view of CBCT to determine the height and width of the alveolar bone. However, we argue that only measuring the depth cannot well determine the implant position. In contrast, the 2D axial view of CBCT is more appropriate for implant position estimation since the precise implant position can be obtained by stacking multiple 2D axial views. Nevertheless, in the clinical, the implant is inserted into the alveolar bone to act as the tooth root, in which the attached soft tissues around the tooth root will lead to blurry CT images, which makes a big challenge to implant position estimation.

In this paper, we propose to train the prediction network using the 2D axial view of the tooth crown, which is exposed in the air and can be clearly captured by CT imaging. To obtain the implant position label at the tooth crown, we first fit the centerline of the implant using tooth root annotations and then extend the centerline to the crown area (see Fig. 1(b)). By this means, the implant position at the tooth crown area can be obtained and used as labels to train the prediction network. During inference, the outputs of the network will be transformed back to the tooth root area, as the predicted positions of the implant (see Fig. 1(d)). Moreover, the fitted centerline combines the prediction results of a series of 2D slices of CBCT, which compensates the loss of 3D context information without introducing heavy computation costs. Inspired by the dentist who determines the implant position with reference to the neighboring teeth [11] [13], we employ Vision Transformer (ViT) [14] as the backbone for our Implant Position Regression Network (ImplantFormer). By dividing the input image into equal-sized patches and applying multi-head self-attention (MHSA) on image patches, all pixels can establish a relationship with others, which is extremely important for implant position decisions with reference to the texture of neighboring teeth. To further improve the performance of ImplantFormer, we design a convolutional stem to coarsely extract the image feature before patch embedding and then extract multi-level features from the backbone and introduce a convolutional decoder for further feature transformation and fusion.

The main contributions of this paper are summarized as follows.

1.

We creatively propose to predict the implant position using the 2D axial view of the tooth crown area, which has better image quality than the tooth root and predicts a precise implant position.
2.

The centerline of the implant is fitted using a series of prediction results on the 2D axial view, which introduces the 3D context information without requiring heavy computation costs.
3.

A transformer-based Implant Position Regression Network (ImplantFormer) is proposed to consider both local context and global information for more robust prediction.
4.

The experimental results show that the proposed ImplantFormer achieves superior performance than the mainstream detectors.

2 Related work

2.1 Deep Learning in Dentistry

Deep learning technology is widely used in many tasks of dentistry, e.g., dental caries detection, 2D and 3D tooth segmentation, and dental implants classification. For dental caries detection, Schwendicke et al. [15] and Casalegno et al. [16] proposed deep convolutional neural networks (CNNs) for the detection and diagnosis of dental caries on Near-Infrared-Light Transillumination (NILT) and TI images, respectively. For tooth segmentation, Kondo et al. [17] proposed an automated method for tooth segmentation from 3D digitized image captured by a laser scanner, which avoids the complexity of directly processing 3-D mesh data. Xu et al. [18] presented a label-free mesh simplification method particularly tailored for preserving teeth boundary information, which is generic and robust in the complex appearance of human teeth. Lian et al. [19] integrated a series of graphconstrained learning modules to hierarchically extract multiscale contextual features for automatically labeling on raw dental surface. Cui et al. [20] proposed a two-stage segmentation method for 3D dental point cloud data. The first stage uses a distance-aware tooth centroid voting scheme to ensure the accurate localization of tooth and the second stage designs a confidence-aware cascade segmentation module to segment each individual tooth and resolve the ambiguous cases. Qiu et al. [21] decomposed the 3D teeth segmentation task as the tooth centroid detection and tooth instance segmentation, which provides a novel dental arch estimation method and introduces an arch-aware point sampling (APS) module based on the estimated dental arch for tooth centroid detection. For the task of implant fixture system classification, Sukegawa et al. [22] evaluated the performance of differnet CNN models for implant classification. Kim et al. [23] proposed an optimal pretrained network architecture for identifying different types of implants.

2.2 Deep Learning in Object Detection

The object detectors can be divided into two categories, i.e., anchor-based and anchor-free. The anchor-based detector sets the pre-defined anchor box before training and the anchor-free detector directly regresses the bounding box of the object. Furthermore, the anchor-based detector can be grouped into one-stage and two-stage methods. YOLO [24] is a classical one-stage detector, which directly predict the bounding box and category of objects based on the feature maps. A series of improved versions of YOLO [25] [26] [27] [28] have been proposed to improve the performance. Faster R-CNN [29] is a classical two-stage detector that consists of a region proposal network (RPN) and a prediction network (R-CNN [30]). Similarly, a series of detection algorithms [31] [32] [33] have been proposed to improve the performance of the two-stage detector. Different from the anchor-based detector that heavily relies on the predefined anchor box, the anchor-free detector regresses the object using the heatmap. CornerNet [34] simplified the prediction of the object bounding box as the regression of the top-left corner and the bottom-right corner. CenterNet [35] further simplified CornerNet by regressing the center of object. Recently, transformer-based anchor-free detector achieves great success in object detection. DETR [36] employs ResNet as the backbone and introduces a transformer-based encoder-decoder architecture for the object detection task. Deformable DETR [37] extends DETR with sparse deformable attention that reduces the training time significantly.

2.3 Deep Learning in Implant Position Estimation

The computer-aided diagnosis (CAD) systems were applied to dental implant planning earlier [6]. Polášková et al. [38] presented a web-based tool which utilized patient history and clinical data input into a program and preset threshold levels for various parameters to formulate a decision on whether or not implants may be placed. Sadighpour et al. [39] developed an ANN model which utilized a number of input factors to formulate a decision regarding the type of prosthesis (fixed or removable) and the specific design of the prosthesis for rehabilitation of the edentulous maxilla. Szejka et al. [40] developed an interactive reasoning system which requires the dentist to select the region of interest within a 3D bone model based on computed tomography (CT) images, to help the selection of the optimum implant length and design. However, these CAD systems need manual hyperparameter adjustment.

Recently, researchers proposed different approaches to determine the implant position using the panoramic radiographic images and 2D slices of CBCT. Kurt et al. [11] utilised multiple pre-trained convolutional networks to segment the teeth and jaws to locate the missing tooth and determine the implant loaction according to the neighbouring teeth’ location and tilt. Widiasri et al. introduced Dental-YOLO [12] to detect the alveolar bone and mandibular canal based on the sagittal view of CBCT to determine the height and width of the alveolar bone, which determines the implant position indirectly.

Refer to caption — Figure 1: The whole procedure of the proposed implant position prediction method, which consists of training and inference mode.

3 Method

The whole procedure of the proposed implant position prediction method is shown in Fig. 1, which consists of four stages: 1) Firstly, we transform the 3D CBCT data $V_{p}$ of patient $p$ into a series of 2D transverse plane slice views, $V^{p}_{c}=\{S^{p}\}$ and $V^{p}_{r}=\{{\hat{S}}^{p}\}$ . As the 3D implant regression problem is now modeled as a series of 2D regression of $Pos^{p}$ , the prediction network can now be simplified from 3D to 2D and greatly alleviate the insufficiency of training data; 2) Use proposed Algorithm 1 to project the ground truth annotation $Pos^{p}_{r}$ from tooth root to tooth crown $Pos^{p}_{c}$ ; 3) Train ImplantFormer using crown images $S^{p}_{i}$ with projected labels and predict the implant position of the new patient $\tilde{p}$ at tooth crown; 4) in the end, Algorithm 2 is proposed to transform the predicted results $Pos^{\tilde{p}}_{c}$ back from tooth crown to tooth root $Pos^{\tilde{p}}_{r}$ .

3.1 CBCT Data Pre-processing

As shown in Fig. 1(a), the 3D CBCT data of the patient $p$ , $V^{p}$ consists of two parts, i.e. tooth crown data $V^{p}_{c}$ and tooth root data $V^{p}_{r}$ , $V^{p}=V^{p}_{c}\cup V^{p}_{r}$ . Here $V^{p}_{c}=\{S^{p}_{i}\},i\in[1,N^{p}_{c}]$ and $V^{p}_{r}=\{\hat{S}^{p}_{j}\},j\in[1,N^{p}_{r}]$ , $S^{p}_{i}$ and $\hat{S}^{p}_{j}$ are the 2D slices of tooth crown and tooth root, respectively. $N^{p}_{c}$ and $N^{p}_{r}$ are the total number of 2D slices of tooth crown and tooth root, respectively.

In the clinical, the implant is inserted into the alveolar bone to act as the tooth root. However, the 2D slices captured from the tooth root usually look blurry and the shape of tooth root varies much more sharply across slices, which makes a big challenge to the training of the prediction network. In contrast, the texture in 2D slices corresponding to tooth crown areas are much richer and more stable. As shown in Fig. 2, with an interval of 10 slices, the distance between two neighboring teeth at tooth crown only increases by 0.01 centimeter while which is 0.08 at the tooth root. Meanwhile, the textures of tooth area (including the implant position and its neighboring teeth) in the crown images are richer and more stable than that of tooth root, which is beneficial for the network regression. More detailed comparison of tooth crown and root images are given in the experimental section.

Hence, in this work, we propose to train the prediction network using crown images - $V^{p}_{c}$ . However, the ground truth position of implant, defined as $Pos^{p}_{r}(j)=(\hat{x}^{p}_{j},\hat{y}^{p}_{j},\hat{z}^{p}_{j}),j\in[1,N^{p}_{r}]$ , is annotated in $V^{p}_{r}$ . Here $\hat{x}^{p}_{j},\hat{y}^{p}_{j}$ are the coordinates of implant position in $\hat{S}^{p}_{j}$ and $\hat{z}^{p}_{j}$ is the slice index of $\hat{S}^{p}_{j}$ in $V^{p}_{r}$ . To obtain the implant position annotation $Pos^{p}_{c}$ in $V^{p}_{c}$ , a space transform algorithm $T_{{Pos^{p}_{r}}\rightarrow{Pos^{p}_{c}}}$ is proposed. As shown in Fig. 1(b), firstly, we use $Pos^{p}_{r}$ to fit the center line of implant, defined as $L_{r}$ ; then $L_{r}$ is extended to the crown area and the intersections of $L_{r}$ with $V^{p}_{c}$ , $Pos^{p}_{c}$ , can be obtained. The detailed algorithm is given in Algorithm 1. The input and output of the algorithm are $Pos^{p}_{r}$ and $Pos^{p}_{c}$ , respectively. Given the slice index of $S^{p}_{i}$ as $z^{p}_{i}$ , we define a space transform $T_{{Pos^{p}_{r}}\rightarrow{Pos^{p}_{c}}}$ to calculate the $x^{p}_{i}$ and $y^{p}_{i}$ . Here $k_{1}$ , $k_{2}$ , $b_{1}$ and $b_{2}$ can be calculated by minimizing the residual sum of square $Q$ .

3.2 Transformer Based Implant Position Regression Network (ImplantFormer)

The input and output of the ImplantFormer is the 2D slice at tooth crown $S_{i}\in R^{H\times W\times 3},(H=W=512)$ and the corresponding implant position $Pos_{c}(i)$ , respectively. Similar to the existing keypoint regression network [41] [34], ImplantFormer is based on the Gaussian heatmap and the center of the implant position $(x_{i},y_{i})$ at 2D slice is set as the regression target. The network structure of ImplantFormer is given in Fig. 3, which mainly consists of four components: convolutional stem, transformer encoder, convolutional decoder and regression branch. Given a tooth crown image $S^{\tilde{p}}_{i}\in R^{H\times W\times 3}$ of the patient $\tilde{p}$ as the network input, we firstly introduce two Resblocks [42] as the convolutional stem for coarsely extracting the image feature, and then the general patch embedding is generated using the output of convolutional stem. Subsequently, we employ ViT-Base-ResNet-50 model as the transformer encoder and extract multi-level features. In convolutional decoder, we introduce reassemble blocks [43] to transform the multi-level features to multi-resolution feature maps and then integrate them by upsampling and concatenation. Finally, focal loss [44] and L1 loss are employed to supervise the heatmap head and the local offset head, respectively. The output of ImplantFormer is the implant positions at the tooth crown, which is extracted from the predicted heatmap by the post processing operation.

3.2.1 Convolutional Stem

Adding a convolutional stem before transformer encoder is demonstrated to be helpful in extracting finer feature representation [37] [45] [36]. To this end, we design a convolutional stem for the ImplantFormer, which consists of two Resblocks. Unlike previous works [46] [47] [48] that use the last stage feature of CNN backbone as the input of transformer encoder, we keep the channel and size of the output tensor $X_{out}\in\mathbb{R}^{H\times W\times 3}$ the same as the input image, so that the original ViT can be directly used as backbone.

3.2.2 Transformer Encoder

ViT strikes a remarkable success in computer vision, due to the property of modeling long-ranged dependencies. In our work, the regression of implant position heavily relies on the texture of the neighboring teeth, which is similar to the way dentist used to design the surgical guide. Consequently, the prediction network should possess the capacity of establishing relationship between long-ranged pixels, i.e. between the implant position and the neighboring teeth. ViT divides the input image into equal-sized patches $P\in R^{m\times m\times 3}$ , and patch embedding is applied by flattening the patch into vectors using a linear projection. Then, multi-head self-attention (MHSA) is applied to establish relationship between different image patches. By this means, each pixel can perceive the adjacent pixels even with long distance. This characteristic is essential for our work that uses the texture of the neighboring teeth for implant position prediction.

Hence, in this work, we employ ViT-Base-ResNet-50 model as backbone for our ImplantFormer. The non-overlapping square patches are generated using the output of the convolutional stem. The patch size is set as 16 for all experiments. Meanwhile, to further improve the performance of ImplantFormer, we extract multi-level features from layers $l=\{3,6,9,12\}$ . The produced multi-level features with 256 dimensions are used to generate image-like feature representations for the convolutional decoder.

3.2.3 Convolutional Decoder

Unlike transformer-based encoder-decoder structures, we introduce the convolutional decoder for further feature transformation and fusion. As shown in Fig. 3, the reassemble block [43] firstly transform the multi-levels ViT features to multi-resolutions feature maps and then the image-like feature maps are integrated by upsampling and concatenation. The final concatenated feature map is fed into the $3\times 3$ convolution for feature smooth and channel compression.

Specifically, we employ a simple three-stage reassemble block to recover image-like representations from the output tokens of arbitrary layers of the transformer encoder (shown as Fig. 3(a)), the reassemble operation is defined as follows:

Reassemble_{s}^{\hat{D}}(t)=(Resample_{s}\circ Concat\circ Read)(t),

(1)

where $s$ denotes the output size ratio of the recovered representation with respect to the input image, $\hat{D}$ denotes the output feature dimension, and $t$ refers to the readout token from transformer encoder.

Read=R^{{I_{m}+1}\times D}\rightarrow R^{I_{m}\times D},

(2)

Concat=R^{{I_{m}}\times D}\rightarrow R^{{\frac{H}{W}}\times{\frac{H}{W}}\times D},

(3)

Resample_{s}=R^{{\frac{H}{W}}\times{\frac{H}{W}}\times D}\rightarrow R^{{\frac{H}{s}}\times{\frac{H}{s}}\times\hat{D}}.

(4)

As shown in the above equations, reassemble block consists of three steps: Read, Concat and Resample. Read operation firstly map the $I_{m}+1$ tokens to a set of $I_{m}$ tokens, where $I_{m}=\frac{HW}{m^{2}}$ and $D$ is the feature dimension of each token. After the Read block, the resulting $I_{m}$ tokens are reshaped into an image-like representation by placing each token according to the position of the initial patch in the image. Then, a spatial concatenation operation is applied to produce a feature map of size $\frac{H}{m}\times\frac{W}{m}$ with channels. Finally, a spatial resampling layer is proposed to scale the representation to size $\frac{H}{s}\times\frac{W}{s}$ with $\hat{D}$ features per pixel.

After the reassemble operation, a simple feature fusion method is used to integrate the multi-resolution feature maps. We upsample the feature map by a factor of two and concatenate it with the neighboring feature map. For the final concatenated feature map, we use $3\times 3$ convolution for feature smoothing and reducing the channel from 512 to 256.

3.2.4 Regression Branch

The outputs of ImplantFormer is an implant position heatmap $F\in[0,1]^{\frac{W}{g}\times\frac{H}{g}}$ , where $g$ is the down sampling factor of the prediction and set as 4. The heatmap $F$ is expected to be equal to 1 at the center of implant position, and equal to 0 otherwise. Following the standard practice in CenterNet [41], the probability of pixel $(x,y)$ being the target center point is modeled as a 2D Gaussian kernel:

\mathcal{L}_{k}=\frac{-1}{N}\sum_{xy}\left\{\begin{array}[]{ccl}(1-\hat{F}_{xy})^{\alpha}\log(\hat{F}_{xy})&\text{if $F_{xy}=1$},\\ (1-\hat{F}_{xy})^{\beta}\log(\hat{F}_{xy})^{\alpha}\log(1-\hat{F}_{xy})&\text{otherwise},\end{array}\right.

(5)

where $\alpha$ and $\beta$ are hyper-parameters of the focal loss, $\hat{F}$ is the predicted heatmap and $N$ is the number of keypoints in image. We use $\alpha=2$ and $\beta=4$ in all our experiments. To further refine the prediction location, the local offset head is used for each target center point, which is optimized by L1 loss. The loss of the local offset $\mathcal{L}_{off}$ is weighted by a constant $\lambda_{off}=0.55$ . The overall training loss is:

\mathcal{L}_{total}=\mathcal{L}_{k}+\lambda_{off}\mathcal{L}_{off}.

(6)

Algorithm 1 Space transform from

Pos^{p}_{r}

Pos^{p}_{c}

The groundtruth annotation of implant position in

V^{p}_{r}

Pos^{p}_{r}(j)=(\hat{x}^{p}_{j},\hat{y}^{p}_{j},\hat{z}^{p}_{j}),j\in[1,N^{p}_{r}]

The annotation of implant position in

V^{p}_{c}

Pos^{p}_{c}(i)=(x^{p}_{i},y^{p}_{i},z^{p}_{i}),i\in[1,N^{p}_{c}]

Define a space transform $T_{Pos^{p}_{r}\to Pos^{p}_{c}}$ based on $Pos^{p}_{r}$

T_{Pos^{p}_{r}\to Pos^{p}_{c}}(\hat{z}^{p}_{j})=\left\{\begin{array}[]{rl}&\hat{x}^{p}_{j}=k_{1}\hat{z}^{p}_{j}+b_{1}\\ &\hat{y}^{p}_{j}=k_{2}\hat{z}^{p}_{j}+b_{2}\end{array}\right.

where $k$ and $b$ are the coefficient and bias of $L_{r}$ , respectively.

Perform the residual sum of squares $Q$ on $T_{Pos^{p}_{r}\to Pos^{p}_{c}}$ :

Q_{1}=\sum^{N^{p}_{r}}_{j=1}(\hat{x}^{p}_{j}-k_{1}\hat{z}^{p}_{j}-b_{1})^{2},Q_{2}=\sum^{N^{p}_{r}}_{j=1}(\hat{y}^{p}_{j}-k_{2}\hat{z}^{p}_{j}-b_{2})^{2}

3.

Employ the least square method to minimize $Q$ , calculate the derivative of $Q$ with respect to $k$ and $b$ , respectively, and set them to 0.

$\frac{\partial Q_{1}}{\partial k_{1}}=\sum^{N^{p}_{r}}_{j=1}2(\hat{x}^{p}_{j}-k_{1}\hat{z}^{p}_{j}-b_{1})\times(-\hat{z}^{p}_{j})=0$

$\frac{\partial Q_{1}}{\partial b_{1}}=\sum^{N^{p}_{r}}_{j=1}2(\hat{x}^{p}_{j}-k_{1}\hat{z}^{p}_{j}-b_{1})\times(-1)=0$

$\frac{\partial Q_{1}}{\partial k_{2}}=\sum^{N^{p}_{r}}_{j=1}2(\hat{y}^{p}_{j}-k_{2}\hat{z}^{p}_{j}-b_{2})\times(-\hat{z}^{p}_{j})=0$

$\frac{\partial Q_{1}}{\partial b_{2}}=\sum^{N^{p}_{r}}_{j=1}2(\hat{y}^{p}_{j}-k_{2}\hat{z}^{p}_{j}-b_{2})\times(-1)=0$

Use $Pos^{p}_{r}$ to calculate $k$ and $b$ .

k_{1}=\frac{N^{p}_{r}\sum_{j=1}^{N^{p}_{r}}\hat{x}_{j}\hat{z}_{j}-\sum_{j=1}^{N^{p}_{r}}\hat{x}_{j}\times\sum_{j=1}^{N^{p}_{r}}\hat{z}_{j}}{N^{p}_{r}\sum_{j=1}^{N^{p}_{r}}z_{j}^{2}-\sum_{j=1}^{N^{p}_{r}}z_{j}\times\sum_{j=1}^{N^{p}_{r}}z_{j}},b_{1}=\frac{\sum_{j=1}^{N^{p}_{r}}\hat{x}_{j}\times k_{1}\sum_{j=1}^{N^{p}_{r}}\hat{z}_{j}}{N^{p}_{r}}

k_{2}=\frac{N^{p}_{r}\sum_{j=1}^{N^{p}_{r}}\hat{y}_{j}\hat{z}_{j}-\sum_{j=1}^{N^{p}_{r}}\hat{y}_{j}\times\sum_{j=1}^{N^{p}_{r}}\hat{z}_{j}}{N^{p}_{r}\sum_{j=1}^{N^{p}_{r}}z_{j}^{2}-\sum_{j=1}^{N^{p}_{r}}z_{j}\times\sum_{j=1}^{N^{p}_{r}}z_{j}},b_{2}=\frac{\sum_{j=1}^{N^{p}_{r}}\hat{y}_{j}\times k_{2}\sum_{i=1}^{N^{p}_{r}}\hat{z}_{j}}{N^{p}_{r}}

5.

Substitute $k$ , $b$ and $z^{p}_{j}$ into $T_{Pos^{p}_{r}\to Pos^{p}_{c}}$ to obtain $Pos^{p}_{c}$ .

$Pos^{p}_{c}(i)=(k_{1}z^{p}_{i}+b_{1},k_{2}z^{p}_{i}+b_{2},z^{p}_{i}),i\in[1,N^{p}_{c}]$

Algorithm 2 Space transform from

Pos^{\tilde{p}}_{c}

Pos^{\tilde{p}}_{r}

The outputs of ImplantFormer for patient

\tilde{p}

Pos^{\tilde{p}}_{c}(i)=(x^{\tilde{p}}_{i},y^{\tilde{p}}_{i},z^{\tilde{p}}_{i}),i\in[1,N^{\tilde{p}}_{c}]

The predicted implant position at tooth root:

Pos^{\tilde{p}}_{r}(j)=(\hat{x}^{\tilde{p}}_{j},\hat{y}^{\tilde{p}}_{j},\hat{z}^{\tilde{p}}_{j}),j\in[1,N^{\tilde{p}}_{r}]

Define a space transform $T_{Pos^{\tilde{p}}_{c}\to Pos^{\tilde{p}}_{r}}$ based on $Pos^{\tilde{p}}_{c}$

T_{Pos^{\tilde{p}}_{c}\to Pos^{\tilde{p}}_{r}}(\hat{z}^{\tilde{p}}_{i})=\left\{\begin{array}[]{rl}&x^{\tilde{p}}_{i}=\hat{k}_{1}z^{\tilde{p}}_{i}+\hat{b}_{1}\\ &y^{\tilde{p}}_{i}=\hat{k}_{2}z^{\tilde{p}}_{i}+\hat{b}_{2}\end{array}\right.

where $\hat{k}$ and $\hat{b}$ are the coefficient and bias of $L_{c}$ , respectively.

Perform the residual sum of squares $\hat{Q}$ on $T_{Pos^{\tilde{p}}_{c}\to Pos^{\tilde{p}}_{r}}$ :

\hat{Q}_{1}=\sum^{N^{\tilde{p}}_{r}}_{i=1}(\hat{x}^{\tilde{p}}_{i}-k_{1}\hat{z}^{\tilde{p}}_{i}-b_{1})^{2},\hat{Q}_{2}=\sum^{N^{\tilde{p}}_{r}}_{i=1}(\hat{y}^{\tilde{p}}_{i}-k_{2}\hat{z}^{\tilde{p}}_{i}-b_{2})^{2}

3.

Calculate the derivative with respect to $k$ and $b$ on $\hat{Q}$ , and set them to 0.
4.

Use $Pos^{\tilde{p}}_{c}$ to calculate $\hat{k}$ and $\hat{b}$ .

Substitute $\hat{k}$ , $\hat{b}$ and $\hat{z}^{\tilde{p}}_{j}$ into $T_{Pos^{\tilde{p}}_{c}\to Pos^{\tilde{p}}_{r}}$ to obtain $Pos^{\tilde{p}}_{r}$ .

Pos^{\tilde{p}}_{c}(j)=(\hat{k}_{1}\hat{z}^{\tilde{p}}_{j}+\hat{b}_{1},\hat{k}_{2}\hat{z}^{\tilde{p}}_{j}+\hat{b}_{2},\hat{z}^{\tilde{p}}_{j}),j\in[1,N^{\tilde{p}}_{c}]

3.2.5 Post-processing

The output size of the heatmap is smaller in scale than that of the input image, due to the down sampling operation. We introduce a post-processing to extract the implant position from the heatmap and recover the scale of output. As shown in Fig. 3(b), the post-processing includes two steps: top-1 keypoint selection and coordinate transformation. We firstly select the prediction result with the highest confidence. Then, the Gaussian heatmap with the selected implant positions are directly transformed from the resolution of $128\times 128$ to $512\times 512$ . The coordinate of implant position can be obtained by extracting the brightest point in the Gaussian heatmap.

3.3 Project Implant Position from Tooth Crown to Tooth Root

After the post processing, a set of implant position at tooth crown $Pos^{\tilde{p}}_{c}$ are obtained. To obtain the real location of implant at tooth root $Pos^{\tilde{p}}_{r}$ , we propose Algorithm 2 to transform the prediction results back from tooth crown to tooth root. Algorithm 2 has the same workflow as Algorithm 1, except that the input and output are reversed. As shown in Fig. 1(d), firstly, we use $Pos^{\tilde{p}}_{c}$ to fit the space line $L_{c}$ ; then $L_{c}$ is extended to the root area and the intersections of $L_{c}$ with $V^{\tilde{p}}_{r}$ , $Pos^{\tilde{p}}_{r}$ , can be obtained. The detailed algorithm is given in Algorithm 2. Moreover, the fitted centerline combines the prediction results of a series of 2D slices of CBCT, which compensates the loss of 3D context information without introducing heavy computation costs.

4 Experiments and Results

4.1 Dataset Details

We evaluate our method on a dental implant dataset collected from the Shenzhen University General Hospital (SUGH), which contains 3045 2D slices of tooth crown and the implant position was annotated by three experienced dentists. These CBCT data were captured using the KaVo 3D eXami machine, manufactured by Imagine Sciences International LLC. Dentists firstly designed the virtual implant based on the CBCT data using the surgical guide design software. Then the implant position can be determined as the center of the virtual implant. Some sample images of the dataset are shown in Fig. 4.

4.2 Implementation Details

For our experiments, we use a batch size of 6, Adam optimizer and a learning rate of 0.0005 for network training. The network is trained for 140 epochs and the learning rate is divided by 10 at 60th and 100th epochs, respectively. Three data augmentation methods, i.e. random crop, random scale and random flip are employed in the network training. The original images are with size $776\times 776$ and cropped to $512\times 512$ at the network training stage. In the inference stage, images are directly resized to $512\times 512$ . All the models are trained and tested on the platform of NVIDIA GeForce RTX TITAN. Five-fold cross-validation is employed for all experiments.

4.3 Evaluation Criteria

The diameter of the implant is 3.5 $\sim$ 5mm, and clinically the mean error between the predicted and ideal implant position is required to be less than 1mm, i.e., around 25% of the size of implant. Therefore, AP75 is used as the evaluation criteria. As the average radius of implants is around 20 pixels, a bounding-box with size $21\times 21$ centered at the keypoint is generated. The calculation of AP is defined as follows:

Precition=\frac{TP}{TP+FP}

(7)

Recall=\frac{TP}{TP+FN}

(8)

AP=\int^{1}_{0}P(r)dr

(9)

Here TP, FP and FN are the number of correct, false and missed predictions, respectively. P(r) is the PR Curve where the recall and precision act as abscissa and ordinate, respectively.

4.4 Performance Analysis

4.4.1 Tooth Crown vs. Tooth Root

To justify the effectiveness of tooth crown images for implant position estimation, we train the ImplantFormer using tooth crown and root images, respectively, and compare their performances in Table 1. We can observe from the table that when setting the IOU value as 0.5 and 0.75, the prediction of ImplantFormer using crown images achieves 34.3% and 13.7% AP, respectively, which is 24.9% and 13.1% higher than using tooth root images. The experimental results demonstrate the efficiency of using tooth crown images for implant position prediction.

Table 1: The detection performance of ImplantFormer trained using tooth crown and tooth root images.

CBCT slice	$AP_{50}\%$	$AP_{75}\%$
Tooth crown	34.3 $\pm$ 2.2684	13.7 $\pm$ 0.2045
Tooth root	9.4 $\pm$ 1.5308	0.6 $\pm$ 0.0516

4.4.2 Component Ablation

To demonstrate the effectiveness of the proposed components of the ImplantFormer, i.e. convolutional stem and decoder, we conduct ablation experiments on both components to investigate their effects for ImplantFormer. When these components are removed, the patch embedding is generated on the input image and the multi-resolution feature maps are directly upsampled and element-wise added with the neighboring feature map. As discussed previously, we use $AP_{75}$ as the evaluation criteria in the following experiments.

The comparison results are shown in Table 2. We can observe from the table that the proposed components are beneficial for ImplantFormer, among which convolutional stem and decoder improves the performance by 0.9% and 1.4%, respectively. When combining both of these components, the improvement of performance reaches 2.9%.

In Fig. 5, we visualize some examples of the detection results of different components for further comparison. We can observe from the figure that, for the ViT-base model, the predicted positions are relatively far away from the ground truth position. In the image of the first row, there is also a false positive detection at the right side. Both convolutional stem and decoder can reduce the distance between prediction and ground truth, thus improve the prediction accuracy. When combining convolutional stem and decoder, the model achieves more accurate predictions and reduces the number of missing or false positive detection.

Table 2: The ablation study of convolutional stem and decoder.

Backbone	Conv stem	Conv decoder	$AP_{75}\%$
ViT-Base-ResNet-50			10.8 $\pm$ 0.3491
	$\checkmark$		11.7 $\pm$ 0.4308
		$\checkmark$	12.2 $\pm$ 0.4065
	$\checkmark$	$\checkmark$	13.7 $\pm$ 0.2045

4.4.3 Comparison to The State-of-the-art Detectors

In Table 3, we compare the AP value of ImplantFormer with other state-of-the-art detectors. As no useful texture is available around the center of implant, where teeth are missing, the regression of the implant position is mainly based on the texture of neighboring teeth. Therefore, anchor-free methods (VFNet [49], ATSS [50], RepPoints [51], CenterNet [41]) and transformer-based method (Deformable DETR [37]) are more suitable for this regression problem. Nevertheless, we also employed two classical anchor-based detectors - Faster RCNN [29] and Cascade RCNN [31] for comparison. Resnet-50 is employed as the feature extraction backbone for these detectors. To further determine the validity of vision transformer, we introduce ResNet-50 as the backbone for ImplantFormer, in which we remove the convolutional stem and the patch embedding operation, but keep the feature fusion block.

From Table 3 we can observe that the anchor-based methods fail to predict the implant position, which confirms our concern. The transformer-based methods perform better than the CNN-based networks (e.g., Deformable DETR achieved 12.8% AP, which is 0.7% higher than the best-performed anchor-free network - ATSS). The ViT-based ImplantFormer achieves 13.7% AP, which is 2.2% and 0.9% higher than ResNet-based ImplantFormer and Deformable DETR, respectively. The experimental results proved the effectiveness of our method, and the ViT-based ImplantFormer achieves the best performance among all benchmarks.

We choose two detectors from both anchor-free (e.g. ATSS and VFNet) and transformer-based (e.g. Deformable DETR and ImplantFormer) methods, respectively, to further demonstrate the superiority of ImplantFormer in the implant position prediction. Fig. 6 shows the euclidean distance between the ground truth position and the predictions of these detectors. The euclidean distances are summed in an interval of five. Smaller the distance, more accurate the implant position prediction. From the figure we can observe that, for anchor-free detection methods, the distance distributes equally from 0 to 30 pixels. Only 35% predictions distribute within 10 pixels from groundtruth position. In contrast, for transformer-based detection methods, the distance mainly distributes in the range of 0 to 20 pixels. More than 70% of the predictions located within 10 pixels from groundtruth. Considering that the diameter of implant is around 20 pixels, the predictions with distance more than 10 pixels are meaningless. Therefore, the transformer-based detectors are more suitable for implant prediction task. Compared to Deformable DETR, the ImplantFormer generates about 8% more predictions with distance less than 5 pixels, which indicates that the ImplantFormer achieve much better performance even its AP is only 1.2% higher.

In Fig. 7, we also visualize the detection results of these detectors for four example images. We can observe from the first row of the figure that the transformer-based methods perform better than the anchor-free methods, which is consistent with the distance distribution. As shown in the second and third row, the anchor-free methods generate false detections, while the transformer-based methods perform accurately. In the last row, an example of hard case is given, where the patients teeth are sparse in several places, which leads to error detection of VFNet and Deformable DETR, while the proposed ImplantFormer still performs accurately.

As shown in the second row, an example of hard case is given, where the patient’s teeth are sparse in several places, which leads to error detection of VFNet and Deformable DETR, while the proposed ImplantFormer still performs accurately.

Table 3: Verification of the proposed method and other state-of-the-art detectors on our dataset. "-" represents 0.

Methods	Network	Backbone	$AP_{75}\%$
Anchor-based	Faster RCNN	ResNet-50	-
Anchor-based	Cascade RCNN		-
Anchor-free	CenterNet		10.9 $\pm$ 0.2457
	ATSS		12.1 $\pm$ 0.2694
	VFNet		11.8 $\pm$ 0.8734
	RepPoints		11.2 $\pm$ 0.1858
	ImplantFormer		11.5 $\pm$ 0.3748
Transformer-based	Deformable DETR		12.8 $\pm$ 0.1417
Transformer-based	ImplantFormer	ViT-Base-ResNet-50	13.7 $\pm$ 0.2045

4.4.4 Visualization of Attention Map

To verify whether the location mechanism of implant position of the ImplantFormer is in line with the dentist, we visualize the attention map of the ImplantFormer in the Fig. 8. The enlarged portion in Fig. 8(a) shows the location mechanism of implant position by the dentist, which is determined by the edge of neighboring teeth. Fig. 8(b) is the attention map of ImplantFormer, from which we can observe that the network attention is on the edges of neighboring teeth of implant position, which is in accordance with the location mechanism of implant by the dentist. Simultaneously, the network attention around the neighboring teeth also indicates that the vision transformer can establish the relationships between long-ranged pixels, i.e. from the implant position to the neighboring teeth.

4.4.5 Comparison of Slice Views of Implant

To validate the effectiveness of the proposed ImplantFormer, we compare four slice views of the actual implant with the predicted implant in Fig. 9. The slice view is the different longitudinal views of CBCT data of the implant, which can well demonstrate the direction and placement of the implant. For the predicted implant, a cylinder with a radius of 10 pixels centered at $Pos_{r}$ , is generated, and the implant depth is manually set. The pixel value of CBCT data inside the cylinder area is set as 3100. For fair comparison, the longitudinal views from the CBCT data of both ground truth and predicted implant are selected at the same position.

From the figure, we can observe that the implant position and direction generated by ImplantFormer are consistent with the ground truth implant, which confirms the accuracy of the proposed ImplantFormer.

5 Conclusions

In this paper, we introduce a transformer-based Implant Position Regression Network (ImplantFormer) for CBCT data based implant position prediction, which includes both local context and global information for more robust prediction. We creatively propose to train ImplantFormer using tooth crown images by projecting the annotations from tooth root to tooth crown via the space transform algorithm. In the inference stage, the outputs of ImplantFormer will be projected back to the tooth root area as the predicted positions of the implant. Visualization of attention map and slice view of implant demonstrated the effectiveness of our method and qualitative experiments show that the proposed ImplantFormer achieves superior performance than the state-of-the-art object detectors.

References

[1] H. Elani, J. Starr, J. Da Silva, G. Gallucci, Trends in dental implant use in the us, 1999–2016, and projections to 2026, Journal of dental research 97 (13) (2018) 1424–1430.
[2] M. Nazir, A. Al-Ansari, K. Al-Khalifa, M. Alhareky, B. Gaffar, K. Almas, Global prevalence of periodontal disease and lack of its surveillance, The Scientific World Journal 2020 (2020).
[3] E. Varga Jr, M. Antal, L. Major, R. Kiscsatári, G. Braunitzer, J. Piffkó, Guidance means accuracy: A randomized clinical trial on freehand versus guided dental implantation, Clinical oral implants research 31 (5) (2020) 417–430.
[4] R. Vinci, M. Manacorda, R. Abundo, A. Lucchina, A. Scarano, C. Crocetta, L. Lo Muzio, E. Gherlone, F. Mastrangelo, Accuracy of edentulous computer-aided implant surgery as compared to virtual planning: a retrospective multicenter study, Journal of Clinical Medicine 9 (3) (2020) 774.
[5] J. Gargallo-Albiol, O. Salomó-Coll, N. Lozano-Carrascal, H.-L. Wang, F. Hernández-Alfaro, Intra-osseous heat generation during implant bed preparation with static navigation: Multi-factor in vitro study, Clinical Oral Implants Research 32 (5) (2021) 590–597.
[6] F. Amato, A. López, E. M. Peña-Méndez, P. Vaňhara, A. Hampl, J. Havel, Artificial neural networks in medical diagnosis (2013).
[7] F. Schwendicke, T. Singh, J.-H. Lee, R. Gaudin, A. Chaurasia, T. Wiegand, S. Uribe, J. Krois, et al., Artificial intelligence in dental research: Checklist for authors, reviewers, readers, Journal of dentistry 107 (2021) 103610.
[8] A. Müller, S. M. Mertens, G. Göstemeyer, J. Krois, F. Schwendicke, Barriers and enablers for artificial intelligence in dental diagnostics: a qualitative study, Journal of Clinical Medicine 10 (8) (2021) 1612.
[9] M. Kim, M. Chung, Y.-G. Shin, B. Kim, Automatic registration of dental ct and 3d scanned model using deep split jaw and surface curvature, Computer Methods and Programs in Biomedicine 233 (2023) 107467.
[10] Q. Chen, J. Huang, H. S. Salehi, H. Zhu, L. Lian, X. Lai, K. Wei, Hierarchical cnn-based occlusal surface morphology analysis for classifying posterior tooth type using augmented images from 3d dental surface models, Computer Methods and Programs in Biomedicine 208 (2021) 106295.
[11] S. Kurt Bayrakdar, K. Orhan, I. S. Bayrakdar, E. Bilgir, M. Ezhov, M. Gusarev, E. Shumilov, A deep learning approach for dental implant planning in cone-beam computed tomography images, BMC Medical Imaging 21 (1) (2021) 86.
[12] M. Widiasri, A. Z. Arifin, N. Suciati, C. Fatichah, E. R. Astuti, R. Indraswari, R. H. Putra, C. Za’in, Dental-yolo: Alveolar bone and mandibular canal detection on cone beam computed tomography images for dental implant planning, IEEE Access 10 (2022) 101483–101494.
[13] Y. Liu, Z.-c. Chen, C.-h. Chu, F.-L. Deng, Transfer learning via artificial intelligence for guiding implant placement in the posterior mandible: an in vitro study (2021).
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[15] F. Schwendicke, K. Elhennawy, S. Paris, P. Friebertshäuser, J. Krois, Deep learning for caries lesion detection in near-infrared light transillumination images: A pilot study, Journal of dentistry 92 (2020) 103260.
[16] F. Casalegno, T. Newton, R. Daher, M. Abdelaziz, A. Lodi-Rizzini, F. Schürmann, I. Krejci, H. Markram, Caries detection with near-infrared transillumination using deep learning, Journal of dental research 98 (11) (2019) 1227–1233.
[17] T. Kondo, S. H. Ong, K. W. Foong, Tooth segmentation of dental study models using range images, IEEE Transactions on medical imaging 23 (3) (2004) 350–362.
[18] X. Xu, C. Liu, Y. Zheng, 3d tooth segmentation and labeling using deep convolutional neural networks, IEEE transactions on visualization and computer graphics 25 (7) (2018) 2336–2348.
[19] C. Lian, L. Wang, T.-H. Wu, F. Wang, P.-T. Yap, C.-C. Ko, D. Shen, Deep multi-scale mesh feature learning for automated labeling of raw dental surfaces from 3d intraoral scanners, IEEE transactions on medical imaging 39 (7) (2020) 2440–2450.
[20] Z. Cui, C. Li, N. Chen, G. Wei, R. Chen, Y. Zhou, D. Shen, W. Wang, Tsegnet: An efficient and accurate tooth segmentation network on 3d dental model, Medical Image Analysis 69 (2021) 101949.
[21] L. Qiu, C. Ye, P. Chen, Y. Liu, X. Han, S. Cui, Darch: Dental arch prior-assisted 3d tooth instance segmentation with weak annotations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20752–20761.
[22] S. Sukegawa, K. Yoshii, T. Hara, K. Yamashita, K. Nakano, N. Yamamoto, H. Nagatsuka, Y. Furuki, Deep neural networks for dental implant system classification, Biomolecules 10 (7) (2020) 984.
[23] J.-E. Kim, N.-E. Nam, J.-S. Shim, Y.-H. Jung, B.-H. Cho, J. J. Hwang, Transfer learning via deep neural networks for implant fixture system classification using periapical radiographs, Journal of clinical medicine 9 (4) (2020) 1117.
[24] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
[25] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934 (2020).
[26] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, et al., Yolov6: A single-stage object detection framework for industrial applications, arXiv preprint arXiv:2209.02976 (2022).
[27] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464–7475.
[28] Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, Yolox: Exceeding yolo series in 2021, arXiv preprint arXiv:2107.08430 (2021).
[29] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015).
[30] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[31] Z. Cai, N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
[32] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al., Sparse r-cnn: End-to-end object detection with learnable proposals, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14454–14463.
[33] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[34] H. Law, J. Deng, Cornernet: Detecting objects as paired keypoints, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750.
[35] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, Q. Tian, Centernet: Keypoint triplets for object detection, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6569–6578.
[36] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, 2020, pp. 213–229.
[37] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020).
[38] A. Polášková, J. Feberová, T. Dostálová, P. Kříž, M. Seydlová, et al., Clinical decision support system in dental implantology, MEFANET Journal 1 (1) (2013) 11–14.
[39] L. Sadighpour, S. M. M. Rezaei, M. Paknejad, F. Jafary, P. Aslani, The application of an artificial neural network to support decision making in edentulous maxillary implant prostheses, Journal of Research and Practice in Dentistry 2014 (2014) i1–10.
[40] A. L. Szejka, M. Rudek, O. C. Jnr, A reasoning method for determining the suitable dental implant, in: 41st International Conference on Computers & Industrial Engineering, Los Angeles. Proceedings of the 41st International Conference on Computers & Industrial Engineering, 2011.
[41] X. Zhou, D. Wang, P. Krähenbühl, Objects as points, arXiv preprint arXiv:1904.07850 (2019).
[42] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[43] R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12179–12188.
[44] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[45] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, J. Wang, Conditional detr for fast training convergence, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3651–3660.
[46] Y. Wang, X. Zhang, T. Yang, J. Sun, Anchor detr: Query design for transformer-based detector, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 36, 2022, pp. 2567–2575.
[47] Z. Dai, B. Cai, Y. Lin, J. Chen, Up-detr: Unsupervised pre-training for object detection with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1601–1610.
[48] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, L. Zhang, Dn-detr: Accelerate detr training by introducing query denoising, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13619–13627.
[49] H. Zhang, Y. Wang, F. Dayoub, N. Sunderhauf, Varifocalnet: An iou-aware dense object detector, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8514–8523.
[50] S. Zhang, C. Chi, Y. Yao, Z. Lei, S. Z. Li, Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9759–9768.
[51] Z. Yang, S. Liu, H. Hu, L. Wang, S. Lin, Reppoints: Point set representation for object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9657–9666.