This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection

Ying Zeng1, Yushi Chen1, Member, IEEE, Xue Yang2, Qingyun Li1, Junchi Yan2, Senior Member, IEEE
This work was supported by the Natural Science Foundation of China under Grant 62371169, 61971164 and U20B2041 (Corresponding author: Yushi Chen.) Ying Zeng, Yushi Chen and Qingyun Li are with the School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China (e-mail: [email protected]; [email protected]; [email protected]) Xue Yang is with OpenGVLab, Shanghai AI Laboratory, Shanghai 200030, China (e-mail: [email protected]) Junchi Yan is with School of Electronic Information and Electrical Engineering, and MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai 200030, China (e-mail: [email protected])
Abstract

Existing oriented object detection in aerial images has progressed a lot in recent years and achieved a favorable success. However, high-precision oriented object detection in aerial images remains a challenging task. Some recent works have adopted the classification-based method to predict the angle in order to address boundary problem in angle. However, we have found that these works often neglect the sensitivity of objects with different aspect ratios to angle. At the same time, it is worth exploring a suitable way to improve the emerging transformer-based approaches in order to adapt them to oriented object detection. In this paper, we propose an Aspect Ratio Sensitive DEtection TRansformer, termed ARS-DETR, for oriented object detection in aerial images. Specifically, a new angle classification method, called Aspect Ratio aware Circle Smooth Label (AR-CSL), is proposed to smooth the angle label in a more reasonable way and discard the hyperparameter that introduced by previous work (e.g. CSL). Then, a rotated deformable attention module is designed to rotate the sampling points with the corresponding angles and eliminate the misalignment between region features and sampling points. Moreover, a dynamic weight coefficient according to the aspect ratio is adopted to calculate the angle loss. Comprehensive experiments on several challenging datasets demonstrate that our method achieves a competitive performance in the high-precision oriented object detection task.

Index Terms:
Oriented Object Detection, High-Precision Detection, Detection Transformer, Feature Alignment.

I Introduction

Object detection in aerial images has always been a hot spot in the remote sensing community. With the rapid increase of a large number of available high-resolution aerial images[1, 2, 3], accurately and effectively detection in these aerial images has become a crucial issue.

Benefiting from the development of deep learning, the emergence of deep Convolutional Neural Networks (CNNs) greatly influenced the design of detectors and has achieved a favorable performance in generic object detection. Instead of using handcrafted features for detection, which is cumbersome and not accurate enough, CNNs could learn from the training data and update themselves iteratively, exhibiting a strong ability to extract high-level and robust features for accurate detection. Numerous advanced detectors have also been proposed to detect the objects by using horizontal bounding boxes (HBBs).

Compared with generic images, objects in aerial images often exhibit a wide variety of scales, aspect ratios, and orientations, and sometimes they are arranged densely. When simply using HBBs to detect these objects, HBBs cannot fit these objects very well and thus will include abundant background or overlap with other objects. Therefore, oriented object detection, which adopts the oriented bounding boxes (OBBs) to represent the objects, is more suitable for aerial object detection.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: Even though the angle prediction is inaccurate, it still obtains a high performance in terms of AP50.

As a recently emerged task in remote sensing, oriented object detection exhibits a strong ability in analysing the objects in aerial images and many advanced oriented object detectors have been proposed and achieved a favorable performance in aerial images[4, 5, 6, 7]. However, numerous detectors treat oriented object detection as generic detection involving an angle that needs to be predicted. To achieve this, they simply introduce an additional angle parameter in the detection head. Consequently, their angle predictions are not very accurate, as shown in Fig. 1. Nevertheless, these detectors can still obtain a fairly good results under current metric i.e. AP50, indicating that AP50 is not accurate enough to reflect the performance of oriented object detectors and high-precision oriented object detection in aerial images still remains a challenging task.

Angle, as a unique parameter in oriented object detection, plays a vital role in high-precision detection. At the same time, the characteristics of angle also make it difficult to predict and therefore require more attention. Firstly, the periodicity of angle will cause the discontinuity at the boundary, leading to a suboptimal optimization during the training process. Secondly, objects with different aspect ratios exhibit varying sensitivities to angle, which is neglected by most of oriented object detectors. Objects with a small aspect ratio, especially those resembling squares, exhibit reduced sensitivity to variations in angle. Consequently, even in situations where there exist notable disparities in the predicted angles, which indicates a significant angle deviation, they are capable of maintaining a high Skew Intersection over Union (SkewIoU) between the targets and predictions. In contrast, even a slight angle deviation can drastically degrade the SkewIoU in cases where objects have a large aspect ratio.

Among a large number of angle prediction methods, classification-based method shows a favorable performance [8, 9, 10]. Specifically, it decouples the angle from the bounding boxes and transforms the angle prediction into a classification task, thereby eliminating the boundary problem [11]. Moreover, [12] also shows a strong potential of classification-based method in high-precision oriented object detection. Nevertheless, there still exists some issues, such as ignoring the correlation between the angle and aspect ratio completely, introducing hyperparameter (e.g. window radius in [8]), and etc. Thus, the accuracy of the angle prediction is hindered to some extent.

Recently, the transformer-based detectors [13, 14] have revived the object detection task. Without additional complicated hand-designed components like preset Anchor or Non-Maximum Suppression (NMS), DEtection TRansformer (DETR [13]) regards object detection as a set prediction task and assigns labels by bipartite graph matching, which achieves a comparable performance with classical detectors like Faster RCNN [15]. Existing DETR derivatives [14, 16, 17, 18, 19, 20] dramatically improve detection performance and convergence speed, exhibiting great potential of Transformer for high-precision object detection. Although some DETR-based oriented object detection methods [21, 22] have been proposed, they still use regression to predict angle and do not take into account the issues caused by boundary discontinuity. Meanwhile, they predict angle in a simple way and do not explore how to embed angle information into DETR. How to use DETR more naturally in oriented object detection is still a research topic.

In this paper, we propose an Aspect Ratio Sensitive DEtection TRansformer to achieve oriented object detection in aerial images, called ARS-DETR. Specifically, a hyperparametric free Aspect Ratio aware Circle Smooth Label (AR-CSL) is designed to represent the relationship of adjacent angles according to the aspect ratios of objects. Considering the sensitivity of different objects to angle, AR-CSL uses the SkewIoU of the objects with different aspect ratios under each angle deviation to smooth the angle labels. Then, We also propose a rotated deformable attention module to embed the angle information into detector to align the features. Finally, we adopt the aspect ratio sensitive matching and loss strategy to enable dynamic adjustment of the detector’s training, thereby reducing the burden of model training. Extensive experiments on different aerial datasets demonstrate the effectiveness of ARS-DETR in high-precision oriented object detection. In summary, our contributions lie in four-folds as follows:

  • We analyze the influence of angle deviation in the oriented object detection in detail and give the corresponding formula. Additionally, we also analyze the flaws of the current oriented object detection metric (i.e. AP50).

  • A new angle classification method called AR-CSL is designed to smooth angle labels in a more reasonable way. This method adopts the values of the SkewIoU of objects with different aspect ratios under each angle deviation, while also eliminating the hyperparameter of window radius that was introduced by previous work.

  • We propose an angle embedded Rotated Deformable Attention module (RDA) to incorporate the angle information for extracting the aligned features. Meanwhile, the Aspect Ratio sensitive Matching (ARM) and Aspect Ratio sensitive Loss (ARL) are developed to adaptively adjust the focus on the angle based on the aspect ratio of the object. In addition, we also combine with DeNoising strategy (DN) to further improve the performance of DETR-based method for oriented object detection.

  • Extensive experiments on three public aerial datasets: DOTA-v1.0, DIOR-R and OHD-SJTU demonstrate the effectiveness of the proposed model. ARS-DETR achieves a competitive performance on high-precision oriented object detection in the all datasets.

II Related Work

II-A Oriented Object Detection

As an emerging task, oriented object detection has made great progress in recent years. The simple solution [23] for oriented object detection task is to change the Anchor or Region of Interests (RoIs) from the horizontal type to the rotated type. RoI-Transformer [4] constructs the geometry transformation to rotate the proposals to locate the objects more accurately. To address the feature misalignment in refined single-stage detectors, R3Det [24] and S2A-Net [25] adopt the feature alignment module to get a more accurate location. However, these mainstream regression-based methods often suffer the boundary problem [8] due to the predictions beyond the defined range and need additional complicated treatment. SCRDet [5] designs a novel IoU-Smooth L1 Loss to alleviate the sudden increase in loss caused by angle periodicity and edge exchangeability, which reduces the difficulty of model training.

CSL [8] transforms the prediction of angle from regression to classification, thereby eliminating the boundary problem. It is achieved through the design of a Circle Smooth Label. Gliding vertex [6] glides the vertex of the horizontal bounding box to accurately represent a multi-oriented object. GWD [11], KLD [26] and KFIoU [27] convert the rotated bounding box into a Gaussian distribution to avoid the boundary discontinuity and square-like issue in oriented object detection. PSC [28] provides a unified framework for various periodic fuzzy issues in oriented object detection by mapping rotational periodicity of different cycles into phase of different frequencies.

II-B Angle Classification-based Oriented Object Detection

The classification-based angle prediction method is a novel and effective approach to circumvent the boundary problem while predicting angle accurately, and it has also made a lot of progress. CSL[8] discretizes the angle variable into 180 categories and smooths the angle label via a Gaussian window function. The CSL method directly promotes the development of classification-based oriented object detection algorithms. DCL [9] uses dense coded label to reduce the amount of computation and parameters of CSL. [10] adopts a dynamic weighting mechanism based on CSL to perform precise angle estimation for rotated objects. To overcome the challenges of ambiguity and high costs in angle representation, [29] proposes a multi-grained angle representation method, consisting of coarse-grained angle classification and fine-grained angle regression. TIOE [12] proposes a progressive orientation estimation strategy to approximate the orientation of objects with n-ary codes. AR-BCL [30] uses an aspect ratio-based bidirectional coded label to solve the square-like detection issue [9]. In contrast, the new angle encoding technique proposed in this paper is free from hyperparameters, boundary problem, and square-like issue. Furthermore, it explores the potential of angle classification in high-precision detection, which is overlooked by most of the above methods.

Refer to caption
(a) k1k\geq 1
Refer to caption
(b) 1k1.51\leq k\leq 1.5
Refer to caption
(c) k>1.5k>1.5
Figure 2: The curves represent the relationship between SkewIoU and angle deviation Δθ\Delta\theta under different aspect ratios. kk indicates the aspect ratio.

II-C DETR and Its Variants

DETR [13] proposed a Transformer-based end-to-end object detector without using hand-designed components like prior anchor design and NMS. In recent years, DETR has progressed a lot and also exhibits its strong ability in object detection compared with classic detection methods [14, 16, 17, 18, 19, 20]. Deformable DETR [14] proposes a deformable attention module to sample the value of adaptive positions around the reference point and utilizes the multi-level features to mitigate the slow convergence and high complexity issues of DETR. DAB-DETR [20] provides explicit positional priors for each query to let the cross-attention module focus on a local region corresponding to a target object by using anchor box size. DN-DETR [17] and DINO [18] design a denoising auxiliary task that bypasses the bipartite graph matching. This not only accelerates training convergence but also leads to a better training result. In addition, there have been some DETR-based oriented object detectors [21, 22]. O2DETR [21] is the first attempt to apply DETR to the oriented object detection task and AO2-DETR [22] introduces oriented proposal generation and refinement module into the transformer architecture to refine the features. Nevertheless, both of them predict angles using a simple regression way and do not address boundary discontinuity or embed angle information into DETR.

Refer to caption
(a) situation 1
Refer to caption
(b) boundary condition
Refer to caption
(c) situation 2
Figure 3: Two situations for SkewIoU calculation. (a) The situation where Δθ<θ\Delta\theta<\theta^{*}; (b) The boundary condition between situation 1 and situation 2; (c) The situation where Δθ>θ\Delta\theta>\theta^{*}.

III Rethinking on Oriented Object Detection

In this section, we analyze the relationship between angle and aspect ratio. Additionally, we also analyze the shortcomings of currently used metric AP50 in oriented object detection and emphasize the importance of high-precision oriented object detection.

III-A Angle and Aspect Ratio

Objects with different aspect ratios have different sensitivities to angles. In order to better observe the relationship between the angle and aspect ratio, we assume that there are two bounding boxes with the same center, width and height, and give the SkewIoU of these two boxes with different aspect ratios under different angle deviations, as shown in Fig. 2. The SkewIoU can be calculated by Eq. 1:

SkewIoU(k,Δθ)=\displaystyle\text{SkewIoU}(k,\Delta\theta)= {4ktanΔθmn4ktanΔθ+m+nΔθ2arctan1k12ksinΔθ1Δθ>2arctan1k,\displaystyle\left\{\begin{array}[]{rcl}\frac{4k\tan\Delta\theta-m-n}{4k\tan\Delta\theta+m+n}&\Delta\theta\leq 2\arctan\frac{1}{k}\\ \frac{1}{2k\sin\Delta\theta-1}&\Delta\theta>2\arctan\frac{1}{k}\end{array}\right., (1)
m=\displaystyle\small m= (1ktanΔθ2)2tan2Δθ,\displaystyle(1-k\tan\frac{\Delta\theta}{2})^{2}\tan^{2}\Delta\theta,
n=\displaystyle\small n= (2sin2Δθ2+ksinΔθcosΔθ)2,\displaystyle(\frac{-2\sin^{2}\frac{\Delta\theta}{2}+k\sin\Delta\theta}{\cos\Delta\theta})^{2},

where k1k\geq 1 is the aspect ratio, and Δθ[0,90]\Delta\theta\in[0^{\circ},90^{\circ}] represents angle deviation, indicating the absolute value of the angle difference between two boxes. There is a critical angle boundary threshold (θ=2arctan1k\theta^{*}=2\arctan\frac{1}{k}) as shown in Fig. 3. The symmetrical nature depicted in Fig. 2 demonstrates that variations in Δθ\Delta\theta exhibit bidirectional characteristics.

Fig. LABEL:sub@fig:skewiou1 shows the curves between SkewIoU and angle deviation under different aspect ratios. It can be seen that the SkewIoU variation trends of bounding boxes with different aspect ratios are obviously divided into two types according to metric AP50 (if the SkewIoU between the prediction box and ground truth is greater than 0.5, then it will be judged as true positive), and the dividing boundary is k=1.5k=1.5, as shown in Fig. LABEL:sub@fig:skewiou2 (1k1.51\leq k\leq 1.5) and Fig. LABEL:sub@fig:skewiou3 (k>1.5k>1.5), respectively. Specifically, Fig. LABEL:sub@fig:skewiou2 shows that when the aspect ratio is smaller than 1.5, SkewIoU is always greater than 0.5 regardless of the angle deviation (see pink dashed line). In contrast, when the aspect ratio is greater than 1.5, as shown in Fig. LABEL:sub@fig:skewiou3, SkewIoU will decay rapidly with the increase of angle deviation, but the valid angle deviation still retains a wide range. In summary, objects with a small aspect ratio are less sensitive to angle deviation, whereas objects with a large aspect ratio are more sensitive but still exhibit a significant tolerance for angle deviation under AP50.

III-B High-Precision Oriented Object Detection

Considering that angle is a very important parameter in oriented object detection, the accuracy of its estimation will greatly affect the subsequent related tasks, such as object fine-grained recognition [31], object heading estimation [32, 33], etc., AP50 seems not accurate enough for reflecting the performance of high-precision oriented object detection. Therefore, we advocate to use more stringent metric, e.g. AP75111Note: The difficulty of achieving high AP75 in oriented object detection is more difficult than that in horizontal object detection, because the rotated bounding box is more accurate with less redundant areas, thus more sensitive to errors., which is usually used in generic detection, to measure this challenging task. Under the AP75 metric, not only are the prediction boxes required to be closer to the ground truth boxes, but the angle prediction requirement is also more stringent. As shown the gray dashed line in Fig. 2, when AP75 is adopted, regardless of the aspect ratio, the angle deviation should be controlled within a specific range; otherwise, they will not be judged as positive. The larger the aspect ratio, the narrower the range.

Tab. I compares the accuracy of some oriented object detectors using AP50 and AP75, respectively. It can be seen that all detectors achieve a high performance in terms of AP50 and the gap among them is small. However, the situation becomes different when AP75 is used, some detectors may not good as other detectors whose performance on AP50 are lower than them, e.g. S2A-Net vs. Rotated ATSS. Therefore, AP75 can further represent the performance of high-precision oriented object detection.

TABLE I: Accuracy of some oriented object detectors on DOTA-v1.0 and DIOR-R datasets.
Method DOTA-v1.0 DIOR-R
AP50 AP75 AP50:95 AP50 AP75 AP50:95
Rotated FCOS [34] 72.45 39.84 41.02 62.00 36.10 37.61
S2A-Net [25] 75.29 40.08 42.00 64.50 38.24 38.02
Rotated Faster RCNN [15] 73.96 43.44 42.93 63.41 41.80 39.72
KLD [26] 73.46 44.74 43.70 64.63 41.60 40.34
Rotated ATSS [35] 73.37 44.95 43.53 63.52 42.61 40.72
GWD [11] 73.25 45.21 44.04 60.31 40.90 39.70
Oriented Reppoints [7] 74.38 46.56 44.57 66.31 44.36 42.81

As for the AP50:95, which is also often adopted in generic detection, it is no doubt a stricter and more comprehensive metric, but we believe that AP75 could reflect high-precision oriented object detection more directly. AP50:95 contains a large number of metrics. Under the metrics AP50 or AP55, when the angle prediction is not accurate, as shown in Fig. 1, a relatively high value could still be reached. Conversely, under the metrics AP90 or AP95, the performance of oriented detectors degenerates a lot, which is of no significance for comparison in current research. Therefore, AP75, as an intermediate metric, is more balanced and suitable for high-precision oriented object detection. Meanwhile, there is an average operation in AP50:95, which makes metrics like AP50 and AP55 have a great influence on the overall value.

In summary, it is necessary and meaningful to pay more attention to high-precision oriented object detection.

IV Method

Refer to caption
Figure 4: The framework of the proposed ARS-DETR. ‘GT’ means ground truth. ‘Train Only’ means it only works during the training process and will be removed during the inference.

In this section we mainly design our methods around the relationship between the angle and aspect ratio. We propose a new angle classification method, AR-CSL, to dynamically adjust the smoothing process according to the aspect ratio. Then we adopt the Deformable DETR[14] as the detection architecture and develop it with rotated deformable attention module, denoising training strategy, aspect ratio sensitive matching and loss to adapt to the oriented object detection.

IV-A Overview

Fig. 4 shows the framework of the proposed ARS-DETR. Given an image, a backbone is firstly used to extract feature maps. The backbone will generate hierarchical feature maps and the last three stages of outputs are used. Then, the 1×\times1 convolution is adopted to map their channels to the uniform dimension. Additionally, the lowest resolution feature map is obtained via a 3×\times3 convolution on the final feature map. Then the multi-scale feature maps, embed with the 2D positional encoding, are fed into Encoder. Without the use of top-down structure in FPN[36], multi-scale Deformable Attention can exchange the information among multi-scale feature maps and further refine these feature maps. The output of Encoder will then generate a large number of proposals and be used in Decoder. Next, the Top-K scoring proposals are picked as object queries and transformed into output embeddings by MultiHead Attention and multi-scale Rotated Deformable Attention in Decoder. Finally, the prediction head further decode the output embeddings from decoder into class labels, angle labels and horizontal box coordinates. Additionally, during the training process, we utilize the ground truth with noise as queries to participate in the training to stabilize the training. At the same time, we adopt Aspect Ratio sensitive Matching (ARM) to adjust the influence of angle in matching process and adopt Aspect Ratio sensitive Loss (ARL) to adjust the training strategy for different aspect ratio objects when calculating angle loss.

IV-B Aspect Ratio Aware Circle Smooth Label

IV-B1 Rethinking on Circular Smooth Label

Instead of using regression-based loss function, Circular Smooth Label (CSL) [8] transforms angle prediction into a classification task so that the boundary problem naturally disappears. As shown in Fig. LABEL:sub@fig:csl_arsl1 and Fig. LABEL:sub@fig:csl_arsl2, CSL divides the angle into 180 categories and treats the first angle category and the last angle category as adjacent angle categories to eliminate the impact of boundary discontinuity. Then, it adopts Gaussian window function to smooth the angle category label of the objects so as to reflect the correlation among adjacent angle categories and make it have a certain tolerance for angle estimation error. The expression of CSL is as follows:

CSL(t)\displaystyle CSL(t) ={g(t),θr<t<θ+r0,otherwise,\displaystyle=\left\{\begin{matrix}g(t),&\theta-r<t<\theta+r\\ 0,&otherwise\end{matrix}\right., (2)

where g()g() is window function, tt is the angle represented by label, rr is the radius of the window function, and θ\theta is the angle of ground truth.

Although CSL has made some progress, it still has two drawbacks which will behind its performance:

  • Fixed label function. CSL adopts a fixed radius Gaussian function to learn the correlation among adjacent angles and smooth the label, without considering objects’ aspect ratio, as shown in Fig. LABEL:sub@fig:csl_arsl1. However, it can be obviously seen from the Fig. 2 that the SkewIoU of objects with different aspect ratio differs a lot in adjacent angles. Therefore, the correlation among adjacent angles should not be rigid and Gaussian window is likely not the best choice for all objects.

  • Angle discrete granularity insensitivity. CSL is also insensitive to the angle discrete granularity. When the angle discrete granularity ω\omega is 1, indicating the angle is divided into 180 categories, the smoothing result is shown in Fig. LABEL:sub@fig:csl_arsl1. In contrast, when the angle discrete granularity ω\omega is 15, indicating the angle is divided into 12 categories, the smoothing result is shown in Fig. LABEL:sub@fig:csl_arsl5. It can be seen from these two results that the smoothing outcomes of CSL remain consistent under different angle discrete granularities, which is clearly unreasonable. With the increase of ω\omega, the correlation among adjacent angles will become weaker, while CSL is insensitive to this and will still give the same smoothing values to the adjacent angle categories. Hence, the correlation among adjacent angles under different angle discrete granularity should also be taken into account.

  • Hyperparameter introduction. The radius of window function will affect the final performance to some extent. As a hyperparameter, it is a thorny problem to determine the best value of the radius when the ω\omega changes.

IV-B2 Design of Aspect Ratio Aware Circle Smooth Label

According to the above analysis, the fixed window function and hyperparameter (i.e. radius) hurt the applicability of classification-based oriented object detectors to some extent. In this subsection, we will address the aforementioned issues from the perspective of the encoding form.

Considering that SkewIoU can dynamically reflect the correlation among adjacent angles of different objects, we design an Aspect Ratio aware Circle Smooth Label (AR-CSL) technique to obtain a more reasonable angle prediction, using the SkewIoU instead of a fixed window function to smooth the label. Specifically, we calculate the SkewIoU of the bounding boxes under each angle deviation according to Eq. 1, and take the calculated values as the label of the current angle category bin.

Refer to caption
(a) CSL encoding in all objects (flatly unfolded)
Refer to caption
(b) Circular smooth label
Refer to caption
(c) AR-CSL encoding in small aspect ratio objects
Refer to caption
(d) AR-CSL encoding in large aspect ratio objects
Refer to caption
(e) CSL encoding in large angle discrete granularity
Refer to caption
(f) AR-CSL encoding in large angle discrete granularity
Figure 5: The comparison of two encoding methods in objects with different aspect ratio at each angle deviation. For the convenience of comparison, the labels in the Fig. (a) and Fig. (c)-(f) are flatly unfolded, otherwise they should be circular like (b). (a) For CSL, a Gaussian window with a fixed window radius will be adopted to smooth the angle label, regardless of the objects’ aspect ratio. (c)-(d) For AR-CSL, objects with different aspect ratio will be considered and it will use a more reasonable smoothing strategy to reflect the correlation among the adjacent angles. (e) For CSL, angle discrete granularity ω\omega will be overlooked and will give the same smoothing values under different angle discrete granularity ω\omega (d) For AR-CSL, smoothing values are calculated dynamically according to the angle deviation and will vary under different angle discrete granularity ω\omega.
Refer to caption
(a) Simple method
Refer to caption
(b) Ours
Figure 6: Two methods to iterate the angle information in the DETR. (a) In the simple way, although the angle information is updated iteratively after each layer, it is not embed into DETR. (b) In our proposed way, the angle information will be replaced with a new value after each layer and the angle information will assist in aligning features.

Then, we normalize the SkewIoU values by maximum and minimum normalization method, shown as follows:

ARCSL(k,t)=SkewIoU(k,Δθ)SkewIoU(k)min1SkewIoU(k)min,Δθ=|tθ|,\displaystyle\begin{matrix}AR-CSL(k,t)=\frac{SkewIoU(k,\Delta\theta)-SkewIoU(k)_{min}}{1-SkewIoU(k)_{min}},\\ \Delta\theta=\left|t-\theta\right|,\end{matrix} (3)

where kk is aspect ratio of ground truth, tt is the angle represented by label, θ\theta is the angle of ground truth and Δθ[0,90]\Delta\theta\in[0^{\circ},90^{\circ}] is the angle deviation.

Compared with CSL, the proposed AR-CSL has the following advantages:

  • Dynamic label function. The smoothing values are dynamically calculated according to the aspect ratios of objects by using SkewIoU, as shown in Fig. LABEL:sub@fig:csl_arsl3-LABEL:sub@fig:csl_arsl4.

  • Angle discrete granularity sensitivity. Because the angle deviation of different angle categories will be accounted for according to the Eq. 3, the smoothing values in adjacent categories will vary with alterations in angle discrete granularity, as shown in Fig. LABEL:sub@fig:csl_arsl4-LABEL:sub@fig:csl_arsl6.

  • Hyperparameter free. According to Eq. 1 and Eq. 3, no hyperparameters are introduced. This makes the proposed method more convenient to use.

IV-C Rotated Deformable Attention Module

Fig. LABEL:sub@fig:architecture1 shows a simple DETR-based oriented detector [21, 22]. This detector merely adds an additional angle parameter in the prediction head to accomplish rotated bounding box estimation. Nevertheless, it fails to embed the angle information into the detector to exploit the maximum potential of the detector, resulting in feature misalignment. To address this, we present a Rotated Deformable Attention module (RDA).

Given an input feature map xC×H×Wx\in\mathbb{R}^{C\times H\times W}, let qΩqq\in\Omega_{q} index a query element with representation feature zqCz_{q}\in\mathbb{R}^{C} and reference box bq=[pq,wq,hq,θq]=[(xq,yq),wq,hq,θq]b_{q}=\left[p_{q},w_{q},h_{q},\theta_{q}\right]=\left[(x_{q},y_{q}),w_{q},h_{q},\theta_{q}\right], where pq=(xq,yq)[0,1]2p_{q}=(x_{q},y_{q})\in\left[0,1\right]^{2} is the centric point of the reference box and wq[0,1]w_{q}\in\left[0,1\right], hq[0,1]h_{q}\in\left[0,1\right], θq[π4,π4]\theta_{q}\in\left[-\frac{\pi}{4},\frac{\pi}{4}\right] are the width, height and angle of the reference box respectively. The rotated deformable attention feature is calculated as follows:

RDA(zq,pq,x)=m=1MWm[k=1KAmqkWmx(pq+Δpmqk)],\displaystyle RDA(z_{q},p_{q},x)=\sum_{m=1}^{M}W_{m}\left[\sum_{k=1}^{K}A_{mqk}\cdot W_{m}^{{}^{\prime}}x(p_{q}+\Delta p_{mqk})\right], (4)

where mm indexes the attention head, and kk indexes the sampled points. MM is the total head number and KK is the total sampled points number. We use M=8M=8 and K=4K=4 following [14]. WmCM×CW_{m}^{{}^{\prime}}\in\mathbb{R}^{\frac{C}{M}\times C} and WmC×CMW_{m}\in\mathbb{R}^{C\times\frac{C}{M}} are learnable weights. Δpmqk\Delta p_{mqk} and AmqkA_{mqk} denote the sampling offset and attention weight of the kthk^{th} sampling point in the mthm^{th} attention head, respectively. The scalar attention weight AmqkA_{mqk} lies in the range [0,1]\left[0,1\right], normalized by k=1KAmqk=1\sum_{k=1}^{K}A_{mqk}=1. For each Δpmqk\Delta p_{mqk}, it is calculated by:

Δpmqk=(w,h)2K(zqfmk+rmk)RT(θq),\displaystyle\Delta p_{mqk}=\frac{(w,h)}{2K}(z_{q}f_{mk}+r_{mk})R^{T}(\theta_{q}), (5)

where fmkC×2f_{mk}\in\mathbb{R}^{C\times 2} is the projection matrices. rmkr_{mk} is the bias of the kthk^{th} sampling point in the mthm^{th} attention head and is calculated by:

rmk=k(cos(2πmM),sin(2πmM))max(|cos(2πmM)|,|sin(2πmM)|),\displaystyle r_{mk}=\frac{k(cos(\frac{2\pi m}{M}),sin(\frac{2\pi m}{M}))}{max(\left|cos(\frac{2\pi m}{M})\right|,\left|sin(\frac{2\pi m}{M})\right|)}, (6)

R(θq)=(cosθq,sinθq;sinθq,cosθq)TR(\theta_{q})=(cos\theta_{q},-sin\theta_{q};sin\theta_{q},cos\theta_{q})^{T} is the rotation matrix.

In this way, we can obtain the dynamic sampling points pq+Δpmqkp_{q}+\Delta p_{mqk} and constrain them as much as possible within bqb_{q} to extract aligned features.

Refer to caption
(a) Deformable Attention
Refer to caption
(b) Misalign
Refer to caption
(c) Align
Refer to caption
(d) Rotated Deformable Attention
Figure 7: Illustration of the misalignment in Deformable Attention and the alignment in Rotated Deformable Attention.
Refer to caption
(a) Deformable Offsets
Refer to caption
(b) Fixed Offsets
Refer to caption
(c) RDA
Figure 8: Illustration of different sampling methods. The orange rectangles mean bounding boxes. The black arrows mean the offset field. The green dots are regular sampling locations and the blue dots are sampling locations with offset. (a) deformable offsets; (b) fixed offsets with rotation in bounding box; (c) deformable offsets with rotation in bounding box.

As depicted in Fig. LABEL:sub@fig:rda1, the sampling points in Deformable Attention Module will be adjusted according to the corresponding reference box, so that the sampling points will be restricted within the reference box and fall within the object as far as possible. However, as shown in Fig. LABEL:sub@fig:rda2, when the object is of the oriented type, the sampling points cannot accurately align with the object if the horizontal reference box [22] is still used. Therefore, we design the Rotated Deformable Attention Module to align the sampling points with features by rotating the sampling points according to the embedded angle information, as shown in Fig. LABEL:sub@fig:rda3 and Fig. LABEL:sub@fig:rda4. Moreover, instead of refining the angle layer by layer, we predict a new angle after each layer independently, as shown in Fig. LABEL:sub@fig:architecture2.

As shown in Fig. 8, we compare two other sampling methods with our RDA. Fig. LABEL:sub@fig:deformable_offsets shows the sampling method in [37]. It learns deformable offsets to augment the spatial sampling locations, but it may sample from wrong locations with weak supervision, especially for densely packed objects. Fig. LABEL:sub@fig:fixed_offsets shows the sampling methods in [4, 25]. It rotates the sampling points with the angle of the bounding box to align features, but the sampling positions are fixed. Our RDA also rotates the sampling points but it also learns deformable offsets and will constrain them according to the width and height of the bounding box. Compared with these methods, RDA provides some flexibility while aligning features.

IV-D Denoising Training

It is verified in [17] that, the instability of bipartite graph matching in DETR could result in slow convergence and hence hinder the performance. The proposed DETR-based detection learns to refine the coarse object features and boxes iteratively, which can be simulated by the process of reconstructing noisy ground-truth labels and boxes. Denoising Training, as an auxiliary task of denoising labels and boxes, has fixed target assigning results, so it can mitigate the effect of matching instability and accelerate the convergence. Besides, to simulate predicting both positive and negative samples, both positive and negative noisy targets are generated for each ground-truth target, which provides a more reasonable optimization goal. Additionally, to transfer the Denoising Training to the oriented object detection, we also design the task of denoising angles for training procedure.

Given a ground truth gt=(cgt,bgt,θgt)gt=(c_{gt},b_{gt},\theta_{gt}), where cgtc_{gt} is the class, bgt=(xc,yc,w,h)b_{gt}=(x_{c},y_{c},w,h) is the bounding box, and θgt\theta_{gt} is the angle, the noisy ground truth gtn=(cn,bn,θn)gt_{n}=(c_{n},b_{n},\theta_{n}) is obtained as follows.

The noisy labels cnc_{n} are generated by randomly selecting part of the ground-truth labels and overlaying the selected labels with arbitrary object labels at a ratio of α\alpha. The target labels of positive samples are assigned with the ground-truth labels, while those of negative samples are assigned with the background category.

The noisy boxes bnb_{n} are generated by moving the four boundaries of ground-truth boxes randomly. Specifically, the ground truth box bgt=(xc,yc,w,h)b_{gt}=(x_{c},y_{c},w,h) is convert into format b^gt=(xl,yu,xr,yb)\hat{b}_{gt}=(x_{l},y_{u},x_{r},y_{b}), where xlx_{l} and xrx_{r} represent horizontal ordinates of left and right boundaries, respectively, and yuy_{u} and yby_{b} represent vertical ordinates of upper and bottom boundaries, respectively. The negative noisy boxes should have larger noise scale than positive ones, because the farther proposals should predict negative samples [18]. Hence, random noise offsets ϵ^=(Δxl,Δyu,Δxr,Δyb)\hat{\epsilon}=(\Delta x_{l},\Delta y_{u},\Delta x_{r},\Delta y_{b}) where Δxl,ΔxrU(β2w,β2w)\Delta x_{l},\Delta x_{r}\sim U(-\frac{\beta}{2}w,\frac{\beta}{2}w) and Δyu,ΔybU(β2h,β2h)\Delta y_{u},\Delta y_{b}\sim U(-\frac{\beta}{2}h,\frac{\beta}{2}h), are generated for positive noisy boxes. While for negative ones, Δxl,Δxr(U(βw,β2w)+U(β2w,βw))\Delta x_{l},\Delta x_{r}\sim\left(U(-\beta w,-\frac{\beta}{2}w)+U(\frac{\beta}{2}w,\beta w)\right) and Δyu,Δyb(U(βh,β2h)+U(β2h,βh))\Delta y_{u},\Delta y_{b}\sim\left(U(-\beta h,-\frac{\beta}{2}h)+U(\frac{\beta}{2}h,\beta h)\right). The noisy boxes are calculated with bn^=bgt^+ϵ^\hat{b_{n}}=\hat{b_{gt}}+\hat{\epsilon} and then converted into format bn=(xnc,ync,wn,hn)b_{n}=(x_{nc},y_{nc},w_{n},h_{n}) as initial box proposals. The decoder learns to denoise bnb_{n} and reconstruct bgtb_{gt}.

The noisy angles θn\theta_{n} are generated by shifting the θgt\theta_{gt} with θn=f(θgt+Δθ)\theta_{n}=f(\theta_{gt}+\Delta\theta), where ΔθU(γπ,γπ)\Delta\theta\sim U(-\gamma\pi,\gamma\pi) are generated for positive noisy angles and γ\gamma is the angle noise scale. While for negative ones, ΔθU(2γπ,2γπ)\Delta\theta\sim U(-2\gamma\pi,2\gamma\pi). f()f() is a periodic function, thereby ensuring that the θgt+Δθ\theta_{gt}+\Delta\theta remain within the defined range.

IV-E Aspect Ratio Sensitive Matching and Loss

After decoding the output embeddings from decoder, the predicted results will be matched with targets to count the loss during the training. Let yy denote ground truth set of oriented objects and y^\hat{y} denote the NN predictions. Then we calculate the cost between these two sets and search for a permutation of NN elements σOn\sigma\in{{O}_{n}} with the lowest cost:

σ^=argminσOniNLmatch(yi,y^σ(i)),\hat{\sigma}=\underset{\sigma\in{{O}_{n}}}{\mathop{\arg\min}}\,\sum\limits_{i}^{N}{{{L}_{match}}({{y}_{i}},{{{\hat{y}}}_{\sigma(i)}})}, (7)

where Lmatch(yi,y^σ(i)){{L}_{match}}({{y}_{i}},{{\hat{y}}_{\sigma(i)}}) is a pair-wise matching cost between ground truth yi{{y}_{i}} and a prediction with index σ(i)\sigma(i), which takes into account the class prediction, angle prediction and the similarity of predicted and ground truth horizontal boxes, and we define it as follows:

Refer to caption
Figure 9: Illustration of Aspect Ratio sensitive Matching. ‘GT’ means ground truth.
Lmatch(yi,y^σ(i))\displaystyle{{L}_{match}}({{y}_{i}},{{{\hat{y}}}_{\sigma(i)}}) =λclsLcls(ci,c^σ(i))\displaystyle={{\lambda}_{cls}}\cdot{{L}_{cls}}({{c}_{i}},{{{\hat{c}}}_{\sigma(i)}}) (8)
+λboxLbox(bi,b^σ(i))\displaystyle+{{\lambda}_{box}}\cdot{{L}_{box}}({{b}_{i}},{{{\hat{b}}}_{\sigma(i)}})
+λiouLiou(bi,b^σ(i))\displaystyle+{{\lambda}_{iou}}\cdot{{L}_{iou}}({{b}_{i}},{{{\hat{b}}}_{\sigma(i)}})
+λθLθ(θi,θ^σ(i)),\displaystyle+{{\lambda}_{\theta}}\cdot{{L}_{\theta}}({{\theta}_{i}},{{{\hat{\theta}}}_{\sigma(i)}}),

where the cc is the class label, bb is the horizontal box, and θ\theta is the angle, respectively. Additionally, the Lcls{{L}_{cls}} is the focal loss, the Lbox{{L}_{box}} is the L1 loss, the Liou{{L}_{iou}} is the generalized intersection over union loss (GIoU), and Lθ{{L}_{\theta}} is the cross entropy loss.

Whereas, considering the objects with larger aspect ration are more sensitive to the angle, we introduce a dynamic coefficient to adjust the angle loss, named Aspect Ratio sensitive Matching (ARM), which can be formulated as follows:

L(θi,θ^σ(i))2ki1+kiL(θi,θ^σ(i)),\displaystyle L({{\theta}_{i}},{{\hat{\theta}}_{\sigma(i)}})\to\frac{2{{k}_{i}}}{1+{{k}_{i}}}L({{\theta}_{i}},{{\hat{\theta}}_{\sigma(i)}}), (9)

where kik_{i} is the aspect ratio of the targets and ki1k_{i}\geq 1.

As shown in Fig. 9, the red ground truth will be matched to the blue prediction whose cost is the lowest according to Eq. 8. When the ARM is introduced, the ground truth with a large aspect ratio will consider more about the angle deviation in the cost, so it is prone to be matched to the prediction whose angle is more similar with its.

The definition of training loss function is consistent with the Eq. 8. Meanwhile, we also introduce the Aspect Ratio sensitive Loss (ARL), which is the same as Eq. 9, to dynamically adjust the angle loss.

V Experiments

V-A Datasets and Implementation Details

DOTA-v1.0 [2] is one of the largest datasets for oriented object detection, containing 2,806 large aerial images from different sensors and platforms ranging from around 800 ×\times 800 to 4,000 ×\times 4,000 and 188,282 instances. It has 15 common categories: Plane (PL), Baseball diamond (BD), Bridge (BR), Ground track field (GTF), Small vehicle (SV), Large vehicle (LV), Ship (SH), Tennis court (TC), Basketball court (BC), Storage tank (ST), Soccer-ball field (SBF), Roundabout (RA), Harbor (HA), Swimming pool (SP), and Helicopter (HC). We use both training and validation sets for training, and the test set for testing. We divide the images into 1024 ×\times 1024 sub-images with an overlap of 200 pixels. During the training, only the random horizontal, vertical, diagonal flipping is adopted to avoid over-fitting and no other tricks are utilized. The performance of the test set is evaluated on the official DOTA evaluation server.

DIOR-R [3] is an aerial image dataset annotated by oriented bounding boxes from the DIOR [38] dataset. There are 23,463 images and 192,518 instances in this dataset, containing 20 common categories: Airplane (APL), Airport (APO), Baseball Field (BF), Basketball Court (BC), Bridge (BR), Chimney (CH), Expressway Service Area (ESA), Expressway Toll Station (ETS), Dam (DAM), Golf Field (GF), Ground Track Field (GTF), Harbor (HA), Overpass (OP), Ship (SH), Stadium (STA), Storage Tank (STO), Tennis Court (TC), Train Station (TS), Vehicle (VE) and Windmill (WM). DIOR-R has a high variation of object size, both in spatial resolutions, and in the aspect of inter-class and intra-class size variability across objects. Different imaging conditions, weather, seasons, image quality are the major challenges of DIOR-R. We use both training and validation sets for training and the test set for testing.

OHD-SJTU [33] is a public dataset for oriented object detection and object heading detection. It contains two different scale datasets, called OHD-SJTU-S and OHD-SJTU-L. OHD-SJTU-S collects 43 large scene images ranging from 10,000 ×\times 10,000 to 16,000 ×\times 16,000 and 4,125 instances in this dataset. In contrast, OHD-SJTU-L adds more categories and instances, containing six object categories and 113,435 instances. In line with previous work [24, 33] processing, we divide the images into 600 ×\times 600 sub-images with an overlap of 150 pixels and scale them to 800 ×\times 800.

All models in this paper are implemented by PyTorch [39] based framework MMRotate [40], and trained with AdamW [41] optimizer. The initial learning rate is 10-4 with 2 images per mini-batch with ‘3x’ training schedule on DOTA-v1.0, DIOR-R and OHD-SJTU-L and ‘9x’ training schedule on OHD-SJTU-S respectively. In addition, we adopt learning rate warm-up for 500 iterations, and the learning rate is divided by 10 at decay step.

V-B Ablation Studies

TABLE II: Comparison of AR-CSL and CSL with different radius on DOTA-v1.0.
Method CSL AR-CSL
R=2 R=4 R=6 R=8
Deformable DETR AP50 72.57 72.24 72.15 72.10 72.38
AP75 43.61 42.82 44.07 43.19 45.71
TABLE III: Comparison of AR-CSL and CSL under different angle discrete granularity ω\omega on DOTA-v1.0.
Method Granularity AP50 AP75 Method Granularity AP50 AP75
CSL(R=2) ω\omega=1 72.57 43.61 CSL(R=6ω\frac{6}{\sqrt{\omega}}) ω\omega=1 72.15 44.07
ω\omega=6 72.22 42.51 ω\omega=6 71.70 43.02
ω\omega=15 71.28 39.93 ω\omega=15 70.45 40.59
ω\omega=30 71.39 27.62 ω\omega=30 70.55 28.72
CSL(R=6) ω\omega=1 72.15 44.07 AR-CSL ω\omega=1 72.38 45.71
ω\omega=6 71.39 42.73 ω\omega=6 72.44 44.32
ω\omega=15 72.06 33.13 ω\omega=15 71.90 41.52
ω\omega=30 71.01 19.61 ω\omega=30 71.38 30.25
TABLE IV: Comparison of different detectors using CSL and AR-CSL on DOTA-v1.0.
Method(ω=6\omega=6) AP50 AP75
Deformable DETR(R-50) CSL(R=6) 71.39 42.73
AR-CSL 72.44 44.32
RetinaNet(R-50) CSL(R=6) 67.63 38.64
AR-CSL 67.98 39.18
FCOS(R-50) CSL(R=6) 70.60 38.42
AR-CSL 71.60 39.74

V-B1 Studies on AR-CSL

In this subsection we adopt Deformable DETR as baseline and compare AR-CSL with CSL and other angle classification methods.

Comparison of CSL and AR-CSL on Deformable DETR. We compare the proposed AR-CSL and CSL with different radius RR under different angle discrete granularity ω\omega on DOTA-v1.0, which is based on Deformable DETR, and the results are shown in Tab. II and Tab. III. Firstly, in Tab. II, we set the ω\omega to 1 and compare AR-CSL and CSL with different radius RR. Fig. 10 shows the visualization of the CSL and AR-CSL. The influence of RR on CSL is mainly concentrated on high-precision oriented detection and the maximum gap could reach 1.25% on AP75 ( R=6R=6 with 44.07% vs R=4R=4 with 42.82%). In contrast, AR-CSL achieves the best performance (about 45.71% in terms of AP75) without tuning any hyperparameters. Secondly, in Tab. III, we set the RR in CSL to 2 and 6 and compare them with AR-CSL under different angle discrete granularity ω\omega. With the increase of the ω\omega, the angle interval becomes larger and the angle representation of each angle category becomes more ambiguous, thus the performances of CSL and AR-CSL on AP75 deteriorate rapidly. However, the performance of CSL on AP75 is more sensitive to the change of ω\omega, which needs to further tune the RR. When the ω\omega is 1 and 6, the best RR is 6 (44.07% and 42.73% on AP75, respectively). When the ω\omega is 15 and 30, the best RR is 2 (39.93% and 27.62% on AP75, respectively). This is because the correlation among adjacent angle categories decreases with the increase of the ω\omega, thus a small RR could be more suitable. On the contrary, AR-CSL could dynamically smooth angle label with the change of ω\omega so it is less affected by ω\omega and performs better. Furthermore, we also modify the RR in CSL with Rω\frac{R}{\sqrt{\omega}} so that the radius can dynamically adjust itself under different angle discrete granularities to some extent. Compared with CSL with a fixed radius, the modified CSL has a better performance, but it still lags behind AR-CSL.

Refer to caption
Figure 10: Visual comparison between CSL and AR-CSL on DOTA-v1.0.
Refer to caption
Figure 11: The location of sampling points before (left) and after (right) using RDA module. Each sampling point is marked as a filled circle whose color indicates its attention weight.

Comparison of different detectors using CSL and AR-CSL. We conduct the CSL and AR-CSL on other detectors to verify the generalization of AR-CSL and set the ω\omega to 6. The experimental results are shown in Tab. IV. When the detector is changed to RetinaNet, AR-CSL achieves 67.98% on AP50 and 39.18% on AP75. When the detector is changed to FCOS, AR-CSL achieves 71.60% on AP50 and 39.74% on AP75. Compared with CSL, AR-CSL also performs well on AP75 when using different detectors.

TABLE V: Comparison of different angle classification methods on DOTA-v1.0.
Method Reg CSL [8] POE [12] AR-BCL [30] AR-CSL
Deformable DETR AP50 69.39 72.15 72.34 72.27 72.38
AP75 40.79 44.07 43.75 44.67 45.71
TABLE VI: Ablation study of different angle prediction types and ways on DOTA-v1.0.
Angle Pred. Type Angle Pred. Way AP50 AP75
Regression θ=θref+Δθpred\theta=\theta_{ref}+\Delta\theta_{pred} 69.48 40.32
θ=θpred\theta=\theta_{pred} 69.39 40.79
Classification (AR-CSL) θ=θref+Δθpred\theta=\theta_{ref}+\Delta\theta_{pred} 70.81 40.64
θ=θpred\theta=\theta_{pred} 72.38 45.71
TABLE VII: Ablation study of ARS-DETR components on DOTA-v1.0.
DN RDA ARM ARL AN AP50 AP75
72.38 45.71
73.14 (+0.76) 47.04 (+1.33)
72.80 (+0.42) 48.06 (+2.35)
73.41 (+1.03) 47.63 (+1.92)
73.31 (+0.93) 47.58 (+1.87)
73.43 (+1.05) 48.13 (+2.42)
73.90 (+1.52) 48.62 (+2.91)
74.16 (+1.78) 49.41 (+3.70)
TABLE VIII: Comparison of different sampling methods on DOTA-v1.0.
Methods AP50 AP75
ARS-DETR Deformable Offsets 73.47 46.94
Fixed Offsets 73.58 48.76
RDA 74.16 49.41
TABLE IX: Ablation study of different angle noise scale γ\gamma on DOTA-v1.0.
γ\gamma 0.00 0.01 0.02 0.05 0.07
ARS-DETR AP50 73.90 73.46 73.91 74.16 73.72
AP75 48.62 48.86 49.23 49.41 48.85

Comparison with other angle classification methods on Deformable DETR. To explore the effectiveness of our proposed AR-CSL, we also conduct several experiments using different angle classification methods on Deformable DETR and the results are shown in Tab. V. Reg is the baseline that uses regression-based method to predict angle and it achieves 69.39% and 40.79% on AP50 and AP75 respectively. CSL[8] transfers the angle prediction to a classification task and utilizes the Gaussian window to smooth the angle label, greatly improving the performance of oriented object detection with 2.76% on AP50 and 3.28% on AP75. Recent POE[12] adopts a n-ary codes to predict angle but its performance in high-precision oriented object detection is slightly inferior to CSL. AR-BCL[30] considers the square-like objects and introduces bi-directional angle to improve CSL, which further prompt 0.6% on AP75. Our AR-CSL dynamically consider the object with different aspect ratio and achieves 45.71% on AP75, which is best among the compared methods.

TABLE X: Comparisons with the advanced oriented detectors on DOTA-v1.0. R-50 indicates ResNet50 [42]. Swin-T indicates Swin-Transformer [43]. ReR-50 indicates ReResNet50 [25]. D-DETR means Deformable DETR [14]. * indicates that the model adopts the 1x training schedule(12 epochs). Red and blue: top two performances.
Method Backbone PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC AP50 AP75 AP50:95
AO2-DETR [22] R-50 87.99 79.46 45.74 66.64 78.90 73.90 73.30 90.40 80.55 85.89 55.19 63.62 51.83 70.15 60.04 70.91 22.60 33.31
Rotated D-DETR [14] R-50 84.89 70.71 46.04 61.92 73.99 78.83 87.71 90.07 77.97 78.41 47.07 54.48 66.87 67.66 55.62 69.48 40.32 40.27
Rotated FCOS* [34] R-50 89.06 76.97 47.92 58.55 79.78 76.95 86.90 90.90 84.87 84.58 57.11 64.68 63.69 69.38 46.87 71.88 37.30 39.80
Rotated FCOS [34] R-50 88.52 77.54 47.06 63.78 80.42 80.50 87.34 90.39 77.83 84.13 55.45 65.84 66.02 72.77 49.17 72.45 39.84 41.02
S2A-Net* [25] R-50 89.25 81.19 51.55 71.39 78.61 77.37 86.77 90.89 86.28 84.64 61.21 65.65 66.07 67.57 50.18 73.91 35.52 39.05
S2A-Net [25] R-50 89.26 84.11 51.97 72.78 78.23 79.41 87.46 90.85 85.62 84.09 60.18 65.90 72.54 71.59 55.31 75.29 40.08 42.00
Rotated RetinaNet* [44] R-50 89.64 82.56 38.43 69.83 77.39 62.74 77.24 90.68 83.79 82.04 59.91 64.83 57.37 64.76 45.56 69.79 37.69 39.64
Rotated RetinaNet [44] R-50 87.33 78.91 46.45 69.81 67.72 62.34 73.59 90.85 82.79 79.37 59.62 61.89 65.01 67.76 44.95 69.23 40.96 40.38
Gliding Vertex* [6] R-50 89.20 75.92 51.31 69.56 78.11 75.63 86.87 90.90 85.40 84.77 53.36 66.65 66.31 69.99 54.39 73.22 37.47 39.52
Gliding Vertex [6] R-50 88.71 77.22 52.00 70.85 73.75 74.81 86.55 90.89 80.41 84.63 57.66 62.88 68.49 71.86 58.17 73.26 41.14 41.29
H2RBox [45] R-50 88.16 80.47 40.88 61.27 79.78 75.25 84.40 90.89 80.05 85.35 58.91 68.46 63.67 71.87 47.18 71.77 41.42 41.49
R3Det* [24] R-50 89.29 75.21 45.41 69.23 75.53 72.89 79.28 90.88 81.02 83.25 58.81 63.15 63.40 62.21 37.41 69.80 36.59 37.82
R3Det [24] R-50 89.24 83.32 48.03 72.52 77.52 76.72 86.48 90.89 82.33 83.51 60.96 63.09 67.58 69.27 49.50 73.40 41.69 41.43
KFIoU [27] R-50 89.20 76.40 51.64 70.15 78.31 76.43 87.10 90.88 81.68 82.22 64.65 64.84 66.77 70.68 49.52 73.37 42.71 41.70
Rotated Faster RCNN* [15] R-50 89.25 82.44 50.05 69.34 78.17 73.59 85.91 90.89 84.08 85.50 57.66 60.96 66.25 69.22 57.74 73.40 39.61 40.75
Rotated Faster RCNN [15] R-50 89.09 78.28 48.93 71.54 74.01 74.99 85.90 90.84 86.87 85.03 57.97 69.74 68.10 71.28 56.88 73.96 43.44 42.93
Rotated D-DETR w/ CSL [8] R-50 86.27 76.66 46.64 65.29 76.80 76.32 87.74 90.77 79.38 82.36 54.00 61.47 66.05 70.46 61.97 72.15 44.07 42.72
SASM [46] R-50 87.51 80.15 51.07 70.35 74.95 75.80 84.23 90.90 80.87 84.93 58.51 65.59 69.74 70.18 42.31 72.47 44.21 43.01
KLD [26] R-50 89.08 84.18 43.77 72.33 79.85 73.58 85.69 90.88 85.14 81.96 65.86 64.60 63.60 68.26 53.19 73.46 44.74 43.70
Rotated ATSS* [35] R-50 88.50 77.73 49.60 69.86 76.87 72.52 82.49 90.83 80.30 82.96 62.34 64.67 64.83 66.81 53.97 72.29 37.81 40.05
Rotated ATSS [35] R-50 88.94 79.89 48.71 70.74 75.80 74.02 84.14 90.89 83.19 84.05 60.48 65.06 66.74 70.14 57.78 73.37 44.95 43.53
GWD* [11] R-50 88.92 77.03 45.90 69.30 72.53 64.06 76.40 90.87 79.20 80.45 57.68 64.37 63.60 64.74 48.26 69.55 38.91 39.50
GWD [11] R-50 89.06 80.56 44.27 73.02 79.51 73.53 85.55 90.89 86.21 83.26 63.17 64.24 63.56 69.04 52.92 73.25 45.21 44.04
PSC [28] R-50 89.65 83.80 43.64 70.98 79.00 71.35 85.08 90.90 84.28 82.51 60.64 65.06 62.52 69.61 54.0 72.87 46.18 43.98
CFA [47] R-50 88.34 83.09 51.92 72.23 79.95 78.68 87.25 90.90 85.38 85.71 59.63 63.05 73.33 70.36 47.86 74.51 46.55 44.41
Oriented Reppoints* [7] R-50 87.78 77.67 49.54 66.46 78.51 73.11 86.58 90.86 83.75 84.34 53.14 65.63 63.70 68.71 45.91 71.71 41.39 40.88
Oriented Reppoints [7] R-50 88.52 80.62 52.68 73.04 79.61 80.73 87.76 90.89 81.82 85.33 59.95 64.88 73.81 69.84 46.18 74.38 46.56 44.57
RoI Trans. [4] R-50 88.70 83.66 54.65 72.72 73.77 78.05 87.39 90.90 80.64 84.76 60.73 63.98 77.61 73.32 54.48 75.03 48.86 45.84
RoI Trans. [4] Swin-T 88.44 85.53 54.56 74.55 73.43 78.39 87.64 90.88 87.23 87.11 64.25 63.27 77.93 74.10 60.03 76.49 50.15 47.60
ARS-DETR R-50 86.97 75.56 48.32 69.20 77.92 77.94 87.69 90.50 77.31 82.86 60.28 64.58 74.88 71.76 66.62 74.16 49.41 46.21
ARS-DETR Swin-T 87.65 76.54 50.64 69.85 79.76 83.91 87.92 90.26 86.24 85.09 54.58 67.01 75.62 73.66 63.39 75.47 51.77 47.77
Refer to caption
Figure 12: Examples of detection results on the DOTA-v1.0 dataset using ARS-DETR.
TABLE XI: Comparisons with the advanced oriented detectors on DIOR-R. Red and blue: top two performances.
Method Rotated FCOS [34] S2A-Net  [25] R3Det  [24] Gliding Vertex [6] KFIoU [27] SASM [46] GWD [11] KLD [26] Rotated Faster RCNN [15] Rotated ATSS [35] CFA [47] Oriented Reppoints [7] RoI Trans. [4] ARS-DETR
APL 62.89 67.98 62.55 62.67 58.03 64.78 69.68 66.52 63.07 62.19 61.10 67.80 63.28 68.00
APO 41.38 44.44 43.44 38.56 45.41 49.90 28.83 46.80 40.22 44.63 44.93 48.01 46.05 54.17
BF 71.83 71.63 71.72 71.94 69.52 74.94 74.32 71.76 71.89 71.55 77.62 77.02 71.93 74.43
BC 81.00 81.39 81.48 81.20 81.55 80.38 81.49 81.43 81.36 81.42 84.67 85.37 81.33 81.65
BR 38.01 42.66 36.49 37.73 38.82 34.52 29.62 40.81 39.67 41.08 37.69 38.55 43.71 41.13
CH 72.46 72.72 72.63 72.48 73.36 69.21 72.67 78.25 72.51 72.37 75.71 78.45 72.69 75.66
ESA 77.73 79.03 79.50 78.62 78.08 76.28 76.45 79.23 79.19 78.54 82.68 81.13 80.17 81.92
ETS 67.52 70.40 64.41 69.04 66.41 61.37 63.14 66.63 69.45 67.50 72.03 72.06 70.04 73.07
DAM 28.61 27.08 27.02 22.81 25.23 31.66 27.13 29.01 26.00 30.56 33.41 33.67 31.42 34.89
GF 74.58 75.56 77.36 77.89 79.24 72.22 77.19 78.68 77.93 75.69 77.25 76.00 78.00 76.10
GTF 77.04 81.02 77.17 82.13 78.25 77.81 78.94 80.19 82.28 79.11 79.94 79.89 83.48 78.62
HA 40.66 43.41 40.53 46.22 44.67 44.69 39.11 44.88 46.91 42.77 46.20 45.72 49.04 36.33
OP 53.92 56.45 53.33 54.76 54.45 52.08 42.18 57.23 53.90 56.31 54.27 54.27 58.29 55.41
SH 79.41 81.12 79.66 81.03 80.78 83.64 79.10 80.91 81.03 80.92 87.01 85.13 81.17 84.55
STA 66.33 68.00 69.22 74.88 68.40 62.83 70.41 74.17 75.77 67.78 70.43 76.04 77.93 70.09
STO 67.57 70.03 61.10 62.54 64.52 63.91 58.69 68.02 62.54 69.24 69.58 65.27 62.61 72.23
TC 79.88 87.07 81.54 81.41 81.49 80.79 81.52 81.48 81.42 81.62 81.55 85.38 81.40 81.14
TS 48.10 53.88 52.18 54.25 51.64 56.54 47.78 54.63 54.50 55.45 55.51 59.76 56.05 61.52
VE 46.22 51.12 43.57 43.22 46.03 43.58 44.47 47.80 43.17 47.79 49.53 48.02 44.18 50.57
WM 64.79 65.31 64.13 65.13 59.50 63.14 62.63 64.41 65.73 64.10 64.92 68.92 66.44 70.28
AP50 62.00 64.50 61.91 62.91 62.29 62.21 60.31 64.63 63.41 63.52 65.25 66.31 64.97 66.12
AP75 36.10 38.24 38.40 40.00 40.20 40.40 40.90 41.60 41.80 42.61 43.41 44.36 46.02 45.81
AP50:95 37.61 38.02 37.84 38.34 38.52 39.51 39.70 40.34 39.72 40.72 42.18 42.81 43.31 43.89

V-B2 Studies on ARS-DETR

In this subsection, we adopt Deformable DETR + AR-CSL as our baseline and explore the effectiveness of our proposed methods.

Ablation study of different angle prediction ways in DETR. The prediction types of angle mainly include regression and classification, and each type can be divided into ways in DETR: direct prediction (θ=θpred\theta=\theta_{pred}) or residual prediction (θ=θref+Δθpred\theta=\theta_{ref}+\Delta\theta_{pred}). Tab. VI compares the four combinations and finds that the way of direct prediction is the best in both regression and classification. We suspect that the periodicity of angle results in two optimization directions for the residual prediction during the optimization, which leads to the boundary problem and thus hinders the performance to some extent.

Ablation study of Rotated Deformable Attention Module. To explore the effectiveness of the Rotated Deformable Attention Module (RDA), we perform ablation studies of detector w/ and w/o RDA in Tab. VII. By aligning sampling points and features, RDA gets 1.02% and 0.49% gains from 47.04% and 48.13% to 48.06% and 48.62% on AP75, respectively. The visualization is shown in Fig. 11. Before using RDA, there is a large number of sampling points will be distributed in the background around the objects or the adjacent objects, resulting in misalignment. Meanwhile, sampling points with high attention weight are relatively less and arranged densely. In contrast, after using RDA, sampling points could basically align with objects, and at the same time, sampling points with high attention weight are more widely distributed, indicating that the model pays more attention to multiple parts of objects. Besides, as shown in Tab. VIII, RDA also surpasses other two sampling methods on both AP50 and AP75, exhibiting its superiority.

Ablation study of Aspect Ratio sensitive Matching and Loss. To investigate the contribution of Aspect Ratio sensitive Matching (ARM) and Aspect Ratio sensitive Loss (ARL), we conduct detailed ablation studies on Tab. VII. The results clearly show that when using ARM and ARL independently, they can improve the performance, about 0.59% and 0.54% on AP75, respectively. When using both ARM and ARL, they can further improve the performance by 1.09% on AP75, verifying the effectiveness of ARM and ARL.

Ablation study of Denoising training. During the experiments, we mainly explore the influence of the angle noise and separate it from the DN, indicating it with AN. Label noise α\alpha and box noise β\beta are set with 0.5 and 0.4 respectively, which is the same with DINO[18]. Tab. VII shows that the basic DN training strategy (just adding noise to label and box) could improve the performance by 0.76% and 1.33% to 73.14% and 47.04% in terms of AP50 and AP75 respectively. Tab.IX shows that when additionally adding noise to angle, it can also further improve the performance. When the γ\gamma is set to 0.05, the best performance can be obtained with 74.16% and 49.41% on AP50 and AP75 respectively. The results in Tab. VII and Tab.IX show that noised ground truth could further help model learning for oriented object detection.

Refer to caption
Figure 13: Examples of detection results on the DIOR-R dataset using ARS-DETR.
TABLE XII: Comparisons with the advanced oriented detectors on OHD-SJTU-L. Red and blue: top two performances.
Method PL SH SV LV HA HC AP50 AP75 AP50:95
RRPN [23] 89.55 82.60 57.36 72.26 63.01 45.27 68.34 22.03 31.12
R2CNN [48] 90.02 80.83 63.07 64.16 66.36 55.94 70.06 32.70 35.44
RetinaNet-H [24] 90.22 80.04 63.32 63.49 63.73 53.77 69.10 35.90 36.89
R3Det [24] 89.89 87.69 65.20 78.95 57.06 53.50 72.05 36.51 38.57
Rotated Faster RCNN [15] 81.44 88.30 66.58 75.11 67.97 48.19 71.32 37.43 39.28
Rotated ATSS [35] 81.21 88.73 71.67 76.51 71.48 38.50 71.33 39.82 40.37
RetinaNet-R [24] 90.00 86.90 63.24 86.90 62.85 52.35 72.78 40.13 40.58
OHDet [33] 89.73 86.63 61.37 78.80 63.76 54.62 72.49 43.60 41.29
ARS-DETR 87.48 87.76 65.23 78.51 69.43 56.89 74.20 46.08 43.22
TABLE XIII: Comparisons with the advanced oriented detectors on OHD-SJTU-S. Red and blue: top two performances.
Method PL SH AP50 AP75 AP50:95
RRPN [23] 90.14 76.13 83.13 27.87 40.74
R2CNN [48] 90.91 77.66 84.28 55.00 52.80
RetinaNet-H [24] 90.86 66.32 78.59 58.45 53.07
R3Det [24] 90.82 85.59 88.21 67.13 56.19
Rotated Faster RCNN [15] 90.83 79.18 85.01 62.73 52.17
Rotated ATSS [35] 90.81 86.49 88.65 72.53 59.51
RetinaNet-R [24] 90.82 85.59 89.48 74.62 61.86
OHDet [33] 90.74 87.59 89.06 78.55 63.94
ARS-DETR 90.18 89.71 89.95 80.67 65.49

V-C Comparison with state-of-the-art methods

Results on DOTA-v1.0. We report the results of 16 oriented object detectors in Tab. X. Since different methods use different image resolutions with different data pre-processing, data augmentation, backbone, training strategies, various tricks and etc. in the original papers, we implement all detectors on MMRotate [40], using the same setting to make the comparison as fair as possible. All the results are obtained by single-scale training and testing, and adopted the ‘1x’ (12epochs) or ‘3x’ (36 epochs) training schedule. With R-50 and Swin-T as the backbone, our method obtains 74.16% and 75.47% on AP50, and 49.41% and 51.77% on AP75, respectively, as shown in Fig. 12. When using the ResNet50 as backbone, the performance of ARS-DETR under AP50 is not as good as that of many advanced oriented detectors, but it has obvious advantages in high-precision detection and surpasses other advanced detectors on AP75. Specially, ARS-DETR outperforms RoI Trans by 0.55% (49.41% VS 48.86%), Oriented Reppoints by 2.85% (49.41% VS 46.56%), CFA by 2.86% (49.41% VS 46.55%), GWD by 4.2% (49.41% VS 45.21%). In addition, there are also some detectors perform well on AP50 but degenerate a lot on AP75 (e.g. S2A-Net, 75.29% on AP50 and 40.08% on AP75) and some detectors are relatively not good at AP50 but has a favorable performance on AP75 (e.g. PSC, 72.87% on AP50 and 46.18% on AP75), which further proves that it is not suitable to only use AP50 to judge the performance of the detector. What’s more, ARS-DETR also exceeds DETR-based detectors like Rotated Deformable DETR and AO2-DETR.

Results on DIOR-R. Results on DIOR-R dataset are shown in Tab. XI. All methods adopt ‘3x’ training schedule and use R-50 as backbone. ARS-DETR achieves 66.12% and 45.81% on AP50 and AP75, respectively. The visualization is shown in Fig. 13.

Results on OHD-SJTU. We also compare the performance of some oriented object detection methods on OHD-SJTU, mainly include R2CNN, RRPN, RetinaNet, R3Det, OHDet. Without any bells and whistles, our ARS-DETR achieves 46.08% and 80.67% on AP75 in OHD-SJTU-L and OHD-SJTU-S respectively, surpassing other advanced oriented object detectors. The detailed results are shown in Tab. XII and Tab. XIII.

Besides, from the above comparisons, it can be observed that AP50 is not accurate enough to represent the performance of oriented object detectors, especially the performance of high-precision oriented detection. In short, this paper mainly advocates the use of more stringent indicators (e.g. AP75) to further study high-precision oriented object detector.

VI Conclusion

In this paper, we analyze the correlation between the angle and objects with different aspect ratios in detail and identify the flaws of the current metric (i.e. AP50) in high-precision oriented object detection. The metric AP50, which is widely used, has a large tolerance for angle deviation, which cannot very accurately reflect the performance of the oriented object detectors. Therefore, using the more stringent metric AP75 to measure the performance is more reasonable. Then, we design an oriented object detector named ARS-DETR and find that dynamically adjust the smoothing in angle classification, matching and loss calculation process in DETR based on sensitivity of objects with different aspect ratios to angle can effectively boost the performance. Additionally, aligning the features in DETR’s decoder and adopting the denoising training strategy could further improve the DETR to adapt to the oriented object detection. Compared with other advanced oriented detectors, ARS-DETR achieves higher detection accuracy especially in the more stringent metric among the various datasets. Furthermore, we hope that this method will facilitate future work in high-precision oriented object detection and the application of DETR in oriented object detection.

References

  • [1] Z. Liu, L. Yuan, L. Weng, and Y. Yang, “A high resolution optical satellite image dataset for ship recognition and some new baselines.” in ICPRAM, 2017, pp. 324–331.
  • [2] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3974–3983.
  • [3] G. Cheng, J. Wang, K. Li, X. Xie, C. Lang, Y. Yao, and J. Han, “Anchor-free oriented proposal generator for object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
  • [4] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2849–2858.
  • [5] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and K. Fu, “Scrdet: Towards more robust detection for small, cluttered and rotated objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8232–8241.
  • [6] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, “Gliding vertex on the horizontal bounding box for multi-oriented object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1452–1459, 2020.
  • [7] W. Li, Y. Chen, K. Hu, and J. Zhu, “Oriented reppoints for aerial object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1829–1838.
  • [8] X. Yang and J. Yan, “Arbitrary-oriented object detection with circular smooth label,” in European Conference on Computer Vision, 2020, pp. 677–694.
  • [9] X. Yang, L. Hou, Y. Zhou, W. Wang, and J. Yan, “Dense label encoding for boundary discontinuity free rotation detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 819–15 829.
  • [10] J. Wang, F. Li, and H. Bi, “Gaussian focal loss: Learning distribution polarized angle prediction for rotated object detection in aerial images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
  • [11] X. Yang, J. Yan, Q. Ming, W. Wang, X. Zhang, and Q. Tian, “Rethinking rotated object detection with gaussian wasserstein distance loss,” in International Conference on Machine Learning.   PMLR, 2021, pp. 11 830–11 841.
  • [12] Q. Ming, L. Miao, Z. Zhou, J. Song, Y. Dong, and X. Yang, “Task interleaving and orientation estimation for high-precision oriented object detection in aerial images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 241–255, 2023.
  • [13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  • [14] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021.
  • [15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [16] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3651–3660.
  • [17] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
  • [18] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in International Conference on Learning Representations, 2023.
  • [19] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 3, 2022, pp. 2567–2575.
  • [20] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,” arXiv preprint arXiv:2201.12329, 2022.
  • [21] T. Ma, M. Mao, H. Zheng, P. Gao, X. Wang, S. Han, E. Ding, B. Zhang, and D. Doermann, “Oriented object detection with transformer,” arXiv preprint arXiv:2106.03146, 2021.
  • [22] L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “Ao2-detr: Arbitrary-oriented object detection transformer,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [23] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122, 2018.
  • [24] X. Yang, J. Yan, Z. Feng, and T. He, “R3det: Refined single-stage detector with feature refinement for rotating object,” in AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3163–3171.
  • [25] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for oriented object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
  • [26] X. Yang, X. Yang, J. Yang, Q. Ming, W. Wang, Q. Tian, and J. Yan, “Learning high-precision bounding box for rotated object detection via kullback-leibler divergence,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 381–18 394, 2021.
  • [27] X. Yang, Y. Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. Zhang, and Q. Tian, “The kfiou loss for rotated object detection,” in International Conference on Learning Representations, 2023.
  • [28] Y. Yu and F. Da, “Phase-shifting coder: Predicting accurate orientation in oriented object detection,” in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. [Online]. Available: https://arxiv.org/abs/2211.06368
  • [29] H. Wang, Z. Huang, Z. Chen, Y. Song, and W. Li, “Multigrained angle representation for remote-sensing object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
  • [30] Z. Xiao, B. Xu, Y. Zhang, K. Wang, Q. Wan, and X. Tan, “Aspect ratio-based bidirectional label encoding for square-like rotation detection,” IEEE Geoscience and Remote Sensing Letters, 2023.
  • [31] X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y. Feng, T. Xu et al., “Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–130, 2022.
  • [32] X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, and K. Fu, “Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network,” IEEE Access, vol. 6, pp. 50 839–50 849, 2018.
  • [33] X. Yang and J. Yan, “On the arbitrary-oriented object detection: Classification based approaches revisited,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1340–1365, 2022.
  • [34] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636.
  • [35] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9759–9768.
  • [36] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  • [37] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
  • [38] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020.
  • [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019.
  • [40] Y. Zhou, X. Yang, G. Zhang, J. Wang, Y. Liu, L. Hou, X. Jiang, X. Liu, J. Yan, C. Lyu et al., “Mmrotate: A rotated object detection benchmark using pytorch,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7331–7334.
  • [41] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
  • [42] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2961–2969.
  • [43] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [44] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2980–2988.
  • [45] X. Yang, G. Zhang, W. Li, X. Wang, Y. Zhou, and J. Yan, “H2rbox: Horizontal box annotation is all you need for oriented object detection,” in International Conference on Learning Representations, 2023.
  • [46] L. Hou, K. Lu, J. Xue, and Y. Li, “Shape-adaptive selection and measurement for oriented object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 923–932.
  • [47] Z. Guo, C. Liu, X. Zhang, J. Jiao, X. Ji, and Q. Ye, “Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8792–8801.
  • [48] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo, “R2cnn: rotational region cnn for orientation robust scene text detection,” arXiv preprint arXiv:1706.09579, 2017.