ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection

Ying Zeng¹, Yushi Chen¹, Member, IEEE, Xue Yang², Qingyun Li¹, Junchi Yan², Senior Member, IEEE
This work was supported by the Natural Science Foundation of China under Grant 62371169, 61971164 and U20B2041 (Corresponding author: Yushi Chen.) Ying Zeng, Yushi Chen and Qingyun Li are with the School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China (e-mail: [email protected]; [email protected]; [email protected]) Xue Yang is with OpenGVLab, Shanghai AI Laboratory, Shanghai 200030, China (e-mail: [email protected]) Junchi Yan is with School of Electronic Information and Electrical Engineering, and MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai 200030, China (e-mail: [email protected])

Abstract

Existing oriented object detection in aerial images has progressed a lot in recent years and achieved a favorable success. However, high-precision oriented object detection in aerial images remains a challenging task. Some recent works have adopted the classification-based method to predict the angle in order to address boundary problem in angle. However, we have found that these works often neglect the sensitivity of objects with different aspect ratios to angle. At the same time, it is worth exploring a suitable way to improve the emerging transformer-based approaches in order to adapt them to oriented object detection. In this paper, we propose an Aspect Ratio Sensitive DEtection TRansformer, termed ARS-DETR, for oriented object detection in aerial images. Specifically, a new angle classification method, called Aspect Ratio aware Circle Smooth Label (AR-CSL), is proposed to smooth the angle label in a more reasonable way and discard the hyperparameter that introduced by previous work (e.g. CSL). Then, a rotated deformable attention module is designed to rotate the sampling points with the corresponding angles and eliminate the misalignment between region features and sampling points. Moreover, a dynamic weight coefficient according to the aspect ratio is adopted to calculate the angle loss. Comprehensive experiments on several challenging datasets demonstrate that our method achieves a competitive performance in the high-precision oriented object detection task.

Index Terms:

Oriented Object Detection, High-Precision Detection, Detection Transformer, Feature Alignment.

I Introduction

Object detection in aerial images has always been a hot spot in the remote sensing community. With the rapid increase of a large number of available high-resolution aerial images[1, 2, 3], accurately and effectively detection in these aerial images has become a crucial issue.

Benefiting from the development of deep learning, the emergence of deep Convolutional Neural Networks (CNNs) greatly influenced the design of detectors and has achieved a favorable performance in generic object detection. Instead of using handcrafted features for detection, which is cumbersome and not accurate enough, CNNs could learn from the training data and update themselves iteratively, exhibiting a strong ability to extract high-level and robust features for accurate detection. Numerous advanced detectors have also been proposed to detect the objects by using horizontal bounding boxes (HBBs).

Compared with generic images, objects in aerial images often exhibit a wide variety of scales, aspect ratios, and orientations, and sometimes they are arranged densely. When simply using HBBs to detect these objects, HBBs cannot fit these objects very well and thus will include abundant background or overlap with other objects. Therefore, oriented object detection, which adopts the oriented bounding boxes (OBBs) to represent the objects, is more suitable for aerial object detection.

As a recently emerged task in remote sensing, oriented object detection exhibits a strong ability in analysing the objects in aerial images and many advanced oriented object detectors have been proposed and achieved a favorable performance in aerial images[4, 5, 6, 7]. However, numerous detectors treat oriented object detection as generic detection involving an angle that needs to be predicted. To achieve this, they simply introduce an additional angle parameter in the detection head. Consequently, their angle predictions are not very accurate, as shown in Fig. 1. Nevertheless, these detectors can still obtain a fairly good results under current metric i.e. AP₅₀, indicating that AP₅₀ is not accurate enough to reflect the performance of oriented object detectors and high-precision oriented object detection in aerial images still remains a challenging task.

Angle, as a unique parameter in oriented object detection, plays a vital role in high-precision detection. At the same time, the characteristics of angle also make it difficult to predict and therefore require more attention. Firstly, the periodicity of angle will cause the discontinuity at the boundary, leading to a suboptimal optimization during the training process. Secondly, objects with different aspect ratios exhibit varying sensitivities to angle, which is neglected by most of oriented object detectors. Objects with a small aspect ratio, especially those resembling squares, exhibit reduced sensitivity to variations in angle. Consequently, even in situations where there exist notable disparities in the predicted angles, which indicates a significant angle deviation, they are capable of maintaining a high Skew Intersection over Union (SkewIoU) between the targets and predictions. In contrast, even a slight angle deviation can drastically degrade the SkewIoU in cases where objects have a large aspect ratio.

Among a large number of angle prediction methods, classification-based method shows a favorable performance [8, 9, 10]. Specifically, it decouples the angle from the bounding boxes and transforms the angle prediction into a classification task, thereby eliminating the boundary problem [11]. Moreover, [12] also shows a strong potential of classification-based method in high-precision oriented object detection. Nevertheless, there still exists some issues, such as ignoring the correlation between the angle and aspect ratio completely, introducing hyperparameter (e.g. window radius in [8]), and etc. Thus, the accuracy of the angle prediction is hindered to some extent.

Recently, the transformer-based detectors [13, 14] have revived the object detection task. Without additional complicated hand-designed components like preset Anchor or Non-Maximum Suppression (NMS), DEtection TRansformer (DETR [13]) regards object detection as a set prediction task and assigns labels by bipartite graph matching, which achieves a comparable performance with classical detectors like Faster RCNN [15]. Existing DETR derivatives [14, 16, 17, 18, 19, 20] dramatically improve detection performance and convergence speed, exhibiting great potential of Transformer for high-precision object detection. Although some DETR-based oriented object detection methods [21, 22] have been proposed, they still use regression to predict angle and do not take into account the issues caused by boundary discontinuity. Meanwhile, they predict angle in a simple way and do not explore how to embed angle information into DETR. How to use DETR more naturally in oriented object detection is still a research topic.

In this paper, we propose an Aspect Ratio Sensitive DEtection TRansformer to achieve oriented object detection in aerial images, called ARS-DETR. Specifically, a hyperparametric free Aspect Ratio aware Circle Smooth Label (AR-CSL) is designed to represent the relationship of adjacent angles according to the aspect ratios of objects. Considering the sensitivity of different objects to angle, AR-CSL uses the SkewIoU of the objects with different aspect ratios under each angle deviation to smooth the angle labels. Then, We also propose a rotated deformable attention module to embed the angle information into detector to align the features. Finally, we adopt the aspect ratio sensitive matching and loss strategy to enable dynamic adjustment of the detector’s training, thereby reducing the burden of model training. Extensive experiments on different aerial datasets demonstrate the effectiveness of ARS-DETR in high-precision oriented object detection. In summary, our contributions lie in four-folds as follows:

•

We analyze the influence of angle deviation in the oriented object detection in detail and give the corresponding formula. Additionally, we also analyze the flaws of the current oriented object detection metric (i.e. AP₅₀).
•

A new angle classification method called AR-CSL is designed to smooth angle labels in a more reasonable way. This method adopts the values of the SkewIoU of objects with different aspect ratios under each angle deviation, while also eliminating the hyperparameter of window radius that was introduced by previous work.
•

We propose an angle embedded Rotated Deformable Attention module (RDA) to incorporate the angle information for extracting the aligned features. Meanwhile, the Aspect Ratio sensitive Matching (ARM) and Aspect Ratio sensitive Loss (ARL) are developed to adaptively adjust the focus on the angle based on the aspect ratio of the object. In addition, we also combine with DeNoising strategy (DN) to further improve the performance of DETR-based method for oriented object detection.
•

Extensive experiments on three public aerial datasets: DOTA-v1.0, DIOR-R and OHD-SJTU demonstrate the effectiveness of the proposed model. ARS-DETR achieves a competitive performance on high-precision oriented object detection in the all datasets.

II Related Work

II-A Oriented Object Detection

As an emerging task, oriented object detection has made great progress in recent years. The simple solution [23] for oriented object detection task is to change the Anchor or Region of Interests (RoIs) from the horizontal type to the rotated type. RoI-Transformer [4] constructs the geometry transformation to rotate the proposals to locate the objects more accurately. To address the feature misalignment in refined single-stage detectors, R³Det [24] and S²A-Net [25] adopt the feature alignment module to get a more accurate location. However, these mainstream regression-based methods often suffer the boundary problem [8] due to the predictions beyond the defined range and need additional complicated treatment. SCRDet [5] designs a novel IoU-Smooth L1 Loss to alleviate the sudden increase in loss caused by angle periodicity and edge exchangeability, which reduces the difficulty of model training.

CSL [8] transforms the prediction of angle from regression to classification, thereby eliminating the boundary problem. It is achieved through the design of a Circle Smooth Label. Gliding vertex [6] glides the vertex of the horizontal bounding box to accurately represent a multi-oriented object. GWD [11], KLD [26] and KFIoU [27] convert the rotated bounding box into a Gaussian distribution to avoid the boundary discontinuity and square-like issue in oriented object detection. PSC [28] provides a unified framework for various periodic fuzzy issues in oriented object detection by mapping rotational periodicity of different cycles into phase of different frequencies.

II-B Angle Classification-based Oriented Object Detection

The classification-based angle prediction method is a novel and effective approach to circumvent the boundary problem while predicting angle accurately, and it has also made a lot of progress. CSL[8] discretizes the angle variable into 180 categories and smooths the angle label via a Gaussian window function. The CSL method directly promotes the development of classification-based oriented object detection algorithms. DCL [9] uses dense coded label to reduce the amount of computation and parameters of CSL. [10] adopts a dynamic weighting mechanism based on CSL to perform precise angle estimation for rotated objects. To overcome the challenges of ambiguity and high costs in angle representation, [29] proposes a multi-grained angle representation method, consisting of coarse-grained angle classification and fine-grained angle regression. TIOE [12] proposes a progressive orientation estimation strategy to approximate the orientation of objects with n-ary codes. AR-BCL [30] uses an aspect ratio-based bidirectional coded label to solve the square-like detection issue [9]. In contrast, the new angle encoding technique proposed in this paper is free from hyperparameters, boundary problem, and square-like issue. Furthermore, it explores the potential of angle classification in high-precision detection, which is overlooked by most of the above methods.

II-C DETR and Its Variants

DETR [13] proposed a Transformer-based end-to-end object detector without using hand-designed components like prior anchor design and NMS. In recent years, DETR has progressed a lot and also exhibits its strong ability in object detection compared with classic detection methods [14, 16, 17, 18, 19, 20]. Deformable DETR [14] proposes a deformable attention module to sample the value of adaptive positions around the reference point and utilizes the multi-level features to mitigate the slow convergence and high complexity issues of DETR. DAB-DETR [20] provides explicit positional priors for each query to let the cross-attention module focus on a local region corresponding to a target object by using anchor box size. DN-DETR [17] and DINO [18] design a denoising auxiliary task that bypasses the bipartite graph matching. This not only accelerates training convergence but also leads to a better training result. In addition, there have been some DETR-based oriented object detectors [21, 22]. O²DETR [21] is the first attempt to apply DETR to the oriented object detection task and AO2-DETR [22] introduces oriented proposal generation and refinement module into the transformer architecture to refine the features. Nevertheless, both of them predict angles using a simple regression way and do not address boundary discontinuity or embed angle information into DETR.

III Rethinking on Oriented Object Detection

In this section, we analyze the relationship between angle and aspect ratio. Additionally, we also analyze the shortcomings of currently used metric AP₅₀ in oriented object detection and emphasize the importance of high-precision oriented object detection.

III-A Angle and Aspect Ratio

Objects with different aspect ratios have different sensitivities to angles. In order to better observe the relationship between the angle and aspect ratio, we assume that there are two bounding boxes with the same center, width and height, and give the SkewIoU of these two boxes with different aspect ratios under different angle deviations, as shown in Fig. 2. The SkewIoU can be calculated by Eq. 1:

$\displaystyle\text{SkewIoU}(k,\Delta\theta)=$	$\displaystyle\left\{\begin{array}[]{rcl}\frac{4k\tan\Delta\theta-m-n}{4k\tan\Delta\theta+m+n}&\Delta\theta\leq 2\arctan\frac{1}{k}\\ \frac{1}{2k\sin\Delta\theta-1}&\Delta\theta>2\arctan\frac{1}{k}\end{array}\right.,$	(1)
$\displaystyle\small m=$	$\displaystyle(1-k\tan\frac{\Delta\theta}{2})^{2}\tan^{2}\Delta\theta,$
$\displaystyle\small n=$	$\displaystyle(\frac{-2\sin^{2}\frac{\Delta\theta}{2}+k\sin\Delta\theta}{\cos\Delta\theta})^{2},$

where $k\geq 1$ is the aspect ratio, and $\Delta\theta\in[0^{\circ},90^{\circ}]$ represents angle deviation, indicating the absolute value of the angle difference between two boxes. There is a critical angle boundary threshold ( $\theta^{*}=2\arctan\frac{1}{k}$ ) as shown in Fig. 3. The symmetrical nature depicted in Fig. 2 demonstrates that variations in $\Delta\theta$ exhibit bidirectional characteristics.

Fig. LABEL:sub@fig:skewiou1 shows the curves between SkewIoU and angle deviation under different aspect ratios. It can be seen that the SkewIoU variation trends of bounding boxes with different aspect ratios are obviously divided into two types according to metric AP₅₀ (if the SkewIoU between the prediction box and ground truth is greater than 0.5, then it will be judged as true positive), and the dividing boundary is $k=1.5$ , as shown in Fig. LABEL:sub@fig:skewiou2 ( $1\leq k\leq 1.5$ ) and Fig. LABEL:sub@fig:skewiou3 ( $k>1.5$ ), respectively. Specifically, Fig. LABEL:sub@fig:skewiou2 shows that when the aspect ratio is smaller than 1.5, SkewIoU is always greater than 0.5 regardless of the angle deviation (see pink dashed line). In contrast, when the aspect ratio is greater than 1.5, as shown in Fig. LABEL:sub@fig:skewiou3, SkewIoU will decay rapidly with the increase of angle deviation, but the valid angle deviation still retains a wide range. In summary, objects with a small aspect ratio are less sensitive to angle deviation, whereas objects with a large aspect ratio are more sensitive but still exhibit a significant tolerance for angle deviation under AP₅₀.

III-B High-Precision Oriented Object Detection

Considering that angle is a very important parameter in oriented object detection, the accuracy of its estimation will greatly affect the subsequent related tasks, such as object fine-grained recognition [31], object heading estimation [32, 33], etc., AP₅₀ seems not accurate enough for reflecting the performance of high-precision oriented object detection. Therefore, we advocate to use more stringent metric, e.g. AP₇₅¹¹1Note: The difficulty of achieving high AP₇₅ in oriented object detection is more difficult than that in horizontal object detection, because the rotated bounding box is more accurate with less redundant areas, thus more sensitive to errors., which is usually used in generic detection, to measure this challenging task. Under the AP₇₅ metric, not only are the prediction boxes required to be closer to the ground truth boxes, but the angle prediction requirement is also more stringent. As shown the gray dashed line in Fig. 2, when AP₇₅ is adopted, regardless of the aspect ratio, the angle deviation should be controlled within a specific range; otherwise, they will not be judged as positive. The larger the aspect ratio, the narrower the range.

Tab. I compares the accuracy of some oriented object detectors using AP₅₀ and AP₇₅, respectively. It can be seen that all detectors achieve a high performance in terms of AP₅₀ and the gap among them is small. However, the situation becomes different when AP₇₅ is used, some detectors may not good as other detectors whose performance on AP₅₀ are lower than them, e.g. S²A-Net vs. Rotated ATSS. Therefore, AP₇₅ can further represent the performance of high-precision oriented object detection.

TABLE I: Accuracy of some oriented object detectors on DOTA-v1.0 and DIOR-R datasets.

Method	DOTA-v1.0			DIOR-R
Method	AP₅₀	AP₇₅	AP_50:95	AP₅₀	AP₇₅	AP_50:95
Rotated FCOS [34]	72.45	39.84	41.02	62.00	36.10	37.61
S²A-Net [25]	75.29	40.08	42.00	64.50	38.24	38.02
Rotated Faster RCNN [15]	73.96	43.44	42.93	63.41	41.80	39.72
KLD [26]	73.46	44.74	43.70	64.63	41.60	40.34
Rotated ATSS [35]	73.37	44.95	43.53	63.52	42.61	40.72
GWD [11]	73.25	45.21	44.04	60.31	40.90	39.70
Oriented Reppoints [7]	74.38	46.56	44.57	66.31	44.36	42.81

As for the AP_50:95, which is also often adopted in generic detection, it is no doubt a stricter and more comprehensive metric, but we believe that AP₇₅ could reflect high-precision oriented object detection more directly. AP_50:95 contains a large number of metrics. Under the metrics AP₅₀ or AP₅₅, when the angle prediction is not accurate, as shown in Fig. 1, a relatively high value could still be reached. Conversely, under the metrics AP₉₀ or AP₉₅, the performance of oriented detectors degenerates a lot, which is of no significance for comparison in current research. Therefore, AP₇₅, as an intermediate metric, is more balanced and suitable for high-precision oriented object detection. Meanwhile, there is an average operation in AP_50:95, which makes metrics like AP₅₀ and AP₅₅ have a great influence on the overall value.

In summary, it is necessary and meaningful to pay more attention to high-precision oriented object detection.

IV Method

In this section we mainly design our methods around the relationship between the angle and aspect ratio. We propose a new angle classification method, AR-CSL, to dynamically adjust the smoothing process according to the aspect ratio. Then we adopt the Deformable DETR[14] as the detection architecture and develop it with rotated deformable attention module, denoising training strategy, aspect ratio sensitive matching and loss to adapt to the oriented object detection.

IV-A Overview

Fig. 4 shows the framework of the proposed ARS-DETR. Given an image, a backbone is firstly used to extract feature maps. The backbone will generate hierarchical feature maps and the last three stages of outputs are used. Then, the 1 $\times$ 1 convolution is adopted to map their channels to the uniform dimension. Additionally, the lowest resolution feature map is obtained via a 3 $\times$ 3 convolution on the final feature map. Then the multi-scale feature maps, embed with the 2D positional encoding, are fed into Encoder. Without the use of top-down structure in FPN[36], multi-scale Deformable Attention can exchange the information among multi-scale feature maps and further refine these feature maps. The output of Encoder will then generate a large number of proposals and be used in Decoder. Next, the Top-K scoring proposals are picked as object queries and transformed into output embeddings by MultiHead Attention and multi-scale Rotated Deformable Attention in Decoder. Finally, the prediction head further decode the output embeddings from decoder into class labels, angle labels and horizontal box coordinates. Additionally, during the training process, we utilize the ground truth with noise as queries to participate in the training to stabilize the training. At the same time, we adopt Aspect Ratio sensitive Matching (ARM) to adjust the influence of angle in matching process and adopt Aspect Ratio sensitive Loss (ARL) to adjust the training strategy for different aspect ratio objects when calculating angle loss.

IV-B Aspect Ratio Aware Circle Smooth Label

IV-B1 Rethinking on Circular Smooth Label

Instead of using regression-based loss function, Circular Smooth Label (CSL) [8] transforms angle prediction into a classification task so that the boundary problem naturally disappears. As shown in Fig. LABEL:sub@fig:csl_arsl1 and Fig. LABEL:sub@fig:csl_arsl2, CSL divides the angle into 180 categories and treats the first angle category and the last angle category as adjacent angle categories to eliminate the impact of boundary discontinuity. Then, it adopts Gaussian window function to smooth the angle category label of the objects so as to reflect the correlation among adjacent angle categories and make it have a certain tolerance for angle estimation error. The expression of CSL is as follows:

\displaystyle CSL(t)

\displaystyle=\left\{\begin{matrix}g(t),&\theta-r<t<\theta+r\\ 0,&otherwise\end{matrix}\right.,

(2)

where $g()$ is window function, $t$ is the angle represented by label, $r$ is the radius of the window function, and $\theta$ is the angle of ground truth.

Although CSL has made some progress, it still has two drawbacks which will behind its performance:

•

Fixed label function. CSL adopts a fixed radius Gaussian function to learn the correlation among adjacent angles and smooth the label, without considering objects’ aspect ratio, as shown in Fig. LABEL:sub@fig:csl_arsl1. However, it can be obviously seen from the Fig. 2 that the SkewIoU of objects with different aspect ratio differs a lot in adjacent angles. Therefore, the correlation among adjacent angles should not be rigid and Gaussian window is likely not the best choice for all objects.
•

Angle discrete granularity insensitivity. CSL is also insensitive to the angle discrete granularity. When the angle discrete granularity $\omega$ is 1, indicating the angle is divided into 180 categories, the smoothing result is shown in Fig. LABEL:sub@fig:csl_arsl1. In contrast, when the angle discrete granularity $\omega$ is 15, indicating the angle is divided into 12 categories, the smoothing result is shown in Fig. LABEL:sub@fig:csl_arsl5. It can be seen from these two results that the smoothing outcomes of CSL remain consistent under different angle discrete granularities, which is clearly unreasonable. With the increase of $\omega$ , the correlation among adjacent angles will become weaker, while CSL is insensitive to this and will still give the same smoothing values to the adjacent angle categories. Hence, the correlation among adjacent angles under different angle discrete granularity should also be taken into account.
•

Hyperparameter introduction. The radius of window function will affect the final performance to some extent. As a hyperparameter, it is a thorny problem to determine the best value of the radius when the $\omega$ changes.

IV-B2 Design of Aspect Ratio Aware Circle Smooth Label

According to the above analysis, the fixed window function and hyperparameter (i.e. radius) hurt the applicability of classification-based oriented object detectors to some extent. In this subsection, we will address the aforementioned issues from the perspective of the encoding form.

Considering that SkewIoU can dynamically reflect the correlation among adjacent angles of different objects, we design an Aspect Ratio aware Circle Smooth Label (AR-CSL) technique to obtain a more reasonable angle prediction, using the SkewIoU instead of a fixed window function to smooth the label. Specifically, we calculate the SkewIoU of the bounding boxes under each angle deviation according to Eq. 1, and take the calculated values as the label of the current angle category bin.

Then, we normalize the SkewIoU values by maximum and minimum normalization method, shown as follows:

\displaystyle\begin{matrix}AR-CSL(k,t)=\frac{SkewIoU(k,\Delta\theta)-SkewIoU(k)_{min}}{1-SkewIoU(k)_{min}},\\ \Delta\theta=\left|t-\theta\right|,\end{matrix}

(3)

where $k$ is aspect ratio of ground truth, $t$ is the angle represented by label, $\theta$ is the angle of ground truth and $\Delta\theta\in[0^{\circ},90^{\circ}]$ is the angle deviation.

Compared with CSL, the proposed AR-CSL has the following advantages:

•

Dynamic label function. The smoothing values are dynamically calculated according to the aspect ratios of objects by using SkewIoU, as shown in Fig. LABEL:sub@fig:csl_arsl3-LABEL:sub@fig:csl_arsl4.
•

Angle discrete granularity sensitivity. Because the angle deviation of different angle categories will be accounted for according to the Eq. 3, the smoothing values in adjacent categories will vary with alterations in angle discrete granularity, as shown in Fig. LABEL:sub@fig:csl_arsl4-LABEL:sub@fig:csl_arsl6.
•

Hyperparameter free. According to Eq. 1 and Eq. 3, no hyperparameters are introduced. This makes the proposed method more convenient to use.

IV-C Rotated Deformable Attention Module

Fig. LABEL:sub@fig:architecture1 shows a simple DETR-based oriented detector [21, 22]. This detector merely adds an additional angle parameter in the prediction head to accomplish rotated bounding box estimation. Nevertheless, it fails to embed the angle information into the detector to exploit the maximum potential of the detector, resulting in feature misalignment. To address this, we present a Rotated Deformable Attention module (RDA).

Given an input feature map $x\in\mathbb{R}^{C\times H\times W}$ , let $q\in\Omega_{q}$ index a query element with representation feature $z_{q}\in\mathbb{R}^{C}$ and reference box $b_{q}=\left[p_{q},w_{q},h_{q},\theta_{q}\right]=\left[(x_{q},y_{q}),w_{q},h_{q},\theta_{q}\right]$ , where $p_{q}=(x_{q},y_{q})\in\left[0,1\right]^{2}$ is the centric point of the reference box and $w_{q}\in\left[0,1\right]$ , $h_{q}\in\left[0,1\right]$ , $\theta_{q}\in\left[-\frac{\pi}{4},\frac{\pi}{4}\right]$ are the width, height and angle of the reference box respectively. The rotated deformable attention feature is calculated as follows:

\displaystyle RDA(z_{q},p_{q},x)=\sum_{m=1}^{M}W_{m}\left[\sum_{k=1}^{K}A_{mqk}\cdot W_{m}^{{}^{\prime}}x(p_{q}+\Delta p_{mqk})\right],

(4)

where $m$ indexes the attention head, and $k$ indexes the sampled points. $M$ is the total head number and $K$ is the total sampled points number. We use $M=8$ and $K=4$ following [14]. $W_{m}^{{}^{\prime}}\in\mathbb{R}^{\frac{C}{M}\times C}$ and $W_{m}\in\mathbb{R}^{C\times\frac{C}{M}}$ are learnable weights. $\Delta p_{mqk}$ and $A_{mqk}$ denote the sampling offset and attention weight of the $k^{th}$ sampling point in the $m^{th}$ attention head, respectively. The scalar attention weight $A_{mqk}$ lies in the range $\left[0,1\right]$ , normalized by $\sum_{k=1}^{K}A_{mqk}=1$ . For each $\Delta p_{mqk}$ , it is calculated by:

\displaystyle\Delta p_{mqk}=\frac{(w,h)}{2K}(z_{q}f_{mk}+r_{mk})R^{T}(\theta_{q}),

(5)

where $f_{mk}\in\mathbb{R}^{C\times 2}$ is the projection matrices. $r_{mk}$ is the bias of the $k^{th}$ sampling point in the $m^{th}$ attention head and is calculated by:

\displaystyle r_{mk}=\frac{k(cos(\frac{2\pi m}{M}),sin(\frac{2\pi m}{M}))}{max(\left|cos(\frac{2\pi m}{M})\right|,\left|sin(\frac{2\pi m}{M})\right|)},

(6)

$R(\theta_{q})=(cos\theta_{q},-sin\theta_{q};sin\theta_{q},cos\theta_{q})^{T}$ is the rotation matrix.

In this way, we can obtain the dynamic sampling points $p_{q}+\Delta p_{mqk}$ and constrain them as much as possible within $b_{q}$ to extract aligned features.

As depicted in Fig. LABEL:sub@fig:rda1, the sampling points in Deformable Attention Module will be adjusted according to the corresponding reference box, so that the sampling points will be restricted within the reference box and fall within the object as far as possible. However, as shown in Fig. LABEL:sub@fig:rda2, when the object is of the oriented type, the sampling points cannot accurately align with the object if the horizontal reference box [22] is still used. Therefore, we design the Rotated Deformable Attention Module to align the sampling points with features by rotating the sampling points according to the embedded angle information, as shown in Fig. LABEL:sub@fig:rda3 and Fig. LABEL:sub@fig:rda4. Moreover, instead of refining the angle layer by layer, we predict a new angle after each layer independently, as shown in Fig. LABEL:sub@fig:architecture2.

As shown in Fig. 8, we compare two other sampling methods with our RDA. Fig. LABEL:sub@fig:deformable_offsets shows the sampling method in [37]. It learns deformable offsets to augment the spatial sampling locations, but it may sample from wrong locations with weak supervision, especially for densely packed objects. Fig. LABEL:sub@fig:fixed_offsets shows the sampling methods in [4, 25]. It rotates the sampling points with the angle of the bounding box to align features, but the sampling positions are fixed. Our RDA also rotates the sampling points but it also learns deformable offsets and will constrain them according to the width and height of the bounding box. Compared with these methods, RDA provides some flexibility while aligning features.

IV-D Denoising Training

It is verified in [17] that, the instability of bipartite graph matching in DETR could result in slow convergence and hence hinder the performance. The proposed DETR-based detection learns to refine the coarse object features and boxes iteratively, which can be simulated by the process of reconstructing noisy ground-truth labels and boxes. Denoising Training, as an auxiliary task of denoising labels and boxes, has fixed target assigning results, so it can mitigate the effect of matching instability and accelerate the convergence. Besides, to simulate predicting both positive and negative samples, both positive and negative noisy targets are generated for each ground-truth target, which provides a more reasonable optimization goal. Additionally, to transfer the Denoising Training to the oriented object detection, we also design the task of denoising angles for training procedure.

Given a ground truth $gt=(c_{gt},b_{gt},\theta_{gt})$ , where $c_{gt}$ is the class, $b_{gt}=(x_{c},y_{c},w,h)$ is the bounding box, and $\theta_{gt}$ is the angle, the noisy ground truth $gt_{n}=(c_{n},b_{n},\theta_{n})$ is obtained as follows.

The noisy labels $c_{n}$ are generated by randomly selecting part of the ground-truth labels and overlaying the selected labels with arbitrary object labels at a ratio of $\alpha$ . The target labels of positive samples are assigned with the ground-truth labels, while those of negative samples are assigned with the background category.

The noisy boxes $b_{n}$ are generated by moving the four boundaries of ground-truth boxes randomly. Specifically, the ground truth box $b_{gt}=(x_{c},y_{c},w,h)$ is convert into format $\hat{b}_{gt}=(x_{l},y_{u},x_{r},y_{b})$ , where $x_{l}$ and $x_{r}$ represent horizontal ordinates of left and right boundaries, respectively, and $y_{u}$ and $y_{b}$ represent vertical ordinates of upper and bottom boundaries, respectively. The negative noisy boxes should have larger noise scale than positive ones, because the farther proposals should predict negative samples [18]. Hence, random noise offsets $\hat{\epsilon}=(\Delta x_{l},\Delta y_{u},\Delta x_{r},\Delta y_{b})$ where $\Delta x_{l},\Delta x_{r}\sim U(-\frac{\beta}{2}w,\frac{\beta}{2}w)$ and $\Delta y_{u},\Delta y_{b}\sim U(-\frac{\beta}{2}h,\frac{\beta}{2}h)$ , are generated for positive noisy boxes. While for negative ones, $\Delta x_{l},\Delta x_{r}\sim\left(U(-\beta w,-\frac{\beta}{2}w)+U(\frac{\beta}{2}w,\beta w)\right)$ and $\Delta y_{u},\Delta y_{b}\sim\left(U(-\beta h,-\frac{\beta}{2}h)+U(\frac{\beta}{2}h,\beta h)\right)$ . The noisy boxes are calculated with $\hat{b_{n}}=\hat{b_{gt}}+\hat{\epsilon}$ and then converted into format $b_{n}=(x_{nc},y_{nc},w_{n},h_{n})$ as initial box proposals. The decoder learns to denoise $b_{n}$ and reconstruct $b_{gt}$ .

The noisy angles $\theta_{n}$ are generated by shifting the $\theta_{gt}$ with $\theta_{n}=f(\theta_{gt}+\Delta\theta)$ , where $\Delta\theta\sim U(-\gamma\pi,\gamma\pi)$ are generated for positive noisy angles and $\gamma$ is the angle noise scale. While for negative ones, $\Delta\theta\sim U(-2\gamma\pi,2\gamma\pi)$ . $f()$ is a periodic function, thereby ensuring that the $\theta_{gt}+\Delta\theta$ remain within the defined range.

IV-E Aspect Ratio Sensitive Matching and Loss

After decoding the output embeddings from decoder, the predicted results will be matched with targets to count the loss during the training. Let $y$ denote ground truth set of oriented objects and $\hat{y}$ denote the $N$ predictions. Then we calculate the cost between these two sets and search for a permutation of $N$ elements $\sigma\in{{O}_{n}}$ with the lowest cost:

\hat{\sigma}=\underset{\sigma\in{{O}_{n}}}{\mathop{\arg\min}}\,\sum\limits_{i}^{N}{{{L}_{match}}({{y}_{i}},{{{\hat{y}}}_{\sigma(i)}})},

(7)

where ${{L}_{match}}({{y}_{i}},{{\hat{y}}_{\sigma(i)}})$ is a pair-wise matching cost between ground truth ${{y}_{i}}$ and a prediction with index $\sigma(i)$ , which takes into account the class prediction, angle prediction and the similarity of predicted and ground truth horizontal boxes, and we define it as follows:

$\displaystyle{{L}_{match}}({{y}_{i}},{{{\hat{y}}}_{\sigma(i)}})$	$\displaystyle={{\lambda}_{cls}}\cdot{{L}_{cls}}({{c}_{i}},{{{\hat{c}}}_{\sigma(i)}})$	(8)
	$\displaystyle+{{\lambda}_{box}}\cdot{{L}_{box}}({{b}_{i}},{{{\hat{b}}}_{\sigma(i)}})$
	$\displaystyle+{{\lambda}_{iou}}\cdot{{L}_{iou}}({{b}_{i}},{{{\hat{b}}}_{\sigma(i)}})$
	$\displaystyle+{{\lambda}_{\theta}}\cdot{{L}_{\theta}}({{\theta}_{i}},{{{\hat{\theta}}}_{\sigma(i)}}),$

where the $c$ is the class label, $b$ is the horizontal box, and $\theta$ is the angle, respectively. Additionally, the ${{L}_{cls}}$ is the focal loss, the ${{L}_{box}}$ is the L1 loss, the ${{L}_{iou}}$ is the generalized intersection over union loss (GIoU), and ${{L}_{\theta}}$ is the cross entropy loss.

Whereas, considering the objects with larger aspect ration are more sensitive to the angle, we introduce a dynamic coefficient to adjust the angle loss, named Aspect Ratio sensitive Matching (ARM), which can be formulated as follows:

\displaystyle L({{\theta}_{i}},{{\hat{\theta}}_{\sigma(i)}})\to\frac{2{{k}_{i}}}{1+{{k}_{i}}}L({{\theta}_{i}},{{\hat{\theta}}_{\sigma(i)}}),

(9)

where $k_{i}$ is the aspect ratio of the targets and $k_{i}\geq 1$ .

As shown in Fig. 9, the red ground truth will be matched to the blue prediction whose cost is the lowest according to Eq. 8. When the ARM is introduced, the ground truth with a large aspect ratio will consider more about the angle deviation in the cost, so it is prone to be matched to the prediction whose angle is more similar with its.

The definition of training loss function is consistent with the Eq. 8. Meanwhile, we also introduce the Aspect Ratio sensitive Loss (ARL), which is the same as Eq. 9, to dynamically adjust the angle loss.

V Experiments

V-A Datasets and Implementation Details

DOTA-v1.0 [2] is one of the largest datasets for oriented object detection, containing 2,806 large aerial images from different sensors and platforms ranging from around 800 $\times$ 800 to 4,000 $\times$ 4,000 and 188,282 instances. It has 15 common categories: Plane (PL), Baseball diamond (BD), Bridge (BR), Ground track field (GTF), Small vehicle (SV), Large vehicle (LV), Ship (SH), Tennis court (TC), Basketball court (BC), Storage tank (ST), Soccer-ball field (SBF), Roundabout (RA), Harbor (HA), Swimming pool (SP), and Helicopter (HC). We use both training and validation sets for training, and the test set for testing. We divide the images into 1024 $\times$ 1024 sub-images with an overlap of 200 pixels. During the training, only the random horizontal, vertical, diagonal flipping is adopted to avoid over-fitting and no other tricks are utilized. The performance of the test set is evaluated on the official DOTA evaluation server.

DIOR-R [3] is an aerial image dataset annotated by oriented bounding boxes from the DIOR [38] dataset. There are 23,463 images and 192,518 instances in this dataset, containing 20 common categories: Airplane (APL), Airport (APO), Baseball Field (BF), Basketball Court (BC), Bridge (BR), Chimney (CH), Expressway Service Area (ESA), Expressway Toll Station (ETS), Dam (DAM), Golf Field (GF), Ground Track Field (GTF), Harbor (HA), Overpass (OP), Ship (SH), Stadium (STA), Storage Tank (STO), Tennis Court (TC), Train Station (TS), Vehicle (VE) and Windmill (WM). DIOR-R has a high variation of object size, both in spatial resolutions, and in the aspect of inter-class and intra-class size variability across objects. Different imaging conditions, weather, seasons, image quality are the major challenges of DIOR-R. We use both training and validation sets for training and the test set for testing.

OHD-SJTU [33] is a public dataset for oriented object detection and object heading detection. It contains two different scale datasets, called OHD-SJTU-S and OHD-SJTU-L. OHD-SJTU-S collects 43 large scene images ranging from 10,000 $\times$ 10,000 to 16,000 $\times$ 16,000 and 4,125 instances in this dataset. In contrast, OHD-SJTU-L adds more categories and instances, containing six object categories and 113,435 instances. In line with previous work [24, 33] processing, we divide the images into 600 $\times$ 600 sub-images with an overlap of 150 pixels and scale them to 800 $\times$ 800.

All models in this paper are implemented by PyTorch [39] based framework MMRotate [40], and trained with AdamW [41] optimizer. The initial learning rate is 10^-4 with 2 images per mini-batch with ‘3x’ training schedule on DOTA-v1.0, DIOR-R and OHD-SJTU-L and ‘9x’ training schedule on OHD-SJTU-S respectively. In addition, we adopt learning rate warm-up for 500 iterations, and the learning rate is divided by 10 at decay step.

V-B Ablation Studies

TABLE II: Comparison of AR-CSL and CSL with different radius on DOTA-v1.0.

Method		CSL				AR-CSL
Method		R=2	R=4	R=6	R=8	AR-CSL
Deformable DETR	AP₅₀	72.57	72.24	72.15	72.10	72.38
Deformable DETR	AP₇₅	43.61	42.82	44.07	43.19	45.71

TABLE III: Comparison of AR-CSL and CSL under different angle discrete granularity

\omega

on DOTA-v1.0.

Method	Granularity	AP₅₀	AP₇₅	Method	Granularity	AP₅₀	AP₇₅
CSL(R=2)	$\omega$ =1	72.57	43.61	CSL(R= $\frac{6}{\sqrt{\omega}}$ )	$\omega$ =1	72.15	44.07
	$\omega$ =6	72.22	42.51		$\omega$ =6	71.70	43.02
	$\omega$ =15	71.28	39.93		$\omega$ =15	70.45	40.59
	$\omega$ =30	71.39	27.62		$\omega$ =30	70.55	28.72
CSL(R=6)	$\omega$ =1	72.15	44.07	AR-CSL	$\omega$ =1	72.38	45.71
	$\omega$ =6	71.39	42.73		$\omega$ =6	72.44	44.32
	$\omega$ =15	72.06	33.13		$\omega$ =15	71.90	41.52
	$\omega$ =30	71.01	19.61		$\omega$ =30	71.38	30.25

TABLE IV: Comparison of different detectors using CSL and AR-CSL on DOTA-v1.0.

Method( $\omega=6$ )		AP₅₀	AP₇₅
Deformable DETR(R-50)	CSL(R=6)	71.39	42.73
Deformable DETR(R-50)	AR-CSL	72.44	44.32
RetinaNet(R-50)	CSL(R=6)	67.63	38.64
RetinaNet(R-50)	AR-CSL	67.98	39.18
FCOS(R-50)	CSL(R=6)	70.60	38.42
FCOS(R-50)	AR-CSL	71.60	39.74

V-B1 Studies on AR-CSL

In this subsection we adopt Deformable DETR as baseline and compare AR-CSL with CSL and other angle classification methods.

Comparison of CSL and AR-CSL on Deformable DETR. We compare the proposed AR-CSL and CSL with different radius $R$ under different angle discrete granularity $\omega$ on DOTA-v1.0, which is based on Deformable DETR, and the results are shown in Tab. II and Tab. III. Firstly, in Tab. II, we set the $\omega$ to 1 and compare AR-CSL and CSL with different radius $R$ . Fig. 10 shows the visualization of the CSL and AR-CSL. The influence of $R$ on CSL is mainly concentrated on high-precision oriented detection and the maximum gap could reach 1.25% on AP₇₅ ( $R=6$ with 44.07% vs $R=4$ with 42.82%). In contrast, AR-CSL achieves the best performance (about 45.71% in terms of AP₇₅) without tuning any hyperparameters. Secondly, in Tab. III, we set the $R$ in CSL to 2 and 6 and compare them with AR-CSL under different angle discrete granularity $\omega$ . With the increase of the $\omega$ , the angle interval becomes larger and the angle representation of each angle category becomes more ambiguous, thus the performances of CSL and AR-CSL on AP₇₅ deteriorate rapidly. However, the performance of CSL on AP₇₅ is more sensitive to the change of $\omega$ , which needs to further tune the $R$ . When the $\omega$ is 1 and 6, the best $R$ is 6 (44.07% and 42.73% on AP₇₅, respectively). When the $\omega$ is 15 and 30, the best $R$ is 2 (39.93% and 27.62% on AP₇₅, respectively). This is because the correlation among adjacent angle categories decreases with the increase of the $\omega$ , thus a small $R$ could be more suitable. On the contrary, AR-CSL could dynamically smooth angle label with the change of $\omega$ so it is less affected by $\omega$ and performs better. Furthermore, we also modify the $R$ in CSL with $\frac{R}{\sqrt{\omega}}$ so that the radius can dynamically adjust itself under different angle discrete granularities to some extent. Compared with CSL with a fixed radius, the modified CSL has a better performance, but it still lags behind AR-CSL.

Comparison of different detectors using CSL and AR-CSL. We conduct the CSL and AR-CSL on other detectors to verify the generalization of AR-CSL and set the $\omega$ to 6. The experimental results are shown in Tab. IV. When the detector is changed to RetinaNet, AR-CSL achieves 67.98% on AP₅₀ and 39.18% on AP₇₅. When the detector is changed to FCOS, AR-CSL achieves 71.60% on AP₅₀ and 39.74% on AP₇₅. Compared with CSL, AR-CSL also performs well on AP₇₅ when using different detectors.

TABLE V: Comparison of different angle classification methods on DOTA-v1.0.

Method		Reg	CSL [8]	POE [12]	AR-BCL [30]	AR-CSL
Deformable DETR	AP₅₀	69.39	72.15	72.34	72.27	72.38
Deformable DETR	AP₇₅	40.79	44.07	43.75	44.67	45.71

TABLE VI: Ablation study of different angle prediction types and ways on DOTA-v1.0.

Angle Pred. Type	Angle Pred. Way	AP₅₀	AP₇₅
Regression	$\theta=\theta_{ref}+\Delta\theta_{pred}$	69.48	40.32
Regression	$\theta=\theta_{pred}$	69.39	40.79
Classification (AR-CSL)	$\theta=\theta_{ref}+\Delta\theta_{pred}$	70.81	40.64
Classification (AR-CSL)	$\theta=\theta_{pred}$	72.38	45.71

TABLE VII: Ablation study of ARS-DETR components on DOTA-v1.0.

DN	RDA	ARM	ARL	AN	AP₅₀	AP₇₅
					72.38	45.71
✓					73.14 (+0.76)	47.04 (+1.33)
✓	✓				72.80 (+0.42)	48.06 (+2.35)
✓		✓			73.41 (+1.03)	47.63 (+1.92)
✓			✓		73.31 (+0.93)	47.58 (+1.87)
✓		✓	✓		73.43 (+1.05)	48.13 (+2.42)
✓	✓	✓	✓		73.90 (+1.52)	48.62 (+2.91)
✓	✓	✓	✓	✓	74.16 (+1.78)	49.41 (+3.70)

TABLE VIII: Comparison of different sampling methods on DOTA-v1.0.

	Methods	AP₅₀	AP₇₅
ARS-DETR	Deformable Offsets	73.47	46.94
	Fixed Offsets	73.58	48.76
	RDA	74.16	49.41

TABLE IX: Ablation study of different angle noise scale

\gamma

on DOTA-v1.0.

$\gamma$		0.00	0.01	0.02	0.05	0.07
ARS-DETR	AP₅₀	73.90	73.46	73.91	74.16	73.72
ARS-DETR	AP₇₅	48.62	48.86	49.23	49.41	48.85

Comparison with other angle classification methods on Deformable DETR. To explore the effectiveness of our proposed AR-CSL, we also conduct several experiments using different angle classification methods on Deformable DETR and the results are shown in Tab. V. Reg is the baseline that uses regression-based method to predict angle and it achieves 69.39% and 40.79% on AP₅₀ and AP₇₅ respectively. CSL[8] transfers the angle prediction to a classification task and utilizes the Gaussian window to smooth the angle label, greatly improving the performance of oriented object detection with 2.76% on AP₅₀ and 3.28% on AP₇₅. Recent POE[12] adopts a n-ary codes to predict angle but its performance in high-precision oriented object detection is slightly inferior to CSL. AR-BCL[30] considers the square-like objects and introduces bi-directional angle to improve CSL, which further prompt 0.6% on AP₇₅. Our AR-CSL dynamically consider the object with different aspect ratio and achieves 45.71% on AP₇₅, which is best among the compared methods.

TABLE X: Comparisons with the advanced oriented detectors on DOTA-v1.0. R-50 indicates ResNet50 [42]. Swin-T indicates Swin-Transformer [43]. ReR-50 indicates ReResNet50 [25]. D-DETR means Deformable DETR [14]. * indicates that the model adopts the 1x training schedule(12 epochs). Red and blue: top two performances.

Method	Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	AP₅₀	AP₇₅	AP_50:95
AO2-DETR [22]	R-50	87.99	79.46	45.74	66.64	78.90	73.90	73.30	90.40	80.55	85.89	55.19	63.62	51.83	70.15	60.04	70.91	22.60	33.31
Rotated D-DETR [14]	R-50	84.89	70.71	46.04	61.92	73.99	78.83	87.71	90.07	77.97	78.41	47.07	54.48	66.87	67.66	55.62	69.48	40.32	40.27
Rotated FCOS* [34]	R-50	89.06	76.97	47.92	58.55	79.78	76.95	86.90	90.90	84.87	84.58	57.11	64.68	63.69	69.38	46.87	71.88	37.30	39.80
Rotated FCOS [34]	R-50	88.52	77.54	47.06	63.78	80.42	80.50	87.34	90.39	77.83	84.13	55.45	65.84	66.02	72.77	49.17	72.45	39.84	41.02
S²A-Net* [25]	R-50	89.25	81.19	51.55	71.39	78.61	77.37	86.77	90.89	86.28	84.64	61.21	65.65	66.07	67.57	50.18	73.91	35.52	39.05
S²A-Net [25]	R-50	89.26	84.11	51.97	72.78	78.23	79.41	87.46	90.85	85.62	84.09	60.18	65.90	72.54	71.59	55.31	75.29	40.08	42.00
Rotated RetinaNet* [44]	R-50	89.64	82.56	38.43	69.83	77.39	62.74	77.24	90.68	83.79	82.04	59.91	64.83	57.37	64.76	45.56	69.79	37.69	39.64
Rotated RetinaNet [44]	R-50	87.33	78.91	46.45	69.81	67.72	62.34	73.59	90.85	82.79	79.37	59.62	61.89	65.01	67.76	44.95	69.23	40.96	40.38
Gliding Vertex* [6]	R-50	89.20	75.92	51.31	69.56	78.11	75.63	86.87	90.90	85.40	84.77	53.36	66.65	66.31	69.99	54.39	73.22	37.47	39.52
Gliding Vertex [6]	R-50	88.71	77.22	52.00	70.85	73.75	74.81	86.55	90.89	80.41	84.63	57.66	62.88	68.49	71.86	58.17	73.26	41.14	41.29
H2RBox [45]	R-50	88.16	80.47	40.88	61.27	79.78	75.25	84.40	90.89	80.05	85.35	58.91	68.46	63.67	71.87	47.18	71.77	41.42	41.49
R³Det* [24]	R-50	89.29	75.21	45.41	69.23	75.53	72.89	79.28	90.88	81.02	83.25	58.81	63.15	63.40	62.21	37.41	69.80	36.59	37.82
R³Det [24]	R-50	89.24	83.32	48.03	72.52	77.52	76.72	86.48	90.89	82.33	83.51	60.96	63.09	67.58	69.27	49.50	73.40	41.69	41.43
KFIoU [27]	R-50	89.20	76.40	51.64	70.15	78.31	76.43	87.10	90.88	81.68	82.22	64.65	64.84	66.77	70.68	49.52	73.37	42.71	41.70
Rotated Faster RCNN* [15]	R-50	89.25	82.44	50.05	69.34	78.17	73.59	85.91	90.89	84.08	85.50	57.66	60.96	66.25	69.22	57.74	73.40	39.61	40.75
Rotated Faster RCNN [15]	R-50	89.09	78.28	48.93	71.54	74.01	74.99	85.90	90.84	86.87	85.03	57.97	69.74	68.10	71.28	56.88	73.96	43.44	42.93
Rotated D-DETR w/ CSL [8]	R-50	86.27	76.66	46.64	65.29	76.80	76.32	87.74	90.77	79.38	82.36	54.00	61.47	66.05	70.46	61.97	72.15	44.07	42.72
SASM [46]	R-50	87.51	80.15	51.07	70.35	74.95	75.80	84.23	90.90	80.87	84.93	58.51	65.59	69.74	70.18	42.31	72.47	44.21	43.01
KLD [26]	R-50	89.08	84.18	43.77	72.33	79.85	73.58	85.69	90.88	85.14	81.96	65.86	64.60	63.60	68.26	53.19	73.46	44.74	43.70
Rotated ATSS* [35]	R-50	88.50	77.73	49.60	69.86	76.87	72.52	82.49	90.83	80.30	82.96	62.34	64.67	64.83	66.81	53.97	72.29	37.81	40.05
Rotated ATSS [35]	R-50	88.94	79.89	48.71	70.74	75.80	74.02	84.14	90.89	83.19	84.05	60.48	65.06	66.74	70.14	57.78	73.37	44.95	43.53
GWD* [11]	R-50	88.92	77.03	45.90	69.30	72.53	64.06	76.40	90.87	79.20	80.45	57.68	64.37	63.60	64.74	48.26	69.55	38.91	39.50
GWD [11]	R-50	89.06	80.56	44.27	73.02	79.51	73.53	85.55	90.89	86.21	83.26	63.17	64.24	63.56	69.04	52.92	73.25	45.21	44.04
PSC [28]	R-50	89.65	83.80	43.64	70.98	79.00	71.35	85.08	90.90	84.28	82.51	60.64	65.06	62.52	69.61	54.0	72.87	46.18	43.98
CFA [47]	R-50	88.34	83.09	51.92	72.23	79.95	78.68	87.25	90.90	85.38	85.71	59.63	63.05	73.33	70.36	47.86	74.51	46.55	44.41
Oriented Reppoints* [7]	R-50	87.78	77.67	49.54	66.46	78.51	73.11	86.58	90.86	83.75	84.34	53.14	65.63	63.70	68.71	45.91	71.71	41.39	40.88
Oriented Reppoints [7]	R-50	88.52	80.62	52.68	73.04	79.61	80.73	87.76	90.89	81.82	85.33	59.95	64.88	73.81	69.84	46.18	74.38	46.56	44.57
RoI Trans. [4]	R-50	88.70	83.66	54.65	72.72	73.77	78.05	87.39	90.90	80.64	84.76	60.73	63.98	77.61	73.32	54.48	75.03	48.86	45.84
RoI Trans. [4]	Swin-T	88.44	85.53	54.56	74.55	73.43	78.39	87.64	90.88	87.23	87.11	64.25	63.27	77.93	74.10	60.03	76.49	50.15	47.60
ARS-DETR	R-50	86.97	75.56	48.32	69.20	77.92	77.94	87.69	90.50	77.31	82.86	60.28	64.58	74.88	71.76	66.62	74.16	49.41	46.21
ARS-DETR	Swin-T	87.65	76.54	50.64	69.85	79.76	83.91	87.92	90.26	86.24	85.09	54.58	67.01	75.62	73.66	63.39	75.47	51.77	47.77

TABLE XI: Comparisons with the advanced oriented detectors on DIOR-R. Red and blue: top two performances.

Method	Rotated FCOS [34]	S²A-Net [25]	R³Det [24]	Gliding Vertex [6]	KFIoU [27]	SASM [46]	GWD [11]	KLD [26]	Rotated Faster RCNN [15]	Rotated ATSS [35]	CFA [47]	Oriented Reppoints [7]	RoI Trans. [4]	ARS-DETR
APL	62.89	67.98	62.55	62.67	58.03	64.78	69.68	66.52	63.07	62.19	61.10	67.80	63.28	68.00
APO	41.38	44.44	43.44	38.56	45.41	49.90	28.83	46.80	40.22	44.63	44.93	48.01	46.05	54.17
BF	71.83	71.63	71.72	71.94	69.52	74.94	74.32	71.76	71.89	71.55	77.62	77.02	71.93	74.43
BC	81.00	81.39	81.48	81.20	81.55	80.38	81.49	81.43	81.36	81.42	84.67	85.37	81.33	81.65
BR	38.01	42.66	36.49	37.73	38.82	34.52	29.62	40.81	39.67	41.08	37.69	38.55	43.71	41.13
CH	72.46	72.72	72.63	72.48	73.36	69.21	72.67	78.25	72.51	72.37	75.71	78.45	72.69	75.66
ESA	77.73	79.03	79.50	78.62	78.08	76.28	76.45	79.23	79.19	78.54	82.68	81.13	80.17	81.92
ETS	67.52	70.40	64.41	69.04	66.41	61.37	63.14	66.63	69.45	67.50	72.03	72.06	70.04	73.07
DAM	28.61	27.08	27.02	22.81	25.23	31.66	27.13	29.01	26.00	30.56	33.41	33.67	31.42	34.89
GF	74.58	75.56	77.36	77.89	79.24	72.22	77.19	78.68	77.93	75.69	77.25	76.00	78.00	76.10
GTF	77.04	81.02	77.17	82.13	78.25	77.81	78.94	80.19	82.28	79.11	79.94	79.89	83.48	78.62
HA	40.66	43.41	40.53	46.22	44.67	44.69	39.11	44.88	46.91	42.77	46.20	45.72	49.04	36.33
OP	53.92	56.45	53.33	54.76	54.45	52.08	42.18	57.23	53.90	56.31	54.27	54.27	58.29	55.41
SH	79.41	81.12	79.66	81.03	80.78	83.64	79.10	80.91	81.03	80.92	87.01	85.13	81.17	84.55
STA	66.33	68.00	69.22	74.88	68.40	62.83	70.41	74.17	75.77	67.78	70.43	76.04	77.93	70.09
STO	67.57	70.03	61.10	62.54	64.52	63.91	58.69	68.02	62.54	69.24	69.58	65.27	62.61	72.23
TC	79.88	87.07	81.54	81.41	81.49	80.79	81.52	81.48	81.42	81.62	81.55	85.38	81.40	81.14
TS	48.10	53.88	52.18	54.25	51.64	56.54	47.78	54.63	54.50	55.45	55.51	59.76	56.05	61.52
VE	46.22	51.12	43.57	43.22	46.03	43.58	44.47	47.80	43.17	47.79	49.53	48.02	44.18	50.57
WM	64.79	65.31	64.13	65.13	59.50	63.14	62.63	64.41	65.73	64.10	64.92	68.92	66.44	70.28
AP₅₀	62.00	64.50	61.91	62.91	62.29	62.21	60.31	64.63	63.41	63.52	65.25	66.31	64.97	66.12
AP₇₅	36.10	38.24	38.40	40.00	40.20	40.40	40.90	41.60	41.80	42.61	43.41	44.36	46.02	45.81
AP_50:95	37.61	38.02	37.84	38.34	38.52	39.51	39.70	40.34	39.72	40.72	42.18	42.81	43.31	43.89

V-B2 Studies on ARS-DETR

In this subsection, we adopt Deformable DETR + AR-CSL as our baseline and explore the effectiveness of our proposed methods.

Ablation study of different angle prediction ways in DETR. The prediction types of angle mainly include regression and classification, and each type can be divided into ways in DETR: direct prediction ( $\theta=\theta_{pred}$ ) or residual prediction ( $\theta=\theta_{ref}+\Delta\theta_{pred}$ ). Tab. VI compares the four combinations and finds that the way of direct prediction is the best in both regression and classification. We suspect that the periodicity of angle results in two optimization directions for the residual prediction during the optimization, which leads to the boundary problem and thus hinders the performance to some extent.

Ablation study of Rotated Deformable Attention Module. To explore the effectiveness of the Rotated Deformable Attention Module (RDA), we perform ablation studies of detector w/ and w/o RDA in Tab. VII. By aligning sampling points and features, RDA gets 1.02% and 0.49% gains from 47.04% and 48.13% to 48.06% and 48.62% on AP₇₅, respectively. The visualization is shown in Fig. 11. Before using RDA, there is a large number of sampling points will be distributed in the background around the objects or the adjacent objects, resulting in misalignment. Meanwhile, sampling points with high attention weight are relatively less and arranged densely. In contrast, after using RDA, sampling points could basically align with objects, and at the same time, sampling points with high attention weight are more widely distributed, indicating that the model pays more attention to multiple parts of objects. Besides, as shown in Tab. VIII, RDA also surpasses other two sampling methods on both AP₅₀ and AP₇₅, exhibiting its superiority.

Ablation study of Aspect Ratio sensitive Matching and Loss. To investigate the contribution of Aspect Ratio sensitive Matching (ARM) and Aspect Ratio sensitive Loss (ARL), we conduct detailed ablation studies on Tab. VII. The results clearly show that when using ARM and ARL independently, they can improve the performance, about 0.59% and 0.54% on AP₇₅, respectively. When using both ARM and ARL, they can further improve the performance by 1.09% on AP₇₅, verifying the effectiveness of ARM and ARL.

Ablation study of Denoising training. During the experiments, we mainly explore the influence of the angle noise and separate it from the DN, indicating it with AN. Label noise $\alpha$ and box noise $\beta$ are set with 0.5 and 0.4 respectively, which is the same with DINO[18]. Tab. VII shows that the basic DN training strategy (just adding noise to label and box) could improve the performance by 0.76% and 1.33% to 73.14% and 47.04% in terms of AP₅₀ and AP₇₅ respectively. Tab.IX shows that when additionally adding noise to angle, it can also further improve the performance. When the $\gamma$ is set to 0.05, the best performance can be obtained with 74.16% and 49.41% on AP₅₀ and AP₇₅ respectively. The results in Tab. VII and Tab.IX show that noised ground truth could further help model learning for oriented object detection.

TABLE XII: Comparisons with the advanced oriented detectors on OHD-SJTU-L. Red and blue: top two performances.

Method	PL	SH	SV	LV	HA	HC	AP₅₀	AP₇₅	AP_50:95
RRPN [23]	89.55	82.60	57.36	72.26	63.01	45.27	68.34	22.03	31.12
R²CNN [48]	90.02	80.83	63.07	64.16	66.36	55.94	70.06	32.70	35.44
RetinaNet-H [24]	90.22	80.04	63.32	63.49	63.73	53.77	69.10	35.90	36.89
R³Det [24]	89.89	87.69	65.20	78.95	57.06	53.50	72.05	36.51	38.57
Rotated Faster RCNN [15]	81.44	88.30	66.58	75.11	67.97	48.19	71.32	37.43	39.28
Rotated ATSS [35]	81.21	88.73	71.67	76.51	71.48	38.50	71.33	39.82	40.37
RetinaNet-R [24]	90.00	86.90	63.24	86.90	62.85	52.35	72.78	40.13	40.58
OHDet [33]	89.73	86.63	61.37	78.80	63.76	54.62	72.49	43.60	41.29
ARS-DETR	87.48	87.76	65.23	78.51	69.43	56.89	74.20	46.08	43.22

TABLE XIII: Comparisons with the advanced oriented detectors on OHD-SJTU-S. Red and blue: top two performances.

Method	PL	SH	AP₅₀	AP₇₅	AP_50:95
RRPN [23]	90.14	76.13	83.13	27.87	40.74
R²CNN [48]	90.91	77.66	84.28	55.00	52.80
RetinaNet-H [24]	90.86	66.32	78.59	58.45	53.07
R³Det [24]	90.82	85.59	88.21	67.13	56.19
Rotated Faster RCNN [15]	90.83	79.18	85.01	62.73	52.17
Rotated ATSS [35]	90.81	86.49	88.65	72.53	59.51
RetinaNet-R [24]	90.82	85.59	89.48	74.62	61.86
OHDet [33]	90.74	87.59	89.06	78.55	63.94
ARS-DETR	90.18	89.71	89.95	80.67	65.49

V-C Comparison with state-of-the-art methods

Results on DOTA-v1.0. We report the results of 16 oriented object detectors in Tab. X. Since different methods use different image resolutions with different data pre-processing, data augmentation, backbone, training strategies, various tricks and etc. in the original papers, we implement all detectors on MMRotate [40], using the same setting to make the comparison as fair as possible. All the results are obtained by single-scale training and testing, and adopted the ‘1x’ (12epochs) or ‘3x’ (36 epochs) training schedule. With R-50 and Swin-T as the backbone, our method obtains 74.16% and 75.47% on AP₅₀, and 49.41% and 51.77% on AP₇₅, respectively, as shown in Fig. 12. When using the ResNet50 as backbone, the performance of ARS-DETR under AP₅₀ is not as good as that of many advanced oriented detectors, but it has obvious advantages in high-precision detection and surpasses other advanced detectors on AP₇₅. Specially, ARS-DETR outperforms RoI Trans by 0.55% (49.41% VS 48.86%), Oriented Reppoints by 2.85% (49.41% VS 46.56%), CFA by 2.86% (49.41% VS 46.55%), GWD by 4.2% (49.41% VS 45.21%). In addition, there are also some detectors perform well on AP₅₀ but degenerate a lot on AP₇₅ (e.g. S²A-Net, 75.29% on AP₅₀ and 40.08% on AP₇₅) and some detectors are relatively not good at AP₅₀ but has a favorable performance on AP₇₅ (e.g. PSC, 72.87% on AP₅₀ and 46.18% on AP₇₅), which further proves that it is not suitable to only use AP₅₀ to judge the performance of the detector. What’s more, ARS-DETR also exceeds DETR-based detectors like Rotated Deformable DETR and AO2-DETR.

Results on DIOR-R. Results on DIOR-R dataset are shown in Tab. XI. All methods adopt ‘3x’ training schedule and use R-50 as backbone. ARS-DETR achieves 66.12% and 45.81% on AP₅₀ and AP₇₅, respectively. The visualization is shown in Fig. 13.

Results on OHD-SJTU. We also compare the performance of some oriented object detection methods on OHD-SJTU, mainly include R2CNN, RRPN, RetinaNet, R³Det, OHDet. Without any bells and whistles, our ARS-DETR achieves 46.08% and 80.67% on AP₇₅ in OHD-SJTU-L and OHD-SJTU-S respectively, surpassing other advanced oriented object detectors. The detailed results are shown in Tab. XII and Tab. XIII.

Besides, from the above comparisons, it can be observed that AP₅₀ is not accurate enough to represent the performance of oriented object detectors, especially the performance of high-precision oriented detection. In short, this paper mainly advocates the use of more stringent indicators (e.g. AP₇₅) to further study high-precision oriented object detector.

VI Conclusion

In this paper, we analyze the correlation between the angle and objects with different aspect ratios in detail and identify the flaws of the current metric (i.e. AP₅₀) in high-precision oriented object detection. The metric AP₅₀, which is widely used, has a large tolerance for angle deviation, which cannot very accurately reflect the performance of the oriented object detectors. Therefore, using the more stringent metric AP₇₅ to measure the performance is more reasonable. Then, we design an oriented object detector named ARS-DETR and find that dynamically adjust the smoothing in angle classification, matching and loss calculation process in DETR based on sensitivity of objects with different aspect ratios to angle can effectively boost the performance. Additionally, aligning the features in DETR’s decoder and adopting the denoising training strategy could further improve the DETR to adapt to the oriented object detection. Compared with other advanced oriented detectors, ARS-DETR achieves higher detection accuracy especially in the more stringent metric among the various datasets. Furthermore, we hope that this method will facilitate future work in high-precision oriented object detection and the application of DETR in oriented object detection.

References

[1] Z. Liu, L. Yuan, L. Weng, and Y. Yang, “A high resolution optical satellite image dataset for ship recognition and some new baselines.” in ICPRAM, 2017, pp. 324–331.
[2] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3974–3983.
[3] G. Cheng, J. Wang, K. Li, X. Xie, C. Lang, Y. Yao, and J. Han, “Anchor-free oriented proposal generator for object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
[4] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2849–2858.
[5] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and K. Fu, “Scrdet: Towards more robust detection for small, cluttered and rotated objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8232–8241.
[6] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, “Gliding vertex on the horizontal bounding box for multi-oriented object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1452–1459, 2020.
[7] W. Li, Y. Chen, K. Hu, and J. Zhu, “Oriented reppoints for aerial object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1829–1838.
[8] X. Yang and J. Yan, “Arbitrary-oriented object detection with circular smooth label,” in European Conference on Computer Vision, 2020, pp. 677–694.
[9] X. Yang, L. Hou, Y. Zhou, W. Wang, and J. Yan, “Dense label encoding for boundary discontinuity free rotation detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 819–15 829.
[10] J. Wang, F. Li, and H. Bi, “Gaussian focal loss: Learning distribution polarized angle prediction for rotated object detection in aerial images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[11] X. Yang, J. Yan, Q. Ming, W. Wang, X. Zhang, and Q. Tian, “Rethinking rotated object detection with gaussian wasserstein distance loss,” in International Conference on Machine Learning. PMLR, 2021, pp. 11 830–11 841.
[12] Q. Ming, L. Miao, Z. Zhou, J. Song, Y. Dong, and X. Yang, “Task interleaving and orientation estimation for high-precision oriented object detection in aerial images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 241–255, 2023.
[13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
[14] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021.
[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
[16] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3651–3660.
[17] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
[18] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in International Conference on Learning Representations, 2023.
[19] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 3, 2022, pp. 2567–2575.
[20] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,” arXiv preprint arXiv:2201.12329, 2022.
[21] T. Ma, M. Mao, H. Zheng, P. Gao, X. Wang, S. Han, E. Ding, B. Zhang, and D. Doermann, “Oriented object detection with transformer,” arXiv preprint arXiv:2106.03146, 2021.
[22] L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “Ao2-detr: Arbitrary-oriented object detection transformer,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[23] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122, 2018.
[24] X. Yang, J. Yan, Z. Feng, and T. He, “R3det: Refined single-stage detector with feature refinement for rotating object,” in AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3163–3171.
[25] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for oriented object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
[26] X. Yang, X. Yang, J. Yang, Q. Ming, W. Wang, Q. Tian, and J. Yan, “Learning high-precision bounding box for rotated object detection via kullback-leibler divergence,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 381–18 394, 2021.
[27] X. Yang, Y. Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. Zhang, and Q. Tian, “The kfiou loss for rotated object detection,” in International Conference on Learning Representations, 2023.
[28] Y. Yu and F. Da, “Phase-shifting coder: Predicting accurate orientation in oriented object detection,” in Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. [Online]. Available: https://arxiv.org/abs/2211.06368
[29] H. Wang, Z. Huang, Z. Chen, Y. Song, and W. Li, “Multigrained angle representation for remote-sensing object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[30] Z. Xiao, B. Xu, Y. Zhang, K. Wang, Q. Wan, and X. Tan, “Aspect ratio-based bidirectional label encoding for square-like rotation detection,” IEEE Geoscience and Remote Sensing Letters, 2023.
[31] X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y. Feng, T. Xu et al., “Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–130, 2022.
[32] X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, and K. Fu, “Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network,” IEEE Access, vol. 6, pp. 50 839–50 849, 2018.
[33] X. Yang and J. Yan, “On the arbitrary-oriented object detection: Classification based approaches revisited,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1340–1365, 2022.
[34] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636.
[35] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9759–9768.
[36] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[37] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
[38] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020.
[39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019.
[40] Y. Zhou, X. Yang, G. Zhang, J. Wang, Y. Liu, L. Hou, X. Jiang, X. Liu, J. Yan, C. Lyu et al., “Mmrotate: A rotated object detection benchmark using pytorch,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7331–7334.
[41] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
[42] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2961–2969.
[43] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[44] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2980–2988.
[45] X. Yang, G. Zhang, W. Li, X. Wang, Y. Zhou, and J. Yan, “H2rbox: Horizontal box annotation is all you need for oriented object detection,” in International Conference on Learning Representations, 2023.
[46] L. Hou, K. Lu, J. Xue, and Y. Li, “Shape-adaptive selection and measurement for oriented object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 923–932.
[47] Z. Guo, C. Liu, X. Zhang, J. Jiao, X. Ji, and Q. Ye, “Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8792–8801.
[48] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo, “R2cnn: rotational region cnn for orientation robust scene text detection,” arXiv preprint arXiv:1706.09579, 2017.