This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CRRS: Concentric Rectangles Regression Strategy for Multi-point Representation on Fisheye Images

Xihan Wang1,2, Xi Xu1,2, Yu Gao1,2, Yi Yang∗,1,2, Yufeng Yue1,2, Mengyin Fu1,2 1School of Automation, Beijing Institute of Technology, Beijing, China2State Key Laboratory of Intelligent Control and Decision of Complex System, Beijing Institute of Technology, Beijing, China*Corresponding author: Y. Yang Email: [email protected]
Abstract

Modern object detectors take advantage of rectangular bounding boxes as a conventional way to represent objects. When it comes to fisheye images, rectangular boxes involve more background noise rather than semantic information. Although multi-point representation has been proposed, both the regression accuracy and convergence still perform inferior to the widely used rectangular boxes. In order to further exploit the advantages of multi-point representation for distorted images, Concentric Rectangles Regression Strategy(CRRS) is proposed in this work. We adopt smoother mean loss to allocate weights and discuss the effect of hyper-parameter to prediction results. Moreover, an accurate pixel-level method is designed to obtain irregular IoU for estimating detector performance. Compared with the previous work for muti-point representation, the experiments show that CRRS can improve the training performance both in accurate and stability. We also prove that multi-task weighting strategy facilitates regression process in this design. Source code is at https://github.com/IN2-ViAUn/Concentric-Rectangles-Loss.

I INTRODUCTION

The representation of standard rectangular bounding boxes is widely applied in object detectors. Despite being more convenient to calculate coordination, rectangular boxes include much invalid background information which interferes object detection, as shown in Fig.1. This drawback becomes more obvious for fisheye images due to severe distortion. Therefore, it is significant to explore novel representations for fisheye images. To solve this problem, relevant researches have proposed modified boxes, like rotation boxes[1, 2], circles[3] and ellipses[4]. Particularly, FisheyeDet[5] utilized irregular quadrangle to represent distorted objects. Multi-point representation, proposed by Rashed et.al[6, 7], can exactly describe the outline of objects.

Multi-point representation faces two challenges: (1) The multi-point regression strategy encounters complicated calculation and convergence difficulty. When each point is regressed separately, the number of loss items increase conspicuously. This phenomenon may lead to slow convergence or even failure[8]. Therefore, multi-point loss function requires considering both efficiency and convergence speed. (2) Compared with rectangular boxes, computation complexity of multi-point Intersection over Union(IoU) increases dramatically. Specifically, there are many stages requiring IoU calculation, such as label assignment, loss function, and mean Average Precision(mAP). The more time-consuming IoU calculates, the more time costs, which is necessary to be explored for more efficient means.

Refer to caption
Figure 1: Rectangular bounding boxes provide rough representations for most objects in distorted fisheye images.
Refer to caption
Figure 2: Overview of the experiment process. Inputs: distorted images and multi-point labels. Model: the outputs of network is changed for multi-point with N distances and the centroid coordinates(Fig.4). Loss function: Reg Loss is the key contribution which composed of Poly IoU loss and Concentric Rectangles Loss. Dynamic Weights Average(DWA) is used to allocate weights.

Aimed at the above problems, the following solutions are proposed: (1) A regression strategy called Concentric Rectangles Loss is elegantly designed to promote the prediction performance. Firstly, we design a series of concentric rectangles for regression. The reason why using rectangles is that circle shape is merely decided by radius, causing advanced IoU loss unavailable. We further analyze the necessity of concentric feature based on previous work[9]. Secondly, we introduce weighting strategy to accelerate convergence. Because Concentric Rectangles Loss leads to a large number of loss items, adaptive loss weight strategies are needed to balance the convergence rates of each loss. Dynamic Weight Average(DWA)[10] is verified feasible to solve this problem. Furthermore, we discuss the quality of weight allocation under different temperature coefficients. Also, calculating loss ratio by mean loss is able to reduce the negative impact of single loss fluctuations. (2) A pixel-level method for polygon IoU is elegantly designed to generate accurate results. The relationship of pixel and polygon is altered into that of pixel and triangle. This approach can be applied to calculate polygon area composed of different numbers of points and diverse sampling ways. Polygon IoU plays an important role in evaluating detection performance. Precise mAP provides quantitative comparison between different regression strategies for multi-point representation in object detection.

The main contributions of this paper are as follows:

\bullet CRRS is proposed for multi-point regression. The multi-task learning(MTL) weighting strategies are adopted in this paper to improve convergence speed. Furthermore, we propose the efficient combination of multiple loss.

\bullet We design a way to calculate IoU for multi-point representation in object detection task. On the basis of Polygon IoU, it is feasible to evaluate prediction performance quantitatively.

II RELATED WORK

II-A Object Representation

Rectangular boundary box with regular shape and simple computation is convenient for position regression. Currently, it is one of the most popular representations of object detector[11, 12, 13]. However, rectangular boxes always contain much background information. This feature exerts a negative influence on the learning of semantic information and object detection model. Unfortunately, this defect is more serious when the images are distorted or the objects are dense and severely hidden. Therefore, more suitable representation methods are proposed accord with object shape and position. It is common to use oriented boxes to conduct people detection in overhead fish eye images[14, 15]. Compared with rectangular box, circle box[16, 17] only requires radius prediction with one Degree of Freedom(DoF) and is rotation invariant. Ellipse box[4] can include less background area, so it is more adaptive for severely overlapped objects. In addition, lots of researches aim to represent distorted objects more exactly with reduced background information. Thus, multi-point representation, like irregular quadrilateral box[5] and 24-side box[7, 9] is put forward.These representation methods are able to extract feature to locate objects more precisely.

IoU matters in label allocation, loss function, evaluation index and non-maximum suppression(NMS). Various representations make irregular IoU calculation more difficult. It is relatively simple to compute IoU of two circles[17], but the condition of two polygon is more complicated. Li et.al[5]took advantage of cross product of vector. Based on statistic approach, they proposed a method to estimate the positional relationship of sampling point and irregular quadrangle. Xie et.al[8]put forward Polar IoU, an algorithm to calculate mask IoU on the basis of polar vector. In this way, the complex IoU computation of polygon prediction box and truth box can be avoided. Previous work[9] adopted the mean of 24 concentric circles IoU in the stage of label allocation as approximate value. But on account of the lack of accurate calculation of 24 polygon IoU, there is no way of describing prediction performance quantitatively by mAP. For the distorted objects in 360°images, Cao et.al[18] proposed a new detection method using Field-of-View Bounding Boxes to replace rectangular boxes. Also, they proposed an approximate method called FoV-IoU which is more accurate than Sph-IoU[19]. This method helps object detectors to achieve better prediction results during training, inference, and evaluation stages. For multi-poins representation, this paper proposes a pixel-level method to accurately calculate polygon IoU and obtain accurate mAP evaluation results.

Refer to caption
Figure 3: The procedure of calculating Poly IoU.

II-B Weighting Strategies of Multi-task Loss

MTL[20] is widely used in the field of computer vision, like classification, depth estimation and semantic dense prediction. Since different tasks have different objective functions, it is necessary to consider how to allocate task loss weights when training a multi-task model. Therefore, many researchers have proposed various multi-task loss weight adjustment strategies. Different tasks can extract features by hard parameter sharing or soft parameter sharing for the best learning performance. Because diverse tasks may be unbalanced during training process, setting fixed weight for every task is no longer available. It is necessary to set adaptive weights in the loss function through unique strategies. The aim that all the tasks are synchronous[21] or auxiliary task accelerates main task[22, 23] is achieved thanks to these weighting methods.

Bayesian uncertainty can be used to design weight parameter[24, 25]. This method weights the losses of different tasks based on the uncertainty of the task loss. Specifically, the uncertainty weighting strategy weights the loss function of different tasks to reflect the difficulty of the task. Later some researches utilized learning rate to assign weighting for different losses[10, 26, 27]. Also there are certain classical algorithms update weight by gradient[28, 29].The gradient sharing strategy is a method that combines the loss functions of multiple tasks into a total loss function. This strategy allows different tasks to share model parameters to increase training speed and reduce the risk of over-fitting. Specifically, the gradient sharing strategy can balance more difficult tasks by using lower loss weights. GradNorm[28] balances learning among tasks by normalizing the gradient norm to a common scale or considering the relationship between the gradient directions of each task’s loss function. HydaLearn[30] introduces a dynamic weighted method for multi-task learning loss with auxiliary tasks. However, the above methods are mainly used for weight adjustment among multiple tasks. In addition, the information entropy-based strategy is a method that assigns task weights based on the mutual information between tasks. Specifically, this strategy calculates the mutual information between tasks to assign task weights. If two tasks have high mutual information, their weights will be adjusted to higher weights, and vice versa.If the special problem of single-task multi-loss is also considered as multi-task learning, the weight adjustment strategy of certain methods is no longer applicable. [31] has proven that for single-task multi-loss problems, using the coefficient of variation to design weight adjustment strategies can achieve better results. This paper synthesizes DWA[10] and Welford algorithm[32] to dynamically adjust the weights of single-task multi-loss.

III METHOD

Fig.2 summarizes the main research work of this paper. It mainly involves the design of the position regression loss term for multi-point representation and multiple loss weight adjustment strategies. Our research methods will be introduced from three aspects: Polygon IoU Calculation, Concentric Rectangles Loss and Adaptive Weighting Strategies of Multi-loss. In order to compare with previous work[9], this paper still uses a 24-sided polygon to represent the object. NN is the number of vertices of the polygon bounding box and equals to 24.

III-A Polygon IoU Calculation

Polygon bounding box: For the ground truth box, we first determine Cgt(xcgt,ycgt)C^{gt}(x_{c}^{gt},y_{c}^{gt}) as the centroid of each object based on the instance mask. The formula is as follows:

xcgt=M10M00,ycgt=M01M00x_{c}^{gt}=\frac{M_{10}}{M_{00}},\quad y_{c}^{gt}=\frac{M_{01}}{M_{00}} (1)

where the zero-order moment of the object contour M00=ijV(i,j)M_{00}=\sum_{i}\sum_{j}V(i,j), the first moment is respectively: M10=ijiV(i,j)M_{10}=\sum_{i}\sum_{j}i\cdot V(i,j) and M01=ijjV(i,j)M_{01}=\sum_{i}\sum_{j}j\cdot V(i,j), ii and jj respectively represent the horizontal and vertical coordinates of each pixel point composing the object contour, V(i,j)V(i,j)represents the gray-scale value of (i,j)(i,j) pixel. For distorted objects, centroid outweighs center of rectangular bounding box. There may even be probable that the center falls outside the object mask, which affects the confirmation of vertices. We obtained NN boundary points using equal-angular sampling: P0gt(x0gt,y0gt),P1gt(x1gt,y1gt),,PN1gt(xN1gt,yN1gt)P_{0}^{gt}(x_{0}^{gt},y_{0}^{gt}),P_{1}^{gt}(x_{1}^{gt},y_{1}^{gt}),...,P_{N-1}^{gt}(x_{N-1}^{gt},y_{N-1}^{gt}). In these points, P0gtP_{0}^{gt} is the intersection point between the ray emitted from the centroid CgtC^{gt} in the direction of the positive half-axis of the pixel coordinate system and the object contour. Then every 360/N360^{\circ}/N, other points are sampled clockwise to form an approximate polygonal ground truth box for the object contour.

For the prediction box, we calculate NN boundary points P0pd(x0pd,y0pd),P1pd(x1pd,y1pd),,PN1pd(xN1pd,yN1pd)P_{0}^{pd}(x_{0}^{pd},y_{0}^{pd}),P_{1}^{pd}(x_{1}^{pd},y_{1}^{pd}),...,P_{N-1}^{pd}(x_{N-1}^{pd},y_{N-1}^{pd}) based on the outputs of the network(xcpd,ycpd,r0,r1,,rN1)(x_{c}^{pd},y_{c}^{pd},r_{0},r_{1},...,r_{N-1}) and Formula (2):

{xkpd=xcpd+rkcos(k360N)ykpd=ycpd+rksin(k360N)\left\{{\begin{array}[]{*{20}{c}}{x_{k}^{pd}=x_{c}^{pd}+{r_{k}}\cdot\cos(k\cdot\frac{{360^{\circ}}}{N})}\\ {y_{k}^{pd}=y_{c}^{pd}+{r_{k}}\cdot\sin(k\cdot\frac{{360^{\circ}}}{N})}\end{array}}\right. (2)

where k=0,1,,N1k=0,1,...,N-1, (xcpd,ycpd)(x_{c}^{pd},y_{c}^{pd}) is the centroid coordinates predicted by the network. r0,r1,,rN1r_{0},r_{1},...,r_{N-1} are the Euclidean distance between the centroid predicted and NN predicted boundary points.

The calculation of irregular IoU: The complexity of overlapping shape between two polygons is much greater than that of two rectangles. Therefore, we propose a pixel-level confirmation method that can accurately calculate the area of any polygon and the IoU. The polygon is constructed by the above method can be easily divided into NN triangles. Since it is relatively simple to determine whether a point is inside a triangle, the problem of calculating the polygon area can be simplified. It is more easy to confirm whether the pixels contained in the circumscribed rectangle of the polygon are inside any of the triangles. As shown in Fig.3, the triangle composed of P0P_{0}, PN1P_{N-1}, CC and the point JJ can illustrate the judging method. Construct three sets of vectors P0PN1\overrightarrow{{P_{0}}{P_{N-1}}} and P0J\overrightarrow{{P_{0}}J}, PN1C\overrightarrow{{P_{N-1}}C} and PN1J\overrightarrow{{P_{N-1}}J}, CP0\overrightarrow{C{P_{0}}} and CJ\overrightarrow{CJ}. Perform a cross product calculation on each set of vectors:

cross1=P0PN1×P0Jcross2=PN1C×PN1Jcross3=CP0×CJ\begin{array}[]{l}{cross1=\overrightarrow{{P_{0}}{P_{N-1}}}\times\overrightarrow{{P_{0}}J}}\\ {cross2=\overrightarrow{{P_{N-1}}C}\times\overrightarrow{{P_{N-1}}J}}\\ {cross3=\overrightarrow{C{P_{0}}}\times\overrightarrow{CJ}}\end{array} (3)

If cross1cross1, cross2cross2 and cross3cross3 have same sign, point JJ is in the triangle. In order to compute area, we set the gray-scale value of pixels inside the polygon to 1, and set the rest to 0. Summing up the gray-scale values of all pixels will give the area of the polygon. The key of calculating IoU is to compute the intersection area. We first determine which pixels are included in the ground truth polygon and the predicted polygon respectively. After converting the masks of the two polygons to Boolean types, the ”and” operation is performed. The number of intersection pixels is counted to calculate the polygon IoU(Poly IoU). Our method has been validated to be applicable to both convex and concave polygons.

Refer to caption
Figure 4: Illustration of partial points and centroid for concentric rectangles.

III-B Concentric Rectangles Loss

We designed a Concentric Rectangles Regression Strategy with EIoU loss[33] inspired by the previous concentric circles(the design of aspect ratio in EIoU loss and CIoU loss[34] cannot help with circles regression). This new strategy not only accelerates convergence but also efficiently and accurately completes the polygonal position regression by fully utilizing Eiou Loss.

The construction of concentric multi-rectangle: As shown in Fig.4, four bounding points P0gtP_{0}^{gt},P6gtP_{6}^{gt},P12gtP_{12}^{gt},P18gtP_{18}^{gt} are aligned with the centroid axis. So we combine them to form Rect0gtRect_{0}^{gt} and Rect1gtRect_{1}^{gt}. The remaining boundary points form rectangles with the centroid respectively and the total number of rectangles is N1N-1. Each rectangle has its center point aligned with the centroid. The Euclidean distance from the centroid to the boundary points is taken as half of the diagonal of the rectangle, so we call it the concentric multi-rectangle. Similarly, the multi-rectangle construction method for the polygon prediction box is the same.

Loss function: The regression of the rectangular box needs to consider the degree of overlap, distance, and aspect ratio. Among the current Iou-based loss function[34, 33, 35, 36], efficient IoU loss(EIoU) has best performance. This paper uses EIoU Loss to calculate the loss of the ground truth-predicted pairs of the above-mentioned concentric multi-rectangle. By this means, the position regression of each boundary point of the polygon can be achieved. As shown in Fig.4, the procedure to calculate ground truth-predicted pairs for RectigtRect_{i}^{gt}-RectipdRect_{i}^{pd} is as follows:

EIoUi=IoUiρ2(Cgt,Cpd)(ci_diag)2(Rectigt_wRectipd_w)2(ci_w)2(Rectigt_hRectipd_h)2(ci_h)2\begin{array}[]{l}EIo{U_{i}}=Io{U_{i}}-\displaystyle{\frac{{{\rho^{2}}({C^{gt}},{C^{pd}})}}{{{{({c_{i}}\_diag)}^{2}}}}}\\ {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}-\displaystyle{\frac{{{{(Rect_{i}^{gt}\_w-Rect_{i}^{pd}\_w)}^{2}}}}{{{{({c_{i}}\_w)}^{2}}}}}\\ {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}-\displaystyle{\frac{{{{(Rect_{i}^{gt}\_h-Rect_{i}^{pd}\_h)}^{2}}}}{{{{({c_{i}}\_h)}^{2}}}}}\end{array} (4)

where i=0,1,,N3i=0,1,...,N-3, ρ2(Cgt,Cpd){\rho^{2}}({C^{gt}},{C^{pd}}) represents the Euclidean distance between the centroid of the ground truth box and the predicted box (as well as the center of RectigtRect_{i}^{gt}-RectipdRect_{i}^{pd}), ci_w{c_{i}}\_w, ci_h{c_{i}}\_h and ci_diag{c_{i}}\_diag are the width, height, and diagonal length of the circumscribe rectangle respectively of RectigtRect_{i}^{gt}, RectipdRect_{i}^{pd}. Rectigt_wRect_{i}^{gt}\_w and Rectigt_hRect_{i}^{gt}\_h are the width and height of the ground truth rectangular box. Rectipd_wRect_{i}^{pd}\_w and Rectipd_hRect_{i}^{pd}\_h are the width and height of the prediction rectangular box. Then EIoU loss can be computed:

EIoU_Lossi=1EIoUi\begin{array}[]{l}EIoU\_Los{s_{i}}=1-EIo{U_{i}}\end{array} (5)

Finally, we obtain N2N-2 loss terms that jointly participate in the position regression calculation.

III-C Adaptive Weighting Strategies of Multi-loss

The overall loss includes multi-point regression loss, confidence loss, and classification loss. Among these losses, multi-point regression loss requires accumulation of single-point loss, significantly increasing the number of loss terms. In order to balance the different loss components, this experiment draws on the weight adjustment methods of multi-task learning. In this paper, we utilizes the DWA[9] adjustment strategy. The weight of the jj-th loss term is determined as follows:

n×exp(rj/T)i=1nexp(ri/T)\begin{array}[]{l}\displaystyle{\frac{{n\times\exp({r^{j}}/T)}}{{\sum\limits_{i=1}^{n}{\exp({r^{i}}/T)}}}}\end{array} (6)

where nn is the number of loss items, rir^{i} is loss ratio of epoch-ii, TT is temperature. There are two ways to calculate loss ratio:

(1)the ratio of epoch ii loss to epoch i1i-1 loss

ri=LiLi1\begin{array}[]{l}\displaystyle{r^{i}=\frac{L_{i}}{L_{i-1}}}\end{array} (7)

(2)the ratio of epoch-ii loss to average loss in i1i-1 epochs:

ri=Lij=1i1Lj/(i1)\begin{array}[]{l}\displaystyle{r^{i}=\frac{L_{i}}{\sum_{j=1}^{i-1}L_{j}/(i-1)}}\end{array} (8)

Comparing the calculation methods of two loss rates, the mean loss ratio is more stable. Mean loss can effectively reduce the impact of loss fluctuations during the training process on the loss ratio. Welford algorithm[32] is used for calculating the mean loss , and the formula is:

x¯n=x¯n1+xnx¯n1n\begin{array}[]{l}\displaystyle{\bar{x}_{n}=\bar{x}_{n-1}+\frac{x_{n}-\bar{x}_{n-1}}{n}}\end{array} (9)

where xnx_{n} is epoch-nn loss in our experiment. The loss rate is calculated in the form of loss ratio, which can transform losses of different scales into the same scale. According to the definition of the loss ratio, the better the training effect of a certain regression task, the faster the loss decreases. Similarly, the smaller the loss rate, the smaller the weight assigned to it. This means the model focuses more on learning the part with a poorer training effect. This strategy meets the requirement of balancing the losses of each part. However, the temperature coefficient can also smooth the weights of each part. The higher the temperature coefficient, the more evenly the weights are distributed. When the temperature coefficient tends to infinity, the loss weight of each part is the same. Setting a temperature coefficient higher than 1 can moderately increase the weight of the part with a lower loss rate. Compared with GradNorm[28], there is no extra backward process required, so DWA can simplify computation.

IV EXPERIMENTS AND RESULTS

This paper designs position regression loss function and multi-loss weighting strategy based on the promoted YOLOX-s[12] detector. This section mainly introduces the process and results of the experiment.

Implementation details: Woodscape[37] is used to train model and verify our method. This experiment randomly selects about 5000 training data and about 200 test data for mAP calculation. There are 8 categories, which are: person, bicycle, car, motorcycle, bus, train, truck and traffic light. Our model is trained on two RTX3090 GPUs for 500 epochs with batch size 32. But a large number of parameters and high computational cost of multi-point representation complicate the data augmentation.Therefore, we only used simple data augmentation techniques such as color space transformation, aspect ratio distortion, translation, rotation, and flipping to give the detector basic generalization ability. The weighting strategy of loss function involves the definition of temperature and loss rates. Experiment A set the temperature coefficient as T=20, and the loss rate as the ratio of current loss to previous single loss. Experiment B adopts 22-rectangle plus poly IoU, further exploring the effect of temperature and loss ratio.

IV-A Multi-point IoU Loss Function

TABLE I: The exploration for concentric rectangle
         Loss Function          mAP
         22 vertex-shared rectangle          17.77 %\%
         22 center-shared rectangle          25.97 %\%

At the beginning, we designed several different forms of multi-rectangle construction and selected a small sample size of data as the training set.

\bullet 22 vertex-shared rectangles: Our boundary points aligned with the centroid axis are grouped into two pairs. Each pair forms a rectangle with the centroid as one of the vertices (the diagonal is the line connecting the two boundary points). The remaining 20 boundary points are used to construct 20 rectangles, each with the centroid as one of the vertices (the diagonal is the line connecting the centroid and the boundary point). Each rectangle must have one vertex at the centroid.

\bullet 22 concentric rectangles: Section III explains this method in detail.

Loss functions both use EIoU loss. As shown in Table I, Scheme 2 performs significantly better than another. We believe that the accuracy of the centroid has a greater impact on the final detection results compared to the multiple points.The shared centroid design in Scheme 2 precisely aggregates the power of multiple scattered rectangle regressions. Such effect enables the multiple rectangles to converge faster to the same center point, that is, the centroid. Although Scheme 1 also has the feature of shared vertices, the vertices are jointly determined by the width and height, making regression more difficult. Scheme 1 does not have any commonality between multiple rectangles, and therefore no collective force is formed.

Based on above analysis, three types of position regression loss functions (poly IoU, 24 concentric circles loss[9] and 22 concentric rectangles EIoU) were used either individually or in combination for contrast experiment. The mAP results are shown in Table II. Using the Poly IoU alone for position regression resulted in poor detection performance. After analysis, there are three main reasons explaining this phenomenon. Firstly, Rectangular bounding boxes can use IoU loss better than Euclidean distance loss because of the strong correlation between the four points. However, the boundary points of a polygonal box are relatively independent. Hence area loss alone cannot accurately predict multiple points.The direct analogy of Poly IoU to rectangular IoU without adaptive improvement loses effect. Secondly, Poly IoU loss struggles to achieve synchronous regression of multiple points. By splitting Poly IoU loss into 22 rectangular IoU losses, each point can be regressed separately and aggregated to form the overall regression. Thirdly, Poly IoU does not restrict the shape and orientation of the predicted bounding boxes. The same IoU value corresponds to different shapes and orientations, decreasing prediction accuracy.

Compared to Poly IoU loss, the detection performance is significantly improved when using 24 concentric circles GIoU loss or the 22 concentric rectangles EIoU loss. If Poly IoU loss is combined with them, it can further improve the detection performance. Overall, the combination of the proposed concentric rectangles EIou loss and the Poly-IoU loss is a good choice for multi-point representation regression strategy. The detection results and ground thruth of each method are shown in Fig. 5.

TABLE II: The mAP of different IoU loss
         Loss Function          mAP
         Poly IoU          3.12%\%
         24 circles GIoU          32.24%\%
         22 rectangles EIoU          34.10%\%
         24 circles GIoU + Poly IoU          34.28%\%
         22 rectangles EIoU + Poly IoU          36.09%\%

IV-B Exploration about Loss Weight

Exploration about temperature: At the initial stage of the experiment, the temperature coefficient is set to 1. However, it is found that the losses of each part fluctuat greatly, and the final prediction performance is poor. Therefore, the temperature coefficient is adjusted, resulting in a change in the weight distribution. Through a series of comparisons of temperature coefficients, the experiment find that setting the temperature coefficient to a value greater than 1 can reduce the instability of the loss and improve the prediction performance. Since the temperature coefficient has a smoothing effect on the weights, increasing the temperature coefficient can moderately reduce the difference between the weights. This can avoid sharp changes in weights between different batches. Meanwhile, increasing the temperature can moderately raise the weight of the loss with good training performance. This trend can reduce the risk of falling into a local minimum to some extent. But too high temperature may cause the weights to lose significant discrimination. Thus the model cannot focus on learning the poorer performing regressions. Therefore, it is necessary to find the optimal value through experimental comparison. As shown in TableIII, based on the mAP under different temperature coefficients, the final temperature is set to 20.

TABLE III: The mAP of different temperature
         Temperature          mAP
         T=1          30.95%\%
         T=20          40.60%\%
         T=50          35.55%\%
Refer to caption
Figure 5: Ground truth and detection results.

Exploration about loss ratio: The line chart of losses is shown in Fig.6. It can be observed that the single-epoch loss fluctuates greatly, but the average loss value is relatively smooth. If the loss ratio is defined by single-epoch loss, the value strongly depends on the loss in the previous epoch. Then the irregular fluctuations in the single-epoch loss will affect the accuracy of the loss ratio and weight allocation. Therefore, the experiment improved the above loss ratio by changing the single loss to a more stable average loss, making the calculation of the loss rate more accurate.

Refer to caption
Figure 6: Comparison of mean loss until the last epoch and current loss.

V CONCLUSIONS AND FUTURE WORK

In this paper, we studied regression loss for multi-point representation and designed delicate comparative experiment. To sum up, we draw following conclusions.

(1) It is difficult to acquire compatible results only by Poly IoU loss. To solve this problem, we split multi-point group to single points, and design the corresponding loss items for each point. The experiment shows that the items related to dense distance can significantly improve detection performance. Moreover, the combination of Poly IoU Loss and Concentric Rectangles Loss has better results.

(2) Increasing the temperature moderately can decrease the discrimination of weights and enhance the stability of total loss. In addition, mean loss can help reducing the impact of loss fluctuations and make weight allocation more effective.

Our future work is oriented to improve generalization of model and explore a mechanism for selecting points according to object size and shape. We believe that multi-point representation will gain persistent attention.

ACKNOWLEDGMENT

This work was partly supported by National Natural Science Foundation of China (Grant No. U1913203, 61973034, 62233002 and CJSP Q2018229). The authors would like to thank Tianji Jiang, Jiadong Tang, Zhaoxiang Liang, Dianyi Yang, and all other members of ININ Lab of Beijing Institute of Technology for their contribution to this work.

References

  • [1] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals,” IEEE transactions on multimedia, vol. 20, no. 11, pp. 3111–3122, 2018.
  • [2] M. Liao, B. Shi, and X. Bai, “Textboxes++: A single-shot oriented scene text detector,” IEEE transactions on image processing, vol. 27, no. 8, pp. 3676–3690, 2018.
  • [3] H. Yang, R. Deng, Y. Lu, Z. Zhu, Y. Chen, J. T. Roland, L. Lu, B. A. Landman, A. B. Fogo, and Y. Huo, “Circlenet: Anchor-free detection with circle representation,” arXiv preprint arXiv:2006.02474, 2020.
  • [4] W. Dong, P. Roy, C. Peng, and V. Isler, “Ellipse r-cnn: Learning to infer elliptical object from clustering and occlusion,” IEEE Transactions on Image Processing, vol. 30, pp. 2193–2206, 2021.
  • [5] T. Li, G. Tong, H. Tang, B. Li, and B. Chen, “Fisheyedet: A self-study and contour-based object detector in fisheye images,” IEEE Access, vol. 8, pp. 71739–71751, 2020.
  • [6] H. Rashed, E. Mohamed, G. Sistu, V. R. Kumar, C. Eising, A. El-Sallab, and S. Yogamani, “Fisheyeyolo: Object detection on fisheye cameras for autonomous driving,” in Machine Learning for Autonomous Driving NeurIPS 2020 Virtual Workshop, vol. 8, 2020.
  • [7] H. Rashed, E. Mohamed, G. Sistu, V. R. Kumar, C. Eising, A. El-Sallab, and S. Yogamani, “Generalized object detection on fisheye cameras for autonomous driving: Dataset, representations and baseline,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2272–2280, 2021.
  • [8] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo, “Polarmask: Single shot instance segmentation with polar representation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12193–12202, 2020.
  • [9] X. Xu, Y. Gao, H. Liang, Y. Yang, and M. Fu, “Fisheye object detection based on standard image datasets with 24-points regression strategy,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9911–9918, IEEE, 2022.
  • [10] S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1871–1880, 2019.
  • [11] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
  • [12] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
  • [13] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
  • [14] Z. Duan, O. Tezcan, H. Nakamura, P. Ishwar, and J. Konrad, “Rapid: rotation-aware people detection in overhead fisheye images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 636–637, 2020.
  • [15] O. Krams and N. Kiryati, “People detection in top-view fisheye imaging,” in 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 1–6, IEEE, 2017.
  • [16] B. Arsenali, P. Viswanath, and J. Novosel, “Rotinvmtl: Rotation invariant multinet on fisheye images for autonomous driving applications,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0, 2019.
  • [17] E. H. Nguyen, H. Yang, R. Deng, Y. Lu, Z. Zhu, J. T. Roland, L. Lu, B. A. Landman, A. B. Fogo, and Y. Huo, “Circle representation for medical object detection,” IEEE transactions on medical imaging, vol. 41, no. 3, pp. 746–754, 2021.
  • [18] M. Cao, S. Ikehata, and K. Aizawa, “Field-of-view iou for object detection in 360 {\{\\backslashdeg}\} images,” arXiv preprint arXiv:2202.03176, 2022.
  • [19] P. Zhao, A. You, Y. Zhang, J. Liu, K. Bian, and Y. Tong, “Spherical criteria for fast and accurate 360 object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12959–12966, 2020.
  • [20] R. Caruana, Multitask learning. Springer, 1998.
  • [21] O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,” Advances in neural information processing systems, vol. 31, 2018.
  • [22] X. Lin, H. Baweja, G. Kantor, and D. Held, “Adaptive auxiliary task weighting for reinforcement learning,” Advances in neural information processing systems, vol. 32, 2019.
  • [23] Y. Du, W. M. Czarnecki, S. M. Jayakumar, M. Farajtabar, R. Pascanu, and B. Lakshminarayanan, “Adapting auxiliary losses using gradient similarity,” arXiv preprint arXiv:1812.02224, 2018.
  • [24] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491, 2018.
  • [25] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?,” Advances in neural information processing systems, vol. 30, 2017.
  • [26] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji, “Pyramidal person re-identification via multi-loss dynamic training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8514–8522, 2019.
  • [27] S. Liu, Y. Liang, and A. Gitter, “Loss-balanced task weighting to reduce negative transfer in multi-task learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 9977–9978, 2019.
  • [28] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” in International conference on machine learning, pp. 794–803, PMLR, 2018.
  • [29] M. Crawshaw and J. Košecká, “Slaw: Scaled loss approximate weighting for efficient multi-task learning,” arXiv preprint arXiv:2109.08218, 2021.
  • [30] S. Verboven, M. H. Chaudhary, J. Berrevoets, V. Ginis, and W. Verbeke, “Hydalearn: Highly dynamic task weighting for multitask learning with auxiliary tasks,” Applied Intelligence, pp. 1–15, 2022.
  • [31] R. Groenendijk, S. Karaoglu, T. Gevers, and T. Mensink, “Multi-loss weighting with coefficient of variations,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1469–1478, 2021.
  • [32] B. Welford, “Note on a method for calculating corrected sums of squares and products,” Technometrics, vol. 4, no. 3, pp. 419–420, 1962.
  • [33] Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan, “Focal and efficient iou loss for accurate bounding box regression,” Neurocomputing, vol. 506, pp. 146–157, 2022.
  • [34] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12993–13000, 2020.
  • [35] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced object detection network,” in Proceedings of the 24th ACM international conference on Multimedia, pp. 516–520, 2016.
  • [36] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666, 2019.
  • [37] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uricár, S. Milz, M. Simon, K. Amende, et al., “Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9308–9318, 2019.