A Guide to Image and Video based Small Object Detection using Deep Learning : Case Study of Maritime Surveillance

Aref Miri Rekavandi, Member, IEEE, Lian Xu, Farid Boussaid, Abd-Krim Seghouane, Senior Member, IEEE, Stephen Hoefs, and Mohammed Bennamoun, Senior Member, IEEE, Aref Miri Rekavandi, Lian Xu, and Mohammed Bennamoun are with the Department of Computer Science and Software Engineering, The University of Western Australia, Australia.Farid Boussaid is with the Department of Electrical, Electronics and Computer Engineering, The University of Western Australia, Australia.Abd-Krim Seghouane is with the School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia.Stephen Hoefs is discipline leader of submarine optronics, undersea combat systems, and undersea command and control maritime division, Defence Science and Technology Group, Australia. (Email: [email protected]

Abstract

Small object detection (SOD) in optical images and videos is a challenging problem and even state-of-the-art generic object detection methods fail to accurately localize and identify small objects. Because small objects occupy only a small area in the input image (e.g., less than 10%), the information extracted from such a small area is not always rich enough to support decision making. Multidisciplinary strategies are being developed by researchers working at the interface of deep learning and computer vision to enhance the performance of SOD deep learning based methods. In this paper, we provide a comprehensive review of over 160 research papers published between 2017 and 2022 in order to survey this growing subject. This paper summarizes the existing literature and provides a taxonomy that illustrates the broad picture of current research. We investigate how to improve the performance of small object detection in maritime environments, where increasing performance is critical. By establishing a connection between generic and maritime SOD research, future directions have been identified. In addition, the popular datasets that have been used for SOD for generic and maritime applications are discussed, and also well-known evaluation metrics for the state-of-the-art methods on some of the datasets are provided.

Index Terms:

Object recognition, small object detection, object localization, deep learning, maritime surveillance.

1 Introduction

Refer to caption — Figure 1: Examples of small objects. Source: MS COCO dataset [lin2014microsoft]. By definition, small objects refer to the objects smaller than $32\times 32$ pixels or objects which cover less than only 10% of the image.

Object detection is at the heart of many computer vision applications and has grown in importance over the last decade. It plays a crucial role in modern computer vision tasks such as autonomous driving [chen2016monocular, wang2020pillar], pedestrian identification [liu2019high, lan2018pedestrian], image captioning [herdade2019image, iwamura2021image], object tracking [yin2021center, lee2021cnn], ship detection [liu2017rotated, rekavandi2021robust] face recognition [yan2022age, li2022enhanced], traffic control [khan2022machine, ge2022vehicle], animal detection [berg2022weakly, xue2022small], action recognition [kanimozhi2022key, patil2022survey], environment surveillance [jha2021real, kumar2010addressing], video checking in sports [roros2022maskgru, li2022automatic], and many others. Object detection methods have become increasingly popular with the advances in deep learning and GPU power that allow Deep Neural Nets (DNNs) to be trained faster and more efficiently in recent years. Object detection methods are classified into two-stage and single stage methods. A few notable two-stage methods include Region-Based CNN (R-CNN) [girshick2014rich], Spatial Pyramid Pooling Network (SPP-Net) [he2015spatial], Fast R-CNN [girshick2015fast], Faster R-CNN [ren2015faster], Region-Based Fully Convolutional Networks (R-FCN) [dai2016r], Mask R-CNN [he2017mask], Feature Pyramid Networks (FPN) [lin2017feature], cascade R-CNN [cai2018cascade], and Libra R-CNN [pang2019libra]. These methods identify the regions in an image that are most likely to contain objects, then features are extracted to classify the objects, followed by a fine-tuning step to accurately localize the bounding boxes surrounding the objects. Some anchor-free (anchors are defined as a set of bounding boxes with a particular height and width) detectors such as RepPoints [yang2019reppoints] can also be viewed as two-stage methods. On the other hand, single-stage methods treat the object detection task as a regression problem and estimate the parameters of the bounding boxes and the probability that these boxes contain the target objects. This category of methods includes You Only Look Once (YOLO) and its variants [redmon2016you, redmon2017yolo9000, redmon2018yolov3, bochkovskiy2020yolov4, jocher2020yolov5], Single Shot multibox Detector (SSD) [liu2016ssd], RetinaNet [lin2017focal], Multi-Scale Deep Feature Learning Network (MDFN) [ma2020mdfn] and anchor-free object detection methods such as CornerNet [law2018cornernet], CenterNet [duan2019centernet], FCOS [tian2019fcos]. Although the above mentioned object detection techniques have undoubtedly grown due to the availability of large datasets, e.g., ImageNet [russakovsky2015imagenet], PASCAL VOC [everingham2010pascal] and MS COCO [lin2014microsoft], most of these deep learning based techniques fail to accurately localize and identify small objects. The main reason for their poor performance to deal with small objects is due to the loss of the geometrical information in the last layers of their networks and their large receptive fields. Solely the semantic information recovered from the last layers of deep neural networks is indeed useful for larger objects classification, but cannot help with the localization of small objects. Max pooling or large steps toward down sampling are responsible for the large receptive fields of the convolutional layers, e.g., $\times 8$ and $\times 32$ in SSD and YOLO. As a result, the last layers of deep networks have a small number of nodes whose values reflect the small objects in the input image, which is not desirable for SOD.
The applications of small object detection (SOD) are but not limited to pedestrian detection [song2018small, wu2020self], medical image analysis [xing2016robust, rashidi2020optimal], industrial product quality assessment [abedini2017defect], face recognition in surveillance cameras [cho2018face], sign detection in autonomous driving [li2022trafficss], ship detection in remotely sensed images [liu2017rotated] and others. In spite of the extensive potential use of SOD methods in the maritime surveillance, unlike the other applications, this area has not been explored as much as it truly deserves. This may be due to the paucity of publicly available datasets for maritime environments, as compared to those for other applications. On the other hand, approximately $70\%$ of the planet is covered by water, so most of the global trade and transportation of goods takes place by sea [international2011international]. This requires accurate monitoring of the environment for rescue missions, and to avoid collisions, pollution from oil leaks, illicit cargos, illegal smuggling, fisheries dumping of pollutants, and the crossing of borders by unidentified vessels. In spite of the fact that an Automatic Identification System (AIS) can be used to monitor vessels, many small and even medium-sized vessels lack such technology, or intentionally switch it off when they conduct illegal activities. Therefore, the development of a wide range automatic system that is capable of detecting and identifying small boats is vital. Synthetic Aperture Radar (SAR) technology has been the leading technology since the 1990s, providing an all-time performance and a strong signal reflection response from normal large vessels. However, the relatively weak reflected signal from small or medium-sized targets with small radar cross-sections makes it difficult to recognize targets due to the observed speckle multiplicative noise, resulting in a high number of false positives. Furthermore, SAR cannot provide global range monitoring because of its limited spatio-temporal coverage. This opens up a wide range of research opportunities in maritime environments, including the detection of objects based on images and videos.

A variety of definitions have been reported for “small object” in the literature, but most studies define a small object as one that is smaller than $32\times 32$ pixels. In high resolution images, a small object is one that covers less than 10% of the image [lin2014microsoft]. This definition means that the object of interest does not provide much information in terms of colour, shape, texture, or any other type of visually discernible information, making the task of SOD particularly challenging. There are mainly two reasons why small objects appear in images and videos. First, the object appears small by virtue of its size, e.g., a bird relative to a tree, a tennis ball relative to a tennis court, or a mobile phone relative to an indoor space, and so forth. Second, a large object-camera distance can also lead to the object looking small, in which case the object’s real size is irrelevant. Even a ship can appear small and occupy only a few pixels in a satellite image. Fig. 1 shows examples of small objects.

The task of small object detection is typically performed through a variety of computer vision techniques, such as semantic segmentation, foreground background (FB) separation, anomaly detection, regression, and finally classification. Many data modalities have also been explored in the context of SOD in the literature, including AIS data, satellite-based SAR and multi-spectral data, airborne SAR, multi-spectral data from Unmanned Aerial Vehicles (UAVs), on board (ship based, unmanned surface vessels, etc.) visual (RGB video and image), InfraRed (IR) and Near InfraRed (NIR) data, and finally shore based which includes visual data (RGB video and image), etc. Often these modalities differ in terms of their spatial and temporal resolution, cost of acquiring data, delay, robustness, range of coverage, etc. [kanjir2018vessel]. Spaceborne data (satellite), for example, can be accessed remotely. Satellites positioned in geostationary orbits may also capture images of the surface of the earth while maintaining the same footprint. Data volume generated by this technology is quite large, and it is often not suitable for continuous monitoring [kanjir2018vessel]. Furthermore, spaceborne optical images are affected by bad weather (clouds covering objects of interest), while radar data has a low resolution. Infrared imaging is particularly well-suited for night-time monitoring. However, it becomes saturated during the daytime and it does not provide colour information. Optical imaging on the other hand, provides rich colour information, real-time operation, adequate spatial resolution, and is relatively inexpensive. In particular, spaceborne optical sensors are growing in number and are becoming increasingly popular because of their excellent spatial coverage. For this reason, this survey paper will focus on images or videos acquired by optical cameras, from space, air, in-shore and off-shore.

Specifically, this paper will review the field of small object detection using deep learning, with a case study covering maritime applications. Our literature survey was conducted by searching for keywords such as “small object detection”, “small target detection”, “tiny object detection”, and “ship detection” in title. Checking the corresponding references of individual papers on Google Scholar also yielded a comprehensive list of studies. We limited the scope of this survey to deep learning based methods. Our survey paper reviewed more than 160 papers, most of which were published after 2017 (Fig. 2), when deep learning methods began to show promising results for object detection. Small object detection is a relatively new field, so this survey provides an overview of the current state-of-the-art (SOTA) and may also serve as a guide for upcoming research. In summary, the contributions of this survey paper are as follows:

TABLE I: A list of the recently published surveys on maritime and generic SOD.

Survey Title	Year	Publisher	Category	Image/Video	Limitations	Strengths
Video processing from electro-optical sensors for object detection and tracking in a maritime environment: a survey [prasad2017video]	2017	IEEE Transactions on Intelligent Transportation Systems	Maritime	Video	It just covers the classical methods not the DNNs	Both Visible and NIR parts of the spectrum
Vessel detection and classification from spaceborne optical images: A literature survey [kanjir2018vessel]	2018	Remote Sensing of Environment	Maritime	Image	This survey is up to 2017, does not contain deep learning based methods, constrained to spaceborne images	Covers all the classical approaches multiple data modalities in details
Recent advances in small object detection based on deep learning: A review [tong2020recent]	2020	Image and Vision Computing	Generic	Image	Their taxonomy is very general, does not cover maritime environment, does not cover video	It gives a great list of the works up to 2020 for deep learning based methods
Ship detection and classification from optical remote sensing images: A survey [bo2021ship]	2021	Chinese Journal of Aeronautics	Maritime	Image	This survey is up to 2020, constrained to remote sensing images, Not detailed for DNNs	To an extent at time of publication is up to date and includes some DNN based methods
Survey on Deep Learning-Based Marine Object Detection [zhang2021survey]	2021	Journal of Advanced Transportation	Maritime	Image & Video	It does not categorize the studies based on their adopted appraches, does not introduce available datasets, does not emphasize on SOD	A recent review which to an extent includes deep learning methods for maritime up to 2021
Survey of Video Based Small Target Detection [liu2021survey]	2021	Journal of Image and Graphics	Generic	Video	It focuses mostly on spatial methods, instead of spatio temporal, datasets are not comprehensive	Recent video based detection survey for SOD, addresses studies up to 2021
A survey of the four pillars for small object detection: Multiscale representation, contextual information, super-resolution, and region proposal [chen2020survey]	2022	IEEE Transactions on systems, man, and cybernetics: systems	Generic	Image	The aerial perspective is not included, limited datasets, subsection of the current manuscript.	Divides the prior works into four categories that are somehow related to popular object detection frameworks
A Guide to Image and Video based Small Object Detection using Deep Learning : Maritime Surveillance Case Study (Ours)	2022	ArXiv	Generic & Maritime	Image & Video	Limited to optical images and only DNN based techniques	We cover state-of-the-art methods in DNNs including transformers, We cover both image and video, we list all the available datasets in detail, we suggest very diverse future research directions

•

First, we review generic small object detection methods. This is the first review that explores both image and video modalities for small object detection using deep learning frameworks, including both CNNs and transformers (transformers have not previously been covered previously in any survey). Our careful review of the literature has allowed us to identify research gaps and suggest potential research directions.
•

Our study has identified object detection in maritime environments as an important and challenging task, and in addition to generic SOD, we also present a systematic review of SOD in maritime environments.
•

By comparing and establishing links between the literature of generic and maritime SOD, possible research directions are highlighted for both domains.
•

There is a limited number of datasets available, and we believe that is the main hurdle for researchers who do not work in this field of research. Therefore, in order to allow future research to be explored more effectively, we have compiled the most relevant and comprehensive datasets (50 datasets) specific to SOD.
•

Finally, the limitations of existing works as well as possible future directions, and potential tools that could be useful for SOD have been identified.

Review papers for SOD are listed in Table I. Our paper differs from existing surveys in that we consider both image and video modalities, look at each component of learning pipeline from the input to the output, establish and discuss the link between maritime and generic SOD to identify research gaps, and introduce the recent deep learning methods that have been proposed up to May 2022. Fig. 3 shows a taxonomy of small object detection methods, where the works are divided into categories according to their methodology, domains, and applications.
The remainder of the paper is organized as follows: The challenges of SOD are discussed in Section 2. Section 3, summarizes existing single- and double-stage detectors and the well-used backbones in the context of SOD. In Sections 4 and 5, we examine generic and maritime SOD methods. We provide evaluation metrics and datasets in Section 6 and compare and discuss methods and potential reserach gaps in Section 7. Finally, the paper concludes in Section 8.

2 Challenges in SOD

Let’s explore some of the potential challenges that potential SOD users may encounter before we delve into the technical content and methodologies. While some of these challenges are common across generic and maritime domains, others are specific to the maritime environment. Listed below are the most common challenges of SOD that fall under maritime specific and generic SOD. Here are some challenges associated with generic SOD:

•

As a result of the small number of pixels representing each object, SOD loses geometrical information in the deeper layers of the network, resulting in false object detection.
•

Small objects are usually occluded by larger objects, and their extracted features behave like clutter because of their relatively weak feature values.
•

Object detection evaluation metrics that are commonly used are not appropriate for small objects. These metrics can become quite sensitive when the bounding boxes are small, leading to the underestimation of methods or even incorrect solutions.
•

Compared to regular-size object detection, very few small object datasets have been released to date.
•

To annotate the ground truth frames between the ground truth human annotated frames in video object detection, most commonly used softwares use interpolation to draw the bounding boxes (e.g., they annotate the 1st and 10th frames, assume linear motion, and use linear interpolation to annotate the frames in between). This is not an issue for large object detection, however it may produce very noisy ground truth labels for SOD. SOD methods should therefore be robust to such deviations.

Challenges associated with the detection of small objects in maritime environments include:

•

The reflection of light from the water and waves can cause rapid changes in illumination in video frames.
•

The dynamic nature of maritime environments and challenging weather conditions significantly reduce the range of sight and make the images blurry or hazy. As a result, such environmental factors can adversely affect detection performance, especially when using passive remote sensing imaging to detect ships.
•

Most of the maritime datasets are aerial. Consequently, depending on the viewing angle and relative position of the target, the object may appear distorted in the image or can appear at different scales, structures and shapes which makes the detection more challenging.
•

A ship dataset can show greater intra-class variation than inter-class variation, increasing the complexity of maritime SOD.
•

When aerial data is acquired, the camera’s perspective towards the object can rapidly change between frames. A highly dynamic scenario like this can result in the object being missed in SOD over many frames.
•

Especially for cameras installed on ships, the image data shows jitter at high frequencies and a shift in the field of view at low frequencies due to irregular jittering, hull swaying, and hull heaving [qiao2021marine].

3 Background

To ensure completeness, this section provides a brief overview of the most important object detection frameworks that have been used in the SOD literature, including their underlying principles and backbones.
Regional Based Detectors: Also known as two-stage detectors, they typically involve the following three main steps: (i) region proposal, (ii) feature extraction, and (iii) classification. The first version of this framework was the Region-Based CNN (R-CNN) [girshick2014rich], whose pipeline is shown in Fig. 4(a). R-CNN takes the input image and extracts approximately $2K$ region proposals of different scales using selective search [uijlings2013selective]. In a second step, a CNN is used for feature extraction through five convolutional layers with two fully connected layers (4096-dimensional features), and then SVMs are used for classification. The R-CNN algorithm is relatively slow (two stages) and needs to pass each region individually without sharing computation. In addition, it is trained in multiple stages. R-CNN’s first issue is fixed by SPP-Net, which shares computation [he2015spatial]. SPP-Net extracts convolutional feature maps from the entire image and features are extracted from the shared feature maps to classify the objects in region proposals. In this way, the process becomes faster, and the runtime at the test stage is also reduced. In [girshick2015fast], an extension of R-CNN dubbed Fast R-CNN was proposed to increase the runtime speed of R-CNN and SPP-Net, using a multi-task loss function for learning in a single stage. On the deep VGG16 network, fast R-CNN improves training time by $9\times$ and test time by $213\times$ over regular R-CNN. Fast R-CNN jointly classifies and localizes bounding boxes (Fig. 4(b)). Faster R-CNN [ren2015faster] was introduced to improve the bottleneck of the two-stage framework, which is the first step of the pipeline (i.e., the region proposal extraction step) by replacing the selective search module with another convolutional network called the Region Proposal Network (RPN), which shares the features with the detection network. RPN takes an input image and returns a set of rectangular object proposals, each with an object score. Figure 4(c) is a flowchart of the RPN where k is the number of anchors. When 300 proposal regions per image are used, the processing frame rate reaches 5 frames per second (including all steps). Additionally, the convolution layers are shared between detection and region proposal networks. The R-FCN [dai2016r] approach was then developed to circumvent the process of repeatedly applying per-region subnetworks by sharing almost all computations across the entire image, using fully convolutional networks. Fig. 4 (d) shows the block diagram for this technique. Finally, FPN was used in [lin2017feature] to improve the object detection performance especially for small size objects since it concatenates the information of the deeper and early layers together to produce a decision. A typical FPN is shown in Fig. 4 (e).
Single Stage Detectors: YOLO [redmon2016you] was the first proposed single-stage detector, which viewed the problem of object detection as a regression problem i.e., regressing the bounding box coordinates. Since the whole detection framework is performed in a single stage, the training process can be performed in an end-to-end manner. YOLO’s first version achieved 45fps, making it suitable for real-time detection. However, the performance was relatively worse than its two-stage counterpart. As shown in Fig. 4 (f), the algorithm divides the image into $S\times S$ grids and checks whether the center of each object lies within a grid cell. After that, the matched grid cell will regress the bounding box of the selected object in the grid. Finally, the overlapping bounding boxes are merged to produce the most plausible bounding boxes. The initial version of YOLO had strong spatial constraints, which made nearby objects difficult to detect. In order to address this problem and scale up the detection framework to a variety of objects, YOLOv2 was proposed in [redmon2017yolo9000]. YOLO’s localization error and low recall were identified by [redmon2017yolo9000] as its most important limitations, which were addressed through batch normalization, high resolution classifiers, the use of anchor boxes instead of fully connected layers, and the use of clustering to determine the bounding box sizes as priors. A multi-scale prediction was used in YOLOv3 [redmon2018yolov3], to estimate bounding boxes at three different scales. A new network, called Darknet-53, has been proposed in [redmon2018yolov3], Which combines Darknet-19 and a residual network with 53 convolutional layers. In addition, the activation function of softmax has been replaced by logistic classifiers. YOLOv4 [bochkovskiy2020yolov4] was built upon CSPDarknet53 on top of YOLOv3 and used Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation to improve the performance. The YOLO framework has been used to develop several other models, including [jocher2020yolov5, long2020pp, wu2021yolop, ge2021yolox, chen2021you, wang2021you]. SSD [liu2016ssd] is another single-stage detector that at first, was as accurate as the two-stage detectors while being much faster than its two-stage competitors. The core idea behind SSD is to determine the category scores and box offsets for a set of predefined bounding boxes using small convolutional filters on top of the feature maps. As shown in Fig. 4(g), various scales of feature maps have been used to perform the prediction. RetinaNet [lin2017focal] was then proposed to alleviate the problem of class imbalance. In RetinaNet, a new focal loss focusing on hard examples was proposed by adding a multiplicative factor to the cross-entropy loss. Through this approach, the performance finally reached the performance of the SOTA two-stage methods. The structure of RetinaNet as shown in Fig. 4(h) uses the FPN as the neck of the pipeline.
The typical backbones used to extract learned features from image include: VGGNet [simonyan2014very], ResNet [he2016deep], ResNeXt [xie2017aggregated], Inception [szegedy2016rethinking], ZF Net [zeiler2014visualizing] MobileNet [howard2017mobilenets, howard2019searching], DenseNet [huang2017densely], SqueezeNet [iandola2016squeezenet], ShuffleNet [zhang2018shufflenet], Darknet [li2018detnet], EfficientNet [tan2019efficientnet] and Hourglass [newell2016stacked].

4 Generic Small Object Detection

Throughout this section, we will examine extensively SOD methods for both image and video modalities for generic applications. In Fig. 3, we have categorized the methods for each modality and discussed how they are related below.

4.1 Image based SOD

The topics covered in this section include training datasets, architecture, feature learning and objective loss functions. Fig. 5(a) shows a general block diagram of image-based SOD methods.

4.1.1 Data Preparation

Data Augmentation. In computer vision, data augmentation is commonly used to address the problem of limited labelled data samples. Its goal is to generate a large, high-quality, and diverse set of training datasets that will enable deep learning models to be more robust and generalizable. The traditional methods of data augmentation can be broadly categorized into: (i) geometric transformations-based, including rotation, scaling, flipping, cropping, padding, translation, affine transformation, etc. (ii) Photometric transformations-based, i.e., changing the color components, which include brightness, contrast, hue, saturation, etc. In addition to these pixel-level adjustments based data augmentation methods, there are several patch-level manipulation methods, such as random erase [zhong2020random], CutOut [devries2017improved], CutMix [yun2019cutmix] and grid mask [chen2020gridmask]. Recent advances in Generative Adversarial Networks (GANs) provide a new avenue for data augmentation [antoniou2017data] by synthesizing realistic training samples of different styles [CycleGAN2017] or even novel unseen classes [wang2018low]. Moreover, Cubuk et al. [cubuk2019autoaugment] proposed a reinforcement learning-based data augmentation method, “AutoAugment”, to automatically search for the optimal augmentation strategy to train a classification model. Various data augmentation techniques have been used with the existing object detection methods, such as the horizontal flipping used with Fast R-CNN [girshick2015fast] and Cascade R-CNN [cai2019cascade], saturation and exposure shifts used in YOLO [redmon2016you] and YOLO9000 [redmon2017yolo9000], and the “Mosaic” strategy proposed with YOLOv4 [bochkovskiy2020yolov4]. Zoph et al. [zoph2020learning] extended AutoAugment [cubuk2019autoaugment] to the object detection task by performing the augmentation operations on the bounding boxes. However, existing object detection methods generally perform worse on small objects, compared to medium or large objects. There are two main reasons: (i) there are much less images containing small objects in the training dataset, leading to a model that is biased towards medium or large objects; (ii) in those images containing small objects, the small object regions are too small, leading to a limited number of matched anchors. This namely decreases the probability of small objects to be detected. To address these problems, Kisantal et al. [kisantal2019augmentation] proposed two data augmentation methods accordingly. (i) An oversampling method was used to increase the number of training samples of small objects. (ii) To increase the number of small objects appearing in a single image, multi-copy-pasting of small objects was used to increase the likelihood of matching anchors with small target objects. Based on the copy-paste augmentation strategy [kisantal2019augmentation], Chen et al. [chen2019rrnet] proposed an adaptive resampling augmentation method, which uses a pre-trained semantic segmentation model to determine suitable image regions for the augmented object pastes. This method effectively addresses the problems of background and scale mismatches when performing random pastes. In order to exploit additional datasets of different object scale distributions to pre-train the network for small object detection, Yu et al. [yu2020scale] proposed a scale match approach to align the scale distributions of the pre-training dataset with that of the target small object dataset. Similarly to the Mosaic strategy [bochkovskiy2020yolov4], Chen et al. [chen2020stitcher] proposed to balance the scale distribution of a training dataset by stitching multiple images of medium- or large-size objects to form a down-scaled collage image. Moreover, a feedback-driven decision paradigm based on the loss statistics of the minority small-scale objects was proposed to guide the image stitching process.

Super Resolution. The limited region of interest (RoI) for small objects results in insufficient feature information for an accurate detection prediction. To address this problem, a straightforward method is to perform super-resolution, namely recovering high-resolution images from their low-resolution counterparts [wang2020deep]. There are typically two types of super-resolution strategies for small object detection: (i) image super-resolution and (ii) feature super-resolution. Haris et al. [haris2021task] proposed to concatenate a super-resolution network prior to a detection network for an end-to-end training. The super-resolution process was also driven by the detection objectives, thus leading to better detection-oriented super-resolved images. Bai et al. [bai2018sod] proposed a multi-task generative adversarial network for small object detection (SOD-MTGAN). More specifically, SOD-MTGAN is composed of: (i) a generator which reconstructs super-resolved RoI images from the small blurred ones, and (ii) a multi-task discriminator to perform detection on the super-resolved RoI images and differentiates real high-resolution RoI images from the fake generated ones. Image super-resolution can help recover details of small objects in an image, thereby resulting in a moderate improvement in detection performance. However, image super-resolution based methods for small object detection suffer from several limitations. Firstly, super-resolving whole images can inevitably enlarge other irrelevant regions, which adversely impact detection performance. Secondly, if super-resolution is only performed on RoI images, object detection on the super-resolved RoI images will largely limit the detection performance due to the lack of context information. This second limitation can be alleviated by performing super-resolution on deep feature maps, which are generated by convolving context. Li et al. [li2017perceptual] proposed a Perceptual GAN to improve small object detection by generating the super-resolved features of small objects that cannot be discriminated from the features of large objects. Similarly, Noh et al. [noh2019better] used GAN to generate super-resolved features for small objects. This was shown to significantly improve the detection performance by providing a direct supervision to learning the super-resolved features of small objects using high-resolution features with appropriate receptive fields. In their article [pang2019jcs], Pang et al. introduced a unified network, called JCS-Net, to integrate the classification and super resolution tasks and to exploit the relationship between large and small scale objects (pedestrians) for recovering the detailed information.
Finally, several other methods perform semi-preprocessing steps to improve detection performance. For example, in [ozge2019power] the authors used the overlapped tiling technique to increase the likelihood of small objects being present in the training stage.

4.1.2 Deep Learning Architecture

2D-CNN. The majority of deep learning-based methods for detecting small objects rely on CNNs. These object detection methods can typically be categorized into anchor-based or anchor-free methods. Anchor-based methods primarily consists of two types of methods, namely, two-stage methods and one-stage methods (see Section 3). One-stage methods generally have a faster detection speed, while two-stage methods tend to have higher detection performance.

Anchor-based two-stage object detection methods mainly consist of the following two stages: (i) a stage to generate object proposals from images; (ii) a stage to predict the final bounding boxes of objects from the region proposals. Representative two-stage CNN frameworks include: R-CNN [girshick2014rich], SPPNet [he2015spatial], Fast R-CNN [girshick2015fast], Faster R-CNN [ren2015faster], FPN [lin2017feature], and Cascade R-CNN [cai2018cascade, cai2019cascade]. Anchor-based one-stage methods do not have a stage for generating region proposals. Instead, they directly generate the class probabilities of objects as well as the corresponding coordinates of the bounding boxes. Representative anchor-based one-stage methods include YOLO v1 [redmon2016you], SSD [liu2016ssd], YOLO v2 [redmon2017yolo9000], RetinaNet [lin2017focal], YOLO v3 [redmon2018yolov3], YOLO v4 [bochkovskiy2020yolov4], and YOLO v5 [jocher2020yolov5] (see Section 3).

Anchor-based methods usually have a large number of anchors and hyper-parameters, leading to a prohibitively high computation cost. To address these problems, recent anchor-free methods alleviate the need for anchors by performing detection through key-points. This largely reduces the number of hyper-parameters. Recent related works include CornerNet [law2018cornernet], CenterNet [duan2019centernet], FSAF [zhu2019feature], FCOS [tian2019fcos], and SAPD [zhu2020soft].

Image Transformer. Several studies have suggested the use of transformers [vaswani2017attention] for detecting objects following Dosovitskiy et. al.’s pioneering work [dosovitskiy2020image]. The Vision Transformer (ViT) was used for the first time in ViT-FRCNN [beal2020toward] to examine the feasibility of transformers for complex object detection tasks. However, the SOD results revealed that the proposed method was not suitable and modifications were necessary to improve the detection performance. Moreover, the proposed method combines transformers and CNNs (i.e., does not merely use transformers). As a way to mitigate the reliance on CNNs and to propose a purely transformer-based object detection technique, You Only Look at One Sequence (YOLOS) was proposed in [fang2021you] to test the transferability of pre-trained transformers from image recognition to object detection. But YOLOS was unable to benefit from multi-scale features and achieved limited performance. With these limitations in mind, [song2021vidt] proposed a method that integrates Vision and Detection Transformers (ViDT), and introduced three major contributions: (i) a new attention mechanism called Reconfigured Attention Module (RAM); (ii) a lightweight encoder-free neck architecture; and (iii) a token matching for knowledge distillation.

Mixed Architecture. The use of both CNNs and transformer architectures has been proposed in various studies. The Most common approach is to first use CNN networks as the backbone and extract several appropriate feature maps. Then these feature maps should be fed into a transformers for decision making. In the early work of transformer-based object detection (OD), Carion et al. [carion2020end] proposed DEtection TRansformer (DETR) using transformers (with both encoder and decoder) on top of CNNs. DETR outperformed CNN-only based SOTA methods, while alleviating the need for complex post-processing steps such as Non-Maximum Suppression (NMS). Considering the computational cost of DETR, [zhen2022deeply] proposed another compact end-to-end variant which represents the large weight matrix in one layer by low order matrices. Additionally, a decoder-only detector ( $\text{D}^{2}$ ETR) was proposed in [lin2022d] to address complexity. Furthermore, two additional modifications of DETR were introduced in [jiang2021guiding] in order to enhance learning and SOD performance. First, in order to to update the positional information of the queries, a module called Guided Query Position (GQPos) was added to the decoder. Second, the authors proposed Similar Attention (SiA), a new fusion scheme that interpolates the low-resolution attention weight map to generate a high-resolution attention map, since multi-scale feature learning is computationally expensive. This idea was motivated from the fact that the relative positions of the objects is unique across different scales. A CNN-transformer based on deformable attention (following the idea of deformable convolution [dai2017deformable]) and attending to just a small set of sampling locations has been proposed by Zhu et al. [zhu2020deformable], which has the advantage of being trained much faster than DETR (with 10 times fewer training epochs). SOD performance was also improved by adding a multi-scale deformable attention module. Their method was referred to as “Deformable DETR”. Despite the fact that DETR and Deformable DETR only account for spatial information, they are still fast enough for Video SOD. A new method of extracting small-size features, SOF-DETR, has been proposed in [dubey2021improving], together with a normalized inductive bias. In a nutshell, SOF-DETR uses a multi-scale feature representation of the input image. Consequently, the input of the transformer captures richer information (both semantic and geometrical information) that is more suitable for SOD. Pre-training is performed only on the CNN block in DETR and Deformable DETR, but not on the transformer module. This was addressed by [dai2021up], who proposed UP-DETR, which utilizes unsupervised pre-training for a pre-trained CNN backbone. However, since the pre-training of the transformer and CNN is done separately, they are unlikely to perform as well together. In FP-DETR [wang2021fp], the pre-training was thus performed on the encoder module (not the decoder) using ImageNet before fine-tuning the object detection task with a task adaptor. In [wang2022resc], a transformer-based object detection framework was proposed (RESC), which minimizes post-processing steps and the number of hyperparameters. RESC converges faster than DETR. In addition to being lighter, it enables the use of the FPN structure [lin2017feature] to detect small objects.

4.1.3 Feature Learning

Multi-Scale Learning. Multi-scale feature learning is one of the most common approaches for SOD, and several architectures have been developed to support it. Amudhan et al. [amudhan2021rfsod] introduced RFSOD, a lightweight single-stage detector that can be used in embedded systems for real time applications. RFSOD’s architecture is similar to that of the YOLO detector, and uses $3\times 3$ and $1\times 1$ convolutions for lightweight detection. By transferring and concatenating information from the earlier layers to the deeper layers, RFSOD increases the spatial resolution of the information in the last layers. This is critical for SOD and the concatenation is performed until that the receptive field reaches the size of $50\times 50$ , so that objects of size 32×32 and smaller can be detected. Chalavadi et al. [chalavadi2022msodanet] proposed mSODANET which consists of three main components: backbone network, Hierarchical Dilated Network (HDN), and Bi-directional Feature Aggregation Module (BFAM). EfficientNet [tan2019efficientnet] was used to fully exploit the visual information contained in input images of varying sizes. Furthermore, the HDN was used to learn the contextual information of objects while the BFAM aims to resolve the network’s limitation of top-down information flow (parallel connections from the last layers to the first layers) with cross-scale connections in order to improve the model efficacy. Fu et al. in [fu2021small] extended the ResNet structure to ResNeXt-RC and proposed IIHNet. IIHNet is a convolution-based network based on three key concepts: (i) information fusion; (ii) information exchange between different resolutions and modules; and (iii) a multi-scale network. Furthermore, [he2021small] proposed a lightweight network known as YOLO-MXANet which uses a powerful backbone based on the MobileNext [zhou2020rethinking] named SA-MobileNeXt, as a mean to incorporate both spatial and channel attention. Along with the addition of another scale from the shallower layers to improve the performance of SOD, the number of parameters was markedly reduced from 61.5 M to 13.8 M. The authors in [qi2022small] proposed a single stage SODNet composed of an adaptively spatial parallel convolution module (ASPConv) and a fast multi-scale fusion module (FMF) to optimize the spatial information extraction and to fuse the spatial and semantic information. By design, FMF preserves both spatial and semantic information. Following the SSD idea, Cui et al. [cui2018mdssd] proposed a Multi-scale Deconvolutional Single Shot Detector (MDSSD), where multiple feature maps at different scales are upsampled to increase the spatial resolution. For better localization of small objects, concatenation is used in [liu2020small], instead of summation in the fusion block to preserve more information across layers.

Context Learning. Objects are not isolated and they usually co-vary with other objects or particular backgrounds, which provides a rich source of contextual associations. For context learning, there are typically two types of approaches: (i) deep CNNs provide an implicit way to model the spatial context for each pixel through the convolution and pooling operations. In order to incorporate the local context information, existing methods generally manually select the surrounding regions and aggregate their features to enhance the target regional feature [li2016attentive, chen2018context]. In order to model the global context information, enlarging the receptive field to cover the whole image and performing global pooling is commonly employed. Besides, Bell et al. [bell2016inside] regarded feature maps as four sequences of feature maps arranged in the four cardinal directions, i.e., right, left, up and down, and proposed to model the global context information by using four recurrent neural networks (RNNs) to process each sequence and concatenating the outputs. To enhance the context learning of deep CNNs, a number of strategies have been developed to capture the multi-scale context [cui2020context, lim2021small] (See Multi-scale Learning in Section 4.1.3). Moreover, an attention mechanism has been used to effectively extract contextual information for object detection [li2016attentive, shen2019indoor]. (ii) Another line of methods involves explicitly modeling the contextual information, such as scene-to-object and object-to-object relationships at the semantic level or in terms of the spatial layout. Fu et al. [fu2020intrinsic] proposed a context reasoning method for small object detection, which models the object-to-object relationships using the semantic features and the spatial geometric information (i.e., location, size, and aspect ratio) of object regions with a graph convolutional network (GCN). Using the learned contextual relations, the regional features were then updated for both classification and regression, resulting in improved performance for detecting small objects. Leng et al. [leng2021realize] proposed to model object-to-object relations and use the reliable object proposals with their pairwise relations to help classify and localize ambiguous object proposals.

Region Proposal. SOD performance of deep networks can be greatly enhanced by higher input image resolution. Using high-resolution data, however, requires considerably more computational power. To mitigate this bottleneck, one approach is to select the most promising regions and discard the rest of the input image. QueryDet was developed by Yang et al. [yang2022querydet] which first localizes small objects roughly, then refers to high resolution feature maps for better adjustment of bounding box coordinates. Bosquet et al. [bosquet2020stdnet] proposed STDnet which relies on two components: Region Context Network (RCN) and Region Of Interest (ROI) Collection Layer (RCL). As a result of processing only specific areas, high-resolution feature maps are kept in deeper layers, thereby increasing SOD performance. Additionally, in order to improve adaptation, both the number and the size of anchor boxes were learned by k-means in [bosquet2020stdnet]. In [liu2021modified], MdrlEcf was proposed as a way to exploit deep reinforcement learning (DRL) with a new reward function and an efficient attention network added to a CNN for the task of SOD with very high resolution remote sensing images. Based on FastMask [hu2017fastmask], Wilms et al. [wilms2018attentionmask] proposed AttentionMask, a class-agnostic object proposal generation algorithm that is well suited for SOD. AttentionMask is biologically-inspired and includes scale-specific attention maps.

4.1.4 Loss Function Regularization

While most existing methods focused on redesigning the neural network architecture or utilizing some prior information in order to boost SOD performance, fewer works employed different loss functions or added penalty terms to the classical loss functions in order to boost SOD performance. We can cite RetinaNet [lin2017focal], which is designed to focus on the most challenging samples (e.g., small objects) by multiplying a term proportional to the network’s confidence into the classical cross-entropy loss. Other methods modify the standard IoU loss, including Intersection over Detection, Generalized IoU [rezatofighi2019generalized], Wasserstein distance [wang2021normalized], and Complete IoU [zheng2021enhancing]; The detailed explanation of these methods can be found in Section 6.2.1.

4.2 Video based SOD

In general, videos provide additional temporal contextual information that is not contained in still images. Several previous methods exploit temporal information in an ad-hoc way [han2016seq, feichtenhofer2017detect]. These methods depend essentially on the static object detection results produced by an image-based object detector and then use the temporal information in a post-processing stage. This however leads to sub-optimal results since the training of the object detector does not take advantage of temporal information. More recent methods [xiao2018video, liu2018mobile] have incorporated the temporal information into training either by aggregating feature maps across different frames or by predicting object proposals between frames. As a result, the video object detection performance has been largely improved. With so many redundancies between adjacent frames, detection performance can be improved while still maximizing detection speed. The use of temporal information can also improve detection performance when dealing with challenges such as motion blur, partial occlusion, small-scale objects, etc. Our focus in this section is on the methods that jointly learn spatial and temporal information to detect small objects in video footage.

4.2.1 Deep Learning Architecture

In this section, we present general deep learning architectures (illustrated in Fig. 6) for small object detection in videos.

3D-CNN. While 3D-CNN is the easiest tool for integrating temporal and spatial information of video frames, it is rarely used for the task of object detection. In contrast, 3D-CNN has been deeply investigated for 3D object detection [maturana2015voxnet], action recognition [ji20123d], anomaly detection [shin20203d], etc. In contrast, of those limited studies, 3D-CNN was used in [lin2019smoke] as a feature extractor in combination with Faster R-CNN in order to detect and localize smoke.

RNN. The recurrent neural network (RNN) is a type of neural network that processes temporal or time-series data. It has been widely used for video-based visual tasks following the pipeline shown in Fig. 6 (a). Tripathi et al. [tripathi2016context] proposed to use a recurrent neural network to extract the temporal context information, which is subsequently used to compute a regularization loss to better optimize the training of an object detector. Lu et al. [lu2017online] proposed Association LSTM, which is composed of an SSD and an LSTM networks. More specifically, SSD performs object detection on each frame. The features of the detected objects by SSD are stacked and then forwarded to the LSTM. An additional association error loss is applied to the LSTM outputs of two adjacent frames, to enforce the consistency of two neighboring frames in the temporal space. Compared to Association LSTM, which only uses limited motion information between two frames, Xiao et al. [xiao2018video] proposed a spatio-temporal memory network (STMN) to leverage the motion information across multiple frames. STMN is a bi-directional RNN, which is used to process the convolutional features of a sequence of multiple neighboring frames and also transfer the outputs to each frame. Therefore, the spatial and motion information of multiple neighbouring frames is all incorporated to compute the detection prediction for a target frame, thus effectively improving the detection performance. Moreover, to refine feature maps across frames, Liu et al. [liu2018mobile] proposed an inter-weaved recurrent-convolutional network, coined as Bottleneck-LSTM. By using depthwise separable convolutions and bottleneck design principles, Bottleneck-LSTM achieves a real time inference as well as a high detection performance.

Video Transformer. Due to their superior ability to detect long-range correlations, transformers have recently become very popular in object detection. Transformers have been applied to video based SOD to capture long term spatio-temporal dependencies. As described in [he2021end] and [zhou2022transvod], TransVOD is the first end-to-end system for video object detection using spatio-temporal information. TransVOD uses multiple frames of the video as inputs to its spatial transformers, and uses another temporal transformer on top of it. These two transformers can link each object query and memory encoding outputs simultaneously. Two other extensions of TransVOD have been developed, called TransVOD++ and TransVOT Lite. TransVOD++ uses hard query mining (HQM) strategy to mitigate the redundancy of the number of objects and targets. Experiments show that the TransVOD framework can improve the performance of SOD. TransVOD++ is the first to achieve $90\%$ mAP on ImageNet VID dataset. The second extension, TransVOT was designed for real time object detection.

4.2.2 Spatio-Temporal Feature Aggregation

In the previous section, we explained how sequence-based architectures such as 3D-CNN, RNN, and transformers have been applied to detect small objects. In other studies, the temporal and spatial features are mixed or aggregated during the process of object detection, e.g., by using 2D-CNN and finding the objects correlation over time. The STDnet-bST algorithm [bosquet2020stdnet] was proposed by Bosquet et al. which first detects objects in frames using STDnet, and then links the detected objects using the Viterbi algorithm across the frames. In another extension, Bosquet et al. [bosquet2021stdnet] proposed STDnet-ST, a spatio-temporal convolutional network method for SOD. Built on STDnet, STDnet-ST operates on two consecutive frames simultaneously. These two frames are integrated together through a correlation module at shallower layers and a final tubelet linking module. The term “tubelet linking” refers to forming sequences of the same objects across a video. Despite being based on the Viterbi algorithm, the tubelet linking module has three novelties, including (i) correlations are generated from the shallower layers of the convolution layers; (ii) to evaluate the degree of variability and confidence, a scoring system has been used; and (iii) dummy objects are introduced to suppress tubelets with incorrect data associations. A Faster R-CNN-like method called FANet was proposed by Cores et al. [cores2020spatio] based on short-term spatio-temporal feature aggregation to produce first a detection set, followed by long-term object linking to refine the detection. They also introduced Tubelet Non-Maximum Suppression (T-NMS) to eliminated spatially redundant tubelets.

5 Maritime SOD

This section provides a literature review of SOD in maritime environments. Objects such as vessels, swimmers, obstacles, or plastic objects on the water’s surface are included in this category.

5.1 Image based maritime SOD

This section is organized according to the flow of the detection pipeline shown in Fig. 5.

5.1.1 Data Pre-processing

Data augmentation. Data augmentation is one of the most effective methods to improve the performance of small object detection. A number of data augmentation methods [kisantal2019augmentation] have been developed to increase the size and enrich the diversity of maritime training datasets, thus improving the robustness and the generalization ability of the detection models. In the maritime context, general data augmentation techniques, such as multi-angle rotation, color jittering, random translation, random cropping, horizontal flipping and adding random noises, have also been used in [you2019broad, liu2021enhanced, zhang2020intelligent, wang2021sdgh] to increase the diversity of samples. In order to address the scarcity of real-world samples of small ships for training a deep learning based object detector, Chen et al. [chen2020deep] proposed to use a Gaussian Mixture Wasserstein GAN with Gradient Penalty (WGAN-GP) to generate synthetic small ships. Both real and synthetic data were used for training, significantly improving the detection performance over the case of not using synthetic data. Moreover, Shin et al. [shin2020data] proposed a “cut and paste” strategy to augment training images for maritime object detection. More specifically, the pre-trained mask-RCNN was used to extract the ship segments, which were then pasted in various background sea scenes to synthesize new images. The improved detection results confirmed the effectiveness of the synthetic ship images. Similarly, Hu et al. [hu2022somc] proposed a mixed strategy to mix the regions of sea surface objects with a number of varying scenes to increase the diversity and the number of training samples.
Image Enhancement. The complex marine environment makes maritime object detection challenging. The ocean wind, waves, and currents usually cause marine object motion blur, which significantly degrades the performance of visual object detectors. Feng et al. [feng2021sharpgan] proposed ShapeGAN, a deblurring method based on GAN, which aims to remove motion blur from real sea images. The ship detection results of the sharp images are clearly superior to those of the blurred ones. In [tian2021image], a GAN based low-quality to DSLR-quality image translator [ignatov2017dslr] was used to enhance the remote sensing ship imagery, leading to images with improved contrast and clarity. In [tian2021image], the proposed image enhancement method was shown to improve detection performance, especially when training data is scarce. For image enhancement, deep learning is often combined with physical models. For instance, to improve maritime vessel detection, Guo et al. [guo2021lightweight] proposed a low-light image enhancement method based on deep learning and the Retinex theory [land1977retinex]. According to the Retinex theory, the observed image can be decomposed into reflectance and illumination components, so image quality can be improved by enhancing the illumination. To this end, Guo et al. [guo2021lightweight] proposed to learn a mapping between low-light images and their illumination-enhanced counterparts through a CNN-based model. This model was supervised by pairs of synthetic low-light and normal-light images. With the trained model, low-visibility maritime imagery was significantly enhanced, which improved the vessel detection in low-visibility environments. Similar maritime image enhancement methods have been proposed in [lu2021towards, yang2021deep]. The Atmospheric Scattering model [narasimhan2000chromatic] has also been used with deep learning to de-haze the maritime images to achieve an improved vessel detection performance in [guo2021heterogeneous].
Sea-Land Segmentation. Another widely used pre-processing technique is sea-land segmentation or land masking. Usually, this technique is used when analyzing satellite images. Direct application of standard DNN-based methods in coastal areas, where the land and sea meet, can generate a high number of false positives due to similarities between urban structures and vessels. In order to reduce the false alarm rate, researchers used a pre-processing step in order to remove the land regions and thus reduce the amount of information for further analysis. Examples of DNN-based techniques include SeNet [cheng2016senet], which combines segmentation and edge detection methods in an end-to-end framework. Li et al. [li2018deepunet], developed DeepUNet, a pixel-level sea-land segmentation method based on U-Net. DeepUNet consists of a contracting path and an expansive path used to generate a high resolution optical output. Liu et al. [liu2021laenet] proposed a lightweight multitask, end-to-end fully convolutional neural network without any down sampling to simultaneously segment the input image and extract edges from remote sensing images. In addition, a novel method (BS- Net) based on the joint learning network of boundary and segmentation is described in [jing2021bs], in which these two modules interact and enhance the sea-land segmentation result. In the literature, there are several other methods for separating sea from land, however since their details are beyond the scope of this survey, we do not elaborate further.

5.1.2 Feature Learning

Multi-scale Learning. Smaller objects have fewer pixels to work with compared to normal-size objects. Therefore, obtaining good representations of small objects can be challenging. Furthermore, after passing through a number of sub-sampling and striding operations, the top-layer feature maps may not include any features of small objects [liu2016ssd]. This makes detecting small objects more difficult. A multi-scale learning strategy is an effective method for improving the detection of small objects. It is also the most commonly used strategy for detecting maritime small objects.
Multi-scale learning typically falls into two categories: (i) multi-level features, i.e., combining features from different layers. Zhang et al. [zhang2019real] improved Faster R-CNN by fusing low- and high-level features to generate object proposals, predict bounding boxes and classification scores for float detection. Li et al. [li2021water] integrated feature maps from a number of layers by employing a feature pyramid network structure with deconvolutions into SSD, effectively improving the detection performance of remote objects in water surface. Additionally, the fusion of shallow features and deep features has also been used to detect ships in remote sensing images [zhang2020intelligent] for ship detection of remote sensing images. (ii) parallel multi-scale features, which are usually obtained by applying multiple parallel convolutions with different kernel sizes or dilated rates on the same input feature. Li et al. [li2018hsf] improved faster R-CNN by proposing a Hierarchical Selective Filtering (HSF) layer, which is composed of three parallel convolutional layers with kernel sizes $1\times 1$ , $3\times 3$ , $5\times 5$ , respectively. The HSF layer, which exploits features of multiple receptive fields, was used for both object proposal generation and bounding box regression, effectively detecting both inshore and offshore ships of varying sizes. Compared to the standard convolution, dilated convolution is more efficient since it enlarges the receptive field without increasing the number of parameters. Chen et al. [chen2021ship] proposed to enhance the feature representation of YOLOv3 by using multiple dilated convolutions to capture multi-scale context information for ship detection. Tian et al. [tian2021image] embeded multiple Atrous Spatial Pyramid Pooling (ASPP) modules in FPN to improve the detection performance for ships at different scales. Zhou et al. [zhou2021image] proposed CRB-Net, a multi-scale image feature learning based method that can carry out adaptive weight adjustment (improved BIFPN) during feature fusion by attention mechanism and Mish activation (a novel self-regularized non-monotonic activation function [misra2019mish]). Two SPPNets were also used to increase the receptive field of the features in layers 4 and 5 to isolate the most significant contextual features. The performance of CRB-Net was compared to 16 different deep learning-based methods for the detection of small objects on water surface, with promising results.

Attention based learning. Multi-scale feature learning poses a challenge to real time object detection due to its increased complexity. This is because all areas in the input data (image/video) are exploited to localize objects. An alternative to reduce time and computational load is to use attention (whether spatially, temporally, or channel-wise) to eliminate irrelevant information and focus on that which is relevant to the object of interest.
For small object detection in maritime environments, Chen et al. [chen2021improved] proposed a single stage method, called ImYOLOv3 which integrates both spatial and channel attention modules (DAM) into a YOLOv3 network in order to better distinguish between ships and backgrounds. Their proposed end-to-end framework was successfully applied to optical remote sensing images. By adjusting receptive fields on three network branches, ImYOLOv3 achieved promising results for large, medium, and small sized objects. Nie et al. [nie2020attention] used both the channel attention modules and the spatial attention modules in a Mask-RCNN model to enhance the information propagation from the lower layers to the top layers. The use of the attention mechanism was shown to significantly improve the detection accuracy of small ship detection. Liu et al. [liu2021attention] used the Convolutional Block Attention Module (CBAM) [woo2018cbam], which sequentially applies channel and spatial attention modules, to refine intermediate features of the object detection network. A similar attention mechanism was also used in [hu2021pag, fu2021improved, dong2021ship, li2021enhanced]. Wang et al. [wang2021ship] used the Squeeze-and-Excitation (SE) attention module [hu2018squeeze] to dynamically perform channel-wise feature re-calibration, leading to an enhanced representational capacity of their detection network and an improved overall detection performance. A similar attention mechanism was also used in [hu2021ship]. Chen et al. [cheng2021robust] proposed a global attention module to adaptively fuse multi-modal features extracted from image and radar data for small floating waste detection [cheng2021flow].

5.1.3 Leveraging Segmentation methods

Foreground/Background Segmentation. Saliency detection aims to mimic the low-level human visual attention mechanism, which localizes the most “interesting” (salient) regions in an image for more efficient subsequent processing. Saliency object detection has been widely used in both traditional [sobral2015double, cane2016saliency] and deep learning-based [shao2019saliency] methods for maritime small object detection, to determine reliable object regions. More specifically, in [shao2019saliency], saliency detection was applied on the predicted object proposal to refine their predicted locations for a more accurate ship detection.

Semantic Segmentation. Smart modifications of the loss functions can result in a better feature representation for maritime small object detection. It was demonstrated in [moosbauer2019benchmark] that multitask (joint) learning, such as segmentation and object detection, can improve the performance of each task. A possible explanation is that due to joint learning, feature representation is no longer task-specific nor over-fitted to the training dataset. Cane et al. [cane2018evaluating] proposed the use of state-of-the-art deep semantic segmentation networks such as ENet [paszke2016enet], ESPNet [mehta2018espnet] and SegNet [badrinarayanan2017segnet] for maritime object detection. As a result of this, the segmentation stream improved greatly while the network needed fewer annotated, labelled images to train. Park et al. [park2022lightweight] proposed a lightweight Mask-RCNN by using an efficient backbone, i.e., MobileNetV2, to jointly perform warship detection and segmentation. To reduce the cost of dense pixel-level annotation, Zust et al. [vzust2022learning] proposed a weakly supervised method to train a semantic segmentation network for maritime obstacle detection.

5.1.4 Generic OD for Maritime SOD

Even though SOD in maritime environments presents some unique challenges in terms of shape and domain, several works have directly applied and evaluated generic object detection methods for this more challenging task. The main focus of these studies was to introduce a new maritime dataset and use the generic OD approaches as a baseline. This section reviews such prior works. YOLOv2 was evaluated by Lee et al. [lee2018image] in maritime video surveillance with no changes to the overall network except a slight modification to the final layer used to classify objects into the 10 different ship classes. A speed of 30fps was achieved, thus making the method suitable for real time maritime detection. In [moon2020comparative], a cascade R-CNN [cai2018cascade] with a HRNetV2 backbone for high resolution representation [wang2020deep2] was used to more accurately detect small objects in maritime environment. This accuracy was the consequence of maintaining information throughout all the layers. In another study, Shao et al. [shao2018seaships] compared and analyzed the performance of Faster R-CNN (ZF Net, VGG16 Net, ResNet18, ResNet50, ResNet101), YOLO (DarkNet19), SSD (MobileNet, VGG16 Net) on their own maritime dataset. It was observed that YOLOv2 can achieve a proper trade-off between accuracy and speed in practical applications (average precision of 79 and speed of 91fps). The speed of YOLOv2 was adequate for real time video-based object detection. Aside from providing a new dataset for the maritime environment, the authors of [ribeiro2017data], also used four different techniques (2 supervised and 2 unsupervised) to provide a benchmark for SOD in maritime environments. Parasad et al. [prasad2018object] evaluated the performance of 23 classical and state-of-the-art Background Subtraction (BS) algorithms on visible range and near infrared range videos using the Singapore Maritime dataset. They found that those methods were not suitable for maritime environments (poor prediction), largely due to spurious dynamics of water, wakes, ghost effects, and multiple small detections for a single object. Therefore, BS methods must be adapted to suit the highly dynamic maritime backgrounds. The authors in [scholler2019assessing] used LWIR input images, together with CNN-based methods such as RetinaNet (ResNet50), YOLOv3 (Darknet53) and Faster RCNN to localize objects at sea. In [soloviev2020comparing], the authors reported the results for Faster R-CNN, R-FCN and SSD on their own dataset. Compared to their other evaluated methods, Faster R-CNN with ResNet101 achieved the highest detection accuracy for large objects. Its accuracy was reduced, however, when they considered small objects. A cascading approach was used in [van2020automated] to monitor plastic pollution, using one network for the segmentation of regions of interest and another network for classification. In their comparison step, their goal was not to determine the exact location of the plastic bottles, but to predict their number in river streams. In [chen2019port], the YOLOv3 framework was used to accurately identify small, medium and large ships using three feature scales provided by DarkNet53. Varga et al. [varga2022seadronessee] presented a new sea-based vision dataset for identifying and localizing swimmers in open waters for emergency rescue missions. They compared the state-of-the-art CNN based techniques such as Faster R-CNN, CenterNet [zhou2019objects], and EfficientDet [tan2020efficientdet] with different backbones and showed that Faster R-CNN with a deep network (ResNeXt-101-FPN) outperforms others. However, it revealed very challenging to localize swimmers from a far distance, since they appear as points on the image.

TABLE II: Commonly used datasets for Generic SOD.

Dataset	Application	Video	Image	Shooting Angle (Type)	Resolution (pixels)	#Object Classes	#Instances	#Image/Video	Public?
MS COCO [lin2014microsoft]	Generic		✓	(RGB)	NF	91 Stuff C. 80 Object C.	2.5M	328K	Yes: Click Here
ImageNet Vid [russakovsky2015imagenet]	Generic	✓		(RGB)	–	30	–	4417 ( $>$ 1.2M frames)	Yes: Click Here
Lost and Found [pinggera2016lost]	Generic (Autonomous Driving)	✓		On-board (stereo RGB sequence)	$2048\times 1024$	37	–	112 (2104 annotated frames)	Yes: Click Here
STS [larsson2011using]	Generic (Autonomous Driving)	✓		On-board (RGB)	–	7	3488	$>$ 20K frames (20% labeled)	Yes: Click Here
Tsinghua-Tencent 100K [zhu2016traffic]	Generic (Autonomous Driving)		✓	On-board Shoulder-mounted (panoramas RGB)	$2048\times 2048$	45	30K	100K	Yes: Click Here
GTSDB[houben2013detection]	Generic (Autonomous Driving)		✓	On-board (RGB)	$1360\times 800$	4	1206	900	Yes: Click Here
CURE-TSD [temel2019challenging]	Generic (Autonomous Driving)	✓		On-board (RGB)	$1628\times 1236$	14	2.2M	5733 (1.7M frames)	Yes: Click Here
SOD [chen2016r]	Generic		✓	(RGB)	–	10	8393	4925	–
CURE-OR [temel2018cure]	Generic		✓	(RGB)	NF	100	–	1M	Yes: Click Here
WIDER FACE [yang2016wider]	Generic (Face Detection)		✓	(RGB)	–	60	393K	32.2K	Yes: Click Here
DeepScores [tuggener2018deepscores]	Generic (optical Music Recognition)		✓	(GS)	$1894\times 2668$	123	80M	300K	Yes: Click Here
ATSETC4 [liang2021small]	Generic (Air-Target Recognition)	✓		(RGB)	–	4	–	2400 (60K frames)	Yes
HVD [song2019vision]	Generic (Vehicle Detection)		✓	(RGB)	$1920\times 1080$	3	57290	11129	Yes: Click Here
BIT-Vehicle [dong2015vehicle]	Generic (Vehicle Detection)		✓	(RGB)	$1600\times 1200$ $1920\times 1080$	6		9850	Yes
KITTI [geiger2012we]	Generic (Autonomous Driving)		✓	(RGB)	–	2	$>$ 100K	80256	Yes: Click Here
Caltech [dollar2009pedestrian]	Generic (Pedestrian Detection)	✓		(RGB)	$640\times 480$	3	350K	1M frames (250K labeled frames)	Yes: Click Here
USC-GRAD-STDdb [bosquet2020stdnet]	Generic	✓		(RGB)	$1280\times 720$	5	56K	115 ( $>$ 25K frames)	Yes: Under Request
UAVDT [du2018unmanned]	Generic (Vehicle Detection)	✓		UAV based (RGB)	$1080\times 540$	3	841.5K	100 (80K frames)	Yes: Click Here
VisDrone2021 [zhu2020detection]	Generic	✓	✓	UAV based (RGB)	Image: $2000\times 1500$ Video: $3840\times 2160$	10	$>$ 2.6M	400 Videos, $>$ 10K Imgages ( $>$ 265K frames)	Yes: Click Here
Neovision2 Tower [khosla2014neuromorphic]	Generic	✓		On-board (RGB)	$1920\times 1080$	5	–	100	Yes: Click Here
NWPU VHR-10 [cheng2014multi]	Generic		✓	Satellite based (RGB&CIR)	–	10	–	800	Yes: Click Here
LULC [yang2011spatial, yang2010bag]	Generic		✓	Satellite based (RGB)	$256\times 256$	21	–	2100	Yes: Click Here
DOTA [xia2018dota, ding2021object]	Generic		✓	Aerial & Satellite Images (RGB)	From $800\times 800$ to $20000\times 20000$	18	$>$ 1.7M	11268	Yes: Click Here
Xie et al. [xie2021small]	Generic (Drone Detection)	✓		(RGB)	$1920\times 1080$ $2048\times 1538$ $4096\times 1800$	2	–	6	No
xView [lam2018xview]	Generic		✓	Satellite based (RGB)	$1500\times 1200$	60	1M	1413	Yes: Click Here
VEDAI [razakarivony2016vehicle]	Generic (Vehicle Detection)		✓	Aerial based (RGB & NIR)	$1024\times 1024$	9	3640	1210	Yes Click Here
DIOR [li2020object]	Generic		✓	Satellite based (RGB)	$800\times 800$	20	192472	23463	Yes:Click Here

TABLE III: Commonly used datasets for Maritime SOD.

Dataset	Application	Video	Image	Shooting Angle (Type)	Resolution (pixels)	#Object Classes	#Instances	#Image/Video	Public?
TinyPerson [yu2020scale]	Maritime (Person Detection)		✓	UAV based (RGB)	From $497\times 700$ to $4064\times 6354$	2	$>$ 72K	2369 (1610 labeled)	Yes: Click Here
Scholler et al. [scholler2019assessing]	Maritime (Ship Detection)		✓	On-board (LWIR)	$640\times 480$	2	–	$>$ 21k	No
HRSC2016 [liu2017high]	Maritime (Ship Detection)		✓	Satellite based (RGB)	From $300\times 300$ to $1500\times 900$	25	2976	1061	Yes:Click Here
ETRI-Maritime [soloviev2020comparing]	Maritime (Ship Detection)		✓	(RGB)	NF	12	50K	37694	No
SeaShip [shao2018seaships]	Maritime (Ship Detection)		✓	Shore based (RGB)	$1920\times 1080$	6	40077	31455	–
WSODD [zhou2021image]	Maritime ( Obstacle Detection)		✓	(RGB)	$1920\times 1080$	14	21911	7467	Yes: Click Here
Seagull [ribeiro2017data]	Maritime (Ship Detection)	✓		UAV based (RGB&NIR&IR&Hyperspectral)	$1920\times 1080$ $1024\times 768$ $640\times 480$ $384\times 288$ $1024\times 648$	6	–	19 (150K frames)	Yes:Under Request Click Here
Soloviev et al. [soloviev2020comparing]	Maritime (Ship Detection)		✓	Waterborne (RGB)	$1920\times 720$	–	850	400	No
Soloviev et al. [soloviev2020comparing]	Maritime (Ship Detection)		✓	Waterborne (RGB&IR Thermal)	$1200\times 400$	4	9137	1750	No
River Image [van2020automated]	Maritime (Plastic Monitoring)		✓	(RGB)	–	2	14968	1272	–
SMD [prasad2017video]	Maritime (Ship Detection)	✓		Shore based (RGB) On-board (RGB) Shore Based (NIR)	$1920\times 1080$	10	240842	81 (31653 frames)	Yes: Click Here
MarDCT [Bl-Io-Pe-15]	Maritime (Ship Detection)	✓		Shore based (RGB & IR)	–	–	–	20	Yes: Click Here
Botlek [ghahremani2017self]	Maritime (Vessel Detection)		✓	(RGB)	$1536\times 2048$	–	–	$>$ 48K	No
MSD [chen2021improved]	Maritime (Ship Detection)		✓	Satellite based (panchromatic)	$1000\times 1000$	4	–	1015	No
MODD [kristan2015fast]	Maritime (Obstacle Detection)	✓		USV based (RGB)	$640\times 480$	2	–	12 (4454 fully annotated frames)	Yes: Click Here
IPATCH [patino2016pets]	Maritime (Auto Protection)	✓		On-board (Visual & IR)	$640\times 480$ $640\times 512$	–	–	14	Yes
FGSD [chen2020fgsd]	Maritime (Ship Detection)		✓	Satellite based (RGB)	$930\times 930$	43	5634	4736 2612 annotated	Yes: Coming Soon
ShipRSImageNet [zhang2021shiprsimagenet]	Maritime (Ship Detection)		✓	Satellite based (RGB)	$930\times 930$	50	17573	$>$ 3435	Yes: Click Here
BCCT200 [rainey2011object]	Maritime (Ship Detection)		✓	Satellite based (GS)	NF	4	–	800	Yes
Chen et al. [chen2020video]	Maritime (Ship Detection)	✓		UAV based (RGB)	$720\times 480$	–	–	2 (3000 frames)	Yes: Under Request
Airbus Ship Detection	Maritime (Ship Detection)		✓	Satellite based (RGB)	$768\times 768$	–	–	$>$ 192K	Yes: Click Here
SeaDronesSees [varga2022seadronessee]	Maritime (Search and Rescue)	✓	✓	UAV based (RGB & NIR & RE)	$3840\times 2160$ $5456\times 3632$	6	400K	5630 images, 208 short videos, 22 videos ( $>$ 393K and 54K frames)	Yes: Click Here
MOBDrone [cafarelli2022mobdrone]	Maritime (Search and Rescue)	✓		UAV based (RGB)	–	5	$>$ 180K	66 (126170 annotated frames)	Yes: Click Here

5.1.5 Other Maritime SOD

In [li2018multiscale] the authors used a slightly different regression task by adding an angle parameter to the existing standard four bounding box parameters regression. This modification provides a more precise localization of rotated ships within a rectangular bounding box that is aligned with the ship’s direction. Similar approaches have been reported in [qin2021mrdet, yi2021oriented, zand2021oriented]. Using SSD, [ghahremani2018cascaded] developed a cascade object detection method to identify obscure regions. Following some verification steps, the method considers the original high resolution input image (the one before down sampling) for decision making. This method does not require any modifications when different architectures are used. However, this cascading approach makes the method inappropriate for real time applications due to its high complexity.

5.2 Video based maritime SOD

Prior works for video-based maritime small object detection are typically categorized into: (i) spatial-based (i.e., frame-based) detection and (ii) spatio-temporal based detection. The first category of methods, e.g., [guo2021lightweight, guo2021heterogeneous, yang2021deep, lu2021towards, shao2019saliency, liu2021attention], generally developed similar strategies compared to their image-based counterparts (see Section 5.1), and detected maritime small objects in videos frame by frame. While these methods (by only using the spatial information) have been able to achieve good detection accuracy and speed in several video based maritime applications, we believe that using the temporal information across video frames could lead to better performance by inferring relationships between moving objects. Therefore, this section focuses on the methods which leverage both the spatial and temporal information for maritime small object detection in videos. Recent deep learning-based object detection methods generally perform well on large- and medium-sized objects. However, they perform poorly on small-sizes objects. Even though a number of specific techniques have been proposed to enhance the spatial features of small objects, their performance largely degrades in a dynamic environment characterized by background elements (e.g., water surface perturbations, sunlight reflection, floating driftwood and kelp), which are similar to the target objects in appearance or size. In such cases, the temporal information, i.e., the movement conveyed by multiple images/frames of the same scene, could be a useful cue to detect the presence of small objects. There are a number of works, which exploit both the spatial and temporal information for maritime small object detection in videos. Using the intersection of union of the bounding boxes between consecutive frames, Kim et al. [kim2018probabilistic] proposed to detect ships that could not be detected based solely on the spatial information of individual frames. Marques et al. [marques2021size] proposed a Detector of Small Marine Vessels (DSMV), which exploited the temporal information to model backgrounds using a bi-directional gaussian mixture model. With the combination of DSMV and temporal information, the performance of general deep object detection methods was found to be significantly improved. The results confirm the effectiveness of using the temporal information for detecting maritime small objects in videos. Using a convolutional LSTM, Cruz et al. [cruz2019learning] extracted temporal features, which were combined with spatial features from CNNs, to detect objects in maritime airborne videos. Chen et al. [chen2020video] proposed an automated ship recognition method consisting of four main steps: (i) feature extraction at different scales and construction of feature pyramids using ensemble YOLOv3 framework, (ii) bounding box generation, (iii) removal of interference bounding boxes using K-means algorithm and localization of ships, (iv) ship behavior analysis by a spatio-temporal constraints-based method on two consecutive frames. However, the reported spatio-temporal method still exhibits potential issues in handling fast moving ships and identifying individual ships in water-sky line as well as in dense (ship wise) environments such as ports and harbors. The components of the YOLOv3 network have been improved by Jie et al. [jie2021ship] to achieve higher precision and recall values. Their contribution can be described as follows: (i) using the K-means algorithm to initialize the number of anchor boxes and their sizes based on the characteristics of the ships instead of the objects found in the VOC dataset, (ii) replacing the Sigmoid function with Softmax, (iii) introducing Soft Non-Maximum Suppression (Soft-NMS) to resolve the shortcomings of the standard NMS algorithm when detecting overlapped objects. Finally, Deep Simple Online and Real time Tracking (Deep SORT) algorithm was used to accurately localize objects in frames with severe occlusions. They reported improvements of about 5% and 2fps on average, in mean average precision (mAP) and in the number of analyzed Frame Per Second (FPS), respectively. An innovative spatio-temporal object detection method based on high-quality region proposals mainly centered around rigid (i.e., potential object) video locations is proposed in [marie2018real]. The high quality regions of proposals were obtained by assessing textural variations at key video locations using a long-term keypoint tracking algorithm. Scale Invariant Feature Transform (SIFT) [lowe2004distinctive] was shown to perform best compared to other keypoint extractors in terms of both accuracy and repeatability.

6 Evaluation of Small Object Detection

6.1 Small Object datasets

A review of existing SOD datasets is provided in supplementary material. These datasets are summarized in Tables II and III.

6.2 Evaluation Metrics

6.2.1 General Measures

Intersection over Union [everingham2010pascal]: Since the output of an object detection method and its corresponding ground truth are the coordinates of bounding boxes, the Intersection over Union (IoU) is used to quantify the similarity between the areas of these two bounding boxes; Ground Truth (GT) and Predicted (P). when the bounding boxes are indexing the same pixels, this measure is expected to return a value one in the best case, and zero in the worst case when the boxes are not overlapped at all. Using the set notations, the IOU is given by

IoU=\frac{|S_{GT}\cap S_{P}|}{|S_{GT}\cup S_{P}|},

(1)

where $S$ indicates the pixels as a set, $|.|$ is the size of a set, $\cap$ and $\cup$ are the intersection and union, respectively. Fig. 7(a) shows the GT in green and in Fig. 7(b) the intersection (pink square) and union (black boundaries) are clearly shown in the image assuming the red bounding box as the prediction.
Precision, Recall and Accuracy: These are well known measures in classification tasks defined for categorical outputs. Object detection, however, uses bounding boxes whose similarity is shown through continuous numbers ranging from 0 to 1. A threshold is therefore applied to the IoU in order to use such measures for object detection. Predicted bounding boxes are accepted as true positives (accurate recovery of the ground truth bounding box) if the corresponding IoUs exceed the threshold, otherwise they are considered false positives. Specifically, precision is defined as the number of correctly detected bounding boxes compared to the total number of detected or predicted boxes. Recall, on the other hand, is defined as the number of correctly detected bounding boxes over the total number of ground truth boxes. It is therefore necessary to make a trade-off between recall and precision. Finally, accuracy is defined as the total number of correctly labeled bounding boxes (either positive or negative) over the total number of evaluated boxes.
Average Precision (AP): The trade-off between Precision (Pr) and Recall (Re) prevents comparing two given methods using a single precision value for a fixed recall. Rather, precision needs to be on average better across all recall values. Therefore, the precision-recall curve can be drawn for each class label and the area under the curve can be determined. A method is better if its computed area is larger than that of its competitors. Precisely, the AP is given by:

AP=\int_{0}^{1}Pr(Re)dRe,

(2)

where $Pr(Re)$ indicates the dependence of precision on the recall value.
mean Average Precision (mAP) and mean Average Recall (mAR): Due to the fact that AP is defined over a single class label, it is not universal across all classes. In order to generalize this measure, the mAP computes the average over all the classes. In other words, for mAP we have

mAP=\frac{1}{C}\sum_{i=1}^{C}AP_{i},

(3)

where $C$ denotes the number of classes. Observe that the average above is computed based on a single predefined threshold, e.g., 0.5. In a broader sense, this average can be computed in terms of different threshold values, notably from 0.5 to 0.95 with a 0.05 step size. This particular setup is denoted as $mAP^{@[0.5,0.95]}$ in [lin2014microsoft]. Similarly we have the same concept for recall, with the equivalent metric being mAR which is defined for the average of the individual recalls over the number of classes.
Frame Per Second (FPS): In addition to the measures which evaluate the ability of the detection methods in recovering the true objects, FPS measures the running time of these techniques to evaluate their applicability to video or real time detection. The higher FPS implies that the method is faster and can potentially be applied to real-time video-based small object detection.
Degrade of Reduction (DOR) [chen2020survey]: This measure indicates the performance gap between the AP of medium/large objects and that of small objects. SOD performance is weaker when DOR is larger.
FPPI: The average number of false positives per image when recall is 0.5 and the recall when FPPI is 1 are two other measures that have been used for evaluation of SOD methods [rozantsev2016detecting]. Ideally, we aim for smaller FPPI and higher recall for a fixed FPPI.
Intersection over Detection (IoD): This measure is similar to IoU with a minor change in the denominator. In other words, the IoD is given by:

IoD=\frac{|S_{GT}\cap S_{P}|}{|S_{P}|}.

(4)

As a result of this change, small objects won’t be missed in applications where accurate detection of true objects is crucial at the cost of more false positives.
Generalized IoU (GIoU) [rezatofighi2019generalized]: If two boxes are not overlapping, IoU is not helpful during the learning process since it is always zero no matter how distinct the boxes are. For this reason, the GIoU loss has been proposed as a solution to Gradient vanishing. Thus the GIoU is given by:

GIoU=IoU-\frac{|C\backslash S_{GT}\cup S_{P}|}{|C|},

(5)

where $C$ is the smallest box containing both GT and P bounding boxes and “ $\backslash$ ” means excluding the set in the right from the left set. Fig. 7(c) shows an example of the use of this metric ( $C$ and $S_{GT}\cup S_{P}$ ).
Complete IoU (CIoU) [zheng2021enhancing]: As a result of its inability to exploit geometrical factors in the metric, GIoU suffers from slow convergence and inaccurate regression. In contrast, CIoU improves performance by considering three main geometrical factors, namely the overlaped area, the distance and the aspect ratio to improve the performance. It is given by:

CIoU=IoU-\frac{\rho^{2}(GT,P)}{c^{2}}-\alpha V,

(6)

where $\rho$ is the Euclidean distance between the central points of the boxes, $c$ is the diagonal length of the smallest box containing both GT and P bounding boxes, $\alpha$ is the trade-off parameter, and finally $V$ is the consistency of aspect ratios. Fig. 7(d) shows an example of the use of this metric ( $\rho$ and $c$ ).
Miss Rate (MR): Even though the trade-off between false positives and miss detection rate matters in most applications, in some real world problems (e.g., pedestrian and tumor detection) the MR is the main objective since the object should not be missed in order to avoid major consequences (e.g., accident or cancer). A smaller MR is always desirable [dollar2009pedestrian].
Error Rate (ER): Deep network training can also be optimized by minimizing a measure of error. In this case, the ER is defined as the total number of miss classified pixels over the total number of pixels.

TABLE IV: Detection performance (%) for small-scale objects on Tsinghua-Tencent 100K [zhu2016traffic]

	Recall	Accuracy	F1-score
Fast RCNN (ICCV15)[girshick2015fast]	46.0	74.0	56.7
Faster RCNN (NIPS15)[ren2015faster]	49.8	24.1	32.5
SSD (ECCV16)[liu2016ssd]	43.4	25.3	32.0
Zhu et al. (CVPR16)[zhu2016traffic]	87.4	81.7	84.5
FPN (CVPR17)[lin2017feature]	78.6	77.3	77.9
Perceptual GAN (CVPR17)[li2017perceptual]	89.0	84.0	86.4
Pon et al. (CRV18)[pon2018hierarchical]	65.0	24.0	35.1
Liang et al. (PCM18)[liang2018small]	93.0	84.0	88.3
Song et al. (JSA19)[song2019efficient]	88.0	85.0	86.5
Noh et al. (ICCV19)[noh2019better]	92.6	84.9	88.6
MR-CNN (ACCESS19)[liu2019mr]	89.3	82.9	86.0
Wang et al. (ITS20)[wang2020traffic]	89.4	87.3	88.3
YOLOv3-Final (JSPS21)[wan2021efficient]	91.0	91.0	91.0
SODNet (RS22)[qi2022small]	90.0	85.5	87.7
Min et al. (ITS22)[min2022traffic]	92.3	88.1	90.2

TABLE V: Detection performance (%) for small-scale objects on MS COCO image dataset. “

{*}

” indicates average over just three classes of stop sign, mouse, fire hydrant. Currently, the leadership for small objects on MS COCO dataset belongs to Noah CV Lab (Huawei) with

\text{mAP}^{@[0.5,0.95]}=40.7

. The results are by default for objects smaller than

32\times 32

pixels. “

+

” indicates that the results are for object sizes smaller than

16\times 16

	$\text{mAP}^{@0.5}$ $\uparrow$	$\text{mAP}^{@[0.5,0.95]}$ $\uparrow$
Faster R-CNN (NIPS2015)[ren2015faster]	$5^{+}$	$15.6$ , $1.5^{+}$
Faster R-CNN+FPN (NIPS2015)[ren2015faster]	–	27.2
R-FCN (NIPS16)[dai2016r]	–	10.8
SSD (ECCV16)[liu2016ssd]	–	10.9
FPN (CVPR17)[lin2017feature]	$\textbf{11.8}^{+}$	18.2, $\textbf{4.8}^{+}$
RetinaNet (ICCV17)[lin2017focal]	$9.1^{+}$	21.8, $4.5^{+}$
RFBNet (ECCV18)[liu2018receptive]	16.2	–
YOLOv3 (arXiv18)[redmon2018yolov3]	–	18.3
SOD-MTGAN (ECCV18)[bai2018sod]	–	25.1
Noh et al.(ICCV19) [noh2019better]	–	16.2
Kisantal et al. (arXiv19) [kisantal2019augmentation]	–	17.9
FCOS (ICCV19)[tian2019fcos]	–	24.4
SSD-MSN (IEEE ACCESS19)[chen2019ssd]	–	29.4
FSAF (CVPR19)[zhu2019feature]	–	29.7
DR-CNN sum (AI20)[liu2020small]	18.3	–
DR-CNN concat. (AI20)[liu2020small]	18.6	–
ViT-FRCNN (arXiv20)[beal2020toward]	–	17.8
DETR (ECCV20)[carion2020end]	–	21.9
DETR-DC5	–	23.7
Deformable DETR (arXiv20)[zhu2020deformable]	–	26.4
Two Stage Deformable DETR (arXiv20)[zhu2020deformable]	–	28.8
Full Deformable DETR (arXiv20)[zhu2020deformable]	–	34.4
ATSS (CVPR20)[zhang2020bridging]	–	33.2
YOLOv5s [jocher2020yolov5]	–	18.8
TSD (CVPR20) [song2020revisiting]	–	33.8
STDnet-C3 (EAAI20)[bosquet2020stdnet]	$11.4^{+}$	$5.5^{+}$
YOLOS (NIPS21)[fang2021you]	–	19.5
UP-DETR (CVPR21)[dai2021up]	–	20.8
SOF-DETR[dubey2021improving]	–	21.7
ViDT w.o. Neck (arXiv21)[song2021vidt]	–	21.9
ViDT (arXiv21)[song2021vidt]	–	30.6
SMCA (ICCV21)[gao2021fast]	22.8	–
DETR-GQPos (arXiv21)[jiang2021guiding]	23.1	–
DETR-GQPos-SiA (arXiv21)[jiang2021guiding]	24.4	–
FP-DETR (ICLR22)[wang2021fp]	–	27.5
SODNet (RS22)[qi2022small]	–	20.1
RFSOD (RTIP22)[amudhan2021rfsod]	$\textbf{59.09}^{*}$	–
RFSODTL (RTIP22)[amudhan2021rfsod]	$56.42^{*}$	–
QueryDet (CVPR22)[yang2022querydet]	–	25.24
RESC (NCA22)[wang2022resc]	–	26.2
$\text{D}^{2}$ ETR (arXiv22)[lin2022d]	–	22
Deformable $\text{D}^{2}$ ETR (arXiv22)[lin2022d]	–	31.7

TABLE VI: Detection performance (%) for small-scale objects on USC-GRAD-STDdb video dataset [bosquet2020stdnet]. +k indicates that the anchors were defined by the k-means algorithm and the “

*

” indicates that they were run on Caffe2 framework. The results are by default for the objects smaller than

16\times 16

pixels.

	mAP^@0.5 $\uparrow$	$\text{mAP}^{@[0.5,0.95]}$ $\uparrow$	FPPI $\downarrow$	FPS $\uparrow$
Faster R-CNN (NIPS15)+k[ren2015faster]	$44$	$14.4$	0.95	2.6
FPN (CVPR17)[lin2017feature]	$50.8$	$16.3$	$0.29$	$3$
FPN (CVPR17)+k[lin2017feature]	$50.7$	$16.8$	$0.31$	$3.5$
RetinaNet (ICCV17)[lin2017focal]	$47.6$	$16.2$	$0.47$	$\textbf{6.5}^{*}$
FGFA (ICCV17)[zhu2017flow]	$37.5$	$11.7$	–	–
Cascade-FPN (CVPR18)[cai2018cascade]	$55.9$	$17.4$	–	–
RDN (ICCV19)[deng2019relation]	$48.6$	$15.5$	–	–
FANet(short term) (arXiv20)[cores2020spatio]	$48.5$	$17.6$	–	–
FANet(short&long term) (arXiv20)[cores2020spatio]	$49.9$	$18.3$	–	–
MEGA (CVPR20)[chen2020memory]	$53.1$	$17.4$	–	–
STDnet-C3 (EAAI20)[bosquet2020stdnet]	$57.4$	$20$	0.22	3.7
STDnet-bST (EAAI20)[bosquet2020stdnet]	$59.7$	$20.6$	0.2	–
STDnet-ST (PR21)[bosquet2021stdnet]	$62.1$	$20.1$	–	–
STDnet-ST++ (PR21)[bosquet2021stdnet]	63.4	21.4	–	–

TABLE VII: Detection performance (%) for small-scale objects on UAVDT video dataset [du2018unmanned]. The results are by default for objects smaller than

32\times 32

pixels. “

+

” indicates the results of object sizes smaller than

16\times 16

	mAP^@0.5 $\uparrow$	$\text{mAP}^{@[0.5,0.95]}$ $\uparrow$
Faster R-CNN+FPN (ECCV18)[du2018unmanned]	$26^{+}$	8.1
R-FCN (NIPS16) [dai2016r]	$32.5^{+}$	4.4
SSD (ECCV16)[liu2016ssd]	$23.5^{+}$	7.1
RON (CVPR17)[kong2017ron]	$19.7^{+}$	2.9
FPN (CVPR17)[lin2017feature]	$29.7^{+}$	$11.8^{+}$
FGFA (ICCV17)[zhu2017flow]	$20.7^{+}$	$6.3^{+}$
Cascade-FPN (CVPR18)[cai2018cascade]	$30.5^{+}$	$12^{+}$
RDN (ICCV19)[deng2019relation]	$27.9^{+}$	$9.3^{+}$
ClusDet (ICCV19)[yang2019clustered]	–	9.1
YOLOv5s [jocher2020yolov5]	–	9.8
MEGA (CVPR20)[chen2020memory]	$26.6^{+}$	$9.2^{+}$
STDnet++ (EAAI20)[bosquet2020stdnet]	$35.4^{+}$	$12.6^{+}$
STDnet-ST++ (PR21)[bosquet2021stdnet]	$\textbf{36.4}^{+}$	$\textbf{13.3}^{+}$
SODNet (RS22)[qi2022small]	–	11.9

Normalized Wasserstein Distance (NWD)[wang2021normalized]: As opposed to the aforementioned metrics, which treat bounding boxes as deterministic variables, here the bounding boxes are represented by multivariate Gaussian densities. The similarity is then calculated by an exponential function of the existing Optimal Transport (OT) theory (i.e., Wasserstein distance). The benefit of this approach lies in assigning different weights to different pixels, putting more emphasis on the central pixels. In other words, the similarity is given by

NWD(GT,P)=\exp\{-\frac{\sqrt{W_{2}^{2}(GT,P)}}{c}\},

(7)

where $c$ is a learnable constant, and $W_{2}^{2}(GT,P)=\|\textbf{m}_{1}-\textbf{m}_{2}\|_{2}^{2}+\|\boldsymbol{\Sigma}_{1}^{1/2}-\boldsymbol{\Sigma}_{2}^{1/2}\|_{F}^{2}$ is the Wasserstein distance between two ground truth and predicted bounding boxes where m is the centre of the boxes and $\boldsymbol{\Sigma}$ is their covariance.

6.2.2 Specific to Maritime

Intersection over Ground truth (IoG) [prasad2018object]: As with autonomous driving, detecting ships in maritime environments is very important in order to avoid collisions. The bounding boxes in maritime SOD tend to be wider than those in other applications because of wakes and waves. False positives are caused by using the standard IoU metric. The modified metric IoG can help mitigate this issue and is defined by:

IoG=\frac{|S_{GT}\cap S_{P}|}{|S_{G}|}.

(8)

Bottom Edge Proximity 1 (BEP1) [prasad2018object]: Objects in the sea may be characterized by a solid dense hull having a larger possibility of detection and a sparse mast region. The standard IoU criteria may regard the detected object as a false alarm since the ground truth covers both dense hull and mast regions. The BEP1 metric helps to avoid such inaccuracies and it is given by:

BEP1=X(1-Y);X=\frac{\Delta x_{ov}}{x_{GT}},Y=\frac{\Delta y_{bot}}{y_{GT}}.

The parameters for this metric are as shown in Fig. 7(e).
Bottom Edge Proximity 2 (BEP2) [prasad2019object]: BEP2 is symmetric with respect to ground truth and predicted bounding boxes while BEP1 is biased toward ground truth. The BEP2 is defined as

BEP2=X(1-Y);X=\frac{\Delta x_{ov}}{x_{GT}+x_{c}},Y=\frac{\Delta y_{bot}}{min(y_{GT},y_{P})}.

The parameters for this metric are as shown in Fig. 7(f).

6.3 Performance Evaluation

In this section, we assess the performance of the discussed SOD methods on different large-scale datasets. For the generic SOD evaluation, we selected the popular image datasets: Tsinghua-Tencent 100K and MS COCO. For the analysis of video-based techniques, we selected the USC-GRAD-STDdb and UAVDT, which are relatively challenging. This paper uses all performance measures taken from the original papers, or their websites. Research usually compares methods using a subset of these datasets (for example, MS COCO) since some of these datasets are not specifically designed for SOD. The table captions clearly indicate the setups corresponding to the reported results.
SOD datasets for maritime applications are still rare, so most papers perform performance analyses on datasets that they have designed themselves. As a result, the maritime case study results were presented together with the generic methods using four image datasets, including TinyPerson, SeeDronesSees, WSODD, and ShipRSImageNet. For video datasets, we selected Seagull and SMD since they are more popular.
Tables IV to VII show the results for generic small objects and similarly, Tables VIII and IX show the results for maritime small objects.

6.3.1 Generic SOD Performance Results

Tsinghua-Tencent 100K. Table IV reports the detection performance of the state-of-the-art methods on images with small objects, whose number of pixels are in the range of (0,32], in terms of recall, accuracy and F1-score. As shown in Table IV, Liang et al.[liang2018small] achieved the best Recall of 93.0% and a moderate accuracy of 84.0%. In contrast, YOLOv3-Final [wan2021efficient] attained the best accuracy of 91.0% with a recall of 91.0%, leading to the best F1-score of 91.0%.

MS COCO. Table V shows the detection results of deep learning-based methods on MS COCO dataset. For comparison, we report $\text{mAP}^{@0.5}$ and $\text{mAP}^{@[0.5,0.95]}$ . Since the comparison was made using different setups, we denote the results of object detection with sizes smaller than $32\times 32$ with normal values, the results of objects with sizes smaller than $16\times 16$ with “+” and the results on a subset of MS COCO including the three classes of stop signs, mice, and fire hydrants with values marked with “*”. As shown, Full Deformable DETR (arXiv20)[zhu2020deformable] achieved the best $\text{mAP}^{@[0.5,0.95]}=34.4$ . In general, smaller objects produce poorer results. FPN (CVPR17)[lin2017feature] achieves the best values for both $\text{mAP}^{@0.5}$ and $\text{mAP}^{@[0.5,0.95]}$ for smaller objects, with values of 11.8 and 4.8, respectively. Finally DETR-GQPos-SiA (arXiv21)[jiang2021guiding] achieves the best $\text{mAP}^{@0.5}$ of 24.4 for normal small objects on MS COCO. Table V also shows the results for the MS COCO subset separately. As can be observed, transformer-based deep learning methods currently have the best SOTA results.

USC-GRAD-STDdb. For the evaluation on video sequences, we selected the recently released dataset, USC-GRAD-STDdb to compare existing SOTA methods. Table VI shows the results obtained on this dataset for various metrics of $\text{mAP}^{@0.5}$ , $\text{mAP}^{@[0.5,0.95]}$ , FPPI, and FPS. By default, the results for this particular dataset are reported for sizes smaller than $16\times 16$ . The team who collected the USC-GRAD-STDdb dataset proposed STDnet-ST++ (PR21)[bosquet2021stdnet], which remains the leading technique in terms of average precision. In terms of FPPI, STDnet-bST (EAAI20)[bosquet2020stdnet], another framework proposed by the same team, performs best. Finally, RetinaNet (ICCV17)[lin2017focal] achieves the best results in terms of runtime speed.

UAVDT. As for this video dataset, Table VII shows the results for $\text{mAP}^{@0.5}$ , and $\text{mAP}^{@[0.5,0.95]}$ . The values are by default for objects smaller than $32\times 32$ . However, smaller sizes than $16\times 16$ are indicated by “+”. As for USC-GRAD-STDdb dataset, STDnet-ST++ (PR21)[bosquet2021stdnet] is seen again to be the leading method for generic small object detection task.

6.3.2 Maritime SOD Performance Results

TABLE VIII: Detection performance (%) for small-scale objects on TinyPerson [yu2020scale]. MR and AP denote Miss Rate and Average Precision. The superscripts of MR and AP denote the size splits, where “tiny” refers to the size range [2,20] and “small” refers to the size range [20,32]. The subscripts of MR and AP denote the IOU thresholds used for the evaluation.

	$MR^{tiny}_{50}$ $\downarrow$	$MR^{small}_{50}$ $\downarrow$	$MR^{tiny}_{25}$ $\downarrow$	$MR^{tiny}_{75}$ $\downarrow$	$AP^{tiny}_{50}$ $\uparrow$	$AP^{small}_{50}$ $\uparrow$	$AP^{tiny}_{25}$ $\uparrow$	$AP^{tiny}_{75}$ $\uparrow$
Faster RCNN-FPN (CVPR17)[lin2017feature]	87.78	71.31	77.35	98.40	43.55	56.69	64.07	5.35
RetinaNet (ICCV17)[lin2017focal]	92.40	81.75	81.56	99.11	30.82	43.38	57.33	2.64
DSFD (CVPR19)[li2019dsfd]	93.47	78.72	78.02	99.48	31.15	51.64	59.58	1.99
Adaptive FreeAnchor (NIPS19)[zhang2019freeanchor]	88.97	73.67	77.62	98.70	41.36	53.36	63.73	4.00
FCOS (ICCV19)[tian2019fcos]	96.12	84.14	89.56	99.56	16.9	35.75	40.49	1.45
Libra RCNN (CVPR19)[pang2019libra]	89.22	74.86	82.44	98.39	44.68	62.65	64.77	6.26
Grid RCNN (CVPR19)[lu2019grid]	87.96	73.16	78.27	98.21	47.14	62.48	68.89	6.38
RetinaNet-SM (WACV20)[yu2020scale]	88.87	71.82	77.88	98.57	48.48	63.01	69.41	5.83
Faster RCNN-FPN+MSM (WACV20)[yu2020scale]	85.86	68.76	74.33	98.23	50.89	65.76	71.28	6.66
RetinaNet+SM with S- $\alpha$ (WACV21) [gong2021effective]	87.00	69.25	74.72	98.41	52.56	65.69	73.09	6.64
Faster RCNN-FPN+MSM with S- $\alpha$ (WACV21) [gong2021effective]	86.18	69.28	73.90	98.24	51.41	65.97	72.25	6.69
Faster RCNN-FPN-MSM+ (ICASSP21)[jiang2021sm+]	–	–	–	–	52.61	67.37	72.54	6.72

TABLE IX: Detection performance for three images and two videos maritime datasets. Unlike generic results, we did not limit ourselves to objects with specific size and reported the results for the whole dataset, due to the fact that most of the objects are small. “*” indicates the results only on the visible range videos.

Method		SeaDronesSees		WSODD		ShipRSImageNet
Method		$\text{mAP}^{@0.5}$ $\uparrow$	$\text{mAP}^{@[0.5,0.95]}$ $\uparrow$	$\text{mAP}^{@0.5}$ $\uparrow$	FPS $\uparrow$	$\text{mAP}^{@[0.5,0.95]}$ $\uparrow$	$\text{mAR}^{@[0.5,0.95]}$ $\uparrow$
Image	SSD (ECCV16)[liu2016ssd]	–	–	41.5	43.02	48.3	61.8
	Faster R-CNN+FPN (NIPS15)[ren2015faster]	30.1	14.2	32.3	19.42	54.3	–
	Faster R-CNN+FPN (CVPR17) [xie2017aggregated]	54.7	30.4	–	–	–	–
	Mask R-CNN (ICCV17)[he2017mask]	–	–	–	–	56.4	–
	RetinaNet+FPN (ICCV17)[lin2017focal]	–	–	–	–	48.3	68.9
	YOLOv3 (arXiv18)[redmon2018yolov3]	–	–	56.1	45.34	–	–
	TridentNet (ICCV19)[li2019scale]	–	–	62.2	10.16	–	–
	CenterNet-Hourglass (arXiv19)[zhou2019objects]	50.3	25.6	–	–	–	–
	CenterNet-ResNet (arXiv19)[zhou2019objects]	36.4	15.1	–	–	–	–
	CenterNet(ICCV19)[duan2019centernet]	–	–	53.5	43.42	–	–
	FCOS+FPN(ICCV19)[tian2019fcos]	–	–	–	–	49.8	67.4
	YOLOv4(arXiv20)[bochkovskiy2020yolov4]	–	–	57.2	46.25	–	–
	FoveaBox(TIP20)[kong2020foveabox]	–	–	–	–	45.9	62.2
	YOLOv3-2SMA(IJARS20)[li2020modified]	–	–	56.9	50.46	–	–
	EfficientDet-D0 (CVPR20)[tan2020efficientdet]	37.1	20.8	31.3	30.83	–	–
	Cascade R-CNN (TPAMI21)[cai2018cascade]	–	–	41.1	29.56	59.3	69.5
	ShipYOLO(JAT21)[han2021shipyolo]	–	–	58.4	49.81	–	–
	EfficientDet-D0+CroW (ICCV21)[varga2021tackling]	–	31.21	–	–	–	–
	YOLOv4+CroW (ICCV21)[varga2021tackling]	–	36.41	–	–	–	–
	Synth Pretrained RX101FPN (arXiv21)[kiefer2021leveraging]	59.2	32.6	–	–	–	–
	Synth Pretrained Yolo5 (arXiv21)[kiefer2021leveraging]	59.1	33.2	–	–	–	–
	CRB-Net (FN21)[zhou2021image]	–	–	65	43.76	–	–
Method		Seagull	SMD
Method		ER $\downarrow$	FPS $\uparrow$	$\text{mAP}^{@0.3}$ $\uparrow$	$\text{mAR}^{@0.3}$ $\uparrow$	$\text{Pr}^{@0.5}$ $\uparrow$	$\text{Re}^{@0.5}$ $\uparrow$
Video	ConvNet	0.16	–	–	–	–	–
	Eigen-background (TPAMI00) [oliver2000bayesian]	–	–	$0.5^{*}$	$26.8^{*}$	–	–
	Adaptive SOM (TIP08) [maddalena2008self]	–	–	$1.2^{*}$	$23^{*}$	–	–
	Fuzzy ASOM (NCA10) [maddalena2010fuzzy]	–	–	$1.5^{*}$	$20.3^{*}$	–	–
	LSTM	0.22		–	–	–	–
	GRU	0.17	–	–	–	–	–
	GFLFM (TCVPR15) [xin2015background]	–	–	$\textbf{8.9}^{*}$	$\textbf{32}^{*}$	–	–
	Faster R-CNN (NIPS15)[ren2015faster]	–	–	–	–	$81^{*}$	$71^{*}$
	YOLO (CVPR16) [redmon2016you]	–	–	–	–	42.3	57
	SSD (ECCV16)[liu2016ssd]	–	–	–	–	83.7	40.1
	Mask R-CNN recursive (ICCV17)[he2017mask]	–	–	–	–	$78^{*}$	$73^{*}$
	Mask R-CNN fine-tuned (ICCV17)[he2017mask]	–	–	–	–	$\textbf{82}^{*}$	$71^{*}$
	Mask R-CNN w/o seg. (ICCV17)[he2017mask]	–	–	–	–	$\textbf{82}^{*}$	$\textbf{77}^{*}$
	Marie et al.(AVSS18) [marie2018real]	–	–	–	–	77	79
	ConvLSTM (TGRS19) [cruz2019learning]	0.132	–	–	–	–	–
	ConvLSTM+DS Knowledge (TGRS19) [cruz2019learning]	0.13	–	–	–	–	–
	CNN (OSE20) [leela2020image]	–	–	–	–	–	56
	CNN+PASSTHROUGH L. (OSE20) [leela2020image]	–	–	–	–	–	68
	CNN+PASSTHROUGH L. initialized (OSE20) [leela2020image]	–	–	–	–	66	73
	Feng et al. (TITS22) [leela2020image]	–	–	–	–	38.8	93.6

TinyPerson. Table VIII shows the detection results obtained (i.e., MR and AP with IoU thresholds set to be 0.25, 0.5, 0.75) for the state-of-the-art methods on the images of tiny and small objects, whose number of pixels are in the range of [2,20] and [20,32], respectively. Recent methods are generally based on two commonly used object detection architectures, i.e., Faster RCNN-FPN and RetinaNet. Among these methods, MSM+ [jiang2021sm+] achieved the best performance for almost all AP results. S- $\alpha$ [gong2021effective] achieved the best results among the methods based on RetinaNet with respect to all MR evaluations. In contrast, MSM [yu2020scale] achieved relatively better results compared to other methods based on Faster RCNN-FPN in terms of all MR scores. Overall, the two-stage detection methods are seen to outperform the one-stage methods on TinyPerson.

Other Maritime Image and Video Datasets. Table IX presents detection results for other maritime datasets and the best results are marked in bold. The Table provides more information and identifies the leading methods for each metric. Figure 8 shows some of the predicted bounding boxes for different datasets and techniques. Generally, it is observed that using general object detection frameworks to detect small objects is challenging, whereas small object specific methods can better locate those objects.

7 Discussion and Future Directions

Our review of the literature on the detection of small objects has identified several limitations. Taking into account these limitations, we suggest the following directions for future research in SOD:
1. Transformer models have recently greatly benefited computer vision and object detection in general, however the field of SOD has yet to fully utilize them. This is particularly more acute for video-based small object detection (VSOD). We believe that transformers have the potential to achieve superior results in VSOD as well as SOD in maritime environments.
2. While several studies have been conducted on generic SOD tasks, they either used different definitions of small objects, or they missed to report their experiments on publicly available datasets devoted to small objects, or they used a subset of a generic dataset with relatively large objects. Using MS COCO as an example: (i) this dataset is not ideal for studying small objects; (ii) different definitions are used for small objects (e.g., $32\times 32$ or $16\times 16$ ); or (iii) a small subset of small objects is used, which can result in bias and make benchmarking difficult. For a fair benchmarking, researchers should report their performance results on large-scale datasets such as Tsinghua-Tencent 100K, CURE-TSD, USC-GRAD-STDdb, DOTA, VisDrone2021 for generic SOD and TinyPerson, ETRI-Maritime, MOBDrone, Seagull, SMD, SeaDronesSees for maritime SOD.
3. The technology of VSOD is still evolving compared to image-based SOD. The majority of current research exploits spatial information from videos and does not fully explore the temporal information; however, spatial and temporal information can be used together to minimize false alarms and miss detections for small objects when video quality is poor or when objects are occluded, which is especially relevant in maritime applications.
4. There has not been any proper benchmarking of maritime SOD literature yet, and studies seldom use the same large-scale datasets. When it comes to VSOD, speed and the ability to monitor the maritime environment in real time are crucial. The majority of prior studies have attempted to improve accuracy of SOD methods, but this has resulted in increased computational complexity, which is not desirable for real-time surveillance. Recent studies overlook this and do not report FPS, which is vital for monitoring maritime environments in real time. Therefore, it is necessary to investigate networks that are accurate and lightweight.
5. Even though multi-task or joint learning pipelines have yielded promising results for global feature extraction for small object identification, this area has not been studied deeply, and only a few papers have been published in this field.
6. A majority of approaches reported in the SOD literature are based on the standard 2D-CNN. Hence, 3D-CNN can be used as an alternative to extend the 2D-CNN-based methods for videos. Moreover, the definition of small objects in images that deal with limited spatial information can be extended to video. In video, small objects can be redefined as objects with limited spatio-temporal information. Here, a limited temporal information refers to the fact that a small object (spatially small) appears in only a few frames of a video. With this new definition, all the existing tools for SOD using 2D-CNN can also be applied to 3D-CNN, such as pyramidal networks.
7. In spite of the fact that most maritime objects are small (since the camera-to-object distance is large), analyzing the taxonomy of the works in the two domains (i.e., generic vs maritime), some ideas have been applied to only one domain whereas the other domain has not taken advantage of them. Following, we examine such ideas in both domains and discuss their potentials. (i) Although Super Resolution has improved generic SOD performance, it has not yet been investigated for maritime SOD. (ii) In maritime SOD, image enhancement is used to improve visibility under poor maritime conditions. It has not, however, been exploited for generic SOD. Then again, poor weather conditions may also hamper applications such as autonomous driving. (iii) Sea-Land Segmentation is another extensively used maritime SOD technique that reduces the number of false alarms. When prior information about the location of the objects is available, this approach could also be used for generic SOD. Pedestrians, for example, are not expected to appear in the sky. (iv) The use of context learning has been successful in improving generic SOD performance. Marine environments, however, do not lend themselves well to this method since water is a major component of the background. (v) There have been limited studies examining the performance of recurrent networks for video-based detection, despite their success in sequential data analysis such as time series and natural language processing.

8 Conclusion

In this paper, we survey more than 160 recent studies (2017-2022) in the field of small object detection in optical images and videos using deep learning, along with a maritime case study. A survey of relevant pre-processing techniques (e.g., data augmentation, super resolution), modern neural network architectures (e.g., 2D-CNN, 3D-CNN, RNN, transformers, and mixed architectures), feature learning (e.g., multi-scale, context, feature aggregation, and region proposal), multi-task learning, and loss function regularization for image and video-based small object detection is presented. In addition, 50 different datasets used for small object detection are extensively reviewed in this paper. This paper also presents popular learning and evaluation metrics and discusses their limitations. Lastly, potential future research directions in the field of small object detection are presented.

Acknowledgement

This research is supported by the Commonwealth of Australia as represented by the Defence Science and Technology Group of the Department of Defence.


R-CNN [girshick2014rich]	Fast R-CNN [girshick2015fast]

RPN in Faster R-CNN [ren2015faster]	R-FCN [dai2016r]

FPN [lin2017feature]	YOLO [redmon2016you]

SSD [liu2016ssd]	RetinaNet [lin2017focal]

SMD
	(a)	(b)
SeaShip
	(c)-(e)
Seagull
	(f)	(g)