This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Underwater Object Detection in the Era of Artificial Intelligence: Current, Challenge, and Future

Long Chen, Yuzhi Huang, Junyu Dong, Qi Xu, Sam Kwong, Huimin Lu, Huchuan Lu, and Chongyi Li L. Chen is with the Department of Medical Physics and Biomedical Engineering, University College London, United Kingdom.Y. Huang and Q. Xu are with the School of Computer Science and Technology, Dalian University of Technology, China.J. Dong with the Department of Information Science and Engineering, Ocean University of China, China.S. Kwong is with the School of Data Science, Lingnan University, Hong Kong, China.H. Lu is with the School of Automation, Southeast University, China, and also with the Advanced Institute of Ocean, Southeast University, ChinaH. Lu is with School of Information and Communication Engineering, Dalian University of Technology, Dalian, ChinaC. Li is with VCIP, CS, Nankai University, China. C. Li is the corresponding author.
Abstract

Underwater object detection (UOD), aiming to identify and localise the objects in underwater images or videos, presents significant challenges due to the optical distortion, water turbidity, and changing illumination in underwater scenes. In recent years, artificial intelligence (AI) based methods, especially deep learning methods, have shown promising performance in UOD. To further facilitate future advancements, we comprehensively study AI-based UOD. In this survey, we first categorise existing algorithms into traditional machine learning-based methods and deep learning-based methods, and summarise them by considering learning strategy, experimental dataset, utilised features or frameworks, and learning stage. Next, we discuss the potential challenges and suggest possible solutions and new directions. We also perform both quantitative and qualitative evaluations of mainstream algorithms across multiple benchmark datasets by considering the diverse and biased experimental setups. Finally, we introduce two off-the-shelf detection analysis tools, Diagnosis and TIDE, which well-examine the effects of object characteristics and various types of errors on detectors. These tools help identify the strengths and weaknesses of detectors, providing insigts for further improvement. The source codes, trained models, utilised datasets, detection results, and detection analysis tools are public available at https://github.com/LongChenCV/UODReview, and will be regularly updated.

Index Terms:
Underwater object detection, artificial intelligence, machine learning, deep learning.

I Introduction

Underwater object detection (UOD) aims not only to identify object categories but also to predict their locations. It remains one of the most challenging tasks in computer vision due to the complex underwater environment, where the captured images frequently suffer from severe blur, color distortion, and degraded visibility [1, 2]. As shown in Fig. 1, challenges such as degraded image quality, small object sizes, noisy labels, and class imbalances significantly hinder the performance of underwater object detection models.

Refer to caption
Figure 1: Challenges such as (a) degraded image quality and small objects, (b) noisy labels, and (c) class imbalance in underwater object detection datasets significantly impair the performance of deep detection models. For instance, (c) highlights that the deep detector RepPoints [3] faces severe class imbalance issues on the DUO [4] dataset.

Recent years have witnessed an exponential growth in artificial intelligence (AI)-based UOD algorithms, driven by advancements in computational power and data availability [5, 6]. AI, a general concept of the technique that enables computers to emulate human intelligence, has garnered attention since its inception in the 1950s, as illustrated in Fig 2. After an initial wave of optimism, specialised fields of AI-first machine learning, and later deep learning-triggered notable disruptions. By 1995, AI technology began to be integrated into the field of UOD following its revival. Today, AI, particularly its key branch, machine learning, is extensively utilised in the field of UOD. Previous machine learning-based UOD methods can be broadly categorised into traditional machine learning methods and deep learning methods. To gain insight into their strengths and weaknesses, we summarised these methods based on four aspects:

  • Learning Strategies

  • Applicable Datasets

  • Utilised Features or Frameworks

  • Learning Stages

Among various AI techniques, deep learning has garnered significant attention and extensive study in UOD, as illustrated in Fig. 3. Deep learning models require large amounts of training data but offer clear advantages over traditional methods in terms of detection accuracy. Two main research lines are primarily pursued in UOD. First, general deep detection networks, such as Faster RCNN [7], SSD [8], YOLO [9], and their variants, have been adapted for underwater object detection, significantly enhancing performance in this field. The second research line focuses on developing special network backbones, loss functions, or learning strategies for UOD.

Refer to caption
Figure 2: The development of (a) AI in underwater object detection and (b) generic artificial intelligence.

Despite the significant potential of deep learning-based UOD techniques for both academic and commercial applications, they still faces serious challenges. Many generic detection networks struggle to achieve accurate detection due to special difficulties unique to underwater domain. To facilitate a comprehensive understanding of these challenges, we categorise them into four groups:

  • Image Quality Degradation Challenge

  • Small Object Detection Challenge

  • Noisy Label Challenge

  • Class Imbalance Challenge

During the solution search stage, we have observed significant research gaps between generic object detection (GOD) and UOD in addressing these challenges. To bridge the research gaps and advance UOD development, we summarise current solutions for each challenge in both UOD and GOD and suggest potential improvements and new research directions.

Datasets are the fundamental to learning-based UOD approaches. Over the past decades, numerous datasets have been introduced. These public benchmarks offer platforms for the evaluations of UOD methods and significantly advance related fields. Hence, we review existing UOD datasets and offer comprehensive and detailed descriptions of both the datasets and the evaluation metrics. Moreover, we introduce two valuable detection analysis tools, Diagnosis [10] and TIDE [11], to assess the specific strengths and weaknesses of detectors by analysing their errors. Diagnosis [10] well-examines how object characteristics, such as size and aspect ratio, impact the detection performance, while TIDE [11] well-evaluates the effects of different types of errors on detectors.

Refer to caption
Figure 3: The frequency map of the keyword ‘underwater object detection’ in Google Scholar from 2020 to 2024. The size of the keyword is proportional to the frequency of the word. The keywords ‘underwater object detection’, ‘deep learning’, and ‘neural network’ have drawn large research interests in the community.
TABLE I: The summary of the existing underwater object detection surveys. The quality of each survey is rated based on five key criteria: review of traditional learning-based UOD, review of deep learning-based UOD, datasets, challenges, and proposed solutions. More stars (\bigstar) indicate better quality.
Year Title Venue ML UOD DL UOD Dataset Challenge Future
2017 Deep Learning on Underwater Marine Object Detection: A Survey [12] (Conference) Advanced Concepts for Intelligent Vision Systems 2017 \bigstar \bigstar\bigstar \bigstar \bigstar \bigstar
2020 Robust Underwater Object Detection with Autonomous Underwater Vehicle: A Comprehensive Study [13] (Conference) Proceedings of the International Conference on Computing Advancements \bigstar\bigstar\bigstar \bigstar\bigstar\bigstar \bigstar \bigstar \bigstar
2022 Review on deep learning techniques for marine object recognition: Architectures and algorithms [14] (Journal) Control Engineering Practice \bigstar \bigstar \bigstar\bigstar \bigstar \bigstar
2022 Underwater Object Detection: Architectures and Algorithms–A Comprehensive Review [15] (Journal) Multimedia Tools and Applications \bigstar \bigstar \bigstar \bigstar \bigstar
2023 A Systematic Review and Analysis of Deep Learning-based Underwater Object Detection [6] (Journal) Neurocomputing \bigstar \bigstar\bigstar\bigstar \bigstar \bigstar\bigstar\bigstar \bigstar\bigstar
2023 Rethinking General Underwater Object Detection: Datasets, Challenges, and Solutions [2] (Journal) Neurocomputing \bigstar \bigstar\bigstar \bigstar\bigstar\bigstar \bigstar\bigstar\bigstar \bigstar\bigstar
2024 Underwater Object Detection and Datasets: A Survey [16] (Journal) Intelligent Marine Technology and Systems \bigstar\bigstar \bigstar\bigstar \bigstar\bigstar \bigstar\bigstar \bigstar
2024 Ours - \bigstar\bigstar\bigstar \bigstar\bigstar\bigstar \bigstar\bigstar\bigstar \bigstar\bigstar\bigstar \bigstar\bigstar\bigstar
Refer to caption
Figure 4: The road maps of (a) generic object detection and (b) underwater object detection reveal that the latter often leverages insights and techniques from the former to enhance detection performance in underwater envrionments.

I-A Comparisons with Previous Surveys

In the last few years, quite a number of surveys on underwater object detection have been published. For this work, we cautiously select seven representative survey papers [12, 13, 14, 15, 6, 2, 16] for comparison, most of which are the latest surveys published in last three years. To quantitatively assess the quality of the selected surveys, we evaluate them across five key aspects: coverage of traditional machine learning-based UOD, deep learning-based UOD, datasets, challenges, and proposed solutions. We posit that a comprehensive survey of underwater object detection should provide researchers with a clear understanding of the field’s potential, challenges and future directions. To quickly convey the strengths and weaknesses of each survey paper, we rate the quality of each aspect using a star system \bigstar, with more stars indicating higher review quality. The quantitative evaluations are presented in Table I, and detailed discussions of each paper follow.

In 2017, Moniruzzaman et al. [12] classified the underwater object detection approaches according to the object categories. While the paper offers a comprehensive reviews on deep learning applications in fish, plankton, and coral recognition, it lacks analysis of the challenges and potential future solutions for underwater object detection. Gomes et al. [13] conducted a broad study on traditional machine learning and deep learning approaches for underwater object detection. Although they review and compare a wide range of methods, the analysis of challenges and proposed solutions remains limited. Wang et al. [14] provided comprehensive discussions on deep learning techniques for general object recognition and detection, but dedicated minimal attention to underwater object detection. Similarly, Fayaz et al. [15] concentrated heavily on the development of general deep learning methods, overlooking the specific techniques used for underwater object detection. Xu et al. [6] provided a comprehensive review of the potential applications and research challenges in underwater object detection. This work also examines the connection between underwater image enhancement and underwater object detection, providing some insights for future research directions in the field. Fu et al. [2] presented a review focused on deep learning-based underwater object detection, but traditional machine learning methods haven’t been covered. This paper also provides a comprehensive survey of datasets and challenges, and suggests some promising solutions to advance underwater object detection. Jian et al. [16] presented a moderate review of traditional machine learning and deep learning-based methods, along with discussions on datasets and challenges. However, their review offers limited insights for further advancements.

Table I reveals several common issues in previous surveys. Firstly, many survey papers, such as [5, 6, 2], exclusively focus on an intensive review on deep learning-based UOD techniques, however, neglecting the reviews on traditional machine learning approaches. Several works [14, 15] overemphasise generic object detection methods but ignore the core topic of underwater object detection. Secondly, most studies, such as [12, 13, 14, 15, 16], offer limited explorations of challenges and corresponding solutions for UOD. Therefore, this survey focuses on discussing these challenges and their potential solutions in depth.

I-B Contributions of This Survey

In this paper, we provide a more comprehensive review of AI-based UOD techniques and discuss key challenges, potential solutions, available datasets, evaluation metrics, and useful analysis tools for underwater object detection. The contributions of this survey are summarised as follow:

  • This survey focuses on reviewing AI techniques in underwater object detection, unlike previous work that centers on generic object detection. These UOD techniques are categorised based on learning strategies, applicable datasets, utilised features or network architectures, and learning stages, providing a clearer understanding of their strengths and weaknesses.

  • An in-depth analysis of the challenges in underwater object detection is presented, with these challenges summarised into four categories: image quality degradation, noisy labels, small object detection, and class imbalance. Advanced techniques from both generic and underwater object detection are explored and summarised to provide potential solutions to these challenges, offering valuable insights for future development.

  • A comprehensive overview of underwater object detection datasets and evaluation metrics is provided. Moreover, two valuable detection analysis tools, Diagnosis and TIDE, are introduced to identify the strengths and weaknesses of each detector by analysing their errors. The pre-built source code for these tools, tailored for the RUOD and DUO datasets, are available online.

  • To ensure a fair evaluation of underwater detectors, we recommend two high-quality and large-scale benchmarks, RUOD and DUO, and evaluate several mainstream deep detectors on these datasets. The source code, trained models, utilised datasets, and detection results are available online, offering researchers an accessible platform to compare their detectors against previous works.

The remainder of this paper is structured as follows: Section II reviews related works on AI-based UOD methods. Section III describes the research challenges in UOD and suggests potential solutions. Section IV presents popular datasets, evaluation metrics, and introduces two valuable detection analysis tools. Section V evaluates several mainstream detection frameworks on two large-scale datasets, RUOD and DUO, and reports and discusses the experimental results. Finally, future insights and vision are provided in Section VI.

II Existing Works of AI-based UOD

The development of underwater object detection is closely tied to the advancements in generic object detection, as evidenced by the road maps of GOD and UOD presented in Fig. 4. Many generic object detection frameworks and algorithms have been directly applied to underwater object detection and significantly advanced the research field. Hence, we first review the progress in GOD to identify similarities and gaps between the two areas. Then, we review related works of AI-based UOD methods, including both traditional machine learning methods and deep learning methods. Table II and Table III summarise the traditional and deep learning-based UOD approaches, detailing their learning strategies, utilised datasets, features/techniques or detection frameworks used.

II-A Generic Object Detection

The evolution of generic object detection can be divided into two periods: the traditional era (before 2014) and the deep learning era (after 2014).

II-A1 Traditional Generic Object Detection

During the traditional period, researchers focused on developing complex hand-crafted features for object detection. In 2001, Viola and Jones [17] introduced a real-time face detection algorithm that used sliding windows to scan all possible locations in an image and applied a detector to identify whether the window contained a human face. In 2005, Dalal et al. [18] introduced an effective hand-crafted feature called Histogram of Rriented Gradients (HOG) for pedestrian detection. In 2008, Felzenszwalb et al. [19] developed the Deformable Part-Based Model (DPM) for generic object detection, which became the champion model in the Pascal VOC detection challenge. However, by 2010, traditional object detection methods reached a plateau as the performance of hand-crafted features became saturated.

II-A2 Deep Generic Object Detection

It is widely recognised that the deep learning era in object detection began with the introduction of the two-stage deep detection network, RCNNs [20], proposed by Girshick et al. in 2014. RCNNs have since served as a crucial foundation for numerous deep detection networks detectors. Following 2014, the filed of generic object detection evolved at an unprecedented pace. There are two main research lines in deep learning-based object detection: two-stage deep detectors and one-stage deep detectors.

Two-stage Detectors. Two-stage detectors, such as Faster RCNN [7], RFCN [21], and FPNs [22], initially employ a proposal generation technique to produce a set of object proposals. Subsequently, a separate deep network extracts features from these proposals and predicts their locations and categories. The milestone work, Faster RCNN, integrates proposal generation, feature extraction, and bounding box regression into a unified end-to-end framework. Since then, the end-to-end deep pipeline has become dominant in object detection research. These two-stage deep detectors have significantly advanced detection performance in authoritative object detection challenges, such as the ImageNet and Microsoft COCO. However, many of these detectors struggle to achieve real-time detection due to their complex, coarse-to-fine processing paradigm.

One-stage Detectors. To overcome speed limitations, one-stage detectors like YOLO [9] and SSD [8] were introduced to perform object detection in real-time. Although these detectors excel in speed, they initially lagged behind two-stage detectors in accuracy. However, advanced one-stage frameworks such as RetinaNet [23], CornerNet [24], and CenterNet [25] later emerged, surpassing two-stage detectors in both accuracy and speed. In recent years, transformer-based networks such as DETR [26] and Deformable DETR [27] have attracted considerable attention in GOD. These models utilise attention mechanisms instead of convolution layers, which provide a global receptive field and show great promise for future advancement.

From the road maps presented in Fig. 4, it is clear that underwater object detection follows a similar development trajectory to generic object detection. Many techniques from generic object detection have been adapted to enhance underwater detection performance. Therefore, we believe that advanced techniques from generic object detection can benefit underwater object detection, and vice versa, in the further.

II-B Traditional ML-based Underwater Object Detection

Traditional machine learning algorithms have been employed in underwater object detection task for a long history. These approaches generally consists of two-stage learning. In the first stage, hand-crafted features are extracted, which can include simple visual features (e.g., color [28] and shape [29]) or complex hand-crafted features (e.g., HOG, SIFT and SURF). In the second stage, these features are forwarded to traditional classifiers, such as SVM and decision trees, to carry out various tasks in underwater scenes. Since traditional machine learning-based methods encompass a range of diverse techniques, we categorise them into sonar-based and RGB-based UOD methods based on the data used, and summarise their advantages and disadvantages.

Refer to caption
Figure 5: Comparisons of the sonar (top) and RGB (bottom) images. RGB images capture rich visual features but with limited perceptual range. Sonar images extend the perceptual range but are less intuitive and harder for humans to interpret.
TABLE II: Summary of Traditional Machine Learning Approaches for UOD, focusing on learning strategies (LS), utilised datasets, input data formats, extracted features, and detection methods.
Refs Year LS Datasets Format Features Detecotrs
Strachan et al. [29] 1993 SL Private Dataset RGB Color and Shape Features Discriminant Analysis
Shiela et al. [30] 2005 SL Great Barrier Reef Dataset RGB Color and Local Binary Pattern (LBP) Features Artificial Neural Network (ANN)
Barat et al. [31] 2006 UL Private Dataset RGB Visual Attention Mechanisms Motion-based Algorithm
Chew et al. [32] 2007 UL Private Dataset Sonar Frequency and Contour Features Self-adaptive Power Filtering Technique
Spampinato et al. [33] 2008 UL Private Dataset RGB Texture and Color Features Moving Average Algorithm
Beijbom et al. [34] 2012 SL Moorea Labeled Corals Dataset RGB Texture and Color Features Support Vector Machine (SVM)
Spampinato et. al [35] 2012 UL Private Dataset RGB Covariance Representation Covariance Tracking Algorithm
Galceran et al. [36] 2012 UL Private Dataset Sonar Integral-Image Representation Echo Scoring and Thresholding
Li et al. [37] 2013 UL Private Dataset Sonar Color and Area Features Otsu Segmentation Algorithm and Contour Detection Algorithm
Lee et al. [38] 2014 UL Private Dataset RGB Contour and Shape Features Contour Matching Algorithm
Kim et al. [28] 2014 UL Private Dataset RGB Artificial Landmark Features Template Matching Algorithm
Ravanbakhsh et al. [39] 2015 SL Private Dataset RGB Shape Features Modeled by PCA Haar-Like Detector
Cho et al. [40] 2015 UL Private Dataset Sonar Motion Features of Acoustic Beam Profiles Analysis the Cross-correlations between Successive Sonar Images
Hou et al. [41] 2016 UL Private Dataset RGB Color and Shape Features Shape Signature Algorithm
Chuang et al. [42] 2016 SL Fish4Knowledge and NOAA Fisheries Datasets RGB Part Features An Error-Resilient Classifier
Villon et al. [43] 2016 SL MARBEC Dataset RGB HOG Features Support Vector Machine (SVM)
Liu et al. [44] 2016 UL Private Dataset RGB Motion Features Background Subtraction and Three Frame Difference algorithms
Chen et. al [45] 2017 UL Private Dataset RGB Color, Intensity and Light Transmission Features Bottom-up ROI Detection and Otsu Segmentation Algorithm
Kim et al. [46] 2017 SL Private Dataset Sonar Haar-Like Feature AdaBoost
Vasamsetti et al. [47] 2018 UL Fish4Knowledge Dataset RGB Color, Motion and Multi Frame Triplet Pattern (MFTP) Features Multi-Feature Integration (MFI) Framework

II-B1 Sonar Data-based ML UOD Methods

Underwater scenes often present significant decrease in visibility because the signals received by sensors are absorbed and distorted by water bodies. As a result, sonar sensors [37, 40] have been widely applied in underwater exploration, as they can provide relatively reliable scene data regardless of visibility. Sonar sensors are adept at capturing geometric structure information and can offer insights into underwater scenes even in low-visibility conditions.

Two main types of sonars are commonly used in sonar-based UOD: side-scan sonar (SSS) [32, 48, 49] and multi-beam forward-looking sonar (FLS) [36, 50, 51]. SSS provides long-range, high-resolution data, allowing for detection across vast survey areas (hundreds of meters long). In contrast, FLS is sutied for closer, more detailed inspection of specific underwater object locations. Chew et al. [32] utilised simple visual features to detect man-made objects in side-scan sonar images, whereas Yu et al. [48] developed a TR-YOLOv5s network for detecting shipwrecks and containers in side-scan sonar images. Hayes et al. [52] demonstrated that the high-resolution imagery from SSS is effective for identifying potential objects on the seafloor over vast surveyed areas.

However, for a more detailed inspection of detected targets, FLS is a more appropriate option. Galceran et al. [36] introduced a real-time underwater object detection algorithm designed to detect man-made objects in images captured by multi-beam forward-looking sonars. This work used integral-image representation to extract features without the need for training data, significantly reducing computational overload by processing smaller portions of the underwater images. Zhang et al. [50] also developed an enhanced YOLOv5 network, called CoordConv-YOLOv5, for underwater object detection in forward-looking sonar images.

Sonar images extend the perceptual range but are less intuitive and harder for human to interpret due to the absence of visual features, as illustrated in Fig. 5. Additionally, sonar images often contain a significant amount of noise, which makes it challenging to ensure the reliability of sonar image recognition and analysis.

II-B2 RGB Data-based ML UOD Methods.

In contrast to sonars, cameras can capture a large number of RGB images with high spatial and temporal resolutions. We broadly categorise RGB data-based ML UOD approaches into three groups: methods based on static hand-crafted features, methods based on dynamic motion features, and methods-based on innovative pipelines.

Static Hand-crafted Features-based Methods. In RGB data-based ML UOD approaches, the most significant research focus has been on developing robust hand-crafted features. In 1993, Strachan et al. [29] used color and shape features to identify fish transported on a conveyor belt monitored by a digital camera. Later, Beijbom et al. [34] utilised texture and color features and employed Support Vector Machine (SVM) classifier to detect corals at various scales. In 2015, Ravanbakhsh et al. [39] introduced Principal Component Analysis (PCA) to model fish shape features, which were then processed by a Haar-like detector to identify fish heads and snouts. Several traditional machine learning algorithms have employed more sophisticated hand-crafted features such as SIFT [53], SURF [54], Haar-like feature [46], HOG [43], and light transformation features [45] for UOD.

Dynamic Motion Features-based Methods. In addition to detecting the objects in still images, some research efforts [31, 35] focus on using motion information to detect moving objects. Spampinato et al. [35] introduced a covariance-based tracking algorithm for fish detection in 2012. In 2016, Liu et al. [44] exploited background subtraction to handle lighting changes and a three frame difference algorithm to address background noise in moving object detection. Subsequently, Vasamsetti et al. [47] developed a novel spatiotemporal texture feature for detecting moving objects. Their method demonstrated significant improvements in performance and achieves SOAT results on the Fish4Knowledge dataset.

Innovative Pipeline-based Methods. In addition to developing static hand-crafted features and dynamic motion features, many researchers have explored new pipelines or techniques to construct underwater object detection systems. In 2008, Spampinato et al. [33] developed a vision pipeline for detecting, tracking, and counting fish in low-quality underwater videos. Lee et al. [38] utilised contour matching to recognise fish in fish tanks, while Kim et al. [28] applied multi-template object selection and color-based image segmentation for underwater object detection. For detecting man-made underwater objects, Hou et al. [41] developed a color-based extraction algorithm to identify objects of interest after addressing non-uniform illumination, then an improved Otsu algorithm is employed to eliminate color noise in the backgrounds. The pipeline concluded with a shape signature algorithm to recognise objects based on their shape.

TABLE III: Summary of Deep Learning Approaches for UOD, focusing on learning strategies (LS), utilised datasets, input data formats, learning stages, novel techniques, and detection frameworks.
Refs Year LS Datasets Formats Stages Novel Techniques Detection Frameworks
Li et al. [55] 2015 SL Datasets from LifeCLIEF Fish Task of ImageCLIEF RGB Two - Fast RCNN
Villon et al. [43] 2016 SL MARBEC Dataset RGB Two Motion from Previous Sliding Window RCNN invariant
Shkurti et al. [56] 2017 SL Aqua and Synthetic Datasets RGB One - YOLO Variant
Li et al. [57] 2017 SL Datasets from LifeCLIEF Fish Task of ImageCLIEF RGB Two - Faster RCNN Variant
Ji et al. [58] 2018 SL URPC2017 Dataset RGB Two - R-FCN Variant
Zhang et al. [59] 2018 SL URPC2017 Dataset RGB One Multi-scale Features and Context Information SSD Variant
Lee et al. [60] 2018 SL Private and Synthetic Datasets Sonar Two Style Transfer Algorithm for Sonar Image Synthesis Faster RCNN
Pedersenet al. [61] 2019 SL Brackish Dataset RGB One - YOLOv2 and YOLOv3
Chen et al. [62] 2020 SL URPC2017 and URPC2018 Datasets RGB One Invert Multi-class Adaboost Algorithm to Handle Noisy Data SWIPENET (SSD Variant)
Zhang et al. [63] 2020 SL URPC2019 Dataset RGB One Image Enhancement Technique SSD
Chen et al. [64] 2020 SL URPC2017 Dataset RGB Two Mixed Attention Mechanism and Multi-Enhancement Strategy Faster RCNN Variant
Wang et al. [65] 2020 SL URPC2019 Dataset RGB One - YOLO Nano
Liu et al. [66] 2020 SL URPC2019 Dataset RGB One Data Augmentation Method Water Quality Transfer (WQT) DG-YOLO
Fan [67] 2020 SL UWD Dataset RGB One Multi-scale Contextual Features and Anchor Refinement FERNet (SSD Variant)
Zhang et al. [68] 2020 SL URPC Dataset RGB One Attention Module and Multi-scale Feature Fusion MFFSSD
Lin [69] 2020 SL URPC2018 Dataset RGB Two Data Augmentation Technique RoIMix Faster RCNN Variant
Karimanzira et al. [70] 2020 SL Private Dataset Sonar Two - Faster RCNN Variant
Sung et al. [71] 2020 SL Private and Synthetic Datasets Sonar One Synthesising Sonar Images with GANs YOLO
Yang et al. [72] 2021 SL URPC2017 Dataset RGB One - YOLOv3
Pan et al. [73] 2021 SL Fish4Knowledge and Private Datasets RGB One Multi-scale ResNet for Multi-scale Object Detection SSD Variant
Jiang et al. [74] 2021 SL UODD Dataset RGB One Image Enhancement Framework WaterNet YOLOv3 Variant
Yu et al. [48] 2021 SL Private Dataset Sonar One Deep Learning Transformer–YOLOv5
Yeh et al. [75] 2021 SL Private Dataset RGB One Joint Color Conversion and Object Detection YOLOv3 Variant
Chen et al. [76] 2022 SL URPC2017 and URPC2018 Datasets RGB One Curriculum Multi-class Adaboost Algorithm to Handle Noisy Data SWIPENET (SSD Variant)
Alla et al. [77] 2022 SL Private Dataset RGB One Image Enhancement Techniques YOLOV4
Zhang et al. [78] 2022 SL URPC2017 Dataset RGB One - YOLO Variant
Cai et al. [79] 2022 SL URPC2021 Dataset RGB One Collaborative Weakly Supervision YOLOv5
Jia et al. [80] 2022 SL URPC Dataset RGB One - EfficientDet Variant (SSD Variant)
Wang et al. [81] 2022 SL URPC2020 Dataset RGB Two Joint Image Reconstruction and Object Detection Faster RCNN Variant
Liang et al. [82] 2022 SL UTDAC2020 Dataset RGB Two Attention Module to Capture RoI-level Relation Faster RCNN Variant
Chen et al. [83] 2023 SL URPC2018 Dataset RGB One - Hybrid Transformer
Song et al. [84] 2023 SL URPC2020 and Brackish Datasets RGB Two Boosting Re-weighting for Hard Example Mining Boosting RCNN
Dai et al. [85] 2023 SL UTDAC2020, Brackish, and TrashCan Datasets RGB Two Edge-guided Attention Module to Capture Boundary Information Faster RCNN Variant
Fu et al. [86] 2023 SL URPC2020 and UODD Datasets RGB Two Incorporate Transferable Priors to Remove Degradation Cascade RCNN Variant
Zhou et al. [87] 2024 SL DUO and RUOD Datasets RGB One AMSP-VConv to Handle Noise and Degradation AMSP-UOD (SSD Variant)
Guo et al. [88] 2024 SL RUOD, UTDAC2020, and URPC2022 Datasets RGB One Backbone Improvements for Real-time Inference YOLOv8
Gao et al. [89] 2024 SL UTDAC and RUOD Datasets RGB One Path-augmented Transformer to Enhance Small Object Detection Transformer Variant
Ge et al. [90] 2024 SL UATD Dataset Sonar One - YOLOv7 Variant
Dai et al. [91] 2024 SL DUO, Brackish, TrashCan, and WPBB Datasets RGB One Gated Cross-domain Collaborative Network to Address Poor Visibility and Low Contrast GCCNet (SSD Variant)
Wang et al. [92] 2024 SL DUO, UODD, RUOD, and UDD Datasets RGB One Joint Image Enhancement and Object Detection DJLNet (SSD Variant)
Refer to caption
Figure 6: A statistic analysis of AI-based UOD methods, including (a, b) utilised frameworks and learning stages, (c, d, e) utilised datasets, (g) learning strategies, and (f, h) publication dates.
Refer to caption
Figure 7: The categorisation and summary of previous research in AI-based UOD, highlighting the advantages (red) and disadvantages of each type.

II-C DL-based Underwater Object Detection

In deep learning-based UOD, two primary research directions have emerged: transferring generic object detection frameworks for UOD and designing specialised frameworks tailored to underwater environments.

II-C1 Transferring Generic Object Detection Frameworks.

A prominent approach in deep learning-based UOD involves directly transferring generic deep detection backbones and frameworks [7, 8, 9] to the underwater object detection task. These frameworks are generally categorised into two-stage detectors and one-stage detectors.

Two-stage Generic Detectors for UOD. Several studies have leveraged two-stage detectors to extract powerful deep features for UOD. For example, Li et al. [55] initially applied the generic Fast-RCNN [93] framework for fish species detection and later adopted the Faster-RCNN [7] framework to accelerate the detection process in [57]. In 2018, Ji et al. [58] introduced an RFCN variant for marine organism detection, while Lee et al. [60] and Karimanzira et al. [70] used Faster RCNN for underwater object detection. Two-stage deep detectors generally offer high localisation and recognition accuracy, however, they fall short in achieving real-time detection due to their complex, coarse-to-fine processing paradigm.

One-stage Generic Detectors for UOD. To achieve real-time detection, many generic one-stage detectors have been applied in underwater object detection, particularly SSD variants [8] and YOLO variants [9]. Zhang et al. [63] utilised the generic SSD framework for underwater object detection. They also applied three underwater image enhancement algorithms to enhance the quality of underwater images and subsequently examined the correlations between image enhancement and object detection tasks.

YOLO [9] and its variants [72, 56, 79, 94] are also among the most frequently used generic detectors for UOD. Yang et al. [72] utilised the real-time YOLOv3 [95] framework to detect underwater objects in color images. In contrast, Sung et al. [71] and Zhang et al. [78] applied the YOLO framework or its variants to underwater object detection. Pedersenet al. [61] used both YOLOv2 and YOLOv3 for detecting marine animals in underwater scenes with varying visibility. Additionally, YOLOV4 [77] and YOLOv5 [79] have also been employed for underwater object detection.

II-C2 Specially Designed Underwater Object Detection Frameworks

Another key research direction in DL-based UOD is the development of novel frameworks or algorithms specially designed to tackle the unique challenges of UOD. Interestingly, we observe that most of these frameworks are one-stage deep detectors, as they can effectively balance accuracy and speed. We categorise these one-stage detectors into SSD variants, YOLO variants, and transformer variants.

SSD Variants. Over the past few years, many UOD frameworks have developed based on the SSD architecture. Zhang et al. [59] enhanced the SSD framework by incorporating multi-scale and context features to improve multi-scale object detection in complex underwater environments. Similarly, Zhang et al. [68] introduced the MFFSSD framework, which integrates an attention module and a multi-scale feature fusion module into SSD to enhance multi-scale underwater object detection. To tackle the issue of noisy data, Chen et al. [76] proposed a easy-to-hard learning algorithm, called Curriculum Multi-class Adaboost (CMA), to train deep detection networks. Lin et al. [69] introduced RoIMix, a data augmentation technique that enhances interactions between images and mixing region proposals from multiple images. To achieve real-time underwater object detection, Pan et al. [73] proposed a lightweight multi-scale ResNet tailored for underwater environments.

YOLO Variants. Many researchers have focused on enhancing the YOLO framework for UOD. Wang et al. [65] introduced YOLO-Nano-Underwater, a deep detector designed to reduce inference time. Given that deep detectors trained on limited underwater data often suffer from severe domain shift, Liu et al. [66] integrated a domain invariant module (DIM) and an invariant risk minimisation (IRM) penalty term into YOLOv3 framework to contruct a detector that performs consistently across various underwater domains. Meanwhile, Jiang et al. [74] introduced a channel sharpening attention module (CSAM) to enhance the fusion of feature maps with the input image in the YOLOv3 framework. This strategy improves the accuracy of multi-scale object detection, particularly for small and medium-sized objects.

Transformer Variants. Several studies have explored the use of transformers in underwater object detection. Yu et al. [48] integrated a transformer module with the YOLOv5s framework to construct a novel model called TR–YOLOv5s. The transformer module, which includes a self-attention mechanism, allows TR–YOLOv5s to focus more on objects rather than backgrounds. While transformers excel at modeling long-range relationships compared to CNNs, they require more training data and exhibit higher computational complexity. To leverage the strengths of both transformers and CNNs, Chen et al. [83] developed a hybrid transformer network that integrates transformer and CNN modules. This hybrid transformer network captures global contextual information and outperforms both standalone CNNs and transformers.

II-D The Summary of AI-based UOD Methods

Fig. 7 provides a summary of the strengths and weaknesses of each type of AI-based UOD methods.

Refer to caption
Figure 8: Challenges and possible solutions for underwater object detection.

II-D1 Comparisons of ML UOD and DL UOD Methods

To understand the strengths and weaknesses of traditional ML and DL UOD methods, we categorise them based on four aspects: learning strategies, applicable datasets, utilised features or frameworks, and learning stages, a statistic analysis from different perspectives is presented in Fig. 6.

As illustrated in Tables II and III, traditional machine learning-based UOD methods utilise supervised or unsupervised learning on relatively small-scale datasets, while deep learning methods typically exploit supervised learning on large-scale datasets. Traditional machine learning approaches generally involve a two-stage learning process: feature extraction and detector selection or design. In contrast, deep learning approaches commonly employ a one-stage, end-to-end learning pipeline. Traditional machine learning methods have recently fallen behind advanced deep learning techniques. This is because they are limited to a narrow set of hand-crafted features and cannot fully leverage the potential of large-scale datasets. On the other hand, deep learning models require large amounts of training data but offer clear advantages over traditional methods in terms of detection accuracy.

II-D2 Comparison of Transferred Generic Detectors and Specially Designed Underwater Detectors

Transferring generic object detection frameworks for UOD is straightforward and easy to implement. This enables researchers to quickly adapt and scale detection systems for various underwater applications without the need to design new networks from scratch. Moreover, generic detection models are typically pre-trained on large, diverse datasets, offering a strong foundation of learned features that can be fine-tuned for underwater environments.

However, generic object detectors trained on terrestrial objects may not adequately capture the unique characteristics of underwater environments, such as specific colors, textures, and shapes of marine objects, leading to reduced detection performance. In contrast, specially designed underwater detectors can be more sensitive to underwater image degradation, as many are optimised for underwater-specific issues like color distortions, haze, blurring, and low visibility. These optimisation can significantly enhance detection performance, as the models are tailored to address these challenges.

TABLE IV: A summary of existing underwater object detection datasets.
Datasets Train Test/Val Total Cls Anno. Object Year Download Link
F4K-Species [96] - - 27,370 23 Mask - 2012 -
F4K-Trajectory [97] - - 93 videos - BBox - 2013 -
F4K-Complex [98] - - 14 videos - BBox - 2014 -
Brackish [61] 11,739 1,468/1467 14,674 6 BBOX 35,565 2019 Link
URPC2017 17,655 985 18,640 3 BBOX - 2017 Link (pwd:0hct)
URPC2018 2,901 800 3,701 4 BBOX 22,688 2018 Link (pwd:0hct)
URPC2019 4,757 1,029 5,786 4 BBOX 36,100 2019 Link
URPC2020 6,575 2,400 8,975 4 BBOX 46,287 2020 Link
URPC2021 7,478 1,200 8,678 4 BBOX 54,238 2021 -
UDD [99] 1,827 400 2,227 3 BBOX 15,022 2021 Link
DUO [4] 6,671 1,111 7,782 4 BBOX 74,515 2021 Link
UODD [74] 2,688 506 3,194 3 BBOX 19,212 2021 Link
RUOD [2] 9,800 4,200 14,000 10 BBOX 74,903 2022 Link
Refer to caption
Figure 9: Visual comparisons of images and annotations in some representative underwater object detection datasets.
Refer to caption
Figure 10: Examples of image quality degradation challenges include (a) uneven illuminations, (b) color distortions, and (c) blur effects, and (d) low visibility.

III Challenges and Future Solutions

Underwater scenes are among the most challenging environments for object detection due to their specific difficulties. To facilitate a clearer understanding of the challenges, we categorise them into four groups: image quality degradation, small object detection, noisy labels, and class imbalance. In this work, we review previous research, summarise existing solutions for each challenge in both UOD and GOD, and suggest potential improvements and research directions, as illustrate in Fig. 8.

III-A Image Quality Challenges and Solutions

Underwater environments face significant image quality degradation due to physical phenomena like light absorption and scattering [1], along with the use of artificial lighting, as illustrated in Fig. 10. These physical phenomena significantly degrade image quality, posing huge challenges to UOD. Most previous works [100, 101] first apply underwater image enhancement techniques to improve image quality before using the enhanced images to train UOD frameworks. While it is generally assumed that UIE can boost UOD performance by enhancing visual quality, several studies [102, 2] have examined the relationship between UIE and found no strong correlation between image quality and detection accuracy. In some cases, UIE algorithms can even degrade detection performance [6].

These findings indicate that separate optimisation of UIE and UOD can result in inconsistent outcomes and suboptimal solutions, as the two tasks have different optimisation objectives: UIE is opitimised using visual quality-based losses, while UOD relies on detection accuracy-based losses.

Potential Solutions: 1) Exploring Influencing Factors. Begin by identifying the key factors that contribute to performance drops in UOD, and then develop targeted solutions. For instance, Fu et al. [2] summarised the challenges posed by image quality degradation into three types: color distortion, haze effect, and light interference. Their research revealed that color distortion and light interference are the primary factors causing performance drops in detectors, while haze effect has a relatively minor impact. Therefore, it is essential to investigate these interfering factors in depth.

2) Joint Optimisation of UIE and UOD. Given the semantic gaps between UIE and UOD, enhancing their interactions to bridge these gaps is a promising research direction. For instance, Chen et al. [103] introduced a detection perceptor that guides the enhancement model to produce images favorable for detection, incorporating an object-focused perceptual loss to jointly supervise the optimisation of the enhancement model. Liu et al. [104] introduced a task-aware feedback module into the enhancement pipeline. This module provides detection-related information to the enhancement model, directing it towards generating detection-favoring images. Additionally, Wang et al. [81] proposed a joint framework for simultaneous underwater object detection and image reconstruction, the joint optimisation enables the shared network backbone to better generalise and adapt to various underwater scenes.

III-B Small Object Detection Challenges and Solutions

Many objects in underwater images occupy only a tiny fraction of the image. For instance, objects in the DUO dataset [4] typically cover only 0.3% to 1.5% of the image area. Even advanced deep detectors often struggle to identify these small objects effectively. Several approaches [73, 105, 106] have been developed to tackle the challenge of small object detection in underwater scenes. Here, we review advanced techniques in both GOD and UOD to facilitate small object detection.

Potential Solutions: 1) Scale-specific Detector Design. Since features at different layers are sensitive to objects of special scales, many approaches have developed multi-branch architectures to construct scale-specific detectors. Farhadi et al. [95] designed parallel branches to detect multi-scale objects, with one branch focused on constructing high-resolution feature maps to capture small objects. Singh et al. [107] introduced a novel training paradigm called Scale Normalization for Image Pyramids (SNIP), which trains deep detectors using only objects that fall within the desired scale range, effectively addressing small object at the most relevant scales. Similarly, Najibi et al. [108] proposed a coarse-to-fine pipeline that processes small objects at the most reasonable scale.

2) Attention Mechanism Deployment. The attention mechanism, designed to emphasise regions of interests, offers another notable solution for small object detection. Yang et al. [109] introduced pixel and channel attention modules to enhance regions containing small objects. Gao et al. [110] developed a local attention pyramid module to enhance the feature representation of small objects while suppressing backgrounds and noise in shallower feature maps. These attention-based solutions offer high flexibility and can be plugged into almost any deep detection architectures, however, they often come with substantial computational overhead due to the correlation operations involved.

3) High-resolution Image Reconstruction. This technique seeks to improve the details of small objects by increasing image resolution. Zhang et al. [111] incorporated a super-resolution branch within the detection framework to learn high-resolution feature representations for small objects. Both Rabbi et al. [112] and Courtrai et al. [113] employed GANs to super-resolve low-resolution remote sensing images, enhancing edge details and minimising high-frequency information loss during reconstruction. While high-resolution image reconstruction enriches the feature representation of small objects by amplifying image resolution, it may also introduce spurious textures and artifacts, which can negatively impact detection performance.

III-C Noisy Label Challenge and Solutions

In real-world datasets, the proportion of noisy/incorrect labels has been reported to range from 8.0% to 38.5% [114, 115]. The noisy label problem is particularly pronounced in underwater object detection datasets, where images often suffer from high levels of blur and low visibility, making accurate data annotation more challenging. Here, we review existing solutions for handling noisy labels in the deep learning community and summarise key approaches that may be well-suited for UOD.

Potential Solutions: 1) Sample Selection. Sample selection [116, 117] is a widely used strategy for filtering out clean samples from noisy datasets. Both theoretical [118] and empirical studies [119] demonstrate that deep learning models first learn simple, generalised patterns before gradually overfitting to noisy patterns. Based on this expertise, quite a few works [120, 121] select small-loss training samples as clean examples in the early stages of training. More advanced approaches have utilised multi-network learning [122] and multi-round learning [123] to iterative refine the selection of clean samples. These methods are well-motivated, however, they can suffer from accumulated errors from incorrect sample selection, and finding a robust strategy for noisy sample selection remains challenging.

2) Robust Loss Function. Several noise-robust loss functions have been developed to train deep learning models on datasets with noisy labels. Ghosh et al. [124] introduce a noise-tolerant loss function for multi-class classification, which performs well under symmetric noise but requires knowledge of the noise rate. Zhang et al. [125] proposed a generalised cross-entropy (GCE) loss function that combines the benefits of both mean absolute error and categorical cross-entropy losses. Zhou et al. [126] introduced an asymmetric loss function that satisfy the Bayes-optimal condition, making them robust to noisy labels. While these loss functions can enhance the generalisation ability of deep models in certain situations, they often depend on special conditions, such as a known noise rate.

III-D Class Imbalance Challenge and Solutions

Long-tail distribution is frequently observed in underwater object detection datasets, as illustrate in Fig. 11. Here, we present two strategies to tackle class imbalance issues in UOD.

1) Class-Aware Sampling. Class-aware sampling seeks to address class imbalance by adjusting the class distribution through class-wise resampling during training. For instance, Chang et al. [127] identified limitations in image-level resampling strategies for long-tailed detection and proposed an object-centric sampling method. Feng et al. [128] introduced a Memory-Augmented Feature Sampling (MFS) module designed to over-sample features from underrepresented classes. While class-aware sampling helps generate a more balanced data distribution, it may also introduce large amounts of duplicated samples, which can slow down training and increase the risk of overfitting.

2) Loss Reweighing. Loss reweighing [129] aims to enhance the learning of minority classes by increasing their weights. For instance, Mahajan et al. [130] reweighted classes using the square root of their frequency, while Cui et al. [131] employed the effective number of samples rather than raw class frequency to balance the classes. Li et. al [132] presented a Balanced Group Softmax (BAGS) module, which optimises the classification head in the detection framework to ensure adequate training for all classes. Observing that learning bias arises not only from class distribution disparities but also from differences in sample diversity, Qi et al. [133] proposed the BAlanced CLassification (BACL) framework to address imbalances stemming from both class distribution and sample diversity. All these works provide insights for addressing class imbalance in underwater object detection and require further investigation to validate their effectiveness.

Refer to caption
Figure 11: Data distributions in the representative underwater object detection datasets show that class imbalance is a widespread issue.

IV Datasets, Evaluation Metrics, and Detection Diagnosis Tools

In this section, we first provide a comprehensive overview of previous UOD datasets, followed by a discussion of commonly used evaluation metrics in UOD. Finally, we introduce two valuable detection diagnosis tools that help identify strengths and weaknesses of each underwater detector.

IV-A Datasets

Several underwater object detection datasets have been proposed over the past decade. In this subsection, we review some of the most representative datasets, including the Fish4K [134], Brackish [61], the URPC series, UDD [99], DUO [4], UODD [74], and RUOD [2]. A summary of these datasets is presented in Table IV and Fig. 11.

Fish4K Dataset [134]. The Fish4K dataset is one of the most well-known early-stage underwater datasets for fish detection and species classification. It offers a large collection of videos and images captured by 10 cameras between 2010 and 2013. The dataset includes a diverse range of moving organisms, such as swimming fish, sea anemones, algae, and aquatic plants. Numerous detection and classification algorithms [47, 73] have ever been developed and validated on various subsets of Fish4K, such as F4K-Complex [98], F4K-Species [96], and F4K-Trajectories [97]. However, the download links for this dataset and its subsets are no longer available.

Brackish dataset [61]. The Brackish dataset is Kaggle competition dataset for fish detection, collected in brackish water with varying levels of visibility. The dataset consists of images extracted from 89 videos, annotated with bounding boxes. Fish are categorised into six broad groups: fish, small fish, crab, shrimp, jellyfish, and starfish. The dataset is divided into training, validation, and testing sets, following an 80/10/10 split. In total, it contains 14,674 underwater images with 35,565 annotated instances.

URPC Series Datasets. The Underwater Robot Professional Contest (URPC)111http://www.urpc.org.cn/index.html has been held annually since 2017, providing large-scale underwater object detection datasets for competition purposes each year. To date, URPC has released five datasets (URPC 2017-2021). The underwater images are captured near Zhangzi Island, Dalian, China, and all URPC datasets include bounding box annotations. Among them, URPC2017 features three object categories (scallop, seacucumber, and seaurchin), while later datasets include four categories (scallop, starfish, seacucumber, and seaurchin). However, the download links for these dataset are no longer available after the competition. In Table IV, we provide the download links sourced from unofficial researchers.

UDD Dataset [99]. The UDD is a benchmark designed for underwater object detection tasks, featuring 15,022 instances with bounding box annotations. The images are collected from an open-sea farm in Zhangzi Island, Dalian, China. The dataset comprises 2,227 images, divided into a training set of 1,827 images and a testing set of 400 images. It includes three object categories: scallop, seacucumber, and seaurchin.

DUO Dataset [4]. The DUO dataset is generated by re-annotating the URPC series (2017-2020) and UDD datasets, as many of these datasets do not offer testing sets and their images suffer from poor annotation quality. The DUO dataset includes four object categories-scallop, starfish, seacucumber, and seaurchin-featuring improved and more accurate annotations. The dataset comprises 6,671 images for training and 1,111 images for testing.

UODD Dataset [74]. The UODD dataset is another specialised dataset for underwater object detection, also collected from Zhangzi Island, Dalian, China. It includes 3,194 underwater images with 19,212 annotated instances, divded into 2,688 training images and 506 testing images. The dataset features three object categories: scallop, seacucumber, and seaurchin.

RUOD Dataset [2]. The RUOD dataset is created to include a wide range of object categories and visual variations. RUOD features underwater images collected from various locations across the internet, capturing diverse visual effects such as haze, color distortion, and varying light conditions. RUOD encompasses ten underwater object categories (fish, seaurchin, coral, starfish, seacucumber, scallop, diver, cuttlefish, turtle and jellyfish), 14,000 images, and 74,903 annotated objects. Fig. 9 displays examples from RUOD alongside those from previous datasets, illustrating that RUOD features a greater diversity of object categories and visual variations.

TABLE V: The quantitative performance of representative detection frameworks on the RUOD and DUO datasets.
Datasets Types Methods Backbone Params FLOPs AP AP50 AP75 APs APm APl
RUOD Generic RetinaNet [23] ResNet101 55.51M 273.41G 52.8 81.8 57.9 14.6 39.9 58.3
GuidedAnchor [135] ResNet101 60.98M 258.05G 46.6 80.8 48.3 21.5 41.1 50.6
CascadeRCNN [136] ResNet101 88.17M 301.06G 49.8 80.5 54.2 18.7 43.1 54.3
RepPoints [3] ResNet101 55.82M 256.00G 53.2 82.2 60.1 28.2 44.9 57.8
FoveaBox [137] ResNet101 56.68M 268.29G 44.8 80.2 45.2 18.0 37.5 49.1
ATSS [138] ResNet101 51.13M 267.26G 54.0 80.3 60.2 18.0 40.0 59.5
DetectoRS [139] DResNet50 123.23M 90.05G 53.3 84.1 58.7 30.8 46.6 57.8
GridRCNN [140] ResNet101 129.63M 365.60G 54.2 81.6 60.0 25.6 46.1 59.8
Underwater RoIMix [69] ResNet50 68.94M 91.08G 54.6 81.3 60.3 15.6 41.7 59.8
RoIAttn [82] ResNet50 55.23M 331.39G 52.9 81.7 57.3 12.2 39.0 58.3
BoostRCNN [84] ResNet50 45.95M 54.71G 53.9 80.6 59.5 11.6 39.0 59.3
RFTM [86] ResNet50 75.58M 91.06G 53.3 80.2 57.7 11.8 39.2 59.3
ERLNet [85] SiEdgeR50 45.95M 54.71G 54.8 83.1 60.9 14.7 41.4 59.8
GCCNet [91] SwinFT 38.31M 78.93G 56.1 83.2 60.5 11.7 41.9 62.1
DJLNet [92] ResNet50 58.48M 69.51G 57.5 83.7 62.5 15.5 41.8 63.1
DUO Generic RetinaNet [23] ResNet101 55.38M 289.79G 50.9 73.1 59.7 51.0 50.1 53.1
CascadeRCNN [136] ResNet101 88.15M 319.49G 60.6 80.9 70.5 54.4 61.4 61.2
RepPoints [3] ResNet101 55.82M 256.00G 59.4 80.4 70.1 55.5 59.6 60.1
GridRCNN [140] ResNet101 129.63M 365.60G 56.6 78.9 67.2 50.3 56.7 57.4
FoveaBox [137] ResNet101 55.24M 286.72G 53.7 78.4 63.9 55.3 54.3 54.6
ATSS [138] ResNet101 51.11M 286.72G 55.4 79.2 63.2 55.7 55.7 56.0
DetectoRS [139] DResNet50 123.23M 90.05G 58.9 81.4 68.3 49.6 57.6 61.8
GuidedAnchor [135] ResNet101 60.94M 276.48G 61.4 83.8 72.0 58.9 62.4 61.3
Underwater RoIMix [69] ResNet50 68.94M 91.08G 61.0 80.0 69.7 48.0 62.5 60.2
RoIAttn [82] ResNet50 55.23M 331.39G 58.7 79.5 66.5 45.5 59.6 58.5
BoostRCNN [84] ResNet50 45.95M 54.71G 60.8 80.6 69.0 48.3 62.3 59.4
RFTM [86] ResNet50 75.58M 91.06G 60.1 79.4 68.1 49.0 61.1 59.5
ERLNet [85] SiEdgeR50 45.95M 54.71G 61.2 81.4 69.5 55.2 62.2 60.8
GCCNet [91] SwinFT 38.31M 78.93G 61.1 81.6 67.3 52.5 63.6 59.3
DJLNet [92] ResNet50 58.48M 69.51G 65.6 84.2 73.0 55.6 67.4 64.1
Refer to caption
Figure 12: Precision-recall curves of representative detection frameworks for each category in the RUOD dataset.

IV-B Evaluation Metrics

Most of evaluation metrics in underwater object detection are adopted from generic object detection. In both fields, evaluation metrics assess the performance of models in terms of both accuracy and efficiency.

IV-B1 Accuracy Metrics

The accuracy metrics assess the model’s accuracy in object detection and include Average Precision (AP) and mean Average Precision (mAP). Average precision (AP) represents the area under the precision-recall curve as shown in Fig. 12. It essentially reflects the precision averaged across all recall values from 0 to 1, and can be formulated as:

AP@α=01p(r)𝑑rAP@\alpha=\int_{0}^{1}p(r)dr (1)

Here, p(r)p(r) represents the precision-recall curve. The notation AP@αAP@\alpha indicates that AP is evaluated at a specific IoU threshold α\alpha. For example, metrics like [email protected] and [email protected] refer to the AP calculated at IoU thresholds of 0.5 and 0.75, respectively. For each class, an individual AP can be calculated. The mean Average Precision (mAP) is then obtained by averaging the AP values over all (nn) classes.

mAP@α=1ni=1nAPimAP@\alpha=\frac{1}{n}\sum_{i=1}^{n}AP_{i} (2)

To avoid bias that might arise from using a single IoU threshold, the COCO [141] mAP evaluator computes mAP across 10 IoU thresholds from 0.5 to 0.95 in increments of 0.05 (mAP@[0.5:0.05:0.95]). Additionally, COCO introduces metrics to assess detection performance across different object scales: APsAP_{s} for small objects, APmAP_{m} for medium objects, and APlAP_{l} for large objects.

IV-B2 Efficiency Metrics

The efficiency metrics measure the detection model’s performance in terms of computational complexity, model size, and real-time inference speed. These metrics include FLOPs, Params, and FPS.

FLOPs measures computational complexity by counting the number of floating-point operations the model required. Models with fewer FLOPs are more computationally efficient. Params refer to the number of learnable weights in a model. Models with fewer parameters are more lightweight, which is better suited for deployment on resource-constrained devices. FPS measures the inference speed of the model, indicating how many frames the model can process per second.

IV-C Detection Analysis Tools

Object detectors can fail due to various issues, such as misclassification, localisation errors, or missed detection. High-level metrics like mAP offer limited insight into the root causes of underperformance. Tools that provide a quantitative breakdown of error types offer researchers a deeper understanding of where and why detectors fail, helping identify areas for improvements. Here, we introduce two practical and easy-to-use detection tools: Diagnosis [10] and TIDE [11], each offering distinct yet complementary advantages.

Diagnosis well-examines the impact of object characteristics, such as size and aspect ratio, on detection performance. Diagnosis categorises objects into five size groups based on their percentile size within each object category: extra-small (XS: bottom 10%), small (S: next 20%), medium (M: next 40%), large (L: next 20%), and extra-large (XL: top 10%), as illustrated in Fig. 14. Similarly, aspect ratio, defined as the object’s width divided by its height, is categorised into five groups based on the percentile rankings: extra-tall (XT: bottom 10%), tall (T: next 20%), medium (M: next 40%), wide (W: next 20%), and extra-wide (XW: top 10%), as illustrated in Fig. 15. TIDE offers a more precise and comprehensive error analysis than Diagnosis. It includes six error types: classification error, localisation error, both classification and localisation error, duplicate detection error, background error, and missed GT error. Fig. 13 illustrates the distributions of error types as defined by TIDE. The pie chart represents the relative contribution of each error type, while the bar plots display their absolute contribution. To support research in underwater object detection, we publish ready-to-use source code for Diagnosis and Tide, specifically designed for the RUOD and DUO datasets.

Refer to caption
Figure 13: The error types of representative object detection frameworks on (a) RUOD and (b) DUO. The pie chart represents the relative contribution of each error type, while the bar plots display their absolute contribution.

V Evaluations of mainstream detection frameworks on the RUOD and DUO benchmarks

Although various datasets have been used to evaluate various UOD algorithms, well-recognised and reliable benchmarks are still urgently needed to support UOD development for two reasons: (1) researchers often use the same dataset but apply different splits, particularly with the URPC series datasets, which lack test set annotations, and (2) many early-stage datasets, like URPC2017 and URPC2018, contain considerable annotation errors. The results reported on test sets with faulty annotations are neither fair nor reliable. Therefore, we recommend two high-quality, large-scale datasets, RUOD and DUO, with consistent data split and accurate annotations as unified benchmarks for comparing UOD algorithms.

Many specialised underwater detectors have not released their source codes or their environments are difficulty to reproduce. While many generic detectors have been tested on the RUOD and DUO datasets, the trained models and detection results are not publicly accessible. As a result, researchers must repeat experiments, causing duplicated, time-consuming efforts. To address these issues, we introduce UODReview, a platform specially designed for comparing various underwater object detection approaches. It provides source codes, trained models, detection results, and detection analysis tools for RUOD and DUO, facilitating faster comparisons and advancing UOD development. Both mainstream generic detectors and SOAT underwater detectors are chosen for comparison. For the generic detectors, we utilise source codes from MMDetection to retrain and evaluate several leading detectors, including RetinaNet [23], FoveaBox [137], ATSS [138], GridRCNN [140], DetectoRS [139], RepPoints [3], CascadeRCNN [136], and GuidedAnchor [135]. For the specialised underwater detectors, we present results reported in previous studies directly, as most of these detectors have not published their source code or cannot be successfully reproduced. On the DUO and RUOD datasets, we consider several top-performing methods, including DJLNet [92], ERLNet [85], GCCNet [91], RoIMix [69], BoostRCNN [84], RFTM [86] and RoIAttn [82]. Their results are sourced directly from the study by Wang et al. [92].

V-A Comparisons between Underwater and Generic Detectors

Table V presents the quantitative results, while Fig. 12 displays the precision-recall curves for underwater detectors and generic detectors. From these, we observe that many underwater detectors, such as DJNet, GCCNet, and ERLNet, performs much better than generic detectors. Many generic detectors, especially FoveaBox and GuidedAnchor on RUOD, and RetinaNet and FoveaBox on DUO, performs much worse. This is primarily due to several key factors related to the unique challenges and characteristics of the underwater environment. For instance, DJLNet excels at simultaneously learning both appearance and edge features, enhancing robustness in complex underwater scenes. It incorporates an image decolorisation module to eliminate color distortions caused by underwater light absorption and scattering. Additionally, its edge enhancement branch sharpens shape and texture details, improving the recognition of object boundary features. Similarly, ERLNet leverages an edge-guided attention module to refine edge information for better detection performance. By sharpening object boundaries, it enhances the detection of objects with fuzzy boundaries, a common challenge in underwater images due to blurring and low visibility. GCCNet tackles the challenges of poor visibility and low contrast in underwater environments through three key components: first, a UIE method that enhances object visibility in low-contrast areas; second, a cross-domain feature interaction module that extracts complementary information between raw and enhanced images; and third, a gated feature fusion module that adaptively manages the fusion ratio of cross-domain information.

Generic detectors often exhibit poor or unstable performance across different underwater datasets. For instance, GuidedAnchor perform well on the DUO dataset but far lags behind other methods on the RUOD dataset. Similarly, CascadeRCNN shows varying performance on two datasets. This discrepancy is largely due to domain-specific variations in the datasets: RUOD features diverse object sizes, color distortions, and lighting conditions, whereas the DUO dataset has a more consistent style. Cascade R-CNN enhances object detection through a multi-stage framework where each stage progressively refines the bounding box predictions. This approach enhances detection quality, particularly for challenging cases such as small or occluded objects, however, it can also overfit on easier-to-detect objects, especially when more refinement stages are used. GuidedAnchor presents a new anchor generation method that uses feature-guided anchors to improve object localisation. However, it can be less effective for objects with extreme aspect ratios or variable shapes, leading to challenges in generalising across diverse datasets. Hence, the design of underwater detectors should consider both the unique characteristics of the underwater environment and the variations present in different datasets.

V-B Discussions on the Impact of Object Characteristics

Fig. 14 presents the performance of representative detection frameworks for objects of different sizes on the RUOD and DUO datasets. In both datasets, object characteristics, such as size and aspect ratio, significantly impact detection performance. Most detection frameworks exhibit very low precision for the extra-small object category, as these objects occupy only a few pixels and contain much less visual information, making it challenging for detectors to distinguish them from backgrounds and noise in underwater images. Moreover, current detectors strive to maintain scale invariance, with detection accuracy varying largely for objects of different sizes. Detecting a wide range of object scales is challenging because models must remain sensitive to small objects without sacrificing performance on larger ones.

Fig. 15 presents the performance of representative detection frameworks for objects with different aspect ratios on the RUOD and DUO datasets. Most detectors prefer objects with a 1:1 aspect ratio (i.e., squared objects). This can be attributed due to several aspects of detection network architecture: (1) Square Kernels-Most detectors utilise square convolutional filters to process input images. These filters are inherently more effective at capturing patterns in objects closer to a 1:1 aspect ratio, as the spatial information is processed evenly in both dimensions (height and width). (2) Anchors Boxes-Many detectors rely on anchor boxes to predict object locations and categories. These anchor boxes are predefined with various aspect ratios, but 1:1 aspect ratios are commonly emphasized. Objects with extreme aspect ratios (either extremely tall or extremely wide) may not align well with these square filters, and the predefined anchor boxes may not match their shapes, leading to less accurate predictions.

V-C Discussions on the Impact of Error Types

Fig. 13 presents the error types of representative object detection frameworks on these two datasets. On the RUOD dataset, most detectors, including Foveabox, CascadeRCNN, Reppoints, and ATSS, tend to make more background and localisation errors compared to other error types. This is due

Refer to caption
Figure 14: Performance of representative detection frameworks for objects of varying sizes in the RUOD and DUO datasets: XS (bottom 10%)=extra-small, S (next 20%)=small, M (next 40%)=medium, L (next 20%)=large, XL (top 10%)=extra-large.
Refer to caption
Figure 15: Performance of representative detection frameworks for objects of varying aspect ratios in the RUOD and DUO datasets: XT (bottom 10%)=extra-tall, T (next 20%)=tall, M (next 40%)=medium, W (next 20%)=wide, XW (top 10%)=extra-wide.

to severe color distortions that obscure the distinctions between objects and backgrounds. DetectoRS and GridRCNN exhibit a higher number of missed ground truth errors, indicating that many objects are miss detected. GridRCNN utilises a grid-guided localisation approach to refine bounding boxes, which, although precise, can struggle with objects that have low contrast and irregular shapes (e.g., seacucumbers). The grid representation can sometimes fail to capture objects, especially when they are small or blurred. DetectoRS employs advanced feature extraction techniques, such as Recursive Feature Pyramid (RFP) and Switchable Atrous Convolution (SAC), which significantly enhance feature representation but also increase model complexity. This complexity can occasionally cause important features to be underrepresented or masked by noise during the recursive processing, leading to missed detection.

On the DUO dataset, the background errors are more pronounced for most of the detectors, mainly due to blurring effects that cause seacucumbers to resemble sand and scallops to appear like stones, as illustrated in Fig. 9. This significantly complicates accurate detection. It is important to note that Retinanet produces more classification errors. This may be attributed to its use of focal loss to address class imbalance. Focal loss emphasises hard-to-classify examples, which can cause the model to overlook easier examples. If the model becomes overly focused these challenging cases, it may misclassify simply instances, resulting in an increase in overall errors.

VI Future insights and vision

This survey presents a systematic review of AI-based UOD techniques. Our analysis reveals that sonar images are better choices than RGB images for UOD in extremely murky waters due to their superior perceptual range. Furthermore, developing improved networks architectures, loss functions, and learning strategies are promising directions to enhance UOD performance. We explore the challenges in UOD, noting that some issues, such as noisy labels, have not been thoroughly investigated, and that advanced techniques from GOD could help address these challenges. Additionally, we provide two valuable detection analysis tools to help identify the strengths and weaknesses of each underwater detector. Finally, we recommend two high-quality, large-scale benchmarks for fair comparison of UOD algorithms. Several mainstream deep detectors are evaluated on these benchmarks, and the source codes, trained models, utilised datasets, and detection results are available online, enabling researchers to compare their detectors against existing methods.

References

  • [1] J. Y. Chiang and Y.-C. Chen, “Underwater image enhancement by wavelength compensation and dehazing,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1756–1769, 2011.
  • [2] C. Fu, R. Liu, X. Fan, P. Chen, H. Fu, W. Yuan, M. Zhu, and Z. Luo, “Rethinking general underwater object detection: Datasets, challenges, and solutions,” Neurocomputing, vol. 517, pp. 243–256, 2023.
  • [3] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set representation for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9657–9666.
  • [4] C. Liu, H. Li, S. Wang, M. Zhu, D. Wang, X. Fan, and Z. Wang, “A dataset and benchmark of underwater object detection for robot picking,” in 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).   IEEE, 2021, pp. 1–6.
  • [5] D. Akkaynak and T. Treibitz, “A revised underwater image formation model,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6723–6732.
  • [6] S. Xu, M. Zhang, W. Song, H. Mei, Q. He, and A. Liotta, “A systematic review and analysis of deep learning-based underwater object detection,” Neurocomputing, 2023.
  • [7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
  • [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision.   Springer, 2016, pp. 21–37.
  • [9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
  • [10] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in European Conference on Computer Vision.   Springer, 2012, pp. 340–353.
  • [11] D. Bolya, S. Foley, J. Hays, and J. Hoffman, “Tide: A general toolbox for identifying object detection errors,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16.   Springer, 2020, pp. 558–573.
  • [12] M. Moniruzzaman, S. M. S. Islam, M. Bennamoun, and P. Lavery, “Deep learning on underwater marine object detection: A survey,” in Advanced Concepts for Intelligent Vision Systems: 18th International Conference, ACIVS 2017, Antwerp, Belgium, September 18-21, 2017, Proceedings 18.   Springer, 2017, pp. 150–160.
  • [13] D. Gomes, A. S. Saif, and D. Nandi, “Robust underwater object detection with autonomous underwater vehicle: A comprehensive study,” in Proceedings of the International Conference on Computing Advancements, 2020, pp. 1–10.
  • [14] N. Wang, Y. Wang, and M. J. Er, “Review on deep learning techniques for marine object recognition: Architectures and algorithms,” Control Engineering Practice, vol. 118, p. 104458, 2022.
  • [15] S. Fayaz, S. A. Parah, and G. Qureshi, “Underwater object detection: architectures and algorithms–a comprehensive review,” Multimedia Tools and Applications, vol. 81, no. 15, pp. 20 871–20 916, 2022.
  • [16] M. Jian, N. Yang, C. Tao, H. Zhi, and H. Luo, “Underwater object detection and datasets: a survey,” Intelligent Marine Technology and Systems, vol. 2, no. 1, p. 9, 2024.
  • [17] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1.   Ieee, 2001, pp. I–I.
  • [18] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1.   Ieee, 2005, pp. 886–893.
  • [19] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition.   Ieee, 2008, pp. 1–8.
  • [20] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
  • [21] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” Advances in Neural Information Processing Systems, vol. 29, 2016.
  • [22] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
  • [23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
  • [24] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
  • [25] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6569–6578.
  • [26] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 213–229.
  • [27] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
  • [28] D. Kim, D. Lee, H. Myung, and H.-T. Choi, “Artificial landmark-based underwater localization for auvs using weighted template matching,” Intelligent Service Robotics, vol. 7, no. 3, pp. 175–184, 2014.
  • [29] N. Strachan, “Recognition of fish species by colour and shape,” Image and Vision Computing, vol. 11, no. 1, pp. 2–10, 1993.
  • [30] M. S. A. C. Marcos, M. N. Soriano, and C. A. Saloma, “Classification of coral reef images from underwater video using neural networks,” Optics Express, vol. 13, no. 22, pp. 8766–8771, 2005.
  • [31] C. Barat and M.-J. Rendas, “A robust visual attention system for detecting manufactured objects in underwater video,” in OCEANS 2006.   IEEE, 2006, pp. 1–6.
  • [32] A. L. Chew, P. B. Tong, and C. S. Chia, “Automatic detection and classification of man-made targets in side scan sonar images,” in 2007 Symposium on Underwater Technology and Workshop on Scientific Use of Submarine Cables and Related Technologies.   IEEE, 2007, pp. 126–132.
  • [33] C. Spampinato, Y.-H. Chen-Burger, G. Nadarajan, and R. B. Fisher, “Detecting, tracking and counting fish in low quality unconstrained underwater videos.” VISAPP (2), vol. 2008, no. 514-519, p. 1, 2008.
  • [34] O. Beijbom, P. J. Edmunds, D. I. Kline, B. G. Mitchell, and D. Kriegman, “Automated annotation of coral reef survey images,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 1170–1177.
  • [35] C. Spampinato, S. Palazzo, D. Giordano, I. Kavasidis, F.-P. Lin, and Y.-T. Lin, “Covariance based fish tracking in real-life underwater environment.” in VISAPP (2), 2012, pp. 409–414.
  • [36] E. Galceran, V. Djapic, M. Carreras, and D. P. Williams, “A real-time underwater object detection algorithm for multi-beam forward looking sonar,” IFAC Proceedings Volumes, vol. 45, no. 5, pp. 306–311, 2012.
  • [37] M. Li, H. Ji, X. Wang, L. Weng, and Z. Gong, “Underwater object detection and tracking based on multi-beam sonar image processing,” in 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO).   IEEE, 2013, pp. 1071–1076.
  • [38] D.-J. Lee, R. B. Schoenberger, D. Shiozawa, X. Xu, and P. Zhan, “Contour matching for a fish recognition and migration-monitoring system,” in Two-and Three-Dimensional Vision Systems for Inspection, Control, and Metrology II, vol. 5606.   International Society for Optics and Photonics, 2004, pp. 37–48.
  • [39] M. Ravanbakhsh, M. R. Shortis, F. Shafait, A. Mian, E. S. Harvey, and J. W. Seager, “Automated fish detection in underwater images using shape-based level sets,” The Photogrammetric Record, vol. 30, no. 149, pp. 46–62, 2015.
  • [40] H. Cho, J. Gu, H. Joe, A. Asada, and S.-C. Yu, “Acoustic beam profile-based rapid underwater object detection for an imaging sonar,” Journal of Marine Science and Technology, vol. 20, pp. 180–197, 2015.
  • [41] G.-J. Hou, X. Luan, D.-L. Song, and X.-Y. Ma, “Underwater man-made object recognition on the basis of color and shape features,” Journal of Coastal Research, vol. 32, no. 5, pp. 1135–1141, 2016.
  • [42] M.-C. Chuang, J.-N. Hwang, and K. Williams, “A feature learning and object recognition framework for underwater fish images,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1862–1872, 2016.
  • [43] S. Villon, M. Chaumont, G. Subsol, S. Villéger, T. Claverie, and D. Mouillot, “Coral reef fish detection and recognition in underwater videos by supervised machine learning: Comparison between deep learning and hog+ svm methods,” in International Conference on Advanced Concepts for Intelligent Vision Systems.   Springer, 2016, pp. 160–171.
  • [44] H. Liu, J. Dai, R. Wang, H. Zheng, and B. Zheng, “Combining background subtraction and three-frame difference to detect moving object from underwater video,” in OCEANS 2016-Shanghai.   IEEE, 2016, pp. 1–5.
  • [45] Z. Chen, Z. Zhang, F. Dai, Y. Bu, and H. Wang, “Monocular vision-based underwater object detection,” Sensors, vol. 17, no. 8, p. 1784, 2017.
  • [46] B. Kim and S.-C. Yu, “Imaging sonar based real-time underwater object detection utilizing adaboost method,” in 2017 IEEE Underwater Technology (UT).   IEEE, 2017, pp. 1–5.
  • [47] S. Vasamsetti, S. Setia, N. Mittal, H. K. Sardana, and G. Babbar, “Automatic underwater moving object detection using multi-feature integration framework in complex backgrounds,” IET Computer Vision, vol. 12, no. 6, pp. 770–778, 2018.
  • [48] Y. Yu, J. Zhao, Q. Gong, C. Huang, G. Zheng, and J. Ma, “Real-time underwater maritime object detection in side-scan sonar images based on transformer-yolov5,” Remote Sensing, vol. 13, no. 18, p. 3555, 2021.
  • [49] C. Barngrover, A. Althoff, P. DeGuzman, and R. Kastner, “A brain–computer interface (bci) for the detection of mine-like objects in sidescan sonar imagery,” IEEE Journal of Oceanic Engineering, vol. 41, no. 1, pp. 123–138, 2015.
  • [50] H. Zhang, M. Tian, G. Shao, J. Cheng, and J. Liu, “Target detection of forward-looking sonar image based on improved yolov5,” IEEE Access, vol. 10, pp. 18 023–18 034, 2022.
  • [51] T. Zhou, J. Si, L. Wang, C. Xu, and X. Yu, “Automatic detection of underwater small targets using forward-looking sonar images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022.
  • [52] M. P. Hayes and P. T. Gough, “Broad-band synthetic aperture sonar,” IEEE Journal of Oceanic Engineering, vol. 17, no. 1, pp. 80–94, 1992.
  • [53] K. Blanc, D. Lingrand, and F. Precioso, “Fish species recognition from video using svm classifier,” in Proceedings of the 3rd ACM International Workshop on Multimedia Analysis for Ecological Data, 2014, pp. 1–6.
  • [54] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” Lecture Notes in Computer Science, vol. 3951, pp. 404–417, 2006.
  • [55] X. Li, M. Shang, H. Qin, and L. Chen, “Fast accurate fish detection and recognition of underwater images with fast r-cnn,” in OCEANS 2015-MTS/IEEE Washington.   IEEE, 2015, pp. 1–5.
  • [56] F. Shkurti, W.-D. Chang, P. Henderson, M. J. Islam, J. C. G. Higuera, J. Li, T. Manderson, A. Xu, G. Dudek, and J. Sattar, “Underwater multi-robot convoying using visual tracking by detection,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 4189–4196.
  • [57] X. Li, Y. Tang, and T. Gao, “Deep but lightweight neural networks for fish detection,” in OCEANS 2017-Aberdeen.   IEEE, 2017, pp. 1–5.
  • [58] L. Ji-Yong, Z. Hao, H. Hai, Y. Xu, W. Zhaoliang, and W. Lei, “Design and vision based autonomous capture of sea organism with absorptive type remotely operated vehicle,” IEEE Access, vol. 6, pp. 73 871–73 884, 2018.
  • [59] L. Zhang, X. Yang, Z. Liu, L. Qi, H. Zhou, and C. Chiu, “Single shot feature aggregation network for underwater object detection,” in 2018 24th International Conference on Pattern Recognition (ICPR).   IEEE, 2018, pp. 1906–1911.
  • [60] S. Lee, B. Park, and A. Kim, “Deep learning from shallow dives: Sonar image generation and training for underwater object detection,” arXiv preprint arXiv:1810.07990, 2018.
  • [61] M. Pedersen, J. Bruslund Haurum, R. Gade, and T. B. Moeslund, “Detection of marine animals in a new underwater dataset with varying visibility,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 18–26.
  • [62] L. Chen, Z. Liu, L. Tong, Z. Jiang, S. Wang, J. Dong, and H. Zhou, “Underwater object detection using invert multi-class adaboost with deep learning,” in 2020 International Joint Conference on Neural Networks, IJCNN 2020.   Institute of Electrical and Electronics Engineers (IEEE), 2020.
  • [63] J. Zhang, L. Zhu, L. Xu, and Q. Xie, “Research on the correlation between image enhancement and underwater object detection,” in 2020 Chinese Automation Congress (CAC).   IEEE, 2020, pp. 5928–5933.
  • [64] W. Chen and B. Fan, “Underwater object detection with mixed attention mechanism and multi-enhancement strategy,” in 2020 Chinese Automation Congress (CAC).   IEEE, 2020, pp. 2821–2826.
  • [65] L. Wang, X. Ye, H. Xing, Z. Wang, and P. Li, “Yolo nano underwater: A fast and compact object detector for embedded device,” in Global Oceans 2020: Singapore–US Gulf Coast.   IEEE, 2020, pp. 1–4.
  • [66] H. Liu, P. Song, and R. Ding, “Towards domain generalization in underwater object detection,” in 2020 IEEE International Conference on Image Processing (ICIP).   IEEE, 2020, pp. 1971–1975.
  • [67] B. Fan, W. Chen, Y. Cong, and J. Tian, “Dual refinement underwater object detection network,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16.   Springer, 2020, pp. 275–291.
  • [68] J. Zhang, L. Zhu, L. Xu, and Q. Xie, “Mffssd: An enhanced ssd for underwater object detection,” in 2020 Chinese Automation Congress (CAC).   IEEE, 2020, pp. 5938–5943.
  • [69] W.-H. Lin, J.-X. Zhong, S. Liu, T. Li, and G. Li, “Roimix: Proposal-fusion among multiple images for underwater object detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 2588–2592.
  • [70] D. Karimanzira, H. Renkewitz, D. Shea, and J. Albiez, “Object detection in sonar images,” Electronics, vol. 9, no. 7, p. 1180, 2020.
  • [71] M. Sung, J. Kim, M. Lee, B. Kim, T. Kim, J. Kim, and S.-C. Yu, “Realistic sonar image simulation using deep learning for underwater object detection,” International Journal of Control, Automation and Systems, vol. 18, no. 3, pp. 523–534, 2020.
  • [72] H. Yang, P. Liu, Y. Hu, and J. Fu, “Research on underwater object recognition based on yolov3,” Microsystem Technologies, vol. 27, pp. 1837–1844, 2021.
  • [73] T.-S. Pan, H.-C. Huang, J.-C. Lee, and C.-H. Chen, “Multi-scale resnet for real-time underwater object detection,” Signal, Image and Video Processing, vol. 15, pp. 941–949, 2021.
  • [74] L. Jiang, Y. Wang, Q. Jia, S. Xu, Y. Liu, X. Fan, H. Li, R. Liu, X. Xue, and R. Wang, “Underwater species detection using channel sharpening attention,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4259–4267.
  • [75] C.-H. Yeh, C.-H. Lin, L.-W. Kang, C.-H. Huang, M.-H. Lin, C.-Y. Chang, and C.-C. Wang, “Lightweight deep neural network for joint learning of underwater object detection and color conversion,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6129–6143, 2021.
  • [76] L. Chen, F. Zhou, S. Wang, J. Dong, N. Li, H. Ma, X. Wang, and H. Zhou, “Swipenet: Object detection in noisy underwater scenes,” Pattern Recognition, vol. 132, p. 108926, 2022.
  • [77] D. N. V. Alla, V. B. N. Jyothi, H. Venkataraman, and G. Ramadass, “Vision-based deep learning algorithm for underwater object detection and tracking,” in OCEANS 2022-Chennai.   IEEE, 2022, pp. 1–6.
  • [78] L. Zhang, C. Li, and H. Sun, “Object detection/tracking toward underwater photographs by remotely operated vehicles (rovs),” Future Generation Computer Systems, vol. 126, pp. 163–168, 2022.
  • [79] S. Cai, G. Li, and Y. Shan, “Underwater object detection using collaborative weakly supervision,” Computers and Electrical Engineering, vol. 102, p. 108159, 2022.
  • [80] J. Jia, M. Fu, X. Liu, and B. Zheng, “Underwater object detection based on improved efficientdet,” Remote Sensing, vol. 14, no. 18, p. 4487, 2022.
  • [81] Y. Wang, J. Guo, and W. He, “Underwater object detection aided by image reconstruction,” in 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP).   IEEE, 2022, pp. 1–6.
  • [82] X. Liang and P. Song, “Excavating roi attention for underwater object detection,” in 2022 IEEE International Conference on Image Processing (ICIP).   IEEE, 2022, pp. 2651–2655.
  • [83] G. Chen, Z. Mao, K. Wang, and J. Shen, “Htdet: A hybrid transformer-based approach for underwater small object detection,” Remote Sensing, vol. 15, no. 4, p. 1076, 2023.
  • [84] P. Song, P. Li, L. Dai, T. Wang, and Z. Chen, “Boosting r-cnn: Reweighting r-cnn samples by rpn’s error for underwater object detection,” Neurocomputing, 2023.
  • [85] L. Dai, H. Liu, P. Song, H. Tang, R. Ding, and S. Li, “Edge-guided representation learning for underwater object detection,” CAAI Transactions on Intelligence Technology, 2023.
  • [86] C. Fu, X. Fan, J. Xiao, W. Yuan, R. Liu, and Z. Luo, “Learning heavily-degraded prior for underwater object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [87] J. Zhou, Z. He, K.-M. Lam, Y. Wang, W. Zhang, C. Guo, and C. Li, “Amsp-uod: When vortex convolution and stochastic perturbation meet underwater object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7659–7667.
  • [88] A. Guo, K. Sun, and Z. Zhang, “A lightweight yolov8 integrating fasternet for real-time underwater object detection,” Journal of Real-Time Image Processing, vol. 21, no. 2, pp. 1–15, 2024.
  • [89] J. Gao, Y. Zhang, X. Geng, H. Tang, and U. A. Bhatti, “Pe-transformer: Path enhanced transformer for improving underwater object detection,” Expert Systems with Applications, vol. 246, p. 123253, 2024.
  • [90] L. Ge, P. Singh, and A. Sadhu, “Advanced deep learning framework for underwater object detection with multibeam forward-looking sonar,” Structural Health Monitoring, p. 14759217241235637, 2024.
  • [91] L. Dai, H. Liu, P. Song, and M. Liu, “A gated cross-domain collaborative network for underwater object detection,” Pattern Recognition, vol. 149, p. 110222, 2024.
  • [92] B. Wang, Z. Wang, W. Guo, and Y. Wang, “A dual-branch joint learning network for underwater object detection,” Knowledge-Based Systems, vol. 293, p. 111672, 2024.
  • [93] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
  • [94] B. Gašparović, J. Lerga, G. Mauša, and M. Ivašić-Kos, “Deep learning approach for objects detection in underwater pipeline images,” Applied Artificial Intelligence, vol. 36, no. 1, p. 2146853, 2022.
  • [95] A. Farhadi and J. Redmon, “Yolov3: An incremental improvement,” in Computer Vision and Pattern Recognition, vol. 1804.   Springer Berlin/Heidelberg, Germany, 2018, pp. 1–6.
  • [96] B. J. Boom, P. X. Huang, J. He, and R. B. Fisher, “Supporting ground-truth annotation of image datasets using clustering,” in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).   IEEE, 2012, pp. 1542–1545.
  • [97] C. Beyan and R. B. Fisher, “Detecting abnormal fish trajectories using clustered and labeled data,” in 2013 IEEE International Conference on Image Processing.   IEEE, 2013, pp. 1476–1480.
  • [98] I. Kavasidis, S. Palazzo, R. D. Salvo, D. Giordano, and C. Spampinato, “An innovative web-based collaborative platform for video annotation,” Multimedia Tools and Applications, vol. 70, pp. 413–432, 2014.
  • [99] C. Liu, Z. Wang, S. Wang, T. Tang, Y. Tao, C. Yang, H. Li, X. Liu, and X. Fan, “A new dataset, poisson gan and aquanet for underwater object grabbing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 2831–2844, 2021.
  • [100] C. Fabbri, M. J. Islam, and J. Sattar, “Enhancing underwater imagery using generative adversarial networks,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 7159–7165.
  • [101] X. Chen, J. Yu, S. Kong, Z. Wu, X. Fang, and L. Wen, “Towards real-time advancement of underwater visual quality with gan,” IEEE Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9350–9359, 2019.
  • [102] R. Liu, X. Fan, M. Zhu, M. Hou, and Z. Luo, “Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4861–4875, 2020.
  • [103] L. Chen, Z. Jiang, L. Tong, Z. Liu, A. Zhao, Q. Zhang, J. Dong, and H. Zhou, “Perceptual underwater image enhancement with deep learning and physical priors,” IEEE Transactions on Circuits and Systems for Video Technology, 2020.
  • [104] R. Liu, Z. Jiang, S. Yang, and X. Fan, “Twin adversarial contrastive learning for underwater image enhancement and beyond,” IEEE Transactions on Image Processing, vol. 31, pp. 4922–4936, 2022.
  • [105] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3560–3569.
  • [106] M. Zhang, S. Xu, W. Song, Q. He, and Q. Wei, “Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion,” Remote Sensing, vol. 13, no. 22, p. 4706, 2021.
  • [107] B. Singh and L. S. Davis, “An analysis of scale invariance in object detection snip,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3578–3587.
  • [108] M. Najibi, B. Singh, and L. S. Davis, “Autofocus: Efficient multi-scale inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9745–9755.
  • [109] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and K. Fu, “Scrdet: Towards more robust detection for small, cluttered and rotated objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8232–8241.
  • [110] T. Gao, Q. Niu, J. Zhang, T. Chen, S. Mei, and A. Jubair, “Global to local: A scale-aware network for remote sensing object detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [111] J. Zhang, J. Lei, W. Xie, Z. Fang, Y. Li, and Q. Du, “Superyolo: Super resolution assisted object detection in multimodal remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
  • [112] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-object detection in remote sensing images with end-to-end edge-enhanced gan and object detector network,” Remote Sensing, vol. 12, no. 9, p. 1432, 2020.
  • [113] S. M. A. Bashir and Y. Wang, “Small object detection in remote sensing images with residual feature aggregation-based super-resolution and object detector network,” Remote Sensing, vol. 13, no. 9, p. 1854, 2021.
  • [114] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [115] H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5907–5915.
  • [116] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia, “Iterative learning with open-set noisy labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8688–8696.
  • [117] D. Patel and P. Sastry, “Adaptive sample selection for robust learning under label noise,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3932–3942.
  • [118] X. Wang, Y. Hua, E. Kodirov, D. A. Clifton, and N. M. Robertson, “Proselflc: Progressive self label correction for training robust deep neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 752–761.
  • [119] P. Ma, Z. Liu, J. Zheng, L. Wang, and Q. Ma, “Ctw: confident time-warping for time-series label-noise learning,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, pp. 4046–4054.
  • [120] H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by agreement: A joint training method with co-regularization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 726–13 735.
  • [121] Q. Yao, H. Yang, B. Han, G. Niu, and J. T.-Y. Kwok, “Searching to exploit memorization effect in learning with noisy labels,” in International Conference on Machine Learning.   PMLR, 2020, pp. 10 789–10 798.
  • [122] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning from noisy labels with distillation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1910–1918.
  • [123] Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y.-F. Li, “Ngc: A unified framework for learning with open-world noisy data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 62–71.
  • [124] A. Ghosh, H. Kumar, and P. S. Sastry, “Robust loss functions under label noise for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  • [125] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [126] X. Zhou, X. Liu, D. Zhai, J. Jiang, and X. Ji, “Asymmetric loss functions for noise-tolerant learning: Theory and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [127] N. Chang, Z. Yu, Y.-X. Wang, A. Anandkumar, S. Fidler, and J. M. Alvarez, “Image-level or object-level? a tale of two resampling strategies for long-tailed detection,” in International Conference on Machine Learning.   PMLR, 2021, pp. 1463–1472.
  • [128] C. Feng, Y. Zhong, and W. Huang, “Exploring classification equilibrium in long-tailed object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3417–3426.
  • [129] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4334–4343.
  • [130] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. Van Der Maaten, “Exploring the limits of weakly supervised pretraining,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 181–196.
  • [131] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9268–9277.
  • [132] Y. Li, T. Wang, B. Kang, S. Tang, C. Wang, J. Li, and J. Feng, “Overcoming classifier imbalance for long-tail object detection with balanced group softmax,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 991–11 000.
  • [133] T. Qi, H. Xie, P. Li, J. Ge, and Y. Zhang, “Balanced classification: A unified framework for long-tailed object detection,” IEEE Transactions on Multimedia, 2023.
  • [134] R. B. Fisher, Y.-H. Chen-Burger, D. Giordano, L. Hardman, F.-P. Lin et al., Fish4Knowledge: collecting and analyzing massive coral reef fish video data.   Springer, 2016, vol. 104.
  • [135] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, “Region proposal by guided anchoring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2965–2974.
  • [136] Z. Cai and N. Vasconcelos, “Cascade r-cnn: High quality object detection and instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1483–1498, 2019.
  • [137] T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi, “Foveabox: Beyound anchor-based object detection,” IEEE Transactions on Image Processing, vol. 29, pp. 7389–7398, 2020.
  • [138] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9759–9768.
  • [139] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 213–10 224.
  • [140] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7363–7372.
  • [141] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.