DARE: AI-based Diver Action Recognition System using Multi-Channel CNNs for AUV Supervision

Jing Yang21, James P. Wilson2 and Shalabh Gupta2
2 Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT.1 Corresponding author: Shalabh Gupta (email: [email protected])

Abstract

With the growth of sensing, control and robotic technologies, autonomous underwater vehicles (AUVs) have become useful assistants to human divers for performing various underwater operations. In the current practice, the divers are required to carry expensive, bulky, and waterproof keyboards or joystick-based controllers for supervision and control of AUVs. Therefore, diver action-based supervision is becoming increasingly popular because it is convenient, easier to use, faster, and cost effective. However, the various environmental, diver and sensing uncertainties present underwater makes it challenging to train a robust and reliable diver action recognition system. In this regard, this paper presents DARE, a diver action recognition system, that is trained based on Cognitive Autonomous Driving Buddy (CADDY) dataset, which is a rich set of data containing images of different diver gestures and poses in several different and realistic underwater environments. DARE is based on fusion of stereo-pairs of camera images using a multi-channel convolutional neural network supported with a systematically trained tree-topological deep neural network classifier to enhance the classification performance. DARE is fast and requires only a few milliseconds to classify one stereo-pair, thus making it suitable for real-time underwater implementation. DARE is comparatively evaluated against several existing classifier architectures and the results show that DARE supersedes the performance of all classifiers for diver action recognition in terms of overall as well as individual class accuracies and F1-scores.

Index Terms:

Diver gesture recognition, Transfer learning, Multi channel convolutional neural networks, Human-robot interaction, Autonomous underwater vehicles

I Introduction

Underwater robots (e.g., autonomous underwater vehicles (AUVs)) have become vital assistants to human operators for a variety of tasks [1] including search and exploration [2, 3, 4], 3-D seafloor mapping[5, 6], ocean resource analysis[7], marine data collection[8], target tracking [9, 10, 11], ocean demining [12, 13], oil spill cleaning [14, 15] and underwater structural inspection and repairing[16]. Since underwater environments can be hazardous, human divers can utilize robots for performing heavy or risky tasks, such as lifting heavy parts [17]. On the other hand, underwater robots can also perform tasks which require precision and delicacy, such as gathering light-weight biological samples[18]. Therefore, it is becoming increasingly important to develop tools and methods that facilitate cost-effective, fast, and reliable communication between the divers and their robot counterparts for safe and rapid completion of underwater tasks.

I-A Motivation

Since on-demand reprogramming of the AUVs is difficult in dynamic and uncertain underwater environments, they require constant supervision from the diver to perform tasks. Traditional methods of communicating and controlling the AUVs utilize waterproof tablets, keyboards, mice, or joysticks directly connected to the AUVs. These interfaces, however: 1) are expensive to waterproof and deploy in underwater environments, 2) require the diver to be close to the AUV, and 3) are cumbersome and unwieldy to operate. Therefore, due to lack of effective technologies of underwater radio and wireless communication, one way for the diver to send commands to an AUV is by means of hand gestures [19]. In response, the AUV can utilize AI to recognize the diver’s hand gesture and interpret the corresponding command, as seen in Fig. 1. Sometimes the AUVs are also required to follow the diver by recognizing diver’s posture. It is desired that such diver action recognition system implemented on an AUV should be fast to act in time-critical situations, such as changing environments, risky diver locations, and limited oxygen supply.

Refer to caption — Figure 1: Underwater diver AUV interaction

I-B Challenges

There exist several challenges in building a robust, reliable and real-time underwater diver action recognition system, as highlighted in Fig. 2. These challenges are discussed below.

I-B1 Environmental Uncertainties

The two main challenges in underwater environments are due to: a) the water appearance and b) the background. Water appearance can adversely affect the image quality due to varying levels of water clarity, color, and brightness. This results in a large combination of contrasting environments that complicate the diver action recognition problem; thus requiring different image enhancement and filtering techniques to recognize diver actions. While the diver action recognition system might need minimal pre-processing in clear, colorless, and bright water environments, it might require complex nonlinear filtering and feature extraction in murky, brown, and dark water environments.

The difficulties of an underwater diver action recognition system are further exacerbated by the presence of background complexities. These are of three main categories: terrain and rock formations, marine vegetation and wildlife, and artificial constructs. Thus, it is necessary to filter out these background effects for efficient and reliable diver action recognition.

Therefore, it is critically important that an underwater diver action recognition system must be robust to operating in diverse and complex underwater environments.

I-B2 Diver Uncertainties

There are several uncertainties associated with the diver due to sporadic movements and ambiguous actions. The diver’s position and motion is usually uncertain in an underwater environment. First, the size of the diver varies depending on the distance from the AUV. If the diver is too close to the AUV, key parts of his body might not be clear within the camera’s field of view. On the other hand, if the diver is too far from the AUV, the camera might not capture enough details to recognize his actions. Second, it is impossible for the diver to remain stationary at all times because of the underwater current. Specifically, the diver’s orientation changes with respect to the direction of the current. The diver’s position can also be anywhere in the image such as at the bottom left, top middle, or center right. Finally, the diver is moving when performing a particular pose such as turning horizontally, thus making the problem more difficult.

Similarly, the position and motion of the diver’s hand is crucial for gesture recognition. Similar to the diver size and diver orientation, hand size and hand position can affect the recognition of hand gesture. However, unlike estimating the diver’s body pose, the hand is a much smaller target, which requires the recognition model to extract and process particular regions of interest in the image.

Furthermore, the problem is exacerbated by the presence of bulky equipment (e.g., oxygen tank) carried by the divers. In addition, the exhalation of air bubbles into the underwater environment adds considerable disturbance in the images. Other issue arises, when multiple divers appear in front of the AUV camera system at the same time; thus making it difficult to identify the diver issuing the commands.

Therefore, it is vital for the underwater diver action recognition system to distinguish between different body poses and hand gestures in the presence of various diver uncertainties while the diver performs complex maneuvers at various positions and orientations relative to the AUV.

I-B3 Sensing Uncertainties

The underwater camera system is susceptible to noise and biases, which might adversely affect the clarity and color accuracy of images. In addition, various objects in the environment can interfere with the camera’s vision. For example, floating underwater debris near the lens, particles scattered throughout the environment, and reflective underwater surfaces can result in viewing angle obstruction, difficulty focusing the camera lens, and lens flare. These make it difficult to determine if the image distortion is due to the environment, diver motion, or hardware issues.

Thus, it is important that the underwater diver action recognition system is robust to these sensing uncertainties.

I-B4 Fusion of Stereo Camera Images

The training dataset used in this paper consists of images that are captured using a stereo camera system, which contains one camera with two separate lenses abreast in order to implicitly preserve the distance information. This system uses the left-lens and the right-lens to capture two images of a diver action at the same time from slightly different angles. If only the left or the right lens image is used to classify the diver action, then it might exclude the crucial distance information and thus it can degrade the classifier performance. On the other hand, treating left and right-lens images as individual uncorrelated images in the same training dataset will cause the model to over-fit since these images correspond to the same diver action. Thus, the left and right lens images should be fused together at the feature level prior to classification. Fusion, however, is difficult because of stereo correlation and subtle spatial differences.

Therefore, the underwater diver action recognition system should be able to fuse the left and right lens information cohesively to boost the classification accuracy.

I-B5 Computational Efficiency and Reliability

While performing tasks in hazardous underwater environments, there may be several time-critical situations where commands must be carried out by the AUV urgently to ensure the safety of the diver and/or the AUV, and to accommodate the changing environment. However, it might be difficult to interpret commands fast because of the high-resolution RGB stereo imagery captured by the camera system and limited computational resources on the AUV. Furthermore, many frames may need to be processed in rapid succession to verify a sequence of commands to be executed. Additionally, it is required that the diver actions are recognized with high accuracy to ensure that the AUV operates reliably in time-critical situations.

Therefore, the on-board diver action recognition system needs to be simple for ease of implementation, accurate for robust and reliable performance, and computationally efficient for classification of diver actions in real-time.

I-C Related work

Remotely operated vehicles (ROVs) are common underwater robots which are directly controlled by a human operator from the surface via cable connections. These vehicles, however, are limited in their deployment and autonomy due to expensive equipment, short range, unwieldy cables, and complex operation [20]. On the other hand, autonomous underwater vehicles (AUVs) are cable-free, equipped with advanced sensing and control technologies [21], provide longer range autonomy and are becoming increasingly more intelligent and cost effective. Thus, they have become useful to serve as advanced assistants for a variety of underwater tasks. However, despite the advances in AUV technologies, several underwater tasks require efficient diver-AUV collaboration. The key requirement for diver-AUV collaboration is the capability to dynamically reprogram the AUV’s task parameters. In underwater environments, radio and wireless communications between the diver and AUVs[22] are not efficient. Traditionally, divers used tactile devices to send commands using waterproof tablets, keyboards, mice and joysticks; however, these interfaces are expensive and inconvenient [19]. Therefore, the underwater diver action recognition problem has become an increasingly important area of research to enable efficient underwater diver-AUV interaction.

There have been some initial attempts to address the diver action recognition problem, which requires: 1) the creation of a database of diver actions in diverse and realistic underwater environments, 2) the use of cost-effective but information-rich underwater sensors to capture diver motions, and 3) a computationally efficient and reliable diver action recognition model. Initial studies focused on the classification of diver actions were based on images captured in ideal swimming pool environments [23] with controllable lighting and background conditions. These were later extended to include diver actions in more realistic underwater ocean environments [24]. In both cases, images were captured using a single monocular RGB camera. While this sensing solution is cost-effective, this system has several limitations including blind spots from a single point-of-view, low image resolution, and limited depth information which can make the system unreliable.

The diver pose recognition problem is also crucial for diver tracking application [25]. Traditionally, acoustic sensors are used to locate the diver’s position relative to the water surface for autonomous surface vehicles; however, such sensors cannot identify specific diver poses (e.g., orientations and motion types). Similarly, information-rich sonar, ultra-short baseline acoustic localization, and stereo-cameras have also been used for AUV diver tracking. These sensors ensure the AUV closely follows the diver underwater; however, diver action recognition models were not developed to identify specific diver poses present in the captured images[26]. At the same time, research has also focused on detecting the number of divers present in the view of the AUV camera in various underwater environments [24, 27]. Researchers have also addressed the diver arm motion recognition problem using wireless acoustic networks in a simulated underwater environment [28]; however, there is no real-world data for validation. Furthermore, these networks might not always be available for implementation in the ocean environments.

Common underwater recognition models utilize time-frequency feature extractors with k-nearest neighbor and SVM classifiers to identify mammals and inanimate objects on the ocean floor [29, 30, 31]. Researchers performed gesture recognition by fusing convex hull method and SVM classifier [32]; however, very few gestures and no poses were considered. As such, the method might not scale well to larger and more complex set of diver actions. On the other hand, deep transfer learning methods have been applied to underwater fish species classification [33], human motion recognition in terrestrial environments [34] and diver gesture recognition[35]. While these methods provide good average performance, high individual class performance is not guaranteed.

However, for the diver action recognition problem it is of critical importance that each diver action is recognized with high accuracy for safe and reliable operation in underwater environments. To the best of our knowledge, the diver action recognition problem is still an active area of research.

I-D Contributions

To address the challenges and limitations discussed above, this paper develops a robust, reliable and computationally efficient Diver Action Recognition System, called DARE, to identify the diver hand gestures and full-body poses in real-time; thus, enabling the AUVs to reliably interpret diver commands. DARE has been trained based on a recently created rich database of diver hand gestures and poses, called Cognitive Autonomous Diving Buddy (CADDY) [36] dataset, which contains data generated in real uncertain ocean environments.

DARE is built upon deep transfer learning architecture for robust classification of a multitude of diver actions under various underwater uncertainties and complex scenarios. First, DARE enables fusion of information available from the left and right images of the stereo camera system on AUVs. For this purpose, the deep transfer learning method is extended to a multi-channel framework that: 1) extracts the relevant features from each stereo image individually using the convolution layers, and 2) optimally fuses these features together using fully connected neural networks.

Second, DARE provides high classification performance of each individual class. Typically, a single monolithic neural network classifier is trained after the convolution layers; however, for problems with large number of classes, such networks cannot guarantee a high minimum performance for each individual class. Therefore, instead of considering all diver actions simultaneously, DARE trained a decision tree of fully-connected neural network classifiers, where each network is tailored to discriminate between groups of diver actions. This ensures high individual class recognition performance at the bottom of the tree, thus yielding superior reliability.

The main contributions of this paper are as follows:

•
Development of DARE, to facilitate computationally efficient, robust, reliable and highly accurate diver action recognition, under various environmental, diver, and sensing uncertainties. It consists of:
- –
  
  a multi-channel deep transfer learning architecture for fusing stereo pairs of camera images and
- –
  
  a hierarchical tree structured classification scheme to yield high individual class recognition performance.
•

Training and testing of DARE using the CADDY dataset consisting of a large number of diver action images in real-life challenging underwater environments.
•

Comparative evaluation with baseline architectures.

TABLE I: CADDY dataset diver hand gestures

Diver Image

Gesture

Code

Diver Image

Gesture

Code

Diver Image

Gesture

Code

\hlxvhvvv

Start

End

\hlxvv

Here

Take a

photo

Four

\hlxvv

Carry

Tessel-

lation

Two

\hlxvv

Down

One

Back-

ward

\hlxvv

Three

Five

Number

delimiter

\hlxvv

Boat

I-E Organization

The rest of the paper is organized as follows: Section II describes how CADDY dataset was created and organized for both gestures and poses. Section III provides the details of the deep transfer learning tree structure approach for classifying the diver gestures and poses. Section IV presents the data analysis setup, compares the performance with other existing techniques, and discusses the results. Finally, Section V presents the conclusions and plan for future work.

II CADDY Dataset Overview

The objective of an underwater diver action recognition system is to enable remote supervision of AUVs to remove the need of cumbersome and expensive waterproof joysticks and keyboards. Such a system should have low complexity for real-time execution once deployed on an AUV. Furthermore, it should be robust to various underwater uncertainties (e.g., environment, diver, and sensor uncertainties) and deliver highly reliable classification performance.

Therefore, to train such a system this paper utilizes the Cognitive Autonomous Diving Buddy (CADDY)[36] dataset, which consists of a rich variety of diver action images collected in real uncertain underwater environments. Specifically, images in the CADDY dataset were captured using the BumbleBee XB3 stereo RGB camera system with a resolution of $640\times 480$ pixels. The stereo camera is similar to human eyes which contains one camera and two lenses side by side; it takes synchronized pictures which together preserve the information like spatial distance, object/diver details and the broader view of the background environment[37]. These images were collected in various indoor swimming pools and in the open seas under complex and diverse environmental conditions. Specifically, the images were collected in three different locations: 1) the open seas of Biograd na Moru, Croatia, 2) an indoor pool in the Brodarski Institute, Croatia, and 3) an outdoor pool in Genova, Italy. Images collected from these distinct locations have diverse conditions such as water color, water clarity, underwater lighting, terrain, diver equipment, diver movement, size of the gestures and their position and orientation. All these factors make the diver gesture and pose recognition problem very challenging.

The CADDY dataset consists of: 1) diver hand gestures to demonstrate specific commands for the AUV under realistic underwater operating conditions and 2) diver poses for the AUV to track and follow the diver.

II-A Diver Gestures

Hand gestures in the CADDY dataset use the CADDIAN gesture-based language[38]. The diver signals a specific hand gesture or performs a sequence of gestures to command the AUV in real-world underwater missions. In order to aid the classifier in distinguishing between different hand gestures, the divers wear gloves with special features that have a different color strip on each finger, a white square on the back of the hand, and a white circle on the front of the hand. Without these features on the gloves, it is not only difficult to distinguish between different hand gestures but also difficult to differentiate between the diver’s body and his hands in dim surroundings even with human eyes.

Table I shows a total of 16 different gestures and the associated commands, such as start, up, down, and take a photo. Overall, the CADDY dataset contains $9239$ stereo pairs, i.e., $9239\times 2=18,478$ images of the diver gestures.

II-B Diver Poses

Pose images were extracted from video sequences showing the diver’s movement. The purpose of classifying diver’s pose is to track diver’s movement and position such that the diver is always facing the camera on the AUV. Table II shows $3$ different poses: i) turning horizontally with chest pointing downward to the floor, ii) turning clockwise or counterclockwise vertically, and iii) free swim. Overall, the dataset contains $12,708\times 2=25,416$ images of the diver poses.

In this paper, DARE is trained using both gesture and pose datasets, thus it is capable of recognizing $20$ different classes consisting of $16$ gestures, $3$ poses, and $1$ true negative (or null) class. The null class contains $7190$ stereo pair which include miscellaneous situations like diver idle, diver missing, gestures missing, improper gestures or the transition between two gestures. Thus, there are a total of $(9239+12,708+7190)\times 2=58,274$ images which are used for training and validation of DARE.

TABLE II: CADDY dataset diver postures

Command

Turning

Free

horizontally

vertically

swim

\hlxvhvv

Diver

Posture

III DARE Architecture

DARE is constructed to enable robust, efficient and reliable underwater communication between the diver and AUV considering various environmental, diver, and sensing uncertainties. Specifically, it is critical that any diver action is accurately recognized in real-time in order to enable safe and rapid completion of tasks in hazardous underwater environments. The DARE architecture is shown in Figure 3. During AUV missions, a live video of the divers is recorded using an on-board stereo camera. Each frame captured in the video is composed of two images, corresponding to the left and right lens of the stereo camera. Deep multi-channel transfer learning architectures are utilized to extract the features from each image individually. These features from each image are then ”flattened” into a single one-dimensional feature vector, which is then systematically fused and classified to identify the diver action using a tree structured classifier consisting of fully-connected neural networks. The tree-structure helps boost the individual class recognition performance.

III-A Multi-Channel Feature Extraction

DARE is built upon the deep transfer learning approach utilizing CNNs in a bi-channel configuration (i.e., one for each image of the stereo pair of the AUV camera).

Multi-channel CNN approaches have been applied in other domains that have multiple high-dimensional input data streams for robust and accurate classification [39, 40]. In multi-channel frameworks, data from each input stream is processed independently before fusion and classification. In the context of stereo camera system in the CADDY dataset, we employ a bi-channel approach to enable feature extraction from two different perspectives, which: i) implicitly capture the distance of the diver from the AUV, and ii) allow for better filtering of noise, biases, and other uncertainties by comparing correlated details between stereo images. Thus, a bi-channel CNN approach enables DARE to take full advantage of stereo images and provide robust classification.

The training of deep CNNs requires a large dataset (on the order of tens of thousands of images per class) in order to adequately train the weights of the network without overfitting. However, the collection of such large amounts of data is often cost-prohibitive and requires expensive computational resources to process. In order to overcome these limitations, we employ the use of pre-trained deep transfer learning models[41]. These models were trained on large-scale and diverse benchmark datasets, and their structure can be transferred to solve a different, smaller, but also relevant problem. In a transfer learning approach, the weights of the pre-trained convolutional layers are directly utilized to extract the features of the input images, and only the fully-connected neural network layers are retrained. This approach therefore only requires minimal data for re-training and thus shorter computation times, while still yielding high classification accuracy.

In this paper, DARE is constructed using three basic pre-trained transfer learning models: ResNet18 [42], AlexNet [43], and VggNet16 [44]. The architectures of these models are summarized in Table III. In general, these models are composed of an input layer, a series of convolutional stages that extract abstract and relevant features from the images, and finally a fully connected neural network that fuses the features and makes a classification decision. The input images from the camera are down-sampled to match the input size requirements of the considered transfer learning model.

Both channels are identical to each other and consist of the same stages as the underlying transfer learning model. These stages typically consist of a convolutional layer with a specified filter size that extracts the features, an activation layer that augments the features via nonlinear transformations to help increase separation, and finally a pooling layer that downsamples the features to ensure they are compact to maintain generality while reducing computation times. An example of the feature extraction process from the AlexNet is shown in Figure 4. The final features of each channel are then flattened and concatenated before being sent to the tree topology classifier for robust information fusion and decision making. For the sake of completeness, a brief overview of the most common CNN layers in each stage of a deep transfer learning model [45] is presented below.

TABLE III: Specifics of transfer learning models

ResNet

AlexNet

VggNet

Network depth

\hlxv

Input layer

size

224

\times

224

227

\times

227

224

\times

224

\hlxv

Convolution

layers

\hlxv

Fully connected

layers

\hlxv

\#

of parameters

(million)

11.7

61.1

138.4

III-A1 Convolutional Layer

Each convolution layer outputs a feature map generated from a set of filters. Without loss of generality, suppose layer $l$ is a convolution layer. Then, the convolution operations in this layer are as follows. The input to this layer is a 3D matrix $\mathbf{Z}_{l-1}$ of feature maps generated from the previous layer $l-1$ with dimensions $N_{l-1}\times N_{l-1}\times C_{l-1}$ , where $N_{l-1}$ is the length and width, and $C_{l-1}$ is the depth (i.e., the total number) of feature maps. Define padding parameter $p_{l}\in\mathbb{Z}^{+}\cup\{0\}$ as the number of zero-element rows or columns to add to each outer edge of each input feature map. This padded set of feature maps is denoted as $\mathbf{Z}^{p_{l}}_{l-1}$ with dimensions $(N_{l-1}+2p_{l})\times(N_{l-1}+2p_{l})\times C_{l-1}$ . The convolution operations are performed on $\mathbf{Z}^{p_{l}}_{l-1}$ . Suppose there are $C_{l}$ filters in layer $l$ and let $k\in\{0,\dots,C_{l}-1\}$ be the filter index. Define filter $k$ as a 3D matrix of weights $\mathbf{W}_{l}^{k}$ with dimensions $V_{l}\times V_{l}\times C_{l-1}$ , where $V_{l}\leq N_{l-1}+2p_{l}$ . Each filter has an associated scalar bias term $b_{l}^{k}$ . Finally, define stride parameter $s_{l}\in\mathbb{Z}^{+}$ , which is the step size of the convolution filter both horizontally and vertically across the maps. Then, the 3D output feature map of layer $l$ is computed as:

\begin{split}\mathbf{Z}_{l}[i][j][k]=b_{l}^{k}+\sum_{x=0}^{V_{l}-1}\sum_{y=0}^{V_{l}-1}\sum_{c=0}^{C_{l-1}-1}&\mathbf{Z}^{p_{l}}_{l-1}[s_{l}i+x][s_{l}j+y][c]\\ &\times\mathbf{W}^{k}_{l}[x][y][c]\\ \end{split}

(1)

where $i,j\in\{0,\dots N_{l}-1\}$ and $N_{l}=(N_{l-1}+2p_{l}-V_{l}+s_{l})/s_{l}$ is the length and width of $\mathbf{Z}_{l}$ . The main benefit of this step is that only a subset of the feature maps are processed with the sliding filter at a time, which reduces the number of weights needed and hence improves computational performance.

III-A2 Activation Layer

Activation functions perform nonlinear transformations to increase feature separation, which helps the network to learn complex patterns and solve nontrivial problems. The pre-trained transfer learning models considered in this paper use the rectified linear unit (ReLU) activation function. This activation function provides faster learning due to its constant gradient, which is crucial in deep learning architectures. Without loss of generality, suppose layer $l$ is a ReLU activation layer. Let $\mathbf{Z}_{l-1}$ be the 3D feature map input from the previous layer. The output feature map after the activation layer is given by:

\mathbf{Z}_{l}[i][j][k]=\max\big{(}0,\;\mathbf{Z}_{l-1}[i][j][k]\big{)}

(2)

The output feature map has the same dimensions as those of the input since activation is performed on each element.

III-A3 Pooling Layer

The pooling layer performs down-sampling by dividing the input feature maps from the previous layer into (non-overlapping) rectangular regions and computing either the maximum or average of each region. This layer does not perform any learning; however, its purpose is to reduce the numbers of parameters and prevent overfitting. In the three transfer learning models considered, max pooling is often performed after the convolution and activation layers. Without loss of generality, suppose layer $l$ is a max pooling layer. Given a 3D feature map $\mathbf{Z}_{l-1}$ of size $N_{l-1}\times N_{l-1}\times C_{l-1}$ from the previous layer, a square pooling region of size $Q_{l}\times Q_{l}$ , and a stride parameter $s_{l}\in\mathbb{Z}^{+}$ , the output feature map is computed as:

\mathbf{Z}_{l}[i][j][k]=\max_{x,y\in\{0,\dots,Q_{l}-1\}}\mathbf{Z}_{l-1}[s_{l}i+x][s_{l}j+y][k]

(3)

where $i,j\in\{0,\dots,N_{l}-1\}$ , $k\in\{0,\dots,C_{l-1}-1\}$ , and $N_{l}=(N_{l-1}-Q_{l}+s_{l})/s_{l}$ is the length and width of $\mathbf{Z}_{l}$ .

III-A4 Flattening Layer

Once feature extraction has been completed for each channel, the feature maps from each channel need to be reorganized and combined into a single flattened feature vector, denoted as the Multi-FM, for processing by the Tree Topology Classifier and the fully-connected networks therein. Suppose there are $L$ layers during the multi-channel feature extraction. Let $\mathbf{Z}_{L}^{1}$ and $\mathbf{Z}_{L}^{2}$ be the feature map outputs of the left and right channels, respectively, both with dimensions $N_{L}\times N_{L}\times C_{L}$ . Each of these 3D feature maps are reshaped into 1D feature vectors, each with $N_{L}^{2}\cdot C_{L}$ elements. Then, the Multi-FM feature vector is created by concatenating the 1D feature vectors from each channel, which contains all $2\cdot N_{L}^{2}\cdot C_{L}$ elements from both $\mathbf{Z}_{L}^{1}$ and $\mathbf{Z}_{L}^{2}$ .

III-B Classification

DARE consists of hierarchical arranged tree-structured fully-connected neural network classifiers which are custom trained for systematically identifying individual classes, as shown in Figure 3. Since CADDY dataset has a large number of classes, a single neural network classifier may not yield satisfactory individual class recognition performances although it may still provide good overall average performance. The classification tree is constructed as follows. Each node of the tree represents a neural network classifier which is trained to separate certain groups of classes. The branches correspond to the outputs of the classifiers and the final decisions of individual classes are predicted at the leaf nodes.

In the classification tree, the class groups separated at each node are formed based on similar categories (e.g., number of hands, hand orientation, number of fingers, etc.) The root node at level $0$ , called RootNet, is trained to classify the three major categories: gestures, poses and true negatives (i.e., images with no, improper or wrong gestures). Subsequently, the two classifiers at level $1$ , GNet10 and PNet11, further classify the gestures and poses identified from RootNet, respectively. It is noted that the back and front faces of the hands with gloves have different features with the white square shape on the back and white circle on the front. Moreover, the gestures are made with either one or two hands. Thus, GNet10 is constructed as a quinary classifier to classify the following five gesture types: one hand (back face), one hand (front face), two hands closed (back face), two hands open (back face), two hands open (front face). On the other hand, PNet11 is constructed as a ternary classifier to separate the three different diver poses: turning horizontally, turning vertically and free swim. Next, at level $2$ , the one hand gestures are further classified using the finger/thumb pointing direction and the number of fingers. Thus, GNet20 is constructed as a ternary classifier to classify the one hand (back face) gestures into three categories: thumb pointing up/no finger, finger pointing down/front, and finger(s) pointing up. Similarly, GNet21 is constructed as a quaternary classifier to classify the one hand (front face) gestures into four categories: thumb pointing down, four/five fingers, three fingers/two fingers and a thumb, and one/two fingers.

Finally, at level $3$ , six binary classifiers are constructed: GNet30, GNet31, GNet32, GNet33, GNet34 and GNet35, which classify no finger vs thumb; finger pointing downward vs finger pointing forward; two fingers pointing up vs one finger pointing up; four vs five fingers pointing up; three fingers vs two fingers and a thumb pointing up; and two fingers vs one finger pointing up, respectively.

As previously mentioned, each node of the tree is a fully-connected feed-forward neural network classifier. The input layer to each of these classifiers is the Multi-FM. The architecture of the hidden layers of each of these classifiers is the same as the fully-connected hidden layers of the pre-trained transfer learning model that the multi-channel feature extraction stages are based on. For example, AlexNet has two identical fully-connected hidden layers. Each hidden layer has $4096$ neurons that are connected to each input from the previous layer, a ReLU activation layer, and a $50\%$ dropout layer. Thus, if the convolution layers of AlexNet are used for multi-channel feature extraction, then these hidden layers are used in each node classifier of the tree. These hidden layers are then followed by a fully-connected output layer that has the same number of neurons as number of classes or class groups separated at that node, and a softmax layer to calculate the probability of a sample belonging to a class. The weights of the fully-connected network are retrained using the cross-entropy loss function at the output.

The tree-structured classifier simplifies the 20-class underwater diver action recognition problem by constructing several focused neural network classifiers arranged in a hierarchical manner. It not only boosts the individual class recognition performance but also improves the overall performance, while producing a robust and reliable decision in real-time.

III-C Performance Measures

In this paper, we used several measures to evaluate the performance of DARE against other existing deep learning architectures. For each individual diver action, we compute the micro F1 score and balanced individual class accuracy. Then, we compute the macro F1 score and the overall correct classification rate (CCR). Each of these measures are defined in terms of the number of true positives, false negatives, true negatives, and false positives for each diver action.

For each class $i\in\{1,\dots,n\}$ , where $n$ is the number of classes, let $TP_{i}$ be the number of true positives, which is the number of samples of class $i$ that are correctly classified. Let $FN_{i}$ be the number of false negatives, which is the number of samples of class $i$ that are misclassified. It should be noted that the total number of samples belonging to class $i$ is $TP_{i}+FN_{i}$ . Also let $TN_{i}$ be the number of true negatives, which is the number of samples belonging to any class $j\neq i$ that are not classified as class $i$ . Finally, let $FP_{i}$ be the number of false positives, which is the number of samples belonging to any class $j\neq i$ that are misclassified as class $i$ .

F1 Score: The F1 Score is the harmonic mean of precision and recall. A high precision for diver action $i$ suggests that there are few false positives relative to true positives (i.e., other diver actions are not misclassified as action $i$ ), whereas a high recall suggests that there are few false negatives relative to true positives (i.e., diver action $i$ is not misclassified as another action). The precision and recall for class $i$ are computed as:

Precision_{i}=\frac{TP_{i}}{TP_{i}+FP_{i}},\ i=1,\dots,n,

(4)

Recall_{i}=\frac{TP_{i}}{TP_{i}+FN_{i}},\ i=1,\dots,n.

(5)

Then, the individual class (micro) F1 score is computed as:

F\textit{1}_{i}=2\cdot\frac{Precision_{i}\times Recall_{i}}{Precision_{i}+Recall_{i}},\ i=1,\dots,n

(6)

and the average (macro) F1 score is computed as:

\overline{F\textit{1}}=\frac{1}{n}\sum_{i=1}^{n}F\textit{1}_{i}.

(7)

Balanced Individual Class Accuracy: The balanced individual class accuracy (BACC) is the mean of the true positive rate and true negative rate. This measure is useful in inherently unbalanced one-verses-the-rest class performance comparisons, as other accuracy measures do not normalize the true positive and true negative predictions. The individual class true positive rate $TPR_{i}$ is the same as the recall in Eq. (5), and the individual class true negative rate is computed as:

TNR_{i}=\frac{TN_{i}}{TN_{i}+FP_{i}}\ i=1,\dots,n.

(8)

Then, the balanced individual class accuracy is computed as:

BACC_{i}=\frac{TPR_{i}+TNR_{i}}{2},\ i=1,\dots,n.

(9)

Correct Classification Rate: The CCR is the overall accuracy of the classifier, which is computed as:

CCR=\frac{\sum_{i=1}^{n}TP_{i}}{\sum_{i=1}^{n}TP_{i}+FN_{i}}\times 100\%

(10)

IV Results and Discussion

This section presents the comparative evaluation results of the performance of DARE with other deep transfer learning methods. In this paper, three base networks are used: ResNet18, AlexNet, and VggNet16. Furthermore, each of these base networks are augmented in the bi-channel framework to analyse stereo image pairs. Subsequently, DARE is built from each of these bi-channel networks by replacing the single neural network classifier with a tree-topology classifier to boost the individual class recognition performance. Thus, the performance of DARE networks with three different underlying transfer learning models are compared with the corresponding base networks and their bi-channel augmented networks. Classification was performed on the $16$ hand gesture types, $3$ diver pose types, and $1$ miscellaneous class of no-gesture/no-pose images. In order to utilize the transfer learning approaches above, the CADDY dataset images were downsampled and resized to the specified input image sizes of each network architecture. The data analysis was performed using the MATLAB Deep Learning Toolbox on a Windows 10 computer with an Intel Core i7 7700 processor and 32GB of RAM. For DARE, in each of the sub-classifiers in the tree, the initial learning rate is set to $0.001$ with stochastic gradient descent with momentum optimizer. The performance results are obtained via the k-fold cross validation.

Figure 5 shows the variations of individual class performance measures using the box plots. While Figures 5a-5c show the box plots of balanced individual class accuracies for different networks, Figures 5d-5f show the boxplots of individual class F1-scores for different networks. Each boxplot shows a red line (the median), two black lines (the minimum and maximum values), and a blue box with the lower and upper boundaries representing the $25^{th}$ and $75^{th}$ percentile values, respectively. Figure 5a shows the box plots of balanced individual class accuracy of the ResNet-based networks. As seen, the regular ResNet has a huge variation in the accuracy values with the maximum accuracy of $0.98$ and the minimum accuracy of $0.52$ , which will give poor underwater performance. The blue box also shows high variation. On the other hand, MC ResNet not only boosts the maximum and minimum accuracy values to $0.99$ and $0.55$ , respectively, it also shrinks the variation between the $25^{th}$ and $75^{th}$ percentile values. On the other hand, DARE significantly improves the minimum accuracy value to $0.82$ , thus verifying the utility of the tree-topology classifier. It also shrinks the interval between the $25^{th}$ and $75^{th}$ percentile values, where these values are equal to $0.84$ and $0.94$ , respectively, which are better than those of the regular ResNet and MC ResNet.

TABLE IV: Comparison of the overall average performance measures of different learning networks.

Metrics

ResNet

AlexNet

VggNet

Regular

DARE

Regular

DARE

Regular

DARE

\hlxvhvv

CCR (%)

86.03

88.80

89.47

94.21

95.88

95.93

95.20

95.39

95.87

\hlxvv

Macro F1 score

0.728

0.782

0.799

0.875

0.919

0.920

0.899

0.901

0.921

\hlxvv

Training

Time (hrs)

1.46

1.72

3.65

2.08

3.92

13.80

7.82

13.80

17.19

\hlxvv

Testing

Time (ms)

34.59

69.24

72.65

16.69

33.33

34.20

266.75

533.09

534.49

Similarly, Figure 5b shows the box plots of balanced individual class accuracy of the AlexNet-based learning networks. As seen, DARE outperforms the regular AlexNet and MC AlexNet with the minimum accuracy of $0.89$ . DARE also achieves the shortest range of the $25^{th}$ and $75^{th}$ percentile values. Finally, Figure 5c shows the box plots of balanced individual class accuracy of the VggNet-based learning networks. As seen, DARE significantly boosts the minimum accuracy to $0.92$ , while shrinking the $25^{th}$ and $75^{th}$ percentile range, which indicates significantly better and reliable performance as compared to the regular VggNet and MC VggNet.

Figures 5d-5f show similar trends in the boxplots of individual class F1 scores for different networks, where for each base network DARE supersedes the performance of the other networks by boosting the minimum individual class F1 score and shrinking the range of the $25^{th}$ and $75^{th}$ percentile values. In particular, DARE with VggNet-based training model achieves the best performance with the minimum individual class F1 score of $0.83$ .

Table IV presents the overall CCR, overall F1 score, and the training and testing times of all networks. DARE gives the highest overall CCR and F1 score among all base networks. Although DARE requires slightly more training times due to training of several sub-classifiers in the tree; its testing times are fairly small and are suitable for real-time implementation.

In summary, DARE with VggNet-based training model reveals the best performance in terms of both the individual class and overall performance measures. The high accuracies, smallest range of the $25^{th}$ and $75^{th}$ percentile values and high F1-scores obtained by DARE indicate that it delivers robust and reliable performance in the presence of various environmental, driver and sensor uncertainties in the CADDY dataset. Although DARE with VggNet gives the best classification performance, the prediction time for one stereo pair is $0.53$ seconds. While this is suitable for real-time prediction in static or slowly changing environments, it might not be adequate in rapidly changing environments with sporadic diver movements. On the other hand, DARE with AlexNet can make predictions in only $0.034$ seconds, while offering diver action recognition performance close to DARE with VggNet. As such, DARE with AlexNet is able to process all incoming frames from the stereo camera system on the AUV in order to make robust and reliable predictions in real-time under dynamic underwater operating conditions.

V Conclusion and Future Work

This paper addresses the diver gesture and pose recognition problem for robust, efficient and reliable underwater communication between the diver and AUV in uncertain underwater environments. In this regard, the paper developed a diver action recognition system, called DARE, which is trained using the CADDY dataset collected in real underwater environments. DARE is built upon a multi-channel deep transfer learning architecture with a tree-topology neural network classifier. The multi-channel deep transfer learning CNN facilitates feature extraction and fusion from stereo-pairs of AUV camera images. Furthermore, the tree structured neural network based classifier boosts the individual class recognition performance for reliable operation. The results show that DARE delivers better performance in both individual class and overall classification measures. In addition, DARE is computationally efficient and delivers fast decisions, making it suitable for real-time underwater implementation.

For future work, there are several areas of research for this problem. First, to further enhance the classification performance, a customized multi-channel CNN can be designed to extract specific and useful features. Second, more data can be collected to train the classifier with more different diver actions under diverse real-life underwater conditions. Finally, a smart information-theoretic approach to efficiently and automatically create the tree topology classifier can be developed so that DARE can scale better to arbitrarily large diver action recognition problems.

References

[1] J. Yuh, G. Marani, and D. R. Blidberg, “Applications of marine robotic vehicles,” Intelligent Service Robotics, vol. 4, no. 4, pp. 221–231, 2011.
[2] J. Song and S. Gupta, “ $\epsilon^{*}$ : An online coverage path planning algorithm,” IEEE Transactions on Robotics, vol. 34, no. 2, pp. 526–533, 2018.
[3] J. Song and S. Gupta, “CARE: Cooperative autonomy for resilience and efficiency of robot teams for complete coverage in unknown environments under robot failures,” Autonomous Robots, vol. 44, no. 3, pp. 647–671, 2020.
[4] R. K. Katzschmann, J. Del Preto, R. MacCurdy, and D. Rus, “Exploration of underwater life with an acoustically controlled soft robotic fish,” Science Robotics, vol. 3, no. 16, 2018.
[5] Z. Shen, J. Song, K. Mittal, and S. Gupta, “Autonomous 3-D mapping and safe-path planning for underwater terrain reconstruction using multi-level coverage trees,” in OCEANS’17 MTS/IEEE, (Anchorage, AK, USA), pp. 1–6, 2017.
[6] S. Negahdaripour and H. Madjidi, “Stereovision imaging on submersible platforms for 3-D mapping of benthic habitats and sea-floor structures,” IEEE Journal of Oceanic Engineering, vol. 28, no. 4, pp. 625–650, 2003.
[7] N. Wakita, K. Hirokawa, T. Ichikawa, and Y. Yamauchi, “Development of autonomous underwater vehicle (AUV) for exploring deep sea marine mineral resources,” Mitsubishi Heavy Industries Technical Review, vol. 47, no.3, pp. 73–80, 2010.
[8] T. Somers and G. A. Hollinger, “Human–robot planning and learning for marine data collection,” Autonomous Robots, vol. 40, no. 7, p. 1123–1137, 2016.
[9] J. Hare, S. Gupta, and T. A. Wettergren, “POSE: Prediction-based opportunistic sensing for energy efficiency in sensor networks using distributed supervisors,” IEEE Transactions on Cybernetics, vol. 48, no. 7, pp. 2114–2127, 2018.
[10] J. Z. Hare, S. Gupta, and T. A. Wettergren, “POSE.3C: Prediction-based opportunistic sensing using distributed classification, clustering and control in heterogeneous sensor networks,” IEEE Transactions on Control of Network Systems, vol. 6, no. 4, pp. 1438–1450, 2019.
[11] K. Shojaei and M. Dolatshahi, “Line-of-sight target tracking control of underactuated autonomous underwater vehicles,” Ocean Engineering, vol. 133, pp. 244–252, 2017.
[12] K. Mukherjee, S. Gupta, A. Ray, and S. Phoha, “Symbolic analysis of sonar data for underwater target detection,” IEEE Journal of Oceanic Engineering, vol. 36, no. 2, pp. 219–230, 2011.
[13] E. U. Acar, H. Choset, Y. Zhang, and M. Schervish, “Path planning for robotic demining: Robust sensor-based coverage of unstructured environments and probabilistic methods,” The International Journal of Robotics Research, vol. 22, no. 7-8, pp. 441–466, 2003.
[14] J. Song, S. Gupta, J. Hare, and S. Zhou, “Adaptive cleaning of oil spills by autonomous vehicles under partial information,” in OCEANS’13 MTS/IEEE, (San Diego, CA, USA), pp. 1–5, 2013.
[15] S. V. Kumar, R. Jayaparvathy, and B. Priyanka, “Efficient path planning of AUVs for container ship oil spill detection in coastal areas,” Ocean Engineering, vol. 217, 2020.
[16] G. L. Foresti, “Visual inspection of sea bottom structures by an autonomous underwater vehicle,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 31, no. 5, pp. 691–705, 2001.
[17] S. Sivčev, J. Coleman, E. Omerdić, G. Dooly, and D. Toal, “Underwater manipulators: a review,” Ocean Engineering, vol. 163, pp. 431–450, 2018.
[18] K. C. Galloway, K. P. Becker, B. Phillips, J. Kirby, S. Licht, D. Tchernov, R. J. Wood, , and D. F. Gruber, “Soft robotic grippers for biological sampling on deep reefs,” Soft Robotics, vol. 3, no. 1, pp. 23–33, 2016.
[19] G. M. Bandeira, M. Carmo, B. Ximenes, and J. Kelner, “Using gesture-based interfaces to control robots,” in Human-Computer Interaction: Interaction Technologies, vol. 9170, pp. 3–12, 2015.
[20] R. Capocci, G. Dooly, E. Omerdić, J. Coleman, T. Newe, and D. Toal, “Inspection-class remotely operated vehicles: a review,” Journal of Marine Science and Engineering, vol. 5, no. 1, pp. 13–44, 2017.
[21] J. W. Nicholson and A. J. Healey, “The present state of autonomous underwater vehicle (AUV) applications and technologies,” Marine Technology Society Journal, vol. 42, no. 1, pp. 44–51, 2008.
[22] G. Dudek, P. Giguere, and J. Sattar, “Sensor-based behavior control for an autonomous underwater vehicle,” Experimental Robotics, vol. 39, pp. 267–276, 2008.
[23] M. J. Islam, M. Ho, and J. Sattar, “Dynamic reconfiguration of mission parameters in underwater human-robot collaboration,” in IEEE International Conference on Robotics and Automation (ICRA), (Brisbane, QLD, Australia), 2018.
[24] M. J. Islam, M. Ho, and J. Sattar, “Understanding human motion and gestures for underwater human-robot collaboration,” Journal of Field Robotics, vol. 36, no. 5, pp. 851–873, 2018.
[25] N. Miskovic, D. Nad, and I. Rendulic, “Tracking divers: An autonomous marine surface vehicle to increase diver safety,” IEEE Robotics and Automation Magazine, vol. 22, no. 3, pp. 72–84, 2015.
[26] D. Nad, F. Mandic, and N. Miskovic, “Using autonomous underwater vehicles for diver tracking and navigation aiding,” Journal of Marine Science and Engineering, vol. 8, no. 6, 2020.
[27] M. J. Islam, M. Fulton, and J. Sattar, “Toward a generic diver-following algorithm: Balancing robustness and efficiency in deep visual detection,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 113–120, 2018.
[28] H. Hu, Z. Sun, and L. Su, “Underwater motion and activity recognition using acoustic wireless networks,” in ICC 2020-2020 IEEE International Conference on Communications (ICC)., (Dublin, Ireland), 2020.
[29] Q. Q. Huynh, L. N. Cooper, N. Intrator, and H. Shouval, “Classification of underwater mammals using feature extraction based on time–frequency analysis and bcm theory,” IEEE Transactions on Signal Processing, vol. 46, no. 5, pp. 1202–1207, 1998.
[30] S.Murugan, T. G. Babu, and C.Srinivasan, “Underwater object recognition using KNN classifier,” International Journal of MC Square Scientific Research, vol. 9, no. 3, pp. 48–52, 2017.
[31] N. Kumar, U. Mitra, and S. S. Narayanan, “Robust object classification in underwater sidescan sonar images by using reliability-aware fusion of shadow features,” IEEE Journal of Oceanic Engineering, vol. 40, no. 3, pp. 592–606, 2016.
[32] F. Gustin, I. Rendulic, N. Miskovic, and Z. Vukic, “Hand gesture recognition from multibeam sonar imagery,” IFAC-PapersOnLine, vol. 49, no. 23, pp. 470 – 475, 2016.
[33] S. A. Siddiqui, A. Salman, M. I. Malik, F. Shafait, A. Mian, M. R. Shortis, and E. S. Harvey, “Automatic fish species classification in underwater videos: exploiting pre-trained deep neural network models to compensate for limited labelled data,” ICES Journal of Marine Science, vol. 75, no. 1, p. 374–389, 2018.
[34] P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “RGB-D-based human motion recognition with deep learning: A survey,” Computer Vision and Image Understanding, vol. 171, pp. 118–139, 2018.
[35] J. Yang, J. P. Wilson, and S. Gupta, “Diver gesture recognition using deep learning for underwater human-robot interaction,” in OCEANS 2019 MTS/IEEE, (Seattle, WA, USA), pp. 1–5, 2019.
[36] A. G. Chavez, A. Ranieri, D. Chiarella, E. Zereik, A. Babić, and A. Birk, “CADDY underwater stereo-vision dataset for human-robot interaction (HRI) in the context of diver activities,” Journal of Marine Science and Engineering, vol. 7, no. 1, pp. 16–29, 2019.
[37] M. Shortis, E. Harvey, and D. Aboo, “A review of underwater stereo-image measurement for marine biology and ecology applications,” Oceanography and Marine Biology: An Annual Review, vol. 47, pp. 257–292, 2016.
[38] D. Chiarella, M. Bibuli, G. Bruzzone, M. Caccia, A. Ranieri, E. Zereik, L. Marconi, and P. Cutugno, “A novel gesture-based language for underwater human–robot interaction,” Journal of Marine Science and Engineering, vol. 6, no. 3, 2018.
[39] X. Y. Wu, “A hand gesture recognition algorithm based on DC-CNN,” Multimedia Tools and Applications, vol. 79, p. 9193–9205, 2019.
[40] J. Jiang, X. Feng, F. Liu, Y. Xu, and H. Huang, “Multi-spectral RGB-NIR image classification using double-channel CNN,” IEEE Access, vol. 7, pp. 20607–20613, 2019.
[41] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), pp. 1097–1105, 2012.
[44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
[45] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.