Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition

Yang Liu, , Keze Wang, Guanbin Li, , Liang Lin This work is supported in part by the National Natural Science Foundation of China under Grant No.62002395, in part by the National Natural Science Foundation of Guangdong Province (China) under Grant No. 2021A15150123, and in part by the China Postdoctoral Science Foundation funded project under Grant No.2020M672966. (Corresponding author: Liang Lin.)Yang Liu, Guanbin Li and Liang Lin are with the School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China. (e-mail: [email protected], [email protected], [email protected]). Keze Wang is with DMAI Co., Ltd, Guangzhou 511400, China. (e-mail: [email protected]).

Abstract

Existing vision-based action recognition is susceptible to occlusion and appearance variations, while wearable sensors can alleviate these challenges by capturing human motion with one-dimensional time-series signals (e.g. acceleration, gyroscope, and orientation). For the same action, the knowledge learned from vision sensors (videos or images) and wearable sensors, may be related and complementary. However, there exists a significantly large modality difference between action data captured by wearable-sensor and vision-sensor in data dimension, data distribution, and inherent information content. In this paper, we propose a novel framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors. The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modalities. To preserve the local temporal relationship and facilitate employing visual deep learning models, we transform one-dimensional time-series signals of wearable sensors to two-dimensional images by designing a gramian angular field based virtual image generation model. Then, we introduce a novel Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to adaptively fuse intermediate representation knowledge from different teacher networks. Finally, to fully exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel Graph-guided Semantically Discriminative Mapping (GSDM) module, which utilizes graph-guided ablation analysis to produce a good visual explanation to highlight the important regions across modalities and concurrently preserve the interrelations of original data. Experimental results on Berkeley-MHAD, UTD-MHAD, and MMAct datasets well demonstrate the effectiveness of our proposed SAKDN for adaptive knowledge transfer from wearable-sensors modalities to vision-sensors modalities. The code is publicly available at https://github.com/YangLiu9208/SAKDN.

Index Terms:

Action recognition, wearable sensor, knowledge distillation, multi-modalities, transfer learning.

I Introduction

Human action recognition has attracted increasing attention due to its wide applications such as health-care services, smart homes, intelligent surveillance, and human-machine interaction, etc. With the development of deep learning, vision-sensors (images, videos) based methods dominate the community of action recognition and a large amount of effective models have been proposed and applied to real-world scenarios [1, 2, 3, 4]. However, the performance of vision-based methods is easily affected by camera position, camera view-point, background clutter, occlusion, and appearance variation [5]. Furthermore, vision-based methods usually require expensive hardware resources to run computationally complex computer vision algorithms [6]. In some privacy-sensitive areas such as bank and government, the difficulty of acquiring images and videos makes this method infeasible. However, these limitations can be addressed by low-cost and computationally efficient wearable-sensors. The wearable-sensors equipped by smartwatches or smartphones can capture human actions by three-axis time-series acceleration, gyroscope, and orientation signals, which are suitable for privacy protecting and robust to variant illuminations and camera viewpoints [5]. With the popularity and increasing demand of intelligent cities and smart health-care, human action recognition based on wearable-sensors has become a key research area in human activity understanding. Although some wearable-sensors based action recognition methods [7, 8, 9, 10] have been proposed and achieved promising results, most of them just consider time-series data of wearable-sensors without considering the complementary relationship and domain divergence between vision-sensor and wearable-sensor action data. Therefore, it is considerable to leverage the knowledge from both vision-sensor and wearable-sensor modalities to improve the performance of action recognition in such a multi-modal manner.

Refer to caption — Figure 1: Comparison of vision and wearable sensor action data.

However, there exists a significantly large modality difference between vision-sensor and wearable-sensor action data, which can be observed from Fig. 1. Obviously, the vision-sensor action data are two-dimensional images or three-dimensional videos which contain abundant color or texture information. In contrast, wearable-sensor action data are one-dimensional time-series signals without containing color and texture information. Specially, traditional action recognition methods are usually in a unimodal manner (either in vision-sensor modality or wearable-sensor modality), which is infeasible in real-world scenarios because the dynamic environment makes the model hard to adapt to the modality difference. Previous works [11, 12, 13, 14] have verified the existence of complementary information between action data of vision-sensor and wearable-sensor. For instance, vision-based sensors could provide global motion features while wearable-sensors give 3D information about local body movement. Hence, by utilizing the complementary information from these two modalities, the generalization ability and the performance of action recognition can be improved. However, due to the huge modality gap between vision-sensor and wearable-senor action data, the following two key challenges should be addressed: 1) there are multiple modalities of wearable-sensor action data, the data of each modality is one-dimensional time-series signal without containing local temporal relationship, color, and texture information. This makes existing models difficult to interpret and fuse the content of multi-modal wearable-sensor action data. Therefore, a specific and effective multi-modal representation learning method is required to increase the representative power of wearable-sensor data and concurrently fuse different kinds of wearable-sensors data. 2) there exists a large modality difference between wearable-sensor and vision-sensor in data dimension, data distribution, and inherent information content, which highlights the importance of adaptive feature fusion and specific knowledge transfer methods.

Based on these observations, in this paper, we focus on enhancing action recognition performance in vision-sensor modality (videos) by adaptively transferring the knowledge from multiple wearable-sensor modalities, meanwhile solving the aforementioned challenges. Since the knowledge distillation allows a model with only one modality input to achieve the performance close to the use of multiple modalities even with heterogeneous model and data [15, 16], we propose an end-to-end knowledge distillation framework, named Semantics-aware Adaptive Knowledge Distillation Network (SAKDN), which adaptively distills the complementary knowledge from multiple wearable-sensor modalities (teachers) to the vision-sensor modality (student), and concurrently improves the action recognition performance in vision-sensor modality (videos). An overview of the SAKDN is presented in Fig. 2. In SAKDN, we use multiple kinds of wearable-sensor signals as teacher modalities and RGB stream of video as a single student modality. Since multi-modal action data share the same semantic content, we use semantics-aware knowledge of action class names to guide the multi-modal feature fusion, knowledge distillation, and representation learning.

More specifically, the SAKDN consists of multiple teacher networks and a single student network. The acceleration, gyroscope, and orientation signals are used as our teacher modalities, and the RGB videos are used as our student modality. To make the one-dimensional action data of wearable-sensor preserve the local temporal relationship and facilitate its visual recognition, we build a Gramian Angular Field (GAF) [17] based virtual image generation model (as shown in Fig. 3) which transforms the one-dimensional time-series signals into two-dimensional image representations and facilitates its application to existing visual models. Since there are multiple kinds of wearable-sensors modalities, we construct a Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to fully utilize the complementary information among different teacher networks. This module utilizes the intra-modality similarity, semantic embedding, and multiple relational knowledge to recalibrate the channel-wise features adaptively in each teacher network, as shown in Fig. 4. To improve the performance of the student modality, we propose the Graph-guided Semantically Discriminative Mapping (GSDM) module, which transfers the graph-guided semantics-aware attention knowledge of multiple well-trained teacher networks to guide the training of the student network, as shown in Fig. 5. Extensive experiments on three benchmarks verify that our SAKDN can realize adaptive knowledge transfer from multiple wearable-sensor modalities to vision-sensors modalities and achieve state-of-the-art performance.

The main contributions of this paper are as follows:

•

To fully utilize the complementary knowledge from intermediate layers of multiple teacher networks, we propose a novel plug-and-play module, named Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM), which integrates intra-modality similarity, semantic embeddings, and multiple relational knowledge to learn the global context representation and recalibrate the channel-wise features adaptively in each teacher network.
•

To effectively exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel knowledge distillation loss, named Graph-guided Semantically Discriminative Mapping (GSDM) module, which utilizes graph-guided ablation analysis to produce a good visual explanation highlighting the important regions in the image for predicting the semantic concept, and concurrently preserving respective interrelations of data for each modality.
•

One major advantage of our method is that it exploits the semantic relationship to bridge the modality gap between wearable-sensors and vision-sensors, and utilizes this constraint to guide the multi-modal feature fusion, knowledge transfer, and representation learning. The SAKDN focuses on the sensor-to-vision heterogenous action recognition problem and integrates SPAMFM, GSDM into a unified end-to-end adaptive knowledge distillation framework. Extensive experiments on three benchmark datasets validate the effectiveness of our SAKDN.

This paper is organized as follows: Section II briefly reviews the related works. Section III introduces the proposed SAKDN. Experimental results and related discussions are presented in Section IV. Finally, Section V concludes the paper.

II Related Work

II-A Uni-modal Action Recognition

Action Recognition is an active research field and has received great attention in recent years [18]. The action recognition methods can be divided into three types: (1) handcrafted representations based [19, 20, 21, 22, 23, 24, 25, 26], (2) graph learning based [27, 28, 29], and (3) deep learning based [30, 31, 32, 33, 34, 35]. To be noticed, most of them are based on vision-sensors modality such as RGB, depth, skeleton, infrared images or videos, etc. Some representative RGB based works include IDT [36], DAG [27], 3D CNN [30], two-stream CNNs [31], C3D [32], TSN [33], TRN [2], MLGCN [28], TSM [35], etc. Yuan et al. [37] introduced a statistical hypothesis detector for abnormal RGB event detection in crowded scenes. Yuan et al. [38] proposed a memory-augmented temporal dynamic learning network to learn temporal motion dynamics and tackle unsteady dynamics in long-duration motion of videos. Li et al. [39] introduced a spatio-temporal manifold network (STMN) that leverages data manifold structures to regularize deep action feature learning. In addition, other modalities (depth [40, 41], skeleton [42, 43], infrared [44]) based methods also receive increasing attention. Zhang et al. [29] proposed semantics-guided neural network (SGN) for skeleton-based action recognition. Zhang et al. [45] presented a low-cost descriptor called 3D histograms of texture (3DHoTs) to extract discriminant features from a sequence of depth maps. Though these vision-sensors based methods have achieved promising results, their performance is easily affected by camera viewpoints, background clutter, occlusion, and illumination change. In some privacy-sensitive area such as bank and government, the difficulty of acquiring images and videos make it infeasible. Furthermore, vision-based methods usually require expensive hardware resources to run deep learning models with high computational demands.

With the popularity of the wearable devices such as smartwatches and smartphones, human action recognition based on wearable-sensors has become a key research area in human activity understanding [46, 9]. Although aforementioned visions-sensors based methods have achieved good results, they cannot be directly applied to wearable-sensor based problems due to the existence of huge modality divergence. Since wearable-sensors action data is suitable for privacy protecting and robust to variant illuminations and camera viewpoints, some specific works have been proposed recently. Jiang et al. [7] assembled signal sequences of accelerators and gyroscopes into an activity image to learn optimal representations automatically. Wannenburg et al. [8] utilized ten different classifier algorithms to classify the human actions using the accelerator signals captured by smartphones. Setiawan [47] used gramian angular field to transform one-dimensional wearable-sensor signals to two-dimensional images. Wang et al. [10] proposed an attention-based CNN framework to address weakly-supervised sensors-based action recognition problem. Fazli et al. [48] built a hierarchical classification with neural networks to recognize human activities based on built-in sensors in smart and wearable devices. Different from vision-sensor based methods, most of these wearable-sensor based action recognition methods are based on raw sensor time-series signals, which lack color and texture information and could not preserve the local temporal relationship. In addition, these methods use simple feature fusion methods to fuse the knowledge from different sensor modalities without considering intra-modality similarity, semantic embeddings, and multiple relational knowledge.

To increase the representative ability of wearable-sensors action features, we construct a Gramian Angular Field (GAF) based virtual image generation model, which transforms the one-dimensional time-series signals of wearable-sensors into two-dimensional image representations. To fully utilize the complementary knowledge from multiple wearable-sensors, we propose a Similarity- Preserving Adaptive Multi-modal Fusion Module (SPAMFM), which integrates intra-modality similarity, semantic embeddings, and multiple relational knowledge to learn the global context representation and recalibrate the channel-wise features adaptively.

II-B Multi-modal Action Recognition

Action recognition has been developed for a long period, but action recognition on multiple modalities is a relatively new topic. With the development of deep learning methods and various hardware such as cameras and wearable devices, there are some typical methods of dealing with multi-modal action recognition problems in recent years. These methods can be roughly categorized into three types: 1) cross-view action recognition, typical works [49, 50] used transfer learning methods to reduce the domain gap of action data from different camera views; 2) cross-spectral action recognition, typical works [51, 52] addressed the visible-to-infrared action recognition problems using domain adaptation methods. Yuan et al. [53] proposed a spatial-optical data organization and sequential learning framework based on spatial-optical action data; 3) cross-media action recognition, typical works [54, 55] designed specific multi-modal feature learning frameworks to address the image-to-video action recognition problems.

Different from these cross-domain action recognition problems, the multi-modal action recognition based on wearable-sensors and vision-sensors (sensor-to-vision) is essentially a heterogeneous knowledge transfer problem because there exists large modality difference between wearable-sensors and vision-sensors in data dimension, data distribution, and inherent information content, which is shown in Fig. 1. And related research about sensor-to-vision action recognition is limited. Chen et al. [56] proposed a feature fusion framework to combine signals from the depth camera and inertial body sensor. Kong et al. [5] built a multi-modality distillation model with the attention mechanism to realize adaptive knowledge transfer from sensor modalities to vision modalities. Hamid et al. [57] proposed a multi-modal transfer module to fuse knowledge from different unimodal CNNs and tested this module for three different multi-modal fusion tasks: gesture recognition, audio-visual speech enhancement, and action recognition. However, most of these methods only use raw one-dimensional time-series sensor signals to recognize actions. Since time-series data lacks local temporal relationship, color, and texture information, it may affect the representative ability of wearable-sensor signals and make existing pre-trained deep learning models (e.g. LeNet, AlexNet, VGGNet, ResNet, etc.) hard to adapt. Furthermore, the semantic relationship between wearable-sensors and vision-sensors action data is ignored in previous works, which can guide the knowledge transfer.

In this paper, we use the semantics-aware information to guide the multi-modal feature fusion, knowledge distillation, and representation learning of our SAKDN. Although method [57] built a squeeze and excitation based multi-modal feature fusion module which only used the simple concatenation of features from different modalities to learn the global context embedding without considering more diverse relation functions and the intra-modality similarity relationship. Differently, we propose a novel plug-and-play module, named Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM), which seamlessly integrates intra-modality similarity, semantic embeddings, and multiple relational knowledge to learn the global context representation and recalibrate the channel-wise features adaptively in each network.

II-C Knowledge Distillation

Knowledge distillation is a general technique for supervising the training of student networks by capturing and transferring useful knowledge from well-trained teacher networks. Hinton et al. [15] used softened labels of the teacher with a temperature to transfer knowledge to a small student network. Attention transfer [58] designed a knowledge distillation loss based on summed $p$ -norm of convolutional feature activations along channel dimension. Park et al. [59] proposed distance-wise and angle-wise distillation losses to realize relational knowledge transfer. Tung et al. [60] construct a knowledge distillation loss with the constraint that input pairs that produce similar (dissimilar) activations in the teacher network should produce similar (dissimilar) activations in the student network. Hoffman et al. [61] built a modality hallucination architecture for training an RGB object detection model using depth as side information. Garcia et al. [62] proposed a generalized distillation framework considering the case of learning representations from the depth and RGB videos, while relying on RGB data only at test time. Crasto et al. [63] introduced a feature-based loss compared to the Flow stream, a linear combination of the feature-based loss and the standard cross-entropy loss, to mimic the motion stream, and as a result avoids flow computation at test time.

Different from existing knowledge distillation methods that focus on the modality transfer task across vision-sensor based modalities, we move a further step towards knowledge transfer from wearable-sensor based modalities to vision-sensors based modalities. In this paper, we construct a novel knowledge distillation module, named Graph-guided Semantically Discriminative Mapping (GSDM), which utilizes graph-guided ablation analysis to produce good visual explanations highlighting the important regions for predicting the semantic concept and preserving the intrinsic structures concurrently. Since the semantic relationship between wearable-sensors and vision-sensors is similar, we transfer the semantics-aware attention knowledge of multiple well-trained teacher networks to guide the training of the student network.

III Semantics-aware Adaptive Knowledge Distillation Networks

III-A Framework Overview

The framework of the SAKDN is shown in Fig. 2, which is an end-to-end knowledge distillation framework seamlessly constituted by three parts: virtual image generation of wearable-sensors, training of multiple teacher networks, and multi-modality knowledge distillation from multiple teacher networks to the student network. We use wearable-sensors action data (acceleration, gyroscope, and orientation) as teacher modalities and RGB videos as student modalities. The virtual image generation part uses Gramian Angular Field (GAF) [17, 47] to encode one-dimensional time-series signals of wearable-sensor into the two-dimensional image representation. We build a novel Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to fuse intermediate representation knowledge from different teacher networks adaptively. Then we use semantic preserving loss along with cross-entropy loss for the training of multiple teacher networks. The multi-modality knowledge distillation consists of two knowledge distillation losses, one is our proposed Graph-guided Semantically Discriminative Mapping (GSDM) loss, and the other one is the soft-target knowledge distillation loss [15]. We train the student network using GSDM loss, soft-target loss, semantic preserving loss, and cross-entropy loss.

III-B Virtual Image Generation

To make the one-dimensional action data of wearable-sensor preserve the local temporal relationship and facilitate its visual recognition, we build a Gramian Angular Field (GAF) based virtual image generation model which transforms the one-dimensional time-series signals into two-dimensional image representations. The Gramian Angular Field (GAF) based virtual image generation model is shown in Fig. 3. Since there are three axial time-series signals (x, y, z) of wearable-sensors action data, we denote one of the tri-axial signals as $X=\{x_{1},\cdots,x_{n}\}$ . We then use min-max normalization to normalize original signal $X$ into the interval $[-1,1]$ and get normalized signal $\widetilde{X}$ ,

\widetilde{X}_{i}=\frac{(x_{i}-\textrm{max}(X))+(x_{i}-\textrm{min}(X))}{\textrm{max}(X)-\textrm{min}(X)}

(1)

Then, we use transformation function $g$ to transform the normalized signal $\widetilde{X}$ to the polar coordinate system, which represents cosine angle from the normalized amplitude and the radius from the time $t$ , as represented in Eq. (2).

g(\widetilde{x}_{i},t_{i})=[\theta_{i},r_{i}]~{}~{}\textrm{where}~{}~{}\left\{\begin{aligned} \theta_{i}&=\arccos(\widetilde{x}_{i}),\widetilde{x}_{i}\in\widetilde{X}\\ r_{i}&=t_{i}\end{aligned}\right.

(2)

After encoding the normalized time-series signals into a polar coordinate system, the correlation coefficient between time intervals can be easily obtained by trigonometric sum between points. Since the correlation coefficient can be calculated by the cosine of the angle between vectors [17, 47], the correlation between time $i$ and $j$ is calculated using $\cos(\theta_{i}+\theta_{j})$ and the Gramian Angular Field based matrix is defined as $G$ :

G=\left(\begin{array}[]{ccc}\textrm{cos}(\theta_{1}+\theta_{1})&\cdots&\textrm{cos}(\theta_{1}+\theta_{n})\\ \vdots&\ddots&\vdots\\ \textrm{cos}(\theta_{n}+\theta_{1})&\cdots&\textrm{cos}(\theta_{n}+\theta_{n})\\ \end{array}\right)

(3)

The encoding map of Eq. (3) has two important properties. Firstly, it is bijective as cos $(\phi)$ is monotonic when $\phi\in[0,\pi]$ . Given the time-series data, the proposed map produces one and only one result in the polar coordinate system with a unique inverse map. Secondly, as opposed to Cartesian coordinates, polar coordinates preserve absolute temporal relations. After encoding the scaled time-series signals into a polar coordinate system, we can easily extract the correlation coefficient between time intervals considering the trigonometric sum between points. In this way, the GAF provides a new representation style that can preserve the local temporal relationship in the form of temporal correlation as the timestamp increases. For wearable-sensor based action data, the accelerator, gyroscope, and orientation signals are in tri-axial style. Therefore, we assume that each axis sensor data with length $n$ can be transformed into a single GAF matrix with the size $n\times n$ . Then, the GAF matrices of tri-axial sensor data (x-, y-, and z-axis) are assembled as a three-channel image representation $P=\{G_{x},G_{y},G_{z}\}$ of size $n\times n\times 3$ . And this novel image representation is named as GAF based Virtual Image (GAFVI). GAFVI of wearable-sensors will be used as the input for teacher modalities in this paper.

III-C Similarity-Preserving Adaptive Multi-modal Fusion

Since there exist multiple teacher modalities when training the teacher networks, we need to discover and fuse representative features among these modalities. To be noticed, attention gives a feasible way to extract important features for prediction. However, most of the existing attention operations [58, 64, 65] focus on the uni-modal problem and neglect the relationship knowledge among different modalities when conducting multi-modal fusion. To fully utilize the complementary knowledge from multiple teacher modalities, we extend the attention operations from uni-modal to multi-modal feature fusion by integrating the intra-modality similarity, semantic embeddings, and multiple relational knowledge into a unified Similarity-Preserving Adaptive Multi-modal Fusion (SPAMFM) module. The simplest case of SPAMFM for two modalities is shown in as shown in Fig. 4.

III-C1 Intra-modality Similarity Matrix Generation

Assume that we have $m$ teacher modalities and each modality has its own network $\{T_{k}|k=1,\cdots,m\}$ . Given an input mini-batch of size $b$ , the activation map produced by the teacher network $T_{k}$ at a particular layer $l$ is denoted as $A_{T_{k}}^{l}\in\mathbb{R}^{b\times c_{k}\times h_{k}\times w_{k}}$ , where $b$ is the batch size, $c_{k}$ is the number of output channels for the $k$ -th modality, and $h_{k}$ , $w_{k}$ are spatial dimensions. Inspired by attention-based knowledge transfer methods [58, 60] which use activation correlation to conduct knowledge transfer, we use mini-batch data to calculate intra-modality similarities in particular intermediate layers for different teacher modalities. Specifically, the activation maps $A_{T_{k}}^{l}$ are first reshaped to $R_{T_{k}}^{l}\in\mathbb{R}^{b\times c_{k}h_{k}w_{k}}$ , and then we use row-wise L2-normalized outer product of $R_{T_{k}}^{l}$ matrices to calculate intra-modality similarity-preserving matrices $G_{T_{k}}^{l}\in\mathbb{R}^{b\times b}$ :

\tilde{R}_{T_{k}}^{l}=R_{T_{k}}^{l}\times R_{T_{k}}^{l\top}\\

(4)

G_{T_{k}[i,:]}^{l}=\frac{\tilde{R}_{T_{k}[i,:]}^{l}}{\left\|\tilde{R}_{T_{k}[i,:]}^{l}\right\|_{2}}

(5)

where $\tilde{R}_{T_{k}}^{l}$ encodes the similarity of the activations within teacher modality $k$ of layer $l$ in the mini-batch, $[i,:]$ denotes the $i$ -th row in a matrix. These intra-modality similarities $G_{T_{k}[i,:]}^{l}$ can be utilized as weight matrices to guide the fusing of different relation functions in global context modeling. In this way, the calculated global context information can adaptively preserve the intra-modality relationship as well as the complementary information among different teachers.

III-C2 Global Context Modeling and Feature Recalibration

After obtaining the intra-modality similarity matrices $\{G_{T_{k}}^{l}|k=1,\cdots,m\}$ , we build a global context modeling module to receive features from particular layers (conv or fc) of different teacher networks and learns a similarity-preserved global context embedding, then we use this embedding to recalibrate the input features from different modalities. To fix notation, we let $A_{k}^{l}\in\mathbb{R}^{b\times c_{k}\times h_{k}\times w_{k}}$ denotes the feature maps of a batch at a given layer $l$ of modality $k$ . We first use global average pooling (GAP) to generate squeezed feature vectors $S_{k}^{l}\in\mathbb{R}^{b\times c_{k}}$ for different modalities. Formally, a statistic $S_{k}^{l}$ is generated by shrinking $A_{k}^{l}$ through spatial dimensions $h_{k}\times w_{k}$ , where the $c$ -th element of $S_{k}^{l}$ is calculated by:

S_{k}^{l}(b,c)=\frac{1}{h_{k}\times w_{k}}\sum_{i=1}^{h_{k}}\sum_{j=1}^{w_{k}}A_{k}^{l}(b,c,i,j)

(6)

In order to make the global context preserve the intra-modality relationship as well as the complementary information among different teacher modalities, we use the product of intra-modality similarity matrix $G_{T_{k}}^{l}$ and squeezed feature vector $S_{k}^{l}$ for each teacher modality to learn joint representation. To aggregate their complementary heterogeneous information from different aspects, we use three different relation functions: concatenation, summation, and Hadamard product, which have been validated their effectiveness in [65]. Thus, we can get three forms of joint representations through three independent fully-connected layers with three kinds of relation functions:

Z_{con}^{l}=W_{con1}^{l}[G_{T_{1}}^{l}S_{1}^{l},\cdots,G_{T_{m}}^{l}S_{m}^{l}]+b_{con1}^{l}

(7)

Z_{sum}^{l}=W_{sum1}^{l}\left(\sum_{k=1}^{m}{G_{T_{k}}^{l}S_{k}^{l}}\right)+b_{sum1}^{l}

(8)

Z_{had}^{l}=W_{had1}^{l}\prod_{k=1}^{m}{G_{T_{k}}^{l}S_{k}^{l}}+b_{had1}^{l}

(9)

where $[\cdot,\cdot]$ denotes the concatenation operation, $\prod_{k=1}^{m}$ denotes hadamard product from modality $1$ to modality $m$ , $Z_{con}^{l}\in\mathbb{R}^{c_{con}}$ , $Z_{sum}^{l}\in\mathbb{R}^{c_{sum}}$ and $Z_{had}^{l}\in\mathbb{R}^{c_{had}}$ denote joint representations of $l$ -th layer for concatenation, summation and hadamard product relation functions, respectively. Here, $W_{con1}^{l}\in\mathbb{R}^{c_{con}\times\sum_{k=1}^{m}{c_{k}}}$ , $W_{sum1}^{l}\in\mathbb{R}^{c_{sum}\times c_{k}}$ , $W_{had1}^{l}\in\mathbb{R}^{c_{had}\times c_{k}}$ are weights, $b_{con1}^{l}\in\mathbb{R}^{c_{con}}$ , $b_{sum1}^{l}\in\mathbb{R}^{c_{sum}}$ and $b_{had1}^{l}\in\mathbb{R}^{c_{had}}$ are the biases of the fully-connected layers. We choose $c_{con}=\frac{\sum_{k=1}^{m}{c_{k}}}{2m}$ , $c_{sum}=c_{k}$ and $c_{had}=c_{k}$ according to [64] to restrict the model capacity and increase its generalization ability.

To make use of the global context information aggregated in the above three joint representations $Z_{con}^{l}$ , $Z_{sum}^{l}$ and $Z_{had}^{l}$ , we predict excitation signals for them through three independent fully-connected layers:

E_{con}^{l}=W_{con2}^{l}Z_{con}^{l}+b_{con2}^{l}

(10)

E_{sum}^{l}=W_{sum2}^{l}Z_{sum}^{l}+b_{sum2}^{l}

(11)

E_{had}^{l}=W_{had2}^{l}Z_{had}^{l}+b_{had2}^{l}

(12)

where $W_{con2}^{l}\in\mathbb{R}^{c_{k}\times c_{con}}$ , $W_{sum2}^{l}\in\mathbb{R}^{c_{k}\times c_{sum}}$ , $W_{had2}^{l}\in\mathbb{R}^{c_{k}\times c_{had}}$ are weights, $b_{con2}^{l}\in\mathbb{R}^{c_{k}}$ , $b_{sum2}^{l}\in\mathbb{R}^{c_{k}}$ and $b_{had2}^{l}\in\mathbb{R}^{c_{k}}$ are the biases of the fully connected layers.

After obtaining these three excitation signals $E_{con}^{l}\in\mathbb{R}^{c}_{k}$ , $E_{sum}^{l}\in\mathbb{R}^{c}_{k}$ and $E_{had}^{l}\in\mathbb{R}^{c}_{k}$ , we use them to recalibrate the input feature $A_{k}^{l}$ from each modality $k$ adaptively by a simple gating mechanism,

\tilde{A}_{k}^{l}=(\delta(E_{con}^{l})+\delta(E_{sum}^{l})+\delta(E_{had}^{l}))\odot{A}_{k}^{l}

(13)

where $\odot$ is channel-wise product operation for each element in the channel dimension, and $\delta(\cdot)$ is the ReLU function. With SPAMFM, we can realize adaptive multi-modal feature fusion and inter-modality feature recalibration, which allows the features of one modality to recalibrate the features of another modality while concurrently preserving the intra-modality similarities as well as the complementary information among different teacher modalities.

III-D Graph-guided Semantically Discriminative Mapping

Traditional knowledge distillation methods usually conduct knowledge transfer at the last fully-connected layers and ignore the intermediate layers which contain abundant essential complementary information between networks. For heterogeneous knowledge problems like sensor-to-vision action recognition, the knowledge in intermediate layers is important for efficient knowledge transfer. To mitigate the modality divergence between teacher and student modalities, we propose a novel semantics-aware knowledge distillation module, named Graph-guided Semantically Discriminative Mapping (GSDM), which works at convolutional layers and transfers the semantics-aware attention knowledge of multiple well-trained teacher networks to guide the training of student network. This module utilizes graph-guided ablation analysis to produce good visual explanations for both teacher and student modalities highlighting the important regions for predicting the semantic concept and concurrently preserving respective interrelations of data for each modality.

Since previous works [66, 67, 68] have validated that the ablation of some units of a network can be an indicator of how important a unit is for a particular class, we use ablation drop of mini-batch input features to produce visual explanations based knowledge distillation loss across domains. Different from previous methods which use global average pooled gradients and class scores for visual explanation, we use semantics-guided ablation analysis to learn the visual explanations because the similar semantic relationship between wearable-sensors and vision-sensors data can be considered as good guidance for knowledge transfer while class scores are too strict for heterogenous sensor-to-vision action recognition problem. The framework of GSDM is shown in Fig. 5.

The input mini-batch data of student network contains two parts, the first part $I\in\mathbb{R}^{b\times c\times h\times w}$ contains raw input mini-batch data, the second part $I_{a}\in\mathbb{R}^{b\times c\times h\times w}$ contains black images which is essentially the ablation of the raw input data. And the combination of these two parts $[I;I_{a}]\in\mathbb{R}^{2b\times c\times h\times w}$ is used as the input. We assume that the class score $y^{c}$ for class $c$ can be considered as a non-linear function of input data. When we set all the input mini-batch data to zeros and repeat the forward pass, we get a reduced activation score $y_{a}^{c}$ with respect to feature map $A_{p}$ of $p$ -th unit. Based on these class scores $y^{c}$ and $y_{a}^{c}$ , we use Glove [69] to calculate their corresponding semantic embeddings $F^{c}$ and $F_{a}^{c}$ :

F=\left\{\begin{array}[]{ll}F^{c}=\textrm{Glove}(y^{c})\in\mathbb{R}^{b\times 300},\hbox{when input is $I$;}\\ F_{a}^{c}=\textrm{Glove}(y_{a}^{c})\in\mathbb{R}^{b\times 300},\hbox{when input is $I_{a}$.}\end{array}\right.

(14)

Since manifold learning [70] extracts intrinsic structures from data, we construct two graphs for original data $I$ and ablation data $I_{a}$ , where vertexes are embedded features at the final fully connected layers and edges are the relations between features. The edge weights $W_{i,j}$ between the input data $x_{i}$ and $x_{j}$ are determined by the Gaussian similarity, $W_{i,j}=\textrm{exp}(-\frac{\|f_{i}-f_{j}\|^{2}}{2})$ , where $f_{i}$ and $f_{j}$ are embedded feature vectors of $x_{i}$ and $x_{j}$ . Then we apply the normalized graph Laplacians [71] on $W$ , that is, $Q=D^{1/2}WD^{-1/2}$ , where $D$ is a diagonal matrix with its $(i,i)$ -value to be the sum of the $i$ -th row of $W$ . In this way, the manifold structure in the data can be well represented in graph matrix $Q\in\mathbb{R}^{b\times b}$ .

To preserve the embedded low-dimensional manifold subspace structure of original modalities when conducting knowledge distillation, we first multiply the semantic embeddings with the graph matrix, which is essentially the manifold regularization. In this way, a semantic space that is robust against small perturbations can be produced [72]. Semantics propagation can be seen as a repeated random walk through the graph of features using an affinity matrix to assign the semantic embeddings of data, which can effectively preserves the manifold structure [73]. Then, we define a graph-guided slope metric $\omega_{p,l}^{c}\in\mathbb{R}^{b\times 300}$ to measure the changing rate of the transformed semantic embeddings for the p-th unit of layer $l$ and class $c$ .

\omega_{p,l}^{c}=\frac{QF^{c}-Q_{a}F_{a}^{c}}{QF^{c}}

(15)

where $Q\in\mathbb{R}^{b\times b}$ and $Q_{a}\in\mathbb{R}^{b\times b}$ are the normalized graph similarity matrices for original data and ablation data, respectively. In this way, the intrinsic of data can be preserved and concurrently the importance value can be represented by the fraction of drop in semantic embeddings of class $c$ when the input features are removed. Then the graph-guided semantically discriminative map $M_{l}^{c}$ for the $l$ -th layer of class $c$ can be obtained as a weighted linear combination of activation maps $A_{p,l}$ and corresponding weights $\omega_{p,l}^{c}$ ,

M_{p,l}^{c}=\textrm{ReLU}\left(\sum_{p}\omega_{p,l}^{c}A_{p,l}\right)

(16)

The dimensionality of the weight $\omega_{p,l}^{c}$ is adaptively adjusted to the dimensionality of different feature maps in the same way as the Grad-CAM [74]. The GSDM of specific layers for teacher and student networks are generated following Eq. (16). After obtaining the GSDM, the GSDM based knowledge distillation loss can be constructed. Assume that we have $m$ teacher modalities and one student modality, we use mean squared error (MSE) loss between the normalized GSDM of teachers and student to transfer knowledge:

L_{\textrm{GSDM}}=\frac{\sum_{k=1}^{m}\sum_{l_{T}\in\mathcal{L}_{distill}^{T},l_{S}\in\mathcal{L}_{distill}^{S}}\left\|\frac{M_{l_{T}}^{T_{k}}}{\|M_{l_{T}}^{T_{k}}\|_{2}}-\frac{M_{l_{S}}^{S}}{\|M_{l_{S}}^{S}\|_{2}}\right\|_{2}^{2}}{m\times N^{\mathcal{L}_{distill}^{T}}}

(17)

where $M_{l_{T}}^{T_{k}}$ denotes the GSDM for teacher network $T_{k}$ of the $l$ -th layer, $M_{l_{S}}^{S}$ is the GSDM for student network of the $l$ -th layer, $\mathcal{L}_{distill}^{T}$ represents the group containing the choosing layers of teacher networks for knowledge distillation, $\mathcal{L}_{distill}^{S}$ is the group containing the choosing layers of student network, and $\|\cdot\|_{2}$ denotes the L2 norm. $N^{\mathcal{L}_{distill}^{T}}$ is the number of choosing layers in group $\mathcal{L}_{distill}^{T}$ .

III-E Semantics-aware Adaptive Knowledge Distillation

Based on the Virtual Image Generation model, Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM), and Graph-guided Semantically Discriminative Mapping (GSDM) knowledge distillation module, we seamlessly integrate them into a unified adaptive knowledge distillation framework, named Semantics-aware Adaptive Knowledge Distillation Network (SAKDN), to address the sensor-to-vision heterogenous action recognition problem, shown in Fig. 2. In SAKDN, we use multiple wearable-sensors as teacher modalities and use vision-sensor as student modalities. For teacher networks, the input is GAF images of wearable-sensors. And the input of the student network is the RGB video.

III-E1 Training of teacher networks

Assume that we have $m$ wearable-sensors modalities, we build $m$ teacher networks using VGG16 [75] as the backbone. As shown in Fig. 2, given GAF images for each modality, we simultaneously feed them into their respective networks for model training. The SPAMFM are added into selected layers of VGG16 among teacher networks, which is shown as follows:

\mathcal{L}_{\textrm{SPAMFM}}=\{\textrm{conv}_{1}^{2},\textrm{conv}_{2}^{2},\textrm{conv}_{3}^{3},\textrm{conv}_{4}^{3},\textrm{conv}_{5}^{3},\textrm{fc}1,\textrm{fc}2\}

(18)

In addition to the SPAMFM, we design a semantic preserving loss at the fc2 layer among different teacher networks to make sure the fc2 layer contain semantic knowledge as well as the intra-modality relationship and the complementary information from different teacher modalities. The semantic preserving loss $L_{\textrm{SP}}^{T}$ is defined as the MSE loss between the raw features of the fc2 layer and their corresponding semantic representations of action class names,

L_{\textrm{SP}}^{T}=\frac{1}{m}\sum_{k=1}^{m}\left\|H_{k}^{T}-F_{k}\right\|_{2}^{2}

(19)

where $H_{k}^{T}$ is the raw feature of the fc2 layer for teacher network $k$ , $F_{k}$ is the corresponding semantic representation.

All the teacher networks are trained simultaneously using $L_{\textrm{SP}}^{T}$ along with the summation of cross-entropy loss for all teacher networks $L_{\textrm{CS}}^{T}$ . The total loss $L_{\textrm{T}}$ for all teacher networks is organized as:

L_{\textrm{T}}=L_{\textrm{CS}}^{T}+L_{\textrm{SP}}^{T}=\frac{1}{m}\sum_{k=1}^{m}\textrm{CE}(Y_{k}^{T},Z_{k}^{T})+\frac{1}{m}\sum_{k=1}^{m}\left\|H_{k}^{T}-F_{k}\right\|_{2}^{2}

(20)

where CE is the cross entropy, $Y_{k}$ and $Z_{k}$ denote the predicted labels and class probability for teacher network $k$ , respectively.

III-E2 Training of student network

Our student network is a TRN [2] with BN-Inception using only RGB videos as input. During the training of the student network, the parameters of teacher networks are fixed, as shown in Fig. 2. In order to reduce the computational cost during the training phase, we only perform distillation on some representative features. Thus, the GSDM is added into selected convolutional layers between BN-Inception and VGG16 networks for all teacher-student pairs, which are shown as follows:

\mathcal{L}_{distill}^{T}=\{\textrm{conv}_{1}^{2},\textrm{conv}_{2}^{2},\textrm{conv}_{3}^{3},\textrm{conv}_{4}^{3},\textrm{conv}_{5}^{3}\}

(21)

\mathcal{L}_{distill}^{S}=\{\textrm{conv}2,\textrm{Inc}{3c},\textrm{Inc}{4c},\textrm{Inc}{5a},\textrm{Inc}{5b}\}

(22)

where $\textrm{conv}_{i}^{j}$ represents the $j$ -th convolutional activation map of convolution group $i$ , Inc represents the inception layer.

In addition to the GSDM distillation loss $L_{\textrm{GSDM}}$ in Eq. (17), we build a complementary knowledge distillation loss at the last fully-connected layers between teacher and student networks.

L_{\textrm{ST}}=\frac{1}{m}\sum_{k=1}^{m}\textrm{KL}(\frac{P_{k}^{T}}{T},\frac{P^{S}}{T})

(23)

where $\textrm{KL}(\cdot,\cdot)$ is the Kullback-Leibler divergence, $P_{k}^{T}$ is the class probability prediction of teacher network $k$ , $P^{S}$ is the class probability prediction of the student network, $T$ denotes the temperature controlling the distribution of the probability. We set $T=4$ in this paper suggested by [15].

To make the semantic knowledge between teacher and student networks similar, we use semantic preserving loss between the fc1 layer of the student network and the fc2 layer of teacher networks, which is defined as follows:

L_{\textrm{SP}}^{S}=\frac{1}{m}\sum_{k=1}^{m}\left\|H^{S}-H_{k}^{T}\right\|_{2}^{2}

(24)

where $H^{S}$ denotes the features of the fc1 layer for the student network, $H_{k}^{T}$ represents the features of the fc2 layer for teacher network $k$ . Since we use Eq. (19) to train the teacher networks, the fc2 layer of the trained teacher network already contains semantic knowledge. Therefore, we can realize semantic knowledge preserving for student network using Eq. (24).

To train the student network, we use cross entropy loss $L_{\textrm{CS}}^{S}=CE(Y^{S},Z^{S})$ along with two knowledge distillation loss $L_{\textrm{GSDM}}$ and $L_{\textrm{ST}}$ , and semantic preserving loss $L_{\textrm{SP}}^{S}$ . The total loss $L_{S}$ for student network is defined as follows:

L_{S}=L_{\textrm{CS}}^{S}+\alpha L_{\textrm{ST}}+\beta L_{\textrm{GSDM}}+\gamma L_{\textrm{SP}}^{S}

(25)

where $Y^{S}$ and $Z^{S}$ are predicted labels and class probability for the student network, respectively. $\alpha$ , $\beta$ , and $\gamma$ are the parameters controlling the importance of ST, GSDM, and SP loss, respectively.

IV Experiments

IV-A Experimental Setup

In this work, we conduct extensive experiments on three benchmarks for sensor-to-vision action recognition. We first introduce the three datasets and implementation details. Then, we compare our SAKDN with existing knowledge distillation and action recognition methods. In addition, we conduct ablation studies to analyze the importance of the proposed ST, SPAMFM, GSDM, and SP. Furthermore, we conduct experiments with different backbone architectures and selected transfer layers to validate whether our SAKDN could generalize to different networks and choosing layers. To show how we select the values of hyper-parameters, we use the grid-search method to conduct parameter sensitivity analysis of hyper-parameters $\alpha$ , $\beta$ , and $\gamma$ in three datasets. Finally, we illustrate the training curves of our method and use CAM to visualize the space-time regions that contribute to our predictions.

IV-A1 Berkeley-MHAD [76]

This dataset consists of $11$ action classes performed by $12$ subjects with $5$ repetitions for each action. There are $12$ different camera views in total. The action data modalities include RGB videos, depth images, accelerators, and microphones. In this paper, we use RGB videos and accelerators. There are $7,900$ RGB videos and $6$ different accelerator modalities. Each accelerator modality has $658$ samples and the total number of accelerator samples is $3,948$ . In all experiments, we use the first $7$ subjects for training and the last $5$ subjects for testing.

TABLE I: Implementation details for three benchmark datasets. Batch denotes the batch size, LR is the initial learning rate, DR is the decay ratio of the learning rate, DI is the decay iterations of the learning rate, Iters is the total iterations.

Dataset	Modality	Batch	LR	DR	DI	Iters
Berkeley-MHAD	Teacher	8	0.0001	0.5	50	100
Berkeley-MHAD	Student	8	0.001	0.1	20	30
UTD-MHAD	Teacher	16	0.0002	0.5	50	100
UTD-MHAD	Student	16	0.001	0.5	50	100
MMAct	Teacher	16	0.0001	0.5	50	70
MMAct	Student	32	0.001	0.5	30	60

IV-A2 UTD-MHAD [77]

It consists of $27$ different actions performed by $8$ subjects with $4$ repetitions. This dataset has five modalities: RGB, depth, skeleton, Kinect, and inertial data. The vision-sensors data are captured by Kinect camera, while wearable-sensors data are captured by the inertial sensor. In this paper, we use RGB videos and two different wearable-sensors modalities (accelerator, gyroscope). Each modality has $861$ samples. Since each subject performs each action for $4$ times, we choose the first two samples of each action to form the training set and the remaining samples as the testing set.

IV-A3 MMAct [5]

MMAct is a large-scale multi-modal action dataset consist of more than $36,000$ trimmed clips with seven modalities captured by $20$ subjects, which include RGB videos, keypoints, acceleration, gyroscope, orientation, Wi-Fi, and pressure signal. Each modality has $37$ action classes. This dataset is challenging as it contains $4$ camera views combining with random walks and occlusion scenes. In this paper, we use RGB videos and four different wearable-sensors modalities (accelerator-phone, accelerator-watch, gyroscope, and orientation). We use four different settings to evaluate this dataset: cross-subject, cross-view, cross-scene, and cross-session according to the train-test split strategy in [5].

For teacher networks, we used VGGNet16 [75] as the backbone. For the student network, we adopt multi-scale TRN [2] with BN-Inception pretrained on ImageNet because of its balance between accuracy and efficiency. In the multi-scale TRN, we set the dropout ratio as $0.8$ after the global pooling layer to reduce the effect of over-fitting, the number of segments is set as $8$ for Berkeley-MHAD and UTD-MHAD, while $3$ for the MMAct. The implementation details for Berkeley-MHAD, UTD-MHAD, and MMAct datasets are presented in Table I. All the experiments are conducted with two NVIDIA RTX 2080Ti GPUs using PyTorch [78]. To synchronize two different modalities, we make the random seed for initializing these two dataloaders to be the same. For semantic representation extraction, we use Glove [69] model and obtain $300$ dimensional semantic vectors for each action class names. We set hyper-parameters $\alpha$ , $\beta$ and $\gamma$ in SAKDN according to the parameter sensitivity analysis in Section IV.D.

TABLE II: Performance comparison on Berkeley-MHAD.

Type	Method	Modality	Accuracy
VAR	TSN [1]	RGB videos	88.19
	TRN [2]	RGB videos	95.32
	TSM [35]	RGB videos	96.87
MMAR	MKL [76]	Accelerators+Depth	97.81
	MPE [79]	Accelerators+Depth	98.10
	MOCAP [80]	Accelerators+Depth	98.38
KD	Logits [81]	Accelerators+RGB Videos	97.93
	Fitnet [82]	Accelerators+RGB Videos	94.38
	ST [15]	Accelerators+RGB Videos	95.99
	AT [58]	Accelerators+RGB Videos	97.99
	RKD [59]	Accelerators+RGB Videos	97.11
	SP [60]	Accelerators+RGB Videos	98.17
	CC [83]	Accelerators+RGB Videos	97.11
Proposed	SAKDN	Accelerators+RGB Videos	99.33

TABLE III: Performance comparison on UTD-MHAD.

Type	Method	Modality	Accuracy
VAR	TSN [1]	RGB videos	92.54
	TRN [2]	RGB videos	94.87
	TSM [35]	RGB videos	94.17
MMAR	CRC [77]	Acc+Gyro+Depth	79.10
	CRC-2 [84]	Acc+Gyro+Depth	97.20
	CNN+LSTM [12]	Acc+Gyro+Depth	89.20
	MFLF [6]	Acc+Gyro+RGB Videos	98.20
KD	Logits [81]	Acc+Gyro+RGB Videos	97.20
	Fitnet [82]	Acc+Gyro+RGB Videos	90.20
	ST [15]	Acc+Gyro+RGB Videos	97.90
	AT [58]	Acc+Gyro+RGB Videos	95.80
	RKD [59]	Acc+Gyro+RGB Videos	96.73
	SP [60]	Acc+Gyro+RGB Videos	94.40
	CC [83]	Acc+Gyro+RGB Videos	94.87
Proposed	SAKDN	Acc+Gyro+RGB Videos	98.60

TABLE IV: Performance comparison on MMAct.

Method	Modality	Cross	Cross	Cross	Cross
Method	Modality	subject	view	scene	session
TSN[1]	RGB videos	59.50	54.37	51.21	68.65
TRN [2]	RGB videos	66.56	65.51	60.03	71.95
TSM [35]	RGB videos	70.12	67.22	66.04	81.32
SMD [15]	A+RGB	63.89	66.31	61.56	71.23
MMD [5]	A+G+O+RGB	64.33	68.19	62.23	72.08
MMAD [5]	A+G+O+RGB	66.45	70.33	64.12	74.58
Logits [81]	A+G+O+RGB	65.06	60.94	57.92	74.14
Fitnet [82]	A+G+O+RGB	33.96	30.14	18.88	35.87
ST [15]	A+G+O+RGB	64.45	60.39	58.72	74.80
AT [58]	A+G+O+RGB	65.59	60.30	55.92	74.28
RKD [59]	A+G+O+RGB	65.54	61.67	55.38	75.05
SP [60]	A+G+O+RGB	65.16	60.76	57.48	74.41
CC [83]	A+G+O+RGB	65.60	59.59	59.65	73.98
SAKDN	A+G+O+RGB	77.23	73.48	66.38	82.77

TABLE V: Average accuracies (%) on Berkeley-MHAD dataset. W/O denotes Without, A denotes Accelerator. The number in parenthesis means decreased accuracy over the proposed SAKDN.

Method	Teacher	Student	Train	Test	Accuracy
Method	Backbone	Backbone	Modality	Modality	Accuracy
Teacher-Acc1 (SKDN)	VGG16	VGG16	Accelerator 1	Accelerator 1	78.18
Teacher-Acc1 (AKDN)	VGG16	VGG16	Accelerator 1	Accelerator 1	74.90
Teacher-Acc1 (SAKDN)	VGG16	VGG16	Accelerator 1	Accelerator 1	81.09
Teacher-Acc2 (SKDN)	VGG16	VGG16	Accelerator 2	Accelerator 2	75.63
Teacher-Acc2 (AKDN)	VGG16	VGG16	Accelerator 2	Accelerator 2	73.45
Teacher-Acc2 (SAKDN)	VGG16	VGG16	Accelerator 2	Accelerator 2	82.90
Teacher-Acc3 (SKDN)	VGG16	VGG16	Accelerator 3	Accelerator 3	71.63
Teacher-Acc3 (AKDN)	VGG16	VGG16	Accelerator 3	Accelerator 3	68.72
Teacher-Acc3 (SAKDN)	VGG16	VGG16	Accelerator 3	Accelerator 3	75.27
Teacher-Acc4 (SKDN)	VGG16	VGG16	Accelerator 4	Accelerator 4	76.00
Teacher-Acc4 (AKDN)	VGG16	VGG16	Accelerator 4	Accelerator 4	70.54
Teacher-Acc4 (SAKDN)	VGG16	VGG16	Accelerator 4	Accelerator 4	80.36
Teacher-Acc5 (SKDN)	VGG16	VGG16	Accelerator 5	Accelerator 5	52.72
Teacher-Acc5 (AKDN)	VGG16	VGG16	Accelerator 5	Accelerator 5	51.27
Teacher-Acc5 (SAKDN)	VGG16	VGG16	Accelerator 5	Accelerator 5	55.27
Teacher-Acc6 (SKDN)	VGG16	VGG16	Accelerator 6	Accelerator 6	50.90
Teacher-Acc6 (AKDN)	VGG16	VGG16	Accelerator 6	Accelerator 6	46.18
Teacher-Acc6 (SAKDN)	VGG16	VGG16	Accelerator 6	Accelerator 6	54.54
Multi-Teachers (SKDN)	VGG16	VGG16	A1+A2+A3+A4+A5+A6	A1+A2+A3+A4+A5+A6	89.09
Multi-Teachers (AKDN)	VGG16	VGG16	A1+A2+A3+A4+A5+A6	A1+A2+A3+A4+A5+A6	90.54
Multi-Teachers (SAKDN)	VGG16	VGG16	A1+A2+A3+A4+A5+A6	A1+A2+A3+A4+A5+A6	92.00
Student (Baseline)	BNInception	BNInception	RGB videos	RGB videos	$95.32\textbf{ (-4.01)}$
SKDN (W/O SPAMFM)	VGG16	BNInception	A1+A2+A3+A4+A5+A6+RGB	RGB videos	98.11 (-1.22)
KDN (W/O ST)	VGG16	BNInception	A1+A2+A3+A4+A5+A6+RGB	RGB videos	98.14 (-1.19)
SADN (W/O GSDM)	VGG16	BNInception	A1+A2+A3+A4+A5+A6+RGB	RGB videos	98.48 (-0.85)
AKDN (W/O SP)	VGG16	BNInception	A1+A2+A3+A4+A5+A6+RGB	RGB videos	97.63 (-1.70)
SAKDN	VGG16	BNInception	A1+A2+A3+A4+A5+A6+RGB	RGB videos	99.33

IV-B Comparison with State-of-the-Art Methods

We compare the performance of our SAKDN with state-of-the-art knowledge distillation (KD) methods [81, 82, 15, 58, 59, 60, 83], vision-based action recognition (VAR) methods [1, 2, 35], and multi-modal action recognition (MMAR) methods [76, 79, 80, 77, 84, 12, 6, 5]. The comparison results on three datasets are shown in Table II, Table III and Table IV, respectively. For these VAR and KD methods, we use the shared codes, and the parameters are selected based on the default setting. Since [5] is the only existing multi-modal action recognition method for the MMAct dataset which uses F-measure to evaluate the performance, we also adopt the F-measure in the MMAct dataset to make a fair comparison.

In Table II and Table III, the SAKDN achieves better performance than all the comparison action recognition methods, multi-modal action recognition methods, and knowledge distillation methods. This validates that our SAKDN can effectively improve vision-sensor based action recognition performance by integrating SPAMFM, GSDM, and SP into a unified end-to-end adaptive knowledge distillation framework. From Table IV, we can see that the SAKDN performs better than most of the comparison action recognition methods, multi-modal action recognition methods, and knowledge distillation methods. To be noticed, the TSM [35] achieves a comparable performance with the SAKDN. This is because the TSM shifts part of the channels along the temporal dimension and thus facilitates information exchanged among neighboring frames. Although the MMAD [5] proposed a multi-modality distillation model to transfer the knowledge from wearable-sensors to vision-sensors, it only uses raw one-dimensional time-series sensor signals without considering virtual image generation of wearable-sensor data and the semantic relationship. When using the SAKDN, the performance is the best among all the comparison methods. This validates the effectiveness of our adaptive knowledge distillation framework for sensor-to-vision heterogenous action recognition by integrating Gramian Angular Field (GAF) based virtual image generation, SPAMFM, GSDM into a unified end-to-end deep learning framework.

IV-C Ablation Study

To evaluate the contribution of the SPAMFM, the ST, the GSDM, and the SP, we construct four different algorithms based on the SAKDN. Student (Baseline): our student network trained with only RGB videos. Multi-Teachers: our teacher networks trained with all wearable-sensor modalities. SKDN: our SAKDN without Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM). KDN: our SAKDN without soft-target loss. SADN: our SAKDN without Graph-guided Semantically Discriminative Mapping (GSDM). AKDN: our SAKDN without Semantic Preserving (SP) for both teacher and student networks. SAKDN: our proposed Semantics-aware Adaptive Knowledge Distillation Networks.

TABLE VI: Average accuracies (%) on UTD-MHAD dataset. W/O denotes Without. The number in parenthesis means decreased accuracy over the proposed SAKDN.

Method	Teacher	Student	Train	Test	Accuracy
Method	Backbone	Backbone	Modality	Modality	Accuracy
Teacher-Acc (SKDN)	VGG16	VGG16	Accelerator	Accelerator	96.27
Teacher-Acc (AKDN)	VGG16	VGG16	Accelerator	Accelerator	91.84
Teacher-Acc (SAKDN)	VGG16	VGG16	Accelerator	Accelerator	97.66
Teacher-Gyo (SKDN)	VGG16	VGG16	Gyroscope	Gyroscope	93.93
Teacher-Gyo (AKDN)	VGG16	VGG16	Gyroscope	Gyroscope	92.77
Teacher-Gyo (SAKDN)	VGG16	VGG16	Gyroscope	Gyroscope	94.87
Multi-Teachers (SKDN)	VGG16	VGG16	Acc+Gyo	Acc+Gyo	96.27
Multi-Teachers (AKDN)	VGG16	VGG16	Acc+Gyo	Acc+Gyo	97.43
Multi-Teachers (SAKDN)	VGG16	VGG16	Acc+Gyo	Acc+Gyo	98.83
Student (Baseline)	BNInception	BNInception	RGB videos	RGB videos	94.87 (-3.73)
SKDN (W/O SPAMFM)	VGG16	BNInception	Acc+Gyo+RGB	RGB videos	97.43 (-1.17)
KDN (W/O ST)	VGG16	BNInception	Acc+Gyo+RGB	RGB videos	96.27 (-2.33)
SADN (W/O GSDM)	VGG16	BNInception	Acc+Gyo+RGB	RGB videos	97.66 (-0.94)
AKDN (W/O SP)	VGG16	BNInception	Acc+Gyo+RGB	RGB videos	96.96 (-1.64)
SAKDN	VGG16	BNInception	Acc+Gyo+RGB	RGB videos	98.60

TABLE VII: Average accuracies (%) on MMAct dataset. W/O denotes Without. The number in parenthesis means decreased accuracy over the proposed SAKDN.

Method	Teacher	Student	Train	Test	Cross	Cross	Cross	Cross
Method	Backbone	Backbone	Modality	Modality	Subject	View	Scene	Session
Teacher-Ap (SKDN)	VGG16	VGG16	Acc-phone	Acc-phone	49.54	56.65	55.44	56.81
Teacher-Ap (AKDN)	VGG16	VGG16	Acc-phone	Acc-phone	43.41	52.30	49.22	53.57
Teacher-Ap (SAKDN)	VGG16	VGG16	Acc-phone	Acc-phone	52.34	59.82	57.15	59.38
Teacher-Aw (SKDN)	VGG16	VGG16	Acc-watch	Acc-watch	44.23	49.97	63.26	16.50
Teacher-Aw (AKDN)	VGG16	VGG16	Acc-watch	Acc-watch	37.08	47.08	60.54	16.44
Teacher-Aw (SAKDN)	VGG16	VGG16	Acc-watch	Acc-watch	44.83	53.14	69.42	18.58
Teacher-Gyo (SKDN)	VGG16	VGG16	Gyroscope	Gyroscope	44.70	37.83	50.40	56.14
Teacher-Gyo (AKDN)	VGG16	VGG16	Gyroscope	Gyroscope	41.52	37.74	47.85	51.39
Teacher-Gyo (SAKDN)	VGG16	VGG16	Gyroscope	Gyroscope	52.98	40.86	56.52	59.66
Teacher-Ori (SKDN)	VGG16	VGG16	Orientation	Orientation	42.87	55.09	53.78	57.70
Teacher-Ori (AKDN)	VGG16	VGG16	Orientation	Orientation	40.72	54.20	51.29	53.74
Teacher-Ori (SAKDN)	VGG16	VGG16	Orientation	Orientation	47.12	60.60	58.71	61.56
Multi-Teachers (SKDN)	VGG16	VGG16	Ap+Aw+Gyo+Ori	Ap+Aw+Gyo+Ori	67.45	65.66	78.72	68.77
Multi-Teachers (AKDN)	VGG16	VGG16	Ap+Aw+Gyo+Ori	Ap+Aw+Gyo+Ori	66.64	65.88	79.24	66.53
Multi-Teachers (SAKDN)	VGG16	VGG16	Ap+Aw+Gyo+Ori	Ap+Aw+Gyo+Ori	68.69	68.22	81.61	70.11
Student (Baseline)	BNInception	BNInception	RGB videos	RGB videos	68.41	65.25	56.33	76.79
Student (Baseline)	BNInception	BNInception	RGB videos	RGB videos	(-2.70)	(-3.33)	(-7.08)	(-4.98)
SKDN (W/O SPAMFM)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	70.38	67.42	57.69	76.96
SKDN (W/O SPAMFM)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	(-0.73)	(-1.16)	(-5.72)	(-4.81)
KDN (W/O ST)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	70.45	67.26	61.90	80.79
KDN (W/O ST)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	(-0.66)	(-1.32)	(-1.51)	(-0.98)
SADN (W/O GSDM)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	69.11	65.16	56.86	79.53
SADN (W/O GSDM)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	(-2.00)	(-3.42)	(-6.55)	(-2.24)
AKDN (W/O SP)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	70.63	64.03	62.48	79.63
AKDN (W/O SP)	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	(-0.48)	(-4.55)	(-0.93)	(-2.14)
SAKDN	VGG16	BNInception	Ap+Aw+Gyo+Ori+RGB	RGB videos	71.11	68.58	63.41	81.77

The average accuracies on Berkeley-MHAD, UTD-MHAD, and MMAct datasets are shown in Table V, VI and VII, respectively. In Table V, the SAKDN for multi-teachers achieves better performance than that of the single teacher modality (Acc1-Acc6). Additionally, the SAKDN for six teacher modalities achieve better performance than that of the SKDN, these validate that our proposed SPAMFM can effectively fuse the complementary knowledge among different wearable-sensors. Without soft-target loss, the KDN performs worse than the SAKDN. This shows the importance of the ST in conducting knowledge transfer at the last fully-connected layers between teacher and student networks. In addition, the SAKDN also performs better than the AKDN, this verifies that the semantics-preserving part indeed acts as an effective guide for knowledge transfer. The student model with only RGB input (Baseline) achieves better performance than that of the teacher models because the wearable-sensors on Berkeley-MHAD dataset lack color and texture information which may degrade their representative abilities. Introducing different accelerators to the student model improves the performance from $95.32\%$ to $99.33\%$ , which validates the existence of complementary knowledge between wearable-sensors and vision-sensor modalities. In SAKDN, a more significant improvement of performance is achieved than SKDN, KDN, SADN, and AKDN in video action recognition. In addition, the performance of SKDN, KDN, SADN, AKDN are all better than that of the student baseline, which verifies that the SPAMFM, ST, GSDM, and SP are complementary and essential.

From Table VI, we can see that the teacher-acc, teacher-gyo and multi-teachers using SAKDN outperform the SKDN and AKDN. This validates that the SPAMFM can fully utilize complementary knowledge from multiple teacher modalities and the SP can guide the knowledge transfer using the similar semantic relationship between teacher and student modalities. Moreover, the performance of the SKDN, KDN, SADN, and AKDN are all better than that of the student baseline, which validates the effectiveness of SPAMFM, ST, GSDM, and SP. Among SKDN, KDN, SADN, AKDN, and SAKDN, the SAKDN achieves the best performance and improves the video action recognition performance from $94.87\%$ (baseline) to $98.60\%$ . This validates that our SAKDN can effectively transfer the knowledge from wearable-sensor modalities to vision-sensor modalities by integrating four complementary modules, SPAMFM, ST, GSDM, and SP.

To make a fair comparison with other methods in MMAct dataset, we use four different settings. 1) cross-subject: samples from $80\%$ of the subjects are used for training and the remaining $20\%$ for testing; 2) cross-view: samples from $3$ views are used for training and the remaining one for testing; 3) cross-scene: samples from the scenes without occlusion are used for training and the samples with occlusion scene for testing; 4) cross-session: samples from top $80\%$ sessions in ascending order for each subject are used for training and the remaining sessions for testing. The results for different settings on the MMAct dataset are shown in Table VII. In cross-view and cross-scene settings, the multi-teachers achieves better performance than that of the baseline. This is because wearable-sensor based action data are more robust to the occlusion and appearance variations caused by the camera viewpoint and scene change. Since appearance and texture information are important for action recognition in cross-subject and cross-session settings, the baseline model fully utilizes appearance information and performs better than the multi-teachers which lack texture and color information to discriminate different human subjects. Base on these observations, we can validate that wearable-sensors and vision-sensors modalities are related and complementary. To be noticed, the performance drops significantly compared with that of the SAKDN when using SADN (without GSDM), this validates that the GSDM is significantly important for knowledge transfer from wearable-sensors to vision-sensors. In the cross-view setting where the appearance of human actions varies a lot because of the changing camera views, the performance of SADN (without GSDM) and the AKDN (without SP) both perform worse than the baseline. While in SAKDN, the performance increase by $3.33\%$ . This shows that the GSDM and SP are both important for addressing the cross-view challenge. More importantly, our SAKDN outperforms both the multi-teacher model and the baseline model under all settings, which verifies that the SAKDN can effectively exploit complementary knowledge between wearable-sensors and vision-sensors action data and then improve the video action recognition performance in the wild. Among all the settings, the SAKDN performs significantly better than that of the SKDN, KDN, SADN, and AKDN, which demonstrates that the SPAMFM, ST, GSDM, and SP are all complementary and essential.

To get an intuitive understanding of the effectiveness of “ablation” video frames and what Graph-guided Semantically Discriminative Mapping (GSDM) learns, the GSDM generated by the student network (video modality) in the UTD-MHAD dataset is visualized in Fig. 6. To facilitate visualization, the GSDM of the layer inception5b is resized to the input size and then used to weight each channel of the corresponding input video frames. These examples in Fig. 6 indicate that the GSDM can effectively highlight the important regions for predicting the semantic concept, which can be considered as an effective intermediate visual pattern for knowledge distillation.

IV-D Effect of Different Transfer Layers and Backbones

To validate whether our SAKDN could generalize to different selected layers between teacher and student networks, we evaluate the performance of the SAKDN in the UTD-MHAD dataset using different combinations of $\mathcal{L}_{distill}^{T}$ and $\mathcal{L}_{distill}^{S}$ in Eq. (21) and (22). In addition, we use different backbones (BNInception, ResNet18, and ResNet50) for the student network to measure the generalization ability of SAKDN in different student backbones, as shown in Table VIII. For BNIneption, the performances of using different transfer layers are all better than that of the baseline ( $94.87\%$ ). This validates that our SAKDN can conduct knowledge distillation across different layers between teacher and student networks. To be noticed, for the same backbone, the performance gap of using different numbers of transfer layers is marginal. This shows that our SAKDN can generalize well to different levels of transfer layers. And the performance is the best when we use $\mathcal{L}_{distill}^{T}=\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$ and $\mathcal{L}_{distill}^{S}=\{\textrm{c}2,\textrm{I}{3c},\textrm{I}{4c},\textrm{I}{5a},\textrm{I}{5b}\}$ . When using different backbones, ResNet18, ResNet50 and BNInception all perform better than the baseline method, which verifies the generalization ability of the SAKDN in different backbones. Among the ResNet18, ResNet50 and BNInception, both the ResNet50 and BNInception achieve better performance than that of the ResNet18, which attributes to the simpler architecture of the ResNet18 for representation learning.

TABLE VIII: Average accuracies (%) on UTD-MHAD dataset for different transfer layers and student backbones, where c denotes convolutional layer, I denotes inception layer.

Student	Teacher	Student	Acc
Backbone	Layers	Layers	Acc
BNInception	$\{\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{5a}\}$	97.66
BNInception	$\{\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{5b}\}$	97.90
BNInception	$\{\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{4d},\textrm{I}{5a}\}$	97.66
BNInception	$\{\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{5a},\textrm{I}{5b}\}$	97.43
BNInception	$\{\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{4b},\textrm{I}{4d},\textrm{I}{5a}\}$	96.27
BNInception	$\{\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{4c},\textrm{I}{5a},\textrm{I}{5b}\}$	96.27
BNInception	$\{\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{3c},\textrm{I}{4b},\textrm{I}{4d},\textrm{I}{5a}\}$	98.36
BNInception	$\{\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{3c},\textrm{I}{4c},\textrm{I}{5a},\textrm{I}{5b}\}$	97.90
ResNet18	$\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{c}_{1}^{1},\textrm{c}_{2}^{1},\textrm{c}_{3}^{1},\textrm{c}_{4}^{1},\textrm{c}_{5}^{1}\}$	96.73
ResNet50	$\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{c}_{1}^{1},\textrm{c}_{2}^{1},\textrm{c}_{3}^{1},\textrm{c}_{4}^{1},\textrm{c}_{5}^{1}\}$	98.36
BNInception	$\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{I}{3a},\textrm{I}{3c},\textrm{I}{4b},\textrm{I}{4d},\textrm{I}{5a}\}$	97.20
ResNet18	$\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{c}_{1}^{1},\textrm{c}_{2}^{4},\textrm{c}_{3}^{4},\textrm{c}_{4}^{4},\textrm{c}_{5}^{4}\}$	95.33
ResNet50	$\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{c}_{1}^{1},\textrm{c}_{2}^{9},\textrm{c}_{3}^{12},\textrm{c}_{4}^{18},\textrm{c}_{5}^{9}\}$	97.90
BNInception	$\{\textrm{c}_{1}^{2},\textrm{c}_{2}^{2},\textrm{c}_{3}^{3},\textrm{c}_{4}^{3},\textrm{c}_{5}^{3}\}$	$\{\textrm{c}2,\textrm{I}{3c},\textrm{I}{4c},\textrm{I}{5a},\textrm{I}{5b}\}$	98.60

TABLE IX: Parameters sensitivity analysis of

\alpha

\beta

\gamma

on Berkeley-MHAD, UTD-MHAD and MMAct datasets.

$\alpha$	$\beta$	$\gamma$	Berkeley-	UTD-	MMAct
$\alpha$	$\beta$	$\gamma$	MHAD	MHAD	(cross-scene)
1	1	1	98.27	95.80	58.32
1	1	0.1	98.39	95.57	60.01
1	0.1	1	98.54	95.80	59.11
1	0.1	0.1	98.72	96.27	56.01
0.1	1	1	97.96	98.60	62.37
0.1	1	0.1	97.90	97.20	62.01
0.1	0.1	1	99.33	96.73	59.47
0.1	0.1	0.1	98.63	97.90	60.79
0.01	1	1	98.39	96.96	59.63
0.01	1	0.1	98.48	96.73	60.92
0.01	0.1	1	98.33	97.66	56.42
0.01	0.1	0.1	98.72	98.13	61.32

IV-E Parameter Sensitivity Analysis

There are three hyper-parameters $\alpha$ , $\beta$ and $\gamma$ in Eq. (25). To learn how they influence the performance, we conduct the parameter sensitivity analysis on Berkeley-MHAD, UTD-MHAD, and MMAct datasets (cross-scene) using the grid-search method. The parameter $\alpha$ get its values in $\{0.01,0.1,1\}$ , and parameters $\beta$ and $\gamma$ get their values in $\{0.1,1\}$ . Table IX shows the performance of our SAKDN by allocating different values to the parameters $\alpha$ , $\beta$ , and $\gamma$ . To have a more intuitive understanding of how we choose the optimal values for these parameters, we transform the table representation of parameter sensitivity in Table IX to a four-dimensional scatter diagram with the polar coordinate, shown in Fig. 7. From Table IX and Fig. 7, we can see that the optimal values for Berkeley-MHAD dataset are $\{0.1,0.1,1\}$ , while for both UTD-MHAD and MMAct datasets, the optimal values are $\{0.1,1,1\}$ . This means that the semantics-preserving should be attached more importance than the soft-target and GSDM, because the semantic relationship contributes a lot to multi-modal feature fusion, knowledge transfer, and representation learning for our SAKDN. For parameter $\alpha$ , its optimal values are $0.1$ for all datasets. This is because the soft-target loss only conducts knowledge distillation at the last fully-connected layers, while for the GSDM and SP loss, they contribute to the knowledge transfer throughout all layers. For parameter $\beta$ , the optimal value for the Berkeley-MHAD dataset is $0.1$ , while for both UTD-MHAD and MMAct datasets, the optimal values are $1$ . Since the Berkeley-MHAD dataset has six teacher modalities, which is larger than that of the UTD-MHAD and MMAct datasets. Therefore, $\beta$ should be set to a moderate value to avoid overfitting to any one of the teacher modalities when conducting knowledge distillation. The value of $\gamma$ is set to $1$ for all datasets.

IV-F Visualization Analysis

To show the training curves of the proposed method, we plot the training loss and accuracy for Berkeley-MHAD, UTD-MHAD, and MMAct datasets, as shown in Fig. 8. The training loss steadily decreases and converges to a smaller value with the training iteration increasing. At the same time, the training accuracy increases with the number of training iterations increasing. This validates that our proposed loss function is effective with good convergence.

To gain a better understanding about which space-time regions contribute to our predictions, we follow the Class Activation Map (CAM) technique [85] to visualize the energy of the last convolutional layer in the student network, before the global max and average pooling. We show some true examples and fail examples on Berkeley-MHAD, UTD-MHAD, and MMAct datasets to further analyze our proposed method. Fig. 9 depicts computed heat maps superimposed over sample video frames. In most cases, these examples portray a strong correlation between highly activated regions and the dominant movement in the scene, even when performing complex interactive actions in the presence of significant camera motion, background clutter, occlusion, and appearance variation. For example, in the fourth row, our model can capture dominant moving regions under different camera views. In the fifth row, with significant background clutter, occlusion, and appearance variation, our model can still concentrate on the body motion. These visualization results validate that our proposed model takes full advantage of both the vision-sensor modality and wearable-sensor modality in addressing the problem of background clutter, occlusion, and appearance variation, and thus is able to focus on the salient movement under these challenges. To be noticed, in some extremely challenging cases, our model fails to capture true moving regions in the video frames. For example, in the false examples, our model mistakes some background objects (pot plant, heavy luggage) as the body subjects due to the high appearance similarity, severe background clutter, and occlusion.

V Conclusion

In this paper, we propose an end-to-end knowledge distillation framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to adaptively distill the complementary knowledge from multiple wearable-sensors (teachers) to the vision-sensor (student), and concurrently improve the action recognition performance in vision-sensor modality (videos). To fully utilize the complementary knowledge from multiple teachers, we propose a novel plug-and-play module, named Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM), which integrates intra-modality similarity, semantic embeddings, and multiple relational knowledge to learn the global context representation and recalibrate the channel-wise features adaptively in each teacher network. To effectively exploit and transfer the knowledge of multiple well-trained teachers to the student, we propose a novel knowledge distillation module, named Graph-guided Semantically Discriminative Mapping (GSDM), which utilizes graph-guided ablation analysis to produce a visual explanation highlighting the important regions for predicting the semantic concept, and concurrently preserving respective interrelations of data. Extensive experiments on three benchmarks demonstrate the effectiveness of our SAKDN for adaptive knowledge transfer from wearable-sensors to vision-sensors.

References

[1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 11, pp. 2740–2755, 2018.
[2] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 803–818.
[3] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
[4] Y. Liu, K. Wang, H. Lan, and L. Lin, “Temporal contrastive graph for self-supervised video representation learning,” arXiv preprint arXiv:2101.00820, 2021.
[5] Q. Kong, Z. Wu, Z. Deng, M. Klinkigt, B. Tong, and T. Murakami, “Mmact: A large-scale dataset for cross modal human action understanding,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8658–8667.
[6] M. Ehatisham-Ul-Haq, A. Javed, M. A. Azam, H. M. Malik, A. Irtaza, I. H. Lee, and M. T. Mahmood, “Robust human activity recognition using multimodal feature-level fusion,” IEEE Access, vol. 7, pp. 60 736–60 751, 2019.
[7] W. Jiang and Z. Yin, “Human activity recognition using wearable sensors by deep convolutional neural networks,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1307–1310.
[8] J. Wannenburg and R. Malekian, “Physical activity recognition from smartphone accelerometer data for user context awareness sensing,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 12, pp. 3142–3149, 2016.
[9] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learning for sensor-based activity recognition: A survey,” Pattern Recognition Letters, vol. 119, pp. 3–11, 2019.
[10] K. Wang, J. He, and L. Zhang, “Attention-based convolutional neural network for weakly labeled human activities’ recognition with wearable sensors,” IEEE Sensors Journal, vol. 19, no. 17, pp. 7598–7604, 2019.
[11] Z. Ahmad and N. Khan, “Human action recognition using deep multilevel multimodal (m2) fusion of depth and inertial sensors,” IEEE Sensors Journal, vol. 20, no. 3, pp. 1445–1455, 2019.
[12] N. Dawar, S. Ostadabbas, and N. Kehtarnavaz, “Data augmentation in deep learning-based fusion of depth and inertial sensing for action recognition,” IEEE Sensors Letters, vol. 3, no. 1, pp. 1–4, 2018.
[13] H. Wei, R. Jafari, and N. Kehtarnavaz, “Fusion of video and inertial sensing for deep learning–based human action recognition,” Sensors, vol. 19, no. 17, p. 3680, 2019.
[14] N. C. Garcia, P. Morerio, and V. Murino, “Learning with privileged information via adversarial discriminative modality distillation,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2581–2593, 2020.
[15] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[16] M. Phuong and C. Lampert, “Towards understanding knowledge distillation,” in International Conference on Machine Learning, 2019, pp. 5142–5151.
[17] Z. Wang and T. Oates, “Imaging time-series to improve classification and imputation,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015, pp. 3939–3945.
[18] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR), vol. 43, no. 3, pp. 1–43, 2011.
[19] J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero, “A survey of video datasets for human action and activity recognition,” Computer Vision and Image Understanding, vol. 117, no. 6, pp. 633–659, 2013.
[20] M. J. Roshtkhari and M. D. Levine, “Human activity recognition in videos using a single example,” Image and Vision Computing, vol. 31, no. 11, pp. 864–876, 2013.
[21] X. Wang and C. Qi, “Action recognition using edge trajectories and motion acceleration descriptor,” Machine Vision and Applications, vol. 27, no. 6, pp. 861–875, 2016.
[22] X. Wang and Q. Ji, “Hierarchical context modeling for video event recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1770–1782, 2016.
[23] A. B. Sargano, P. Angelov, and Z. Habib, “A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition,” applied sciences, vol. 7, no. 1, p. 110, 2017.
[24] S. Ma, J. Zhang, S. Sclaroff, N. Ikizler-Cinbis, and L. Sigal, “Space-time tree ensemble for action recognition and localization,” International Journal of Computer Vision, vol. 126, no. 2, pp. 314–332, 2018.
[25] S. Siddiqui, M. A. Khan, K. Bashir, M. Sharif, F. Azam, and M. Y. Javed, “Human action recognition: a construction of codebook by discriminative features selection approach,” International Journal of Applied Pattern Recognition, vol. 5, no. 3, pp. 206–228, 2018.
[26] A. B. Sargano, X. Gu, P. Angelov, and Z. Habib, “Human action recognition using deep rule-based classifier,” Multimedia Tools and Applications, vol. 79, no. 41, pp. 30 653–30 667, 2020.
[27] L. Wang and H. Sahbi, “Directed acyclic graph kernels for action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3168–3175.
[28] A. Mazari and H. Sahbi, “Mlgcn: Multi-laplacian graph convolutional networks for human action recognition.” in BMVC, 2019, p. 281.
[29] P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 1112–1121.
[30] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
[31] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
[32] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
[33] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision, 2016, pp. 20–36.
[34] A. B. Sargano, X. Wang, P. Angelov, and Z. Habib, “Human action recognition using transfer learning with deep representations,” in International joint conference on neural networks, 2017, pp. 463–469.
[35] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
[36] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.
[37] Y. Yuan, Y. Feng, and X. Lu, “Statistical hypothesis detector for abnormal event detection in crowded scenes,” IEEE transactions on cybernetics, vol. 47, no. 11, pp. 3597–3608, 2016.
[38] Y. Yuan, D. Wang, and Q. Wang, “Memory-augmented temporal dynamic learning for action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9167–9175.
[39] C. Li, B. Zhang, C. Chen, Q. Ye, J. Han, G. Guo, and R. Ji, “Deep manifold structure transfer for action recognition,” IEEE Transactions on Image Processing, vol. 28, no. 9, pp. 4646–4658, 2019.
[40] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y. Duan, and A. K. Chichung, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2020.
[41] W. Li, L. Chen, D. Xu, and L. Van Gool, “Visual recognition in rgb images and videos by learning from rgb-d data,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 8, pp. 2030–2036, 2017.
[42] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
[43] F. Meng, H. Liu, Y. Liang, J. Tu, and M. Liu, “Sample fusion network: an end-to-end data augmentation network for skeleton-based human action recognition,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5281–5295, 2019.
[44] Y. Liu, Z. Lu, J. Li, T. Yang, and C. Yao, “Global temporal representation based cnns for infrared action recognition,” IEEE Signal Processing Letters, vol. 25, no. 6, pp. 848–852, 2018.
[45] B. Zhang, Y. Yang, C. Chen, L. Yang, J. Han, and L. Shao, “Action recognition using 3d histograms of texture and a multi-class boosting classifier,” IEEE Transactions on Image processing, vol. 26, no. 10, pp. 4648–4660, 2017.
[46] O. D. Lara and M. A. Labrador, “A survey on human activity recognition using wearable sensors,” IEEE communications surveys & tutorials, vol. 15, no. 3, pp. 1192–1209, 2012.
[47] F. Setiawan, B. N. Yahya, and S.-L. Lee, “Deep activity recognition on imaging sensor data,” Electronics Letters, vol. 55, no. 17, pp. 928–931, 2019.
[48] M. Fazli, K. Kowsari, E. Gharavi, L. Barnes, and A. Doryab, “Hhar-net: Hierarchical human activity recognition using neural networks,” arXiv preprint arXiv:2010.16052, 2020.
[49] Y. Liu, Z. Lu, J. Li, and T. Yang, “Hierarchically learned view-invariant representations for cross-view action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2416–2430, 2018.
[50] L. Wang, Z. Ding, Z. Tao, Y. Liu, and Y. Fu, “Generative multi-view human action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6212–6221.
[51] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, “Deep multimodal feature analysis for action recognition in rgb+ d videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 5, pp. 1045–1058, 2017.
[52] Y. Liu, Z. Lu, J. Li, C. Yao, and Y. Deng, “Transferable feature representation for visible-to-infrared cross-dataset human action recognition,” Complexity, vol. 2018, 2018.
[53] Y. Yuan, Y. Zhao, and Q. Wang, “Action recognition using spatial-optical data organization and sequential learning framework,” Neurocomputing, vol. 315, pp. 221–233, 2018.
[54] F. Yu, X. Wu, J. Chen, and L. Duan, “Exploiting images for video recognition: Heterogeneous feature augmentation via symmetric adversarial learning,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5308–5321, 2019.
[55] Y. Liu, Z. Lu, J. Li, T. Yang, and C. Yao, “Deep image-to-video adaptation and fusion networks for action recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 3168–3182, 2020.
[56] C. Chen, R. Jafari, and N. Kehtarnavaz, “Improving human action recognition using fusion of depth camera and inertial sensors,” IEEE Transactions on Human-Machine Systems, vol. 45, no. 1, pp. 51–61, 2014.
[57] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm: multimodal transfer module for cnn fusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 289–13 299.
[58] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in International Conference on Learning Representations, 2017.
[59] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
[60] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1365–1374.
[61] J. Hoffman, S. Gupta, and T. Darrell, “Learning with side information through modality hallucination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 826–834.
[62] N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation with multiple stream networks for action recognition,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 103–118.
[63] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-augmented rgb stream for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7882–7891.
[64] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[65] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 076–10 085.
[66] A. S. Morcos, D. G. T. Barrett, N. C. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” in International Conference on Learning Representations, 2018.
[67] B. Zhou, Y. Sun, D. Bau, and A. Torralba, “Revisiting the importance of individual units in cnns via ablation,” arXiv preprint arXiv:1806.02891, 2018.
[68] S. Desai and H. G. Ramaswamy, “Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,” in IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 972–980.
[69] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the conference on empirical methods in natural language processing, 2014, pp. 1532–1543.
[70] F. Nie, D. Xu, I. W.-H. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction,” IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1921–1932, 2010.
[71] F. R. Chung and F. C. Graham, Spectral graph theory. American Mathematical Soc., 1997, no. 92.
[72] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” Advances in neural information processing systems, vol. 16, no. 16, pp. 321–328, 2004.
[73] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for few-shot learning,” in International Conference on Learning Representations, 2019.
[74] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
[75] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
[76] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley mhad: A comprehensive multimodal human action database,” in IEEE Workshop on Applications of Computer Vision, 2013, pp. 53–60.
[77] C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in IEEE International conference on image processing, 2015, pp. 168–172.
[78] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
[79] A. Shafaei and J. J. Little, “Real-time human motion capture with multiple depth cameras,” in International Conference on Computer and Robot Vision (CRV), 2016, pp. 24–31.
[80] E. P. Ijjina and C. K. Mohan, “Human action recognition based on mocap information using convolution neural networks,” in International Conference on Machine Learning and Applications, 2014, pp. 159–164.
[81] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
[82] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in International Conference on Learning Representations, 2015.
[83] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5007–5016.
[84] C. Chen, R. Jafari, and N. Kehtarnavaz, “A real-time human action recognition system using depth and inertial sensor fusion,” IEEE Sensors Journal, vol. 16, no. 3, pp. 773–781, 2015.
[85] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929.

Yang Liu (M’21) is currently a Postdoctoral Researcher in the School of Computer Science and Engineering, Sun Yat-Sen University, working with Prof. Liang Lin. He received the Ph.D. degree in telecommunications and information systems from Xidian University, Xi’an, China, in June 2019, advised by Prof. Zhaoyang Lu. Before that, he received the B.S. degree in telecommunications engineering from Chang’an University, Xi’an, China, in 2014. His current research interests include video understanding, transfer learning and computer vision. He has been serving as a reviewer for numerous academic journals, including TNNLS, TMM, TCyb, TCSVT, THMS, SPL and PR. More information can be found on his personal website https://yangliu9208.github.io/home.

Keze Wang received his B.S. degree in software engineering from Sun Yat-Sen University, Guangzhou, China, in 2012. He obtained my Ph.D. degree with honors from the School of Data and Computer Science at Sun Yat-Sen University in December 2017, advised by Prof. Liang Lin. He obtained dual PhD awards in the Department of Computing of the Hong Kong Polytechnic University in March 2019, advised by Prof. Lei Zhang. His current research interests include computer vision and machine learning. More information can be found on his personal website https://kezewang.com.

Guanbin Li (M’15) is currently an associate professor in School of Computer Science and Engineering, Sun Yat-Sen University. He received his PhD degree from the University of Hong Kong in 2016. His current research interests include computer vision, image processing, and deep learning. He is a recipient of ICCV 2019 Best Paper Nomination Award. He has authorized and co-authorized on more than 70 papers in top-tier academic journals and conferences. He serves as an area chair for the conference of VISAPP. He has been serving as a reviewer for numerous academic journals and conferences such as TPAMI, IJCV, TIP, TMM, TCyb, CVPR, ICCV, ECCV and NeurIPS.

Liang Lin (M’09, SM’15) is a Full Professor of computer science at Sun Yat-Sen University. He served as the Executive Director and Distinguished Scientist of SenseTime Group from 2016 to 2018, leading the R&D teams for cutting-edge technology transferring. He has authored or co-authored more than 200 papers in leading academic journals and conferences (e.g., 20+ papers in TPAMI/IJCV), and his papers have been cited by more than 16,000 times. He is an associate editor of IEEE Trans. Neural Networks and Learning Systems and IEEE Trans. Human-Machine Systems, and served as Area Chairs for numerous conferences such as CVPR, ICCV, SIGKDD and AAAI. He is the recipient of numerous awards and honors including Wu Wen-Jun Artificial Intelligence Award, the First Prize of China Society of Image and Graphics, ICCV Best Paper Nomination in 2019, Annual Best Paper Award by Pattern Recognition (Elsevier) in 2018, Best Paper Dimond Award in IEEE ICME 2017, Google Faculty Award in 2012. His supervised PhD students received ACM China Doctoral Dissertation Award, CCF Best Doctoral Dissertation and CAAI Best Doctoral Dissertation. He is a Fellow of IET.