¹¹institutetext: Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China
²²institutetext: China Unicom Shenzhen Branch
²²email: [email protected], ²²email: [email protected], ²²email: {shaohua.wan, lxduan}@uestc.edu.cn

Multi-modal Instance Refinement for Cross-domain Action Recognition

Yuan Qing 11 Naixing Wu 22 Shaohua Wan 11 Lixin Duan^(🖂) 11

Abstract

Unsupervised cross-domain action recognition aims at adapting the model trained on an existing labeled source domain to a new unlabeled target domain. Most existing methods solve the task by directly aligning the feature distributions of source and target domains. However, this would cause negative transfer during domain adaptation due to some negative training samples in both domains. In the source domain, some training samples are of low-relevance to target domain due to the difference in viewpoints, action styles, etc. In the target domain, there are some ambiguous training samples that can be easily classified as another type of action under the case of source domain. The problem of negative transfer has been explored in cross-domain object detection, while it remains under-explored in cross-domain action recognition. Therefore, we propose a Multi-modal Instance Refinement (MMIR) method to alleviate the negative transfer based on reinforcement learning. Specifically, a reinforcement learning agent is trained in both domains for every modality to refine the training data by selecting out negative samples from each domain. Our method finally outperforms several other state-of-the-art baselines in cross-domain action recognition on the benchmark EPIC-Kitchens [4] dataset, which demonstrates the advantage of MMIR in reducing negative transfer.

Keywords:

Negative Transfer Cross-domain Action Recognition Unsupervised Domain Adaptation Reinforcement Learning.

1 Introduction

Action recognition is one of the most important tasks in video understanding, which aims at recognizing and predicting human actions in videos [24, 22, 1, 4]. The majority of the works in action recognition are carried out on the basis of supervised learning, which involves using a large number of labeled video/action segments to train a model for a specified scene setting. However, obtaining a large number of annotated video data for a certain scene would be very costly and sometimes difficult due to the environment setup and video post-processing as well as labeling. To fully leverage existing labeled data and reduce the cost of acquiring new data, Unsupervised Domain Adaptation (UDA) [18, 20, 14] has been introduced to generalize a model trained on a source domain with adequate annotations to a target domain with no labels, where the two domains differentiate from each other but are partially related. For action recognition, though there are several UDA methods [26, 10, 17] proposed, most of them achieve this by directly aligning the feature distribution of source and target domains. However, this could lead to negative transfer during domain adaptation due to some negative training samples in both domains [3]. For instance, there might be some ambiguous actions in target domain that do not belong to the defined action categories in the source domain or are very similar to another kind of action in the source domain. Additionally, there might also be some less-relevant actions in source domain that have completely different viewpoints/action styles compared with samples in the target domain. To be specific, Fig. 1 shows these two types of negative samples in domain D2 (source) and D3 (target) defined in [17] from EPIC-Kitchens dataset [4]. In Fig. 1(a), action open in source domain is considered less-relevant to that of the target domain since the trajectory of motion and way of opening are dissimilar. In Fig. 1(b), a spraying action in target domain that does not belong to a predefined action type is likely to be mistakenly recognized as wipe due to the similarity in action style and appearance.

Refer to caption — (a) Less-relevant action

To alleviate the impact of negative transfer brought by these negative training samples, we propose Multi-modal Instance Refinement (MMIR) based on deep Q-learning (DQN) [15], under the framework of MM-SADA [17]. Our MMIR trains reinforcement learning agents in both domains in each modality to refine source and target samples by selecting out less-relevant source instances from source domain and ambiguous target instances from target domain. To the best of our knowledge, there’s no previous work on reducing negative transfer in cross-domain action recognition. Our contributions are summarised as follows:

•

As far as we know, we are the first to define and tackle the issue of negative transfer in cross-domain action recognition.
•

We adopt a novel instance refinement strategy using deep reinforcement learning to select outlier instances in source domain and target domain within two modalities (RGB and optical flow).
•

Our method achieves superior performance compared with other state-of-the-art methods in cross-domain action recognition on EPIC-Kitchens dataset.

2 Related Work

2.1 Action Recognition

In action recognition, early works use 2D/3D convolution [9, 8] for feature extraction only in a single modality, i.e., RGB. Later, optical flow of video segments is used as auxiliary training data which carries more temporal and motion information compared with RGB [21]. Therefore, current popular CNN-based methods adopt a two-stream 3D convolutional neural network structure [19, 1] for feature extraction which could utilize the information contained in multiple modalities and model the temporal information. Most recently, vision transformer [6] based approaches have excelled CNN-based methods on many benchmarks. MVD [22] builds a two-stage masked feature modeling framework, where the high-level features of pretrained models learned in the first stage will serve as masked prediction targets for student model in the second stage. Although these methods show promising performance in a supervised manner, we are going to focus on action recognition under the setting of UDA.

2.2 Unsupervised Domain Adaptation for Action Recognition

Though both RGB and optical flow have been studied for domain adaptation in action recognition, there are only a limited number of works attempted to conduct multi-modal domain adaptation [26, 10, 17]. Munro and Dame [17] propose MM-SADA, a multi-modal 3D convolutional neural network with a self-supervision classifier between modalities. It uses Gradient Reversal Layer (GRL) [7] to implement domain discriminator within different modalities. Kim et al. [10] apply contrastive learning to design a unified framework using transformer for multi-modal UDA in video understanding. Xu et al. [26] propose a source-free UDA model to learn temporal consistency in videos between source domain and target domain. Similar to [17], our work adopts a multi-modal 3D ConvNet for feature extraction and utilizes domain adversarial learning, but we focus on a different task by incorporating deep reinforcement learning into our action recognition framework to eliminate the effect of negative transfer.

2.3 Deep Reinforcement Learning

Deep reinforcement learning has been applied to various tasks in computer vision [23, 27]. Wang et al. [23] design a reinforcement learning-based two-level framework for video captioning, in which a low-level module recognizes the original actions to fulfill the goals specified by the high-level module. ReinforceNet [27] incorporates region selection and bounding box refinement networks to form a reinforcement learning framework based on CNN to select optimal proposals and refine bounding box positions. Recently, several works apply reinforcement learning to action recognition [5, 25]. Dong et al. [5] design a deep reinforcement learning framework to capture the most discriminative frames and delete confusing frames in action segments. Weng et al. [25] improve recognition accuracy by designing agents that learn to produce binary masks to select out interfering categories. All these methods adopt deep reinforcement learning to refine negative frames in the action segment within the same domain, while our method uses deep reinforcement learning to refine negative action segments across domains to handle negative transfer in cross-domain action recognition.

3 Proposed Method

In unsupervised cross-domain action recognition, two domains are given, namely source and target. A labeled source domain $\mathcal{D}^{s}$ is denoted as $\mathcal{D}^{s}=\{(\mathcal{x}^{s}_{i},\mathcal{y}^{s}_{i})|^{\mathcal{N}_{s}}_{i=1}\}$ , where $\mathcal{x}^{s}_{i}$ is the $\mathcal{i}$ -th action segment and $\mathcal{y}^{s}_{i}$ is its verbal label. Similarly, the unlabeled target domain $\mathcal{D}^{t}$ is represented by $\mathcal{D}^{t}=\{\mathcal{x}^{t}_{i}|^{\mathcal{N}_{t}}_{i=1}\}$ , where $\mathcal{x}^{t}_{i}$ is the $\mathcal{i}$ -th action segment. For action segments, each segment is formed as a sequence of $L$ frames. Therefore, we have $\mathcal{x}^{s}_{i}=\{\mathcal{x}_{i,1}^{s},\mathcal{x}_{i,2}^{s},\mathcal{x}_{i,3}^{s},\ldots,\mathcal{x}_{i,L}^{s}\}$ and $\mathcal{x}^{t}_{i}=\{\mathcal{x}_{i,1}^{t},\mathcal{x}_{i,2}^{t},\mathcal{x}_{i,3}^{t},\ldots,\mathcal{x}_{i,L}^{t}\}$ , respectively.

To reduce negative transfer during domain adaptation, two reinforcement learning agents, S-agent and T-agent, are defined under a deep-Q learning network (DQN) to make selections in source and target domains. S-agent learns policies to select out less-relevant action segments from source action segments $\mathcal{x}^{s}$ , while T-agent is trained to select out ambiguous action segments from target action segments $\mathcal{x}^{t}$ . After refinement, we use the refined instances $\hat{\mathcal{x}^{s}}$ and $\hat{\mathcal{x}^{t}}$ to train our domain discriminator to learn domain invariant features.

The following sections give detailed explanations of our proposed method Multi-model Instance Refinement (MMIR). The architecture of our MMIR is shown in Fig. 2(a), which is composed of a two-stream 3D convolutional feature extractor in two modalities: RGB and Optical Flow together with a domain discriminator and instance refinement module in each modality followed by an action classifier. The structure of domain discriminator and instance refinement module is depicted in Fig. 2(b).

3.1 Two-stream Action Recognition

For multiple modalities, a feature fusion layer is applied after the action classifiers for summing the prediction scores from different modalities as shown in Fig. 2(a). For input $\mathcal{X}$ with multiple modalities, we have $\mathcal{X}=(\mathcal{X}^{1},\mathcal{X}^{2},\ldots,\mathcal{X}^{K})$ , where $\mathcal{X}^{k}$ represents the input from the $k$ -th modality. Therefore, we can define the classification loss as follows:

\mathcal{L_{cls}}=\sum_{x\in S}-y\cdot\log\left(Softmax\left(\sum^{K}_{k=1}C^{k}\left(F^{k}\left(x^{k}\right)\right)\right)\right)

(1)

where $y$ represents the class label, $C^{k}$ is the action classifier of the $k$ -th modality, $F^{k}$ denotes the feature extractor of the $k$ -th modality and $x^{k}$ represents source action segments from the $k$ -th modality which are labeled.

3.2 Instance Refinement

We visualize the overall workflow of instance refinement module in Fig. 3. In each modality, we select negative instances from the $\mathcal{i}$ -th batch of action segments $\mathcal{F}^{s}_{i}$ and $\mathcal{F}^{t}_{i}$ in source and target domain, respectively. We divide a batch into several sub batches, namely, candidate set $\mathcal{F}_{C}$ for iterating the agents over more episodes. Therefore, we can have a total number of $\mathcal{E}$ candidate sets in a batch. Each episode is responsible for selecting out $E$ negative samples, thus the terminal time of an episode is defined as $E$ . Take time $e$ in the selection process as an example, S-agent takes an action $\mathcal{A}^{s}_{e}$ by observing current state $\mathcal{S}^{s}_{e}$ . Then, the current state is updated as $\mathcal{S}^{s}_{e+1}$ since an action segment has been selected out. In the meantime, S-agent would receive a reward $\mathcal{R}^{s}_{e}$ for taking action $\mathcal{A}^{s}_{e}$ . After arriving at terminal time $E$ , S-agent has done selection for this episode and the candidate set $\mathcal{F}_{C}$ is optimized as $\hat{\mathcal{F}_{C}}$ . Then, the batch $\mathcal{F}^{s}_{i}$ would become $\hat{\mathcal{F}^{s}_{i}}$ at the end of the last episode, and similarly, we can reach a $\hat{\mathcal{F}^{t}_{i}}$ for $\mathcal{F}^{t}_{i}$ . We give detailed illustrations on State, Action, Reward and DQN Loss in the following part of this section.

•

State. Agents make selections on the level of candidate set and take feature vectors of all the action segments inside the candidate set as state. In this case, the state $\mathcal{S}^{s}_{k}$ of S-agent in the $\mathcal{k}$ -th candidate set $\mathcal{C}^{s}_{k}$ could be defined as $\mathcal{S}^{s}_{k}=[\mathcal{f}^{s}_{k,1},\mathcal{f}^{s}_{k,2},\mathcal{f}^{s}_{k,3},\ldots,\mathcal{f}^{s}_{k,\mathcal{N}_{c}}]\in\mathbb{R}^{d\times\mathcal{N}_{c}}$ where $\mathcal{f}^{s}_{k,n}$ is the feature vector of $\mathcal{F}^{s}_{i,k.n}$ that has $d$ dimensions and $\mathcal{N}_{c}$ is the number of action segments inside a candidate set. Once an action segment $\mathcal{f}^{s}_{k,n}$ is selected out from $\mathcal{S}^{s}_{k}$ , it will be replaced by a $d$ -dimensional zero vector to keep the state shape unchanged. This is the same for T-agent where $\mathcal{S}^{t}_{k}=[\mathcal{f}^{t}_{k,1},\mathcal{f}^{t}_{k,2},\mathcal{f}^{t}_{k,3},\ldots,\mathcal{f}^{t}_{k,\mathcal{N}_{c}}]\in\mathbb{R}^{d\times\mathcal{N}_{c}}$ is the state and $\mathcal{f}^{t}_{k,n}$ is the feature vector of target action segment.
•

Action. For a candidate set of size $\mathcal{N}_{c}$ , we can have $\mathcal{N}_{c}$ actions to perform in each episode. Therefore, we can define the set of actions that can be taken by S-agent as $\mathcal{A}^{s}=\{1,2,\ldots,\mathcal{N}_{c}\}$ and T-agent as $\mathcal{A}^{t}=\{1,2,\ldots,\mathcal{N}_{c}\}$ , which represents the index of the action segment that is to be selected out. The aim of the DQN agents is to maximize the accumulated reward of the actions taken. We define the accumulated reward at time $e$ as $\mathcal{R}_{e}=\sum_{t=e}^{E}\gamma^{t-e}r_{e}$ , where $\gamma$ is the discount factor and $r_{e}$ represents the instant reward at time $e$ . In DQN, we define a state-action value function to approximate the accumulated reward as $Q(\mathcal{S}_{e},\mathcal{a}_{e})$ , where $\mathcal{S}_{e}$ denotes the state and $\mathcal{a}_{e}$ denotes the action taken at time $e$ . For both modalities, $\mathcal{S}_{e}\in\{\mathcal{S}_{e}^{s},\mathcal{S}_{e}^{t}\}$ and $\mathcal{a}_{e}\in\{\mathcal{A}^{s},\mathcal{A}^{t}\}$ . As shown in Fig. 3, DQN outputs a set of q-values corresponding to each action and chooses the optimal action which has the maximum q-value to maximize accumulated reward. This policy can be defined as follows:

$\hat{\mathcal{a}_{e}}=\max_{\mathcal{a}_{e}}Q(\mathcal{S}_{e},\mathcal{a}_{e}).$ (2)

•

Reward. Rewards given to agents are based on actions taken and the relevance of selected action segments to the opposite domain. To measure the relevance, we use the prediction results from domain classifier $D$ . The domain logits of an action segment are processed by a sigmoid function and the relevance measure $\Delta(\mathcal{f})$ is defined as:

\Delta(\mathcal{f})=\begin{cases}Sigmoid(D(\mathcal{f})),&\mathcal{f}\in\hat{\mathcal{F^{s}}}\\ 1-Sigmoid(D(\mathcal{f})),&\mathcal{f}\in\hat{\mathcal{F^{t}}}.\end{cases}

(3)

In Eq.3, we unify the relevance measure in both source and target domains by defining the domain label of source to be 0 and target to be 1. Then, the predefined threshold $\tau$ and the relevance measure $\Delta(\mathcal{f})$ can be compared to give rewards to agents according to the criterion defined below:

r=\begin{cases}1,&\Delta(\mathcal{f})<\tau,\\ -1,&otherwise.\end{cases}

(4)

This criterion is quite intuitive for an agent to recognize if it has made the right selection. Besides, we can set different thresholds for agents in different domains as $\tau^{s}$ and $\tau^{t}$ in source and target, respectively.

•

DQN Loss. For a DQN, the target output is defined as:

$y_{e}=r_{e}+\gamma\cdot\max_{\mathcal{a}_{e+1}}Q(\mathcal{S}_{e+1},\mathcal{a}_{e+1}|\mathcal{S}_{e},\mathcal{a}_{e})$ (5)

where $y_{e}$ represents the temporal difference target value of the Q function $Q(\mathcal{S}_{e},\mathcal{a}_{e})$ . Based on this, the loss of DQN can be defined as:

$\mathcal{L_{q}}=\mathbb{E}_{\mathcal{S}_{e},\mathcal{a}_{e}}[(y_{e}-Q(\mathcal{S}_{e},\mathcal{a}_{e}))^{2}].$ (6)

Then, we can have the overall deep Q-learning loss defined as follows:

$\mathcal{L_{dqn}}=\sum^{K}_{k=1}(\mathcal{L_{q}^{s}}+\mathcal{L_{q}^{t}})_{k}$ (7)

which is the sum of losses from S-agents and T-agents from all modalities.

3.3 Domain Adversarial Alignment

We realize feature alignment across domains in an adversarial way by connecting the domain discriminator with a GRL as shown in Fig. 2(b). We apply a domain discriminator in each modality rather than using a single discriminator for all modalities after late fusion since aligning domains in a combined way might lead the network to focus on a less robust modality and lose the ability to generalize to other modalities [17]. Then, we can define our domain adversarial loss as:

\mathcal{L_{adv}}=\sum_{x^{k}\in\{S,T\}}-d\cdot\log\left(D^{k}\left(F^{k}\left(x^{k}\right)\right)\right)-\left(1-d\right)\cdot\log\left(1-D^{k}\left(F^{k}\left(x^{k}\right)\right)\right)

(8)

where $d$ is the domain label, $D^{k}$ is the domain discriminator for the $k$ -th modality, $F^{k}$ is the feature extractor of the $k$ -th modality, and $x^{k}$ denotes the action segments from source domain or target domain of the $k$ -th modality.

3.4 Training

With the losses defined in previous sections, we can have an overall loss function:

\mathcal{L}=\mathcal{L_{cls}}+\mathcal{L_{dqn}}+\mathcal{L_{adv}}.

(9)

For DQN agents, we use experience replay [13] and $\epsilon$ -greedy strategy [16] during training. An experience replay pool to store actions, states, rewards, etc. is established for every agent, which ensures that data given to them is uncorrelated. The $\epsilon$ -greedy strategy introduces a probability threshold of random action $\epsilon$ to control whether an action is predicted by DQN agent or just randomly selected. This helps to balance the exploitation and exploration of an agent. The strategy is implemented as follows:

\hat{a_{e}}=\begin{cases}\max_{\mathcal{a}_{e}}Q(\mathcal{S}_{e},\mathcal{a}_{e})&\text{if}\ \lambda\geq\epsilon,\\ a_{e}^{*}&otherwise,\end{cases}

(10)

where $\lambda$ is a random variable. If $\lambda$ is larger than $\epsilon$ , the action would be predicted by the agents, or the action would be randomly chosen from the pool of actions.

4 Experiments

4.1 Implementation Details

4.1.1 Dataset.

We use EPIC-Kitchens [4] to set up the cross domain environment for fine-grained action recognition as it includes action segments captured in 32 different scenes. Following the domain split in [17], we sample videos taken in 3 different kitchens to form 3 different domains, which are denoted as D1, D2 and D3. We have a total of 8 classes of action and according to [17], the distribution of training and testing samples from the 8 action classes are highly unbalanced. However, we use this unbalanced dataset to prove that our method could achieve competitive performance even on an unbalanced dataset since the unbalanced distribution of data makes domain adaptation more challenging.

4.1.2 Model Architecture.

We set a two-stream I3D [1] feature extractor as our backbone. A trianing instance is composed of a temporal window of 16 frames sampled from an action segment. The domain discriminator $D^{k}$ in each modality takes in the feature vectors and flattens them to pass through a GRL and two fully connected layers with the dimension of a hidden layer to be 128 and an intermediate LeakyReLU activation function. For data augmentation, we follow the setup in [2] where random cropping, scale jittering and horizontal flipping are applied to training data. For testing data, only center cropping is applied.

4.1.3 Hyperparameter and Training Setting.

The overall dropout rate of $F^{k}$ is set to 0.5 and a weight decay of $10^{-7}$ is applied for model parameters. We divide training process into two stages. In stage 1, our network is trained without domain discriminator and DQN agents. Then, the loss is optimized as follows:

\mathcal{L_{stage1}}=\mathcal{L_{cls}}.

(11)

The learning rate of this stage is set to $0.01$ and the network is trained for 4000 steps. In the second stage, the domain discriminator and DQN agents are incorporated and the objective function for this stage is defined as:

\mathcal{L_{stage2}}=\mathcal{L_{cls}}+\mathcal{L_{adv}}+\mathcal{L_{dqn}}.

(12)

The learning rate in this stage is reduced to $0.001$ and the model is further trained for 8000 steps. Note that for both stages, the action classifier is optimized only using labeled source data. For the hyperparameters of DQN, we set the discount factor $\gamma=0.9$ , $\epsilon$ -greedy factor $\epsilon=0.5$ , relevance threshold $\tau^{s}=\tau^{t}=0.5$ , terminal time $E=1$ and candidate size $\mathcal{N}_{c}=5$ . Besides, Adam [11] optimizer is used for both stages and the batch size is set to 96 in stage 1 and 80 in stage 2, which is equally divided for source and target domains. It takes 6 hours to train our model using 4 NVIDIA RTX 3090 GPUs.

4.2 Results

For all the experimental results, we follow [17] to report the average top-1 accuracy on target domain over the last 9 epochs during training. In the meantime, the experimental results of our model trained with only source data are reported as a lower limit. Also, we report results of supervised learning on target domain as the upper bound.

Table 1: Top-1 Accuracy of the experimental results of different baselines and our MMIR under different domain settings.

Method	D2 $\rightarrow$ D1	D3 $\rightarrow$ D1	D1 $\rightarrow$ D2	D3 $\rightarrow$ D2	D1 $\rightarrow$ D3	D2 $\rightarrow$ D3	Mean
Source-only	42.5	44.3	42.0	56.3	41.2	46.5	45.5
MMD [14]	43.1	48.3	46.6	55.2	39.2	48.5	46.8
AdaBN [12]	44.6	47.8	47.0	54.7	40.3	48.8	47.2
MCD [18]	42.1	47.9	46.5	52.7	43.5	51.0	47.3
MM-SADA [17]	48.2	50.9	49.5	56.1	44.1	52.7	50.3
Kim et al. [10]	49.5	51.5	50.3	56.3	46.3	52.0	51.0
MMIR	46.1	53.5	49.7	61.5	44.5	52.6	51.3
Supervised	71.7	74.0	62.8	74.0	62.8	71.7	69.5

We have a total of 6 domain adaptation settings based on D1, D2 and D3. We make a comparison between our method, several baseline methods and two state-of-the-art methods under every domain setting in Table 1. On average, our method outperforms all other state-of-the-art methods. MMIR has an overall performance improvement of $5.8\%$ compared with Source-only, $4.5\%$ compared with MMD [14], $4.1\%$ compared with AdaBN, $4.0\%$ compared with MCD [18], $1.0\%$ compared with MM-SADA [17] and $0.3\%$ compared with Kim et al. [10].

4.3 Ablation Study

4.3.1 Effects of RL Agents.

We evaluate the performance of reinforcement learning agents according to modality and domain. In particular, we investigate the effect of S-agent and T-agent only in the modality of RGB as this modality contributes more during the feature alignment process. The results are shown in Table 2 and we denote “without” as “w/o”. We further elaborate on the results in the following part.

•

RGB vs Optical flow. Our method without agents in RGB has a performance drop of $1.8\%$ while the case without agents in Optical flow has only a drop of $0.7\%$ . This indicates that agents in RGB play a major part in refining feature alignment compared with that of Optical flow since RGB frames contain more spatial information while flow frames contain more temporal information which contributes less to the feature alignment process.
•

S-agent vs T-agent in RGB. In RGB, when S-agent is removed, we can observe a performance drop of $0.3\%$ . While by removing the T-agent, the performance drop is $0.8\%$ . This shows that in the modality of RGB, T-agent weighs more than S-agent in alleviating the issue of negative transfer.

Table 2: Ablation study of the effect of reinforcement learning agents on D2

\rightarrow

D1.

Method	D2 $\rightarrow$ D1
Source-only	42.5
MMIR (w/o) RGB agents	44.3
MMIR (w/o) Flow agents	45.4
MMIR (w/o) S-agent (RGB)	45.8
MMIR (w/o) T-agent (RGB)	45.3
MMIR	46.1

4.3.2 Overall Evaluation.

In addition, we also evaluate the overall effect of our instance refinement strategy (IR) by comparing it with the case of Adversarial-only in Table 3. We give a detailed illustration of our results as follows.

Table 3: Overall evaluation of our instance refinement strategy.

Method	Adversarial	IR	D2 $\rightarrow$ D1	D3 $\rightarrow$ D1	D1 $\rightarrow$ D2	D3 $\rightarrow$ D2	D1 $\rightarrow$ D3	D2 $\rightarrow$ D3	Mean
Source-only			42.5	44.3	42.0	56.3	41.2	46.5	45.5
MMIR	✓		43	51.8	49.3	59.9	43.5	51.6	49.9
MMIR	✓	✓	46.1	53.5	49.7	61.5	44.5	52.6	51.3

•

Adversarial-only. Compared with Source-only, an improvement of $4.4\%$ in top-1 accuracy can be observed. This shows that our domain adversarial alignment is effective in improving model performance on target domain through directly training a domain discriminator in an adversarial way.
•

Adversarial + IR. Compared with Adversarial-only, it has an overall performance boost of $1.4\%$ and shows an improvement in every domain setting as depicted in Table 3. This is an effective demonstration of the successful implementation of our instance refinement strategy and its capability to help alleviate negative transfer during cross-domain action recognition.

5 Conclusion

We design a multi-modal instance refinement framework to help alleviate the problem of negative transfer during cross-domain action recognition. The reinforcement learning agents are trained to learn policies to select out negative training samples, thus resulting in a better-aligned feature distribution via domain adversarial learning. Experiments show that our method successfully addresses the negative transfer in multi-modal cross-domain action recognition and outperforms several competitive methods on a benchmark dataset. We believe in the future, it is worth conducting experiments on a spectrum of datasets to validate if our MMIR could be generalized to all use cases and even in different modalities such as text, speech, depth and so on.

5.0.1 Acknowledgements.

This work is supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400, National Natural Science Foundation of China No. 82121003, and Shenzhen Research Program No. JSGG20210802153537009.

References

[1] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)
[2] Chen, C.F., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., Fan, Q.: Deep analysis of cnn-based spatio-temporal representations for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2021)
[3] Chen, J., Wu, X., Duan, L., Chen, L.: Sequential instance refinement for cross-domain object detection in images. IEEE Transactions on Image Processing 30, 3970–3984 (2021)
[4] Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 720–736 (2018)
[5] Dong, W., Zhang, Z., Tan, T.: Attention-aware sampling via deep reinforcement learning for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8247–8254 (2019)
[6] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[7] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The journal of machine learning research 17(1), 2096–2030 (2016)
[8] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35(1), 221–231 (2012)
[9] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)
[10] Kim, D., Tsai, Y.H., Zhuang, B., Yu, X., Sclaroff, S., Saenko, K., Chandraker, M.: Learning cross-modal contrastive features for video domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13618–13627 (2021)
[11] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[12] Li, Y., Wang, N., Shi, J., Hou, X., Liu, J.: Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, 109–117 (2018)
[13] Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8, 293–321 (1992)
[14] Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International conference on machine learning. pp. 97–105. PMLR (2015)
[15] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
[16] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. nature 518(7540), 529–533 (2015)
[17] Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 122–132 (2020)
[18] Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3723–3732 (2018)
[19] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 4489–4497 (2015)
[20] Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7167–7176 (2017)
[21] Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision. pp. 3551–3558 (2013)
[22] Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Yuan, L., Jiang, Y.G.: Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. arXiv preprint arXiv:2212.04500 (2022)
[23] Wang, X., Chen, W., Wu, J., Wang, Y.F., Wang, W.Y.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4213–4222 (2018)
[24] Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., et al.: Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
[25] Weng, J., Jiang, X., Zheng, W.L., Yuan, J.: Early action recognition with category exclusion using policy-based reinforcement learning. IEEE Transactions on Circuits and Systems for Video Technology 30(12), 4626–4638 (2020)
[26] Xu, Y., Yang, J., Cao, H., Wu, K., Wu, M., Chen, Z.: Source-free video domain adaptation by learning temporal consistency for action recognition. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV. pp. 147–164. Springer (2022)
[27] Zhou, M., Wang, R., Xie, C., Liu, L., Li, R., Wang, F., Li, D.: Reinforcenet: A reinforcement learning embedded object detection framework with region selection network. Neurocomputing 443, 369–379 (2021)