This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China
22institutetext: China Unicom Shenzhen Branch
22email: [email protected], 22email: [email protected], 22email: {shaohua.wan, lxduan}@uestc.edu.cn

Multi-modal Instance Refinement for Cross-domain Action Recognition

Yuan Qing 11    Naixing Wu 22    Shaohua Wan 11    Lixin Duan(🖂) 11
Abstract

Unsupervised cross-domain action recognition aims at adapting the model trained on an existing labeled source domain to a new unlabeled target domain. Most existing methods solve the task by directly aligning the feature distributions of source and target domains. However, this would cause negative transfer during domain adaptation due to some negative training samples in both domains. In the source domain, some training samples are of low-relevance to target domain due to the difference in viewpoints, action styles, etc. In the target domain, there are some ambiguous training samples that can be easily classified as another type of action under the case of source domain. The problem of negative transfer has been explored in cross-domain object detection, while it remains under-explored in cross-domain action recognition. Therefore, we propose a Multi-modal Instance Refinement (MMIR) method to alleviate the negative transfer based on reinforcement learning. Specifically, a reinforcement learning agent is trained in both domains for every modality to refine the training data by selecting out negative samples from each domain. Our method finally outperforms several other state-of-the-art baselines in cross-domain action recognition on the benchmark EPIC-Kitchens [4] dataset, which demonstrates the advantage of MMIR in reducing negative transfer.

Keywords:
Negative Transfer Cross-domain Action Recognition Unsupervised Domain Adaptation Reinforcement Learning.

1 Introduction

Action recognition is one of the most important tasks in video understanding, which aims at recognizing and predicting human actions in videos [24, 22, 1, 4]. The majority of the works in action recognition are carried out on the basis of supervised learning, which involves using a large number of labeled video/action segments to train a model for a specified scene setting. However, obtaining a large number of annotated video data for a certain scene would be very costly and sometimes difficult due to the environment setup and video post-processing as well as labeling. To fully leverage existing labeled data and reduce the cost of acquiring new data, Unsupervised Domain Adaptation (UDA) [18, 20, 14] has been introduced to generalize a model trained on a source domain with adequate annotations to a target domain with no labels, where the two domains differentiate from each other but are partially related. For action recognition, though there are several UDA methods [26, 10, 17] proposed, most of them achieve this by directly aligning the feature distribution of source and target domains. However, this could lead to negative transfer during domain adaptation due to some negative training samples in both domains [3]. For instance, there might be some ambiguous actions in target domain that do not belong to the defined action categories in the source domain or are very similar to another kind of action in the source domain. Additionally, there might also be some less-relevant actions in source domain that have completely different viewpoints/action styles compared with samples in the target domain. To be specific, Fig. 1 shows these two types of negative samples in domain D2 (source) and D3 (target) defined in [17] from EPIC-Kitchens dataset [4]. In Fig. 1(a), action open in source domain is considered less-relevant to that of the target domain since the trajectory of motion and way of opening are dissimilar. In Fig. 1(b), a spraying action in target domain that does not belong to a predefined action type is likely to be mistakenly recognized as wipe due to the similarity in action style and appearance.

Refer to captionRefer to caption

Target       Source

(a) Less-relevant action
Refer to captionRefer to caption
(b) Ambiguous action
Figure 1: Negative samples that cause negative transfer from source domain D2 (first row) and target domain D3 (second row) defined in [17].

To alleviate the impact of negative transfer brought by these negative training samples, we propose Multi-modal Instance Refinement (MMIR) based on deep Q-learning (DQN) [15], under the framework of MM-SADA [17]. Our MMIR trains reinforcement learning agents in both domains in each modality to refine source and target samples by selecting out less-relevant source instances from source domain and ambiguous target instances from target domain. To the best of our knowledge, there’s no previous work on reducing negative transfer in cross-domain action recognition. Our contributions are summarised as follows:

  • As far as we know, we are the first to define and tackle the issue of negative transfer in cross-domain action recognition.

  • We adopt a novel instance refinement strategy using deep reinforcement learning to select outlier instances in source domain and target domain within two modalities (RGB and optical flow).

  • Our method achieves superior performance compared with other state-of-the-art methods in cross-domain action recognition on EPIC-Kitchens dataset.

2 Related Work

2.1 Action Recognition

In action recognition, early works use 2D/3D convolution [9, 8] for feature extraction only in a single modality, i.e., RGB. Later, optical flow of video segments is used as auxiliary training data which carries more temporal and motion information compared with RGB [21]. Therefore, current popular CNN-based methods adopt a two-stream 3D convolutional neural network structure [19, 1] for feature extraction which could utilize the information contained in multiple modalities and model the temporal information. Most recently, vision transformer [6] based approaches have excelled CNN-based methods on many benchmarks. MVD [22] builds a two-stage masked feature modeling framework, where the high-level features of pretrained models learned in the first stage will serve as masked prediction targets for student model in the second stage. Although these methods show promising performance in a supervised manner, we are going to focus on action recognition under the setting of UDA.

2.2 Unsupervised Domain Adaptation for Action Recognition

Though both RGB and optical flow have been studied for domain adaptation in action recognition, there are only a limited number of works attempted to conduct multi-modal domain adaptation [26, 10, 17]. Munro and Dame [17] propose MM-SADA, a multi-modal 3D convolutional neural network with a self-supervision classifier between modalities. It uses Gradient Reversal Layer (GRL) [7] to implement domain discriminator within different modalities. Kim et al. [10] apply contrastive learning to design a unified framework using transformer for multi-modal UDA in video understanding. Xu et al. [26] propose a source-free UDA model to learn temporal consistency in videos between source domain and target domain. Similar to [17], our work adopts a multi-modal 3D ConvNet for feature extraction and utilizes domain adversarial learning, but we focus on a different task by incorporating deep reinforcement learning into our action recognition framework to eliminate the effect of negative transfer.

2.3 Deep Reinforcement Learning

Deep reinforcement learning has been applied to various tasks in computer vision [23, 27]. Wang et al. [23] design a reinforcement learning-based two-level framework for video captioning, in which a low-level module recognizes the original actions to fulfill the goals specified by the high-level module. ReinforceNet [27] incorporates region selection and bounding box refinement networks to form a reinforcement learning framework based on CNN to select optimal proposals and refine bounding box positions. Recently, several works apply reinforcement learning to action recognition [5, 25]. Dong et al. [5] design a deep reinforcement learning framework to capture the most discriminative frames and delete confusing frames in action segments. Weng et al. [25] improve recognition accuracy by designing agents that learn to produce binary masks to select out interfering categories. All these methods adopt deep reinforcement learning to refine negative frames in the action segment within the same domain, while our method uses deep reinforcement learning to refine negative action segments across domains to handle negative transfer in cross-domain action recognition.

3 Proposed Method

In unsupervised cross-domain action recognition, two domains are given, namely source and target. A labeled source domain 𝒟s\mathcal{D}^{s} is denoted as 𝒟s={(𝓍is,𝓎is)|i=1𝒩s}\mathcal{D}^{s}=\{(\mathcal{x}^{s}_{i},\mathcal{y}^{s}_{i})|^{\mathcal{N}_{s}}_{i=1}\}, where 𝓍is\mathcal{x}^{s}_{i} is the 𝒾\mathcal{i}-th action segment and 𝓎is\mathcal{y}^{s}_{i} is its verbal label. Similarly, the unlabeled target domain 𝒟t\mathcal{D}^{t} is represented by 𝒟t={𝓍it|i=1𝒩t}\mathcal{D}^{t}=\{\mathcal{x}^{t}_{i}|^{\mathcal{N}_{t}}_{i=1}\}, where 𝓍it\mathcal{x}^{t}_{i} is the 𝒾\mathcal{i}-th action segment. For action segments, each segment is formed as a sequence of LL frames. Therefore, we have 𝓍is={𝓍i,1s,𝓍i,2s,𝓍i,3s,,𝓍i,Ls}\mathcal{x}^{s}_{i}=\{\mathcal{x}_{i,1}^{s},\mathcal{x}_{i,2}^{s},\mathcal{x}_{i,3}^{s},\ldots,\mathcal{x}_{i,L}^{s}\} and 𝓍it={𝓍i,1t,𝓍i,2t,𝓍i,3t,,𝓍i,Lt}\mathcal{x}^{t}_{i}=\{\mathcal{x}_{i,1}^{t},\mathcal{x}_{i,2}^{t},\mathcal{x}_{i,3}^{t},\ldots,\mathcal{x}_{i,L}^{t}\}, respectively.

To reduce negative transfer during domain adaptation, two reinforcement learning agents, S-agent and T-agent, are defined under a deep-Q learning network (DQN) to make selections in source and target domains. S-agent learns policies to select out less-relevant action segments from source action segments 𝓍s\mathcal{x}^{s}, while T-agent is trained to select out ambiguous action segments from target action segments 𝓍t\mathcal{x}^{t}. After refinement, we use the refined instances 𝓍s^\hat{\mathcal{x}^{s}} and 𝓍t^\hat{\mathcal{x}^{t}} to train our domain discriminator to learn domain invariant features.

The following sections give detailed explanations of our proposed method Multi-model Instance Refinement (MMIR). The architecture of our MMIR is shown in Fig. 2(a), which is composed of a two-stream 3D convolutional feature extractor in two modalities: RGB and Optical Flow together with a domain discriminator and instance refinement module in each modality followed by an action classifier. The structure of domain discriminator and instance refinement module is depicted in Fig. 2(b).

Refer to caption
(a) Overall model structure
Refer to caption
(b) Domain discriminator and instance refinement agents
Figure 2: Structure of proposed method MMIR: (a) An I3D [1] feature extractor in each modality is shared by both domains. The output feature vectors of I3D network are fed to the instance refinement and domain adversarial learning modules as well as action classifiers. (b) S-agent and T-agent are built for source and target domain, respectively. They take input feature vectors as state and make selections in the training instances to select out noisy samples. A domain classifier with GRL is optimized with refined instances, which gives rewards to agents according to their selections.

3.1 Two-stream Action Recognition

For multiple modalities, a feature fusion layer is applied after the action classifiers for summing the prediction scores from different modalities as shown in Fig. 2(a). For input 𝒳\mathcal{X} with multiple modalities, we have 𝒳=(𝒳1,𝒳2,,𝒳K)\mathcal{X}=(\mathcal{X}^{1},\mathcal{X}^{2},\ldots,\mathcal{X}^{K}), where 𝒳k\mathcal{X}^{k} represents the input from the kk-th modality. Therefore, we can define the classification loss as follows:

𝒸𝓁𝓈=xSylog(Softmax(k=1KCk(Fk(xk))))\mathcal{L_{cls}}=\sum_{x\in S}-y\cdot\log\left(Softmax\left(\sum^{K}_{k=1}C^{k}\left(F^{k}\left(x^{k}\right)\right)\right)\right) (1)

where yy represents the class label, CkC^{k} is the action classifier of the kk-th modality, FkF^{k} denotes the feature extractor of the kk-th modality and xkx^{k} represents source action segments from the kk-th modality which are labeled.

3.2 Instance Refinement

We visualize the overall workflow of instance refinement module in Fig. 3. In each modality, we select negative instances from the 𝒾\mathcal{i}-th batch of action segments is\mathcal{F}^{s}_{i} and it\mathcal{F}^{t}_{i} in source and target domain, respectively. We divide a batch into several sub batches, namely, candidate set C\mathcal{F}_{C} for iterating the agents over more episodes. Therefore, we can have a total number of \mathcal{E} candidate sets in a batch. Each episode is responsible for selecting out EE negative samples, thus the terminal time of an episode is defined as EE. Take time ee in the selection process as an example, S-agent takes an action 𝒜es\mathcal{A}^{s}_{e} by observing current state 𝒮es\mathcal{S}^{s}_{e}. Then, the current state is updated as 𝒮e+1s\mathcal{S}^{s}_{e+1} since an action segment has been selected out. In the meantime, S-agent would receive a reward es\mathcal{R}^{s}_{e} for taking action 𝒜es\mathcal{A}^{s}_{e}. After arriving at terminal time EE, S-agent has done selection for this episode and the candidate set C\mathcal{F}_{C} is optimized as C^\hat{\mathcal{F}_{C}}. Then, the batch is\mathcal{F}^{s}_{i} would become is^\hat{\mathcal{F}^{s}_{i}} at the end of the last episode, and similarly, we can reach a it^\hat{\mathcal{F}^{t}_{i}} for it\mathcal{F}^{t}_{i} . We give detailed illustrations on State, Action, Reward and DQN Loss in the following part of this section.

Refer to caption
Figure 3: Workflow of the instance refinement module.
  • State. Agents make selections on the level of candidate set and take feature vectors of all the action segments inside the candidate set as state. In this case, the state 𝒮ks\mathcal{S}^{s}_{k} of S-agent in the 𝓀\mathcal{k}-th candidate set 𝒞ks\mathcal{C}^{s}_{k} could be defined as 𝒮ks=[𝒻k,1s,𝒻k,2s,𝒻k,3s,,𝒻k,𝒩cs]d×𝒩c\mathcal{S}^{s}_{k}=[\mathcal{f}^{s}_{k,1},\mathcal{f}^{s}_{k,2},\mathcal{f}^{s}_{k,3},\ldots,\mathcal{f}^{s}_{k,\mathcal{N}_{c}}]\in\mathbb{R}^{d\times\mathcal{N}_{c}} where 𝒻k,ns\mathcal{f}^{s}_{k,n} is the feature vector of i,k.ns\mathcal{F}^{s}_{i,k.n} that has dd dimensions and 𝒩c\mathcal{N}_{c} is the number of action segments inside a candidate set. Once an action segment 𝒻k,ns\mathcal{f}^{s}_{k,n} is selected out from 𝒮ks\mathcal{S}^{s}_{k} , it will be replaced by a dd-dimensional zero vector to keep the state shape unchanged. This is the same for T-agent where 𝒮kt=[𝒻k,1t,𝒻k,2t,𝒻k,3t,,𝒻k,𝒩ct]d×𝒩c\mathcal{S}^{t}_{k}=[\mathcal{f}^{t}_{k,1},\mathcal{f}^{t}_{k,2},\mathcal{f}^{t}_{k,3},\ldots,\mathcal{f}^{t}_{k,\mathcal{N}_{c}}]\in\mathbb{R}^{d\times\mathcal{N}_{c}} is the state and 𝒻k,nt\mathcal{f}^{t}_{k,n} is the feature vector of target action segment.

  • Action. For a candidate set of size 𝒩c\mathcal{N}_{c}, we can have 𝒩c\mathcal{N}_{c} actions to perform in each episode. Therefore, we can define the set of actions that can be taken by S-agent as 𝒜s={1,2,,𝒩c}\mathcal{A}^{s}=\{1,2,\ldots,\mathcal{N}_{c}\} and T-agent as 𝒜t={1,2,,𝒩c}\mathcal{A}^{t}=\{1,2,\ldots,\mathcal{N}_{c}\}, which represents the index of the action segment that is to be selected out. The aim of the DQN agents is to maximize the accumulated reward of the actions taken. We define the accumulated reward at time ee as e=t=eEγtere\mathcal{R}_{e}=\sum_{t=e}^{E}\gamma^{t-e}r_{e}, where γ\gamma is the discount factor and rer_{e} represents the instant reward at time ee. In DQN, we define a state-action value function to approximate the accumulated reward as Q(𝒮e,𝒶e)Q(\mathcal{S}_{e},\mathcal{a}_{e}), where 𝒮e\mathcal{S}_{e} denotes the state and 𝒶e\mathcal{a}_{e} denotes the action taken at time ee. For both modalities, 𝒮e{𝒮es,𝒮et}\mathcal{S}_{e}\in\{\mathcal{S}_{e}^{s},\mathcal{S}_{e}^{t}\} and 𝒶e{𝒜s,𝒜t}\mathcal{a}_{e}\in\{\mathcal{A}^{s},\mathcal{A}^{t}\}. As shown in Fig. 3, DQN outputs a set of q-values corresponding to each action and chooses the optimal action which has the maximum q-value to maximize accumulated reward. This policy can be defined as follows:

    𝒶e^=max𝒶eQ(𝒮e,𝒶e).\hat{\mathcal{a}_{e}}=\max_{\mathcal{a}_{e}}Q(\mathcal{S}_{e},\mathcal{a}_{e}). (2)
  • Reward. Rewards given to agents are based on actions taken and the relevance of selected action segments to the opposite domain. To measure the relevance, we use the prediction results from domain classifier DD. The domain logits of an action segment are processed by a sigmoid function and the relevance measure Δ(𝒻)\Delta(\mathcal{f}) is defined as:

    Δ(𝒻)={Sigmoid(D(𝒻)),𝒻𝓈^1Sigmoid(D(𝒻)),𝒻𝓉^.\Delta(\mathcal{f})=\begin{cases}Sigmoid(D(\mathcal{f})),&\mathcal{f}\in\hat{\mathcal{F^{s}}}\\ 1-Sigmoid(D(\mathcal{f})),&\mathcal{f}\in\hat{\mathcal{F^{t}}}.\end{cases} (3)

    In Eq.3, we unify the relevance measure in both source and target domains by defining the domain label of source to be 0 and target to be 1. Then, the predefined threshold τ\tau and the relevance measure Δ(𝒻)\Delta(\mathcal{f}) can be compared to give rewards to agents according to the criterion defined below:

    r={1,Δ(𝒻)<τ,1,otherwise.r=\begin{cases}1,&\Delta(\mathcal{f})<\tau,\\ -1,&otherwise.\end{cases} (4)

    This criterion is quite intuitive for an agent to recognize if it has made the right selection. Besides, we can set different thresholds for agents in different domains as τs\tau^{s} and τt\tau^{t} in source and target, respectively.

  • DQN Loss. For a DQN, the target output is defined as:

    ye=re+γmax𝒶e+1Q(𝒮e+1,𝒶e+1|𝒮e,𝒶e)y_{e}=r_{e}+\gamma\cdot\max_{\mathcal{a}_{e+1}}Q(\mathcal{S}_{e+1},\mathcal{a}_{e+1}|\mathcal{S}_{e},\mathcal{a}_{e}) (5)

    where yey_{e} represents the temporal difference target value of the Q function Q(𝒮e,𝒶e)Q(\mathcal{S}_{e},\mathcal{a}_{e}). Based on this, the loss of DQN can be defined as:

    𝓆=𝔼𝒮e,𝒶e[(yeQ(𝒮e,𝒶e))2].\mathcal{L_{q}}=\mathbb{E}_{\mathcal{S}_{e},\mathcal{a}_{e}}[(y_{e}-Q(\mathcal{S}_{e},\mathcal{a}_{e}))^{2}]. (6)

    Then, we can have the overall deep Q-learning loss defined as follows:

    𝒹𝓆𝓃=k=1K(𝓆𝓈+𝓆𝓉)k\mathcal{L_{dqn}}=\sum^{K}_{k=1}(\mathcal{L_{q}^{s}}+\mathcal{L_{q}^{t}})_{k} (7)

    which is the sum of losses from S-agents and T-agents from all modalities.

3.3 Domain Adversarial Alignment

We realize feature alignment across domains in an adversarial way by connecting the domain discriminator with a GRL as shown in Fig. 2(b). We apply a domain discriminator in each modality rather than using a single discriminator for all modalities after late fusion since aligning domains in a combined way might lead the network to focus on a less robust modality and lose the ability to generalize to other modalities [17]. Then, we can define our domain adversarial loss as:

𝒶𝒹𝓋=xk{S,T}dlog(Dk(Fk(xk)))(1d)log(1Dk(Fk(xk)))\mathcal{L_{adv}}=\sum_{x^{k}\in\{S,T\}}-d\cdot\log\left(D^{k}\left(F^{k}\left(x^{k}\right)\right)\right)-\left(1-d\right)\cdot\log\left(1-D^{k}\left(F^{k}\left(x^{k}\right)\right)\right) (8)

where dd is the domain label, DkD^{k} is the domain discriminator for the kk-th modality, FkF^{k} is the feature extractor of the kk-th modality, and xkx^{k} denotes the action segments from source domain or target domain of the kk-th modality.

3.4 Training

With the losses defined in previous sections, we can have an overall loss function:

=𝒸𝓁𝓈+𝒹𝓆𝓃+𝒶𝒹𝓋.\mathcal{L}=\mathcal{L_{cls}}+\mathcal{L_{dqn}}+\mathcal{L_{adv}}. (9)

For DQN agents, we use experience replay [13] and ϵ\epsilon-greedy strategy [16] during training. An experience replay pool to store actions, states, rewards, etc. is established for every agent, which ensures that data given to them is uncorrelated. The ϵ\epsilon-greedy strategy introduces a probability threshold of random action ϵ\epsilon to control whether an action is predicted by DQN agent or just randomly selected. This helps to balance the exploitation and exploration of an agent. The strategy is implemented as follows:

ae^={max𝒶eQ(𝒮e,𝒶e)ifλϵ,aeotherwise,\hat{a_{e}}=\begin{cases}\max_{\mathcal{a}_{e}}Q(\mathcal{S}_{e},\mathcal{a}_{e})&\text{if}\ \lambda\geq\epsilon,\\ a_{e}^{*}&otherwise,\end{cases} (10)

where λ\lambda is a random variable. If λ\lambda is larger than ϵ\epsilon, the action would be predicted by the agents, or the action would be randomly chosen from the pool of actions.

4 Experiments

4.1 Implementation Details

4.1.1 Dataset.

We use EPIC-Kitchens [4] to set up the cross domain environment for fine-grained action recognition as it includes action segments captured in 32 different scenes. Following the domain split in [17], we sample videos taken in 3 different kitchens to form 3 different domains, which are denoted as D1, D2 and D3. We have a total of 8 classes of action and according to [17], the distribution of training and testing samples from the 8 action classes are highly unbalanced. However, we use this unbalanced dataset to prove that our method could achieve competitive performance even on an unbalanced dataset since the unbalanced distribution of data makes domain adaptation more challenging.

4.1.2 Model Architecture.

We set a two-stream I3D [1] feature extractor as our backbone. A trianing instance is composed of a temporal window of 16 frames sampled from an action segment. The domain discriminator DkD^{k} in each modality takes in the feature vectors and flattens them to pass through a GRL and two fully connected layers with the dimension of a hidden layer to be 128 and an intermediate LeakyReLU activation function. For data augmentation, we follow the setup in [2] where random cropping, scale jittering and horizontal flipping are applied to training data. For testing data, only center cropping is applied.

4.1.3 Hyperparameter and Training Setting.

The overall dropout rate of FkF^{k} is set to 0.5 and a weight decay of 10710^{-7} is applied for model parameters. We divide training process into two stages. In stage 1, our network is trained without domain discriminator and DQN agents. Then, the loss is optimized as follows:

𝓈𝓉𝒶1=𝒸𝓁𝓈.\mathcal{L_{stage1}}=\mathcal{L_{cls}}. (11)

The learning rate of this stage is set to 0.010.01 and the network is trained for 4000 steps. In the second stage, the domain discriminator and DQN agents are incorporated and the objective function for this stage is defined as:

𝓈𝓉𝒶2=𝒸𝓁𝓈+𝒶𝒹𝓋+𝒹𝓆𝓃.\mathcal{L_{stage2}}=\mathcal{L_{cls}}+\mathcal{L_{adv}}+\mathcal{L_{dqn}}. (12)

The learning rate in this stage is reduced to 0.0010.001 and the model is further trained for 8000 steps. Note that for both stages, the action classifier is optimized only using labeled source data. For the hyperparameters of DQN, we set the discount factor γ=0.9\gamma=0.9, ϵ\epsilon-greedy factor ϵ=0.5\epsilon=0.5, relevance threshold τs=τt=0.5\tau^{s}=\tau^{t}=0.5, terminal time E=1E=1 and candidate size 𝒩c=5\mathcal{N}_{c}=5. Besides, Adam [11] optimizer is used for both stages and the batch size is set to 96 in stage 1 and 80 in stage 2, which is equally divided for source and target domains. It takes 6 hours to train our model using 4 NVIDIA RTX 3090 GPUs.

4.2 Results

For all the experimental results, we follow [17] to report the average top-1 accuracy on target domain over the last 9 epochs during training. In the meantime, the experimental results of our model trained with only source data are reported as a lower limit. Also, we report results of supervised learning on target domain as the upper bound.

Table 1: Top-1 Accuracy of the experimental results of different baselines and our MMIR under different domain settings.
Method D2 \rightarrow D1 D3 \rightarrow D1 D1\rightarrow D2 D3 \rightarrow D2 D1 \rightarrow D3 D2 \rightarrow D3 Mean
Source-only 42.5 44.3 42.0 56.3 41.2 46.5 45.5
MMD [14] 43.1 48.3 46.6 55.2 39.2 48.5 46.8
AdaBN [12] 44.6 47.8 47.0 54.7 40.3 48.8 47.2
MCD [18] 42.1 47.9 46.5 52.7 43.5 51.0 47.3
MM-SADA [17] 48.2 50.9 49.5 56.1 44.1 52.7 50.3
Kim et al. [10] 49.5 51.5 50.3 56.3 46.3 52.0 51.0
MMIR 46.1 53.5 49.7 61.5 44.5 52.6 51.3
Supervised 71.7 74.0 62.8 74.0 62.8 71.7 69.5

We have a total of 6 domain adaptation settings based on D1, D2 and D3. We make a comparison between our method, several baseline methods and two state-of-the-art methods under every domain setting in Table 1. On average, our method outperforms all other state-of-the-art methods. MMIR has an overall performance improvement of 5.8%5.8\% compared with Source-only, 4.5%4.5\% compared with MMD [14], 4.1%4.1\% compared with AdaBN, 4.0%4.0\% compared with MCD [18], 1.0%1.0\% compared with MM-SADA [17] and 0.3%0.3\% compared with Kim et al. [10].

4.3 Ablation Study

4.3.1 Effects of RL Agents.

We evaluate the performance of reinforcement learning agents according to modality and domain. In particular, we investigate the effect of S-agent and T-agent only in the modality of RGB as this modality contributes more during the feature alignment process. The results are shown in Table 2 and we denote “without” as “w/o”. We further elaborate on the results in the following part.

  • RGB vs Optical flow. Our method without agents in RGB has a performance drop of 1.8%1.8\% while the case without agents in Optical flow has only a drop of 0.7%0.7\%. This indicates that agents in RGB play a major part in refining feature alignment compared with that of Optical flow since RGB frames contain more spatial information while flow frames contain more temporal information which contributes less to the feature alignment process.

  • S-agent vs T-agent in RGB. In RGB, when S-agent is removed, we can observe a performance drop of 0.3%0.3\%. While by removing the T-agent, the performance drop is 0.8%0.8\%. This shows that in the modality of RGB, T-agent weighs more than S-agent in alleviating the issue of negative transfer.

Table 2: Ablation study of the effect of reinforcement learning agents on D2 \rightarrow D1.
Method D2 \rightarrow D1
Source-only 42.5
MMIR (w/o) RGB agents 44.3
MMIR (w/o) Flow agents 45.4
MMIR (w/o) S-agent (RGB) 45.8
MMIR (w/o) T-agent (RGB) 45.3
MMIR 46.1

4.3.2 Overall Evaluation.

In addition, we also evaluate the overall effect of our instance refinement strategy (IR) by comparing it with the case of Adversarial-only in Table 3. We give a detailed illustration of our results as follows.

Table 3: Overall evaluation of our instance refinement strategy.
Method Adversarial IR D2 \rightarrow D1 D3 \rightarrow D1 D1\rightarrow D2 D3 \rightarrow D2 D1 \rightarrow D3 D2 \rightarrow D3 Mean
Source-only 42.5 44.3 42.0 56.3 41.2 46.5 45.5
MMIR 43 51.8 49.3 59.9 43.5 51.6 49.9
MMIR 46.1 53.5 49.7 61.5 44.5 52.6 51.3
  • Adversarial-only. Compared with Source-only, an improvement of 4.4%4.4\% in top-1 accuracy can be observed. This shows that our domain adversarial alignment is effective in improving model performance on target domain through directly training a domain discriminator in an adversarial way.

  • Adversarial + IR. Compared with Adversarial-only, it has an overall performance boost of 1.4%1.4\% and shows an improvement in every domain setting as depicted in Table 3. This is an effective demonstration of the successful implementation of our instance refinement strategy and its capability to help alleviate negative transfer during cross-domain action recognition.

5 Conclusion

We design a multi-modal instance refinement framework to help alleviate the problem of negative transfer during cross-domain action recognition. The reinforcement learning agents are trained to learn policies to select out negative training samples, thus resulting in a better-aligned feature distribution via domain adversarial learning. Experiments show that our method successfully addresses the negative transfer in multi-modal cross-domain action recognition and outperforms several competitive methods on a benchmark dataset. We believe in the future, it is worth conducting experiments on a spectrum of datasets to validate if our MMIR could be generalized to all use cases and even in different modalities such as text, speech, depth and so on.

5.0.1 Acknowledgements.

This work is supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400, National Natural Science Foundation of China No. 82121003, and Shenzhen Research Program No. JSGG20210802153537009.

References

  • [1] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)
  • [2] Chen, C.F., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., Fan, Q.: Deep analysis of cnn-based spatio-temporal representations for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2021)
  • [3] Chen, J., Wu, X., Duan, L., Chen, L.: Sequential instance refinement for cross-domain object detection in images. IEEE Transactions on Image Processing 30, 3970–3984 (2021)
  • [4] Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 720–736 (2018)
  • [5] Dong, W., Zhang, Z., Tan, T.: Attention-aware sampling via deep reinforcement learning for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8247–8254 (2019)
  • [6] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [7] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The journal of machine learning research 17(1), 2096–2030 (2016)
  • [8] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35(1), 221–231 (2012)
  • [9] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)
  • [10] Kim, D., Tsai, Y.H., Zhuang, B., Yu, X., Sclaroff, S., Saenko, K., Chandraker, M.: Learning cross-modal contrastive features for video domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13618–13627 (2021)
  • [11] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [12] Li, Y., Wang, N., Shi, J., Hou, X., Liu, J.: Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, 109–117 (2018)
  • [13] Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8, 293–321 (1992)
  • [14] Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International conference on machine learning. pp. 97–105. PMLR (2015)
  • [15] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  • [16] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. nature 518(7540), 529–533 (2015)
  • [17] Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 122–132 (2020)
  • [18] Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3723–3732 (2018)
  • [19] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 4489–4497 (2015)
  • [20] Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7167–7176 (2017)
  • [21] Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision. pp. 3551–3558 (2013)
  • [22] Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Yuan, L., Jiang, Y.G.: Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. arXiv preprint arXiv:2212.04500 (2022)
  • [23] Wang, X., Chen, W., Wu, J., Wang, Y.F., Wang, W.Y.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4213–4222 (2018)
  • [24] Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., et al.: Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
  • [25] Weng, J., Jiang, X., Zheng, W.L., Yuan, J.: Early action recognition with category exclusion using policy-based reinforcement learning. IEEE Transactions on Circuits and Systems for Video Technology 30(12), 4626–4638 (2020)
  • [26] Xu, Y., Yang, J., Cao, H., Wu, K., Wu, M., Chen, Z.: Source-free video domain adaptation by learning temporal consistency for action recognition. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV. pp. 147–164. Springer (2022)
  • [27] Zhou, M., Wang, R., Xie, C., Liu, L., Li, R., Wang, F., Li, D.: Reinforcenet: A reinforcement learning embedded object detection framework with region selection network. Neurocomputing 443, 369–379 (2021)