Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Ronglai Zuo [email protected] 0000-0002-7184-5137 and Brian Mak [email protected] 0000-0001-6787-5555 The Hong Kong University of Science and TechnologyHong Kong

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

continuous sign language recognition, auxiliary learning, signer-independent, feature disentanglement.

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†journal: JACM^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8^†^†ccs: Computing methodologies Activity recognition and understanding

1. Introduction

Sign language is usually the principal communication method among hearing-impaired people. Sign language recognition (SLR) aims to transcribe sign languages into glosses (basic lexical units in a sign language), which is an important technology to bridge the communication gap between the normal-hearing and hearing-impaired people. According to the number of glosses in a sign sentence, SLR can be categorized into (a) isolated SLR (ISLR), in which each sign sentence consists of only a single gloss, and (b) continuous SLR (CSLR), in which each sign sentence may consist of multiple glosses. ISLR can be seen as a simple classification task, which becomes less popular in recent years. In this paper, we focus on CSLR which is more practical than its isolated counterpart. In recent years, more and more CSLR models are built using deep learning techniques because of their superior performance over traditional methods (Zhou et al., 2020; Min et al., 2021; Niu and Mak, 2020). According to (Niu and Mak, 2020), the backbone of most deep-learning-based CSLR models is composed of three parts: a visual module, a sequential (contextual) module, and an alignment module. Within this framework, visual features are first extracted from sign videos by the visual module. After that, sequential and contextual information are modeled by the sequential module. Finally, due to the difference between the length of a sign video and its gloss label sequence, an alignment module is needed to align the sequential features with the gloss label sequence and yields its probability.

Refer to caption — Figure 1. An overview of the CSLR backbone and the three proposed auxiliary tasks. First, our SAC enforces the visual module to focus on informative regions by leveraging pose keypoints heatmaps. Second, our SEC aligns the visual and sequential features at the sentence level, which can enhance the representation power of both the features simultaneously. SAC and SEC constitute our preliminary work (Zuo and Mak, 2022a), consistency-enhanced CSLR ( $\text{C}^{2}$ SLR). In this work, we extend $\text{C}^{2}$ SLR by developing a novel signer removal module based on feature disentanglement for signer-independent CSLR.

Usually, such CSLR backbones are trained with the connectionist temporal classification (CTC) (Graves et al., 2006) loss. However, since CSLR datasets are usually small, only using the CTC loss may not train the backbones sufficiently (Pu et al., 2019; Cui et al., 2019; Pu et al., 2020; Zhou et al., 2020; Hao et al., 2021; Cheng et al., 2020; Min et al., 2021). That is, the extracted features are not representative enough to be used to produce accurate recognition results. To relieve this issue, existing works can be roughly divided into two categories. First, (Cui et al., 2019) proposes a stage optimization strategy to iteratively refine the extracted features with the help of pseudo labels, which is widely adopted in (Pu et al., 2018, 2019, 2020; Zhou et al., 2020; Hao et al., 2021). However, it introduces more hyper-parameters and is time-consuming since the model needs to adapt to a different objective in each new stage (Cheng et al., 2020). As an alternative strategy, auxiliary learning can keep the whole model end-to-end trainable by just adding several auxiliary tasks (Cheng et al., 2020; Min et al., 2021). In this work, three novel auxiliary tasks are proposed to help train CSLR backbones.

Our first auxiliary task aims to enhance the visual module, which is important to feature extraction but sensitive to the insufficient training problem (Min et al., 2021; Cui et al., 2019; Zhou et al., 2020). Since the information of sign languages is mainly included in signers’ facial expressions and hand movements (Zhou et al., 2020; Koller, 2020; Hu et al., 2021), signers’ face and hands are treated as informative regions. Thus, to enrich the visual features, some CSLR models (Zhou et al., 2020; Papadimitriou and Potamianos, 2020) leverage an off-the-shelf pose detector (Cao et al., 2019a; Sun et al., 2019) to locate the informative regions and then crop the feature maps to form a multi-stream architecture. However, this architecture will introduce extensive parameters since each stream processes its inputs independently and the cropping operation may overlook the rich information in the pose keypoints heatmaps. As shown in Figure 1, by visualizing the heatmaps, we find that they can reflect the importance of different spatial positions, which is similar to the idea of spatial attention. Thus, as shown in Figure 2, we insert a lightweight spatial attention module into the visual module and enforce the spatial attention consistency (SAC) between the learned attention masks and pose keypoints heatmaps. In this way, the visual module can pay more attention to the informative regions.

Only enhancing the visual module may not fully exploit the power of the backbone. According to (Min et al., 2021; Hao et al., 2021), better performance can be obtained by explicitly enforcing the consistency between the visual and sequential modules. VAC (Min et al., 2021) adopts a knowledge distillation loss between the two modules by treating the visual and sequential modules as a student-teacher pair. With a similar idea, SMKD (Hao et al., 2021) transfers knowledge by shared classifiers. Knowledge distillation can be treated as a kind of consistency since it is usually instantiated as the KL-divergence loss, a measurement of the distance between two probability distributions. Nevertheless, the above two methods have a common deficiency that they measure consistency at the frame level, i.e., each frame has its own probability distribution. We think that it is inappropriate to enforce frame-level consistency since the sequential module is supposed to gather contextual information; otherwise, the sequential module may be dropped. Motivated by that both the visual and sequential features represent the same sentence, we propose the second auxiliary task: enforcing the sentence embedding consistency (SEC) between them. As shown in Figure 2, we build a lightweight sentence embedding extractor that can be jointly trained with the backbone, and then minimize the distance between positive sentence embedding pairs while maximizing the distance between negative pairs.

We name the CSLR model trained with SAC and SEC as consistency-enhanced CSLR ( $\text{C}^{2}$ SLR). According to our experimental results (Table 9), with a transformer-based backbone, $\text{C}^{2}$ SLR can achieve satisfactory performance on signer-dependent datasets, in which all signers in the test set appear in the training set. However, as shown in Table 10(a), $\text{C}^{2}$ SLR cannot outperform the state-of-the-art (SOTA) work on the more challenging but realistic signer-independent CSLR (SI-CSLR). Under the SI setting, since the signers in the test set are unseen during training, removing signer-specific information can make the model more robust to signer discrepancy. In this work, we further develop a signer removal module (SRM) based on the idea of feature disentanglement. More specifically, we first extract robust sentence-level signer embeddings with statistics pooling (Snyder et al., 2018) to “distill” signer information, which is then dispelled from the backbone implicitly by a gradient reversal layer (Ganin et al., 2016). Finally, the SRM is trained with a signer classification loss. To the best of our knowledge, we are the first to develop a specific module for SI-CSLR¹¹1Some works (Cui et al., 2019; Pu et al., 2020) evaluate their methods on SI-CSLR datasets, but none of them propose any dedicated modules for the SI setting. (Yin et al., 2016) proposes a metric learning method to deal with the SI situation, but it focuses on ISLR..

In summary, our main contributions are:

•

We propose to enforce the consistency between the learned attention masks and pose keypoints heatmaps to enable the visual module to focus on informative regions.
•

We propose to align the visual and sequential features at the sentence level to enhance the representation power of both features simultaneously.
•

We propose a signer removal module from the idea of feature disentanglement to implicitly remove signer information from the backbone for SI-CSLR. To the best of our knowledge, we are the first to focus on this challenging setting.
•

Extensive experiments are conducted to validate the effectiveness of the three auxiliary tasks. More remarkably, with a transformer-based backbone, our model can achieve SOTA or competitive performance on five benchmarks, while the whole model is trained in an end-to-end manner.

This work is an extension to our 2022 CVPR paper, $\text{C}^{2}$ SLR (Zuo and Mak, 2022a). More specifically, we make the following new contributions:

•

Besides the investigation on signer-dependent continuous sign language recognition (SD-CSLR) in the CVPR paper, we propose in this paper an additional signer removal module (SRM) to tackle the more challenging signer-independent continuous sign language recognition (SI-CSLR) problem. More specifically, the SRM is designed to remove signer information from the backbone for SI-CSLR based on feature disentanglement. To the best of our knowledge, we are the first to propose a dedicated module to deal with SI-CSLR.
•

We successfully adapt statistics pooling to SI-CSLR to extract robust sentence-level signer embeddings for the SRM.
•

We conduct sufficient ablation studies to validate the effectiveness of the SRM, and the combination of $\text{C}^{2}$ SLR and SRM can achieve SOTA performance on an SI-CSLR benchmark.
•

We also report additional experimental results of $\text{C}^{2}$ SLR on the latest large-scale Chinese sign language dataset, CSL-Daily (Zhou and et al., 2021) with a vocabulary size of 2K and about 20K videos.

2. Related Works

2.1. Deep-learning-based CSLR

According to (Niu and Mak, 2020), most deep-learning-based CSLR backbones consist of a visual module (3D-CNNs (Pu et al., 2019; Zhou et al., 2019) or 2D-CNNs (Zhou et al., 2020; Min et al., 2021; Hao et al., 2021)), a sequential module (1D-CNNs (Guo et al., 2019; Cheng et al., 2020), RNNs (Zhou et al., 2020; Min et al., 2021; Hao et al., 2021; Pu et al., 2019, 2020), or Transformer (Niu and Mak, 2020; Camgöz et al., 2020)), and an alignment module (CTC (Zhou et al., 2020; Min et al., 2021; Hao et al., 2021) or hidden Markov models (Koller et al., 2019)). To mitigate the issue of insufficient training, a novel approach is introduced by (Cui et al., 2019), which employs a stage optimization strategy to iteratively refine the extracted features utilizing pseudo labels. This technique has garnered significant attention and has been widely embraced in relevant studies (Pu et al., 2019, 2020; Zhou et al., 2020; Hao et al., 2021). Extending the capabilities of this strategy, (Pu et al., 2019) incorporates a Long Short-Term Memory (LSTM) based auxiliary decoder. Furthermore, SMKD (Hao et al., 2021) proposes a comprehensive three-stage optimization strategy that necessitates training the model over 100 epochs. VAC (Min et al., 2021) proposes an approach that achieves improved training time efficiency while enhancing the visual module and enforcing consistency between the visual and sequential modules. This is accomplished through the proposed visual enhancement and visual alignment constraints on the frame-level probability distributions. Notably, the proposed model is end-to-end trainable. In this work, we enhance the visual module from a novel view of spatial attention consistency, and align the two modules at the sentence level to enforce their sentence embedding consistency.

Recently published CSLR works mostly focus on injecting more domain knowledge into sign video modeling (Chen et al., 2022b; Hu et al., 2023; Jiao et al., 2023), better training techniques (Guo et al., 2023; Zheng et al., 2023), or cross-lingual signs (Wei and Chen, 2023). However, all these works still focus on the signer-dependent setting, which limits their application scenario. In this work, we propose a signer removal module to make the model robust to signer discrepancy in the more realistic signer-independent setting.

2.2. Spatial Attention

Spatial attention mechanism allows models to selectively attend to specific positions, which is widely adopted in various computer vision tasks including semantic segmentation (Fu et al., 2019), object detection (Woo et al., 2018; Cao et al., 2019b), and image classification (Woo et al., 2018; Cao et al., 2019b; Linsley et al., 2018). However, the spatial attention module may not be well-trained with a single task-specific loss function. Leveraging external information to guide the spatial attention module can be a solution to this issue. In (Chen and Jiang, 2019), the spatial attention module is guided by motion information for video captioning. (Pang et al., 2019) and (Li et al., 2020) propose mask and relation guidance for occluded pedestrian detection and person re-identification, respectively. GALA (Linsley et al., 2018) presents an intriguing approach by utilizing click maps obtained from a game as supervision. In this work, we leverage pose keypoints heatmaps to direct the learning process of the spatial attention module.

2.3. Sentence Embedding

Traditional methods (Palangi et al., 2016; Liu et al., 2019) commonly adopt a straightforward approach where the word embedding sequence is directly fed into recurrent neural networks (RNNs), and the final hidden state (or two hidden states for bidirectional RNNs) is extracted as the sentence embedding. Recently, many powerful sentence embedding extractors (Reimers and Gurevych, 2019; Gao et al., 2021; Carlsson et al., 2020) are built on BERT (Kenton and Toutanova, 2019). However, it is difficult to use these methods in our work because (1) they are too large to be co-trained along with the backbone; (2) they are pretrained on spoken languages, which are totally different to sign languages represented by videos. In this work, we build a lightweight sentence embedding extractor that can be jointly trained with the CSLR backbone.

2.4. Feature Disentanglement

In the context of signer-independent continuous sign language recognition (SI-CSLR), each signer can be considered as a distinct domain, and the key is to enable the model to generalize well to unseen domains, i.e., the test signers. Feature disentanglement has emerged as a powerful approach for achieving domain generalization by decomposing features into domain-invariant and domain-specific components (Wang et al., 2021). Adversarial learning has gained significant traction in the field of feature disentanglement, with the feature extractor serving as the generator and the domain classifier as the discriminator (Xu et al., 2020; Cheng et al., 2022; Liu et al., 2018b). For instance, in the context of facial expression recognition, (Xu et al., 2020) employs an adversarial approach to mitigate biases such as gender and race by training a series of domain classifiers. (Cheng et al., 2022) introduces a self-adversarial framework specifically designed to remove gaze-irrelevant factors, resulting in improved gaze estimation performance. Another notable advancement in feature disentanglement is the utilization of attention mechanisms to emphasize task-relevant features, while considering the remaining features as task-irrelevant. This approach has been successfully employed in various domains. For instance, in person re-identification, (Jin et al., 2020) utilizes a channel attention module to suppress style information, while in face recognition, (Huang et al., 2021) incorporates both spatial and channel attention mechanisms to eliminate age-related features. These studies exemplify the efficacy of leveraging adversarial learning and attention mechanisms for feature disentanglement in a range of applications. However, adversarial learning is usually complicated as the generator and discriminator are trained iteratively, and the attention modules would introduce extra parameters. In this work, we adopt the gradient reversal (GR) layer (Ganin et al., 2016) that reverses the gradient coming from the domain (signer) classification loss when the back-propagation process arrives at the feature extractor (CSLR backbone) while keeping the gradient of the domain classifier unchanged. It shares a similar idea with adversarial learning, but it is totally end-to-end and introduces no extra parameters compared to attention-based methods. Thus, we believe it can serve as a simple baseline for future research on SI-CSLR.

3. Our Proposed Method

3.1. Framework Overview

Figure 2 gives an overview of our proposed method. The blue, orange, and green arrows represent the three components of the CSLR backbone: visual module, sequential module, and alignment module, respectively. Taking a sign video with $T$ RGB frames $\mathbf{x}=\{\mathbf{x}_{t}\}_{t=1}^{T}\in\mathbb{R}^{T\times H\times W\times 3}$ as input, the visual module, which simply consists of several 2D-CNN²²2We only consider visual modules based on 2D-CNNs since a recent survey (Adaloglou et al., 2021) shows that 3D-CNNs cannot provide as precise gloss boundaries as 2D-CNNs, and lead to worse performance. layers ( $C_{1},\dots,C_{n}$ ) followed by a global average pooling (GAP) layer, first extracts visual features $\mathbf{v}=\{\mathbf{v}_{t}\}_{t=1}^{T}\in\mathbb{R}^{T\times d}$ . The sequential features $\mathbf{s}=\{\mathbf{s}_{t}\}_{t=1}^{T}\in\mathbb{R}^{T\times d}$ will be further extracted by the sequential module. Finally, the alignment module computes the probability of the gloss label sequence $p(\mathbf{y}|\mathbf{x})$ based on the widely-adopted CTC (Graves et al., 2006), where $\mathbf{y}=\{y_{i}\}_{i=1}^{N}$ and $N$ denotes the length of the gloss sequence. Below we will first present our proposed three auxiliary tasks, spatial attention consistency (Section 3.2), sentence embedding consistency (Section 3.3), and signer removal (Section 3.4), respectively. The overall loss function will be formulated in Section 3.5. Finally in Section 3.6, we will introduce a variant of Transformer as a strong sequential module for CSLR.

3.2. Spatial Attention Consistency (SAC)

Signers’ facial expressions and hand movements are two major clues of sign languages (Koller, 2020; Zhou et al., 2020; Zuo et al., 2023). Thus, it is reasonable to expect the visual module can focus on signers’ face and hands, i.e., informative regions (IRs). From this view, we insert a spatial attention module into the visual module and enforce the consistency between the learned attention masks and keypoints heatmaps. Since SAC is applied to all frames in the same way, we will omit the time steps in the formulation below.

3.2.1. Spatial Attention Module

We build our spatial attention module based on CBAM (Woo et al., 2018) due to its simplicity and effectiveness. As shown in Figure 3(a), we first pick the most informative channel via a channel-wise max pooling (CMP) operation:

(1)

\mathbf{M}_{1}=f_{CMP}(\mathbf{F})\in\mathbb{R}^{J\times K\times 1},

where $\mathbf{M}_{1}$ is the squeezed feature map by CMP, and $\mathbf{F}\in\mathbb{R}^{J\times K\times C}$ is the input feature maps.

Besides CMP, CBAM also squeezes the feature maps with an average pooling operation along the channel dimension. However, we propose to dynamically weight the importance of each channel. As shown in Figure 3(a), we first conduct global average pooling (GAP) over $\mathbf{F}$ to gather global spatial information. Then the channel weights $\mathbf{E}\in(0,1)^{1\times 1\times C}$ are simply generated by a channel-wise softmax layer. By a weighted sum along the channel dimension, we can generate another squeezed feature map $\mathbf{M}_{2}$ :

(2)

\mathbf{M}_{2}=\mathbf{F}\oplus\mathbf{E}=\sum_{i=1}^{C}\mathbf{F}_{i}\cdot\mathbf{E}_{i}\in\mathbb{R}^{J\times K\times 1},

Finally, the spatial attention mask $\mathbf{M}$ is generated as:

(3)

\mathbf{M}=\sigma(f_{conv}(cat(\mathbf{M}_{1},\mathbf{M}_{2})))\in(0,1)^{J\times K},

where $\sigma(\cdot)$ is the sigmoid function, $f_{conv}(\cdot)$ is a 2D-CNN layer with a kernel size of 7×7, and $cat(\cdot,\cdot)$ is a channel-wise concatenation operation. The output feature maps will be a product between $\mathbf{F}$ and $\mathbf{M}$ . In this way, important positions can be highlighted while trivial ones can be suppressed.

It should be noted that our channel weights are similar to the channel attention module in CBAM, but ours introduces no extra parameters and can even outperform the vanilla CBAM according to our ablation studies in Table 3.

3.2.2. Keypoints Heatmap Extractor

Simply training the spatial attention module with the backbone may lead to sub-optimal solutions. Given the prior knowledge that signers’ faces and hands are informative regions (IRs), we guide the spatial attention module with keypoints heatmaps extracted by the pretrained HRNet (Sun et al., 2019; Andriluka et al., 2014). Specifically, we first normalize the raw outputs of HRNet linearly to obtain the original heatmaps:

(4)

\mathbf{H}_{o}^{i}=\frac{f_{H}^{i}(\mathbf{I})-\min{f_{H}^{i}(\mathbf{I})}}{\max{f_{H}^{i}(\mathbf{I})}-\min{f_{H}^{i}(\mathbf{I})}}\in[0,1]^{H\times W},

where $\mathbf{I}$ is the raw RGB frame, $f_{H}(\cdot)$ is the pretrained HRNet, and $i\in\{1,2,3\}$ denotes the face, left hand, and right hand, respectively.

3.2.3. Post-processing

There are some defects in the original heatmaps although they can roughly highlight the positions of IRs. As shown in Figure 3(b), some trivial regions, e.g., the top of the face heatmap in the first row and the middle part of the left hand heatmap in the second row, may receive high activation values. Besides, some highlighted regions, e.g., both of the face heatmaps in Figure 3(b), may not cover the IRs entirely. In addition, there is usually a mismatch between the fixed heatmap resolution of the pretrained HRNet and that of the spatial attention masks. Below we will elaborate our heatmap post-processing module that corrects the mismatch.

We first locate the center of each IR from the original heatmaps via a simple argmax operation: $(x_{i},y_{i})=\mathrm{argmax}\ \mathbf{H}_{o}^{i}$ . To fit different resolutions of spatial attention masks, we normalize the center as $(\hat{x}_{i},\hat{y}_{i})=(\frac{x_{i}}{H-1},\frac{y_{i}}{W-1})$ . Suppose the spatial attention masks have a common resolution of $J\times K$ , then a Gaussian-like refined keypoints heatmap is generated for each IR to reduce unwanted noise:

(5)

\mathbf{H}_{r}^{i}(a,b)=\exp{\left(-\frac{1}{2}\left(\frac{(a-\hat{c}_{i}^{x})^{2}}{(J/\gamma_{x})^{2}}+\frac{(b-\hat{c}_{i}^{y})^{2}}{(K/\gamma_{y})^{2}}\right)\right)},

where $0\leq a<J$ , $0\leq b<K$ . $(\hat{c}_{i}^{x},\hat{c}_{i}^{y})=(\hat{x}_{i}(J-1),\hat{y}_{i}(K-1))$ , which denotes the transformed center for each IR under the resolution $J\times K$ . $\gamma_{x}$ and $\gamma_{y}$ are two hyper-parameters to control the scale of the highlighted regions. In real practice, we set $\gamma_{x}=\gamma_{y}$ . Finally, we merge the three processed IR heatmaps into a single one: $\mathbf{H}_{r}=\underset{i}{\max}\ \mathbf{H}_{r}^{i}\in(0,1)^{J\times K}$ .

3.2.4. SAC Loss

The spatial attention module is guided by the refined keypoints heatmaps via the SAC loss³³3For implementation, we further compute the average of $\mathcal{L}_{sac}$ over all time steps.:

(6)

\mathcal{L}_{sac}=\frac{1}{J\times K}\|\mathbf{M}-\mathbf{H}_{r}\|_{2}^{2}.

3.3. Sentence Embedding Consistency (SEC)

Some works (Min et al., 2021; Hao et al., 2021) find that enforcing the consistency between the visual and sequential features can enhance their representation power, and lead to better performance. Different from (Min et al., 2021; Hao et al., 2021) that measure their consistency at the frame level, we impose a sentence embedding consistency between them.

3.3.1. Sentence Embedding Extractor (SEE)

Within a sign video, each gloss consists of only a few frames. We believe a good SEE for sign languages should take local contexts into consideration. As shown in Figure 4, our SEE is built on QANet (Yu et al., 2018), which consists of a depth-wise temporal convolution network (TCN) layer and a transformer encoder layer. The depth-wise TCN first extracts local contextual information from the frame-level feature sequence, then the transformer encoder models global contexts by its inner self-attention module.

Similar to the class token in BERT (Kenton and Toutanova, 2019), we first prepend a learnable sentence embedding token, [SEN], to the sequential features $\mathbf{s}\in\mathbb{R}^{T\times d}$ defined in Section 3.1:

(7)

\mathbf{s}^{\prime}=cat(\text{[SEN]},\mathbf{s})\in\mathbb{R}^{(T+1)\times d}.

The input of the SEE is the summation of the feature sequence and the positional embeddings (Vaswani et al., 2017); i.e., $\mathbf{s}^{\prime\prime}=\mathbf{s}^{\prime}+\mathbf{P}$ , where $\mathbf{P}\in\mathbb{R}^{(T+1)\times d}$ .

Within the SEE, the depth-wise TCN (Wu et al., 2018) layer first models local contexts with a residual shortcut: $\mathbf{s}_{l}^{\prime\prime}=f_{TCN}(\mathbf{s}^{\prime\prime})+\mathbf{s}^{\prime\prime}$ . Then the transformer encoder layer gathers information from all time steps to get the sentence embedding:

(8)

\mathbf{E}_{sen}^{s}=f_{TF}(\mathbf{s}_{l}^{\prime\prime})\in\mathbb{R}^{d}.

We can also get the sentence embedding of visual features, $\mathbf{E}_{sen}^{v}$ , in the same way.

3.3.2. Negative Sampling

Directly minimizing the distance between $\mathbf{E}_{sen}^{s}$ and $\mathbf{E}_{sen}^{v}$ will result in trivial solutions. For example, if the parameters of SEE are all zeros, then the outputs of SEE will always be the same. A simple way to address this issue is introducing negative samples. In this work, we follow the common practice (Schroff et al., 2015; Ye et al., 2019; Oord et al., 2018; Hjelm et al., 2019) and sample another video from the mini-batch and take its sequential features as the negative sample. Note that most CSLR models (Min et al., 2021; Hao et al., 2021; Zhou et al., 2020) are trained with a batch size of 2, and our negative sampling strategy will degenerate to swapping under this setting:

(9)

(neg(\mathbf{B}[0]),neg(\mathbf{B}[1]))=(\mathbf{B}[1],\mathbf{B}[0]),

where $\mathbf{B}\in\mathbb{R}^{2\times T\times d}$ is a mini-batch of the sequential features, and $neg(\mathbf{B}[\cdot])$ denotes the corresponding negative sample.

3.3.3. SEC Loss

We implement SEC loss as a triplet loss (Schroff et al., 2015) and minimize the distances between the sentence embeddings computed from the visual and sequential features of the same sentence, while maximizeing the distances between those from different sentences:

(10)

\begin{split}\mathcal{L}_{sec}=\max&\{d(\mathbf{E}_{sen}^{v},\mathbf{E}_{sen}^{s})-d(\mathbf{E}_{sen}^{v},neg(\mathbf{E}_{sen}^{s}))+\alpha,0\},\end{split}

where $d(\mathbf{x}_{1},\mathbf{x}_{2})=1-\frac{\mathbf{x}_{1}\cdot\mathbf{x}_{2}}{\|\mathbf{x}_{1}\|_{2}\cdot\|\mathbf{x}_{2}\|_{2}}$ ; $\{\mathbf{E}_{sen}^{v},\mathbf{E}_{sen}^{s}\}$ are sentence embeddings of visual and sequential features from the same sentence; $\{\mathbf{E}_{sen}^{v},neg(\mathbf{E}_{sen}^{s})\}$ are sentence embeddings of visual and sequential features from different sentences, and we treat the sentence embedding of the sequential features from a different sentence as the negative sample $neg(\mathbf{E}_{sen}^{s})$ ; $\alpha$ is the margin.

3.4. Signer Removal Module (SRM)

To remove signer information from CSLR backbones, we further develop a signer removal module (SRM) based on statistics pooling and gradient reversal as shown in Figure 5.

3.4.1. Signer Embeddings

We first extract signer embeddings to “distill” signer information before dispelling it. A naïve method is simply feeding the frame-level features into an MLP, and treat the outputs of MLP as signer embeddings. In this work, motivated by the superior performance of x-vectors (Snyder et al., 2018) in speaker recognition, we leverage statistics pooling to obtain more robust sentence-level signer embeddings.

Specifically, we first feed the intermediate visual features $\mathbf{F}\in\mathbb{R}^{T\times J\times K\times C}$ into a global average pooling layer to squeeze the spatial dimension and obtain frame-level features⁴⁴4Here we misuse the notation $\mathbf{F}$ in Equation 1. $\mathbf{F}_{s}\in\mathbb{R}^{T\times C}$ . Then a statistics pooling (SP) layer is used to aggregate frame-level information:

(11)

\mathbf{F}_{s}^{SP}=cat(\mathbf{F}_{s}^{mean},\mathbf{F}_{s}^{std})\in\mathbb{R}^{2C},

where $\mathbf{F}_{s}^{mean}\in\mathbb{R}^{C}$ and $\mathbf{F}_{s}^{std}\in\mathbb{R}^{C}$ are the temporal mean and standard deviation of $\mathbf{F}_{s}$ , respectively. In this way, $\mathbf{F}_{s}^{SP}$ is capable to capture signer characteristics over the entire video instead of at the frame-level.

After that, a simple two-layer MLP with rectified linear unit (ReLU) function is used to project the statistics into the signer embedding space:

(12)

\mathbf{E}_{sig}=ReLU(\mathbf{W}_{2}ReLU(\mathbf{W}_{1}\mathbf{F}_{s}^{SP}+\mathbf{b}_{1})+\mathbf{b}_{2})\in\mathbb{R}^{C},

where $\mathbf{W}_{1}\in\mathbb{R}^{C\times 2C},\mathbf{b}_{1}\in\mathbb{R}^{C},\mathbf{W}_{2}\in\mathbb{R}^{C\times C},\mathbf{b}_{2}\in\mathbb{R}^{C}$ represent the two-layer MLP.

Finally, the signer embeddings $\mathbf{E}_{sig}$ are fed into a classifier to yield signer probabilities $\mathbf{p}_{sig}\in(0,1)^{N_{sig}}$ , where $N_{sig}$ is the number of signers. The SRM is trained with the signer classification loss, which is simply a cross-entropy loss:

(13)

\mathcal{L}_{srm}=-\log p_{sig}^{i},

where $i$ is the label of the signer.

3.4.2. Gradient Reversal

If the CSLR backbone is jointly trianed with $\mathcal{L}_{srm}$ , it will become the multi-task learning, which, however, cannot promise removing the signer information from the backbone. In this work, we treat each signer as a domain and formulate SI-CSLR as a domain generalization problem in which no test signers are seen during training. The gradient reversal layer was proposed in (Ganin et al., 2016) to address the domain generalization problem by learning features that are discriminative to the main classification task while indiscriminate to the domain gap. More specifically, according to (Ganin et al., 2016), denoting the parameters of the feature extractor, label predictor, and domain classifier as $\theta_{f}$ , $\theta_{y}$ , and $\theta_{d}$ , respectively, the optimization of these parameters can be formulated as:

(14)

\begin{split}\theta_{f}&\leftarrow\text{optimizer}(\theta_{f},\nabla_{\theta_{f}}\mathcal{L}_{y},-\lambda\nabla_{\theta_{f}}\mathcal{L}_{d},\eta),\\ \theta_{y}&\leftarrow\text{optimizer}(\theta_{y},\nabla_{\theta_{y}}\mathcal{L}_{y},\eta),\\ \theta_{d}&\leftarrow\text{optimizer}(\theta_{d},\lambda\nabla_{\theta_{d}}\mathcal{L}_{d},\eta),\end{split}

where $\mathcal{L}_{y}$ and $\mathcal{L}_{d}$ are the main classification and domain classification losses, respectively, $\lambda$ is the loss weight for $\mathcal{L}_{d}$ , and $\eta$ is the learning rate.

We adapt Equation 14 by instantiating $\mathcal{L}_{y}$ and $\mathcal{L}_{d}$ as the backbone training loss $\mathcal{L}_{b}$ and signer classification loss $\mathcal{L}_{srm}$ , which are illustrated in Figure 5, respectively. We also merge $\theta_{f}$ and $\theta_{y}$ as $\theta_{b}$ to denote the parameters of the backbone, and use $\theta_{s}$ to represent the parameters of the SRM. The new optimization process can be formulated as:

(15)

\begin{split}\theta_{b}&\leftarrow\text{optimizer}(\theta_{b},\nabla_{\theta_{b}}\mathcal{L}_{b},-\lambda\nabla_{\theta_{b}}\mathcal{L}_{srm},\eta),\\ \theta_{s}&\leftarrow\text{optimizer}(\theta_{s},\lambda\nabla_{\theta_{s}}\mathcal{L}_{srm},\eta).\end{split}

As a result, the SRM itself is trained with $\mathcal{L}_{srm}$ as usual, but the backbone is trained “reversely” so that the extracted features cannot discriminate signers, and the signer information is implicitly removed. We validate the effectiveness of the SRM on two challenging SI-CSLR benchmarks, establishing a strong baseline for future works on SI-CSLR.

3.5. Alignment Module and Loss Function

We follow recent works (Hao et al., 2021; Min et al., 2021; Zhou et al., 2020; Pu et al., 2020) to adopt CTC-based alignment module. It yields a label for each frame which may be a repeating label or a special blank symbol. CTC assumes that the model output at each time step is conditionally independent of each other. Given an input sequence $\mathbf{x}$ , the conditional probability of a label sequence $\boldsymbol{\phi}=\{\phi_{i}\}_{i=1}^{T}$ , where $\phi_{i}\in\mathcal{V}\cup\{blank\}$ and $\mathcal{V}$ is the vocabulary of glosses, can be estimated by:

(16)

p(\boldsymbol{\phi}|\mathbf{x})=\prod_{i=1}^{T}p(\phi_{i}|\mathbf{x}),

where $p(\phi_{i}|\mathbf{x})$ is the frame-level gloss probabilities generated by a classifier. The final probability of the gloss label sequence is the summation of all feasible alignments:

(17)

p(\mathbf{y}|\mathbf{x})=\sum_{\boldsymbol{\phi}=\mathcal{G}^{-1}(\mathbf{y})}p(\boldsymbol{\phi}|\mathbf{x}),

where $\mathcal{G}$ is a mapping function to remove repeats and blank symbols in $\boldsymbol{\phi}$ , and $\mathcal{G}^{-1}$ is its inverse mapping. Then the CTC loss is defined as:

(18)

\mathcal{L}_{ctc}=-\log p(\mathbf{y}|\mathbf{x}).

Finally, the overall loss function is a combination of the CTC, SAC, SEC, and signer classification losses:

(19)

\mathcal{L}=\underbrace{\mathcal{L}_{ctc}+\mathcal{L}_{sac}+\mathcal{L}_{sec}}_{\mathcal{L}_{b}}+\lambda\mathcal{L}_{srm},

where $\lambda=0$ for signer-dependent datasets, and $\lambda>0$ for signer-independent ones.

3.6. A Strong Sequential Module: Local Transformer

The sequential module is an important component of the CSLR backbone. Most existing CSLR works adopt globally-guided architectures, e.g., BiLSTM (Pu et al., 2019, 2020) and vanilla Transformer (Niu and Mak, 2020; Camgöz et al., 2020), for sequence modeling due to their strong capability of capturing long-term temporal dependencies. However, within a sign video, each gloss is short, consisting of only a few frames. This can explain why a locally-guided architecture, such as TCNs, can also achieve excellent performance (Cheng et al., 2020). In this subsection, we will elaborate a mixed architecture, Local Transformer (LT), which can leverage both global and local contexts for sequence modeling for CSLR.

Figure 6(a) shows the architecture of LT. Each LT layer consists of a depth-wise TCN layer, a local self-attention (LSA) layer, and a feed-forward network. Since the depth-wise TCN layer and the feed-forward network are the same as those used in (Yu et al., 2018; Vaswani et al., 2017), below we will only give the formulation of the LSA.

As shown in Figure 6(b), three linear layers first project the input feature sequence $\mathbf{A}\in\mathbb{R}^{T\times d}$ into queries $\mathbf{Q}\in\mathbb{R}^{T\times d}$ , keys $\mathbf{K}\in\mathbb{R}^{T\times d}$ , and values $\mathbf{V}\in\mathbb{R}^{T\times d}$ , respectively. We then split $\mathbf{Q},\mathbf{K},\mathbf{V}$ into $\{\mathbf{Q}^{h}\}_{h=1}^{N_{h}},\{\mathbf{K}^{h}\}_{h=1}^{N_{h}},\{\mathbf{V}^{h}\}_{h=1}^{N_{h}}$ , respectively, for multi-head self-attention as (Vaswani et al., 2017), where $\mathbf{Q}^{h},\mathbf{K}^{h},\mathbf{V}^{h}\in\mathbb{R}^{T\times d/{N_{h}}}$ and $N_{h}$ is the number of heads. The attention scores for each head can be obtained by the scaled dot-product attention as follows:

(20)

\mathbf{ATT}=\left\{\frac{(\mathbf{Q}^{h})(\mathbf{K}^{h})^{T}}{\sqrt{d/N_{h}}}\right\}_{h=1}^{N_{h}}\in\mathbb{R}^{N_{h}\times T\times T}.

The vanilla self-attention treats each position equally. To emphasize local contexts, we adopt Gaussian bias (Luong et al., 2015; Yang et al., 2018) to weaken the interactions between distant query-key (QK) pairs. Given a QK pair ( $\mathbf{q}_{i}^{h},\mathbf{k}_{j}^{h}$ ), the Gaussian bias (GB) is defined as:

(21)

GB_{ij}^{h}=-\frac{(j-i)^{2}}{2\sigma^{2}},

where $\sigma=\frac{D}{2}$ , and $D$ is the window size of the Gaussian bias (Luong et al., 2015). Note that although we can assign Gaussian bias with a different value of $D$ for each head, we find that a common Gaussian bias among all heads suffices to boost the performance of transformer significantly. The final attention weights for each value vector are obtained from a softmax layer, and the output of the LSA is:

(22)

\begin{cases}\quad\mathbf{O}^{h}=softmax(\mathbf{ATT}^{h}+\mathbf{GB}^{h})\mathbf{V}^{h}\\ \quad\mathbf{O}^{LSA}=cat(\{\mathbf{O}^{h}\}_{h=1}^{N_{h}})\mathbf{W}^{O}\in\mathbb{R}^{T\times d}\ ,\end{cases}

where $\mathbf{W}^{O}\in\mathbb{R}^{d\times d}$ denotes the output linear layer.

We intuitively set $D$ as the average ratio of frame length to gloss length: $D=\frac{1}{|tr|}\sum_{i=1}^{|tr|}\frac{T_{i}}{N_{i}}$ , where $|tr|$ is the number of training samples, based on the idea that a good window size should reflect the average frame length of a gloss. More specifically, $D=6.3,15.8,5.0$ for the PHOENIX datasets, CSL, and CSL-Daily, respectively.

4. Experiments

4.1. Datasets and Evaluation Metric

4.1.1. Datasets

Table 1. Dataset statistics.

Dataset	Language	Vocab Size	#Samples			#Signers			Signer-
Dataset	Language	Vocab Size	Train	Dev	Test	Train	Dev	Test	Independent
PHOENIX-2014	Germany	1,081	5,672	540	629	9	9	9	No
PHOENIX-2014-T	Germany	1,085	7,096	519	642	9	9	9	No
CSL-Daily	Chinese	2,000	18,401	1,077	1,176	10	10	10	No
PHOENIX-2014-SI	Germany	1,081	4,376	111	180	8	1	1	Yes
CSL	Chinese	178	4,000	N/A	1,000	40	N/A	10	Yes

We evaluate our method on three signer-dependent datasets (PHOENIX-2014, PHOENIX-2014-T, and CSL-Daily) and two signer-independent datasets (PHOENIX-2014-SI and CSL). The information of these datasets, including language, vocabulary size, train/dev/test splits, number of signers, are available in Table 1. Compared to some widely-adopted datasets in action recognition, e.g., Kinetics-600 (Carreira et al., 2018) with 500K videos and Something-Something v2 (Goyal et al., 2017) with 169K videos, the size of these sign language datasets are quite small. This can also explain why some specific training strategies, e.g., stage optimization and auxiliary training, are suggested necessary for CSLR before.

4.1.2. Evaluation Metric

We use word error rate (WER) to measure the dissimilarity between two sequences.

(23)

\text{WER}=\frac{\#\text{deletions}+\#\text{substitutions}+\#\text{insertions}}{\#\text{glosses in label}}

The official evaluation scripts provided by each dataset are used for measuring the WER.

4.2. Implementation Details

4.2.1. Data Augmentation

We first resize the RGB frames to $256\times 256$ and then crop them to $224\times 224$ . For the PHOENIX datasets, we adopt stochastic frame dropping (SFD) (Niu and Mak, 2020) with a dropping ratio of 50%. However, due to the longer duration of videos in CSL and CSL-Daily, we implement a seg-and-drop strategy that first segments the videos into short clips consisting of only two frames, and then one frame is randomly dropped from each clip. By doing so, the processed videos retain half of the original frames, while preserving most information. After that, we further randomly drop 40% frames using SFD from these processed videos.

4.2.2. Backbones and Hyper-parameters

We first choose three representative backbones to validate the effectiveness of our method.

•

VGG11+TCN+BiLSTM (VTB). It is widely adopted in some recent works (Zhou et al., 2020; Min et al., 2021). VGG11 (Simonyan and Zisserman, 2015) is used as the visual module, and the sequential module is composed of the TCN and BiLSTM to capture both local and global contexts.
•

CNN+TCN (CT). This lightweight backbone only consists of a 9-layer 2D-CNN and a 3-layer TCN, which is proposed in (Cheng et al., 2020).
•

VGG11+Local Transformer (VLT). The sequential module is a 2-layer local transformer encoder described in Section 3.6.

To better validate the robustness of our method, we additionally append our local transformer to three mainstream visual backbones including ResNet-18 (He et al., 2016), MobileNet-v3-Small (Howard et al., 2019), and GoogLeNet (Szegedy et al., 2015). To align the channel dimensions of visual and sequential features, we configure the TCN layers in CT and VTB to have an output channel size of 512. Additionally, we set the number of hidden units for the BiLSTM in VTB to $2\times 256$ . These adjustments lead to comparable Word Error Rates (WERs) to those reported in the original papers (Cheng et al., 2020; Zhou et al., 2020), maintaining consistency in performance evaluation. We empirically insert the spatial attention module after the 5th CNN layer. For post-processing, we set $\gamma_{x}=\gamma_{y}=14$ based on the experimental results presented in Section 4.3.6. The kernel size of the depth-wise TCN layer in both our SEE and VLT backbone models is set to 5, consistent with (Yu et al., 2018). To determine the margin $\alpha$ in Equation 10, we consider the maximum difference between negative and positive cosine distances and set $\alpha$ to 2. Regarding the signer removal module, we empirically position it after the 5th CNN layer, and the default weight for $\mathcal{L}_{srm}$ , $\lambda$ , is set to 0.75.

4.2.3. Training

All models are trained using a batch size of 2, following recent works (Hao et al., 2021; Min et al., 2021; Zhou et al., 2020). We employ the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of $1\times 10^{-4}$ and a weight decay factor of $1\times 10^{-4}$ . We empirically notice that $\mathcal{L}_{sec}$ decreases at a faster rate compared to $\mathcal{L}_{ctc}$ . To ensure consistent training progress between the backbone and SEE, we decrease the learning rate of the SEE by a default factor of 0.1 for the backbone architectures used in our experiments (for CT, we set the factor to 0.01). Following the previous approach (Camgöz et al., 2020), we employ a learning rate scheduling strategy known as plateau. Specifically, if the WER on the dev set does not decrease for 6 consecutive evaluation steps, the learning rate will be reduced by a factor of 0.7. However, since CSL does not have an official dev set, we decrease the learning rate after the 15th and 25th epoch, and subsequently every 5 epochs after the 30th epoch. The total number of training epochs is set to 60.

4.2.4. Inference and Decoding.

Following (Niu and Mak, 2020), to match the training condition, we evenly select every $\frac{1}{p_{d}}$ -th frame to drop during inference, where $p_{d}$ is the dropping ratio. We adopt the beam search algorithm with a beam size of 10 for decoding.

Table 2. Ablation study for SAC and SEC. During inference, since our SEC can be removed, only the spatial attention module in SAC will introduce negligible parameters and affect inference speed. (

\text{SAC}^{-}

denotes only inserting the spatial attention module but not guided by

\mathcal{L}_{sac}

; Par.: number of parameters; Sp.: inference speed measured on the same TITAN RTX GPU in seconds per video.)

Backbone	$\text{SAC}^{-}$	SAC	SEC	WER%	Par.(M)	Sp.(s)	Backbone	$\text{SAC}^{-}$	SAC	SEC	WER%	Par.(M)	Sp.(s)
VTB				25.0	15.6359	0.169	ResNet-18 +LT				23.8	18.1356	0.103
	✓			24.6	+0.0001	+0.002		✓			23.8	+0.0001	+0.002
		✓		23.7	+0.0001	+0.002			✓		22.6	+0.0001	+0.002
			✓	24.3	+0.0000	+0.000				✓	22.8	+0.0000	+0.000
		✓	✓	22.6	+0.0001	+0.002			✓	✓	22.2	+0.0001	+0.002
CT				26.1	8.7504	0.095	MobileNet-v3-Small +LT				26.0	9.0502	0.098
	✓			26.0	+0.0001	+0.001		✓			25.8	+0.0001	+0.001
		✓		25.1	+0.0001	+0.001			✓		25.2	+0.0001	+0.001
			✓	25.2	+0.0000	+0.000				✓	25.2	+0.0000	+0.000
		✓	✓	24.5	+0.0001	+0.001			✓	✓	24.7	+0.0001	+0.001
VLT				21.5	16.1850	0.163	GoogLeNet +LT				24.0	23.7070	0.112
	✓			21.4	+0.0001	+0.002		✓			23.9	+0.0001	+0.002
		✓		20.8	+0.0001	+0.002			✓		23.4	+0.0001	+0.002
			✓	20.9	+0.0000	+0.000				✓	23.3	+0.0000	+0.000
		✓	✓	20.4	+0.0001	+0.002			✓	✓	22.9	+0.0001	+0.002

4.3. Ablation Studies for $\text{C}^{2}$ SLR

We first conduct ablation studies for $\text{C}^{2}$ SLR on PHOENIX-2014 following previous works (Min et al., 2021; Zhou et al., 2020; Pu et al., 2020; Hao et al., 2021).

4.3.1. Effectiveness of SAC and SEC

As shown in Table 2, both SAC and SEC generalize well on different backbones: the performance of all the six backbones can be clearly improved. However, if the spatial attention module is inserted into the backbones without any guidance, i.e., $\text{SAC}^{-}$ , the model performance can only be improved slightly, which verifies the effectiveness of $\mathcal{L}_{sac}$ . The effectiveness of SEC suggests that explicitly enforcing the consistency between the visual and sequential modules at the sentence level can strengthen the cross-module cooperation, which leads to the performance gain. The improvements due to SAC and SEC are complementary so that using both of them can obtain better results than using only one of them. Besides, since VLT performs the best among the six backbones, we will use it as the default backbone for the following experiments.

4.3.2. Visualization Results for SAC

Figure 7 shows the visualization results of the learned spatial attention masks of SAC (with $\mathcal{L}_{sac}$ ) and $\text{SAC}^{-}$ (without $\mathcal{L}_{sac}$ ) for five test samples. It should be noted that since SAC is deactivated during testing, our comparison is fair. First, it is quite clear that the learned attention masks with the guidance of $\mathcal{L}_{sac}$ look much better. Without the guidance of $\mathcal{L}_{sac}$ , the attention masks are quite messy with horizontal lines at the top and many highlights at trivial regions, e.g., the left shoulder of $s_{2}$ , the hair of $s_{1}$ and $s_{4}$ , and the waist of $s_{3},s_{5}$ . This explains why $\text{SAC}^{-}$ can only slightly improve the performance of the backbones as shown in Table 2. Second, our SAC is so robust that the IRs (face and hands) in blurry frames (right columns of $s_{1}$ to $s_{5}$ ) can still be captured precisely. Third, it is capable of dealing with different hand positions: e.g., both two hands are lower than the face ( $s_{1},s_{3}$ ); one hand is near the face while the other one is not ( $s_{1},s_{2},s_{4}$ ), and hands are overlapped ( $s_{5}$ ).

Table 3. Ablation study for SAC.

Method	WER%	#Param(M)
VLT + SAC	20.8	16.1851
- channel weights	21.3	-0.0000
+ channel attention (Woo et al., 2018)	21.2	+0.0335
- post-processing	21.7	-0.0000
- face	21.1	-0.0000
- hands	21.2	-0.0000

4.3.3. Channel Weights

Within our spatial attention module, each channel can receive a weight to better measure its importance before squeezing the feature maps. Removing the channel weights degenerates to the channel-wise average pooling in CBAM (Woo et al., 2018) and achieves a WER of 21.3%, which leads to a performance drop by 0.5% as shown in Table 3. Although our channel weights share a similar idea with the channel attention module of CBAM, which builds extra linear layers to generate the attention weights, no extra parameters are introduced in our spatial attention module. To further validate their effectiveness, after removing the channel weights, we conduct one more experiment by adding the channel attention module back as CBAM; however, it can only lead to a slight performance gain and cannot outperform ours even with extra parameters.

4.3.4. Heatmap Refinement

We discuss in the Section 3.2.3 that the raw heatmaps of HRNet (Sun et al., 2019) consist of too many defects which may hinder the learning of the spatial attention module. As shown in Table 3, the quality of the keypoints heatmaps can make a difference on model performance: directly using the original heatmaps without post-processing achieves a WER of 21.7%, which reduces the performance of SAC by almost 1%.

4.3.5. Effect of Each Informative Region

As shown in the last two rows in Table 3, removing either face or hands region can harm the performance of SAC. The results validate that both signers’ face and hands play a key role in conveying information, which is also mentioned in (Zhou et al., 2020; Koller, 2020).

4.3.6. Effect of the Hyper-parameters $\gamma_{x},\gamma_{y}$ of Equation 5

We think $\gamma_{x}$ and $\gamma_{y}$ are two important hyper-parameters since they control the scale of highlighted regions in keypoints heatmaps. Thus, we conduct experiments to compare the performance of different $\gamma_{x},\gamma_{y}$ as shown in Figure 8. The model performance is worse when they are either too large (cannot cover the informative regions entirely) or too small (cover too many trivial regions). When $\gamma_{x}=\gamma_{y}=14$ , the model achieves the best performance.

Table 4. Ablation study for SEC.

(a) Ablation study for the architecture of the sentence embedding extractor and negative sampling. (TF: Transformer; DTCN: depth-wise TCN; Neg. Sam.: negative sampling.)

Method	Extractor	Neg. Sam.	WER%
VLT + SEC	TF+DTCN	✓	20.9
	TF+DTCN	×	21.5
	TF	✓	21.1
	BiLSTM	✓	21.3

(b) Ablation study for the constraint level. We fine-tune the loss factor of VA as (Min et al., 2021) on the VLT for fair comparisons.

Level	Constraint	WER%
Sentence	consistency	20.9
Frame	consistency	21.6
	visual enhancement (VE) (Min et al., 2021)	22.3
	visual alignment (VA) (Min et al., 2021)	21.9
	VE+VA (Min et al., 2021)	22.8

4.3.7. Sentence Embedding Extractor and Negative Sampling

Our sentence embedding extractor consists of a depth-wise TCN layer and a transformer encoder aiming to model local and global contexts, respectively. As shown in Table 4(a), local contexts are important to sentence embedding extraction as dropping the TCN layer would lead to worse performance. We also compare our method with the common practice, which concatenates the last two hidden states of BiLSTM and treats it as the sentence embedding. Nevertheless, that it underperforms the transformer-based extractors implies the strength of the self-attention mechanism for sentence embedding extraction. Table 4(a) also shows that negative sampling plays a key role in our SEC: without negative sampling, that is, directly minimizing the sentence embedding distance between the visual and sequential features, is not effective.

4.3.8. Constraint Level

As shown in Table 4(b), we implement some frame-level constraints to validate the effectiveness of our SEC. First, we replace the sentence embeddings, $\mathbf{v}_{se}$ and $\mathbf{s}_{se}$ in Equation 10, by their corresponding frame-level features to minimize the positive distances while maximizing the negative distances at the frame level. However, it leads to a performance degradation of 0.7% compared to our SEC. We further compare our SEC with VAC (Min et al., 2021), which is composed of two frame-level constraints: visual enhancement (VE) and visual alignment (VA). First, an extra classifier is appended to the visual module to yield frame-level probability distributions (visual distribution). VE is implemented as a CTC loss computed between the visual distribution and the gloss label, which is the same as the one used for training the backbone. Second, VA is simply a KL-divergence loss, which aims to minimize the distance between the visual distribution and the original probability distribution ( $p(\phi_{i}|\mathbf{x})$ in Equation 16). Table 4(b) shows that both VE and VA perform much worse than our SEC. The results suggest that our SEC is a more proper way to measure the consistency between the visual and sequential modules.

Table 5. Examples of sentence embedding distances of the visual and sequential features.

v_{1}

and

v_{2}

are the videos in Figure 9.

$d(\cdot,\cdot)$	$\mathbf{E}_{sen}^{s}(v_{1})$	$\mathbf{E}_{sen}^{s}(v_{2})$
$\mathbf{E}_{sen}^{v}(v_{1})$	0.01	1.99
$\mathbf{E}_{sen}^{v}(v_{2})$	1.76	0.37

Table 6. Effect of the value of

\lambda

(the weight for the loss

\mathcal{L}_{srm}

) in Equation 19.

$\lambda$	0	0.25	0.5	0.75	1.0	1.25	1.5
Dev	34.3	35.1	35.3	33.1	33.5	35.0	34.4
Test	34.4	33.8	33.1	32.7	32.8	34.2	33.6

4.3.9. Examples of Video-gloss Pairs

To verify whether $\mathcal{L}_{sec}$ can really separate positive and negative samples, we provide two examples of video-gloss pairs denoted as $(v_{1},l_{1})$ and $(v_{2},l_{2})$ as shown in Figure 9. The sentence embedding distances of the visual and sequential features of $v_{1}$ and $v_{2}$ are shown in Table 6. It is clear that the distance between the two features of the same video (diagonal entries, positive pairs) can be very small. Otherwise (off-diagonal entries, negative pairs), the distance can be very large (the maximum value is 2.00).

4.4. Ablation Studies for the Signer Removal Module

We further conduct ablation studies for our signer removal module (SRM) on the challenging signer-independent dataset, PHOENIX-2014-SI.

4.4.1. Effect of the Hyper-parameter $\lambda$ of Equation 19

According to (Liu et al., 2018b), the weight for the domain classification loss, i.e., our signer classification loss $\mathcal{L}_{srm}$ , is an important hyper-parameter. We fine-tune it from 0 to 1.5 with an interval of 0.25 as shown in Table 6. When $\lambda=0$ , the model degenerates to $\text{C}^{2}$ SLR and performs worse than other models with $\lambda>0$ . The results suggest the importance of removing signer information for SI-CSLR. When $\lambda=0.75$ , the model can achieve the best performance with a WER of 33.1%/32.7% on the dev and test set, respectively.

Table 7. Ablation study for the signer removal module. Experiments are conducted on PHOENIX-2014-SI. (SP: statistics pooling; GR: gradient reversal)

Method	$\mathcal{L}_{srm}$	SP	GR	WER%	Type
$\text{C}^{2}$ SLR+				34.4	N/A
	✓			34.9	Multi-task Learning
	✓	✓		33.5	Multi-task Learning
	✓		✓	33.6	Feature Disentanglement
	✓	✓	✓	32.7	Feature Disentanglement

4.4.2. Statistics Pooling and Gradient Reversal

We further conduct ablation studies for the two major components of our SRM, the statistics pooling (SP) and gradient reversal (GR) layer. First, the use of the GR layer decides the type of learning method: feature disentanglement or multi-task learning. As shown in Table 7, it is clear that with the use of GR, models under the feature disentanglement setting can significantly outperform those under the multi-task learning setting. The result implies that removing signer information is effective to SI-CSLR. However, we find that the model, $\text{C}^{2}$ SLR+SP, can also outperform the baseline under the multi-task setting. We think this is because the multi-task learning can be seen as a kind of regularization (Zhang and Yang, 2021), which endows the shared networks between the CSLR and signer classification branches with better generalization capability. Similar ideas also appear in some works that jointly train a speech recognition model and a speaker recognition model (Liu et al., 2018a; Pironkov et al., 2016). Finally, the effectiveness of SP also validates that sentence-level signer embeddings are more robust than frame-level ones to achieve signer classification, leading to better performance.

Table 8. Performance comparison between seen and unseen signers.

Method	Seen Signers	Unseen Signers	Relative Gap
Method	(WER%)	(WER%)	(%)
$\text{C}^{2}$ SLR	22.7	34.4	51.5
$\text{C}^{2}$ SLR + SRM	23.0	32.7	42.2

4.4.3. Effect of the SRM over Seen and Unseen Signers

We finally study the effect of the SRM over seen and unseen signers. We first build an extra test set consisting of only seen signers during training by removing videos performed by unseen signers from the official test set of PHOENIX-2014, and then retest “ $\text{C}^{2}$ SLR” and “ $\text{C}^{2}$ SLR+SRM” on this extra test set. As shown in Table 8, with a comparable performance over the seen signers, adding the SRM can significantly narrow the performance gap between unseen and seen signers. The results suggest that our SRM can be more helpful for the real-world situation that most testing signers are unseen.

4.5. Comparison with State-of-the-art Results

Table 9. Comparison on signer-dependent datasets. (R: RGB; F: optical flow; P: pose.)

Method	End-to-end	Modalities		PHOENIX-2014		PHOENIX-2014-T		CSL-Daily
Method	End-to-end	Training	Inference	Dev	Test	Dev	Test	Dev	Test
CNN-LSTM-HMMs (Koller et al., 2019)	×	R	R	26.0	26.0	22.1	24.1	–	–
DNF (RGB) (Cui et al., 2019) + SBD-RL (Wei et al., 2020)	×	R	R	23.4	23.5	–	–	–	–
DNF (Cui et al., 2019)	×	R+F	R+F	23.1	22.9	–	–	32.8	32.4
CMA (Pu et al., 2020)	×	R	R	21.3	21.9	–	–	–	–
SMKD (Hao et al., 2021)	×	R	R	20.8	21.0	20.8	22.4	–	–
STMC (Zhou et al., 2020)	×	R+P	R	21.1	20.7	19.6	21.0	–	–
LS-HAN (Huang et al., 2018)	✓	R	R	–	38.3	–	–	39.0	39.4
TIN + Transformer (Zhou and et al., 2021)	✓	R	R	–	–	–	–	33.6	33.1
SFL (Niu and Mak, 2020)	✓	R	R	24.9	25.3	25.1	26.1	–	–
FCN (Cheng et al., 2020)	✓	R	R	23.7	23.9	23.3	25.1	33.2	32.5
LCSA (Zuo and Mak, 2022b)	✓	R	R	21.4	21.9	–	–	–	–
SLT (Camgöz et al., 2020)	✓	R	R	–	–	24.6	24.5	33.1	32.0
VAC (Min et al., 2021)	✓	R	R	21.2	22.3	–	–	–	–
MMTLB (Chen et al., 2022a)	✓	R	R	–	–	21.9	22.5	–	–
$\text{C}^{2}$ SLR (ours)	✓	R+P	R	20.5	20.4	20.2	20.4	31.9	31.0

4.5.1. Signer-dependent

As shown in Table 9, we first evaluate our $\text{C}^{2}$ SLR on three signer-dependent benchmarks: PHOENIX-2014, PHOENIX-2014-T, and CSL-Daily.

Our $\text{C}^{2}$ SLR follows the idea of auxiliary learning, which also appears in some existing works, e.g., FCN (Cheng et al., 2020) and VAC (Min et al., 2021). FCN proposes a gloss feature enhancement (GFE) module to introduce auxiliary supervision signals into the model training process. However, the GFE module highly relies on pseudo labels (CTC decoded results), which may contain too many errors. Our method only relies on pre-extracted heatmaps, which are quite accurate with the help of our post-processing algorithm, and the model’s inherent consistency: the visual and sequential features represent the same sentence. These two properties enable our method to outperform FCN by more than 3% on both PHOENIX-2014 and PHOENIX-2014-T. Recently, VAC proposes two auxiliary losses at the frame-level, which are not quite appropriate and perform worse than ours according to the comparison in Section 4.3.8. The SOTA work, STMC (Zhou et al., 2020), adopts the complicated stage optimization strategy, which introduces extra hyper-parameters, and needs to manually decide when to switch to a new stage. Our method is totally end-to-end trainable, and it can outperform STMC on both PHOENIX-2014 and PHOENIX-2014-T. To the best of our knowledge, this is the first time that an end-to-end method can outperform those using the stage optimization strategy.

In terms of modality usage, our method just uses extra pose modality during training, while only RGB videos are needed for inference. Thus, it is simpler for real application compared to DNF (Cui et al., 2019) which is built on a two-stream architecture taking both RGB videos and optical flow as inputs.

Finally, the results reported on CSL-Daily may be more important due to its large vocabulary size. Our method can still achieves SOTA performance on this large-scale dataset, which also validates the generalization capability of our method over different sign languages.

Table 10. Comparison on signer-independent datasets. (R: RGB; F: optical flow; P: pose; D: depth.)

(a) PHOENIX-2014-SI.

Method	End-to-end	Modalities		Dev	Test
Method	End-to-end	Training	Inference	Dev	Test
Re-sign (Koller et al., 2017)	×	R	R	45.1	44.1
DNF (Cui et al., 2019)	×	R+F	R+F	36.0	35.7
CMA (Pu et al., 2020)	×	R	R	34.8	34.3
$\text{C}^{2}$ SLR (ours)	✓	R+P	R	34.3	34.4
$\text{C}^{2}$ SLR + SRM (ours)	✓	R+P	R	33.1	32.7

(b) CSL.

Method	End-to-end	Modalities		Test
Method	End-to-end	Training	Inference	Test
LS-HAN (Huang et al., 2018)	×	R	R	17.3
DPD + TEM (Zhou et al., 2019)	×	R	R	4.7
STMC (Zhou et al., 2020)	×	R+P	R	2.1
CTF (Wang et al., 2018)	✓	R	R	11.2
HLSTM-attn (Guo et al., 2018)	✓	R	R	10.2
FCN (Cheng et al., 2020)	✓	R	R	3.0
VAC (Min et al., 2021)	✓	R	R	1.6
MSeqGraph (Tang et al., 2021)	✓	R+P+D	R+P+D	0.6
$\text{C}^{2}$ SLR (ours)	✓	R+P	R	0.90
$\text{C}^{2}$ SLR + SRM (ours)	✓	R+P	R	0.68

4.5.2. Signer-independent

As shown in Table 10(b), we further evaluate our SRM on the following two signer-independent benchmarks: PHOENIX-2014-SI and CSL.

Although some works, e.g., DNF (Cui et al., 2019) and CMA (Pu et al., 2020), evaluate their method on PHOENIX-2014-SI, none of them propose any dedicated module to deal with the challenging SI setting. In this work, we develop a simple yet effective signer removal module (SRM) for SI-CSLR to make the model more robust to signer discrepancy. As shown in Table 10(a), our $\text{C}^{2}$ SLR can already achieve competitive performance on PHOENIX-2014-SI, and the SRM can further improve the performance significantly. The result validates that feature disentanglement is an effective method to remove signer-relevant information, and we believe our SRM can serve as a strong baseline for future works on SI-CSLR.

As shown in Table 10(b), our SRM can lead to a relative performance gain of 24.4% over the baseline $\text{C}^{2}$ SLR on CSL⁵⁵5Although the SI setting itself is challenging, since the sentences in the CSL test set all appear in the training stage, the WER can be very low ( $<1$ %).. It is worth noting that the SOTA work, MSeqGraph (Tang et al., 2021), uses three modalities including RGB, pose, and depth. However, our method only uses RGB and pose information for training, and only RGB frames are needed for inference. Thus, with a comparable performance to the SOTA work, we believe our method is more applicable in real practice.

5. Conclusion and Future Works

In this work, we propose three auxiliary tasks to enhance CSLR backbones. The first task requires the model to learn informative attention maps from a keypoint-guided spatial attention module. The second task enhances the representation power of visual and sequential features by imposing a sentence embedding consistency constraint. Finally, the third task enforces the model to dispel signer information with a dedicated signer removal module for the signer-independent setting. We conduct sufficient ablation studies to validate the effectiveness of the three auxiliary tasks. Remarkably, our model can achieve SOTA or competitive performance on five benchmarks, while the whole model is trained in an end-to-end manner.

Below are some directions which deserve attention for future works. First, to enhance the quality of keypoints heatmaps, lightweight keypoints estimators which can be co-trained with the CSLR backbone are necessary. Second, more advanced cross-modality sentence embedding extractor shall be considered. Finally, we believe more attention should be paid on signer-independent CSLR since it is more realistic than its signer-dependent counterpart.

Acknowledgements.

The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKUST16200118).

References

(1)
Adaloglou et al. (2021) Nikolaos M. Adaloglou, Theocharis Chatzis, Ilias Papastratis, Andreas Stergioulas, Georgios Th Papadopoulos, Vassia Zacharopoulou, George Xydopoulos, Klimis Antzakas, Dimitris Papazachariou, and Petros Daras. 2021. A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition. IEEE TMM (2021), 1–1.
Andriluka et al. (2014) Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D Human Pose Estimation: New benchmark and State of the Art Analysis. In CVPR. 3686–3693.
Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016).
Camgoz et al. (2018) Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural Sign Language Translation. In CVPR.
Camgöz et al. (2020) Necati Cihan Camgöz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In CVPR. 10020–10030.
Cao et al. (2019b) Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. 2019b. GCNet: Non-local Networks Meet Squeeze-excitation Networks and Beyond. In CVPRW.
Cao et al. (2019a) Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019a. OpenPose: Realtime Multi-person 2D Pose Estimation using Part Affinity Fields. TPAMI 43, 1 (2019), 172–186.
Carlsson et al. (2020) Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, and Magnus Sahlgren. 2020. Semantic Re-tuning with Contrastive Tension. In ICLR.
Carreira et al. (2018) Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A Short Note about Kinetics-600. arXiv preprint arXiv:1808.01340 (2018).
Chen and Jiang (2019) Shaoxiang Chen and Yu-Gang Jiang. 2019. Motion Guided Spatial Attention for Video Captioning. In AAAI, Vol. 33. 8191–8198.
Chen et al. (2022a) Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022a. A Simple Multi-modality Transfer Learning Baseline for Sign Language Translation. In CVPR. 5120–5130.
Chen et al. (2022b) Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022b. Two-Stream Network for Sign Language Recognition and Translation. In NeurIPS.
Cheng et al. (2020) Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully Convolutional Networks for Continuous Sign Language Recognition. In ECCV, Vol. 12369. 697–714.
Cheng et al. (2022) Yihua Cheng, Yiwei Bao, and Feng Lu. 2022. PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation. In AAAI.
Cui et al. (2019) Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE TMM PP (07 2019), 1–1.
Fu et al. (2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual Attention Network for Scene Segmentation. In CVPR. 3146–3154.
Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial Training of Neural Networks. JMLR 17, 1 (2016), 2096–2030.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP (2021).
Goyal et al. (2017) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The “something something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV. 5842–5850.
Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. 369–376.
Guo et al. (2019) Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. 2019. Dense Temporal Convolution Network for Sign Language Translation. In IJCAI. 744–750.
Guo et al. (2018) Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical LSTM for Sign Language Translation. In AAAI. 6845–6852.
Guo et al. (2023) Leming Guo, Wanli Xue, Qing Guo, Bo Liu, Kaihua Zhang, Tiantian Yuan, and Shengyong Chen. 2023. Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition. In CVPR. 10771–10780.
Hao et al. (2021) Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-Mutual Distillation Learning for Continuous Sign Language Recognition. In ICCV. 11303–11312.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770–778.
Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning Deep Representations by Mutual Information Estimation and Maximization. In ICLR.
Howard et al. (2019) Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. In ICCV. 1314–1324.
Hu et al. (2021) Hezhen Hu, Wengang Zhou, Junfu Pu, and Houqiang Li. 2021. Global-local Enhancement Network for NMF-aware Sign Language Recognition. ACM TOMM 17, 3 (2021), 1–19.
Hu et al. (2023) Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Continuous Sign Language Recognition with Correlation Network. In CVPR.
Huang et al. (2018) Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-Based Sign Language Recognition Without Temporal Segmentation. In AAAI. 2257–2264.
Huang et al. (2021) Zhizhong Huang, Junping Zhang, and Hongming Shan. 2021. When Age-invariant Face Recognition Meets Face Age Synthesis: A Multi-task Learning Framework. In CVPR. 7282–7291.
Jiao et al. (2023) Peiqi Jiao, Yuecong Min, Yanan Li, Xiaotao Wang, Lei Lei, and Xilin Chen. 2023. CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition. In ICCV. 20676–20686.
Jin et al. (2020) Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. 2020. Style Normalization and Restitution for Generalizable Person Re-identification. In CVPR. 3143–3152.
Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
Koller (2020) Oscar Koller. 2020. Quantitative Survey of the State of the Art in Sign Language Recognition. arXiv preprint arXiv:2008.09918 (2020).
Koller et al. (2019) Oscar Koller, Necati Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos. IEEE TPAMI 42, 9 (04 2019), 2306–2320.
Koller et al. (2015) Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers. CVIU 141 (Dec. 2015), 108–125.
Koller et al. (2017) Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs. In CVPR. 3416–3424.
Li et al. (2020) Xingze Li, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Relation-guided Spatial Attention and Temporal Refinement for Video-based Person Re-identification. In AAAI, Vol. 34. 11434–11441.
Linsley et al. (2018) Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas Serre. 2018. Learning What and Where to Attend. In ICLR.
Liu et al. (2018a) Yi Liu, Liang He, Jia Liu, and Michael T Johnson. 2018a. Speaker Embedding Extraction with Phonetic Information. Interspeech (2018), 2247–2251.
Liu et al. (2019) Yue Liu, Xin Wang, Yitian Yuan, and Wenwu Zhu. 2019. Cross-modal Dual Learning for Sentence-to-video Generation. In ACM MM. 1239–1247.
Liu et al. (2018b) Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, and Xiaogang Wang. 2018b. Exploring Disentangled Feature Representation beyond Face Identification. In CVPR. 2080–2089.
Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP. 1412–1421.
Min et al. (2021) Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual Alignment Constraint for Continuous Sign Language Recognition. In ICCV. 11542–11551.
Niu and Mak (2020) Zhe Niu and Brian Mak. 2020. Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition. In ECCV. 172–186.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018).
Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep Sentence Embedding using Long Short-term Memory Networks: Analysis and Application to Information Retrieval. IEEE/ACM TASLP 24, 4 (2016), 694–707.
Pang et al. (2019) Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. 2019. Mask-guided Attention Network for Occluded Pedestrian Detection. In CVPR. 4967–4975.
Papadimitriou and Potamianos (2020) Katerina Papadimitriou and Gerasimos Potamianos. 2020. Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. In Interspeech. 2752–2756.
Pironkov et al. (2016) Gueorgui Pironkov, Stéphane Dupont, and Thierry Dutoit. 2016. Speaker-aware Long Short-term Memory Multi-task Learning for Speech Recognition. In European Signal Processing Conference (EUSIPCO). 1911–1915.
Pu et al. (2020) Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In ACM MM. 1497–1505.
Pu et al. (2018) Junfu Pu, Wengang Zhou, and Houqiang Li. 2018. Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition. In IJCAI. 885–891.
Pu et al. (2019) Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative Alignment Network for Continuous Sign Language Recognition. In CVPR. 4165–4174.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP. 3982–3992.
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A Unified Embedding for Face Recognition and Clustering. In CVPR. 815–823.
Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In NAACL-HLT. 464–468.
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
Snyder et al. (2018) David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust DNN Embeddings for Speaker Recognition. In ICASSP. 5329–5333.
Sun et al. (2019) Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-resolution Representation Learning for Human Pose Estimation. In CVPR. 5693–5703.
Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In CVPR. 1–9.
Tang et al. (2021) Shengeng Tang, Dan Guo, Richang Hong, and Meng Wang. 2021. Graph-Based Multimodal Sequential Embedding for Sign Language Translation. IEEE TMM (2021).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998–6008.
Wang et al. (2021) Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng, and Tao Qin. 2021. Generalizing to Unseen Domains: A Survey on Domain Generalization. arXiv preprint arXiv:2103.03097 (2021).
Wang et al. (2018) Shuo Wang, Dan Guo, Wen-gang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist Temporal Fusion for Sign Language Translation. In ACM MM. 1483–1491.
Wei et al. (2020) Chengcheng Wei, Jian Zhao, Wengang Zhou, and Houqiang Li. 2020. Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition. IEEE TCSVT 31, 3 (2020), 1138–1149.
Wei and Chen (2023) Fangyun Wei and Yutong Chen. 2023. Improving Continuous Sign Language Recognition with Cross-Lingual Signs. In ICCV. 23612–23621.
Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In ECCV. 3–19.
Wu et al. (2018) Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2018. Pay Less Attention with Lightweight and Dynamic Convolutions. In ICLR.
Xu et al. (2020) Tian Xu, Jennifer White, Sinan Kalkan, and Hatice Gunes. 2020. Investigating Bias and Fairness in Facial Expression Recognition. In ECCV. 506–523.
Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling Localness for Self-Attention Networks. In EMNLP. 4449–4458.
Ye et al. (2019) Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. 2019. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In CVPR. 6210–6219.
Yin et al. (2016) Fang Yin, Xiujuan Chai, and Xilin Chen. 2016. Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition. In ECCV. 434–450.
Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In ICLR.
Zhang and Yang (2021) Yu Zhang and Qiang Yang. 2021. A Survey on Multi-task Learning. IEEE TKDE (2021).
Zheng et al. (2023) Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and Stan Z Li. 2023. CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment. In CVPR.
Zhou and et al. (2021) Hao Zhou and et al. 2021. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In CVPR.
Zhou et al. (2019) Hao Zhou, Wengang Zhou, and Houqiang Li. 2019. Dynamic Pseudo Label Decoding for Continuous Sign Language Recognition. In ICME. 1282–1287.
Zhou et al. (2020) Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. In AAAI. 13009–13016.
Zuo and Mak (2022a) Ronglai Zuo and Brian Mak. 2022a. C2SLR: Consistency-Enhanced Continuous Sign Language Recognition. In CVPR. 5131–5140.
Zuo and Mak (2022b) Ronglai Zuo and Brian Mak. 2022b. Local Context-aware Self-attention for Continuous Sign Language Recognition. In Proc. Interspeech. 4810–4814.
Zuo et al. (2023) Ronglai Zuo, Fangyun Wei, and Brian Mak. 2023. Natural Language-Assisted Sign Language Recognition. In CVPR. 14890–14900.

Appendix A Appendix

A.1. A Sample for Word Error Rate

Word error rate (WER) is a widely adopted evaluation metric to meaure the dissimilarity between two sequences, which is commonly used in speech recognition and sign language recognition systems (Koller et al., 2015; Camgoz et al., 2018; Pu et al., 2020; Zhou et al., 2020; Min et al., 2021; Hao et al., 2021). It is defined as the ratio of the number of errors to the number of words (glosses) in the transcription after aligning the prediction sequence with the label sequence:

(24)

\text{WER}=\frac{\#\text{deletions}+\#\text{substitutions}+\#\text{insertions}}{\#\text{glosses in label}},

where # denotes ”the number of”. Table 11 shows an example with a WER of 50%:

Table 11. An example of WER computing. ”A”,”B”,”C”, and ”D” represent words (glosses). Substitutions and insertions are highlighted in red and green, respectively.

\square

denotes deletions.

Prediction	A C $\square$ C A D B
Label	A A B C A D

A.2. Loss Normalization

As shown in Table 12, we conduct another experiment on Phoenix-2014 by normalizing each loss term ( $\mathcal{L}_{ctc},\mathcal{L}_{sac},\mathcal{L}_{sec}$ ) of $\mathcal{L}_{b}$ into the range of [0,1]. Note that since $\mathcal{L}_{ctc}$ is defined as the negative log-likelihood of all feasible alignment paths, we use the reciprocal of its maximum training loss value (about 1/7) as its weight for normalization. $\mathcal{L}_{sac}$ is defined as an MSE loss between attention maps and keypoint heatmaps, which ranges from 0 to 1, and thus we set its weight to 1. $\mathcal{L}_{sec}$ is defined as a triplet loss: $\mathcal{L}_{sec}=\max\{d(x,x_{p})-d(x,x_{n})+\alpha,0\}$ , where $d(x,x_{p})=1-\frac{x\cdot x_{p}}{\|x\|_{2}\|x_{p}\|_{2}}\in[0,2]$ and $\alpha=2$ . Thus, the theoretical maximum value of $\mathcal{L}_{sec}$ is 4, and we set its weight to 1/4. However, normalizing the loss terms cannot lead to better performance. Thus, we keep the default setting which equally weigh each loss term by 1.0.

Table 12. Comparison between different loss weighting strategies.

Loss Weights	Dev	Test
Normalization	20.5	20.6
All 1.0 (default)	20.5	20.4

A.3. Ablation Study for Local Transformer

We conduct an ablation study on Phoenix-2014 to verify the effectiveness of the Gaussian bias and the DTCN layer. As shown in Table 13, both the Gaussian bias and the DTCN layer can significantly decrease WER, proving that our local transformer is a strong sequential module for CSLR.

Table 13. Ablation study for the local transformer. (GB: Gaussian bias; DTCN: depth-wise temporal convolution network.)

GB	DTCN	WER%
		25.2
✓		22.7
✓	✓	21.5

A.4. Position of the Signer Removal Module

The position of the signer removal module (SRM) is simply a hyper-parameter. Besides SRM, the position of the spatial attention module is also empirically decided. As shown in Table 14(a), we first decide the position of the spatial attention module by putting it after different CNN layers. We find that an intermediate position, i.e., after the 5th CNN layer, is the best choice. An intuitive explanation is that early positions cannot provide enough error signals for the visual module, while a position too late will lead to low heatmap resolution. After that, we fix the position of the spatial attention module and adjust the position of the SRM. As shown in Table 14(b), putting the SRM right after the 5th CNN layer is also the best choice.

Table 14. The effects of the position of the spatial attention module and signer removal module.

m

denotes the index of the CNN layer. All experiments are conducted on PHOENIX-2014-SI.

(a) The effect of the position of the spatial attention module.

$m$	1	2	3	4	5	6	7	8
Resolution	224	112	56	56	28	28	14	14
Dev	36.1	34.9	35.5	36.1	34.3	35.5	36.2	37.2
Test	36.0	35.3	35.5	36.5	34.4	33.6	35.7	36.0

(b) The effect of the position of the signer removal module.

$m$	1	2	3	4	5	6	7	8
Dev	35.3	35.3	36.2	35.3	33.1	35.2	35.9	35.1
Test	35.3	34.4	35.0	35.0	32.7	35.6	33.2	34.1

A.5. Abbreviations

We list all abbreviations appeared in main texts and their corresponding full names in Table 15.

Table 15. Abbreviations and full names.

Abbreviation	Full Name	Abbreviation	Full Name
BiLSTM	bidirectional long short-term memory	RNN	recurrent neural network
CMP	channel-wise max pooling	SAC	spatial attention consistency
CNN	convolutional neural network	SD	signer-dependent
CSLR	continuous sign language recognition	SEC	sentence embedding consistency
$\text{C}^{2}$ SLR	consistency-enhanced CSLR	SEE	sentence embeddding extractor
CT	CNN+TCN	SFD	stochastic frame dropping
CTC	connectionist temporal classification	SI	signer-independent
GAP	global average pooling	SLR	sign language recognition
GB	Gaussian bias	SOTA	state-of-the-art
GFE	gloss feature enhancement	SP	statistical pooling
GR	gradient reversal	SRM	signer removal module
IR	informative region	TCN	temporal convolutional network
ISLR	isolated sign language recognition	VA	visual alignment
LSA	local self-attention	VE	visual enhancement
LT	local transformer	VLT	VGG11+LT
MLP	multi-layer perceptron	VTB	VGG11+TCN+BiLSTM
QK	query-key	WER	word error rate
ReLU	rectified linear unit