This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Ronglai Zuo [email protected] 0000-0002-7184-5137  and  Brian Mak [email protected] 0000-0001-6787-5555 The Hong Kong University of Science and TechnologyHong Kong
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

continuous sign language recognition, auxiliary learning, signer-independent, feature disentanglement.
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXjournal: JACMjournalvolume: 37journalnumber: 4article: 111publicationmonth: 8ccs: Computing methodologies Activity recognition and understanding

1. Introduction

Sign language is usually the principal communication method among hearing-impaired people. Sign language recognition (SLR) aims to transcribe sign languages into glosses (basic lexical units in a sign language), which is an important technology to bridge the communication gap between the normal-hearing and hearing-impaired people. According to the number of glosses in a sign sentence, SLR can be categorized into (a) isolated SLR (ISLR), in which each sign sentence consists of only a single gloss, and (b) continuous SLR (CSLR), in which each sign sentence may consist of multiple glosses. ISLR can be seen as a simple classification task, which becomes less popular in recent years. In this paper, we focus on CSLR which is more practical than its isolated counterpart. In recent years, more and more CSLR models are built using deep learning techniques because of their superior performance over traditional methods (Zhou et al., 2020; Min et al., 2021; Niu and Mak, 2020). According to (Niu and Mak, 2020), the backbone of most deep-learning-based CSLR models is composed of three parts: a visual module, a sequential (contextual) module, and an alignment module. Within this framework, visual features are first extracted from sign videos by the visual module. After that, sequential and contextual information are modeled by the sequential module. Finally, due to the difference between the length of a sign video and its gloss label sequence, an alignment module is needed to align the sequential features with the gloss label sequence and yields its probability.

Refer to caption
Figure 1. An overview of the CSLR backbone and the three proposed auxiliary tasks. First, our SAC enforces the visual module to focus on informative regions by leveraging pose keypoints heatmaps. Second, our SEC aligns the visual and sequential features at the sentence level, which can enhance the representation power of both the features simultaneously. SAC and SEC constitute our preliminary work (Zuo and Mak, 2022a), consistency-enhanced CSLR (C2\text{C}^{2}SLR). In this work, we extend C2\text{C}^{2}SLR by developing a novel signer removal module based on feature disentanglement for signer-independent CSLR.

Usually, such CSLR backbones are trained with the connectionist temporal classification (CTC) (Graves et al., 2006) loss. However, since CSLR datasets are usually small, only using the CTC loss may not train the backbones sufficiently (Pu et al., 2019; Cui et al., 2019; Pu et al., 2020; Zhou et al., 2020; Hao et al., 2021; Cheng et al., 2020; Min et al., 2021). That is, the extracted features are not representative enough to be used to produce accurate recognition results. To relieve this issue, existing works can be roughly divided into two categories. First, (Cui et al., 2019) proposes a stage optimization strategy to iteratively refine the extracted features with the help of pseudo labels, which is widely adopted in (Pu et al., 2018, 2019, 2020; Zhou et al., 2020; Hao et al., 2021). However, it introduces more hyper-parameters and is time-consuming since the model needs to adapt to a different objective in each new stage (Cheng et al., 2020). As an alternative strategy, auxiliary learning can keep the whole model end-to-end trainable by just adding several auxiliary tasks (Cheng et al., 2020; Min et al., 2021). In this work, three novel auxiliary tasks are proposed to help train CSLR backbones.

Our first auxiliary task aims to enhance the visual module, which is important to feature extraction but sensitive to the insufficient training problem (Min et al., 2021; Cui et al., 2019; Zhou et al., 2020). Since the information of sign languages is mainly included in signers’ facial expressions and hand movements (Zhou et al., 2020; Koller, 2020; Hu et al., 2021), signers’ face and hands are treated as informative regions. Thus, to enrich the visual features, some CSLR models (Zhou et al., 2020; Papadimitriou and Potamianos, 2020) leverage an off-the-shelf pose detector (Cao et al., 2019a; Sun et al., 2019) to locate the informative regions and then crop the feature maps to form a multi-stream architecture. However, this architecture will introduce extensive parameters since each stream processes its inputs independently and the cropping operation may overlook the rich information in the pose keypoints heatmaps. As shown in Figure 1, by visualizing the heatmaps, we find that they can reflect the importance of different spatial positions, which is similar to the idea of spatial attention. Thus, as shown in Figure 2, we insert a lightweight spatial attention module into the visual module and enforce the spatial attention consistency (SAC) between the learned attention masks and pose keypoints heatmaps. In this way, the visual module can pay more attention to the informative regions.

Only enhancing the visual module may not fully exploit the power of the backbone. According to (Min et al., 2021; Hao et al., 2021), better performance can be obtained by explicitly enforcing the consistency between the visual and sequential modules. VAC (Min et al., 2021) adopts a knowledge distillation loss between the two modules by treating the visual and sequential modules as a student-teacher pair. With a similar idea, SMKD (Hao et al., 2021) transfers knowledge by shared classifiers. Knowledge distillation can be treated as a kind of consistency since it is usually instantiated as the KL-divergence loss, a measurement of the distance between two probability distributions. Nevertheless, the above two methods have a common deficiency that they measure consistency at the frame level, i.e., each frame has its own probability distribution. We think that it is inappropriate to enforce frame-level consistency since the sequential module is supposed to gather contextual information; otherwise, the sequential module may be dropped. Motivated by that both the visual and sequential features represent the same sentence, we propose the second auxiliary task: enforcing the sentence embedding consistency (SEC) between them. As shown in Figure 2, we build a lightweight sentence embedding extractor that can be jointly trained with the backbone, and then minimize the distance between positive sentence embedding pairs while maximizing the distance between negative pairs.

We name the CSLR model trained with SAC and SEC as consistency-enhanced CSLR (C2\text{C}^{2}SLR). According to our experimental results (Table 9), with a transformer-based backbone, C2\text{C}^{2}SLR can achieve satisfactory performance on signer-dependent datasets, in which all signers in the test set appear in the training set. However, as shown in Table 10(a), C2\text{C}^{2}SLR cannot outperform the state-of-the-art (SOTA) work on the more challenging but realistic signer-independent CSLR (SI-CSLR). Under the SI setting, since the signers in the test set are unseen during training, removing signer-specific information can make the model more robust to signer discrepancy. In this work, we further develop a signer removal module (SRM) based on the idea of feature disentanglement. More specifically, we first extract robust sentence-level signer embeddings with statistics pooling (Snyder et al., 2018) to “distill” signer information, which is then dispelled from the backbone implicitly by a gradient reversal layer (Ganin et al., 2016). Finally, the SRM is trained with a signer classification loss. To the best of our knowledge, we are the first to develop a specific module for SI-CSLR111Some works (Cui et al., 2019; Pu et al., 2020) evaluate their methods on SI-CSLR datasets, but none of them propose any dedicated modules for the SI setting. (Yin et al., 2016) proposes a metric learning method to deal with the SI situation, but it focuses on ISLR..

In summary, our main contributions are:

  • We propose to enforce the consistency between the learned attention masks and pose keypoints heatmaps to enable the visual module to focus on informative regions.

  • We propose to align the visual and sequential features at the sentence level to enhance the representation power of both features simultaneously.

  • We propose a signer removal module from the idea of feature disentanglement to implicitly remove signer information from the backbone for SI-CSLR. To the best of our knowledge, we are the first to focus on this challenging setting.

  • Extensive experiments are conducted to validate the effectiveness of the three auxiliary tasks. More remarkably, with a transformer-based backbone, our model can achieve SOTA or competitive performance on five benchmarks, while the whole model is trained in an end-to-end manner.

This work is an extension to our 2022 CVPR paper, C2\text{C}^{2}SLR (Zuo and Mak, 2022a). More specifically, we make the following new contributions:

  • Besides the investigation on signer-dependent continuous sign language recognition (SD-CSLR) in the CVPR paper, we propose in this paper an additional signer removal module (SRM) to tackle the more challenging signer-independent continuous sign language recognition (SI-CSLR) problem. More specifically, the SRM is designed to remove signer information from the backbone for SI-CSLR based on feature disentanglement. To the best of our knowledge, we are the first to propose a dedicated module to deal with SI-CSLR.

  • We successfully adapt statistics pooling to SI-CSLR to extract robust sentence-level signer embeddings for the SRM.

  • We conduct sufficient ablation studies to validate the effectiveness of the SRM, and the combination of C2\text{C}^{2}SLR and SRM can achieve SOTA performance on an SI-CSLR benchmark.

  • We also report additional experimental results of C2\text{C}^{2}SLR on the latest large-scale Chinese sign language dataset, CSL-Daily (Zhou and et al., 2021) with a vocabulary size of 2K and about 20K videos.

2. Related Works

2.1. Deep-learning-based CSLR

According to (Niu and Mak, 2020), most deep-learning-based CSLR backbones consist of a visual module (3D-CNNs (Pu et al., 2019; Zhou et al., 2019) or 2D-CNNs (Zhou et al., 2020; Min et al., 2021; Hao et al., 2021)), a sequential module (1D-CNNs (Guo et al., 2019; Cheng et al., 2020), RNNs (Zhou et al., 2020; Min et al., 2021; Hao et al., 2021; Pu et al., 2019, 2020), or Transformer (Niu and Mak, 2020; Camgöz et al., 2020)), and an alignment module (CTC (Zhou et al., 2020; Min et al., 2021; Hao et al., 2021) or hidden Markov models (Koller et al., 2019)). To mitigate the issue of insufficient training, a novel approach is introduced by (Cui et al., 2019), which employs a stage optimization strategy to iteratively refine the extracted features utilizing pseudo labels. This technique has garnered significant attention and has been widely embraced in relevant studies (Pu et al., 2019, 2020; Zhou et al., 2020; Hao et al., 2021). Extending the capabilities of this strategy, (Pu et al., 2019) incorporates a Long Short-Term Memory (LSTM) based auxiliary decoder. Furthermore, SMKD (Hao et al., 2021) proposes a comprehensive three-stage optimization strategy that necessitates training the model over 100 epochs. VAC (Min et al., 2021) proposes an approach that achieves improved training time efficiency while enhancing the visual module and enforcing consistency between the visual and sequential modules. This is accomplished through the proposed visual enhancement and visual alignment constraints on the frame-level probability distributions. Notably, the proposed model is end-to-end trainable. In this work, we enhance the visual module from a novel view of spatial attention consistency, and align the two modules at the sentence level to enforce their sentence embedding consistency.

Recently published CSLR works mostly focus on injecting more domain knowledge into sign video modeling (Chen et al., 2022b; Hu et al., 2023; Jiao et al., 2023), better training techniques (Guo et al., 2023; Zheng et al., 2023), or cross-lingual signs (Wei and Chen, 2023). However, all these works still focus on the signer-dependent setting, which limits their application scenario. In this work, we propose a signer removal module to make the model robust to signer discrepancy in the more realistic signer-independent setting.

2.2. Spatial Attention

Spatial attention mechanism allows models to selectively attend to specific positions, which is widely adopted in various computer vision tasks including semantic segmentation (Fu et al., 2019), object detection (Woo et al., 2018; Cao et al., 2019b), and image classification (Woo et al., 2018; Cao et al., 2019b; Linsley et al., 2018). However, the spatial attention module may not be well-trained with a single task-specific loss function. Leveraging external information to guide the spatial attention module can be a solution to this issue. In (Chen and Jiang, 2019), the spatial attention module is guided by motion information for video captioning. (Pang et al., 2019) and (Li et al., 2020) propose mask and relation guidance for occluded pedestrian detection and person re-identification, respectively. GALA (Linsley et al., 2018) presents an intriguing approach by utilizing click maps obtained from a game as supervision. In this work, we leverage pose keypoints heatmaps to direct the learning process of the spatial attention module.

2.3. Sentence Embedding

Traditional methods (Palangi et al., 2016; Liu et al., 2019) commonly adopt a straightforward approach where the word embedding sequence is directly fed into recurrent neural networks (RNNs), and the final hidden state (or two hidden states for bidirectional RNNs) is extracted as the sentence embedding. Recently, many powerful sentence embedding extractors (Reimers and Gurevych, 2019; Gao et al., 2021; Carlsson et al., 2020) are built on BERT (Kenton and Toutanova, 2019). However, it is difficult to use these methods in our work because (1) they are too large to be co-trained along with the backbone; (2) they are pretrained on spoken languages, which are totally different to sign languages represented by videos. In this work, we build a lightweight sentence embedding extractor that can be jointly trained with the CSLR backbone.

2.4. Feature Disentanglement

In the context of signer-independent continuous sign language recognition (SI-CSLR), each signer can be considered as a distinct domain, and the key is to enable the model to generalize well to unseen domains, i.e., the test signers. Feature disentanglement has emerged as a powerful approach for achieving domain generalization by decomposing features into domain-invariant and domain-specific components (Wang et al., 2021). Adversarial learning has gained significant traction in the field of feature disentanglement, with the feature extractor serving as the generator and the domain classifier as the discriminator (Xu et al., 2020; Cheng et al., 2022; Liu et al., 2018b). For instance, in the context of facial expression recognition, (Xu et al., 2020) employs an adversarial approach to mitigate biases such as gender and race by training a series of domain classifiers. (Cheng et al., 2022) introduces a self-adversarial framework specifically designed to remove gaze-irrelevant factors, resulting in improved gaze estimation performance. Another notable advancement in feature disentanglement is the utilization of attention mechanisms to emphasize task-relevant features, while considering the remaining features as task-irrelevant. This approach has been successfully employed in various domains. For instance, in person re-identification, (Jin et al., 2020) utilizes a channel attention module to suppress style information, while in face recognition, (Huang et al., 2021) incorporates both spatial and channel attention mechanisms to eliminate age-related features. These studies exemplify the efficacy of leveraging adversarial learning and attention mechanisms for feature disentanglement in a range of applications. However, adversarial learning is usually complicated as the generator and discriminator are trained iteratively, and the attention modules would introduce extra parameters. In this work, we adopt the gradient reversal (GR) layer (Ganin et al., 2016) that reverses the gradient coming from the domain (signer) classification loss when the back-propagation process arrives at the feature extractor (CSLR backbone) while keeping the gradient of the domain classifier unchanged. It shares a similar idea with adversarial learning, but it is totally end-to-end and introduces no extra parameters compared to attention-based methods. Thus, we believe it can serve as a simple baseline for future research on SI-CSLR.

Refer to caption
Figure 2. An overview of our proposed method. The sign video input is first fed into the visual module (e.g., VGGNet (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016)) to extract visual features. The following sequential module (e.g., local Transformer (see details in Section 3.6) and TCN) further models long-/short-term dependencies and yield sequential features. The CTC loss (Graves et al., 2006) is adopted as the main objective function. Three auxiliary tasks (highlighted in different colors) are proposed to improve the performance of the CSLR backbone. For spatial attention consistency, we insert a keypoint-guided spatial attention module after the mm-th convolution layer, CmC_{m}, of the visual module. Besides, we push the model to align visual and sequential features at the sentence level to enhance their representative power. Finally, we introduce a signer removal module to make the model more robust to signer discrepancy under the signer-independent setting.

3. Our Proposed Method

3.1. Framework Overview

Figure 2 gives an overview of our proposed method. The blue, orange, and green arrows represent the three components of the CSLR backbone: visual module, sequential module, and alignment module, respectively. Taking a sign video with TT RGB frames 𝐱={𝐱t}t=1TT×H×W×3\mathbf{x}=\{\mathbf{x}_{t}\}_{t=1}^{T}\in\mathbb{R}^{T\times H\times W\times 3} as input, the visual module, which simply consists of several 2D-CNN222We only consider visual modules based on 2D-CNNs since a recent survey (Adaloglou et al., 2021) shows that 3D-CNNs cannot provide as precise gloss boundaries as 2D-CNNs, and lead to worse performance. layers (C1,,CnC_{1},\dots,C_{n}) followed by a global average pooling (GAP) layer, first extracts visual features 𝐯={𝐯t}t=1TT×d\mathbf{v}=\{\mathbf{v}_{t}\}_{t=1}^{T}\in\mathbb{R}^{T\times d}. The sequential features 𝐬={𝐬t}t=1TT×d\mathbf{s}=\{\mathbf{s}_{t}\}_{t=1}^{T}\in\mathbb{R}^{T\times d} will be further extracted by the sequential module. Finally, the alignment module computes the probability of the gloss label sequence p(𝐲|𝐱)p(\mathbf{y}|\mathbf{x}) based on the widely-adopted CTC (Graves et al., 2006), where 𝐲={yi}i=1N\mathbf{y}=\{y_{i}\}_{i=1}^{N} and NN denotes the length of the gloss sequence. Below we will first present our proposed three auxiliary tasks, spatial attention consistency (Section 3.2), sentence embedding consistency (Section 3.3), and signer removal (Section 3.4), respectively. The overall loss function will be formulated in Section 3.5. Finally in Section 3.6, we will introduce a variant of Transformer as a strong sequential module for CSLR.

3.2. Spatial Attention Consistency (SAC)

Signers’ facial expressions and hand movements are two major clues of sign languages (Koller, 2020; Zhou et al., 2020; Zuo et al., 2023). Thus, it is reasonable to expect the visual module can focus on signers’ face and hands, i.e., informative regions (IRs). From this view, we insert a spatial attention module into the visual module and enforce the consistency between the learned attention masks and keypoints heatmaps. Since SAC is applied to all frames in the same way, we will omit the time steps in the formulation below.

Refer to caption
(a)
Refer to caption
(b)
Figure 3. (a) The architecture of our spatial attention module. (J×K×CJ\times K\times C: the size of the input feature maps, GAP: global average pooling, CMP: channel-wise max pooling. (b) Two examples of the original and refined heatmaps.

3.2.1. Spatial Attention Module

We build our spatial attention module based on CBAM (Woo et al., 2018) due to its simplicity and effectiveness. As shown in Figure 3(a), we first pick the most informative channel via a channel-wise max pooling (CMP) operation:

(1) 𝐌1=fCMP(𝐅)J×K×1,\mathbf{M}_{1}=f_{CMP}(\mathbf{F})\in\mathbb{R}^{J\times K\times 1},

where 𝐌1\mathbf{M}_{1} is the squeezed feature map by CMP, and 𝐅J×K×C\mathbf{F}\in\mathbb{R}^{J\times K\times C} is the input feature maps.

Besides CMP, CBAM also squeezes the feature maps with an average pooling operation along the channel dimension. However, we propose to dynamically weight the importance of each channel. As shown in Figure 3(a), we first conduct global average pooling (GAP) over 𝐅\mathbf{F} to gather global spatial information. Then the channel weights 𝐄(0,1)1×1×C\mathbf{E}\in(0,1)^{1\times 1\times C} are simply generated by a channel-wise softmax layer. By a weighted sum along the channel dimension, we can generate another squeezed feature map 𝐌2\mathbf{M}_{2}:

(2) 𝐌2=𝐅𝐄=i=1C𝐅i𝐄iJ×K×1,\mathbf{M}_{2}=\mathbf{F}\oplus\mathbf{E}=\sum_{i=1}^{C}\mathbf{F}_{i}\cdot\mathbf{E}_{i}\in\mathbb{R}^{J\times K\times 1},

Finally, the spatial attention mask 𝐌\mathbf{M} is generated as:

(3) 𝐌=σ(fconv(cat(𝐌1,𝐌2)))(0,1)J×K,\mathbf{M}=\sigma(f_{conv}(cat(\mathbf{M}_{1},\mathbf{M}_{2})))\in(0,1)^{J\times K},

where σ()\sigma(\cdot) is the sigmoid function, fconv()f_{conv}(\cdot) is a 2D-CNN layer with a kernel size of 7×7, and cat(,)cat(\cdot,\cdot) is a channel-wise concatenation operation. The output feature maps will be a product between 𝐅\mathbf{F} and 𝐌\mathbf{M}. In this way, important positions can be highlighted while trivial ones can be suppressed.

It should be noted that our channel weights are similar to the channel attention module in CBAM, but ours introduces no extra parameters and can even outperform the vanilla CBAM according to our ablation studies in Table 3.

3.2.2. Keypoints Heatmap Extractor

Simply training the spatial attention module with the backbone may lead to sub-optimal solutions. Given the prior knowledge that signers’ faces and hands are informative regions (IRs), we guide the spatial attention module with keypoints heatmaps extracted by the pretrained HRNet (Sun et al., 2019; Andriluka et al., 2014). Specifically, we first normalize the raw outputs of HRNet linearly to obtain the original heatmaps:

(4) 𝐇oi=fHi(𝐈)minfHi(𝐈)maxfHi(𝐈)minfHi(𝐈)[0,1]H×W,\mathbf{H}_{o}^{i}=\frac{f_{H}^{i}(\mathbf{I})-\min{f_{H}^{i}(\mathbf{I})}}{\max{f_{H}^{i}(\mathbf{I})}-\min{f_{H}^{i}(\mathbf{I})}}\in[0,1]^{H\times W},

where 𝐈\mathbf{I} is the raw RGB frame, fH()f_{H}(\cdot) is the pretrained HRNet, and i{1,2,3}i\in\{1,2,3\} denotes the face, left hand, and right hand, respectively.

3.2.3. Post-processing

There are some defects in the original heatmaps although they can roughly highlight the positions of IRs. As shown in Figure 3(b), some trivial regions, e.g., the top of the face heatmap in the first row and the middle part of the left hand heatmap in the second row, may receive high activation values. Besides, some highlighted regions, e.g., both of the face heatmaps in Figure 3(b), may not cover the IRs entirely. In addition, there is usually a mismatch between the fixed heatmap resolution of the pretrained HRNet and that of the spatial attention masks. Below we will elaborate our heatmap post-processing module that corrects the mismatch.

We first locate the center of each IR from the original heatmaps via a simple argmax operation: (xi,yi)=argmax𝐇oi(x_{i},y_{i})=\mathrm{argmax}\ \mathbf{H}_{o}^{i}. To fit different resolutions of spatial attention masks, we normalize the center as (x^i,y^i)=(xiH1,yiW1)(\hat{x}_{i},\hat{y}_{i})=(\frac{x_{i}}{H-1},\frac{y_{i}}{W-1}). Suppose the spatial attention masks have a common resolution of J×KJ\times K, then a Gaussian-like refined keypoints heatmap is generated for each IR to reduce unwanted noise:

(5) 𝐇ri(a,b)=exp(12((ac^ix)2(J/γx)2+(bc^iy)2(K/γy)2)),\mathbf{H}_{r}^{i}(a,b)=\exp{\left(-\frac{1}{2}\left(\frac{(a-\hat{c}_{i}^{x})^{2}}{(J/\gamma_{x})^{2}}+\frac{(b-\hat{c}_{i}^{y})^{2}}{(K/\gamma_{y})^{2}}\right)\right)},

where 0a<J0\leq a<J, 0b<K0\leq b<K. (c^ix,c^iy)=(x^i(J1),y^i(K1))(\hat{c}_{i}^{x},\hat{c}_{i}^{y})=(\hat{x}_{i}(J-1),\hat{y}_{i}(K-1)), which denotes the transformed center for each IR under the resolution J×KJ\times K. γx\gamma_{x} and γy\gamma_{y} are two hyper-parameters to control the scale of the highlighted regions. In real practice, we set γx=γy\gamma_{x}=\gamma_{y}. Finally, we merge the three processed IR heatmaps into a single one: 𝐇r=max𝑖𝐇ri(0,1)J×K\mathbf{H}_{r}=\underset{i}{\max}\ \mathbf{H}_{r}^{i}\in(0,1)^{J\times K}.

3.2.4. SAC Loss

The spatial attention module is guided by the refined keypoints heatmaps via the SAC loss333For implementation, we further compute the average of sac\mathcal{L}_{sac} over all time steps.:

(6) sac=1J×K𝐌𝐇r22.\mathcal{L}_{sac}=\frac{1}{J\times K}\|\mathbf{M}-\mathbf{H}_{r}\|_{2}^{2}.

3.3. Sentence Embedding Consistency (SEC)

Refer to caption
Figure 4. The workflow of sentence embedding extraction. We omit LayerNorm (Ba et al., 2016) for simplicity.

Some works (Min et al., 2021; Hao et al., 2021) find that enforcing the consistency between the visual and sequential features can enhance their representation power, and lead to better performance. Different from (Min et al., 2021; Hao et al., 2021) that measure their consistency at the frame level, we impose a sentence embedding consistency between them.

3.3.1. Sentence Embedding Extractor (SEE)

Within a sign video, each gloss consists of only a few frames. We believe a good SEE for sign languages should take local contexts into consideration. As shown in Figure 4, our SEE is built on QANet (Yu et al., 2018), which consists of a depth-wise temporal convolution network (TCN) layer and a transformer encoder layer. The depth-wise TCN first extracts local contextual information from the frame-level feature sequence, then the transformer encoder models global contexts by its inner self-attention module.

Similar to the class token in BERT (Kenton and Toutanova, 2019), we first prepend a learnable sentence embedding token, [SEN], to the sequential features 𝐬T×d\mathbf{s}\in\mathbb{R}^{T\times d} defined in Section 3.1:

(7) 𝐬=cat([SEN],𝐬)(T+1)×d.\mathbf{s}^{\prime}=cat(\text{[SEN]},\mathbf{s})\in\mathbb{R}^{(T+1)\times d}.

The input of the SEE is the summation of the feature sequence and the positional embeddings (Vaswani et al., 2017); i.e., 𝐬′′=𝐬+𝐏\mathbf{s}^{\prime\prime}=\mathbf{s}^{\prime}+\mathbf{P}, where 𝐏(T+1)×d\mathbf{P}\in\mathbb{R}^{(T+1)\times d}.

Within the SEE, the depth-wise TCN (Wu et al., 2018) layer first models local contexts with a residual shortcut: 𝐬l′′=fTCN(𝐬′′)+𝐬′′\mathbf{s}_{l}^{\prime\prime}=f_{TCN}(\mathbf{s}^{\prime\prime})+\mathbf{s}^{\prime\prime}. Then the transformer encoder layer gathers information from all time steps to get the sentence embedding:

(8) 𝐄sens=fTF(𝐬l′′)d.\mathbf{E}_{sen}^{s}=f_{TF}(\mathbf{s}_{l}^{\prime\prime})\in\mathbb{R}^{d}.

We can also get the sentence embedding of visual features, 𝐄senv\mathbf{E}_{sen}^{v}, in the same way.

3.3.2. Negative Sampling

Directly minimizing the distance between 𝐄sens\mathbf{E}_{sen}^{s} and 𝐄senv\mathbf{E}_{sen}^{v} will result in trivial solutions. For example, if the parameters of SEE are all zeros, then the outputs of SEE will always be the same. A simple way to address this issue is introducing negative samples. In this work, we follow the common practice (Schroff et al., 2015; Ye et al., 2019; Oord et al., 2018; Hjelm et al., 2019) and sample another video from the mini-batch and take its sequential features as the negative sample. Note that most CSLR models (Min et al., 2021; Hao et al., 2021; Zhou et al., 2020) are trained with a batch size of 2, and our negative sampling strategy will degenerate to swapping under this setting:

(9) (neg(𝐁[0]),neg(𝐁[1]))=(𝐁[1],𝐁[0]),(neg(\mathbf{B}[0]),neg(\mathbf{B}[1]))=(\mathbf{B}[1],\mathbf{B}[0]),

where 𝐁2×T×d\mathbf{B}\in\mathbb{R}^{2\times T\times d} is a mini-batch of the sequential features, and neg(𝐁[])neg(\mathbf{B}[\cdot]) denotes the corresponding negative sample.

3.3.3. SEC Loss

We implement SEC loss as a triplet loss (Schroff et al., 2015) and minimize the distances between the sentence embeddings computed from the visual and sequential features of the same sentence, while maximizeing the distances between those from different sentences:

(10) sec=max{d(𝐄senv,𝐄sens)d(𝐄senv,neg(𝐄sens))+α,0},\begin{split}\mathcal{L}_{sec}=\max&\{d(\mathbf{E}_{sen}^{v},\mathbf{E}_{sen}^{s})-d(\mathbf{E}_{sen}^{v},neg(\mathbf{E}_{sen}^{s}))+\alpha,0\},\end{split}

where d(𝐱1,𝐱2)=1𝐱1𝐱2𝐱12𝐱22d(\mathbf{x}_{1},\mathbf{x}_{2})=1-\frac{\mathbf{x}_{1}\cdot\mathbf{x}_{2}}{\|\mathbf{x}_{1}\|_{2}\cdot\|\mathbf{x}_{2}\|_{2}}; {𝐄senv,𝐄sens}\{\mathbf{E}_{sen}^{v},\mathbf{E}_{sen}^{s}\} are sentence embeddings of visual and sequential features from the same sentence; {𝐄senv,neg(𝐄sens)}\{\mathbf{E}_{sen}^{v},neg(\mathbf{E}_{sen}^{s})\} are sentence embeddings of visual and sequential features from different sentences, and we treat the sentence embedding of the sequential features from a different sentence as the negative sample neg(𝐄sens)neg(\mathbf{E}_{sen}^{s}); α\alpha is the margin.

3.4. Signer Removal Module (SRM)

To remove signer information from CSLR backbones, we further develop a signer removal module (SRM) based on statistics pooling and gradient reversal as shown in Figure 5.

3.4.1. Signer Embeddings

We first extract signer embeddings to “distill” signer information before dispelling it. A naïve method is simply feeding the frame-level features into an MLP, and treat the outputs of MLP as signer embeddings. In this work, motivated by the superior performance of x-vectors (Snyder et al., 2018) in speaker recognition, we leverage statistics pooling to obtain more robust sentence-level signer embeddings.

Specifically, we first feed the intermediate visual features 𝐅T×J×K×C\mathbf{F}\in\mathbb{R}^{T\times J\times K\times C} into a global average pooling layer to squeeze the spatial dimension and obtain frame-level features444Here we misuse the notation 𝐅\mathbf{F} in Equation 1. 𝐅sT×C\mathbf{F}_{s}\in\mathbb{R}^{T\times C}. Then a statistics pooling (SP) layer is used to aggregate frame-level information:

(11) 𝐅sSP=cat(𝐅smean,𝐅sstd)2C,\mathbf{F}_{s}^{SP}=cat(\mathbf{F}_{s}^{mean},\mathbf{F}_{s}^{std})\in\mathbb{R}^{2C},

where 𝐅smeanC\mathbf{F}_{s}^{mean}\in\mathbb{R}^{C} and 𝐅sstdC\mathbf{F}_{s}^{std}\in\mathbb{R}^{C} are the temporal mean and standard deviation of 𝐅s\mathbf{F}_{s}, respectively. In this way, 𝐅sSP\mathbf{F}_{s}^{SP} is capable to capture signer characteristics over the entire video instead of at the frame-level.

After that, a simple two-layer MLP with rectified linear unit (ReLU) function is used to project the statistics into the signer embedding space:

(12) 𝐄sig=ReLU(𝐖2ReLU(𝐖1𝐅sSP+𝐛1)+𝐛2)C,\mathbf{E}_{sig}=ReLU(\mathbf{W}_{2}ReLU(\mathbf{W}_{1}\mathbf{F}_{s}^{SP}+\mathbf{b}_{1})+\mathbf{b}_{2})\in\mathbb{R}^{C},

where 𝐖1C×2C,𝐛1C,𝐖2C×C,𝐛2C\mathbf{W}_{1}\in\mathbb{R}^{C\times 2C},\mathbf{b}_{1}\in\mathbb{R}^{C},\mathbf{W}_{2}\in\mathbb{R}^{C\times C},\mathbf{b}_{2}\in\mathbb{R}^{C} represent the two-layer MLP.

Finally, the signer embeddings 𝐄sig\mathbf{E}_{sig} are fed into a classifier to yield signer probabilities 𝐩sig(0,1)Nsig\mathbf{p}_{sig}\in(0,1)^{N_{sig}}, where NsigN_{sig} is the number of signers. The SRM is trained with the signer classification loss, which is simply a cross-entropy loss:

(13) srm=logpsigi,\mathcal{L}_{srm}=-\log p_{sig}^{i},

where ii is the label of the signer.

Refer to caption
Figure 5. Workflow of our signer removal module (SRM). We insert the SRM after the mm-th CNN layer, CmC_{m}. The loss of C2\text{C}^{2}SLR, b\mathcal{L}_{b}, which is a sum of the CTC, SAC, and SEC losses, is used to train the backbone parameters θb\theta_{b}. The signer classification loss srm\mathcal{L}_{srm} is used to train the SRM parameters θs\theta_{s} as usual, while the gradient from srm\mathcal{L}_{srm} will be reversed for θb\theta_{b}. λ\lambda is the loss weight for srm\mathcal{L}_{srm}.

3.4.2. Gradient Reversal

If the CSLR backbone is jointly trianed with srm\mathcal{L}_{srm}, it will become the multi-task learning, which, however, cannot promise removing the signer information from the backbone. In this work, we treat each signer as a domain and formulate SI-CSLR as a domain generalization problem in which no test signers are seen during training. The gradient reversal layer was proposed in (Ganin et al., 2016) to address the domain generalization problem by learning features that are discriminative to the main classification task while indiscriminate to the domain gap. More specifically, according to (Ganin et al., 2016), denoting the parameters of the feature extractor, label predictor, and domain classifier as θf\theta_{f}, θy\theta_{y}, and θd\theta_{d}, respectively, the optimization of these parameters can be formulated as:

(14) θfoptimizer(θf,θfy,λθfd,η),θyoptimizer(θy,θyy,η),θdoptimizer(θd,λθdd,η),\begin{split}\theta_{f}&\leftarrow\text{optimizer}(\theta_{f},\nabla_{\theta_{f}}\mathcal{L}_{y},-\lambda\nabla_{\theta_{f}}\mathcal{L}_{d},\eta),\\ \theta_{y}&\leftarrow\text{optimizer}(\theta_{y},\nabla_{\theta_{y}}\mathcal{L}_{y},\eta),\\ \theta_{d}&\leftarrow\text{optimizer}(\theta_{d},\lambda\nabla_{\theta_{d}}\mathcal{L}_{d},\eta),\end{split}

where y\mathcal{L}_{y} and d\mathcal{L}_{d} are the main classification and domain classification losses, respectively, λ\lambda is the loss weight for d\mathcal{L}_{d}, and η\eta is the learning rate.

We adapt Equation 14 by instantiating y\mathcal{L}_{y} and d\mathcal{L}_{d} as the backbone training loss b\mathcal{L}_{b} and signer classification loss srm\mathcal{L}_{srm}, which are illustrated in Figure 5, respectively. We also merge θf\theta_{f} and θy\theta_{y} as θb\theta_{b} to denote the parameters of the backbone, and use θs\theta_{s} to represent the parameters of the SRM. The new optimization process can be formulated as:

(15) θboptimizer(θb,θbb,λθbsrm,η),θsoptimizer(θs,λθssrm,η).\begin{split}\theta_{b}&\leftarrow\text{optimizer}(\theta_{b},\nabla_{\theta_{b}}\mathcal{L}_{b},-\lambda\nabla_{\theta_{b}}\mathcal{L}_{srm},\eta),\\ \theta_{s}&\leftarrow\text{optimizer}(\theta_{s},\lambda\nabla_{\theta_{s}}\mathcal{L}_{srm},\eta).\end{split}

As a result, the SRM itself is trained with srm\mathcal{L}_{srm} as usual, but the backbone is trained “reversely” so that the extracted features cannot discriminate signers, and the signer information is implicitly removed. We validate the effectiveness of the SRM on two challenging SI-CSLR benchmarks, establishing a strong baseline for future works on SI-CSLR.

3.5. Alignment Module and Loss Function

We follow recent works (Hao et al., 2021; Min et al., 2021; Zhou et al., 2020; Pu et al., 2020) to adopt CTC-based alignment module. It yields a label for each frame which may be a repeating label or a special blank symbol. CTC assumes that the model output at each time step is conditionally independent of each other. Given an input sequence 𝐱\mathbf{x}, the conditional probability of a label sequence ϕ={ϕi}i=1T\boldsymbol{\phi}=\{\phi_{i}\}_{i=1}^{T}, where ϕi𝒱{blank}\phi_{i}\in\mathcal{V}\cup\{blank\} and 𝒱\mathcal{V} is the vocabulary of glosses, can be estimated by:

(16) p(ϕ|𝐱)=i=1Tp(ϕi|𝐱),p(\boldsymbol{\phi}|\mathbf{x})=\prod_{i=1}^{T}p(\phi_{i}|\mathbf{x}),

where p(ϕi|𝐱)p(\phi_{i}|\mathbf{x}) is the frame-level gloss probabilities generated by a classifier. The final probability of the gloss label sequence is the summation of all feasible alignments:

(17) p(𝐲|𝐱)=ϕ=𝒢1(𝐲)p(ϕ|𝐱),p(\mathbf{y}|\mathbf{x})=\sum_{\boldsymbol{\phi}=\mathcal{G}^{-1}(\mathbf{y})}p(\boldsymbol{\phi}|\mathbf{x}),

where 𝒢\mathcal{G} is a mapping function to remove repeats and blank symbols in ϕ\boldsymbol{\phi}, and 𝒢1\mathcal{G}^{-1} is its inverse mapping. Then the CTC loss is defined as:

(18) ctc=logp(𝐲|𝐱).\mathcal{L}_{ctc}=-\log p(\mathbf{y}|\mathbf{x}).

Finally, the overall loss function is a combination of the CTC, SAC, SEC, and signer classification losses:

(19) =ctc+sac+secb+λsrm,\mathcal{L}=\underbrace{\mathcal{L}_{ctc}+\mathcal{L}_{sac}+\mathcal{L}_{sec}}_{\mathcal{L}_{b}}+\lambda\mathcal{L}_{srm},

where λ=0\lambda=0 for signer-dependent datasets, and λ>0\lambda>0 for signer-independent ones.

3.6. A Strong Sequential Module: Local Transformer

Refer to caption
(a) Local Transformer (LT). We omit LayerNorm (Ba et al., 2016) for simplicity.
Refer to caption
(b) Local Self-attention (LSA).
Figure 6. We propose a strong sequential module, local transformer. It is based on QANet (Yu et al., 2018), which validates the effectiveness of combining TCNs with self-attention. The difference is that we further leverage Gaussian bias (Luong et al., 2015; Yang et al., 2018) to introduce local contexts into the self-attention module, i.e., local self-attention. (LL: number of LT layers, which is set to 2 as default; RPERPE: relative positional encoding (Shaw et al., 2018); DD: window size of the Gaussian bias.)

The sequential module is an important component of the CSLR backbone. Most existing CSLR works adopt globally-guided architectures, e.g., BiLSTM (Pu et al., 2019, 2020) and vanilla Transformer (Niu and Mak, 2020; Camgöz et al., 2020), for sequence modeling due to their strong capability of capturing long-term temporal dependencies. However, within a sign video, each gloss is short, consisting of only a few frames. This can explain why a locally-guided architecture, such as TCNs, can also achieve excellent performance (Cheng et al., 2020). In this subsection, we will elaborate a mixed architecture, Local Transformer (LT), which can leverage both global and local contexts for sequence modeling for CSLR.

Figure 6(a) shows the architecture of LT. Each LT layer consists of a depth-wise TCN layer, a local self-attention (LSA) layer, and a feed-forward network. Since the depth-wise TCN layer and the feed-forward network are the same as those used in (Yu et al., 2018; Vaswani et al., 2017), below we will only give the formulation of the LSA.

As shown in Figure 6(b), three linear layers first project the input feature sequence 𝐀T×d\mathbf{A}\in\mathbb{R}^{T\times d} into queries 𝐐T×d\mathbf{Q}\in\mathbb{R}^{T\times d}, keys 𝐊T×d\mathbf{K}\in\mathbb{R}^{T\times d}, and values 𝐕T×d\mathbf{V}\in\mathbb{R}^{T\times d}, respectively. We then split 𝐐,𝐊,𝐕\mathbf{Q},\mathbf{K},\mathbf{V} into {𝐐h}h=1Nh,{𝐊h}h=1Nh,{𝐕h}h=1Nh\{\mathbf{Q}^{h}\}_{h=1}^{N_{h}},\{\mathbf{K}^{h}\}_{h=1}^{N_{h}},\{\mathbf{V}^{h}\}_{h=1}^{N_{h}}, respectively, for multi-head self-attention as (Vaswani et al., 2017), where 𝐐h,𝐊h,𝐕hT×d/Nh\mathbf{Q}^{h},\mathbf{K}^{h},\mathbf{V}^{h}\in\mathbb{R}^{T\times d/{N_{h}}} and NhN_{h} is the number of heads. The attention scores for each head can be obtained by the scaled dot-product attention as follows:

(20) 𝐀𝐓𝐓={(𝐐h)(𝐊h)Td/Nh}h=1NhNh×T×T.\mathbf{ATT}=\left\{\frac{(\mathbf{Q}^{h})(\mathbf{K}^{h})^{T}}{\sqrt{d/N_{h}}}\right\}_{h=1}^{N_{h}}\in\mathbb{R}^{N_{h}\times T\times T}.

The vanilla self-attention treats each position equally. To emphasize local contexts, we adopt Gaussian bias (Luong et al., 2015; Yang et al., 2018) to weaken the interactions between distant query-key (QK) pairs. Given a QK pair (𝐪ih,𝐤jh\mathbf{q}_{i}^{h},\mathbf{k}_{j}^{h}), the Gaussian bias (GB) is defined as:

(21) GBijh=(ji)22σ2,GB_{ij}^{h}=-\frac{(j-i)^{2}}{2\sigma^{2}},

where σ=D2\sigma=\frac{D}{2}, and DD is the window size of the Gaussian bias (Luong et al., 2015). Note that although we can assign Gaussian bias with a different value of DD for each head, we find that a common Gaussian bias among all heads suffices to boost the performance of transformer significantly. The final attention weights for each value vector are obtained from a softmax layer, and the output of the LSA is:

(22) {𝐎h=softmax(𝐀𝐓𝐓h+𝐆𝐁h)𝐕h𝐎LSA=cat({𝐎h}h=1Nh)𝐖OT×d,\begin{cases}\quad\mathbf{O}^{h}=softmax(\mathbf{ATT}^{h}+\mathbf{GB}^{h})\mathbf{V}^{h}\\ \quad\mathbf{O}^{LSA}=cat(\{\mathbf{O}^{h}\}_{h=1}^{N_{h}})\mathbf{W}^{O}\in\mathbb{R}^{T\times d}\ ,\end{cases}

where 𝐖Od×d\mathbf{W}^{O}\in\mathbb{R}^{d\times d} denotes the output linear layer.

We intuitively set DD as the average ratio of frame length to gloss length: D=1|tr|i=1|tr|TiNiD=\frac{1}{|tr|}\sum_{i=1}^{|tr|}\frac{T_{i}}{N_{i}}, where |tr||tr| is the number of training samples, based on the idea that a good window size should reflect the average frame length of a gloss. More specifically, D=6.3,15.8,5.0D=6.3,15.8,5.0 for the PHOENIX datasets, CSL, and CSL-Daily, respectively.

4. Experiments

4.1. Datasets and Evaluation Metric

4.1.1. Datasets

Table 1. Dataset statistics.
Dataset Language Vocab Size #Samples #Signers Signer-
Train Dev Test Train Dev Test Independent
PHOENIX-2014 Germany 1,081 5,672 540 629 9 9 9 No
PHOENIX-2014-T Germany 1,085 7,096 519 642 9 9 9 No
CSL-Daily Chinese 2,000 18,401 1,077 1,176 10 10 10 No
PHOENIX-2014-SI Germany 1,081 4,376 111 180 8 1 1 Yes
CSL Chinese 178 4,000 N/A 1,000 40 N/A 10 Yes

We evaluate our method on three signer-dependent datasets (PHOENIX-2014, PHOENIX-2014-T, and CSL-Daily) and two signer-independent datasets (PHOENIX-2014-SI and CSL). The information of these datasets, including language, vocabulary size, train/dev/test splits, number of signers, are available in Table 1. Compared to some widely-adopted datasets in action recognition, e.g., Kinetics-600 (Carreira et al., 2018) with  500K videos and Something-Something v2 (Goyal et al., 2017) with  169K videos, the size of these sign language datasets are quite small. This can also explain why some specific training strategies, e.g., stage optimization and auxiliary training, are suggested necessary for CSLR before.

4.1.2. Evaluation Metric

We use word error rate (WER) to measure the dissimilarity between two sequences.

(23) WER=#deletions+#substitutions+#insertions#glosses in label\text{WER}=\frac{\#\text{deletions}+\#\text{substitutions}+\#\text{insertions}}{\#\text{glosses in label}}

The official evaluation scripts provided by each dataset are used for measuring the WER.

4.2. Implementation Details

4.2.1. Data Augmentation

We first resize the RGB frames to 256×256256\times 256 and then crop them to 224×224224\times 224. For the PHOENIX datasets, we adopt stochastic frame dropping (SFD) (Niu and Mak, 2020) with a dropping ratio of 50%. However, due to the longer duration of videos in CSL and CSL-Daily, we implement a seg-and-drop strategy that first segments the videos into short clips consisting of only two frames, and then one frame is randomly dropped from each clip. By doing so, the processed videos retain half of the original frames, while preserving most information. After that, we further randomly drop 40% frames using SFD from these processed videos.

4.2.2. Backbones and Hyper-parameters

We first choose three representative backbones to validate the effectiveness of our method.

  • VGG11+TCN+BiLSTM (VTB). It is widely adopted in some recent works (Zhou et al., 2020; Min et al., 2021). VGG11 (Simonyan and Zisserman, 2015) is used as the visual module, and the sequential module is composed of the TCN and BiLSTM to capture both local and global contexts.

  • CNN+TCN (CT). This lightweight backbone only consists of a 9-layer 2D-CNN and a 3-layer TCN, which is proposed in (Cheng et al., 2020).

  • VGG11+Local Transformer (VLT). The sequential module is a 2-layer local transformer encoder described in Section 3.6.

To better validate the robustness of our method, we additionally append our local transformer to three mainstream visual backbones including ResNet-18 (He et al., 2016), MobileNet-v3-Small (Howard et al., 2019), and GoogLeNet (Szegedy et al., 2015). To align the channel dimensions of visual and sequential features, we configure the TCN layers in CT and VTB to have an output channel size of 512. Additionally, we set the number of hidden units for the BiLSTM in VTB to 2×2562\times 256. These adjustments lead to comparable Word Error Rates (WERs) to those reported in the original papers (Cheng et al., 2020; Zhou et al., 2020), maintaining consistency in performance evaluation. We empirically insert the spatial attention module after the 5th CNN layer. For post-processing, we set γx=γy=14\gamma_{x}=\gamma_{y}=14 based on the experimental results presented in Section 4.3.6. The kernel size of the depth-wise TCN layer in both our SEE and VLT backbone models is set to 5, consistent with (Yu et al., 2018). To determine the margin α\alpha in Equation 10, we consider the maximum difference between negative and positive cosine distances and set α\alpha to 2. Regarding the signer removal module, we empirically position it after the 5th CNN layer, and the default weight for srm\mathcal{L}_{srm}, λ\lambda, is set to 0.75.

4.2.3. Training

All models are trained using a batch size of 2, following recent works (Hao et al., 2021; Min et al., 2021; Zhou et al., 2020). We employ the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 1×1041\times 10^{-4} and a weight decay factor of 1×1041\times 10^{-4}. We empirically notice that sec\mathcal{L}_{sec} decreases at a faster rate compared to ctc\mathcal{L}_{ctc}. To ensure consistent training progress between the backbone and SEE, we decrease the learning rate of the SEE by a default factor of 0.1 for the backbone architectures used in our experiments (for CT, we set the factor to 0.01). Following the previous approach (Camgöz et al., 2020), we employ a learning rate scheduling strategy known as plateau. Specifically, if the WER on the dev set does not decrease for 6 consecutive evaluation steps, the learning rate will be reduced by a factor of 0.7. However, since CSL does not have an official dev set, we decrease the learning rate after the 15th and 25th epoch, and subsequently every 5 epochs after the 30th epoch. The total number of training epochs is set to 60.

4.2.4. Inference and Decoding.

Following (Niu and Mak, 2020), to match the training condition, we evenly select every 1pd\frac{1}{p_{d}}-th frame to drop during inference, where pdp_{d} is the dropping ratio. We adopt the beam search algorithm with a beam size of 10 for decoding.

Table 2. Ablation study for SAC and SEC. During inference, since our SEC can be removed, only the spatial attention module in SAC will introduce negligible parameters and affect inference speed. (SAC\text{SAC}^{-} denotes only inserting the spatial attention module but not guided by sac\mathcal{L}_{sac}; Par.: number of parameters; Sp.: inference speed measured on the same TITAN RTX GPU in seconds per video.)
Backbone SAC\text{SAC}^{-} SAC SEC WER% Par.(M) Sp.(s) Backbone SAC\text{SAC}^{-} SAC SEC WER% Par.(M) Sp.(s)
VTB 25.0 15.6359 0.169 ResNet-18 +LT 23.8 18.1356 0.103
24.6 +0.0001 +0.002 23.8 +0.0001 +0.002
23.7 +0.0001 +0.002 22.6 +0.0001 +0.002
24.3 +0.0000 +0.000 22.8 +0.0000 +0.000
22.6 +0.0001 +0.002 22.2 +0.0001 +0.002
CT 26.1 8.7504 0.095 MobileNet-v3-Small +LT 26.0 9.0502 0.098
26.0 +0.0001 +0.001 25.8 +0.0001 +0.001
25.1 +0.0001 +0.001 25.2 +0.0001 +0.001
25.2 +0.0000 +0.000 25.2 +0.0000 +0.000
24.5 +0.0001 +0.001 24.7 +0.0001 +0.001
VLT 21.5 16.1850 0.163 GoogLeNet +LT 24.0 23.7070 0.112
21.4 +0.0001 +0.002 23.9 +0.0001 +0.002
20.8 +0.0001 +0.002 23.4 +0.0001 +0.002
20.9 +0.0000 +0.000 23.3 +0.0000 +0.000
20.4 +0.0001 +0.002 22.9 +0.0001 +0.002
Refer to caption
Figure 7. Visualization results for learned spatial attention masks with or without the guidance of sac\mathcal{L}_{sac}. We randomly select five samples (s1,,s5s_{1},\dots,s_{5}) from the test set, and for each sample, we select one clear frame and one blurry frame. It is clear that the guidance of sac\mathcal{L}_{sac} can help the spatial attention module capture the informative regions (face and hands) more accurately.

4.3. Ablation Studies for C2\text{C}^{2}SLR

We first conduct ablation studies for C2\text{C}^{2}SLR on PHOENIX-2014 following previous works (Min et al., 2021; Zhou et al., 2020; Pu et al., 2020; Hao et al., 2021).

4.3.1. Effectiveness of SAC and SEC

As shown in Table 2, both SAC and SEC generalize well on different backbones: the performance of all the six backbones can be clearly improved. However, if the spatial attention module is inserted into the backbones without any guidance, i.e., SAC\text{SAC}^{-}, the model performance can only be improved slightly, which verifies the effectiveness of sac\mathcal{L}_{sac}. The effectiveness of SEC suggests that explicitly enforcing the consistency between the visual and sequential modules at the sentence level can strengthen the cross-module cooperation, which leads to the performance gain. The improvements due to SAC and SEC are complementary so that using both of them can obtain better results than using only one of them. Besides, since VLT performs the best among the six backbones, we will use it as the default backbone for the following experiments.

4.3.2. Visualization Results for SAC

Figure 7 shows the visualization results of the learned spatial attention masks of SAC (with sac\mathcal{L}_{sac}) and SAC\text{SAC}^{-} (without sac\mathcal{L}_{sac}) for five test samples. It should be noted that since SAC is deactivated during testing, our comparison is fair. First, it is quite clear that the learned attention masks with the guidance of sac\mathcal{L}_{sac} look much better. Without the guidance of sac\mathcal{L}_{sac}, the attention masks are quite messy with horizontal lines at the top and many highlights at trivial regions, e.g., the left shoulder of s2s_{2}, the hair of s1s_{1} and s4s_{4}, and the waist of s3,s5s_{3},s_{5}. This explains why SAC\text{SAC}^{-} can only slightly improve the performance of the backbones as shown in Table 2. Second, our SAC is so robust that the IRs (face and hands) in blurry frames (right columns of s1s_{1} to s5s_{5}) can still be captured precisely. Third, it is capable of dealing with different hand positions: e.g., both two hands are lower than the face (s1,s3s_{1},s_{3}); one hand is near the face while the other one is not (s1,s2,s4s_{1},s_{2},s_{4}), and hands are overlapped (s5s_{5}).

Table 3. Ablation study for SAC.
Method WER% #Param(M)
VLT + SAC 20.8 16.1851
    - channel weights 21.3 -0.0000
      + channel attention (Woo et al., 2018) 21.2 +0.0335
    - post-processing 21.7 -0.0000
    - face 21.1 -0.0000
    - hands 21.2 -0.0000

4.3.3. Channel Weights

Within our spatial attention module, each channel can receive a weight to better measure its importance before squeezing the feature maps. Removing the channel weights degenerates to the channel-wise average pooling in CBAM (Woo et al., 2018) and achieves a WER of 21.3%, which leads to a performance drop by 0.5% as shown in Table 3. Although our channel weights share a similar idea with the channel attention module of CBAM, which builds extra linear layers to generate the attention weights, no extra parameters are introduced in our spatial attention module. To further validate their effectiveness, after removing the channel weights, we conduct one more experiment by adding the channel attention module back as CBAM; however, it can only lead to a slight performance gain and cannot outperform ours even with extra parameters.

4.3.4. Heatmap Refinement

We discuss in the Section 3.2.3 that the raw heatmaps of HRNet (Sun et al., 2019) consist of too many defects which may hinder the learning of the spatial attention module. As shown in Table 3, the quality of the keypoints heatmaps can make a difference on model performance: directly using the original heatmaps without post-processing achieves a WER of 21.7%, which reduces the performance of SAC by almost 1%.

4.3.5. Effect of Each Informative Region

As shown in the last two rows in Table 3, removing either face or hands region can harm the performance of SAC. The results validate that both signers’ face and hands play a key role in conveying information, which is also mentioned in (Zhou et al., 2020; Koller, 2020).

Refer to caption
Figure 8. Visualization results and performance comparison for different γx,γy\gamma_{x},\gamma_{y} in Equation 5. Since for real practice, the height and the width of the spatial attention masks are usually the same, we set γx\gamma_{x} and γy\gamma_{y} to the same value.

4.3.6. Effect of the Hyper-parameters γx,γy\gamma_{x},\gamma_{y} of Equation 5

We think γx\gamma_{x} and γy\gamma_{y} are two important hyper-parameters since they control the scale of highlighted regions in keypoints heatmaps. Thus, we conduct experiments to compare the performance of different γx,γy\gamma_{x},\gamma_{y} as shown in Figure 8. The model performance is worse when they are either too large (cannot cover the informative regions entirely) or too small (cover too many trivial regions). When γx=γy=14\gamma_{x}=\gamma_{y}=14, the model achieves the best performance.

Table 4. Ablation study for SEC.
(a) Ablation study for the architecture of the sentence embedding extractor and negative sampling. (TF: Transformer; DTCN: depth-wise TCN; Neg. Sam.: negative sampling.)
Method Extractor Neg. Sam. WER%
VLT + SEC TF+DTCN 20.9
TF+DTCN × 21.5
TF 21.1
BiLSTM 21.3
(b) Ablation study for the constraint level. We fine-tune the loss factor of VA as (Min et al., 2021) on the VLT for fair comparisons.
Level Constraint WER%
Sentence consistency 20.9
Frame consistency 21.6
visual enhancement (VE) (Min et al., 2021) 22.3
visual alignment (VA) (Min et al., 2021) 21.9
VE+VA (Min et al., 2021) 22.8

4.3.7. Sentence Embedding Extractor and Negative Sampling

Our sentence embedding extractor consists of a depth-wise TCN layer and a transformer encoder aiming to model local and global contexts, respectively. As shown in Table 4(a), local contexts are important to sentence embedding extraction as dropping the TCN layer would lead to worse performance. We also compare our method with the common practice, which concatenates the last two hidden states of BiLSTM and treats it as the sentence embedding. Nevertheless, that it underperforms the transformer-based extractors implies the strength of the self-attention mechanism for sentence embedding extraction. Table 4(a) also shows that negative sampling plays a key role in our SEC: without negative sampling, that is, directly minimizing the sentence embedding distance between the visual and sequential features, is not effective.

4.3.8. Constraint Level

As shown in Table 4(b), we implement some frame-level constraints to validate the effectiveness of our SEC. First, we replace the sentence embeddings, 𝐯se\mathbf{v}_{se} and 𝐬se\mathbf{s}_{se} in Equation 10, by their corresponding frame-level features to minimize the positive distances while maximizing the negative distances at the frame level. However, it leads to a performance degradation of 0.7% compared to our SEC. We further compare our SEC with VAC (Min et al., 2021), which is composed of two frame-level constraints: visual enhancement (VE) and visual alignment (VA). First, an extra classifier is appended to the visual module to yield frame-level probability distributions (visual distribution). VE is implemented as a CTC loss computed between the visual distribution and the gloss label, which is the same as the one used for training the backbone. Second, VA is simply a KL-divergence loss, which aims to minimize the distance between the visual distribution and the original probability distribution (p(ϕi|𝐱)p(\phi_{i}|\mathbf{x}) in Equation 16). Table 4(b) shows that both VE and VA perform much worse than our SEC. The results suggest that our SEC is a more proper way to measure the consistency between the visual and sequential modules.

Refer to caption
Figure 9. Two examples of video-gloss pairs. For i{1,2}i\in\{1,2\}, viv_{i} denotes the video, and for i{1,2}i\in\{1,2\}, lil_{i} denotes the corresponding gloss annotations. Their sentence embedding distances are shown in Table 6.
Table 5. Examples of sentence embedding distances of the visual and sequential features. v1v_{1} and v2v_{2} are the videos in Figure 9.
d(,)d(\cdot,\cdot) 𝐄sens(v1)\mathbf{E}_{sen}^{s}(v_{1}) 𝐄sens(v2)\mathbf{E}_{sen}^{s}(v_{2})
𝐄senv(v1)\mathbf{E}_{sen}^{v}(v_{1}) 0.01 1.99
𝐄senv(v2)\mathbf{E}_{sen}^{v}(v_{2}) 1.76 0.37
Table 6. Effect of the value of λ\lambda (the weight for the loss srm\mathcal{L}_{srm}) in Equation 19.
λ\lambda 0 0.25 0.5 0.75 1.0 1.25 1.5
Dev 34.3 35.1 35.3 33.1 33.5 35.0 34.4
Test 34.4 33.8 33.1 32.7 32.8 34.2 33.6

4.3.9. Examples of Video-gloss Pairs

To verify whether sec\mathcal{L}_{sec} can really separate positive and negative samples, we provide two examples of video-gloss pairs denoted as (v1,l1)(v_{1},l_{1}) and (v2,l2)(v_{2},l_{2}) as shown in Figure 9. The sentence embedding distances of the visual and sequential features of v1v_{1} and v2v_{2} are shown in Table 6. It is clear that the distance between the two features of the same video (diagonal entries, positive pairs) can be very small. Otherwise (off-diagonal entries, negative pairs), the distance can be very large (the maximum value is 2.00).

4.4. Ablation Studies for the Signer Removal Module

We further conduct ablation studies for our signer removal module (SRM) on the challenging signer-independent dataset, PHOENIX-2014-SI.

4.4.1. Effect of the Hyper-parameter λ\lambda of Equation 19

According to (Liu et al., 2018b), the weight for the domain classification loss, i.e., our signer classification loss srm\mathcal{L}_{srm}, is an important hyper-parameter. We fine-tune it from 0 to 1.5 with an interval of 0.25 as shown in Table 6. When λ=0\lambda=0, the model degenerates to C2\text{C}^{2}SLR and performs worse than other models with λ>0\lambda>0. The results suggest the importance of removing signer information for SI-CSLR. When λ=0.75\lambda=0.75, the model can achieve the best performance with a WER of 33.1%/32.7% on the dev and test set, respectively.

Table 7. Ablation study for the signer removal module. Experiments are conducted on PHOENIX-2014-SI. (SP: statistics pooling; GR: gradient reversal)
Method srm\mathcal{L}_{srm} SP GR WER% Type
C2\text{C}^{2}SLR+ 34.4 N/A
34.9 Multi-task Learning
33.5
33.6 Feature Disentanglement
32.7

4.4.2. Statistics Pooling and Gradient Reversal

We further conduct ablation studies for the two major components of our SRM, the statistics pooling (SP) and gradient reversal (GR) layer. First, the use of the GR layer decides the type of learning method: feature disentanglement or multi-task learning. As shown in Table 7, it is clear that with the use of GR, models under the feature disentanglement setting can significantly outperform those under the multi-task learning setting. The result implies that removing signer information is effective to SI-CSLR. However, we find that the model, C2\text{C}^{2}SLR+SP, can also outperform the baseline under the multi-task setting. We think this is because the multi-task learning can be seen as a kind of regularization (Zhang and Yang, 2021), which endows the shared networks between the CSLR and signer classification branches with better generalization capability. Similar ideas also appear in some works that jointly train a speech recognition model and a speaker recognition model (Liu et al., 2018a; Pironkov et al., 2016). Finally, the effectiveness of SP also validates that sentence-level signer embeddings are more robust than frame-level ones to achieve signer classification, leading to better performance.

Table 8. Performance comparison between seen and unseen signers.
Method Seen Signers Unseen Signers Relative Gap
(WER%) (WER%) (%)
C2\text{C}^{2}SLR 22.7 34.4 51.5
C2\text{C}^{2}SLR + SRM 23.0 32.7 42.2

4.4.3. Effect of the SRM over Seen and Unseen Signers

We finally study the effect of the SRM over seen and unseen signers. We first build an extra test set consisting of only seen signers during training by removing videos performed by unseen signers from the official test set of PHOENIX-2014, and then retest “C2\text{C}^{2}SLR” and “C2\text{C}^{2}SLR+SRM” on this extra test set. As shown in Table 8, with a comparable performance over the seen signers, adding the SRM can significantly narrow the performance gap between unseen and seen signers. The results suggest that our SRM can be more helpful for the real-world situation that most testing signers are unseen.

4.5. Comparison with State-of-the-art Results

Table 9. Comparison on signer-dependent datasets. (R: RGB; F: optical flow; P: pose.)
Method End-to-end Modalities PHOENIX-2014 PHOENIX-2014-T CSL-Daily
Training Inference Dev Test Dev Test Dev Test
CNN-LSTM-HMMs (Koller et al., 2019) × R R 26.0 26.0 22.1 24.1
DNF (RGB) (Cui et al., 2019) + SBD-RL (Wei et al., 2020) × R R 23.4 23.5
DNF (Cui et al., 2019) × R+F R+F 23.1 22.9 32.8 32.4
CMA (Pu et al., 2020) × R R 21.3 21.9
SMKD (Hao et al., 2021) × R R 20.8 21.0 20.8 22.4
STMC (Zhou et al., 2020) × R+P R 21.1 20.7 19.6 21.0
LS-HAN (Huang et al., 2018) R R 38.3 39.0 39.4
TIN + Transformer (Zhou and et al., 2021) R R 33.6 33.1
SFL (Niu and Mak, 2020) R R 24.9 25.3 25.1 26.1
FCN (Cheng et al., 2020) R R 23.7 23.9 23.3 25.1 33.2 32.5
LCSA (Zuo and Mak, 2022b) R R 21.4 21.9
SLT (Camgöz et al., 2020) R R 24.6 24.5 33.1 32.0
VAC (Min et al., 2021) R R 21.2 22.3
MMTLB (Chen et al., 2022a) R R 21.9 22.5
C2\text{C}^{2}SLR (ours) R+P R 20.5 20.4 20.2 20.4 31.9 31.0

4.5.1. Signer-dependent

As shown in Table 9, we first evaluate our C2\text{C}^{2}SLR on three signer-dependent benchmarks: PHOENIX-2014, PHOENIX-2014-T, and CSL-Daily.

Our C2\text{C}^{2}SLR follows the idea of auxiliary learning, which also appears in some existing works, e.g., FCN (Cheng et al., 2020) and VAC (Min et al., 2021). FCN proposes a gloss feature enhancement (GFE) module to introduce auxiliary supervision signals into the model training process. However, the GFE module highly relies on pseudo labels (CTC decoded results), which may contain too many errors. Our method only relies on pre-extracted heatmaps, which are quite accurate with the help of our post-processing algorithm, and the model’s inherent consistency: the visual and sequential features represent the same sentence. These two properties enable our method to outperform FCN by more than 3% on both PHOENIX-2014 and PHOENIX-2014-T. Recently, VAC proposes two auxiliary losses at the frame-level, which are not quite appropriate and perform worse than ours according to the comparison in Section 4.3.8. The SOTA work, STMC (Zhou et al., 2020), adopts the complicated stage optimization strategy, which introduces extra hyper-parameters, and needs to manually decide when to switch to a new stage. Our method is totally end-to-end trainable, and it can outperform STMC on both PHOENIX-2014 and PHOENIX-2014-T. To the best of our knowledge, this is the first time that an end-to-end method can outperform those using the stage optimization strategy.

In terms of modality usage, our method just uses extra pose modality during training, while only RGB videos are needed for inference. Thus, it is simpler for real application compared to DNF (Cui et al., 2019) which is built on a two-stream architecture taking both RGB videos and optical flow as inputs.

Finally, the results reported on CSL-Daily may be more important due to its large vocabulary size. Our method can still achieves SOTA performance on this large-scale dataset, which also validates the generalization capability of our method over different sign languages.

Table 10. Comparison on signer-independent datasets. (R: RGB; F: optical flow; P: pose; D: depth.)
(a) PHOENIX-2014-SI.
Method End-to-end Modalities Dev Test
Training Inference
Re-sign (Koller et al., 2017) × R R 45.1 44.1
DNF (Cui et al., 2019) × R+F R+F 36.0 35.7
CMA (Pu et al., 2020) × R R 34.8 34.3
C2\text{C}^{2}SLR (ours) R+P R 34.3 34.4
C2\text{C}^{2}SLR + SRM (ours) R+P R 33.1 32.7
(b) CSL.
Method End-to-end Modalities Test
Training Inference
LS-HAN (Huang et al., 2018) × R R 17.3
DPD + TEM (Zhou et al., 2019) × R R 4.7
STMC (Zhou et al., 2020) × R+P R 2.1
CTF (Wang et al., 2018) R R 11.2
HLSTM-attn (Guo et al., 2018) R R 10.2
FCN (Cheng et al., 2020) R R 3.0
VAC (Min et al., 2021) R R 1.6
MSeqGraph (Tang et al., 2021) R+P+D R+P+D 0.6
C2\text{C}^{2}SLR (ours) R+P R 0.90
C2\text{C}^{2}SLR + SRM (ours) R+P R 0.68

4.5.2. Signer-independent

As shown in Table 10(b), we further evaluate our SRM on the following two signer-independent benchmarks: PHOENIX-2014-SI and CSL.

Although some works, e.g., DNF (Cui et al., 2019) and CMA (Pu et al., 2020), evaluate their method on PHOENIX-2014-SI, none of them propose any dedicated module to deal with the challenging SI setting. In this work, we develop a simple yet effective signer removal module (SRM) for SI-CSLR to make the model more robust to signer discrepancy. As shown in Table 10(a), our C2\text{C}^{2}SLR can already achieve competitive performance on PHOENIX-2014-SI, and the SRM can further improve the performance significantly. The result validates that feature disentanglement is an effective method to remove signer-relevant information, and we believe our SRM can serve as a strong baseline for future works on SI-CSLR.

As shown in Table 10(b), our SRM can lead to a relative performance gain of 24.4% over the baseline C2\text{C}^{2}SLR on CSL555Although the SI setting itself is challenging, since the sentences in the CSL test set all appear in the training stage, the WER can be very low (<1<1%).. It is worth noting that the SOTA work, MSeqGraph (Tang et al., 2021), uses three modalities including RGB, pose, and depth. However, our method only uses RGB and pose information for training, and only RGB frames are needed for inference. Thus, with a comparable performance to the SOTA work, we believe our method is more applicable in real practice.

5. Conclusion and Future Works

In this work, we propose three auxiliary tasks to enhance CSLR backbones. The first task requires the model to learn informative attention maps from a keypoint-guided spatial attention module. The second task enhances the representation power of visual and sequential features by imposing a sentence embedding consistency constraint. Finally, the third task enforces the model to dispel signer information with a dedicated signer removal module for the signer-independent setting. We conduct sufficient ablation studies to validate the effectiveness of the three auxiliary tasks. Remarkably, our model can achieve SOTA or competitive performance on five benchmarks, while the whole model is trained in an end-to-end manner.

Below are some directions which deserve attention for future works. First, to enhance the quality of keypoints heatmaps, lightweight keypoints estimators which can be co-trained with the CSLR backbone are necessary. Second, more advanced cross-modality sentence embedding extractor shall be considered. Finally, we believe more attention should be paid on signer-independent CSLR since it is more realistic than its signer-dependent counterpart.

Acknowledgements.
The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKUST16200118).

References

  • (1)
  • Adaloglou et al. (2021) Nikolaos M. Adaloglou, Theocharis Chatzis, Ilias Papastratis, Andreas Stergioulas, Georgios Th Papadopoulos, Vassia Zacharopoulou, George Xydopoulos, Klimis Antzakas, Dimitris Papazachariou, and Petros Daras. 2021. A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition. IEEE TMM (2021), 1–1.
  • Andriluka et al. (2014) Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D Human Pose Estimation: New benchmark and State of the Art Analysis. In CVPR. 3686–3693.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Camgoz et al. (2018) Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural Sign Language Translation. In CVPR.
  • Camgöz et al. (2020) Necati Cihan Camgöz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In CVPR. 10020–10030.
  • Cao et al. (2019b) Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. 2019b. GCNet: Non-local Networks Meet Squeeze-excitation Networks and Beyond. In CVPRW.
  • Cao et al. (2019a) Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019a. OpenPose: Realtime Multi-person 2D Pose Estimation using Part Affinity Fields. TPAMI 43, 1 (2019), 172–186.
  • Carlsson et al. (2020) Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, and Magnus Sahlgren. 2020. Semantic Re-tuning with Contrastive Tension. In ICLR.
  • Carreira et al. (2018) Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A Short Note about Kinetics-600. arXiv preprint arXiv:1808.01340 (2018).
  • Chen and Jiang (2019) Shaoxiang Chen and Yu-Gang Jiang. 2019. Motion Guided Spatial Attention for Video Captioning. In AAAI, Vol. 33. 8191–8198.
  • Chen et al. (2022a) Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022a. A Simple Multi-modality Transfer Learning Baseline for Sign Language Translation. In CVPR. 5120–5130.
  • Chen et al. (2022b) Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022b. Two-Stream Network for Sign Language Recognition and Translation. In NeurIPS.
  • Cheng et al. (2020) Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully Convolutional Networks for Continuous Sign Language Recognition. In ECCV, Vol. 12369. 697–714.
  • Cheng et al. (2022) Yihua Cheng, Yiwei Bao, and Feng Lu. 2022. PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation. In AAAI.
  • Cui et al. (2019) Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE TMM PP (07 2019), 1–1.
  • Fu et al. (2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual Attention Network for Scene Segmentation. In CVPR. 3146–3154.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial Training of Neural Networks. JMLR 17, 1 (2016), 2096–2030.
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP (2021).
  • Goyal et al. (2017) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The “something something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV. 5842–5850.
  • Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. 369–376.
  • Guo et al. (2019) Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. 2019. Dense Temporal Convolution Network for Sign Language Translation. In IJCAI. 744–750.
  • Guo et al. (2018) Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical LSTM for Sign Language Translation. In AAAI. 6845–6852.
  • Guo et al. (2023) Leming Guo, Wanli Xue, Qing Guo, Bo Liu, Kaihua Zhang, Tiantian Yuan, and Shengyong Chen. 2023. Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition. In CVPR. 10771–10780.
  • Hao et al. (2021) Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-Mutual Distillation Learning for Continuous Sign Language Recognition. In ICCV. 11303–11312.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770–778.
  • Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning Deep Representations by Mutual Information Estimation and Maximization. In ICLR.
  • Howard et al. (2019) Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. In ICCV. 1314–1324.
  • Hu et al. (2021) Hezhen Hu, Wengang Zhou, Junfu Pu, and Houqiang Li. 2021. Global-local Enhancement Network for NMF-aware Sign Language Recognition. ACM TOMM 17, 3 (2021), 1–19.
  • Hu et al. (2023) Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. 2023. Continuous Sign Language Recognition with Correlation Network. In CVPR.
  • Huang et al. (2018) Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-Based Sign Language Recognition Without Temporal Segmentation. In AAAI. 2257–2264.
  • Huang et al. (2021) Zhizhong Huang, Junping Zhang, and Hongming Shan. 2021. When Age-invariant Face Recognition Meets Face Age Synthesis: A Multi-task Learning Framework. In CVPR. 7282–7291.
  • Jiao et al. (2023) Peiqi Jiao, Yuecong Min, Yanan Li, Xiaotao Wang, Lei Lei, and Xilin Chen. 2023. CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition. In ICCV. 20676–20686.
  • Jin et al. (2020) Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. 2020. Style Normalization and Restitution for Generalizable Person Re-identification. In CVPR. 3143–3152.
  • Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  • Koller (2020) Oscar Koller. 2020. Quantitative Survey of the State of the Art in Sign Language Recognition. arXiv preprint arXiv:2008.09918 (2020).
  • Koller et al. (2019) Oscar Koller, Necati Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos. IEEE TPAMI 42, 9 (04 2019), 2306–2320.
  • Koller et al. (2015) Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers. CVIU 141 (Dec. 2015), 108–125.
  • Koller et al. (2017) Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs. In CVPR. 3416–3424.
  • Li et al. (2020) Xingze Li, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Relation-guided Spatial Attention and Temporal Refinement for Video-based Person Re-identification. In AAAI, Vol. 34. 11434–11441.
  • Linsley et al. (2018) Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas Serre. 2018. Learning What and Where to Attend. In ICLR.
  • Liu et al. (2018a) Yi Liu, Liang He, Jia Liu, and Michael T Johnson. 2018a. Speaker Embedding Extraction with Phonetic Information. Interspeech (2018), 2247–2251.
  • Liu et al. (2019) Yue Liu, Xin Wang, Yitian Yuan, and Wenwu Zhu. 2019. Cross-modal Dual Learning for Sentence-to-video Generation. In ACM MM. 1239–1247.
  • Liu et al. (2018b) Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, and Xiaogang Wang. 2018b. Exploring Disentangled Feature Representation beyond Face Identification. In CVPR. 2080–2089.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP. 1412–1421.
  • Min et al. (2021) Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual Alignment Constraint for Continuous Sign Language Recognition. In ICCV. 11542–11551.
  • Niu and Mak (2020) Zhe Niu and Brian Mak. 2020. Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition. In ECCV. 172–186.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018).
  • Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep Sentence Embedding using Long Short-term Memory Networks: Analysis and Application to Information Retrieval. IEEE/ACM TASLP 24, 4 (2016), 694–707.
  • Pang et al. (2019) Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. 2019. Mask-guided Attention Network for Occluded Pedestrian Detection. In CVPR. 4967–4975.
  • Papadimitriou and Potamianos (2020) Katerina Papadimitriou and Gerasimos Potamianos. 2020. Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. In Interspeech. 2752–2756.
  • Pironkov et al. (2016) Gueorgui Pironkov, Stéphane Dupont, and Thierry Dutoit. 2016. Speaker-aware Long Short-term Memory Multi-task Learning for Speech Recognition. In European Signal Processing Conference (EUSIPCO). 1911–1915.
  • Pu et al. (2020) Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In ACM MM. 1497–1505.
  • Pu et al. (2018) Junfu Pu, Wengang Zhou, and Houqiang Li. 2018. Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition. In IJCAI. 885–891.
  • Pu et al. (2019) Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative Alignment Network for Continuous Sign Language Recognition. In CVPR. 4165–4174.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP. 3982–3992.
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A Unified Embedding for Face Recognition and Clustering. In CVPR. 815–823.
  • Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In NAACL-HLT. 464–468.
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
  • Snyder et al. (2018) David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust DNN Embeddings for Speaker Recognition. In ICASSP. 5329–5333.
  • Sun et al. (2019) Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-resolution Representation Learning for Human Pose Estimation. In CVPR. 5693–5703.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In CVPR. 1–9.
  • Tang et al. (2021) Shengeng Tang, Dan Guo, Richang Hong, and Meng Wang. 2021. Graph-Based Multimodal Sequential Embedding for Sign Language Translation. IEEE TMM (2021).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998–6008.
  • Wang et al. (2021) Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng, and Tao Qin. 2021. Generalizing to Unseen Domains: A Survey on Domain Generalization. arXiv preprint arXiv:2103.03097 (2021).
  • Wang et al. (2018) Shuo Wang, Dan Guo, Wen-gang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist Temporal Fusion for Sign Language Translation. In ACM MM. 1483–1491.
  • Wei et al. (2020) Chengcheng Wei, Jian Zhao, Wengang Zhou, and Houqiang Li. 2020. Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition. IEEE TCSVT 31, 3 (2020), 1138–1149.
  • Wei and Chen (2023) Fangyun Wei and Yutong Chen. 2023. Improving Continuous Sign Language Recognition with Cross-Lingual Signs. In ICCV. 23612–23621.
  • Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In ECCV. 3–19.
  • Wu et al. (2018) Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2018. Pay Less Attention with Lightweight and Dynamic Convolutions. In ICLR.
  • Xu et al. (2020) Tian Xu, Jennifer White, Sinan Kalkan, and Hatice Gunes. 2020. Investigating Bias and Fairness in Facial Expression Recognition. In ECCV. 506–523.
  • Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling Localness for Self-Attention Networks. In EMNLP. 4449–4458.
  • Ye et al. (2019) Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. 2019. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In CVPR. 6210–6219.
  • Yin et al. (2016) Fang Yin, Xiujuan Chai, and Xilin Chen. 2016. Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition. In ECCV. 434–450.
  • Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In ICLR.
  • Zhang and Yang (2021) Yu Zhang and Qiang Yang. 2021. A Survey on Multi-task Learning. IEEE TKDE (2021).
  • Zheng et al. (2023) Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and Stan Z Li. 2023. CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment. In CVPR.
  • Zhou and et al. (2021) Hao Zhou and et al. 2021. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In CVPR.
  • Zhou et al. (2019) Hao Zhou, Wengang Zhou, and Houqiang Li. 2019. Dynamic Pseudo Label Decoding for Continuous Sign Language Recognition. In ICME. 1282–1287.
  • Zhou et al. (2020) Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. In AAAI. 13009–13016.
  • Zuo and Mak (2022a) Ronglai Zuo and Brian Mak. 2022a. C2SLR: Consistency-Enhanced Continuous Sign Language Recognition. In CVPR. 5131–5140.
  • Zuo and Mak (2022b) Ronglai Zuo and Brian Mak. 2022b. Local Context-aware Self-attention for Continuous Sign Language Recognition. In Proc. Interspeech. 4810–4814.
  • Zuo et al. (2023) Ronglai Zuo, Fangyun Wei, and Brian Mak. 2023. Natural Language-Assisted Sign Language Recognition. In CVPR. 14890–14900.

Appendix A Appendix

A.1. A Sample for Word Error Rate

Word error rate (WER) is a widely adopted evaluation metric to meaure the dissimilarity between two sequences, which is commonly used in speech recognition and sign language recognition systems (Koller et al., 2015; Camgoz et al., 2018; Pu et al., 2020; Zhou et al., 2020; Min et al., 2021; Hao et al., 2021). It is defined as the ratio of the number of errors to the number of words (glosses) in the transcription after aligning the prediction sequence with the label sequence:

(24) WER=#deletions+#substitutions+#insertions#glosses in label,\text{WER}=\frac{\#\text{deletions}+\#\text{substitutions}+\#\text{insertions}}{\#\text{glosses in label}},

where # denotes ”the number of”. Table 11 shows an example with a WER of 50%:

Table 11. An example of WER computing. ”A”,”B”,”C”, and ”D” represent words (glosses). Substitutions and insertions are highlighted in red and green, respectively. \square denotes deletions.
Prediction C \square C A D B
Label A A B C A D

A.2. Loss Normalization

As shown in Table 12, we conduct another experiment on Phoenix-2014 by normalizing each loss term (ctc,sac,sec\mathcal{L}_{ctc},\mathcal{L}_{sac},\mathcal{L}_{sec}) of b\mathcal{L}_{b} into the range of [0,1]. Note that since ctc\mathcal{L}_{ctc} is defined as the negative log-likelihood of all feasible alignment paths, we use the reciprocal of its maximum training loss value (about 1/7) as its weight for normalization. sac\mathcal{L}_{sac} is defined as an MSE loss between attention maps and keypoint heatmaps, which ranges from 0 to 1, and thus we set its weight to 1. sec\mathcal{L}_{sec} is defined as a triplet loss: sec=max{d(x,xp)d(x,xn)+α,0}\mathcal{L}_{sec}=\max\{d(x,x_{p})-d(x,x_{n})+\alpha,0\}, where d(x,xp)=1xxpx2xp2[0,2]d(x,x_{p})=1-\frac{x\cdot x_{p}}{\|x\|_{2}\|x_{p}\|_{2}}\in[0,2] and α=2\alpha=2. Thus, the theoretical maximum value of sec\mathcal{L}_{sec} is 4, and we set its weight to 1/4. However, normalizing the loss terms cannot lead to better performance. Thus, we keep the default setting which equally weigh each loss term by 1.0.

Table 12. Comparison between different loss weighting strategies.
Loss Weights Dev Test
Normalization 20.5 20.6
All 1.0 (default) 20.5 20.4

A.3. Ablation Study for Local Transformer

We conduct an ablation study on Phoenix-2014 to verify the effectiveness of the Gaussian bias and the DTCN layer. As shown in Table 13, both the Gaussian bias and the DTCN layer can significantly decrease WER, proving that our local transformer is a strong sequential module for CSLR.

Table 13. Ablation study for the local transformer. (GB: Gaussian bias; DTCN: depth-wise temporal convolution network.)
GB DTCN WER%
25.2
22.7
21.5

A.4. Position of the Signer Removal Module

The position of the signer removal module (SRM) is simply a hyper-parameter. Besides SRM, the position of the spatial attention module is also empirically decided. As shown in Table 14(a), we first decide the position of the spatial attention module by putting it after different CNN layers. We find that an intermediate position, i.e., after the 5th CNN layer, is the best choice. An intuitive explanation is that early positions cannot provide enough error signals for the visual module, while a position too late will lead to low heatmap resolution. After that, we fix the position of the spatial attention module and adjust the position of the SRM. As shown in Table 14(b), putting the SRM right after the 5th CNN layer is also the best choice.

Table 14. The effects of the position of the spatial attention module and signer removal module. mm denotes the index of the CNN layer. All experiments are conducted on PHOENIX-2014-SI.
(a) The effect of the position of the spatial attention module.
mm 1 2 3 4 5 6 7 8
Resolution 224 112 56 56 28 28 14 14
Dev 36.1 34.9 35.5 36.1 34.3 35.5 36.2 37.2
Test 36.0 35.3 35.5 36.5 34.4 33.6 35.7 36.0
(b) The effect of the position of the signer removal module.
mm 1 2 3 4 5 6 7 8
Dev 35.3 35.3 36.2 35.3 33.1 35.2 35.9 35.1
Test 35.3 34.4 35.0 35.0 32.7 35.6 33.2 34.1

A.5. Abbreviations

We list all abbreviations appeared in main texts and their corresponding full names in Table 15.

Table 15. Abbreviations and full names.
Abbreviation Full Name Abbreviation Full Name
BiLSTM bidirectional long short-term memory RNN recurrent neural network
CMP channel-wise max pooling SAC spatial attention consistency
CNN convolutional neural network SD signer-dependent
CSLR continuous sign language recognition SEC sentence embedding consistency
C2\text{C}^{2}SLR consistency-enhanced CSLR SEE sentence embeddding extractor
CT CNN+TCN SFD stochastic frame dropping
CTC connectionist temporal classification SI signer-independent
GAP global average pooling SLR sign language recognition
GB Gaussian bias SOTA state-of-the-art
GFE gloss feature enhancement SP statistical pooling
GR gradient reversal SRM signer removal module
IR informative region TCN temporal convolutional network
ISLR isolated sign language recognition VA visual alignment
LSA local self-attention VE visual enhancement
LT local transformer VLT VGG11+LT
MLP multi-layer perceptron VTB VGG11+TCN+BiLSTM
QK query-key WER word error rate
ReLU rectified linear unit