Improving Continuous Sign Language Recognition with Adapted Image Models

Lianyu Hu, Tongkai Shi, Liqing Gao, Zekang Liu, Wei Feng^✉ College of Intelligence and Computing, Tianjin University, Tianjin 300350, China Code : https://github.com/hulianyuyy/AdaptSign

Abstract

The increase of web-scale weakly labelled image-text pairs have greatly facilitated the development of large-scale vision-language models (e.g., CLIP), which have shown impressive generalization performance over a series of downstream tasks. However, the massive model size and scarcity of available data limit their applications to fine-tune the whole model in downstream tasks. Besides, fully fine-tuning the model easily forgets the generic essential knowledge acquired in the pretraining stage and overfits the downstream data. To enable high efficiency when adapting these large vision-language models (e.g., CLIP) to performing continuous sign language recognition (CSLR) while preserving their generalizability, we propose a novel strategy (AdaptSign). Especially, CLIP is adopted as the visual backbone to extract frame-wise features whose parameters are fixed, and a set of learnable modules are introduced to model spatial sign variations or capture temporal sign movements. The introduced additional modules are quite lightweight, only owning 3.2% extra computations with high efficiency. The generic knowledge acquired in the pretraining stage is well-preserved in the frozen CLIP backbone in this process. Extensive experiments show that despite being efficient, AdaptSign is able to demonstrate superior performance across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL compared to existing methods. Visualizations show that AdaptSign could learn to dynamically pay major attention to the informative spatial regions and cross-frame trajectories in sign videos.

1 Introduction

Sign language is one of the most commonly-used communication tools for the deaf community in their daily life. However, mastering this language is rather difficult and time-consuming for the hearing people, thus hindering direct communications between two groups. To relieve this problem, continuous sign language recognition (CSLR) progresses by sequentially translating image streams into a series of glosses¹¹1Gloss is the atomic lexical unit to annotate sign languages. to express a complete sentence, more prospective towards bridging the communication gap.

Recently, the availability of large-scale web image-text pairs has greatly accelerated the development of vision-language models such as CLIP [58]. These methods typically model multi-modal information in a contrastive way, by closing the distance of positive image-text pairs and pushing away others’ [58, 1, 44, 45, 38]. Powered by the supervision of natural language, these models have shown impressive generalization performance over a series of downstream tasks [18, 15, 7, 13, 62] in a zero-shot manner.

Despite the excellent performance over novel concepts achieved by these models, it’s still challenging to directly adapt them to downstream tasks for further performance boost following the traditional fine-tuning manner: (1) The massive model scale and scarcity of training data limit their applications to downstream scenarios, where the restricted computing resources and absence of data may not support fine-tuning the whole model. The situation may be more severe in video-related tasks due to the increase of computational costs and the challenge to collect labelled video data [49, 34]. (2) fully fine-tuning the model easily forgets the generic knowledge acquired in the pretraining stage and overfits the downstream data. The acquired well-generalized knowledge of the pretrained model is easily destroyed by fine-tuning on inadequate data which hurts the generalizability [37].

Refer to caption — Figure 1: Illustration of the difference between our training pipeline and the commonly-used pretrain-and-finetune paradigm.

The above issues motivate us to find an efficient strategy to adapt these strong vision-language models to learning good video representations in continuous sign language recognition (CSLR), where data scarcity and limited computing resources are both encountered. To tackle the above problems, we propose a novel strategy (AdaptSign). Especially, we adopt a frozen CLIP model as the visual backbone, and stack several modules on top to learn more discriminative spatial sign features or model temporal sign correlations. The incurred extra computations by these modules are quite lightweight (3.2%) compared to the frozen CLIP model, which enable high efficiency in this procedure. Specifically, We instantiate a new frame-level token as a query, with features from each CLIP block as keys and values to aggregate multiscale features. Adapters are appended in parallel within each CLIP block to update the intermediate spatial features in a residual way. We inject prefix embeddings before the visual features as learnable embeddings to model specific domain knowledge. To capture the temporal sign information, a correlation module is introduced on top to model the cross-frame trajectories. Fig. 1 depicts the differences between our framework and the traditional fine-tuning pipeline. Extensive experiments show that despite being efficient, AdaptSign outperforms existing methods by a large margin across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL. Plentiful visualizations demonstrate AdaptSign is able to pay major attention to the informative spatial regions and cross-frame trajectories in sign videos.

2 Related Work

2.1 Continuous Sign Language Recognition

Sign language recognition methods can be roughly divided into isolated sign language recognition [60, 28, 29] and continuous sign language recognition [56, 6, 9, 54, 51] (CSLR), and we focus on the latter in this paper. CSLR aims to translate input images into corresponding glosses in a weakly-supervised way: only sentence-level label is provided. Earlier methods [19, 16] in CSLR always employ hand-crafted features or HMM-based systems [40, 22, 41, 39] to perform temporal modelling and translate sentences step by step. Recently, the CTC loss [20] is broadly used in recent CSLR methods [56, 57, 6, 9, 54, 51] to train deep networks in an end-to-end manner by sequentially aligning target sentences with input frames. These CTC-based methods first rely on a spatial extractor, which is often instantiated as a 2D CNN to extract frame-wise features, and then adopt a sequence model consisting of a 1D CNN and a LSTM for capturing temporal dependencies. However, several methods [56, 9] found in such conditions the spatial extractor is not well-trained and then present an iterative training strategy to relieve this problem, but consume much more computations. Some recent studies try to directly enhance the spatial extractor with visual supervision [51, 6, 23], squeezing more beneficial temporal features [30] or emphasize critical spatial features [31]. Adapting large vision-language models for CSLR faces two major problems, scarcity of available data (e.g., only 20k videos for commonly-used PHOENIX14 [39] and CSL-Daily [66]) and huge computational costs during fine-tuning. We present a novel strategy to adapt these high-quality visual features for sign language understanding with high efficiency while preserving their generalizability.

2.2 Vision-Language Models

Powered by the abundance of image-text pairs collected from the web, large-scale vision-language methods have developed fast over the past several years. Earlier methods mainly rely on grid features [35, 53] or region proposals‘ [2]’ to align image features with text embeddings. In contrast to these methods, contrastive vision-language methods such as CLIP [58] are trained by maximizing the feature similarity between positive image-text pairs, which learn more powerful visual features with aligned language semantics. Another advantage of CLIP is its impressive feature transferability, which shows promising results on a series of downstream visual tasks in a zero-shot manner. However, fine-tuning the whole model is still infeasible in some cases over the downstream tasks due to the scarcity of training data and incurred computations. In this paper, we design an efficient strategy to adapt these generic features to helping understand sign videos.

2.3 Efficient Transfer Learning

This set of work tries to efficiently transfer high-quality representations from pretrained models into downstream tasks, which are first explored in natural language processing and image recognition. Adapter series [26, 27, 24] keep the pretrained model fixed and design adapters consisting of an MLP with residual connections to adjust output features. Some works explore prompt tuning [47, 43], which append a learnable prompt before the input or the intermediate features to adapt the output features to specific tasks. In terms of the visual domain, CLIP-Adapter [18] tries to calibrate the classifier on top of a frozen CLIP model via a residual way. VL-Adapter [59] explores the setup of adapters in the multitask setting. Tip-Adapter [63] proposes a training-free adapter to align texts with images. VPT [34] tests the choices of a set of visual prompts in visual tasks. EVL [49] designs a transformer decoder to learn more powerful spatial-temporal representations. Some works [7, 15] transfer the aligned image and text features for cross-modality tasks with a CLIP backbone. In contrast to previous methods that mostly focus on image understanding, we try to handle the data scarcity and incurred computations in video scenarios like CSLR.

3 Method

3.1 Overview

Given a sign language video with $T$ frames $x=\{x_{t}\}_{t=1}^{T}\in\mathcal{R}^{T\times 3\times H\times W}$ , a CSLR model aims to translate the input video into a series of glosses $y=\{y_{i}\}_{i=1}^{N}$ to express a sentence, with $N$ denoting the length of the sentence. The CTC loss [20] is used to provide supervision during training by aligning input frames to ground-truth gloss sequences. We follow recent CSLR methods [51, 6, 23, 30, 31] to first deploy a spatial extractor to extract frame-wise features, and then employ a sequence model to perform temporal reasoning for sentence prediction. Fig. 2 shows the framework overview of our AdaptSign. While large-scale vision-language models have shown excellent performance over a series of downstream tasks [13, 38, 58, 17, 46, 1, 44, 45], directly fine-tuning a large-scale model over downstream tasks easily leads to inferior performance and unstable generalizability [37], which hinder them from broader applications when faced with new data or tasks. Besides, their high computational demands may not always be available in most real-world applications due to constrained computing resources. To successfully transfer their powerful features into tasks with limited available data like CSLR, we propose a novel strategy, termed AdaptSign. Specifically, to keep high training efficiency, we adopt a frozen CLIP model [58] as the spatial extractor, and stack several lightweight learnable modules on top to adapt its general features for downstream task target. Specially, we propose attention & FFN adaption, multi-scale aggregation as well as prefix embedding to inject task-related spatial information into the well-generalized features. To model the temporal movements in sign videos, a cross-frame attention module is introduced to capture the trajectories of the signer. Finally, we add the features of both cross-frame attention and multiscale aggregation as the output representation, and feed them into the sequence model for sentence prediction.

3.2 Obtaining More Representative Features

As a specific language with its own grammar and lexicon [12, 55] expressed through both manual components (hand/arm gestures) and non-manual components (facial expressions, head movements, and body postures), sign language requires special expert knowledge to understand. We argue the general features of CLIP acquired by pretraining on large-scale web data don’t directly generalize well over this specialized domain. Thus, we introduce a series of lightweight modules to adapt the frozen CLIP backbone to learn specific spatial sign features. Besides, as the CLIP model lack the ability to model local temporal information (e.g., cross-frame trajectories of hand/face for the signer), we introduce a cross-frame attention module to encode temporal information into the frozen CLIP backbone.

3.2.1 Attention & FFN adaption.

Efficient transfer learning methods [26, 27, 24, 43, 50, 47] have been extensively explored in Natural Language Processing (NLP) tasks to adapt large language models for downstream applications, which can achieve comparable or even superior performance compared to fine-tuning the whole heavy model. This set of works can be roughly divided into two categories, including Adapters [26, 27, 24] and Prompt-tuning [43, 50, 47]. Inspired by the achievements of efficient fine-tuning techniques in NLP, we introduce Adapter [26] to learn generalized visual features.

Specifically, the Adapter consists of two fully connected (FC) layers and a GELU activation function in between, with skip connections in parallel. Especially, to lower the required computations and parameters, the first FC layer reduces the channels by a factor of $r$ , and the second FC layer project it back to the original dimension. To reuse the learned ability on modelling patch relationships, we add an Adapter in parallel with the LN-MSA layer to adapt the pretrained image features for videos. During training, all other layers in the backbone are frozen and only the Adapters are updated. The outputs of the Adapter and MSA layer are added to calibrate general visual features with specialized video information. The computing procedure could be expressed as follows, with $z_{l-1}$ denoting the features of $(l-1)_{th}$ layer:

z_{l}^{{}^{\prime}}=z_{l-1}+{\rm MSA}({\rm LN}(z_{l-1}))+{\rm Adapter}(z_{l-1}),

(1)

To keep the original behavior of the image model, we initialize the weights of the second FC layer in the Adapter with zeros, Thus, the Adapter would act as an identity function at the beginning of training, without hurting the learned features of the pretrained image model. Practically, as the Adapters are quite lightweight, the incurred extra parameters and computations are few (0.1%) compared to the frozen backbone. However, as we show in the experiments, this simple design greatly facilitates the potential of large-scale pretrained image models in videos, and well works with limited available data.

Some works [34, 3] find the feedforward network (FFN) could learn fine-grained image features by sequentially transforming patches via non-linear transformations. To likewise leverage such encouraging ability of pretrained image models, we add an Adapter in parallel with the LN-FFN layer to calibrate its features to learn specialized video information. The calculation procedure could be expressed as:

z_{l}=z_{l}^{{}^{\prime}}+{\rm FFN}({\rm LN}(z_{l}^{{}^{\prime}}))+{\rm Adapter}(z_{l}^{{}^{\prime}}).

(2)

The LN and FFN layers are kept frozen and only the Adapters are updated.

3.2.2 Prefix embedding.

While the pretrained models contain generic powerful knowledge acquired by training upon a large corpus, there may lack a proper way to inject specific domain knowledge for them to adapt to specialized downstream tasks. To handle this challenge, we try to inject specific sign knowledge into pretrained image models by appending learnable prefix embeddings in the basic blocks to provide instructions for these impressive models.

Specifically, in MSA, $x$ first undergoes three linear transformations $W_{q}$ , $W_{k}$ and $W_{v}$ to obtain the query $Q$ , key $K$ and value $V$ . To structurally encode specific sign knowledge into pretrained image models, we append learnable prompt embeddings $P\in\mathcal{R}^{m\times d}$ with the length of $m$ before key $K$ and value $V$ , respectively, to reformulate them as $K^{{}^{\prime}}=[P;K]$ and $V^{{}^{\prime}}=[P;V]$ . Here, $d$ denotes the number of intermediate channels. Taking $i_{th}$ patch as an example, the attention operation to compute the output $Y_{i}$ could be formulated as:

Y_{i}={\rm softmax}(\frac{Q_{i}\cdot[P;K]}{\sqrt{d}})\cdot[P;V].

(3)

It’s noticed that $Q_{i}$ first computes its affinities with $P$ and then aggregates task-specific information from it across various inputs, Irrelevant to input features. $P$ is individually set and learned across different layers, which is expected to offer specialized knowledge of various spatial hierarchies to update the intermediate features of each CLIP block. Practically, $P$ is randomly initialized, and then updated together with the Adapters via backward gradient propagation, keeping other network components frozen. As the length $m$ of $P$ is often quite small (e.g., 8), the incurred extra parameters are few (<0.1%) compared to the frozen backbone.

3.2.3 Multiscale aggregation.

Features across different layers are shown to contain beneficial information of various spatial hierarchies [32, 14, 48, 25]. To effectively leverage this multiscale information for task target, we progressively aggregate features from different CLIP blocks into a robust unified representation. Specifically, we initialize a new token $x_{mul}$ as a query, and treat features from each frozen CLIP block as keys and values to perform MSA. An MLP layer with LN and skip connections is then employed on $x_{mul}$ to update its features. Taking $l_{th}$ layer as an example, this procedure could be expressed as:

x_{mul}^{l^{\prime}}=x_{mul}^{l}+{\rm MSA}({\rm LN}(x_{mul}^{l},x_{l})),

(4)

x_{mul}^{l+1}=x_{mul}^{l^{\prime}}+{\rm MLP}({\rm LN}(x_{mul}^{l^{\prime}})).

(5)

This procedure is repeated for {1, $\cdots$ , $L$ }_th CLIP blocks.

Especially, the MSA and MLP operation are only conducted on $x_{mul}$ , and thus the overall computing complexity is $\mathcal{O}(1)$ with respect to the patch number $n$ in contrast to the $\mathcal{O}(n^{2})$ complexity of CLIP blocks. Overall, the extra computations are few (1.9%) compared to the CLIP backbone.

3.2.4 Cross-frame attention.

Hand and face play a major role in expressing sign language by delivering messages through horizontal/vertical hand movements, finger activities, and facial expressions [12, 55]. However, the pretrained CLIP model fails to capture the temporal features of these body parts. To depict such trajectories, we compute attention maps between patches in a local temporal neighborhood to get their temporal correspondences. Specifically, we use $x_{cls}^{{}^{\prime\prime}}\in\mathcal{R}^{1\times d}$ as a query for each frame, and treat neighboring spatial-temporal patches $x_{\tau}=\{x_{-\tau},\cdots,x_{\tau}\}\in\mathcal{R}^{(2\tau+1)\times n\times d}$ within $2\tau$ +1 adjacent frames as keys and values to compute attention maps $A\in\mathcal{R}^{1\times(2\tau+1)\times n}$ between $x_{cls}^{{}^{\prime\prime}}$ and $x_{\tau}$ as:

A={\rm LN}(x_{cls}^{{}^{\prime\prime}})\times{\rm LN}(x_{\tau})^{T}.

(6)

$A$ is then passed through a sigmoid function to generate weights within (0,1) to measure the importance of each neighboring patch. We further subtract its values from 0.5 into the range of $[-0.5,0.5]$ , where informative patches are expected to be emphasized with positive values and unnecessary patches are suppressed with negative values. Next, we element-wisely multiply $A$ with $x_{\tau}$ to aggregate motion information from neighboring spatial-temporal patches $x_{\tau}$ , whose results are averaged over all patches and added upon $x_{cls}^{{}^{\prime\prime}}$ to encoder temporal correspondences via a residual way as:

\widehat{x}_{cls}=\frac{1}{(2\tau+1)n}\sum_{i=1}^{(2\tau+1)}\sum_{j=1}^{n}(A^{ij}-0.5)\odot x_{\tau}^{ij}+x_{cls}^{{}^{\prime\prime}}

(7)

As the MSA and aggregation operators in eq. 6 and eq. 7 are only conducted between a single token $x_{cls}^{{}^{\prime\prime}}$ and neighboring patches $x_{\tau}$ , the overall computing complexity is $\mathcal{O}(1)$ with respect to the patch number $(2\tau+1)\times n$ . Practically, the extra computations brought by the cross-frame attention module are few (1.0%) compared to the CLIP backbone.

3.3 Complexity Analysis

Totally, the overall extra computations raised by our proposed modules are few (3.2%) with respect to the frozen CLIP backbone. Despite being efficient, as we show below, our AdaptSign shows clear advantages compared to either the frozen or fine-tuned CLIP backbone upon accuracy.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

PHOENIX14 [39] and PHOENIX14-T [4] are both recorded from German weather forecast broadcasts before a clean background with a resolution of 210 $\times$ 260. They contain 6841/8247 sentences with a vocabulary of 1295/1085 signs, divided into 5672/7096 training samples, /519 development (Dev) samples and 629/642 testing (Test) samples.

CSL-Daily [66] revolves the daily life, recorded indoor at 30fps by 10 signers. It contains 20654 sentences, divided into 18401 training samples, 1077 development (Dev) samples and 1176 testing (Test) samples.

CSL [33] is collected in the laboratory environment by fifty signers with a vocabulary size of 178 with 100 sentences. It contains 25000 videos, divided into training and testing sets by a ratio of 8:2.

4.1.2 Training details.

For fair comparisons, we follow the same setting as state-of-the-art methods [51, 31] to prepare our model and also restrict the training procedure to a single graphical card. We adopt ViT-B/16 [11] as the spatial extractor with pretrained weights from CLIP [10]. The sequence model is consisted of a 1D CNN and a two-layer BiLSTM module, followed by a fully connected layer for prediction. The 1D CNN consists of a sequence of {K5, P2, K5, P2} layers where $K$ and $P$ denotes a 1D convolutional layer and a pooling layer with kernel size of 5 and 2, respectively. We train our model for 40 epochs with initial learning rate 0.0001 decayed by 5 after 20 and 30 epochs. Adam optimizer is adopted with weight decay 0.001 and batch size 2. All frames are first resized to 256 $\times$ 256 and then randomly cropped to 224 $\times$ 224, with 50% horizontal flip and $\pm$ 20% random temporal scaling during training. During inference, a central 224 $\times$ 224 crop is simply selected. We use VE and VA losses from VAC [51] for extra supervision.

4.1.3 Evaluation Metric.

We use Word Error Rate (WER) as the evaluation metric, which is defined as the minimal summation of the substitution, insertion, and deletion operations to convert the predicted sentence to the reference sentence, as:

\rm WER=\frac{\#sub+\#ins+\#del}{\#reference}.

(8)

Note that the lower WER, the better accuracy.

Configurations	Step time (s)	Dev(%)	Test(%)
Freezing	0.28	28.6	29.7
fine-tuning	0.65 (2.34 $\times$ )	34.3	34.9
Partial-1	0.31	23.4	23.1
Partial-2	0.34	23.1	22.8
Ours	0.31 (1.15 $\times$ )	18.5	19.4

Table 1: Results for different training settings of the spatial extractor, measured with a 3090 graphical card.

4.2 Ablation Study

We perform ablation studies on both development (Dev) and testing (Test) sets of the PHOENIX14 dataset to verify the effectiveness of AdaptSign.

Study on different training paradigms. We compare our method upon both effectiveness and efficiency against freezing or fine-tuning the CLIP backbone in tab. 1. Here, Partial- $n$ denotes only partially fine-tuning the last $n$ layers of pretrained models. It’s observed that a frozen backbone achieves considerable recognition accuracy (28.6% & 29.7% WER on both sets), while directly fine-tuning it leads to unsatisfactory results (-5.7% & -5.2% compared to freezing), which indicates the difficulty to directly adapt large models upon limited training data. Only fine-tuning the last several layers notably improve the recognition performance. Our method achieves much higher accuracy than them (e.g., +10.0%, +10.3% than freezing and +15.8%, +15.5% than fine-tuning). Regarding training costs, fine-tuning owns much longer (2.34 $\times$ ) step time per batch compared to freezing and partially fine-tuning. Our method only brings 0.15 $\times$ additional step time and consumes comparable step time compared to freezing and partially fine-tuning, which shows much better training efficiency than fine-tuning. Overall, our method demonstrate a much better accuracy-computation trade-off than commonly-used fine-tuning methods.

Study on the effectiveness of each component. We add the proposed modules one by one on top of the frozen CLIP backbone to verify their effectiveness in tab. 2. By adding the Attention & FFN Adaption, a significant +6.0% & +6.5% accuracy boost is observed, which shows the necessity to inject specific sign video representations into the intermediate features of the pretrained visual backbone. Sequentially adding the other three modules could also give +1.4% & +1.1%, +1.4% & +1.5%, +1.6% & +2.2% accuracy boost, which demonstrate the effectiveness to inject domain knowledge, leverage multiscale features and capture cross-frame trajectories.

Configurations	Dev(%)	Test(%)
-	28.6	29.7
w/ attention & FFN adaption	22.6	23.2
w/ prompt embedding	21.2	22.1
w/ multiscale aggregation	19.8	20.6
w/ cross-frame attention	18.5	19.4

Table 2: Effectiveness of each component by adding them one by one upon a frozen CLIP backbone.

Study on the configurations of attention & FFN adaption. The upper part of tab. 3 tests the dimension $r$ of intermediate features between the two FC layers ${\rm fc}_{1}$ and ${\rm fc}_{2}$ in Adapters. Intuitively, smaller $r$ always leads to fewer computations with higher WER. We try to find an accuracy-computation trade-off for $r$ . Practically, it’s observed larger $r$ consistently achieves better performance, which reaches a peak after equalling $\frac{1}{4}d$ . We thus set $r=\frac{1}{4}d$ by default. We then test the choice of ${\rm fc}_{2}$ initialization. It’s found that compared to the normal distribution, zero-initializing ${\rm fc}_{2}$ is critical to obtain high-quality visual features, by keeping the original behaviors of the CLIP backbone at the beginning of training.

Configurations	Dev(%)	Test(%)
$r$ = $\frac{1}{16}$	18.9	19.9
$r$ = $\frac{1}{4}$	18.5	19.4
$r$ = $\frac{1}{2}$	18.6	19.6
$r$ = $1$	18.8	19.8
Initialize ${\rm fc}_{2}$ by normal distribution	20.3	20.6
Initialize ${\rm fc}_{2}$ by all zeros	18.5	19.4

Table 3: Ablations for the configurations of Adapters in Attention & FFN Adaption.

Study on the configuration of prompt embedding. We first explore the length $m$ of prompt embeddings in the upper part of tab. 4. It’s observed that the accuracy consistently promotes as $m$ increases, which reaches the peak when $m=8$ . We thus set $m=8$ by default. We then test whether to share the prompt embeddings $P$ across layers in the bottom part of tab. 4. It’s noticed that setting independent prompt embeddings for each layer achieves better accuracy. This could be attributed from that independent prompt embeddings in different layers offer specific sign knowledge of various hierarchies to help understand sign videos.

Configurations	Dev(%)	Test(%)
$m=2$	20.1	20.3
$m=4$	19.8	20.0
$m=8$	18.5	19.4
$m=12$	19.6	19.9
Shared across layers	20.3	20.6
Independent across layers	18.5	19.4

Table 4: Ablations for the configuration of Prefix Adaption.

Configurations	Dev(%)	Test(%)
unidirectional	19.4	20.1
bidirectional	18.5	19.4
$\tau$ =0	19.8	20.6
$\tau$ =1	19.1	20.3
$\tau$ =2	18.5	19.4
$\tau$ =3	18.9	19.9

Table 5: Ablations for the configurations of Cross-Frame Attention.

Study on the configurations of cross-frame attention. We first explore whether to perform cross-frame attention bidirectionally or uni-directionally. As shown in the upper part of tab. 5, aggregating temporal information from bidirectional frames outperforms unidirectional attention observing only future frames, which demonstrates the benefits of both past and future information. We then explore the temporal neighborhood $\tau$ of cross-frame attention. Here, $\tau=0$ means no temporal information is incorporated. In the bottom part of tab. 5, it’s first observed larger $\tau$ consistently brings better performance. When $\tau$ reaches 2, it brings no more performance gain. We set $\tau$ as 2 by default.

Backbone	Configuration	Dev(%)	Test(%)
CLIP [58]	Freezing	29.7	30.6
	Partial-1	24.2	23.8
	AdaptSign	19.5	19.8
CoCa [62]	Freezing	29.4	29.9
	Partial-1	23.8	23.2
	AdaptSign	19.1	19.4

Table 6: Deploying AdaptSign over multiple backbones on the PHOENIX14 dataset, with ViT-B/32 as the spatial extractor.

Flexibility over multiple backbones. We verify the flexibility of our AdaptSign by deploying it over multiple large-scale vision-language backbones, such as CLIP [58] and CoCa [62], with ViT-B/32 as the spatial extractor. The results are listed in tab. 6. It’s observed our AdaptSign generalizes well across different backbones to improve the performance of CSLR.

4.3 Visualizations

Visualizations for spatial attention maps. Fig. 3 compares the attention maps generated by the last layer of our method with those from the last layer of the frozen CLIP visual backbone. Notably, our method could generally focus on the human body (light yellow areas), and pays specific attention to regions like hands and face (dark red areas) which play an important role in expressing signs. Instead, the attention maps of the frozen CLIP backbone are much sparser, which mostly focus on static objects like clothes or backgrounds. These results show that our method could help the frozen CLIP model learn to emphasize more specific information in expressing signs, e.g., fine-grained features of hands and face to understand sign language.

Visualizations for the cross-frame attention module. Fig. 4 shows the attention maps generated by our cross-frame attention module. The red box denotes the query location, i.e., the frame-level token $x_{cls}^{{}^{\prime\prime}}$ . It’s observed that the query could always attend to informative regions in neighboring frames, e.g., hands or face, to track critical body trajectories in expressing a sign. Especially, it learns to pay special attention to the moving body parts that play a major role in expressing signs. For example, for figures in the first row, the query (left hand) always pays major attention to the quickly moving right hand to capture its trajectories across frames, but pays much less attention to the static hand (left hand itself) and other regions.

Methods	PHOENIX14		PHOENIX14-T
Methods	Dev(%)	Test(%)	Dev(%)	Test(%)
FCN [6]	23.7	23.9	23.3	25.1
CMA [57]	21.3	21.9	-	-
VAC [51]	21.2	22.3	-	-
SMKD [23]	20.8	21.0	20.8	22.4
TLP [30]	19.7	20.8	19.4	21.2
RadialCTC [52]	19.4	20.2	-	-
SEN [31]	19.5	21.0	19.3	20.7
CTCA [21]	19.5	20.1	19.3	20.3
CoSign-2s [36]	19.7	20.1	19.5	20.1
CVT-SLR [64]	19.8	20.1	19.4	20.3
C+L+H^∗ [42]	26.0	26.0	22.1	24.1
DNF^∗ [9]	23.1	22.9	-	-
STMC^∗ [65]	21.1	20.7	19.6	21.0
C²SLR^∗ [67]	20.5	20.4	20.2	20.4
AdaptSign	18.5	18.8	18.6	19.8

Table 7: Comparison with state-of-the-art methods on the PHOENIX14 and PHOENIX14-T datasets.

*

indicates extra clues such as face or hand features are included by additional networks or pre-extracted heatmaps.

Methods	Dev(%)	Test(%)
LS-HAN [33]	39.0	39.4
TIN-Iterative [9]	32.8	32.4
Joint-SLRT [5]	33.1	32.0
FCN [6]	33.2	32.5
BN-TIN [66]	33.6	33.1
CTCA [21]	31.3	29.4
SEN [31]	31.1	30.7
CoSign-2s [36]	28.1	27.2
AdaptSign	26.7	26.3

Table 8: Comparison with state-of-the-art methods on the CSL-Daily dataset [66].

Methods	WER(%)
LS-HAN [33]	17.3
SubUNet [8]	11.0
SF-Net [61]	3.8
FCN [6]	3.0
STMC [65]	2.1
VAC [51]	1.6
C²SLR [67]	0.9
AdaptSign	0.7

Table 9: Comparison with state-of-the-art methods on the CSL dataset [33].

4.4 Comparison with State-of-the-Art Methods

PHOENIX14 and PHOENIX14-T. Tab. 7 comprehensively compare our method and other state-of-the-art approaches. The entries notated with $*$ indicate these methods utilize additional factors like face or hand features for better accuracy. All previous CSLR methods fine-tune the pretrained image backbone to prepare their recognition model. Compared to these methods that require 100% tuned parameters, our AdaptSign adapts generic features from frozen pretrained models to learning specific sign representations and outperforms them by a large margin upon both datasets. Especially, AdaptSign outperforms previous CSLR methods equipped with hand and faces acquired by heavy pose-estimation networks or pre-extracted heatmaps (notated with *), without relying on extra expensive annotations to extract hand or face features.

CSL-Daily. CSL-Daily is designed to cover a wide range of daily contents covering family life, social contact and daily communication with the largest vocabulary size (2k) among commonly-used CSLR datasets. Tab. 8 shows that our method achieves new state-of-the-art accuracy upon this challenging dataset with notable progress, demonstrating its effectiveness to handle real-life scenarios like daily communication.

CSL. CSL is a widely-used CSLR dataset recorded indoors. Tab. 9 shows that our method could achieve extremely superior accuracy (0.7% WER) upon this well-examined dataset, outperforming existing CSLR methods.

5 Conclusion

With impressive performance over a series of downstream tasks, the large scale and scarcity of training data limit the application of large vision-language models to downstream tasks. We propose an efficient training strategy to transfer high-quality visual features into CSLR. Our experiments show that despite being efficient, our strategy outperforms previous methods by a large margin across commonly-used benchmarks. Visualizations show that our proposed method could well attend to informative regions in expressing signs like hands and face as well as their trajectories across frames to capture discriminative sign features.

Acknowledgements

This study was supported by National Key Research and Development Program of China (2023YFF0906200) and National Natural Science Foundation of China (Grant Nos. 62072334).

References

Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
Anderson et al. [2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
Bahng et al. [2022] Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274, 2022.
Camgoz et al. [2018] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7784–7793, 2018.
Camgoz et al. [2020] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10033, 2020.
Cheng et al. [2020] Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. Fully convolutional networks for continuous sign language recognition. In ECCV, 2020.
Cheng et al. [2021] Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
Cihan Camgoz et al. [2017] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV, 2017.
Cui et al. [2019] Runpeng Cui, Hu Liu, and Changshui Zhang. A deep neural framework for continuous sign language recognition by iterative training. TMM, 21(7):1880–1891, 2019.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Dreuw et al. [2007] Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, and Hermann Ney. Speech recognition techniques for a sign language recognition system. hand, 60:80, 2007.
Dzabraev et al. [2021] Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3354–3363, 2021.
Fan et al. [2021] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
Fang et al. [2021] Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
Freeman and Roth [1995] William T Freeman and Michal Roth. Orientation histograms for hand gesture recognition. In International workshop on automatic face and gesture recognition, pages 296–301. Zurich, Switzerland, 1995.
Gabeur et al. [2020] Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020.
Gao et al. [2021] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
Gao et al. [2004] Wen Gao, Gaolin Fang, Debin Zhao, and Yiqiang Chen. A chinese sign language recognition system based on sofm/srn/hmm. Pattern Recognition, 37(12):2389–2402, 2004.
Graves et al. [2006] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
Guo et al. [2023] Leming Guo, Wanli Xue, Qing Guo, Bo Liu, Kaihua Zhang, Tiantian Yuan, and Shengyong Chen. Distilling cross-temporal contexts for continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10771–10780, 2023.
Han et al. [2009] Junwei Han, George Awad, and Alistair Sutherland. Modelling and segmenting subunits for sign language recognition based on hand motion analysis. Pattern Recognition Letters, 30(6):623–633, 2009.
Hao et al. [2021] Aiming Hao, Yuecong Min, and Xilin Chen. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11303–11312, 2021.
He et al. [2021] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
Hu et al. [2021a] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021a.
Hu et al. [2021b] Hezhen Hu, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. Signbert: Pre-training of hand-model-aware representation for sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11087–11096, 2021b.
Hu et al. [2021c] Hezhen Hu, Wengang Zhou, and Houqiang Li. Hand-model-aware sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1558–1566, 2021c.
Hu et al. [2022] Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. Temporal lift pooling for continuous sign language recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 511–527. Springer, 2022.
Hu et al. [2023] Lianyu Hu, Liqing Gao, Zekang Liu, and Wei Feng. Self-emphasizing network for continuous sign language recognition. In Thirty-seventh AAAI conference on artificial intelligence, 2023.
Huang et al. [2017] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017.
Huang et al. [2018] Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
Jiang et al. [2020] Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10267–10276, 2020.
Jiao et al. [2023] Peiqi Jiao, Yuecong Min, Yanan Li, Xiaotao Wang, Lei Lei, and Xilin Chen. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20676–20686, 2023.
Ju et al. [2022] Haotian Ju, Dongyue Li, and Hongyang R Zhang. Robust fine-tuning of deep neural networks with hessian-based generalization guarantees. In International Conference on Machine Learning, pages 10431–10461. PMLR, 2022.
Kim et al. [2021] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
Koller et al. [2015] Oscar Koller, Jens Forster, and Hermann Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
Koller et al. [2016] Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. Deep sign: Hybrid cnn-hmm for continuous sign language recognition. In Proceedings of the British Machine Vision Conference 2016, 2016.
Koller et al. [2017] Oscar Koller, Sepehr Zargaran, and Hermann Ney. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In CVPR, 2017.
Koller et al. [2019] Oscar Koller, Necati Cihan Camgoz, Hermann Ney, and Richard Bowden. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. PAMI, 42(9):2306–2320, 2019.
Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
Li et al. [2022a] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
Li et al. [2022b] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b.
Li and Liang [2021] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
Lin et al. [2022] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 388–404. Springer, 2022.
Liu et al. [2021] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
Min et al. [2021] Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. Visual alignment constraint for continuous sign language recognition. In ICCV, 2021.
Min et al. [2022] Yuecong Min, Peiqi Jiao, Yanan Li, Xiaotao Wang, Lei Lei, Xiujuan Chai, and Xilin Chen. Deep radial embedding for visual sequence learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 240–256. Springer, 2022.
Nguyen et al. [2020] Duy-Kien Nguyen, Vedanuj Goswami, and Xinlei Chen. Movie: Revisiting modulated convolutions for visual counting and beyond. arXiv preprint arXiv:2004.11883, 2020.
Niu and Mak [2020] Zhe Niu and Brian Mak. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In ECCV, 2020.
Ong and Ranganath [2005] Sylvie CW Ong and Surendra Ranganath. Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(06):873–891, 2005.
Pu et al. [2019] Junfu Pu, Wengang Zhou, and Houqiang Li. Iterative alignment network for continuous sign language recognition. In CVPR, 2019.
Pu et al. [2020] Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. Boosting continuous sign language recognition via cross modality augmentation. In ACM MM, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Sung et al. [2022] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
Tunga et al. [2021] Anirudh Tunga, Sai Vidyaranya Nuthalapati, and Juan Wachs. Pose-based sign language recognition using gcn and bert. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 31–40, 2021.
Yang et al. [2019] Zhaoyang Yang, Zhenmei Shi, Xiaoyong Shen, and Yu-Wing Tai. Sf-net: Structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341, 2019.
Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022, 2022.
Zhang et al. [2021] Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
Zheng et al. [2023] Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and Stan Z Li. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23141–23150, 2023.
Zhou et al. [2020] Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. Spatial-temporal multi-cue network for continuous sign language recognition. In AAAI, 2020.
Zhou et al. [2021] Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1316–1325, 2021.
Zuo and Mak [2022] Ronglai Zuo and Brian Mak. C2slr: Consistency-enhanced continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5131–5140, 2022.