\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

Exploiting VLM Localizability and Semantics for
Open Vocabulary Action Detection

Wentao Bao ^1∗, Kai Li², Yuxiao Chen^3∗, Deep Patel², Martin Renqiang Min², Yu Kong¹
¹Department of Computer Science and Engineering, Michigan State University
²Machine Learning Department, NEC Laboratories America
³Department of Computer Science, Rutgers University
{baowenta,yukong}@msu.edu, [email protected],
{kaili,dpatel,renqiang}@nec-labs.com

Abstract

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at https://github.com/Cogito2012/OpenMixer.

^*^*footnotetext: Most of this work was done during the internship of Wentao Bao at NEC Labs America and finished at Michigan State University.

1 Introduction

Action Detection (AD) aims to recognize actions and spatially and temporally localize the actors in videos. It plays a vital roles in various applications like video surveillance [83, 85, 8], autonomous driving [61], and sport event analysis [28], and it thus draws increasing attentions in recent literature [24, 63, 26, 49, 7, 74, 91, 76, 6].

Existing AD methods are mostly developed in a closed-set setting where the models are trained and tested on videos from the same fixed set of action categories. While significant progress has been made over the past few years [7, 91, 76, 6], the assumption that the training and test videos are from the same action categories limits their application to the real world, where test videos could contain actions beyond the pre-defined training categories. For example, a video surveillance system may be able to detect fighting, but other dangerous or suspicious actions like shooting and chasing will not be detected if the system has not been trained with annotated videos from these action categories. In addition, being able to detect actions in an open world facilitates a comprehensive understanding of videos and opens doors to high-level video understanding tasks, like reasoning [89], forecasting [64], etc., that usually require detecting various actions in videos.

This motivates us to investigate Open-Vocabulary Action Detection (OVAD), a task aiming to detect any actions in videos, including both seen categories contained in the training set and unseen categories absent in the training set. However, OVAD is challenging as it requires understanding the human motion dynamics across frames. While motion dynamics modeling has been well studied by the conventional closed-set action detection [13, 7, 91, 76] that takes advantages of full supervision in training, it is challenging in the open-vocabulary setting since there is no supervision for the unseen action categories.

Recently, harvesting the strong generalization capability of pre-trained large visual-language foundation models (VLMs) [51], various open-vocabulary approaches have been proposed for image recognition [90], object detection [1, 25, 79, 35, 27], and image segmentation [97, 34]. However, these methods are designed for images and do not consider temporal dynamics among video frames. In addition, image VLMs such as the CLIP [51] are struggling to capture the action verbs in text and human motion in videos [42]. This inevitably requires learning the temporal dynamics [22] or fully fine-tuning [53] for recognizing the actions on downstream tasks, which take the risk of poor generalizability to the unseen.

There are a few seminal works that leverage VLMs for open-vocabulary video understanding, including action recognition [45, 72, 37, 53] and temporal action localization [54, 43, 82]. However, for the region-level action detection by VLM, there exists a representation gap between video-level pre-training and the region-level adaptation, which is analogous to the representation gap issue discussed in image-based open-vocabulary object detection literature [1, 25, 79]. Specific to the OVAD task, the representation gap stems from the holistic video-action alignment in pre-training and the downstream region-level sub-tasks, i.e., region-action alignment and action-relevant person localization. The cause of the representation gap can be attributed to their intrinsically different adaptation goals from pre-trained video VLMs, i.e., transferring the semantics and localizability of VLMs from video to regions for the two sub-tasks, respectively.

Re-thinking the Transformer-like design of VLMs, we found that the way of using VLM semantic features and the undervalued localizability of VLMs are both critical to the OVAD task. First, to transfer the video-level semantics to each region, we propose to learn a set of region-wise queries to decode the temporal dynamics from videos, by using the pre-trained video-level features as adaptive semantics conditions. The updated queries and video-level features are further dynamically fused and aligned with the textual semantics for recognition. Second, to exploit the video VLM localizability for region-wise localization, we learn a set of queries to decode the person boxes starting from the prior locations revealed by the VLM visual attention.

Specifically, we develop a query-based open-vocabulary action detector, OpenMixer, to detect any video actions in an open vocabulary. It fits in the family of the detection transformers (DETR) [2, 58, 15, 91, 76]. The basic idea is to decouple the action recognition and localization by learning two sets of queries and corresponding decoding modules. Our OpenMixer consists of a spatial OpenMixer Block (S-OMB) for person localization, a temporal OpenMixer Block (T-OMB) for capturing the region-level temporal motion, and a dynamically fused alignment (DFA) for open-vocabulary action recognition. The S-OMB inherits the localizability of VLMs by the text-patch cross-attention, the T-OMB exploits the visual semantic features of VLMs to capture the temporal dynamics, and the DFA dynamically fuses the pre-trained semantics into learnable region-level queries for generalizable recognition. Eventually, our model enjoys the merits of semantics and localizability from VLMs and the end-to-end detection capability from the DETR pipeline.

In experiments, we set up OVAD benchmarks based on popular action detection datasets and evaluate technically viable baselines, showing the superior performance of OpenMixer. In summary, the contributions are three-fold:

•

We formulate the task of open-vocabulary action detection (OVAD), which is valuable while challenging even by foundation models.
•

We develop the OpenMixer model that exploits the semantics and localizability of pre-trained video-language models toward the OVAD task.
•

We empirically reveal the effectiveness of the proposed modules that show strong generalizability on multiple video action detection datasets.

2 Related Work

Spatio-temporal Action Detection. This task aims to localize human actions spatially and temporally in videos and recognize their actions, which has been a fundamental video understanding topic [73, 13, 12, 59]. A line of recent works [63, 26, 66, 49, 5, 92] adopts the two-backbone design to separately extract features of the keyframes and the entire video for actor localization and actor-context relation modeling, respectively. Though they are flexible by taking advantage of both image and video backbones for achieving promising performance, the model parameters are redundant in design and heavy to optimize [6]. With the recent advances in detection transformer (DETR) [2], end-to-end action detection by a single backbone shows impressive performance and thus becomes a more popular design choice [16, 7, 91, 76, 6]. The basic idea is to use a single video transformer to get features of all video frames, and then introduce learnable queries to mix with video features for actor localization and action recognition. Specifically, WOO [7] follows the Sparse RCNN [65] for localization, TubeR [91] learns the action tubes following the classical DETR [2], and STMixer [76] follows the AdaMixer [15] design that achieves the sate-of-the-art performance. The query-based design is advantageous in modeling the interaction between actors and actor context while simplifying the architecture as a single-stage design. However, none of these works could handle unseen actions in an open world. Therefore, we introduce an open-vocabulary action detection method in a query-based design to detect any actions.

Refer to caption — Figure 1: Framework (left) and the OpenMixer Block (right). Given a video and an open vocabulary of actions, we use prompted classes and a pre-trained video VLM to obtain all kinds of VLM features. With a stack of cascaded OpenMixer blocks and spatial-temporal queries, the model predicts the action scores, person boxes, and their associated person scores for the OVAD task.

Open-vocabulary Visual Understanding. Thanks to the strong alignment capability of pre-trained visual language models (VLM), visual data from unseen classes in an open world can be recognized by the alignment between the visual feature and the text feature of the class names [75, 51]. This motivates a series of open-vocabulary works in object detection [87, 39, 86, 93, 25, 35, 27, 79], action recognition [69, 46, 36, 22, 53, 71], and temporal action localization (OVTAL) [22, 54, 43, 37]. For localizing unseen, the recent image-based open-vocabulary object detectors OV-DETR [86] and CORA [79] share a common spirit with ours by injecting VLM semantics into the learnable queries. However, the query conditions in OV-DETR are class-specific such that they are not adaptive to test-time samples, and the two-stage training in CORA limits its flexibility in video domains. For the video understanding, the recent work [22], [54], and STAN [37] are built on existing image-based CLIP model. However, compared to the OVTAL task, the proposed OVAD task in this paper is even more challenging as it needs to distinguish between any actions in both spatial regions and temporal segments. In literature, the iCLIP [20] aims for a zero-shot action detection, which does not consider seen actions in testing. Moreover, it skipped learning to localize actions by off-the-shelf person detectors [56] and only learns to recognize the unseen actions, which lacks the adaptability to localize action-relevant persons. We noticed a concurrent work [78] for the OVAD problem, but it is two-stage designed and relies on extra large-scale region-text pre-training data, without fully exploiting the inherent detection knowledge of video-based VLMs (see Appendix F for more comparative discussions). To the best of our knowledge, the OpenMixer is the first query-based OVAD model that can be combined with any video VLMs without region-level pre-training.

3 Method

In contrast to the closed-set video action detection [63, 24], open-vocabulary action detection (OVAD) aims to recognize and spatiotemporally localize any human actions in videos, including action categories seen and unseen in training. Concretely, an OVAD model is learned from a training set of $N_{train}$ samples $\{(\mathbf{X},\mathbf{Y})_{i}\rvert i=1,\dots,N_{train}\}$ where $\mathbf{X}$ denotes the training video and $\mathbf{Y}$ denotes the bounding box annotations on the keyframe that consists of box coordinates $\mathbf{b}$ and action category $y$ . In training, an action $y$ is drawn from a fixed set of base action categories $\mathcal{C}_{B}$ . In testing, the learned action detector could detect “any” actions in a given video from the open vocabulary $\mathcal{C}_{B}\cup\mathcal{C}_{N}$ , where $\mathcal{C}_{N}$ contains any novel action categories.

3.1 OpenMixer

In this paper, we propose the OpenMixer to solve the OVAD task. The OpenMixer model is developed within the family of query-based action DEtection TRansformers (DETR) [76, 7, 91]. Basically, DETR-style models treat the action detection task as a set-to-set prediction problem, i.e., learning a sparse set of query features from videos to match with the ground truth boxes and action classes. Specific to the OVAD task, the action classes are predicted from an open vocabulary that contains both the base and novel actions.

The OpenMixer is shown in Fig. 1 (left), given a video $\mathbf{X}$ and a list of text prompted action class as input, we leverage the visual and text encoders $\Psi_{\text{VE}}$ and $\Psi_{\text{TE}}$ of a pre-trained video VLM to obtain all features of the video and action text, i.e., $\mathbf{V},\mathbf{f}_{v},\mathbf{S}=\Psi_{\text{VE}}(\mathbf{X})$ and $\mathbf{f}_{t}=\Psi_{\text{TE}}(y)$ . Here, $\mathbf{V}$ , $\mathbf{f}_{v}$ and $\mathbf{S}$ are the 4D patch-level video feature, video-level feature, and video attention, respectively, and $\mathbf{f}_{t}$ is the text feature of class $y$ . Then, we build $M$ cascaded OpenMixer Blocks (OMB) to learn a set of $N$ spatial queries $\mathbf{Q}_{s}$ and $N$ temporal queries $\mathbf{Q}_{t}$ from $(\mathbf{V},\mathbf{S},\mathbf{f}_{v},\mathbf{f}_{t})$ for person detection and action classification, respectively. The OMB takes as input all the features from VLM and the $\mathbf{Q}_{s}$ and $\mathbf{Q}_{t}$ to predict person boxes, person scores, and action scores.

For the $m$ -th OMB, as shown in Fig. 1 (right), it consists of a Temporal OpenMixer Block (T-OMB) $\Psi_{\alpha}$ , a Spatial OpenMixer Block (S-OMB) $\Psi_{\theta}$ , and a dynamically fused alignment (DFA). The S-OMB consists of prior location sampling, query-query (Q-Q) mixing by self-attention [68] and query-video (Q-V) mixing by AdaMixer [15], while the T-OMB sequentially consists of Q-Q mixing, query conditioning, and Q-V mixing (see Fig. 2(a) and 2(b) for reference). The DFA module recursively updates the $\mathbf{Q}_{s}$ , $\mathbf{Q}_{t}$ , and person boxes from the $(m\!-\!1)$ -th OMB, and predict person scores and action scores. These three modules are developed for the OVAD task with the consideration of VLM semantics and localizability, which will be introduced in the following sections.

3.2 Localizability Prior for Spatial OMB

A major challenge for one-stage query-based detectors is the low convergence of localization. One of the causes is the lack of prior knowledge of object locations. Specific to the action detection, recent two-stage action detectors [84, 20, 11, 6] address location prior by an off-the-shelf person detector and RoIAlign [18] cropping, but the feature cropping lacks the spatiotemporal context and suffer from representation gap when a pre-trained VLM is introduced. For recent query-based action detectors [7, 91, 76], the prior knowledge of the person locations is missing in their design. Therefore, when it comes to the OVAD task by VLMs, a natural question is that, Can we obtain the prior locations of actors from pre-trained VLMs in a cheap way? Motivated by these considerations, we resort to the visual attention from a pre-trained VLM.

Prior Locations from VLM Attention. Visual attention maps are traditionally represented by the class activation map (CAM) to visually explain recognition models [95, 60]. In the era of ViT [9] and VLM [51], recent works [3, 4, 30] propose to use the multi-head self-attention (MHSA) of the last ViT layer, or the gradient-weighted accumulative product over multi-layer self-attention. However, MHSA is not visually faithful due to the high redundancy of video tokens, and the gradient-based methods suffer from a huge computational cost on video VLMs and ad-hoc implementation for different VLMs. Moreover, due to the lack the token-level video-text correlation, their attention map does not closely relevant to the action specified by the vocabulary. Therefore, an efficient and structure-agnostic CAM is preferable to large video VLMs, which motivates us to use patch-text correlation as VLM attention to encode the location priors.

Specifically, with the $D$ -dimensional 4D video feature $\mathbf{V}\in\mathbb{R}^{T\times hw\times D}$ where $T$ is the number of frames and $hw$ is the number of visual tokens in each frame, the holistic video feature $\mathbf{f}_{v}\in\mathbb{R}^{D}$ , and the text features of $C$ classes as $\mathcal{F}_{t}=[\mathbf{f}_{t}^{(1)},\ldots,\mathbf{f}_{t}^{(C)}]^{\top}$ . The features are $L_{2}$ normalized. We first get the pre-matched text feature $\mathbf{f}_{t}$ by maximum similarity: $\mathbf{f}_{t}=\arg\max_{\mathbf{f}_{t}\in\mathcal{F}_{t}}\mathbf{f}_{v}^{\top}\otimes\mathbf{f}_{t}$ , since we do not have access to the class label in testing. Thus, the inner-product between $\mathbf{f}_{t}$ and $\mathbf{V}$ determines the patch-text correlation: $\mathbf{S}=\mathbf{V}\otimes\mathbf{f}_{t}$ . Furthermore, as discussed in [32, 30], the q-v attention in self-attention layers shows an opposite heatmap where the foreground regions are associated with low attention value. In practice, we also observed this issue so that similar to [32], our CAM is determined by the reversed patch-text similarity: $\hat{\mathbf{S}}=1-\mathbf{S}$ (see detailed explanation in Appendix B). By reshaping and spatial interpolation over $\hat{\mathbf{S}}$ , the attention map is obtained for prior location sampling. We treat the $\hat{\mathbf{S}}$ as the prior distribution of person locations indicated by the VLM, thus the top- $N$ positions are sampled as the initial boxes centers: $\{(u,v)_{i}|i=1,\ldots,N\}\!\sim\!\hat{\mathbf{S}}(u,v,k)$ where $(u,v)$ are 2D coordinates on the keyframe $k$ and $N$ is the number of queries.

Spatial OMB. With the sampled prior locations, the S-OMB (see Fig. 2(a)) that consists of Q-Q and Q-V mixing modules takes as input the video patch features $\mathbf{V}$ and the box prediction $\hat{\mathbf{b}}_{m-1}$ of the previous $(m\!-\!1)$ -th stage, to update the spatial queries by $\hat{\mathbf{Q}}_{s}=\Psi_{\theta_{m}}(\mathbf{V},\mathbf{Q}_{s},\hat{\mathbf{b}}_{m-1})$ . The updated spatial queries $\hat{\mathbf{Q}}_{s}$ are used to predict the person scores $\hat{\mathbf{o}}_{m}$ and person box offsets $\Delta\hat{\mathbf{b}}_{m}$ by MLP. Then, the predicted boxes at stage $m$ are updated by $\hat{\mathbf{b}}_{m}=\hat{\mathbf{b}}_{m-1}+\Delta\hat{\mathbf{b}}_{m}$ , where initial box queries $\hat{\mathbf{b}}_{0}$ consist of the sampled prior locations and the video spatial range.

The technical intuition behind the design is to encourage the proposed Spatial OMB to learn the box offset $\Delta\mathbf{b}$ starting from the prior locations inherited from the pre-trained VLM. Besides, compared to [76] that uses the fixed non-informative frame centers as prior locations, our VLM attention-based prior locations are adaptive to the test-time video content and vocabulary, which improves not only the seen action localization but also the generalization to the unseen (see Tab. 1 ZSR+TL section).

3.3 Adaptive Semantics for Temporal OMB

For the query-based OVAD models, temporal queries are expected to be discriminative for both base and novel actions. This requires a strong capability of content decoding for the query-video (Q-V) mixing module. The pioneering work DETR [2] uses cross-attention while [15, 76] adopt the MLP-Mixer [67]. However, without VLM semantics, these approaches inevitably overfit the seen class data and are unable to detect the unseen. Recent works [86, 79] rightly address the importance of VLM semantics for the query features, but they lack the adaptability to the test-time visual content due to the class-wise semantic condition in [86] and the region prompting in [79]. These motivate us to propose the Temporal OMB that exploits adaptive semantics from pre-trained VLMs.

Temporal OMB. As dipicted in Fig. 2(b), with the temporal queries $\mathbf{Q}_{t}$ and the predicted boxes $\hat{\mathbf{b}}_{m}$ at the current stage $m$ , the queries are updated by interacting with the video features $\mathbf{V}$ and $\mathbf{f}_{v}$ by the function $\hat{\mathbf{Q}_{t}}=\Psi_{\alpha_{m}}(\mathbf{V},\mathbf{Q}_{t},\mathbf{f}_{v},\hat{\mathbf{b}}_{m})$ . To achieve our motivation of using adaptive semantics, we propose a query update:

\hat{\mathbf{Q}}_{t}=\Psi_{qv}\left(\Psi_{qq}(\mathbf{Q}_{t},\mathbf{b})\oplus\mathbf{f}_{v},\mathbf{V},\mathbf{b}\right),

(1)

where $\Psi_{qq}$ and $\Psi_{qv}$ are Q-Q mixing and Q-V mixing modules by self-attention [68] and AdaMixer [15], respectively. Here, $\mathbf{f}_{v}$ is the adaptive semantic condition by the pre-trained VLM video feature, which is broadcastly added (denoted as $\oplus$ ) to the output of Q-Q mixing.

Remark. Note that the adaptiveness of the semantic condition stems from the test-time video feature $\mathbf{f}_{v}$ . Alternatively, when the semantic condition $\mathbf{f}_{v}$ is changed to $\mathbf{f}_{t}$ over $C$ classes, it is equivalent to the way in [86]. However, we empirically show this leads to inferior performance (see Tab. 4) especially for the seen action detection. The inferiority can be attributed to the lack of adaptability to test-time video content. Besides, as another alternative, the post-condition that places the condition $\mathbf{f}_{v}$ after the Q-V mixing, i.e., $\hat{\mathbf{Q}}_{t}=\Psi_{qv}\left(\Psi_{qq}(\mathbf{Q}_{t},\mathbf{b}),\mathbf{V},\mathbf{b}\right)\oplus\mathbf{f}_{v}$ , the module $\Psi_{qv}$ is thus to learn the residual of $\mathbf{f}_{v}$ . We empirically found that our pre-condition by Eq. 1 is superior to the post-condition, potentially because of the better query features used to learn the important Q-V mixing module.

3.4 Dynamically Fused Alignment

To recognize both seen and unseen actions, the model needs to learn discriminative region-wise visual features to align with seen actions, while keeping the generalizable knowledge of the pre-trained VLMs to align with the unseen actions. Dealing with the two goals is challenging. A line of recent approaches uses model adaptation by prompt tuning [22, 10, 43, 71, 79, 20], adapters [50, 14], and gradient preserving [72, 98]. However, these methods either struggle in generalization to novel categories or need to back-propagate through the large VLM that incurs huge computational costs, especially for long videos. Therefore, we resort to a dynamically fused alignment (DFA) for open-vocabulary action recognition, which is lightweight in design and works well for both seen and unseen actions.

Specifically, as shown in Figs. 1 and 2(c), the DFA is formulated to learn the action classification at each stage $m$ , i.e., $\hat{\mathbf{y}}_{m}=\Psi_{\boldsymbol{\lambda}_{m}}(\hat{\mathbf{Q}_{t}},\mathbf{f}_{v},\mathbf{f}_{t})$ , where $\hat{\mathbf{y}}_{m}$ are the predicted actions for all queries $\hat{\mathbf{Q}_{t}}$ and the $\boldsymbol{\lambda}_{m}$ are the learnable parameters. This module includes dynamic feature fusion and query-text alignment.

Dynamic Feature Fusion. This step aims to fuse the video-level feature $\mathbf{f}_{v}$ into each of the queries $\hat{\mathbf{Q}_{t}}$ dynamically. Specifically, we first repeat $N$ times of the $\mathbf{f}_{v}$ to be $\mathbf{F}_{v}\in\mathbb{R}^{N\times D}$ . Then, the fusion between $\mathbf{F}_{v}$ and $\hat{\mathbf{Q}}_{t}$ is achieved by $\tilde{\mathbf{F}}_{v}=\boldsymbol{\lambda}\odot\mathbf{F}_{v}+(1-\boldsymbol{\lambda})\odot\hat{\mathbf{Q}}$ , where $\boldsymbol{\lambda}\in\mathbb{R}^{N\times 1}$ are learnable in training. The intuition behind the query-specific learnable $\boldsymbol{\lambda}$ is that, it allows the dynamic contributions of the video-level knowledge from $\mathbf{f}_{v}$ to the different learnable queries in the set-matching training.

Settings	Models	J-HMDB			UCF101-24
Settings	Models	Mean	Base	Novel	Mean	Base	Novel
ZSR+ZSL	Region + GPT	31.86	30.06	33.51	19.92	21.54	18.29
	Video + HC	54.40	49.89	58.51	31.04	31.43	30.64
	Video + GPT	66.73	64.61	68.66	35.01	34.59	35.43
ZSR+TL	STMixer [76]	63.53	58.27	68.31	36.66	45.26	28.07
ZSR+TL	Spatial OpenMixer	74.06	68.04	79.53	40.32	48.80	31.85
E2E	STMixer [76]	49.16	73.06	27.44	33.72	60.91	6.54
	STMixer [76] (w. CoOp [96])	42.27	75.66	11.91	36.12	60.42	11.81
	OpenMixer (w. CoOp [96])	86.86	94.18	80.20	45.11	62.48	27.75
	OpenMixer	86.34	90.75	82.33	47.71	61.18	34.23

Table 1: OVAD results. Note that all methods including STMixer [76] use the same pre-trained CLIP-ViP [80] as the frozen VLM and are evaluated by video mAP. For the ZSR+ZSL setting, we use off-the-shelf Mask RCNN [18] as the ZSL person localizer and use either handcrafted (HC) or GPT-generated (GPT) prompts for either video- or region-level zero-shot recognition.

Query-Text Alignment. To make the classification decision by $\tilde{\mathbf{F}}_{v}$ and open vocabulary of actions, for the action category, we leverage GPT-4 [47] to generate multiple visually descriptive action prompts for each category (see the prompts in Appendix A). With VLM text encoder, the aggregated text features of $\mathcal{C}$ classes are represented as $\mathbf{F}_{t}\in\mathbb{R}^{\mathcal{C}\times D}$ , where $\mathcal{C}$ is the number of classes. Eventually, we use the softmax of visual-text cosine similarity to represent the multi-class classification probability: $P(\hat{y}|\hat{\mathbf{Q}})=\text{softmax}(\tilde{\mathbf{F}}_{v}\otimes\mathbf{F}_{t}^{\top}/\tau)$ where $\tau$ is the VLM temperature. In testing, the open-vocabulary action recognition for all queries is achieved by finding the maximum visual-text cosine similarity: $\hat{y}=\arg\max_{y\in\mathcal{C}}(\tilde{\mathbf{F}}_{v}\otimes\mathbf{F}_{t}^{\top})$ .

Note that we do not include the spatial queries $\mathbf{Q}_{s}$ as the input of our DFA module. This makes the T-OMB $\Psi_{\alpha}$ and the S-OMB $\Psi_{\theta}$ to be decoupled in training such that the person localization is class-agnostic, which is essential for open-vocabulary tasks according to [43, 79].

3.5 Training and Inference

In training, for action localization, we adopt the regular set matching loss following the DETR literature [2, 58, 15]: $\mathcal{L}_{set}=\mathcal{L}_{bce}+\mathcal{L}_{L_{1}}+\mathcal{L}_{giou}$ , where $\mathcal{L}_{bce}$ is a binary cross-entropy loss for person score prediction, $\mathcal{L}_{L_{1}}$ and $\mathcal{L}_{giou}$ are the coordinate distance and GIoU distance [57] between predicted and ground truth boxes, respectively. Then, we use the Hungarian matching [2] to find the optimal bipartite matching between the predicted and ground truth boxes for each video. For action recognition, we use a multi-class cross-entropy loss $\mathcal{L}_{act}$ so that the total loss for training is $\mathcal{L}_{total}=w_{1}\mathcal{L}_{set}+w_{2}\mathcal{L}_{act}$ where the hyperparameters $w_{1}$ and $w_{2}$ are used to balance between the two subtasks.

During inference, the thresholded person scores determine the kept person boxes, while the action scores assign the action categories to boxes from input class categories.

4 Experiments

Datasets. Our method is implemented on two commonly-used action detection datasets, i.e., J-HMDB [21] and UCF101-24 [62]. J-HMDB dataset contains per-frame annotated bounding boxes of persons along with 21 action classes. Similar to [22, 20], with 50% of actions as the novel classes, we randomly split it into 10 base classes for training and 11 novel classes for testing, which results in 10,570 training samples and 9,139 testing samples. UCF101-24 dataset is a subset of UCF101 [62]. It is also per-frame annotated for action detection and contains 24 action classes. With the same 50% splitting strategy, we split it into 12 base classes for training and 12 novel classes for testing. Similar to [22, 43], we also report results on other random splits with ratios of both 50%-50% and 75%-25% in Appendix D. We will release all data splits.

Evaluation criteria. Following the standard paradigm in action detection literature [26, 11, 7, 91, 76], the model performance is evaluated by video mAP. It evaluates the spatiotemporal action tubes of the detected bounding boxes over the classification and 3D intersection-over-union (IoU). Following [11], the 3DIoU threshold is set to 0.5 for J-HMDB and 0.2 for UCF101-24, respectively.

Implementation details. We experiment with two VLMs including the image pre-trained OpenAI CLIP [51] and video pre-trained CLIP-ViP [80]. We use the same ViT-B/16 architecture for the two VLMs. The VLMs are kept frozen in training. For the image CLIP, we get video-level semantic features by temporal mean pooling. We obtain the patch token features of the last ViT layer and use them to construct the 4D pyramid feature $\mathbf{V}$ by multi-scale residual convolutions. By default, we set the number of queries and OMB stages to 100 and 3, respectively. In training, we set the mini-batch size to 16 and frame sampling by $16\times 1$ . The weight of the set loss $\mathcal{L}_{set}$ and action loss $\mathcal{L}_{act}$ are set to 2.0 and 48.0, respectively. Following [2, 7, 76], each intermediate stage is individually supervised by the loss $\mathcal{L}_{set}$ and $\mathcal{L}_{act}$ . We set the base learning rate to 1e-5 and use the AdamW [41] optimizer to train models for 12 epochs on 4 RTX 6000Ada or 2 A100 (80G) GPUs. In testing, the person detection threshold is set to 0.6. We individually test the base and novel classes and report their video-mAP results and the mean on all categories. In Appendix D, we also report generalized zero-shot testing by giving a complete list of base and novel classes. Our model inference speed is 0.23 s/video per A6000 GPU, with 587M parameters based on CLIP-ViP/B16 VLM. Other details are in Appendix C.

OVAD task settings. To benchmark methods on OVAD task, three settings are presented considering if the localization and classification are trained or not.

•

ZSR+ZSL (zero-shot action recognition and actor localization): without any training, we only use pre-trained person detectors such as Mask RCNN [18] to detect persons, and use pre-trained video VLMs such as CLIP-ViP [80] for open-vocabulary recognition.
•

ZSR+TL (zero-shot action recognition and trainable actor localization): we use pre-trained CLIP-ViP [80] to perform video-level action recognition while training the localization modules to detect persons.
•

E2E (end-to-end learning): we train and test models in an end-to-end way by using raw videos and vocabulary as input. In this setting, we compare with STMixer [76] with the same CLIP-ViP [80] backbone and investigate the soft prompt by CoOp [96].

4.1 Comparative Results

The main results are reported in Tab. 1. To analyze the baseline performance, we summarize the discussion below.

Zero-shot recognition and localization. In the ZSR+ZSL setting, the findings are as follows. First, region-level features (the 1st row) by RoIAlign [18] perform significantly worse than the video-level features (the 3rd row). This indicates that the RoI-cropped features from VLM suffers from a large representation gap between the video-level pre-training and downstream region-level recognition. Second, the descriptive GPT-generated prompts (the 3rd row) achieve a better performance than the handcrafted (HC) prompt such as the “a video of person [CLS]” (the 2nd row). This can be explained by the more transferable knowledge in the GPT prompts than the handcraft ones.

Zero-shot recognition with learnable localization. Under the ZSR+TL setting, we observe a significant superiority of Spatial OpenMixer to the STMixer baseline, with more than 10% performance gain on J-HMDB dataset. Since the training is only encouraged to localize actors in videos, the outperformance suggests a good exploitation of the localizability in pre-trained VLMs.

Models	f-mAP	v-mAP
iCLIP [20]	65.41	–
OpenMixer	77.06	81.20

Table 2: Zero-shot action detection. Following the same 75%-25% split as [20], we report both the frame- and video-level mAP (f-mAP and v-mAP) on novel classes of J-HMDB.

S-OMB	DFA	T-OMB	Mean	Base	Novel
✗	✓	✓	81.77	86.32	77.64
✓	✗	✓	74.06	68.04	79.53
✓	✓	✗	83.47	86.01	81.18
✓	✓	✓	86.34	90.75	82.33

Table 3: Ablation study of each proposed component.

End-to-end learnable OVAD. For the E2E setting, the OpenMixer (the last row) outperforms the simple STMixer baseline (STMixer+VLM) by large margins, with $7.69\%$ and $54.89\%$ of video mAP gains on base and novel categories of the J-HMDB dataset, respectively. Besides, we explored the widely-used VLM adaptation method CoOp [96] that optimizes the context of class names, i.e., prompt tuning. From Tab. 1, we observe that CoOp improves the base class performance with sacrifice on the novel classes, while the GPT-prompted OpenMixer achieves much better performance on novel classes. Lastly, we notice the relatively smaller numbers on UCF101-24 than those on J-HMDB. This reflects the challenging aspects of the UCF101-24 dataset such as the long duration ( $\sim 10\times$ longer), heavy background bias, and multi-person scenarios.

Zero-shot action detection. We note the iCLIP [20] defines the zero-shot action detection (ZSAD) task which is different from our OVAD task. The ZSAD only cares about the samples from novel classes while OVAD values both the base and novel classes. Therefore, ZSAD uses all samples from base classes in training and only tests on novel classes. Following the same settings as iCLIP, the results in Tab. 2 show that our method could achieve much better performance than iCLIP, even though iCLIP relies on pre-detected person boxes from YOWO [26].

Methods	Queries	Modalities	Mean	Base	Novel
w/o condition			83.99	85.86	82.28
post	TQ	video $\mathbf{f}_{v}$	85.48	88.74	82.52
pre	TQ, SQ	video $\mathbf{f}_{v}$	85.66	90.29	81.45
	TQ	text $\mathbf{f}_{t}$	76.36	70.25	81.92
	TQ	video $\mathbf{f}_{v}$	86.34	90.75	82.33

Table 4: Results of query conditions. The post/pre and TQ/SQ means that the conditional feature (from video

\mathbf{f}_{v}

or from text

\mathbf{f}_{t}

) is placed after/before the Q-V mixing on the temporal queries (TQ) or spatial queries (SQ).

Methods	Mean	Base	Novel
w/o. $\mathbf{F}_{v}$ ( $\boldsymbol{\lambda}=0$ )	68.84	88.94	50.58
w/o. $\hat{\mathbf{Q}}_{t}$ ( $\boldsymbol{\lambda}=1$ )	74.06	68.04	79.53
w/o. dynamics ( $\boldsymbol{\lambda}=0.5$ )	51.48	63.08	40.93
concat $[\hat{\mathbf{Q}}_{s};\hat{\mathbf{Q}}_{t}]$ & mlp	85.51	89.19	82.17
Ours	86.34	90.75	82.33

Table 5: Results of fusion strategies. We explored different strategies to fuse the pre-trained

\mathbf{F}_{v}

and learnable queries (

\hat{\mathbf{Q}}_{t}

\hat{\mathbf{Q}}_{s}

) within our DFA module.

4.2 Ablation Study

In this section, we analyze the properties of the OpenMixer model on the J-HMDB dataset. Results of the component-wise ablation are reported in Tab. 3. It shows that all three components could work well. Specifically, without S-OMB which means the attentional location prior is removed, the performance drops significantly especially for the novel classes. If the DFA is removed, we only use the pre-trained VLM feature for zero-shot recognition, it shows that the base class performance is the worst. Without T-OMB which means the semantic condition is removed and both spatial and temporal queries are used for recognition, it shows a decrease of $4.74$ % and $1.15$ % on base and novel actions, respectively.

Query condition strategies. Specific to the T-OMB, we further investigate different alternatives of query condition strategies in Tab. 4, which shows the following observations: (1) Without any condition, it performs worse on the base class with $4.89$ % mAP drop. (2) Pre-condition performs much better than post-condition on base classes ( $+2$ %), with negligible performance drop on the novel classes ( $-0.19$ %). This can be explained that the pre-condition alleviates the difficulty of content decoding by the following Q-V mixing module. (3) The additional condition on spatial queries (SQ) hurts the performance on both the base and novel classes, because this essentially makes the recognition and localization entangled in training. (4) When using text feature $\mathbf{f}_{t}$ as a condition, base class performance significantly decreased ( $-20.50$ %) and novel class performance also decreased a bit. This is due to the large semantic gap between text feature $\mathbf{f}_{t}$ and patch-wise video token features $\mathbf{V}$ , suggesting that the test-time adaptive $\mathbf{f}_{v}$ is preferable even though $\mathbf{f}_{v}$ and $\mathbf{f}_{t}$ are semantically aligned.

Feature fusion strategies. To validate the design choice of our DFA module, we explored different feature fusion strategies, as shown in Tab. 5. The results show that only using the learned query feature $\hat{\mathbf{Q}}_{t}$ (the 1st row) performs much worse performance on the novel classes, indicating the loss of generalization. If only using the pre-trained feature $\mathbf{F}_{v}$ (the 2nd row), the model cannot work well on the base classes which indicates an under-fitting to the task. If fusing the features by simple averaging, the performance still lags behind ours as it is not adaptive to the variety in queries. Moreover, we notice that [76] uses both spatial and temporal queries for recognition by MLP layers. Thus, we additionally include the spatial queries $\hat{\mathbf{Q}}_{s}$ by concatenation with temporal queries $\hat{\mathbf{Q}}_{t}$ and use MLP layers for dimension reduction. We observe the performance drop, which can be explained as the MLP layers eliminate the benefits of the semantic conditions and makes localization and recognition entangled in training.

Number of queries and OMB stages. In Fig. 3(a) and 3(b), we show that using 100 queries and 3 OMB stages achieves the best average mAP. The figures also indicate that the number of OMB stages is more important than the number of queries, as the bipartite matching could handle the redundant queries in training. The decreasing trend with more than three OMB stages can be attributed to the risk of overfitting to training data. More interesting results and discussions can be found in Appendix D.

Qualitative results. We visualize results on the J-HMDB novel categories in Fig. 4. They show that OpenMixer could precisely localize and confidently recognize those unseen actions, even though there are multiple persons. More visualizations are in Appendix E.

Limitations and Future Work. The recent large-scale action detection dataset AVA [17] is not included in this paper, as we emphasize on the adaptation of existing pre-trained VLMs for downstream small datasets. In the future, similar to the concurrent work [78], we will explore how to effective pre-train on AVA to benefit for a more general audience.

5 Conclusion

We present an open-vocabulary action detection method OpenMixer to detect any human actions in videos. It is a query-based detection transformer that fully exploits the semantics and localizability of pre-trained VLMs. Furthermore, we build OVAD benchmarks that extensively evaluate baselines and our model under various settings, showing the superiority of the OpenMixer.

Acknowledgement. Wentao Bao and Yu Kong are partially supported by the Army Research Office (ARO) grant W911NF-24-1-0385 and Office of Naval Research (ONR) grant N00014-23-1-2046. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of ARO nor ONR.

References

[1] Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, pages 33781–33794, 2022.
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
[4] Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, pages 397–406, 2021.
[5] Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, and Limin Wang. Cycleacr: Cycle modeling of actor-context relations for video action detection. arXiv preprint arXiv:2303.16118, 2023.
[6] Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, and Limin Wang. Efficient video action detection with token dropout and context refinement. In ICCV, 2023.
[7] Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, and Ping Luo. Watch only once: An end-to-end video action detection framework. In ICCV, pages 8178–8187, 2021.
[8] Ishan Dave, Zacchaeus Scheffer, Akash Kumar, Sarah Shiraz, Yogesh Singh Rawat, and Mubarak Shah. Gabriellav2: Towards better generalization in surveillance videos for action detection. In WACV, pages 122–132, 2022.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[10] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
[11] Gueter Josmy Faure, Min-Hung Chen, and Shang-Hong Lai. Holistic interaction transformer network for action detection. In WACV, pages 3340–3350, 2023.
[12] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In CVPR, pages 203–213, 2020.
[13] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019.
[14] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
[15] Ziteng Gao, Limin Wang, Bing Han, and Sheng Guo. Adamixer: A fast-converging query-based object detector. In CVPR, pages 5364–5373, 2022.
[16] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In CVPR, June 2019.
[17] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
[18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
[19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
[20] Wei-Jhe Huang, Jheng-Hsien Yeh, Min-Hung Chen, Gueter Josmy Faure, and Shang-Hong Lai. Interaction-aware prompting for zero-shot spatio-temporal action detection. In ICCV Workshop, pages 284–293, 2023.
[21] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
[22] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In ECCV, pages 105–124, 2022.
[23] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In ECCV, pages 105–124, 2022.
[24] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In ICCV, 2017.
[25] Dahun Kim, Anelia Angelova, and Weicheng Kuo. Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, pages 11144–11154, 2023.
[26] Okan Köpüklü, Xiangyu Wei, and Gerhard Rigoll. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644, 2019.
[27] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2022.
[28] Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In ICCV, pages 13536–13545, 2021.
[29] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, pages 280–296, 2022.
[30] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
[31] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
[32] Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, and Xiaomeng Li. Exploring visual interpretability for contrastive language-image pre-training. arXiv preprint arXiv:2209.07046, 2022.
[33] Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, and Xiaomeng Li. Exploring visual interpretability for contrastive language-image pre-training. arXiv preprint arXiv:2209.07046, 2022.
[34] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
[35] Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, and Jianfei Cai. Learning object-language alignments for open-vocabulary object detection. In ICLR, 2022.
[36] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In ECCV, pages 388–404, 2022.
[37] Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, pages 6555–6564, 2023.
[38] Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, pages 6555–6564, 2023.
[39] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[40] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[41] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018.
[42] Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, and Cordelia Schmid. Verbs in action: Improving verb understanding in video-language models. In ICCV, pages 15579–15591, 2023.
[43] Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. Zero-shot temporal action detection via vision-language prompting. In ECCV, pages 681–697, 2022.
[44] Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. Zero-shot temporal action detection via vision-language prompting. In ECCV, pages 681–697, 2022.
[45] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In ECCV, pages 1–18, 2022.
[46] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In ECCV, pages 1–18, 2022.
[47] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[48] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[49] Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. Actor-context-actor relation network for spatio-temporal action localization. In CVPR, pages 464–474, 2021.
[50] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, pages 26462–26477, 2022.
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
[52] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
[53] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. In CVPR, pages 6545–6554, 2023.
[54] Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, and David A Ross. Open-vocabulary temporal action detection with off-the-shelf image-text features. In BMVC, 2022.
[55] Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, and David A Ross. Open-vocabulary temporal action detection with off-the-shelf image-text features. In BMVC, 2022.
[56] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
[57] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, pages 658–666, 2019.
[58] Byungseok Roh, JaeWoong Shin, Wuhyun Shin, and Saehoon Kim. Sparse DETR: Efficient end-to-end object detection with learnable sparsity. In ICLR, 2022.
[59] Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. In ICML, 2023.
[60] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
[61] Gurkirt Singh, Stephen Akrigg, Manuele Di Maio, Valentina Fontana, Reza Javanmard Alitappeh, Salman Khan, Suman Saha, Kossar Jeddisaravi, Farzad Yousefi, Jacob Culley, Tom Nicholson, Jordan Omokeowa, Stanislao Grazioso, Andrew Bradley, Giuseppe Di Gironimo, and Fabio Cuzzolin. Road: The road event awareness dataset for autonomous driving. IEEE TPAMI, 45(1):1036–1054, 2023.
[62] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01, 2012.
[63] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. Actor-centric relation network. In ECCV, pages 318–334, 2018.
[64] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. Relational action forecasting. In CVPR, 2019.
[65] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In CVPR, pages 14454–14463, 2021.
[66] Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. Asynchronous interaction aggregation for action detection. In ECCV, pages 71–87, 2020.
[67] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, pages 24261–24272, 2021.
[68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[69] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
[70] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
[71] Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. In CVPR, pages 23034–23044, 2023.
[72] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization. In ICML, 2023.
[73] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019.
[74] Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, and Gangshan Wu. Context-aware rcnn: A baseline for action detection in videos. In ECCV, pages 440–456, 2020.
[75] Jianzong Wu, Xiangtai Li, Shilin Xu Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, et al. Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880, 2023.
[76] Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, and Limin Wang. Stmixer: A one-stage sparse action detector. In CVPR, pages 14720–14729, 2023.
[77] Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, and Limin Wang. Stmixer: A one-stage sparse action detector. In CVPR, pages 14720–14729, 2023.
[78] Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, and Limin Wang. Open-vocabulary spatio-temporal action detection. arXiv preprint arXiv:2405.10832, 2024.
[79] Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, pages 7031–7040, 2023.
[80] Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. In ICLR, 2022.
[81] Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. In ICLR, 2022.
[82] Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, and Cordelia Schmid. Unloc: A unified framework for video localization tasks. In ICCV, pages 13623–13633, 2023.
[83] Ming Yang, Shuiwang Ji, Wei Xu, Jinjun Wang, Fengjun Lv, Kai Yu, Yihong Gong, Mert Dikmen, Dennis J Lin, and Thomas S Huang. Detecting human actions in surveillance videos. In TRECVID, 2009.
[84] Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, and Ting Liu. Contextualized spatio-temporal contrastive learning with self-supervision. In CVPR, pages 13977–13986, 2022.
[85] Kimin Yun, Yongjin Kwon, Sungchan Oh, Jinyoung Moon, and Jongyoul Park. Vision-based garbage dumping action detection for real-world surveillance platform. ETRI Journal, 41(4):494–505, 2019.
[86] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. In ECCV, pages 106–122, 2022.
[87] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In CVPR, pages 14393–14402, 2021.
[88] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In CVPR, pages 14393–14402, 2021.
[89] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
[90] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In CVPR, pages 5579–5588, 2021.
[91] Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. Tuber: Tubelet transformer for video action detection. In CVPR, pages 13598–13607, 2022.
[92] Yin-Dong Zheng, Guo Chen, Minglei Yuan, and Tong Lu. Mrsn: Multi-relation support network for video action detection. arXiv preprint arXiv:2304.11975, 2023.
[93] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In CVPR, pages 16793–16803, 2022.
[94] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In CVPR, pages 16793–16803, 2022.
[95] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016.
[96] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
[97] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, pages 11175–11185, 2023.
[98] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In ICCV, pages 15659–15669, 2023.

Appendix

Appendix A Prompts for Query-Text Alignment

To generate text prompts for each action category, we send a request to GPT [48] by using the template: “For the action type {CLS}, what are the visual descriptions? Please respond with a list of 16 short sentences.” where the placeholder “{CLS}” is replaced by the action class name from the vocabulary. Thus, we obtained multiple caption-like sentence descriptions of the action. Eventually, the text feature for each class is computed by mean pooling of features from the VLM text encoder given the text prompts. In Fig. 5 and Fig. 6, we show a few pieces of the prompt examples on the J-HMDB and UCF101-24 datasets, respectively. We will release all the prompts we used in this work.

Appendix B Explanation of the Reversed Attention

As discussed in the main paper, the seemly counterintuitive phenomenon of the reversed visual-text attention has been studied in [33, 31] and we also observed this in our video-based experiments. For CLIP-based models, [CLS] token in ViT is aligned to the text semantics so that its attention weight corresponds to the foreground, while the rest $L$ visual token weights are complementary after softmax over $L+1$ tokens before attention pooling. Therefore, due to the attention pooling, high similarity between text feature (or visual [CLS] token feature) and $L$ visual tokens could indicate the background.

Appendix C Implementation Details

Positional Embedding Interpolation. When using the pre-trained VLM without fine-tuning, an immediate challenge is that the input videos have different spatiotemporal resolutions from the data in VLM pre-training. For example, the CLIP-ViP is pre-trained on input videos with size $12\times 224\times 224$ while videos from J-HMDB can be in any resolution after random augmentations in training. A simple solution is to resize the input video size to match with the pre-trained ones. But for the action detection subtask, person localization is sensitive to the input resolution. To handle this challenge, we instead keep the raw resolution as input, but interpolate the pre-trained spatial and temporal positional embeddings. For example, given the CLIP-ViP B16 VLM and an input video with size $T\times H\times W$ , we interpolate the $12$ temporal positional embeddings $\text{PE}_{t}\in\mathbb{R}^{12\times D}$ to $\hat{\text{PE}}_{t}\in\mathbb{R}^{T\times D}$ , and interpolate the $196\;(=\frac{224}{16}\times\frac{224}{16})$ spatial positional embeddings $\text{PE}_{s}\in\mathbb{R}^{196\times D}$ to $\hat{\text{PE}}_{s}\in\mathbb{R}^{L\times D}$ where $L=\frac{H}{16}\times\frac{W}{16}$ . This technique is found useful for the action detection problem.

4D Feature Pyramid. Following the line of detection literature [29, 77], the pre-trained patch token features are transformed into a 4D feature pyramid before the detection head. Let the $\mathbf{H}\in\mathbb{R}^{h\times w\times T\times D}$ be the pre-trained patch token features from the VLM video encoder, where $h\times w$ is the number of patches for each frame, $T$ is the number of video frames, and $D$ is the Transformer dimension. We use deconvolution or convolution to produce hierarchical feature maps $\tilde{\mathbf{H}}^{(l)}$ by spatial strides $s^{(l)}\in\{1/4,1/2,1,2\}$ where the fractional strides are deconvolutional stides and $l$ indexes the pyramid level. Different from [29, 77] that fully fine-tunes the visual encoder, our VLM visual encoder has to be frozen. Therefore, to allow pre-trained features better utilized by OpenMixer head, we propose to add residual connection at each level of the 4D feature pyramid by spatial interpolation: $\hat{\mathbf{H}}^{(l)}=\phi(\mathbf{H},s^{(l)})+\tilde{\mathbf{H}}^{(l)}$ . The function $\phi$ is to spatially interpolate the feature map from the size $h\times w$ to the same resolution of $\tilde{\mathbf{H}}^{(l)}$ .

Appendix D Additional Results

VLMs	Modality	Mean	Base	Novel
CLIP [52]	image	71.60	79.46	64.44
CLIP-ViP [81]	video	86.34	90.75	82.33

Table 6: Effect of VLMs. We implement the OpenMixer by CLIP-ViP and CLIP with the same ViT-B/16 transformer.

	Mean(t)	Base(t)	Novel(t)
w/o. GPT	83.57	90.74	77.06
w. GPT	91.62	93.63	89.79

Table 7: GPT help temporal localization. We compute mAP by only using temporal IoU on J-HMDB dataset.

models	person boxes	J-HMDB			UCF101-24
models	person boxes	Mean	Base	Novel	Mean	Base	Novel
ZSR+ZSL	MaskRCNN [19]	66.73	64.61	68.66	35.01	34.59	35.43
ZSR+ZSL	G-DINO [40]	69.72	67.09	72.12	45.43	44.82	46.04
E2E	Mask RCNN [19]	83.51	87.45	79.92	42.31	48.48	36.13
E2E	G-DINO [40]	85.06	87.76	82.60	46.56	47.00	46.11

Table 8: Impact of person detectors. For E2E setting, predicted boxes from OpenMixer are replaced with boxes from Mask RCNN [19] or G-DINO [40], and their classification scores are assigned by maximum IoU with OpenMixer boxes that have scores.

Impact of VLMs. We note there is a line of literature [70, 55, 23, 44, 38] built on image CLIP for open-vocabulary video understanding. Therefore, it is interesting to see whether image CLIP also works for the OVAD task. In Tab. 6, we compare OpenMixer with its variants using video-based CLIP-ViP [81] and image-based CLIP [52] under the same ViT-B/16 architecture. The results show that the OpenMixer with CLIP performs way worse than the model with CLIP-ViP, because of the limited capacity of image CLIP in capturing video actions.

Can GPT help temporal action localization? This question is interesting as how textual prompts from language models like GPT could help temporal localization has not been explored in literature. In Tab. 7, we show that by evaluating the temporal action localization performance, GPT prompts could significantly help.

Impact of person detectors. In Tab. 8, we compare the impact of using external person boxes from off-the-shelf person detectors, i.e., G-DINO [40] and Mask RCNN [19], in test time on the two best-performed models under the ZSR+ZSL and E2E settings, respectively. It shows that the high-quality boxes from G-DINO could consistently outperform those from Mask RCNN. With the same external test-time boxes, the results of OpenMixer model are consistently better than those of the strongest ZSR+ZSL baseline (Video+GPT). The relatively smaller gains on UCF101-24 than the gains on J-HMDB can be explained by the background bias in UCF videos that restricts VLMs in action recognition.

Table 9: Impact of the location priors noise. We analyze the performance impact from the noise level aspect of the location prior for initializing the box queries of the first S-OMB block.

	priors from	noise level	Mean	Base	Novel
(a)	G.T. (UB)	clean	91.19	93.23	89.34
(b)	detection	moderate	83.92	88.19	80.03
(c)	random (LB)	serious	54.15	56.50	52.02
ours	attention map	slight	86.34	90.75	82.33

Table 10: Generalized zero-shot testing. A complete vocabulary of base and novel categories is given in testing.

Models	J-HMDB			UCF101-24
Models	Mean	Base	Novel	Mean	Base	Novel
STMixer [77]	36.26	55.71	18.57	28.72	53.42	4.02
OpenMixer	74.28	77.72	71.16	40.07	54.00	26.14

Impact of location prior noise. In Table 9, we compare ours with 3 variants that use location priors from (a) ground truth (G.T.) boxes which can be regarded as clean without noise and upper-bound (UB) the performance, (b) detected person boxes that may be moderately noisy, and (c) uniform random boxes that are completely noisy and lower-bounds (LB) the performance. The results show our location priors, which are sampled from the text-patch attention map, perform much better than the baselines (b)(c), and are close to the upper-bound performance in (a).

Generalized zero shot testing. In our main paper, the base and novel categories are individually given in testing. Thus, in Table 10, we additionally present the results of the generalized zero-shot testing, in which a complete vocabulary of base and novel categories is given for each testing video. This is more challenging but our OpenMixer still keeps outperformance than the STMixer baseline [77]. Moreover, according to [88, 94], the rankings of models are stable by the two testing protocols, and only the scales of numbers are different. Therefore, the efficacy of models can still be validated by individual testing in our main paper.

Results on Different Splits. We experiment with five random 50%-50% seen-unseen class splits on both the J-HMDB and UCF101-24 datasets. Full results of video mAP are summarized in Tab. 11 and 12. The split (0) is used in all experiments of the main paper. We also experiment with five random 75%-25% seen-unseen class splits on the two datasets, and report results in Tab. 13 and 14. As some of human actions are much harder to detect than others and they could be included into either base or novel categories, it is normal that the overall performances on different splits vary significantly. Following the existing literature, we will release all splits.

Metrics	(0)	(1)	(2)	(3)	(4)	avg
Mean	86.34	86.29	85.50	86.73	83.40	85.65
Base	90.75	89.89	89.20	87.70	85.36	88.58
Novel	82.33	83.02	82.13	85.85	81.61	82.99

Table 11: Results on 50%-50% J-HMDB splits.

Metrics	(0)	(1)	(2)	(3)	(4)	avg
Mean	46.42	46.28	45.45	47.32	48.30	46.75
Base	59.10	61.11	55.85	62.33	61.25	59.93
Novel	33.73	31.45	35.05	32.31	35.34	33.58

Table 12: Results on 50%-50% UCF101-24 splits.

Metrics	(0)	(1)	(2)	(3)	(4)	avg
Mean	75.96	79.43	79.77	81.88	86.56	80.72
Base	74.73	75.21	78.34	82.14	85.46	79.17
Novel	79.03	89.98	83.34	81.23	89.30	84.57

Table 13: Results on 75%-25% J-HMDB splits.

Metrics	(0)	(1)	(2)	(3)	(4)	avg
Mean	55.78	55.83	57.04	57.19	61.85	57.54
Base	64.85	61.83	60.16	58.74	61.82	61.48
Novel	28.55	37.80	47.69	52.55	61.96	45.71

Table 14: Results on 75%-25% UCF101-24 splits.

Appendix E Visualizations

We present more visualizations on the J-HMDB dataset and UCF101-24 in Fig. 7 and 8, respectively. They show that our method could detect human actions with precise bounding boxes for both seen and unseen actions. Specifically, in scenarios where multiple persons exist, for the examples of the seen action Volleyball Spiking and the unseen action Ice Dancing on the UCF101-24 dataset, our method could still localize the action-relevant persons on most frames. Referring to single-person action detection, there is still room to improve the performance of multi-person action detection in the future.

Appendix F Comparison with Concurrent Work [78]

The prior work [78] defines the same task setting and identifies similar challenges as ours. However, there are several important differences in terms of technical motivations and design. First, for the roadmap, [78] focuses on large-scale video region-text pre-training followed by downstream fine-tuning, while we emphasize the model adaptation to small downstream datasets in one-time training. Second, for model design, [78] is a two-stage method with region proposal generation and action detection refinement, while we adopt DETR-like end-to-end design. As for empirical comparison, currently, this is not feasible because (1) the [78] is a concurrent work as ours without releasing any code, data, and models (during the submission period), and (2) it is not an apple-to-apple comparison since the data splits and evaluation metrics of the benchmarks in [78] are different from ours as indicated in the paper [78].

⬇

{

"brush_hair": "Brush Hair: A person is brushing their hair using hand movements with a hairbrush or their fingers.",

"catch": "Catch: Someone is catching an object, such as a ball or a frisbee, with their hands.",

"clap": "Clap: A person is bringing their hands together to create a clapping sound.",

"climb_stairs": "Climb Stairs: Someone is ascending or descending a set of stairs, using alternating leg movements.",

"golf": "Golf: A person is swinging a golf club to hit a golf ball.",

"jump": "Jump: Someone is propelling themselves off the ground using both feet simultaneously.",

"kick_ball": "Kick Ball: A person is striking a ball with their foot.",

"pick": "Pick: Someone is picking up an object from the ground, usually using their hands.",

"pour": "Pour: A person is pouring liquid from one container to another.",

"pullup": "Pull Up: Someone is lifting their body upwards using their arms, typically performed on a horizontal bar.",

"push": "Push: A person is exerting force on an object away from their body, using their hands or body.",

"run": "Run: Someone is moving quickly on their feet, usually in a straight line.",

"shoot_ball": "Shoot Ball: A person is shooting a ball towards a target or a goal using their hands or feet.",

"shoot_bow": "Shoot Bow: Someone is using a bow to shoot an arrow.",

"shoot_gun": "Shoot Gun: A person is firing a gun, typically aimed at a target.",

"sit": "Sit: Someone is in a seated position with their weight supported by a surface, such as a chair.",

"stand": "Stand: A person is upright on their feet, with their body fully supported by their legs.",

"swing_baseball": "Swing Baseball Bat: Someone is swinging a baseball bat to hit a ball.",

"throw": "Throw: A person is propelling an object through the air using their hand or arm.",

"walk": "Walk: Someone is moving on their feet with a regular, steady pace, but slower than running.",

"wave": "Wave: A person is moving their hand or arm back and forth in a greeting or farewell gesture, usually with an open palm."

}

Figure 5: Generated prompts for J-HMDB action categories. For each category, we generate one prompt sentence.

⬇

{"Basketball": [

"Basketball: A player dribbles the ball swiftly down the court amidst cheers from the crowd.",

"Basketball: An athlete performs a high jump and slam dunks the ball into the net with confidence.",

"Basketball: Teammates pass the ball around the court, strategizing their next move.",

"Basketball: A player precision shoots the ball from the three-point line and scores.",

"Basketball: A tense one-on-one standoff as a player attempts to steal the ball.",

"Basketball: Players execute deft maneuvers around opponents on the court.",

"Basketball: A player displays impressive footwork while maintaining control of the ball.",

"Basketball: Following a whistle blow, a player steps up to take a free throw.",

"Basketball: The coach calls a timeout to relay new strategies to the team.",

"Basketball: A swift breakaway leads to a stunning layup and two points on the board.",

"Basketball: Thorny defense put up by players trying to prevent the opposing team from scoring.",

"Basketball: The player manages to steal the ball, intercepting a pass and turning the game around.",

"Basketball: In the sound of the last buzzer, players celebrate a well-earned victory.",

"Basketball: Spectators erupt in cheers as the ball swishes through the net.",

"Basketball: A captivating display of agility and teamwork witnessed on the court.",

"Basketball: A player makes a long, arching shot from the half-court line, electrifying the crowd."

"TrampolineJumping": [

"Trampoline Jumping: A joyful child is leaping high on a trampoline in their backyard.",

"Trampoline Jumping: A gymnast is skillfully performing somersaults on a trampoline.",

"Trampoline Jumping: A group of friends are competing in tricks while bouncing on a trampoline.",

"Trampoline Jumping: A professional athlete is executing a perfect backflip on a trampoline. ",

"Trampoline Jumping: Enthralled family members are enjoying a trampoline jump session on a sunny day.",

"Trampoline Jumping: Excited children are bouncing and laughing on a trampoline at a birthday party.",

"Trampoline Jumping: A fitness enthusiast is getting an intense workout by jumping on a trampoline.",

"Trampoline Jumping: An acrobat rehearses complicated maneuvers on a large trampoline. ",

"Trampoline Jumping: A fearless teenager is executing high jumps on a trampoline in a skate park.",

"Trampoline Jumping: An adventurous person is defying gravity with bounces on a massive trampoline.",

"Trampoline Jumping: A young girl confidently performs flips and twists on a trampoline. ",

"Trampoline Jumping: A trampoline athlete practices precise landings in a professional gym.",

"Trampoline Jumping: An aspiring gymnast is perfecting their routine on a trampoline.",

"Trampoline Jumping: A boy exhilaratingly jumps towards the sky on a trampoline, his laughter filling the air.",

"Trampoline Jumping: A daring young woman is doing mid-air splits on a trampoline in an indoor park.",

"Trampoline Jumping: A man is reaching extreme heights, all while being propelled off a trampoline."

]}

Figure 6: Generated prompts for UCF101-24 action categories. For each category, we generate 16 prompt sentences.

	$t/T=1/6$	$t/T=2/6$	$t/T=3/6$	$t/T=4/6$	$t/T=5/6$	$t/T=1$
kick ball
pullup

	$t/T=1/6$	$t/T=2/6$	$t/T=3/6$	$t/T=4/6$	$t/T=5/6$	$t/T=1$
brush hair
catch
pick
pour
push
golf
shoot bow
shoot gun
sit
baseball

	$t/T=1/6$	$t/T=2/6$	$t/T=3/6$	$t/T=4/6$	$t/T=5/6$	$t/T=1$
Biking
Floor Gym.
Horse Riding
Surfing
Volleyball
Basketball
Ice Dance
Long Jump
Skijet
Walking Dog

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection