(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: ¹The Australian National University ²Data61/CSIRO
¹¹email: [email protected], ¹¹email: [email protected], ¹¹email: [email protected]

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

Changsheng Lu¹ 0000-0002-1894-286X ¹The Australian National University ²Data61/CSIRO
¹¹email: [email protected], ¹¹email: [email protected], ¹¹email: [email protected] Zheyuan Liu¹ 0009-0002-3666-1778 ¹The Australian National University ²Data61/CSIRO
¹¹email: [email protected], ¹¹email: [email protected], ¹¹email: [email protected] Piotr Koniusz^,2,1 0000-0002-6340-5289 0000-0002-1894-286X 0009-0002-3666-1778 0000-0002-6340-5289 ¹The Australian National University ²Data61/CSIRO
¹¹email: [email protected], ¹¹email: [email protected], ¹¹email: [email protected]

Abstract

Exploiting foundation models (e.g., CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (e.g., “the nose of a cat”), or the visual prompt (e.g., support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on multimodal prompting is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like “Can you detect the nose and ears of a cat?” In this work, we open the prompt diversity in three aspects: modality, semantics (seen vs. unseen), and language, to enable a more general zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages a multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated in visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also find large language model (LLM) is a good parser, which achieves over 96% accuracy when parsing keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways of dealing with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD.

1 Introduction

Keypoint detection is a fundamental research problem in computer vision and has numerous applications such as pose estimation for humans [8, 6, 37, 52, 12, 33] and animals [34, 2, 5, 23], action recognition [47, 30], and fine-grained image classification [55, 39]. Over the past decade, significant advancements have been made in deep keypoint detection. However, the fully-supervised methods can only predict specific body parts and species from a fixed set, which limits the model generality to novel keypoints on unseen species. The semi-supervised methods suffer from a similar problem and still require several hundreds or thousands of labels for training on new class. The unsupervised methods struggle to detect user-desired keypoints. Moreover, due to the diversity of creatures, it is difficult to collect datasets containing all species. These shortcomings necessitate such a zero- or few-shot keypoint detection: after being trained on a diverse dataset, the model can quickly recognize novel or base keypoints in unseen species given only zero, one or a few labelled samples. Such a quick adaptation to new task mimics the transfer ability of humans, which is crucial in versatile keypoint detection.

On the one hand, researchers have recently explored few-shot keypoint detection (FSKD) [27, 28, 29, 53]. The FSKD inspired by few-shot learning [36, 38, 13, 44, 20] is general and offers higher flexibility of detecting a varying number of keypoints in a query image, given the prompts formed by support image with keypoint annotations. In this regard, the support set is namely visual prompt, which is required when evaluating new classes. On the other hand, in the era of foundation model, the vision-language model (e.g., CLIP [35]) injects new life to detecting keypoints, enabling zero-shot keypoint detection (ZSKD) [56, 54]. Compared to FSKD, ZSKD does not need support image with annotations, instead, using the text prompt to instruct the model to detect the specified keypoints in query image. In this case, the support set is text prompt. Though impressive for the pioneering FSKD and ZSKD models, we identify an important issue limiting the progress: the prompt diversity for current keypoint detection methods is far from opened, especially in the aspects of modality, semantics, and language.

Refer to caption — Figure 1: Illustration of multimodal prompting for keypoint detection. Our model can successfully detect keypoints given visual prompts formed by support images and keypoints (a), text prompts (b), or both (c). Graph (d) shows our model well combines the advantages of different modalities, mitigating the weakness induced by either modality.

Modality diversity. Most existing keypoint detection models cannot support multimodal prompts, e.g., image, text, or both (see Fig. 1). Multimodal prompting is more friendly in real-world interaction, and coherent with the human concept recognition. We not only see the objects, but also describe objects with language, which ultimately renders a deeper understanding of object concepts. Our work extends existing FSKD and ZSKD, building a more general zero- and few-shot keypoint detection by leveraging a multimodal prototype set and aligning the visual keypoint features towards the textual features. While straightforward, our method enables one to study the advantages of respective modality data and exploit them for better model training and testing.

Semantic diversity. Considering keypoints between seen and unseen species, there exist large similarity yet difference in semantics. A strong advantage of text prompt is that the keypoints with the same semantics share high similarity in language across species, which enables excellent ZSKD on base keypoints. However, if the text has different semantics, significant domain gap arises. For instance, “the eye of a cat” vs. “the eye of a dog” has cosine similarity of 0.93 using CLIP text embeddings, whereas for “the eye of a cat” vs. “the knee of a dog” it drops to 0.77. We observe that the keypoint detectors cannot perform well if text is unseen during training. To bridge the gap, we propose to open the semantic diversity by adding auxiliary keypoints and texts into training, where the auxiliary texts are reasoned by LLM given the base keypoints. To ease the reasoning error, we ask LLM a good question using chain of thought [48]. We also propose a novel sampling strategy with false text control to improve the matching between auxiliary keypoints and texts. With them together, our model significantly improves ZSKD on novel keypoints.

Language diversity. Following zero-shot image classification by CLIP [35], existing ZSKD also constructs simple text prompts based on templates, e.g., “ $\langle$ keypoint $\rangle$ ” or “the $\langle$ keypoint $\rangle$ of a $\langle$ category $\rangle$ in the photo”, where $\langle\cdot\rangle$ is replaced by class names. In real-world interaction, humans tend to question in diverse styles, which results in diverse text prompts. For example, “Please locate the right-back leg on cat”, “Can you find the left eye, right ear and nose in image?”, etc. A natural question is how to handle the diverse text prompting? To address this, we propose a simple yet effective approach, whose key idea is to borrow the large language model (LLM) such as GPT3.5 [4] or Vicuna [9, 58] to parse the diverse texts via prompt engineering. After extracting the keypoint type and object category, they will be synthesized into simple text prompts to instruct the detection model. As such, the diverse text prompting can be transformed into simple text prompting, which sheds light on opening language diversity (Fig. 2).

In summary, in this paper, we propose an OpenKD model with several intriguing features: 1) supporting both visual and textual prompts, 2) having the potential to handle unseen texts and diverse texts, and 3) maintaining strong generality and performance on ZSKD and FSKD. We report that LLM is capable of being a reasoner for text interpolation, and a good language parser for parsing diverse texts. We also contribute four diverse text prompt sets for the popular Animal pose dataset [5], AwA [2], CUB [45], and NABird [42] for fair evaluations. To our best knowledge, we are the first to open semantics and language diversity of text prompts for keypoint detection.

2 Related Work

Keypoint Detection has been widely studied ranging from the traditional interest points [25, 10] to deep corner detection [57], semi-supervised [31, 46, 18] and fully-supervised keypoint estimation [40, 33, 6, 8, 12, 37, 52, 51]. There are two major classes of methods for deep keypoint localization: i) direct coordinates based regression [7, 41] and ii) heatmap based regression. In contrast to existing heatmap based models dedicated to specific body parts, e.g., top-down [37, 12, 16] and bottom-up pose estimators [33, 6, 8], our OpenKD model offers more flexible keypoint detection, breaking the limitation of keypoint types to be detected.

Few-shot Keypoint Detection is more versatile and data-efficient compared to the supervised paradigms. Similar to standard few-shot learning (FSL) [44], FSKD also takes episodes for training and evaluation. Many well-known FSL methods such as ProtoNet [36], RelationNet [38], LwoF [15] and MAML [13] have been extended to the field of keypoint detection [27, 3, 14] and serve as baselines. Recently, Lu and Koniusz [27] formalize the comprehensive FSKD settings and model the localization uncertainty of keypoints. Further, they propose a lightweight FSKD model [28] and explore FSKD under transductive setting and occlusions [29]. Bohdal et al. [3] propose a dataset-of-datasets, benchmarking the FSL algorithms universal to various vision tasks. FSKD also inspires class-agnostic animal pose estimation [27, 28, 50]. Compared to existing FSKD, our work pushes further by opening the text prompting, which enables more versatile keypoint detection. Moreover, text prompts can complement visual prompts, delivering better performance of detecting base keypoints on unseen species.

Zero-shot Keypoint Detection becomes feasible thanks to the great progress of vision-language pre-training (e.g., CLIP [35] and BLIP [22]). Since CLIP was pre-trained on internet-scale image-text pairs that contain more semantic concepts than other datasets, CLIP can provide semantically rich features for various downstream tasks [32, 59, 56], showing strong zero-shot transfer ability.

Recently, CLAMP [56] adapts CLIP for animal pose estimation in feature-aware and spatial-aware aspects; Approach [54] also employs CLIP as backbone and proposes language-driven keypoint detection. Compared to FSKD which still requires one or more annotations, ZSKD merely demands cheap language descriptions, which make it intriguing for keypoint detection. Nevertheless, most existing ZSKD works cannot support multimodal prompts and their prompt diversity is limited. To address these issues, we propose an OpenKD model which further opens modality, semantics, and language. To handle unseen texts, we propose to generate auxiliary keypoint-text pairs which significantly help reason novel keypoints on unseen species. Moreover, we contribute a simple yet effective language parsing module to handle diverse text prompts.

Foundation Models include a broad set of models pre-trained over large-scale datasets, including LLMs (e.g., GPT3.5 [4]), VLMs (e.g., CLIP [35]), and Multimodal models (e.g., GPT4 [1]). Transferring the knowledge from these foundation models to specific tasks in a cost-effective way is popular in computer vision [19, 24, 54, 56, 26]. Thus, we follow this trend and leverage LLM like GPT3.5 (relatively cheap) to perform reasoning and parsing for keypoints.

3 Method

3.1 Model Architecture

Following the general few-shot learning [44, 38, 36], Z-FSKD model is evaluated on episodes, each of which includes a support set and a query set. The query set is comprised of query images, while the support set gives the prompts. If the visual prompts, i.e., $K$ support images with keypoint annotations ( $K\geq 1$ ), are given, then the problem is defined as $K$ -shot detection. If only the text prompt (i.e. the language description) is given, then $K=0$ thus the problem becomes zero-shot detection. The goal of Z-FSKD is to detect the corresponding keypoints in query image given the prompts, whether visual prompt, text prompt, or both. Such an approach allows the model to effectively respond to diverse inputs.

The overview of our model inference is shown in Fig. 3, which mainly includes four stages: i) image/text feature extraction, ii) feature adaptation, iii) keypoint prototype set building, iv) correlation, decoding and heatamp fusion. The overview of our model training is shown in Fig. 4, which provides the novel approaches to improve model performance.

Image/Text Feature Extraction Since the input of our model involves vision and language two different modalities, we wish the features extracted from image and text have smaller modality gap, so that the text can somewhat correlate with image regions of the same semantics. Such a property could help the model find keypoint locations in query image. To this end, we resort to the CLIP [35] which is pre-trained on large-scale image-text pairs. The CLIP includes image and text encoders, where the text encoder is generally a transformer [43] while the image encoder can be CNN [21] or ViT [11]. We empirically found that the CLIP image encoder based on RN50 [35, 17], a CNN model, gives high efficiency in consideration of both performance and cost. Thus, we choose it as the default backbone. However, when extracting the image features, original CLIP image encoder only retains the classification token $\mathbf{x}_{\text{class}}$ and discards image tokens $\mathbf{X}$ by attention pooling as

\mathbf{x}_{\text{class}}=\text{AvgPool}(\mathbf{X}),\quad\mathbf{x}_{\text{class}}^{\prime}=\text{Attention}(\mathbf{x}_{\text{class}}\mathbf{W}_{q},\mathbf{X}\mathbf{W}_{k},\mathbf{X}\mathbf{W}_{v})\mathbf{W}_{o}.

(1)

While it is natural for image-text matching, we require the image tokens to recover the spatial locations of keypoints. Thus, we propose to obtain the image tokens via a projection composed by $\mathbf{W}_{v}$ and $\mathbf{W}_{o}$ as

\mathbf{X}^{\prime}=\mathbf{X}\mathbf{W}_{v}\mathbf{W}_{o}.

(2)

In this way, we can not only handle the channel inconsistency between raw image tokens $\mathbf{X}$ and text features, but also reuse the projection layers $\mathbf{W}_{v}$ and $\mathbf{W}_{o}$ .

For brevity, we assume that per input episode contains one query image $\mathbf{I}^{\text{q}}$ , one support image $\mathbf{I}^{\text{s}}$ with $N$ annotated keypoints, and $N$ keypoint texts. With Eq. 2, both the support and query images can be encoded and projected as the support and query feature maps as $\mathbf{X}^{\text{s}}=\mathcal{F}_{\text{v}}(\mathbf{I}^{\text{s}})$ and $\mathbf{X}^{\text{q}}=\mathcal{F}_{\text{v}}(\mathbf{I}^{\text{q}})$ in deep feature space $\mathbb{R}^{l\times l\times d}$ via a shared CLIP image encoder $\mathcal{F}_{\text{v}}$ . Further, with the CLIP text encoder $\mathcal{F}_{\text{t}}$ , the $N$ texts are firstly tokenized, then encoded as the text features $\mathbf{t}_{n}\in\mathbb{R}^{m\times d}$ , where $n=1,2,\cdots,N$ ; and $m$ is sequence length.

Feature Adaptation Despite CLIP can provide general yet transferrable features, we need to adapt them into a new multimodal feature space that is more suitable for keypoint detection. Inspired by residual learning [17], we propose two adaptation nets $\mathcal{A}_{\text{v}}$ and $\mathcal{A}_{\text{t}}$ to respectively refine the obtained image and text features in a residual way:

\mathbf{X}^{\text{s}}:=\mathbf{X}^{\text{s}}+\mathcal{A}_{\text{v}}(\mathbf{X}^{\text{s}}),\quad\mathbf{X}^{\text{q}}:=\mathbf{X}^{\text{q}}+\mathcal{A}_{\text{v}}(\mathbf{X}^{\text{q}}),\quad\mathbf{t}_{n}:=\mathbf{t}_{n}+\mathcal{A}_{\text{t}}(\mathbf{t}_{n}).

(3)

The adaptation net $\mathcal{A}$ is highly scalable and can choose transformer [43] or bottleneck [17] as refinement blocks. We found that $\mathcal{A}$ can well adapt features.

Keypoint Prototype Set Once we obtain the adapted features of support image and text prompts, i.e., $\mathbf{X}^{\text{s}}$ and $\mathbf{t}_{n}$ , we propose to convert them into keypoint prototypes, which are unified in a keypoint prototype set.

For visual prompt, recall that keypoint labels are provided along with support image. Thus, we can aggregate the the local features for each support keypoint $\mathbf{p}_{n}$ by using pixelwise weighted summation between feature map $\mathbf{X}^{\text{s}}$ and Gaussian heatmap $\mathbf{H}(\mathbf{p}_{n};\sigma)$ , yielding the visual keypoint representation (VKR) as $\mathbf{\Phi}_{n}\in\mathbb{R}^{d}$ . The $\sigma$ is the standard deviation that controls Gaussian spread. If $K$ support images are given, namely in few-shot case, the VKRs of the same type of keypoints $\mathbf{\Phi}_{k,n}$ , will be averaged to build the visual keypoint prototype (VKP) $\mathbf{\Psi}^{\text{v}}_{n}=\frac{1}{K}\sum_{k}\mathbf{\Phi}_{k,n}$ . Analogously, we can develop textual keypoint prototype (TKP) $\mathbf{\Psi}^{\text{t}}_{n}$ by text features if provided with multiple texts per keypoint. Finally, we can build the keypoint prototype set as $\mathcal{T}=\mathcal{T}^{\text{v}}\cup\mathcal{T}^{\text{t}}$ , where $\mathcal{T}^{\text{v}}=\{\mathbf{\Psi}^{\text{v}}_{n}\}$ is the VKPs and $\mathcal{T}^{\text{t}}=\{\mathbf{\Psi}^{\text{t}}_{n}\}$ is the TKPs. In this way, different-modal prompts can be summarized as keypoint prototypes in a shared $d$ dimensional feature space, which enables our model to flexibly handle various modalities. Moreover, each prototype in set $\mathcal{T}$ can guide the model to induce a keypoint heatmap, establishing the zero- and few-shot keypoint detection.

Correlation, Decoding and Fusion To discover the corresponding keypoints in query image $\mathbf{I}^{\text{q}}$ , each keypoint prototype $\mathbf{\Psi}_{n}$ , either VKP $\mathbf{\Psi}^{\text{v}}_{n}$ or TKP $\mathbf{\Psi}^{\text{t}}_{n}$ , is required to be correlated with the query feature map $\mathbf{X}^{\text{q}}$ . To this end, we leverage a correlation module $\mathcal{C}$ which takes both $\mathbf{\Psi}_{n}$ and $\mathbf{X}^{\text{q}}$ as input to produce the attentive feature maps $\mathbf{A}_{n}=\mathcal{C}(\mathbf{\Psi}_{n},\mathbf{X}^{\text{q}})$ , where $\mathbf{A}_{n}\in\mathbb{R}^{l\times l\times d}$ and $n=1,2,\cdots,N$ . For the correlation module $\mathcal{C}$ , we explored multiple variants and found that the simple cross correlation already yields high performance, i.e., $\mathbf{A}_{n}=\mathbf{X}^{\text{q}}\odot\mathbf{\Psi}_{n}$ , where $\odot$ denotes the channel-wise multiplication.

Subsequently, a class-agnostic decoder $\mathcal{D}$ is devised to convert each attentive feature map into a keypoint heatmap, i.e., $\mathbf{H}_{n}=\mathcal{D}(\mathbf{A}_{n})$ , where $\mathbf{H}_{n}\in\mathbb{R}^{l\times l}$ . For heatmap regression based keypoint localization, the higher resolution of heatmap can greatly reduce the coordinate decoding error. Thus, a lightweight upsampling module $\mathcal{U}$ is adopted to further refine heatmaps, i.e. $\mathbf{H}_{n}:=\mathcal{U}(\mathbf{H}_{n})$ .

Since each modality of prototypes can induce one group of heatmaps, considering visual and textual modalities, we have two groups of heatmaps $\mathbf{H}^{\text{v}}=\{\mathbf{H}^{\text{v}}_{n}\}_{n=1}^{N}$ and $\mathbf{H}^{\text{t}}=\{\mathbf{H}^{\text{t}}_{n}\}_{n=1}^{N}$ . In testing phase, we fuse the upsampled multi-group heatmaps as the final output $\mathbf{H}\in\mathbb{R}^{N\times ul\times ul}$ ( $u$ is upsampling factor):

\mathbf{H}=(\mathbf{H}^{\text{v}}+\mathbf{H}^{\text{t}})/2

(4)

During training phase, inspired by HigherHRNet [8], we perform group-specific supervision for heatmaps as

\mathcal{L}_{\text{kp}}=(\|\mathbf{H}^{\text{v}}-\mathbf{H}^{*}\|_{2}^{2}+\|\mathbf{H}^{\text{t}}-\mathbf{H}^{*}\|_{2}^{2})/2

(5)

where $\mathbf{H}^{*}$ denotes the groundtruth heatmap that encodes query keypoint.

Improve Model by Contrastive Learning Since the keypoint prototype directly impacts performance, we aim for its representation ability to be as strong as possible. The same type of keypoints should have higher invariance across species, e.g., the ear of cat, dog, and cow; while different types are required to be sufficiently discriminative such that our model can distinguish the ambiguous texts or fine-grained texts like “the left ear” vs. “the right ear”. To this end, we propose to introduce contrative loss over TKPs. If randomly sampling two species $(s,s^{\prime})$ at a time and each species pertaining to an episode, we have pairwise sets of TKPs, i.e., $\mathcal{T}^{\text{t}}_{s}\!=\!\{\mathbf{\Psi}_{n}\}$ and $\mathcal{T}^{\text{t}}_{s^{\prime}}\!=\!\{\mathbf{\Psi}^{\prime}_{n}\}$ , which can form similarity matrix as

\mathbf{J}(\mathcal{T}^{\text{t}}_{s},\mathcal{T}^{\text{t}}_{s^{\prime}})=\left(\begin{array}[]{ccc}\cos(\mathbf{\Psi}_{1},\mathbf{\Psi}^{\prime}_{1})&\cdots&\cos(\mathbf{\Psi}_{1},\mathbf{\Psi}^{\prime}_{N})\\ \vdots&&\vdots\\ \cos(\mathbf{\Psi}_{N},\mathbf{\Psi}^{\prime}_{1})&\cdots&\cos(\mathbf{\Psi}_{N},\mathbf{\Psi}^{\prime}_{N})\\ \end{array}\right)

(6)

where $\cos(\cdot,\cdot)$ denotes the cosine similarity. Then, the contrastive loss within textual modality $\mathcal{L}_{\text{tt}}$ becomes

\mathcal{L}^{s\rightarrow s^{\prime}}_{\text{tt}}=-\langle\mathbb{I},\log(\text{softmax}(\mathbf{J}(\mathcal{T}^{\text{t}}_{s},\mathcal{T}^{\text{t}}_{s^{\prime}})/\tau))\rangle,\quad\mathcal{L}_{\text{tt}}=\frac{1}{2}(\mathcal{L}^{s\rightarrow s^{\prime}}_{\text{tt}}+\mathcal{L}^{s^{\prime}\rightarrow s}_{\text{tt}})

(7)

where $\mathbb{I}$ is identity matrix and $\langle\cdot,\cdot\rangle$ denotes inner product. Since CLIP is trained via image-level alignment instead of keypoint-level alignment to text, we wish VKPs and TKPs are aligned so that we can exploit their respective advantages to help each other. This motivates us to propose the second contrastive loss between VKPs and TKPs. Similar to Eq. 7, we build $\mathcal{L}_{\text{vt}}$ . We find that visual-textual alignment improves 1-shot testing, but unintentionally drops 0-shot performance slightly. We discover the principle behind this is due to the better clustering effects of textual keypoint representations than visual ones. To address this issue, we perform stop gradient to TKPs in $\mathcal{L}_{\text{vt}}$ , thus enforcing the VKPs to align towards TKPs. We also find adding $\mathcal{L}_{\text{vv}}$ does not further advance scores. We conjecture the reason is after aligning VKPs to TKPs via $\mathcal{L}_{\text{vt}}$ , the visual features will also improve discrimination along with textual features as $\mathcal{L}_{\text{tt}}$ is applied.

Optimization By incorporating the heatmap regression loss $\mathcal{L}_{\text{kp}}$ , intra- and inter-modality contrastive loss $\mathcal{L}_{\text{tt}}$ and $\mathcal{L}_{\text{vt}}$ , we have the overall loss as

\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{kp}}+\lambda_{2}\mathcal{L}_{\text{tt}}+\lambda_{3}\mathcal{L}_{\text{vt}}

(8)

where $\lambda_{i}$ is the loss weight. Note that contrastive losses serve as a regularization for the main task loss. By default, we set $\lambda_{1}=1$ , $\lambda_{2}=\lambda_{3}=0.002$ .

3.2 Training with Auxiliary Keypoints and Texts

Localizing the seen keypoints and texts is relatively easy to handle, but it would be difficult if the unseen ones are encountered. Thus, we propose to add auxiliary keypoints and texts interpolated from visual and textual domains into training, which significantly improves the ability to detect novel keypoints.

Visual Interpolation Following [27], we generate the auxiliary keypoints in visual domain via interpolation $\mathcal{I}^{\text{v}}(z;\mathbf{p}_{1},\mathbf{p}_{2})$ , where $z\in(0,1)$ is the interpolation node and $\mathbf{p}_{i}\in\mathbb{R}^{2}$ is the end points that form an interpolation path. An off-the-shelf saliency detector [49] is adopted to filter the auxiliary keypoints out of the foreground [27]. Via interpolation $\mathcal{I}^{\text{v}}$ in visual domain, the auxiliary keypoints $\hat{\mathbf{p}}$ can boost the visual diversity of training keypoints.

Text Interpolation and Selection The $\mathcal{I}^{\text{v}}$ can assist the model in handling unseen keypoints in visual prompt, but helps little if prompted unseen texts. Consequently, we propose using text interpolation $\mathcal{I}^{\text{t}}(z;t(\mathbf{p}_{1}),t(\mathbf{p}_{2}),c)$ to mitigate this issue, where $t(\mathbf{p}_{i})$ are the names of end points $\mathbf{p}_{i}$ and $c$ is object category. Thanks to great language processing ability of LLM (e.g., GPT3.5 [4]), we propose to ask LLM a good question to reason the auxiliary keypoint texts given base keypoints (Fig. 4(b)). E.g., if we want to infer the keypoint text between “thigh” and “paw” of a “cat”, we can design a vanilla prompt to LLM as:

“Please give me three most common body parts/keypoints at 1/2 between thigh and paw of a cat. Please answer in concise words.”

Apparently, we expect that the answers returned by LLM should include “knee” or “elbow”. In practice, the keypoint types are usually more complex thus the reasoning is challenging. To enhance reasoning, we improve the above vanilla prompt into a more advanced one using chain of thought (CoT) [48], whose key idea is to firstly provide an example with analysis and then ask the question. We provide the detailed prompts and examples in §0.A of Suppl.

Considering LLM has randomness and might produce erroneous answers, we let LLM give three most possible answers for each interpolation path and repeat $R$ times, yielding a text pool with $3R$ texts, i.e., $\{\hat{t}_{i}\}_{i=1}^{3R}$ . As such, the GT text has higher chance to be included. Afterwards, we need to select one interpolated text from the pool to form a pair with visually interpolated keypoint. To this end, a simple yet effective approach is sampling from top-1 results of all repetitions. However, it does not use the visual features for selection. An interesting approach is to use the correlation between visual and text features for selection as follows:

j=\arg\max\nolimits_{i}\cos(\hat{\mathbf{\Phi}},\hat{\mathbf{t}}_{i}),\quad\hat{t}^{*}=\hat{t}_{j}

(9)

where $\hat{\mathbf{\Phi}}$ is the auxiliary keypoint feature of $\hat{\mathbf{p}}$ , $\hat{\mathbf{t}}_{i}$ is text feature of $\hat{{t}_{i}}$ , and $\hat{t}^{*}$ is the selected text. However, we found that sometimes it leads performance drop. We suspect it is as CLIP’s keypoint-text features are not fine-grained aligned enough on those regions beyond annotations. Opposed to picking, we observe such a correlation could be used to reject low-quality text if keypoint-text similarity $\cos(\hat{\mathbf{\Phi}},\hat{\mathbf{t}}_{i})$ is low. Consequently, we propose a more advanced sampling with false text control (FTC). Namely, we sample the text within the range top- $\eta$ , but the text will be rejected if the similarity $\cos(\hat{\mathbf{\Phi}},\hat{\mathbf{t}}_{i})$ is below the threshold $\alpha$ . In this way, we can form auxiliary keypoint-text pairs with higher quality to enhance model training. We will apply $\mathcal{I}^{\text{v}}$ to both support and query images to generate support and query auxiliary keypoints $\hat{\mathbf{p}}^{\text{s}}$ and $\hat{\mathbf{p}}^{\text{q}}$ , respectively. Similar to Eq. 5, $\hat{\mathbf{p}}^{\text{q}}$ encodes the GT heatmap supervising the heatmaps induced by $\hat{\mathbf{p}}^{\text{s}}$ and $\hat{t}^{*}$ .

3.3 LLM is a Good Language Parser

To handle diverse text prompts in keypoint detection, unlike current fashion of utilizing LLM for text generation, we use LLM for text decomposition. For example, if our model receives a diverse text prompt like “Can you localize the left eye and nose of cat?”, we firstly leverage an LLM (e.g., GPT3.5 [4]) to perform text parsing by providing it with a simple yet effective prompt as follows:

Please extract the animal and keypoint keywords from the below text: “Can you localize the left eye and nose of cat?”

Then, LLM is able to return the parsed texts regarding $\langle$ keypoint $\rangle$ and $\langle$ object $\rangle$ , i.e., the “left eye”, “nose”, and “cat” in this example. Subsequently, we can directly leverage the parsed results to synthesize simple text prompts to instruct our model and predicting all the keypoints corresponding to the text. In this way, the problem of diverse text prompting can be converted to simple text prompting, thus opening the language diversity for keypoint detection.

To fairly evaluate our approach, we manually collect 100 diverse text templates which cover most scenarios for questions (§0.B of Suppl.). Take one as an example: “Where is the $\langle$ keypoint $\rangle$ for $\langle$ object $\rangle$ ?” Afterwards, we develop four diverse text prompt sets for popular Animal pose dataset [5], AwA [2], CUB [45], and NABird [42]. Specifically, we accompany each object instance with a diverse text prompt synthesized by randomly sampling one template and one to $N$ valid keypoints. Compared to traditional text prompts which only have $N$ variants, our constructed text prompt sets are more diverse, whose space could be as large as $100\cdot(C_{N}^{1}+C_{N}^{2}+\cdots+C_{N}^{N})=100\cdot 2^{N}$ variants. e.g., CUB has 11 keypoints, its space is larger than $10^{5}$ . We randomly sample 1000 diverse texts from each dataset and evaluate LLM’s parsing performance. Table 5(b) reports that GPT3.5 has over 96% accuracy in parsing keypoints from text, which shows that LLM is a good language parser.

4 Experiments

4.1 Experimental Settings

Datasets and Splits. 1) Animal pose dataset [5] has five mammal species cat, dog, cow, horse, and sheep, with over 6000 instances with keypoint annotations. Each animal species is alternately chosen as unseen species for testing while the remaining four as seen species for training, which yields five subproblems; 2) AwA [2] has 35 diverse animal species with 10064 images. For AwA, 25 species are for training and 10 for testing; 3) CUB [45] consists of 200 species with 15 keypoint annotations. We use 100 species for training, 50 for validation, and 50 for testing; 4) NABird [42] is a larger dataset than CUB with 555 categories, 11 types of annotated body parts, and 48,562 images. The species split is 333, 111, and 111 for training, validation, and testing, respectively. To test model generalization, we follow [27] to split keypoints into base and novel sets, reporting the performance of both base and novel keypoint detection on unseen species.

Metric. The percentage of correct keypoints (PCK) is used. A predicted keypoint is correct if its distance to GT $d\!\leq\!\rho\!\cdot\!\max(w_{\text{bbx}},h_{\text{bbx}})$ , where $w_{\text{bbx}}$ and $h_{\text{bbx}}$ are the edges of object bounding box. Following [27], we set $\rho$ to $0.1$ .

Implementation Details. The input image size for all models is $384\times 384$ . We freeze the CLIP text encoder while finetune last two layers of image encoder. The temperature $\tau$ used in contrastive loss is 0.05. Same to [27], the visual auxiliary keypoints are generated in pre-defined interpolation path and we set $z\!=\!0.5$ . For text interpolation, to reduce randomness, we use GPT3.5 to reason $R=3$ times for all datasets except CUB with $R=10$ . Our model is trained with 40k episodes and reports results using 1000 test episodes.

Compared Methods. For few-shot keypoint detection (FSKD), as previous works, we compare the few-shot learning models such as ProtoNet [36], RelationNet [38], and LwoF [15], and FSKD-dedicated works FSKD-R/-D [27]. For zero-shot keypoint detection (ZSKD), we adopt the source code of CLAMP [56] and compare it in our experiments for fairness. Moreover, we build a Baseline shared same backbone with our model but no using the specifically proposed auxiliary texts, contrastive loss, etc. We denote our method as OpenKD.

Table 1: Main results on Animal pose dataset. Each row shows the results tested on the model trained in one setting. Each PCK score is the average of five subproblems.

#	Training settings				1-shot testing		0-shot testing		1-shot w/ text
#	Kp	Aux. kp	Text	Aux. text	Novel	Base	Novel	Base	Novel	Base
1	✓				21.36	50.84	1.26	1.93	16.16	32.81
2	✓	✓			47.54	49.45	2.00	2.10	39.24	34.11
3			✓		16.18	35.46	25.60	61.64	27.02	61.21
4			✓	✓	15.87	31.57	63.14	65.31	63.30	65.52
5	✓		✓		21.46	52.15	22.42	61.07	23.88	60.49
6	✓	✓	✓	✓	50.32	54.39	63.37	65.59	63.19	64.93

Table 2: Results on 1-shot keypoint detection for unseen species. The PCK scores on novel and base keypoint detection are reported. +T means adding texts during testing.

Setting	Model	Animal Pose Dataset						AwA	CUB	NAB
Setting	Model	Cat	Dog	Cow	Horse	Sheep	Avg	AwA	CUB	NAB
Novel	ProtoNet	19.68	16.18	14.39	12.05	15.06	15.47	29.57	51.32	36.65
	RelationNet	22.15	17.19	15.47	13.58	16.55	16.99	20.91	56.59	34.02
	LwoF	22.47	19.39	16.82	16.40	16.94	18.40	28.54	54.75	34.19
	FSKD-R	46.05	40.66	37.55	38.09	31.50	38.77	51.81	77.90	54.01
	FSKD-D	52.36	47.94	44.07	42.77	36.60	44.75	64.76	77.89	56.04
	OpenKD	60.36	53.58	47.59	49.01	41.05	50.32	66.71	78.39	53.35
	OpenKD+T	69.26	66.81	62.40	63.21	54.27	63.19	79.02	73.29	53.40
Base	ProtoNet	45.80	39.83	34.88	35.80	32.33	37.73	57.17	80.36	73.18
	RelationNet	51.03	45.85	39.86	41.97	37.19	43.18	57.31	79.40	78.85
	LwoF	50.05	44.64	43.47	43.35	37.84	43.87	63.87	81.96	81.39
	FSKD-R	57.12	51.12	47.83	49.71	43.71	49.90	65.26	87.94	87.84
	FSKD-D	56.38	51.29	48.24	49.77	43.95	49.93	66.39	87.71	86.99
	OpenKD	63.61	55.43	51.18	53.87	47.86	54.39	74.22	87.45	85.11
	OpenKD+T	70.47	65.09	62.84	66.46	59.81	64.93	84.50	91.81	91.23

Table 3: Results on 0-shot keypoint detection for unseen species. CLAMP^† is the variant trained by adding our interpolated auxiliary texts.

Setting	Model	Animal Pose Dataset						AwA	CUB	NAB
Setting	Model	Cat	Dog	Cow	Horse	Sheep	Avg	AwA	CUB	NAB
Novel	Baseline	25.64	24.29	25.29	18.14	25.51	23.77	28.95	34.94	30.00
	CLAMP	20.90	24.06	27.07	16.86	20.73	21.92	38.96	37.09	18.18
	CLAMP^†	61.70	58.09	61.42	64.86	53.13	59.84	77.66	69.30	50.81
	OpenKD	71.07	66.49	60.30	65.53	53.45	63.37	78.64	70.16	52.29
Base	Baseline	58.69	56.62	56.58	60.00	51.35	56.65	73.45	87.12	71.25
	CLAMP	60.71	54.04	62.58	62.73	57.30	59.47	84.16	90.97	79.10
	CLAMP^†	60.30	56.49	62.69	60.90	57.16	59.51	83.76	90.65	83.65
	OpenKD	71.34	66.60	64.53	67.26	58.23	65.59	84.32	91.72	90.63

4.2 Main Results

Firstly, we explore single-modal training and the impacts of whether or not to add the auxiliary keypoints/texts (1st-4th row, Table 1). If only using main keypoints (1st row) or main texts (3rd row), we can observe that the models perform well on base keypoints in 1-shot (50.84%) or 0-shot (61.64%), but fail to detect novel keypoints. However, if the model is trained by adding the auxiliary keypoints (2nd row) or auxiliary texts (4th row), we can observe the remarkable improvements on novel keypoint detection, e.g., 47.54% vs. 21.36% in 1-shot testing for model trained by visual prompts (2nd row vs. 1st row); and also 63.14% vs. 25.60% in 0-shot testing for model trained by text prompts (4th row vs. 3rd row). The significant gains confirm the benefits of visual interpolation [27] and also highlight the effectiveness of our proposed textual interpolation. Secondly, we investigate multimodal training (5th-6th row, Table 1). As one can see, our model performs well under multimodal training (6th row), and greatly outperforms the model trained by multimodal prompts without auxiliary keypoints and texts (5th row). The higher performance is due to the fact that auxiliary keypoints and texts could boost visual and textual semantic diversity, thus enabling novel keypoint detection under 1-shot and 0-shot. Thirdly, we observe that 0-shot testing on base keypoints is greatly higher than 1-shot, which shows that the texts can be an excellent representation as guidance to detect keypoints. Lastly, we examine multimodal testing, i.e., performing 1-shot with text testing, our model obtains 63.19% and 64.93% on novel and base keypoints, which highlights that our model strongly combines the advantages of both modalities, mitigating the weakness induced by either modality.

Comparisons on FSKD and ZSKD: We comprehensively conduct the few-shot keypoint detection across four datasets. Table 2 shows our OpenKD model greatly outperforms compared methods in 15 out of 18 tasks in detecting novel or base keypoints. Moreover, if adding the texts during testing, the performance of our model can be further improved, which yields 63.19% in Animal pose dataset (63.19% vs. 44.75% of prior-art FSKD-D) and 79.02% in AwA. We also notice the improvements on CUB and NAB are modest, which might be due to the auxiliary texts are relatively harder to reason, thus bringing less bonus to few-shot model. For ZSKD (Table 3), again, our model outperforms other methods (in 17 out of 18 tasks) and the improvement of CLAMP^† on novel keypoint detection strongly shows the benefits of adding auxiliary texts into training.

Table 4: Ablation study. (a) Intra- and inter-modality contrastive loss.

\mathcal{L}_{\text{vt}}^{*}

denotes no stop gradient on text features; (b) Text interpolation. Baseline^†: w/o using auxiliary texts; CoT: chain of thought; Corr: Eq. 9; FTC: sampling with false text control (ours).

AwA	1-shot		0-shot
AwA	Novel	Base	Novel	Base
1: w/o CL	65.56	72.17	76.71	81.67
2: $\mathcal{L}_{\text{tt}}$	66.05	72.35	78.28	84.00
3: $\mathcal{L}_{\text{tt}}$ + $\mathcal{L}_{\text{vt}}^{*}$	66.70	73.91	77.10	83.85
4: $\mathcal{L}_{\text{tt}}$ + $\mathcal{L}_{\text{vt}}$	66.71	74.22	78.64	84.32
5: $\mathcal{L}_{\text{tt}}$ + $\mathcal{L}_{\text{vt}}$ + $\mathcal{L}_{\text{vv}}$	66.27	74.08	78.60	83.42

(a)

0-shot testing	AwA		CUB
0-shot testing	Novel	Base	Novel	Base
1: Baseline^† ( $\blacktriangle$ )	34.31	83.53	36.89	88.85
2: $\blacktriangle$ +w/o CoT	73.80	82.79	68.03	88.93
3: $\blacktriangle$ +CoT	78.30	84.32	68.68	88.98
4: $\blacktriangle$ +CoT+Corr	62.12	83.13	71.80	94.27
5: $\blacktriangle$ +CoT+FTC	78.64	84.32	70.16	91.72

(b)

4.3 Model and Performance Analysis

Why 0-shot can outperform 1-shot? We may observe the 0-shot testing scores on base keypoints are always higher than 1-shot testing. In this case, the training texts are accurate (cf. auxiliary text is noisy) and have high semantic similarity to testing texts. We dig more to visualize the corresponding text and keypoint features. Fig. 5 shows that textual keypoint features have better clustering effects and lower variance than visual ones, which suggests textual representations of base keypoints are better thus 0-shot yields higher scores.

Contrastive Learning (CL). After applying $\mathcal{L}_{\text{tt}}$ , we observe the 0-shot performance significantly improves, e.g. 84.00% vs. 81.67% (2nd row, Table 4(a)), as CL improves the discrimination between text features. However, when adding visual-textual alignment (3rd row), 1-shot testing scores boost but slightly decreasing 0-shot scores. We suspect it is due to the imbalance of representation quality between multimodal features. After adding stop gradient, it mitigates negative impacts from “weak” modality, thus achieving higher scores (4th row). We also observe $\mathcal{L}_{\text{vv}}$ does not further improve scores in our losses (5th row).

Table 5: (a) Study on repetition

R

; (b) Parsing results with GPT3.5 and Vicuna.

$R$	AwA			CUB
$R$	Novel		Base	Novel		Base
1	67.34		83.98	64.21		91.40
3	78.64	(+11.3)	84.32	69.81	(+5.60)	92.38
5	76.36	(+9.02)	83.73	69.81	(+5.60)	92.02
10	76.61	(+9.27)	83.36	70.16	(+5.95)	91.72

(a)

Dataset	GPT3.5		Vicuna
Dataset	Acc. kp	Acc. obj	Acc. kp	Acc. obj
Animal	0.97	0.99	0.94	0.98
AwA	0.96	1.00	0.93	0.98
CUB	0.97	0.96	0.94	1.00
NABird	0.97	0.99	0.95	1.00

(b)

Table 6: Diverse text prompting evaluation. Results are the average of three runs.

Method	Animal Pose		AwA		CUB		NAB		Avg Drop
Method	Novel	Base	Novel	Base	Novel	Base	Novel	Base	Novel	Base
OpenKD	63.37	65.59	78.64	84.32	70.16	91.72	52.29	90.63	-	-
No parsing	13.89	15.18	13.44	21.70	22.09	17.48	14.55	17.62	50.12 $\downarrow$	65.07 $\downarrow$
Vicuna	57.54	59.84	73.97	78.18	67.89	87.39	50.45	86.59	3.65 $\downarrow$	5.06 $\downarrow$
GPT3.5	61.45	62.82	76.23	80.07	69.39	90.65	52.21	89.14	1.29 $\downarrow$	2.39 $\downarrow$

Text Interpolation. Table 4(b) shows that using chain of thought (CoT) significantly improves performance, especially on AwA (3rd row), as CoT improves the quality of reasoned texts. Moreover, after applying FTC, the scores can strike a high balance in all datasets (5-th row), as our FTC rejects low-quality texts and further reduces noise after including visual features into decision. We also explore the repetition $R$ for text reasoning. Table 5(a) shows the importance of setting $R$ more than 1.

Diverse Text Prompting. To evaluate our model under diverse text prompting, we randomly sample 1000 diverse texts from each dataset to conduct zero-shot testing. As shown in Table 6, after coupling with GPT3.5, our model strongly keeps the performance even under diverse texts (with less than 3% drop). The strong results compared to no parsing shows leveraging LLM as a parser to handle language diversity is a possible solution. Moreover, we also found coupling with GPT3.5 has higher performance than Vicuna, as GPT3.5 owns stronger keypoint text parsing ability, e.g., 96% vs. 93% in AwA (Table 5(b)).

5 Conclusion

We propose to open the prompt diversity from the aspects of modality, semantics and language, to enable a more general zero- and few-shot keypoint detection. To this end, we build a versatile OpenKD model which supports both visual and textual prompting. Moreover, to bridge the semantics gap between seen and unseen texts, we propose a novel text interpolation and the selection strategy with false text control, which strongly improves zero-shot novel keypoint detection. We also discover that LLM is a good language parser. After coupling with LLM, our model can well handle diverse texts. We hope the proposed model, text interpolation and parsing approach could provide useful insights in versatile keypoint detection, thus we highly recommend it to vision community.

Acknowledgment

Changsheng Lu is supported by Australian Government Research Training Program (AGRTP) International Scholarship. Piotr Koniusz is supported by CSIRO’s Science Digital.

References

[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
[2] Banik, P., Li, L., Dong, X.: A novel dataset for keypoint detection of quadruped animals from images. arXiv preprint arXiv:2108.13958 (2021)
[3] Bohdal, O., Tian, Y., Zong, Y., Chavhan, R., Li, D., Gouk, H., Guo, L., Hospedales, T.: Meta omnium: A benchmark for general-purpose learning-to-learn. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7693–7703 (2023)
[4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[5] Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9498–9507 (2019)
[6] Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence 43(1), 172–186 (2019)
[7] Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4733–4742 (2016)
[8] Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5386–5395 (2020)
[9] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org (accessed 14 April 2023) (2023)
[10] Derpanis, K.G.: The harris corner detector. York University pp. 2–3 (2004)
[11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
[12] Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: Rmpe: Regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 2334–2343 (2017)
[13] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning. pp. 1126–1135. PMLR (2017)
[14] Ge, Y., Zhang, R., Luo, P.: Metacloth: Learning unseen tasks of dense fashion landmark detection from a few samples. IEEE Transactions on Image Processing 31, 1120–1133 (2021)
[15] Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4367–4375 (2018)
[16] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
[17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[18] Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., Kautz, J.: Improving landmark localization with semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1546–1555 (2018)
[19] Jiao, B., Liu, L., Gao, L., Wu, R., Lin, G., Wang, P., Zhang, Y.: Toward re-identifying any animal. Advances in Neural Information Processing Systems 36 (2024)
[20] Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop. vol. 2. Lille (2015)
[21] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436–444 (2015)
[22] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
[23] Li, S., Gunel, S., Ostrek, M., Ramdya, P., Fua, P., Rhodin, H.: Deformation-aware unpaired image translation for pose estimation on laboratory animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13158–13168 (2020)
[24] Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
[25] Lowe, G.: Sift-the scale invariant feature transform. Int. J 2, 91–110 (2004)
[26] Lu, C., Gu, C., Wu, K., Xia, S., Wang, H., Guan, X.: Deep transfer neural network using hybrid representations of domain discrepancy. Neurocomputing 409, 60–73 (2020)
[27] Lu, C., Koniusz, P.: Few-shot keypoint detection with uncertainty learning for unseen species. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19416–19426 (2022)
[28] Lu, C., Koniusz, P.: Detect any keypoints: An efficient light-weight few-shot keypoint detector. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 3882–3890 (2024)
[29] Lu, C., Zhu, H., Koniusz, P.: From saliency to dino: Saliency-guided vision transformer for few-shot keypoint detection. arXiv preprint arXiv:2304.03140 (2023)
[30] Mathis, A., Mamidanna, P., Cury, K.M., Abe, T., Murthy, V.N., Mathis, M.W., Bethge, M.: Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nature neuroscience 21(9), 1281–1289 (2018)
[31] Moskvyak, O., Maire, F., Dayoub, F., Baktashmotlagh, M.: Semi-supervised keypoint localization. arXiv preprint arXiv:2101.07988 (2021)
[32] Mukhoti, J., Lin, T.Y., Poursaeed, O., Wang, R., Shah, A., Torr, P.H., Lim, S.N.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19413–19423 (2023)
[33] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. pp. 483–499. Springer (2016)
[34] Pereira, T.D., Aldarondo, D.E., Willmore, L., Kislin, M., Wang, S.S.H., Murthy, M., Shaevitz, J.W.: Fast animal pose estimation using deep neural networks. Nature methods 16(1), 117–125 (2019)
[35] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[36] Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175 (2017)
[37] Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5693–5703 (2019)
[38] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1199–1208 (2018)
[39] Tang, L., Wertheimer, D., Hariharan, B.: Revisiting pose-normalization for fine-grained few-shot recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14352–14361 (2020)
[40] Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. Advances in neural information processing systems 27 (2014)
[41] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1653–1660 (2014)
[42] Van Horn, G., Branson, S., Farrell, R., Haber, S., Barry, J., Ipeirotis, P., Perona, P., Belongie, S.: Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 595–604 (2015)
[43] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[44] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. Advances in neural information processing systems 29, 3630–3638 (2016)
[45] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)
[46] Wang, C., Jin, S., Guan, Y., Liu, W., Qian, C., Luo, P., Ouyang, W.: Pseudo-labeled auto-curriculum learning for semi-supervised keypoint localization. arXiv preprint arXiv:2201.08613 (2022)
[47] Wang, L., Huynh, D.Q., Koniusz, P.: A comparative review of recent kinect-based action recognition algorithms. IEEE Transactions on Image Processing 29, 15–28 (2019)
[48] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
[49] Wu, Z., Su, L., Huang, Q.: Stacked cross refinement network for edge-aware salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7264–7273 (2019)
[50] Xu, L., Jin, S., Zeng, W., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Pose for everything: Towards category-agnostic pose estimation. In: European Conference on Computer Vision. pp. 398–416. Springer (2022)
[51] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose+: Vision transformer foundation model for generic body pose estimation. arXiv preprint arXiv:2212.04246 (2022)
[52] Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: Keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11802–11812 (2021)
[53] Yao, Q., Quan, Q., Xiao, L., Kevin Zhou, S.: One-shot medical landmark detection. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24. pp. 177–188. Springer (2021)
[54] Zhang, H., Zhang, K., Xu, L., Lai, S., Shao, W., Zheng, N., Luo, P., Qiao, Y.: Language-driven open-vocabulary keypoint detection for animal body and face. arXiv preprint arXiv:2310.05056 (2023)
[55] Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based r-cnns for fine-grained category detection. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. pp. 834–849. Springer (2014)
[56] Zhang, X., Wang, W., Chen, Z., Xu, Y., Zhang, J., Tao, D.: Clamp: Prompt-based contrastive learning for connecting language and animal pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23272–23281 (2023)
[57] Zhao, S., Gong, M., Zhao, H., Zhang, J., Tao, D.: Deep corner. International Journal of Computer Vision pp. 1–25 (2023)
[58] Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as-a-judge with mt-bench and chatbot arena (2023)
[59] Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16793–16803 (2022)

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection (Supplementary Material)

Changsheng Lu¹ Zheyuan Liu¹ Piotr Koniusz^2,1

Summary of the supplementary material:

•

§0.A presents more details about how to interpolate auxiliary texts with LLM. The chain-of-thought prompting and the examples of text interpolation are given to showcase the efficacy of the proposed method.
•

§0.B provides the examples of diverse text prompts and text parsing. More details about the construction of diverse text prompt sets are given.
•

§0.C depicts more details about implementation.

Table 7: Examples of interpolating texts by prompting GPT3.5 with chain of thought on the Animal pose dataset. We use the API interface provided by OpenAI to perform reasoning. All the examples are generated automatically without human intervention.

System Instruct You are a helpful assistant that produces keypoints of an animal.
User	Q: Please give me one most common body part/keypoint at 1/2 between left-front knee and left-front paw of an animal. Please answer in concise words. Provide no excessive explanations. A: The starting point is left-front knee. The end point is left-front paw. The answer should be between the starting point and end point. Left-front ankle is between the starting point and end point. The answer is left-front ankle. Q: Please give me three most common body parts/keypoints at 1/2 between nose and left ear of an animal. Pay attention to the left and right. Please answer in concise words like “1. 2. 3.”. Please do not include nose and left ear in answers. Provide no excessive explanations.
GPT3.5	1. Left eye $\qquad\rhd$ Correct answer, mostly desirable 2. Left cheek $\qquad\rhd$ Correct answer, less desirable 3. Left temple $\qquad\rhd$ Correct answer, less desirable
User	Q: Please give me one most common body part/keypoint… A: The starting point is … The end point is … The answer should be… Q: Please give me three most common body parts/keypoints at 1/2 between nose and right ear of an animal. Pay attention to the left and right. Please answer in concise words like “1. 2. 3.”. Please do not include nose and right ear in answers. Provide no excessive explanations.
GPT3.5	1. Right eye. $\qquad\rhd$ Correct answer, mostly desirable 2. Temple. $\qquad\rhd$ Correct answer, less desirable 3. Cheek. $\qquad\rhd$ Correct answer, less desirable
User	Q: Please give me one most common body part/keypoint… A: The starting point is … The end point is … The answer should be… Q: Please give me three most common body parts/keypoints at 1/2 between left-front thigh and left-front paw of an animal. Pay attention to the left and right. Please answer in concise words like “1. 2. 3.”. Please do not include left-front thigh and left-front paw in answers. Provide no excessive explanations.
GPT3.5	1. Left-front knee. $\qquad\rhd$ Correct answer, mostly desirable 2. Left-front ankle. $\qquad\rhd$ Correct answer, less desirable 3. Left-front paw. $\qquad\rhd$ Wrong answer, not desirable

Appendix 0.A Details of Text Interpolation

We use the LLM to perform text interpolation $\mathcal{I}^{\text{t}}(z;t(\mathbf{p}_{1}),t(\mathbf{p}_{2}),c)$ by asking a good question, i.e., designing a good prompt to LLM. The detailed vanilla prompt is given as follows:

“Please give me three most common body parts/keypoints at $z$ between $t(\mathbf{p}_{1})$ and $t(\mathbf{p}_{2})$ of $c$ . Pay attention to the left and right. Please answer in concise words like “1. 2. 3.”. Please do not include $t(\mathbf{p}_{1})$ and $t(\mathbf{p}_{2})$ in answers. Provide no excessive explanations.”

In the above formulation, $z$ is the interpolation node, $t(\mathbf{p}_{i})$ is the name of starting or end point, and $c$ is the object name. In order to produce the structured output, several constraints can be appended to the prompt. For example, “Pay attention to the left and right” could help LLM take care of the symmetry of animal anatomy.

Although the vanilla prompt can deal with most cases, we require a more advanced prompt to improve the reasoning of LLM, as the keypoint types may be complex. The Chain of Thought (CoT) [48] has been demonstrated to be an effective method to improve the reasoning of LLM. Inspired by CoT, we propose the improved prompt with CoT as follows:

Q: Please give me one most common body part/keypoint at 1/2 between left-front knee and left-front paw of an animal. Please answer in concise words. Provide no excessive explanations.

A: The starting point is left-front knee. The end point is left-front paw. The answer should be between the starting point and end point. Left-front ankle is between the starting point and end point. The answer is left-front ankle.

Q: Please give me three most common body parts/keypoints at $z$ between $t(\mathbf{p}_{1})$ and $t(\mathbf{p}_{2})$ of $c$ . Pay attention to the left and right. Please answer in concise words like “1. 2. 3.”. Please do not include $t(\mathbf{p}_{1})$ and $t(\mathbf{p}_{2})$ in answers. Provide no excessive explanations.

Examples: Table 7 shows three real examples of text interpolation using our improved prompt. Note that all the examples are generated at one-time run sequentially, without any human intervention, using our code. As demonstrated, our approach can successfully infer the auxiliary keypoint texts.

Listing 1: Examples of diverse text prompts on the Animal pose dataset, AwA, CUB, and NABird. For each dataset, we show five sampled diverse texts.

⬇

# Animal pose dataset

Please tell me the position of the left-front leg, left ear.

Can you find the left-back leg, left-front leg on cat?

Can you recognize the nose, left-front paw, left ear, left-back paw of dog?

Give me the position of the right-front knee, right-front leg.

Is there a way to pinpointing the left-back paw, right ear, right-back knee on cow?

# AwA dataset

Please pinpointing the front_right_knee, back_left_paw, left_eye, front_right_paw on ox.

Please determine the body_middle_left of otter.

The right_earbase, back_left_thai of wolf.

Give me the position of the back_left_thai, neck_base on zebra.

Please detect front_right_thai of grizzly bear.

# CUB dataset

Can you detect the crown, back, right eye, right wing of carolina wren?

Can you recognize the belly, right eye, right leg, forehead, nape, crown?

Spot the tail, right leg, back.

Where is the nape, right wing, tail, crown of a warbling vireo?

Please identify the throat, nape on brandt cormorant.

# NABird dataset

Can you locate the breast, belly, right eye, right wing, tail, nape on Bullock’s Oriole?

Please detect right wing, belly, bill, right eye.

Say’s Phoebe’s crown, back, breast, bill, belly, left eye.

Tell me where is the back, crown, nape, right wing of California Gull?

Ovenbird’s right eye, crown, belly, back.

Table 8: Examples of parsing texts via GPT3.5.

User	Please extract the animal and keypoint keywords from the below text: “Please tell me the position of the left-front leg, left ear.”. Give the answer in simple words, like “Animal type:, Keypoint part:”. If no animal is mentioned, set animal type to N/A.
GPT3.5	Animal type: N/A; Keypoint part: left-front leg, left ear $\quad\rhd$ Correct parsing
User	Please extract the animal and keypoint keywords from the below text: “Can you find the left-back leg, left-front leg on cat?”. Give the answer in simple words, like “Animal type:, Keypoint part:”. If no animal is mentioned, set animal type to N/A.
GPT3.5	Animal type: cat; Keypoint part: left-back leg, left-front leg $\qquad\rhd$ Correct parsing
User	Please extract the animal and keypoint keywords from the below text: “Tell me where is the back, crown, nape, right wing of California Gull?”. Give the answer in simple words, like “Animal type:, Keypoint part:”. If no animal is mentioned, set animal type to N/A.
GPT3.5	Animal type: California Gull; Keypoint part: back, crown, nape, right wing $\qquad\rhd$ Correct parsing

Appendix 0.B Details of Diverse Text Prompts

Examples of diverse texts: A glance of diverse text prompts is shown in Listing LABEL:lst:example-diverse-texts. For brevity, we show five randomly sampled diverse texts per dataset.

Examples of parsing: In our implementation, the detailed prompt to LLM for parsing is as follows:

Please extract the animal and keypoint keywords from the below text: “ $\langle\text{text}\rangle$ ”. Give the answer in simple words, like “Animal type:, Keypoint part:”. If no animal is mentioned, set animal type to N/A.

In the above example, the $\langle\text{text}\rangle$ is replaced by a specific diverse text prompt. For example, we select three diverse texts from Listing LABEL:lst:example-diverse-texts and then use our proposed approach to parse them. Table 8 shows that our method can well extract the object category and keypoint texts from the diverse text.

Moreover, in Table 5(b) (main paper), we randomly draw 1000 diverse texts from each dataset and then perform parsing. The parsed results will be compared to groundtruth (GT). A parsed text is correct if its IoU to GT $\!\geq\!0.9$ . The Acc. kp and Acc. obj in Table 5(b) (main paper) refer to the parsing accuracy of keypoint/object texts. As one can see, our parsing method with GPT3.5 can achieve over 96% accuracy in parsing keypoints from text, which shows that LLM is a good language parser.

Diverse texts construction: To convincingly evaluate the effectiveness of our approach, we construct four diverse text prompt sets for the popular Animal pose dataset, AwA, CUB, and NABird. Firstly, we manually and meticulously collect 100 diverse text templates which cover most real-world scenarios, as shown in Listing LABEL:lst:diverse-text-templates. Secondly, for each annotated instance in each dataset, we randomly sample one template and one to $N$ valid keypoints. Lastly, we replace the $\langle$ keypoint $\rangle$ and $\langle$ object $\rangle$ (if existing) with the sampled keypoints and object category. For example, if the sampled template is “where is the $\langle$ keypoint $\rangle$ for $\langle$ object $\rangle$ ?”, the sampled keypoints are “left eye and right eye”, and the object category of annotated instance is “cat”, then we can synthesize a diverse text prompt as “where is the left eye and right eye for cat?”. We note that each annotated instance may have a varying number of valid keypoints due to self-occlusion, pose, and appearance. Thus, the synthesized text is quite diverse.

Listing 2: Partial diverse text templates. For brevity, we show 20 templates out of 100.

⬇

<obj>’s <keypoint>.

the <keypoint> of <obj>.

detect the <keypoint> of <obj>.

can you detect the <keypoint> of <obj>?

where is the <keypoint> for <obj>?

please detect <keypoint> of <obj>

please identify the <keypoint> on <obj>.

recognize the <keypoint> of <obj>.

please recognize the <keypoint> of <obj>.

Spot the <keypoint> of <obj>.

Locate the <keypoint> on <obj>.

please find the <keypoint> of <obj>.

distinguish the <keypoint> on <obj>.

please determine the <keypoint> of <obj>.

pinpointing the <keypoint> on <obj>.

pick out the <keypoint> on <obj>.

Could you find the <keypoint> on <obj>?

Please make out the <keypoint> on <obj>.

Give me the position of the <keypoint> on <obj>.

Please tell me the position of the <keypoint> on <obj>

Appendix 0.C Additional Implementation Details

By default, the adaptation nets $\mathcal{A}_{\text{v}}$ and $\mathcal{A}_{\text{t}}$ of our OpenKD model use one bottleneck [17] and one transformer block [43], respectively. The decoder $\mathcal{D}$ uses two convolutional blocks. The upsampler $\mathcal{U}$ upscales the heatmap at a ratio of 2. During text sampling with false text control (FTC), we set $R=3$ and $\eta=1$ for Animal pose and AwA datasets, $R=\eta=3$ for NABird, and $R=\eta=10$ for CUB. For all datasets, we set the threshold $\alpha$ to 0.01. Considering the keypoint-text alignment for adapted features is weak in early training, we use 10k episodes for bootstrapping. Specifically, before 10k episodes, we adopt the original CLIP image/text features for FTC. After 10k episodes, we switch to the adapted CLIP image/text features. We note that these hyper-parameters are easy to tune and perform well in experiments.