This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scene-aware Human Pose Generation using Transformer

Jieteng Yao Department of Computer Science and Engineering, Shanghai Jiao Tong UniversityShanghaiChina [email protected] Junjie Chen MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong UniversityShanghaiChina [email protected] Li Niu MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong UniversityShanghaiChina [email protected]  and  Bin Sheng Department of Computer Science and Engineering, Shanghai Jiao Tong UniversityShanghaiChina [email protected]
(2023)
Abstract.

Affordance learning considers the interaction opportunities for an actor in the scene and thus has wide application in scene understanding and intelligent robotics. In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. Existing scene-aware human pose generation methods could be divided into two categories depending on whether using pose templates. Our proposed method belongs to the template-based category, which benefits from the representative pose templates. Moreover, inspired by recent transformer-based methods, we associate each query embedding with a pose template, and use the interaction between query embeddings and scene feature map to effectively predict the scale and offsets for each pose template. In addition, we employ knowledge distillation to facilitate the offset learning given the predicted scale. Comprehensive experiments on Sitcom dataset demonstrate the effectiveness of our method.

affordance learning, transformer, pose generation
submissionid: 3213journalyear: 2023copyright: acmlicensedconference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canadabooktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canadaprice: 15.00doi: 10.1145/3581783.3612439isbn: 979-8-4007-0108-5/23/10ccs: Computing methodologies Scene understanding

1. Introduction

Affordance learning (Gibson, 1979) is a long-standing and significant topic, which studies the possible interactions that an actor is allowed to have with the environment. In application, affordance learning covers numerous types of tasks, e.g., contextual affordance learning (Lopes et al., 2007; Ugur et al., 2011), functionality understanding (Zhu et al., 2015; Shiraki et al., 2014), affordance classification/detection/segmentation (Varadarajan and Vincze, 2013; Grabner et al., 2011; Eigen and Fergus, 2015). In this paper, we focus on contextual affordance learning, which uses affordance as context for the specified tasks, e.g., scene-aware human pose generation. Learning such affordance not only allows actors to better interact in the scene, but also offers considerate feedback to the scene designers. Therefore, contextual affordance learning has profound benefit for content analysis (Gupta et al., 2011), intelligent robotics (Lopes and Santos-Victor, 2005), and scene understanding (Castellini et al., 2011).

Refer to caption
Figure 1. (a,b) Given a scene image and a target point, we aim to generate a reasonable human pose at this point. (c) Through the interaction between pose templates and scene image, our model predicts the pose P and corresponding compatibility score cc for each pose template.

In this work, we focus on learning the affordance between human and scene (Wang et al., 2017), i.e., how we can put a human in a scene. As shown in Figure 1 (a), given a 2D scene image and a target point, we need to generate a human pose that reasonably interacts with the scene at the target point. Scene-aware human pose generation is actually a challenging task, because the model needs to understand the human, the scene, and the interaction between them. For example, in the case of Figure 1 (a), the model should comprehend the function of sofa, know the possible poses of human body, place the human body on the sofa, and adjust the human keypoints according to the geometry of sofa reasonably. As shown in Figure 1 (b), a hypothetical human could sit in the sofa with the back against the backrest and hams on the cushion. Despite the significance of this task, there are only few efforts in this field. Previous methods and intuitive solutions of this task could be divided into two categories depending on whether using pose templates. Non-template-based methods are similar to pose estimation methods, which directly predict the location of each keypoint. Template-base methods generate pose based on pre-defined templates. Wang et. al. (Wang et al., 2017) constructed a dataset by collecting ground-truth poses at target points from video frames of sitcoms and proposed to use VAE (Kingma and Welling, 2014) to predict the scale and offsets of target compared with the chosen pose template.

Inspired by the great success of transformer in a wide range of computer vision tasks (Dosovitskiy et al., 2021; Carion et al., 2020; Cheng et al., 2021), we propose an end-to-end trainable network based on transformer. Previous works (Carion et al., 2020; Cheng et al., 2021) use query embedding to represent the potential object or category, while we use query embedding to represent the pose template. As shown in Figure 1 (c), we set a number of pose templates in our model, which are prepared by using clustering algorithm to find representative poses in the training set. Each pose template is associated with one template embedding to interact with the scene. Specifically, the ii-th template embedding represents the ii-th pose template. Given a 2D scene image, we first extract its feature map via a backbone network and upsampling module, based on which the compatibility score of each pose template is predicted. Because there might be multiple pose templates which are reasonable for the input scene image, we introduce self-training strategy for compatibility score learning. Then, each pose template interacts with the scene by feeding its query embedding and the scene feature map into two cascaded transformer modules (Vaswani et al., 2017) to obtain the pose scale and offsets. Specifically, the query embedding first interacts with the whole feature map to predict the scale according to the global context, and then interacts with the local feature map to predict the offsets according to the fine-grained context. In this way, all plausible pose templates directly and parallelly interact with the scene image for predicting their scales and offsets.

For the model training, we use binary entropy loss to supervise compatibility score, and use MSE loss to supervise scale and offsets. In order to facilitate the offset learning given the predicted scale, we introduce knowledge distillation in our model. Moreover, we also supervise the refinement of templates that are not ground-truth but might fit the scene by adversarial loss.

We conduct comprehensive experiments on Sitcom affordance dataset (Wang et al., 2017) with in-depth analyses to demonstrate the effectiveness of our method. Our contributions can be summarized as: 1) we propose an end-to-end trainable transformer network for scene-aware human pose generation; 2) we propose to facilitate the offset learning with knowledge distillation; 3) extensive experiments on Sitcom affordance dataset show the advantage of our method against state-of-the-art baselines.

2. Related Works

2.1. Affordance Learning

The concept of affordance (Gibson, 1979) is first proposed by ecological psychologist James Gibson, which describes “opportunities for interactions” of environment. Recently, affordance learning has included extensive applications, including contextual affordance learning (Lopes et al., 2007; Ugur et al., 2011), functionality understanding (Zhu et al., 2015; Shiraki et al., 2014), affordance classification (Varadarajan and Vincze, 2013; Ugur et al., 2014), affordance detection (Grabner et al., 2011; Moldovan and De Raedt, 2014), affordance segmentation (Eigen and Fergus, 2015; Roy and Todorovic, 2016), etc. Our work is more related to contextual affordance learning, which employs affordance relationships as context for associated tasks. For examples, Lopes et al. (Lopes and Santos-Victor, 2005) proposed to use contextual affordance as prior information to enhance gesture recognition and decrease ambiguities according to motor terms. Castellini et al. (Castellini et al., 2011) employed affordance as visual features and motor features to benefit the object recognition task. Gupta et al. (Gupta et al., 2011) learned affordance in indoor images to detect the workspace according to human poses. Recently, Wang et al.(Wang et al., 2017) adopted the scene affordance as context for human pose generation. In this paper, we also focus on generating a scene-aware human pose as in (Wang et al., 2017).

2.2. Object Placement

Human could be considered as a specific object, and thus our task is also related to the field of object placement, which is an important subtask of image composition (Niu et al., 2021). Early methods (Tan et al., 2018; Tripathi et al., 2019; Li et al., 2019; Zhang et al., 2020a, b) mainly depended on explicit rules for placing foreground objects in background images, while recent methods (Liu et al., 2021; Zhang et al., 2020a; Tripathi et al., 2019; Zhou et al., 2022) employed deep learning for object placement. To name a few, Lin et al. (Lin et al., 2018b) proposed to use spatial transformer networks to learn geometric corrections to warp composite images to appropriate layouts. PlaceNet (Zhang et al., 2020a) predicted a diverse distribution for common sense locations to place the given foreground object on the background scene. TERSE (Tripathi et al., 2019) proposed to train synthesizer and target networks in adversarial manner for plausible object placement. Lee et al. (Lee et al., 2018) employed a two-stage pipeline to determine where to place objects and what classes to place. Azadi et al. (Azadi et al., 2020) proposed a self-consistent composition-by-decomposition network to learn to place objects, based on the insight that a successful composite image could be decomposed back into individual objects. Most recently, Zhou et al. (Zhou et al., 2022) treated object placement as a graph completion problem and proposed a graph completion module to solve the task. The work (Niu et al., 2022) proposed to identify the reasonable placements using a discriminative approach, by predicting the rationality scores of all scales and locations. In our task, the model needs to understand the scene and the human-scene interaction, and generate reasonable human poses from a large searching space, which is a very challenging task.

2.3. Human Pose Estimation and Generation

Human pose estimation and generation has extensive applications and thus has attracted wide attention in decades. As for human pose estimation, conventional methods (Dantone et al., 2013; Sapp et al., 2010; Sun et al., 2012; Wang and Li, 2013; Wang and Mori, 2008) employed probabilistic graphical models or the pictorial structure models to represent the relations between human body parts. Recent human pose estimation methods (Artacho and Savakis, 2020; Cheng et al., 2020; Lin et al., 2018a; Su et al., 2019; Tang et al., 2019; Li et al., 2021) have achieved significant progress by using deep learning models. Technically, there are roughly two groups of pipelines for pose estimation: regression-based (Toshev and Szegedy, 2014; Carreira et al., 2016; Sun et al., 2017; Luvizon et al., 2019) and heatmap-based (Wei et al., 2016; Chu et al., 2017; Newell et al., 2016; Martinez et al., 2017). As for human pose generation, recent works generate human pose conditioned on past poses in video (Walker et al., 2017), textual description (Hong et al., 2022), or video motion (Rempe et al., 2021). Except for (Wang et al., 2017), there are few works generating human poses skeletons conditioned on 2D scene images.

3. Method

Refer to caption
Figure 2. Illustration of our method. Our method consist of four modules: foundation network module, scale transformer module, offset transformer module, and distillation module. The loss functions are marked in red. More details can be found in Section 3.3.

Given a full scene image Ifull3×H×W\textbf{I}_{full}\in\mathbb{R}^{3\times H\times W} and a target point o2\textbf{o}\in\mathbb{R}^{2}, we need to generate a reasonable human pose with its enclosing rectangle centered at o. As in (Wang et al., 2017; Andriluka et al., 2014), the human pose P is represented by M=16M=16 keypoints (e.g., right ankle, pelvis), that is, P=[p1,,pM]\textbf{P}=[\textbf{p}_{1},\ldots,\textbf{p}_{M}] with pm2\textbf{p}_{m}\in\mathbb{R}^{2} being the mm-th keypoint. The whole pipeline of our method is shown in Figure 2, which uses KK pose templates and employs two cascaded transformer modules to model the interaction between each pose template and the input scene to produce the affordance pose. Each component of our method will be introduced in the following.

3.1. Pose Template Construction

Considering that the definition of affordance learning in our task is how a hypothetical human could interact with the scene, we construct a number of representative pose templates in training set to facilitate the interaction modeling between the hypothetical human and scene image. Specifically, we first normalize all ground-truth (GT) training poses. Then, we apply K-means clustering and further select KK representative cluster centers as KK pose templates {Ti}i=1K\{\textbf{T}_{i}\}_{i=1}^{K} by considering poses as 2M2M-dim vectors and using Euclidean distance as distance metric (K=14K=14 by default).

For pose normalization, we define the box centered at (0,0)(0,0) with side length 11 as the unit box, denoted as Bu\textbf{B}_{u}. We also define a function Be()\textbf{B}_{e}(\cdot) to output the enclosing box of its input pose, i.e., by fetching the maximum and minimum x,y-coordinates of keypoints in the input pose. For each GT training pose P\textbf{P}^{*}, we compute its normalized pose by

(1) 𝒩(P)=𝒟Be(P)Bu(P),\mathcal{N}(\textbf{P}^{*})=\mathcal{D}_{\textbf{B}_{e}(\textbf{P}^{*})\rightarrow\textbf{B}_{u}}(\textbf{P}^{*}),

where 𝒟BxBy()\mathcal{D}_{\textbf{B}_{x}\rightarrow\textbf{B}_{y}}(\cdot) is a linear deformation function defined by mapping box Bx\textbf{B}_{x} to box By\textbf{B}_{y} and could be used to deform the keypoints of input pose. In this way, each normalized pose will have the unit box Bu\textbf{B}_{u} as its enclosing box, i.e., Ti[0.5,0.5]2M\textbf{T}_{i}\in[-0.5,0.5]^{2M}. In our method, these pose templates will be refined to affordance pose according to the interaction with scene image.

3.2. Scene Image Preparation

When the scene image Ifull\textbf{I}_{full} and target point o are given, we prepare three forms of input images to leverage multi-scale information following (Wang et al., 2017). As shown in Figure 2, the first input image Iin1\textbf{I}_{in_{1}} is the whole image offering global context of scene image. The second input image Iin2\textbf{I}_{in_{2}} is a square patch centered at o and the side length is the height of the scene image, which captures the scene context around the target location. The final input image Iin3\textbf{I}_{in_{3}} is also a square patch but in half height to provide high-resolution information of scene context. We re-scale three images to the same size Hin×WinH_{in}\times W_{in}, which is set as Hin=Win=224H_{in}=W_{in}=224 in implementation. For each sample, the ground-truth label consists of three parts: class ii^{*}, scale s\textbf{s}^{*}, and normalized pose 𝒩(P)\mathcal{N}(\textbf{P}^{*}). Class ii^{*} is the index of nearest pose template. We calculate the scale s\textbf{s}^{*} as the actual height and width divided by the actual height of Ifull\textbf{I}_{full}.

3.3. Network Architecture

As shown in Figure 2, our network consists of four modules: 1) a foundation network module, which generates feature map of the input scene image and predicts the compatibility score of each pose template; 2) a scale transformer module, which contains several transformer decoder layers as in (Carion et al., 2020; Cheng et al., 2021) and an MLP prediction head; 3) an offset transformer module, which is similar to scale transformer module in architecture; 4) a knowledge distillation module, which is a pre-trained offset regression network.

Foundation network module. Foundation network module consists of a backbone network, an upsampling network, and a classifier. We employ ResNet-1818 (He et al., 2016) as our backbone network. Because the spatial resolution of ResNet-1818 output (7×77\times 7) is not enough for fine-grained context, we append it with an upsampling module to yield relatively high-resolution feature map. Specifically, similar to the pixel decoder in MaskFormer (Cheng et al., 2021), we gradually upsample the feature map and sum it with the intermediate feature map of the corresponding resolution from the backbone network. As aforementioned, for each full scene image, we have three forms of input image Iin1\textbf{I}_{in_{1}}, Iin2\textbf{I}_{in_{2}}, and Iin3\textbf{I}_{in_{3}}, leading to three feature maps Fin1\textbf{F}_{in_{1}}, Fin2\textbf{F}_{in_{2}}, and Fin3\textbf{F}_{in_{3}} after the foundation network module. Finally, three feature maps are concatenated and squeezed via 1×11\times 1 convolutional layer to produce the scene feature map FC×Hf×Wf\textbf{F}\in\mathbb{R}^{C\times H_{f}\times W_{f}}, in which Hf=Wf=28H_{f}=W_{f}=28 in our implementation. In this way, the scene feature map contains coarse context, fine context, and global context, which provide complementary information for later scene-aware human pose generation. The feature map will pass a global pooling layer and then pass a full-connected layer to get the binary compatibility score for each of KK pose templates.

Scale transformer module. Given the feature map of scene image F and each pose template Ti\textbf{T}_{i}, we use transformer module to model the interaction between pose templates and scene, which can be formulated as

(2) ui=𝒯s(qi,F,F),\textbf{u}_{i}=\mathcal{T}_{s}(\textbf{q}_{i},\textbf{F},\textbf{F}),

where 𝒯s(query,key,value)\mathcal{T}_{s}(query,key,value) is a transformer module as in (Carion et al., 2020; Cheng et al., 2021) and uid\textbf{u}_{i}\in\mathbb{R}^{d} is the interacted embedding decoded from the ii-th template embedding qi\textbf{q}_{i}. Specifically, the transformer module consists of 33 transformer decoder layers. In the transformer module, the embedding of each pose template interacts with each pixel of scene feature map, which facilitates human pose scale prediction through the global context of the whole image. We then use an MLP to predict the scale si2\textbf{s}_{i}\in\mathbb{R}^{2} (two scale values along the x-axis and y-axis respectively) of each pose template based on the interacted embedding ui\textbf{u}_{i}.

Offset transformer module. After getting the predicted scale, we compute the enclosing human bounding box and pay more attention to the features within the human bounding box. Specifically, we crop the feature map of the enclosing box region and resize it to the size of original feature map (28×2828\times 28) using a ROIAlign layer. The resized feature map is then used as the key and value of the transformer module, which can be formulated as

(3) ui=𝒯p(ui,F,F),\textbf{u}^{\prime}_{i}=\mathcal{T}_{p}(\textbf{u}_{i},\textbf{F}^{\prime},\textbf{F}^{\prime}),

where F\textbf{F}^{\prime} is the resized feature map within human bounding box. The transformer module 𝒯p\mathcal{T}_{p} consists of 11 transformer decoder layer. Then we use a MLP to predict the coordinate offsets 𝚫i2M\bm{\Delta}_{i}\in\mathbb{R}^{2M}.

Next we will show how to get the actual output pose prediction given a scene image. As shown in Figure 3, the relative coordinate 0.5-0.5 (resp., 0.50.5) on xx-axis corresponds to the left (resp., right) boundary of input image Iin2\textbf{I}_{in_{2}}. Similarly, the relative coordinate 0.5-0.5 (resp., 0.50.5) on yy-axis corresponds to the bottom (resp., top) boundary of input image Iin2\textbf{I}_{in_{2}}. We first add the offsets to the pose template (i.e., Ti+𝚫i\textbf{T}_{i}+\bm{\Delta}_{i}) to obtain the raw affordance pose. Then, we normalize this pose by 𝒩(Ti+𝚫i)\mathcal{N}(\textbf{T}_{i}+\bm{\Delta}_{i}) (see Eq  1) to ensure that the center of predicted affordance pose is at the target point o. Finally, we scale the pose with 𝒩(Ti+𝚫i)Si\mathcal{N}(\textbf{T}_{i}+\bm{\Delta}_{i})\cdot\textbf{S}_{i}, where Si2M\textbf{S}_{i}\in\mathbb{R}^{2M} denotes the scales of all MM keypoints and all MM keypoints share the same xx,yy-scale value 𝐬i2\mathbf{s}_{i}\in\mathbb{R}^{2}. As for the range of offset prediction, we assert that any offset for normalized pose should not exceed a certain range, which is set to the size of normalized box, so we define the offset 𝚫i[0.5,0.5]2M\bm{\Delta}_{i}\in[-0.5,0.5]^{2M} for normalized pose. We also observe that the max height of ground-truth pose does not exceed twice the image height, so we set the scale si[0,2]2\textbf{s}_{i}\in[0,2]^{2} to ensure that the final predicted pose is in range of [1,1]2M[-1,1]^{2M}. Specifically, we refine each pose template to get the final pose by

(4) Pi=𝒟𝚫i,si(Ti)=𝒩(Ti+𝚫i)Si.\displaystyle\textbf{P}_{i}=\mathcal{D}_{\bm{\Delta}_{i},\textbf{s}_{i}}(\textbf{T}_{i})=\mathcal{N}(\textbf{T}_{i}+\bm{\Delta}_{i})\cdot\textbf{S}_{i}.

Distillation module We introduce a distillation module to improve the performance of offset prediction. Because our model can be seen as a multi-task model, introducing a teacher model MM that focuses on predicting offsets given the scale and pose template can effectively help our model search for optimal solutions in a smaller range and converge faster. Specifically, we employ ResNet-1818 (He et al., 2016) as our teacher network. During the pre-training phase, we first use ground-truth scale s\textbf{s}^{*} to derive the human bounding box, then we can fit the ground-truth pose template Ti\textbf{T}_{i^{*}} into the bounding box and get the heatmap of each keypoint. Finally, we concatenate the heatmaps and the scene image Iin2\textbf{I}_{in_{2}} as the model input. The teacher model predicts an offset vector, based on which we can get the corresponding predicted pose and use L2L_{2} loss to supervise the model during pre-training.

During the training phase of our model, the keypoint heatmaps are generated by the ground-truth pose template Ti\textbf{T}_{i^{*}} and predicted scale from scale transformer module si\textbf{s}_{i^{*}}. Provided with keypoint heatmaps, we use fixed pre-trained teacher model to extract the feature vector vi\textbf{v}_{i^{*}}. Recall that in our main network, offset transformer embedding ui\textbf{u}^{\prime}_{i^{*}} accounts for offset prediction. Therefore, we realize knowledge distillation by narrowing the L2L_{2} distance between offset transformer embedding ui\textbf{u}^{\prime}_{i^{*}} and vi\textbf{v}_{i^{*}}.

3.4. Training with Pose Mining

In existing dataset Sitcom, each training sample is only annotated with one ground-truth pose, corresponding to one pose template. However, other pose templates might also be plausible for this scene image. Therefore, we employ self-training to mine more reasonable human poses in a scene. Specifically, we design a multi-stage training strategy. In the first stage, we assign each sample its nearest pose template as its ground-truth class and train our model. After that, we change the label to positive if the trained model of the previous stage predicts a compatibility score higher than a threshold (i.e., 0.70.7). The training moves on to the next stage when the test classification accuracy begins to decrease, and stops when the labels no longer change. We apply binary cross-entropy loss to supervise the predicted compatibility scores:

(5) cls\displaystyle\mathcal{L}_{\text{cls}} =1KiKwiBCE(ci,li),\displaystyle=\frac{1}{K}\sum_{i}^{K}{w_{i}BCE(c_{i},l_{i})},

where cic_{i} is the compatibility score for the ii-th pose template, lil_{i} is the label of the ii-th pose template (li=1l_{i}=1 if the ii-th pose template is positive), wiw_{i} is the weight to balance the quantity differences between positive and negative categories of the ii-th pose template in dataset, and BCEBCE is the binary cross-entropy loss.

Refer to caption
Figure 3. Illustration for the process of template refinement.

Though multiple pose templates could be classified as positive, the ground-truth poses and pose templates are unique for each sample. Therefore, we only apply the refinement loss to the ground-truth pose template. The loss functions for the two refinement (scale and offsets) predictions could be formulated as

(6) offset\displaystyle\mathcal{L}_{\text{offset}} =𝒩(Ti+𝚫i)𝒩(P)2,\displaystyle={\|\mathcal{N}(\textbf{T}_{i^{*}}+\bm{\Delta}_{i^{*}})-\mathcal{N}(\textbf{P}^{*})}\|^{2},
(7) scale\displaystyle\mathcal{L}_{\text{scale}} =sis2.\displaystyle={\|\textbf{s}_{i^{*}}-\textbf{s}^{*}}\|^{2}.

We apply the distillation loss to the output embedding of ground-truth pose template in offset transformer module, which can be formulated as

(8) dis\displaystyle\mathcal{L}_{\text{dis}} =uivi2,\displaystyle={\|\textbf{u}^{\prime}_{i^{*}}-\textbf{v}_{i^{*}}}\|^{2},

where ui\textbf{u}^{\prime}_{i^{*}} is the output embedding of offset transformer decoder, vi\textbf{v}_{i^{*}} is the output embedding of teacher model.

Besides the single ground-truth pose template, there could be other positive pose templates. We apply an adversarial loss to these templates to ensure the quality of output poses. We use Qi=[𝒩(Ti+𝚫i),si,f)]Q_{i}=[\mathcal{N}(\textbf{T}_{i}+\bm{\Delta}_{i}),\textbf{s}_{i},\textbf{f})] to denote the concatenation of three vectors, in which f is the feature vector after average pooling layer in foundation network module. QiQ_{i} is fed into the discriminator DD as input. The adversarial loss can be formulated as

(9) adv=\displaystyle\mathcal{L}_{\text{adv}}= 𝔼Qi[logD(Qi))]+𝔼Qi~[log(1D(Qi~))],\displaystyle\mathbb{E}_{Q_{i^{*}}}[\log D(Q_{i^{*}}))]+\mathbb{E}_{Q_{\tilde{i}}}[\log(1-D(Q_{\tilde{i}}))],

where the discriminator DD is an MLP and predicts whether the input is from ground-truth template, i~\tilde{i} is the index ii which satisfies li=1l_{i}=1 and iii\neq i^{*}.

Therefore, the total training objective of our model is

(10) minθGmaxθDcls+λooffset+λsscale+λadvadv+λdisdis,\displaystyle\min_{\theta_{G}}\max_{\theta_{D}}\quad\mathcal{L}_{\text{cls}}+\lambda_{\text{o}}\mathcal{L}_{\text{offset}}+\lambda_{s}\mathcal{L}_{\text{scale}}+\lambda_{\text{adv}}\mathcal{L}_{\text{adv}}+\lambda_{\text{dis}}\mathcal{L}_{\text{dis}},

where θD\theta_{D} includes the model parameters of discriminator DD and θG\theta_{G} includes the other model parameters. λo\lambda_{\text{o}}, λs\lambda_{\text{s}}, λadv\lambda_{\text{adv}}, and λdis\lambda_{\text{dis}} are hyper-parameters for balancing loss items. In our experiments, we set λo=10\lambda_{\text{o}}=10, λs=10\lambda_{\text{s}}=10, λadv=10\lambda_{\text{adv}}=10, and λdis=1\lambda_{\text{dis}}=1 via cross-validation.

In the inference stage, we sort the predicted poses {(ci,𝚫i,si)}iK\{(c_{i},\bm{\Delta}_{i},\textbf{s}_{i})\}_{i}^{K} according to their compatibility scores cic_{i}, and select the poses with top kk compatibility scores as our final prediction.

4. Experiments

4.1. Dataset

We conduct experiments on Sitcom affordance dataset (Wang et al., 2017). The training samples are collected from the TV series of “How I Met Your Mother”, “The Big Bang Theory”, “Two and A Half Man”, “Everyone Loves Raymond”, “Frasier” and “Seinfeld”, consisting of 2525k accurate poses over 1010k different scenes. The test samples are collected from “Friends”, consisting of 33k accurate poses over 11k different scenes. Each scene image is annotated with one or more target location points and each target point has one annotated ground-truth pose with 1616 keypoints.

Refer to caption
Figure 4. The in-depth visual analyses for our model prediction based on pose templates. Column (a) shows the GT pose in the cropped patch of scene image I2\textbf{I}_{2}. Columns (b-d) show the top-3 predicted pose templates with corresponding compatibility scores. Columns (e-g) show the predicted affordance poses refined from the top-3 pose templates.

4.2. Evaluation

We use three evaluation metrics in our experiments, i.e., PCK (Percentage of Correct Keypoints), MSE (mean-square error), and user study.

PCK is a common metric used in pose estimation (Artacho and Savakis, 2020; Li et al., 2021; Cheng et al., 2020), which indicates the percentage of correctly predicted keypoints that fall within a certain distance threshold of the ground-truth. In our task, we choose [email protected] as in (Artacho and Savakis, 2020), which refers to a threshold of 20%20\% of the torso diameter (the distance between left hip and right shoulder, corresponding to the 33-rd point and the 1212-th point in Sitcom dataset).

MSE measures the mean-square error between the predicted pose and GT pose. Specifically, we compute the mean-square error using the normalized coordinates which are divided by the heights of corresponding images, considering that the sizes of testing images may be quite different.

User study is conducted with 5050 voluntary participants by comparing the poses generated by different methods. For each test sample, every participant chooses the method producing the most reasonable pose. The score of each method is defined as the frequency that this method is chosen as the best one.

4.3. Implementation Details

For the network architecture, we use ResNet-1818 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) as our backbone network and follow the setting of transformer module in (Cheng et al., 2021) for the two transformer modules. For the MLP settings in the prediction head of two transformer modules, the first 22 hidden layers in pose offset and scale prediction head have hidden dimension of 256256 with ReLU activate function. The last layer of offset prediction is followed by a tanh function, while the last layer of scale prediction is followed by a sigmoid function. During the training stage, we apply SGD optimizer with learning rate 10410^{-4} and weight decay 10410^{-4}. A learning rate multiplier of 0.10.1 is applied to backbone network and 1.01.0 is applied to other modules of model. The batch size is set to 88. For the system environment, we use Python 3.73.7 and Pytorch 1.8.01.8.0 (Paszke et al., 2017). We conduct experiments on Ubuntu 18.0418.04 with Intel(R) Xeon(R) CPU E52678E5-2678 v33 @ 2.502.50GHz CPU and four NVIDIA GeForce GTX 10801080 Ti GPUs. The random seed is set as 0 for all experiments unless stated otherwise.

     Model      PCK \uparrow      MSE \downarrow      User Study\uparrow
     Top-1      Top-3      Top-5      Top-1      Top-3      Top-5
     Heatmap      0.363      -      -      53.45      -      -      0.025
     Regression      0.386      -      -      51.29      -      -      0.037
     Wang et al. (Wang et al., 2017)      0.401      0.432      0.458      46.65      44.49      43.17      0.225
     Unipose (Artacho and Savakis, 2020)      0.387      -      -      46.78      -      -      0.071
     PRTR (Li et al., 2021)      0.408      -      -      45.72      -      -      0.127
     PlaceNet (Zhang et al., 2020a)      0.060      0.082      0.107      377.78      375.92      371.60      0
     GracoNet (Zhou et al., 2022)      0.380      0.472      0.473      49.60      43.48      43.35      0.110
     Ours      0.414      0.498      0.533      44.86      41.91      39.42      0.405
Table 1. The comparison of different methods on Sitcom affordance dataset. The best results are highlighted in boldface.
Refer to caption
Figure 5. Visual comparison of various methods on Sitcom affordance dataset. In column (a), we plot the GT pose in cropped scene image I2\textbf{I}_{2} . In columns (b-m), we depict the affordance poses predicted by various methods.

4.4. Qualitative Analyses

To provide in-depth analyses on how our model performs, we visualize the top-33 confident pose templates and the corresponding generated poses in Figure 4. From the visualizations, we could see that our model could not only choose pose templates suitable for the input scene image but also refine them to reasonable affordance poses. Taking the first row as example, our model produces high compatibility scores for all sitting pose templates, and then our model refines it to an affordance pose naturally sitting on the sofa. More qualitative analyses are left to Supplementary.

4.5. Comparison with Prior Works

Baseline Setting. We compare with the following representative baselines: two simple baselines, Wang et al. (Wang et al., 2017), pose estimation methods, and object placement methods.

1) Heatmap is a basic method producing keypoint heatmap. Specifically, we concatenate three feature maps Fin1\textbf{F}_{in_{1}}, Fin2\textbf{F}_{in_{2}}, and Fin3\textbf{F}_{in_{3}} after the ResNet-1818 (He et al., 2016) backbone network and upsampling module with upsampled resolution of 56×5656\times 56. Then, the concatenated feature maps are fed into a 1×11\times 1 convolution layer to produce an MM-channel keypoint heatmap.

2) Regression is also a intuitive method directly predicting the coordinates of human pose joints. We perform spatial average pooling on the concatenated feature maps which are the same as that in Heatmap baseline and send the pooled feature to a fully-connected layer to produce a 2M2M-dim vector.

3) Wang et al. (Wang et al., 2017) is a two-stage approach for scene-aware human pose generation based on pose templates. In the first stage, it classifies the scene into one of 1414 pose templates which are the same as ours by sending the concatenated feature maps to a spatial average pooling layer and a linear classifier. In the second stage, it employs VAE (Kingma and Welling, 2014) to produce the deformations based on latent code, one-hot pose label vector, and the concatenated feature maps.

4) Unipose (Artacho and Savakis, 2020) and PRTR (Li et al., 2021) belong to pose estimation methods. Unipose (Artacho and Savakis, 2020) is a unified framework based on ”Waterfall” Atrous Spatial Pooling architecture. PRTR (Li et al., 2021) is the first transformer-based model used in pose estimation tasks, which regards query embedding as keypoint hypothesis and generates the final pose by finding a match.

5) PlaceNet (Zhang et al., 2020a) and GracoNet (Zhou et al., 2022) are two state-of-the-art methods in object placement task. To adapt the model to our task, we remove the encoder of foreground in these models and replace foreground feature vector with a learnable embedding.

In these methods, we use the same backbone as ours to extract the image feature map.

Quantitative Comparison. We first conduct quantitative comparison to precisely measure the discrepancies to GT poses in test set. Specifically, in our method and Wang et al. (Wang et al., 2017), we obtain top-kk confident predicted poses with kk highest compatibility scores and calculate their PCK/MSE, after which the predicted pose achieving the optimal PCK/MSE is used for evaluation. in PlaceNet(Zhang et al., 2020a) and GracoNet (Zhou et al., 2022), there are only random vectors for generate multiple results, we sample kk times to get kk results and choose the predicted pose achieving the optimal PCK/MSE for evaluation. Considering that the other baselines can not produce multiple confident predicted poses, we only report their top-11 metrics. We summarize all the results in Table 1. Firstly, directly applying the simple methods (i.e., Heatmap and Regression) could only achieve barely satisfactory performances. PlaceNet (Zhang et al., 2020a) gets an inferior result because there is no reconstruction loss between generated pose and ground-truth pose during training. The performances of GracoNet (Zhou et al., 2022) and Unipose (Artacho and Savakis, 2020) are hardly improved compared to Regression, indicating that these methods may need further improvement to fit this task. Wang et al. (Wang et al., 2017) and PRTR (Li et al., 2021) could produce better results than baselines above, which proves the feasibility of using transformer and template-based method in this task respectively. Finally, our method outperforms all the baseline methods especially in the top-33 and top-55 prediction.

Qualitative Comparison. Secondly, we conduct qualitative comparison to compare the visual rationality of various methods in different scenes. We present representative baselines, including Heatmap, Regression, PRTR (Li et al., 2021), Wang et al. (Wang et al., 2017), and GracoNet (Zhou et al., 2022). We show top-33 poses for the methods which can produce multiple results. As shown in Figure 5, our method could produce overall better affordance poses in various scene images. Specifically, Heatmap, Regression baselines, and GracoNet(Zhou et al., 2022) are prone to produce a standing human pose. One possible reason is that standing pose is the most common case, so they lack explicit knowledge about various human poses and tend to produce the average pose. PRTR (Li et al., 2021) is better to some extent, but the predicted poses often lack integrity and rationality, probably focusing more attention on the location of single keypoint. Wang et al. (Wang et al., 2017) outperforms PRTR (Li et al., 2021) and GracoNet (Zhou et al., 2022), producing more reasonable human poses due to representative pose templates. Our model could further generate more diverse and plausible results. More qualitative comparisons are left to Supplementary.

# templates KK^{\prime}10 KK^{\prime}20 KK^{\prime}30 KK^{\prime}40 KK20 KK14 KK10 KK7
Top-3 PCK\uparrow 0.444 0.466 0.462 0.447 0.476 0.498 0.493 0.479
Top-5 PCK\uparrow 0.477 0.495 0.505 0.486 0.511 0.533 0.520 0.513
Top-3 MSE\downarrow 48.37 46.65 46.78 50.92 44.14 41.91 42.32 42.84
Top-5 MSE\downarrow 47.39 44.00 45.23 49.02 42.75 39.42 40.23 41.33
Table 2. The performances of using different numbers of pose templates. KK^{\prime} denotes the number of templates from K-means, MM denotes the number of templates from selection.
classification scale offset Top-3 PCK\uparrow Top-5 PCK\uparrow Top-3 MSE\downarrow Top-5 MSE\downarrow
1 u u u 0.403 0.475 49.46 43.49
2 f u u 0.461 0.493 46.66 43.15
3 f+ST u u 0.475 0.496 44.54 43.07
4 f+ST u u\textbf{u}^{\prime} 0.484 0.512 42.89 42.08
5 f+ST u u\textbf{u}^{\prime}+DS 0.498 0.533 41.91 39.42
Table 3. Ablation study on different modules. ST denotes self-training, DS denotes knowledge distillation.
scale\mathcal{L}_{\text{scale}} offset\mathcal{L}_{\text{offset}} dis\mathcal{L}_{\text{dis}} adv\mathcal{L}_{\text{adv}} Top-3 PCK\uparrow Top-5 PCK\uparrow Top-3 MSE\downarrow Top-5 MSE\downarrow
1 0.454 0.491 46.03 43.44
2 0.456 0.508 45.93 43.05
3 0.475 0.511 44.91 42.61
4 0.484 0.512 42.89 42.08
5 0.484 0.510 43.12 42.52
6 0.498 0.533 41.91 39.42
Table 4. Ablation study on loss functions.

4.6. Pose Template Analyses

The pose templates play a pivotal role in our method. In order to investigate its impacts and characteristics, we conduct a series of experiments. We first group all annotated training poses into 1010, 2020, 3030, 4040 clusters using K-means algorithm respectively. Then, based on 3030 clusters, we further select 2020, 1414, 1010, 77 representative clusters. We use KK^{\prime} to denote the number of templates derived by K-means and KK to denote the number of templates derived by selection. We summarize the results of various values of KK^{\prime} in Table 2, from which we could observe that as KK^{\prime} increases, the performance first increases and then decreases. The cause of this may be that too few pose templates increases the refinement work of the model, while an excessive number of pose templates renders the classification process more challenging. In manual selection cases, the performance first increases as KK decreases from 2020 to 1414, the decrease of KK after 1414 will lead to a decrease in model performance. It is probably because that the templates are redundant when KK is greater than 1414, while decreasing KK after 1414 will lead to the loss of semantic categories and the rest categories could not cover all data. With an equivalent number of templates, manual selection yields better performance. We infer that it is probably because that the results from K-means algorithm lack semantic information, in which some pose templates share highly similar semantics.

4.7. Ablation Study

Utility of Different Modules. As introduced in Section 3.3, we use two cascaded transformer modules to predict scale and offsets in sequence and predict the compatibility score based on global feature vector. Additionally, we incorporate knowledge distillation and self-training to facilitate learning. To validate the effectiveness of these modules, we conduct the following experiments using different degraded versions of our model. Firstly, we use a single transformer to perform all the classification, scale and offset prediction tasks (row 1). Then, we decouple the classification task and predict the compatibility score based on the feature vector (row 2). We further employ self-training for the classification task (row 3). Next, we introduce a new transformer with local feature map of human bounding box as input to predict offsets (row 4). Finally, we add knowledge distillation to help predict the offsets (row 5). The results are shown in Table 3, from which we could observe that each change results in a performance improvement in the model.

Utility of Different Loss Functions. We conduct experiments on different loss functions. The results are shown in Table 4. We observe that each loss item can improve the performance respectively. Generally, each loss makes up an important part, and they work together to guarantee the effectiveness of our method.

5. Conclusion

In this paper, we focus on contextual affordance learning, which requires to reasonably places a human pose in a scene. Technically, we propose a transformer-based end-to-end trainable framework, which is based on pre-defined templates. We also introduce knowledge distillation to effectively help the offset learning. Extensive experiments and in-depth analyses on Sitcom dataset have demonstrated the effectiveness of our proposed framework for scene-aware human pose generation.

Acknowledgements.
The work was supported by the Shanghai Municipal Science and Technology Major/Key Project, China (Grant No. 2021SHZDZX0102, Grant No. 20511100300), and National Natural Science Foundation of China (Grant No. 62272298).

References

  • (1)
  • Andriluka et al. (2014) Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In CVPR.
  • Artacho and Savakis (2020) Bruno Artacho and Andreas Savakis. 2020. Unipose: Unified human pose estimation in single images and videos. In CVPR.
  • Azadi et al. (2020) Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and Trevor Darrell. 2020. Compositional gan: Learning image-conditional binary composition. International Journal of Computer Vision 128, 10 (2020), 2570–2585.
  • Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.
  • Carreira et al. (2016) Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In CVPR.
  • Castellini et al. (2011) Claudio Castellini, Tatiana Tommasi, Nicoletta Noceti, Francesca Odone, and Barbara Caputo. 2011. Using object affordances to improve object recognition. IEEE transactions on autonomous mental development 3, 3 (2011), 207–215.
  • Cheng et al. (2021) Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-Pixel Classification is Not All You Need for Semantic Segmentation. In NIPS.
  • Cheng et al. (2020) Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S Huang, and Lei Zhang. 2020. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR.
  • Chu et al. (2017) Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In CVPR.
  • Dantone et al. (2013) Matthias Dantone, Juergen Gall, Christian Leistner, and Luc Van Gool. 2013. Human pose estimation using body parts dependent joint regressors. In CVPR.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021).
  • Eigen and Fergus (2015) David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV.
  • Gibson (1979) JJ Gibson. 1979. The Ecological Approach to Visual Perception. Houghton Mifflin Comp (1979).
  • Grabner et al. (2011) Helmut Grabner, Juergen Gall, and Luc Van Gool. 2011. What makes a chair a chair?. In CVPR.
  • Gupta et al. (2011) Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. 2011. From 3d scene geometry to human workspace. In CVPR.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
  • Hong et al. (2022) Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–19.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In ICLR.
  • Lee et al. (2018) Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. 2018. Context-aware Synthesis and Placement of Object Instances. In NIPS.
  • Li et al. (2021) Ke Li, Shijie Wang, Xiang Zhang, Yifan Xu, Weijian Xu, and Zhuowen Tu. 2021. Pose recognition with cascade transformers. In CVPR.
  • Li et al. (2019) Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and Jan Kautz. 2019. Putting humans in a scene: Learning affordance in 3d indoor environments. In CVPR.
  • Lin et al. (2018b) Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. 2018b. St-gan: Spatial transformer generative adversarial networks for image compositing. In CVPR.
  • Lin et al. (2018a) Kyaw Zaw Lin, Weipeng Xu, Qianru Sun, Christian Theobalt, and Tat-Seng Chua. 2018a. Learning a disentangled embedding for monocular 3d shape retrieval and pose estimation. arXiv preprint arXiv:1812.09899 (2018).
  • Liu et al. (2021) Liu Liu, Bo Zhang, Jiangtong Li, Li Niu, Qingyang Liu, and Liqing Zhang. 2021. OPA: Object Placement Assessment Dataset. arXiv preprint arXiv:2107.01889 (2021).
  • Lopes et al. (2007) Manuel Lopes, Francisco S Melo, and Luis Montesano. 2007. Affordance-based imitation learning in robots. In IROS.
  • Lopes and Santos-Victor (2005) Manuel Lopes and José Santos-Victor. 2005. Visual learning by imitation with motor representations. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 35, 3 (2005), 438–449.
  • Luvizon et al. (2019) Diogo C Luvizon, Hedi Tabia, and David Picard. 2019. Human pose regression by combining indirect part detection and contextual information. Computers & Graphics 85 (2019), 15–22.
  • Martinez et al. (2017) Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. 2017. A simple yet effective baseline for 3d human pose estimation. In ICCV.
  • Moldovan and De Raedt (2014) Bogdan Moldovan and Luc De Raedt. 2014. Occluded object search by relational affordances. In ICRA.
  • Newell et al. (2016) Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In ECCV.
  • Niu et al. (2021) Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, and Liqing Zhang. 2021. Making Images Real Again: A Comprehensive Survey on Deep Image Composition. arXiv preprint arXiv:2106.14490 (2021).
  • Niu et al. (2022) Li Niu, Qingyang Liu Liu, Zhenchen Liu, and Jiangtong Li. 2022. Fast Object Placement Assessment. arXiv preprint arXiv:2205.14280 (2022).
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
  • Rempe et al. (2021) Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In ICCV.
  • Roy and Todorovic (2016) Anirban Roy and Sinisa Todorovic. 2016. A multi-scale cnn for affordance segmentation in rgb images. In ECCV.
  • Sapp et al. (2010) Benjamin Sapp, Alexander Toshev, and Ben Taskar. 2010. Cascaded models for articulated pose estimation. In ECCV.
  • Shiraki et al. (2014) Yohei Shiraki, Kazuyuki Nagata, Natsuki Yamanobe, Akira Nakamura, Kensuke Harada, Daisuke Sato, and Dragomir N Nenchev. 2014. Modeling of everyday objects for semantic grasp. In RO-MAN.
  • Su et al. (2019) Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. 2019. Multi-person pose estimation with enhanced channel-wise and spatial information. In CVPR.
  • Sun et al. (2012) Min Sun, Pushmeet Kohli, and Jamie Shotton. 2012. Conditional regression forests for human pose estimation. In CVPR.
  • Sun et al. (2017) Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017. Compositional human pose regression. In ICCV.
  • Tan et al. (2018) Fuwen Tan, Crispin Bernier, Benjamin Cohen, Vicente Ordonez, and Connelly Barnes. 2018. Where and who? automatic semantic-aware person composition. In WACV.
  • Tang et al. (2019) Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In CVPR.
  • Toshev and Szegedy (2014) Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In CVPR.
  • Tripathi et al. (2019) Shashank Tripathi, Siddhartha Chandra, Amit Agrawal, Ambrish Tyagi, James M Rehg, and Visesh Chari. 2019. Learning to generate synthetic data via compositing. In CVPR.
  • Ugur et al. (2011) Emre Ugur, Erhan Oztop, and Erol Şahin. 2011. Going beyond the perception of affordances: Learning how to actualize them through behavioral parameters. In ICRA.
  • Ugur et al. (2014) Emre Ugur, Sandor Szedmak, and Justus Piater. 2014. Bootstrapping paired-object affordance learning with learned single-affordance features. In ICDL-EPIROB.
  • Varadarajan and Vincze (2013) Karthik Mahesh Varadarajan and Markus Vincze. 2013. Parallel deep learning with suggestive activation for object category recognition. In ICVS.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
  • Walker et al. (2017) Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. 2017. The pose knows: Video forecasting by generating pose futures. In ICCV.
  • Wang and Li (2013) Fang Wang and Yi Li. 2013. Beyond physical connections: Tree models in human pose estimation. In CVPR.
  • Wang et al. (2017) Xiaolong Wang, Rohit Girdhar, and Abhinav Gupta. 2017. Binge watching: Scaling affordance learning from sitcoms. In CVPR.
  • Wang and Mori (2008) Yang Wang and Greg Mori. 2008. Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV.
  • Wei et al. (2016) Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In CVPR.
  • Zhang et al. (2020a) Lingzhi Zhang, Tarmily Wen, Jie Min, Jiancong Wang, David Han, and Jianbo Shi. 2020a. Learning object placement by inpainting for compositional data augmentation. In ECCV.
  • Zhang et al. (2020b) Song-Hai Zhang, Zheng-Ping Zhou, Bin Liu, Xi Dong, and Peter Hall. 2020b. What and where: A context-based recommendation system for object insertion. Computational Visual Media 6, 1 (2020), 79–93.
  • Zhou et al. (2022) Siyuan Zhou, Liu Liu, Li Niu, and Liqing Zhang. 2022. Learning Object Placement via Dual-Path Graph Completion. In ECCV.
  • Zhu et al. (2015) Yixin Zhu, Yibiao Zhao, and Song Chun Zhu. 2015. Understanding tools: Task-oriented object modeling, learning and recognition. In CVPR.