Enhancing Weakly-Supervised Object Detection on Static Images through (Hallucinated) Motion

Cagri Gungor and Adriana Kovashka
University of Pittsburgh
[email protected], [email protected]

Abstract

While motion has garnered attention in various tasks, its potential as a modality for weakly-supervised object detection (WSOD) in static images remains unexplored. Our study introduces an approach to enhance WSOD methods by integrating motion information. This method involves leveraging hallucinated motion from static images to improve WSOD on image datasets, utilizing a Siamese network for enhanced representation learning with motion, addressing camera motion through motion normalization, and selectively training images based on object motion. Experimental validation on the COCO and YouTube-BB datasets demonstrates improvements over a state-of-the-art method.

1 Introduction

Weakly-supervised object detection (WSOD) faces challenges in determining which instance carry training image-level labels, with traditional methods [1, 12, 15, 16] relying primarily on appearance information in RGB images. While static appearance details are an appropriate foundation, their limitation becomes evident in dynamic scenarios. Incorporating a modality capturing motion and temporal dynamics, provides distinct and complementary information to appearance. This integration presents an opportunity to enhance the localization of objects, particularly in scenarios with dynamic behaviors, by leveraging the temporal insights offered by motion information. For example, the motion of a car might indicate its trajectory, speed, or interaction with other objects, providing crucial context beyond what static appearance can offer.

Our ultimate goal is to enhance object detection performance on static images in the COCO dataset by leveraging motion. As a preliminary step, we present a proof of concept on the YouTube-BB video dataset, where real motion exists between frames. Our methodology introduces a Siamese WSOD network with contrastive learning, integrating motion to enhance representation learning during training. We employ motion normalization to reduce camera motion in the video dataset, ensuring more reliable motion information. Moreover, we strategically select images on training set based on object motion to obtain a training set that includes images with heightened and meaningful motion, amplifying the potential influence of motion while minimizing noise from low-quality motion and images with restricted motion. Lastly, we extract hallucinated motion from static images, supporting our ultimate object that motion can enhance object detection even in static images.

Refer to caption — Figure 1: This figure illustrates the design of a Siamese WSOD network and contrastive learning by leveraging the motion modality to improve representation learning.

In prior work, W-RPN [13] improves proposal quality in WSOD through motion by training an RPN before WSOD training. Pathak et al. [10] suggest using unsupervised motion-based grouping for object segmentation, aiming to learn improved features. In contrast, our approach directly utilizes motion during WSOD training, demonstrating that hallucinated motion enhance detection performance in static images.

2 Approach

2.1 The Siamese WSOD Network

WSOD. Following [12], the RGB image $\mathbf{I}$ undergoes RoI pooling for each of $R$ visual proposals $v_{i}$ to generate a fixed-length feature vector $\phi(v_{i})$ . Detection and classification scores are computed for each proposal and category using parallel fully-connected layers. The ground-truth class label for each image is represented as $y_{c}\in{0,1}$ , where $c\in{1,...,C}$ and $C$ is the total number of object categories.

The detection and classification scores are computed:

v^{det}_{i,c}=w^{det\intercal}_{c}\phi(v_{i})+b^{det}_{c},\quad v^{cls}_{i,c}=w^{cls\intercal}_{c}\phi(v_{i})+b^{cls}_{c}

(1)

where $w$ and $b$ represent the weights and biases. These scores are then transformed into probabilities where $p^{cls}_{i,c}$ denotes the probability of class $c$ being present in proposal $v_{i}$ , while $p^{det}_{i,c}$ signifies the probability that $v_{i}$ is crucial for predicting the image-level label $y_{c}$ .

p^{det}_{i,c}=\dfrac{exp(v^{det}_{i,c})}{\sum_{k=1}^{R}exp(v^{det}_{k,c})},\quad p^{cls}_{i,c}=\dfrac{exp(v^{cls}_{i,c})}{\sum_{k=1}^{C}exp(v^{cls}_{i,k})}

(2)

Ultimately, image-level predictions $\hat{p}_{c}$ are calculated and used for training in the absence of region-level labels:

\hat{p}_{c}=\sigma\left(\sum_{i=1}^{R}p^{det}_{i,c}p^{cls}_{i,c}\right)

(3)

\mathcal{L}_{mil}=-\sum_{c=1}^{C}\left[y_{c}\log\hat{p}_{c}+(1-y_{c})\log(1-\hat{p}_{c})\right]

(4)

Obtaining motion images. FlowNet 2.0 [7] is employed to compute the optical flow between the image and the successive frame in the video. The resulting optical flow is referred to as GT motion images. Additionally, we leverage Im2Flow [4] to hallucinate motion from static images, thereby obtaining the motion modality in the image dataset. The motion images contain 2 channels, representing the horizontal and vertical components of the optical flow. These 2-channel motion images are converted into RGB images by applying color coding [2], as visualized in Fig. 2. Each distinct color in the resulting images corresponds to a different direction of motion. Additionally, darker colors signify a higher magnitude of motion. This transformation enables them to utilize the same siamese backbone as the RGB images.

Siamese design. In our approach to weakly-supervised object detection, we integrate motion information through a Siamese network [5, 3, 14, 9] employing contrastive learning during training. This strategy allows the utilization of a pre-trained RGB backbone to extract features from both RGB and motion images, without introducing additional complexity to the model. Representation learning is enhanced by introducing a contrastive loss between RGB and motion features. The design of our Siamese network enables the utilization of only RGB images during inference, ensuring no extra overhead on inference time.

As shown in Fig. 1, we extract feature maps from the RGB image $\mathbf{I}$ and the motion image $\mathbf{M}$ using a pre-trained backbone. These feature maps are then passed through adaptive pooling and a Siamese fully connected layer, resulting in projected feature vectors $\psi_{proj}(\mathbf{I})$ and $\psi_{proj}(\mathbf{M})$ .

Contrastive learning. The RGB and motion feature vectors, denoted as $\psi_{proj}(\mathbf{I})$ and $\psi_{proj}(\mathbf{M})$ respectively, undergo L2-normalization. Following this normalization, we compute their cosine similarity:

S(\mathbf{I},\mathbf{M})=\langle\psi_{proj}(\mathbf{I}),\psi_{proj}(\mathbf{M})\rangle/\rho

(5)

where $\rho$ , is a learnable temperature parameter.

Given RGB and motion image pairs $(\mathbf{I},\mathbf{M})\in\mathcal{B}$ , where $\mathcal{B}$ represents an RGB-motion pair batch, we employ noise contrastive estimation (NCE) [6]. The NCE loss contrasts an RGB image with negative motion images to assess the similarity between the RGB image and its paired motion image, relative to others in the batch:

$\mathcal{L}_{{M}\rightarrow{I}}=-\dfrac{1}{|\mathcal{B}|}\sum_{(\mathbf{I},\mathbf{M})\in\mathcal{B}}log\dfrac{exp(S(\mathbf{I},\mathbf{M}))}{exp(S(\mathbf{I},\mathbf{M}))+\sum_{(\mathbf{I}^{\prime},\mathbf{M}^{\prime})\in\mathcal{B}}exp(S(\mathbf{I},\mathbf{M}^{\prime}))}$

(6)

The second component of the NCE loss, denoted as $\mathcal{L}_{{I}\rightarrow{M}}$ , is similarly formulated to contrast a motion image with negative RGB image samples. The average of these two components forms the complete NCE loss:

\mathcal{L}_{NCE}=(\mathcal{L}_{{M}\rightarrow{I}}+\mathcal{L}_{{I}\rightarrow{M}})/2

(7)

2.2 Normalization to Tackle Camera Motion

While object motion yields valuable insights into object characteristics and positioning, the presence of camera motion introduces interference, creating noise and hindering the accurate extraction of object motion. Due to the ubiquity of camera motion across images, we conceive an approach to extract camera motion, aiming to enhance the accuracy of object motion extraction.

We make an assumption that the motion values in the corners of an image approximate the background motion. Consequently, we calculate the background motion by taking into account the motion observed in these corners. We first compute per-corner motion values (scalars) by averaging per-pixel values. We then cluster the four corner values into two clusters, and drop any singleton clusters (one corner), to reduce the effect of a corner having any relative mismatch in motion values among the other corners, such as when a part of an object is situated on a corner. Further, we take the average motion of corners to obtain the approximated background motion. Because the background motion is caused by camera motion, we subtract the background motion value from the image to obtain a more accurate object motion estimate.

In Fig. 2, the motion image exhibits a consistent yellow tone across the image, indicating downward camera motion according to the color coding scheme. Our method focuses on the image corners outlined with red rectangles to estimate camera motion. Notably, the exclusion of the right upper corner, affected by object (rather than camera) motion, contributes to the improved accuracy of camera motion approximation. Following the normalization process to tackle camera motion, the depiction of the object “umbrella” illustrates a more accurate representation of its motion.

2.3 Motion-Driven Training Image Selection

We choose images in the training set depending on whether there is significant object motion. Following the selection process, the training set include images with more pronounced and meaningful motion, amplifying the potential influence of motion while minimizing noise from low-quality motion and images with limited motion and enhancing detection performance. We employ the selection only on training set as we do not use motion during inference.

The process by which we determine if there is significant object motion involves comparing object and background motion (described next). Thus, it requires that we first estimate which bounding boxes containing objects. Employing ground truth bounding boxes (even just at training time) would violate the WSOD setting, which relies solely on class labels for supervision. To adhere to this setting, we leverage a baseline WSOD method, which serves as a foundation for improvement, generating bounding box predictions for objects in the training set.

Let $ib_{i}$ denote the average magnitude motion value inside the bounding box prediction, where $i\in{1,...,N}$ and $N$ is the total number of images in the training set. On the other hand, $ob_{i}$ represents the average magnitude motion value outside the bounding box prediction. The magnitude motion values are normalized between 0 and 1. Our image selection criterion involves choosing an image if $ib_{i}$ surpasses the minimum motion threshold $m=0.2$ and if the ratio between $ib_{i}$ and $ob_{i}$ exceeds the minimum motion ratio threshold $d=1.5$ ( $m$ and $d$ chosen by by subjectively evaluating multiple cases to determine the best fit.). Following this criterion, the variable $s_{i}$ signifies whether an image is selected or not:

s_{i}=\begin{cases}1,&\text{if }ib_{i}>m\text{ and }ib_{i}/ob_{i}>d\\ 0,&\text{otherwise}\end{cases}

(8)

This ensures objects in training images exhibit sufficient motion overall and compared to the background. Note the YouTube-BB dataset where we apply selection usually features a single object.

In Fig. 3, the image in the top row features a horse, but it is identified as a non-moving object with a magnitude of motion inside box $ib_{1}=0.18$ , while the magnitude of motion outside of the box $ob_{1}=0.23$ . Thus, our method excludes this image from the WSOD training set due to the weak motion. In contrast, the bus in the image below has substantial motion, as evidenced by a magnitude of motion inside $ib_{2}$ at $0.41$ , compared to a magnitude of motion outside $ob_{2}$ at $0.09$ . Therefore, this image is selected for training to investigate the impact of motion on the WSOD task.

3 Experiments

We test our method on top of one of the SOTA weakly-supervised detection methods MIST [12] which operates in an RGB-only setting, and verify the contributions of our approach which are Siamese WSOD network with motion (Sec. 2.1), motion normalization (Sec. 2.2) and training image selection based on motion (Sec. 2.3).

3.1 Datasets

Common Objects in Context (COCO) [8] is a large-scale image dataset for object detection, segmentation, and captioning. We focus on 18 moving object classes (bird, cat, train, boat, umbrella, motorcycle, elephant, bus, cow, car, horse, bicycle, truck, airplane, giraffe, zebra, dog, bear). We extract hallucinated motion from images to show motion helps object detection of even static images. The training set comprises approximately 32k images, while the test set consists of around 2k images.

YouTube-BB [11] is a video dataset, from which we curate a subset image dataset containing the same classes as those in the COCO dataset by randomly sampling images from videos. GT motion is derived by computing optical flow between each image and its consecutive frame in the video. The training set initially contains about 38k images. Following motion-driven image selection, the training set is refined to approximately 27k images. We employ the same test set utilized in the COCO dataset for our evaluations. The video dataset is preliminary step and proof of concept to reach our ultimate goal (hallucinated motion).

3.2 Comparison against MIST and ablations

	Avg. Precision, IoU
YouTube-BB	0.5:0.95	0.5	0.75
RGB [12]	2.8	9.8	1.0
+ GT Motion	3.1 (+10%)	10.2 (+4%)	1.1 (+10%)
+ GT Normalized Motion	3.1 (+10%)	10.4 (+6%)	1.2 (+20%)
+ Hallucinated Motion	2.7 (-3.0%)	9.3 (-5.0%)	1.0 (-)
YouTube-BB /w selection
RGB [12]	2.6	9.2	0.8
+ GT Motion	3.0 (+15%)	10.6 (+15%)	0.9 (+13%)
+ GT Normalized Motion	3.0 (+15%)	10.8 (+17%)	1.0 (+25%)
+ Hallucinated Motion	2.9 (+12%)	10.1(+10%)	0.9 (+13%)
COCO
RGB [12]	11.8	27.5	8.3
+ Hallucinated Motion	12.1 (+3.0%)	27.6 (+0.4%)	8.4 (+1.2%)

Table 1: This table compares our proposed methods and the baseline MIST [12] which uses RGB only. The evaluation is conducted on YouTube-BB, YouTube-BB with motion-driven selection, and COCO datasets. Performance improvements are highlighted in green, indicating the percentage increase, while performance declines are marked in red. The best performer per column is in bold.

We enhance the MIST [12] baseline by implementing a Siamese network with contrastive learning to incorporate motion modality and improve representation learning on the backbone. The integration of GT motion results in notable improvements, showcasing a 4-10% and 13-15% increase in mAP across various thresholds for YouTube-BB and YouTube-BB /w selection datasets, respectively.

Implementing normalization on GT motion images, aimed at mitigating noise introduced by camera motion, yields improved performance. The GT normalized motion exhibits an advancement over the original GT motion, with a 2-10% and 3-12% increase in mAP for the YouTube-BB and YouTube-BB with selection datasets, respectively.

To more effectively evaluate the influence of motion on improving detection results, we employ a selective approach in image curation from YouTube-BB, as detailed in Sec. 2.3, resulting in the YouTube-BB with selection training set. Due to the substantial reduction in the number of images in training set after the selection process, from 38k to 27k, the results in RGB [12] are worse in YouTube-BB with selection. However, upon incorporating motion in both GT Motion and GT Normalized Motion settings, we observe higher performance improvement over RGB setting, ranging from 13-25%. This improvement surpasses the enhancement observed in the YouTube-BB dataset, which ranged from 4-20%. Thus normalization and selection allow us to observe the influence of motion and empower the proof of concept that motion boosts object detection.

After demonstrating the impact of GT and hallucinated motion on frames from a video dataset, we provide evidence supporting our ultimate goal that hallucinated motion using Im2Flow [4] enhances object detection even in static images. The quality of the hallucination is notably compromised in scenarios involving complex backgrounds, non-moving entities, and small objects. The integration of hallucinated motion with RGB images in the YouTube-BB dataset results in a decrease in performance due to the poor quality of the generated motion. After selecting training images based on the amount of motion in YouTube-BB with selection training set, more reliable hallucinated motion is obtained on the average of training set. Thus we observe a noteworthy improvement of 10-13% in the YouTube-BB with selection dataset when leveraging hallucinated motion.

Ultimately, we apply hallucinated motion to the COCO image dataset where GT motion is unavailable, yielding a performance improvement ranging from 0.4-3%. This underscores the capability of motion to enhance the accuracy of object detection in static images.

4 Conclusion

Our method enhances WSOD by integrating motion information. The utilization of GT motion in the video dataset serves as a proof of concept, demonstrating that motion is a complementary modality to vision, improving object detection. The inclusion of hallucinated motion supports our ultimate goal, indicating that object detection in static images can be enhanced through motion.

References

Bilen and Vedaldi [2016] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2846–2854, 2016.
Chantas et al. [2014] Giannis Chantas, Theodosios Gkamas, and Christophoros Nikou. Variational-bayes optical flow. Journal of mathematical imaging and vision, 50:199–213, 2014.
Fu et al. [2021] Keren Fu, Deng-Ping Fan, Ge-Peng Ji, Qijun Zhao, Jianbing Shen, and Ce Zhu. Siamese network for rgb-d salient object detection and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(9):5541–5559, 2021.
Gao et al. [2018] Ruohan Gao, Bo Xiong, and Kristen Grauman. Im2flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5937–5947, 2018.
Gungor and Kovashka [2024] Cagri Gungor and Adriana Kovashka. Boosting weakly supervised object detection using fusion and priors from hallucinated depth. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 739–748, 2024.
Gutmann and Hyvärinen [2010] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
Ilg et al. [2017] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Meyer et al. [2020] Johannes Meyer, Andreas Eitel, Thomas Brox, and Wolfram Burgard. Improving unimodal object recognition with multimodal contrastive learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5656–5663. IEEE, 2020.
Pathak et al. [2017] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2701–2710, 2017.
Real et al. [2017] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5296–5305, 2017.
Ren et al. [2020] Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Yong Jae Lee, Alexander G Schwing, and Jan Kautz. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Singh and Lee [2019] Krishna Kumar Singh and Yong Jae Lee. You reap what you sow: Using videos to generate high precision object proposals for weakly-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9414–9422, 2019.
Song et al. [2021] Hwanjun Song, Eunyoung Kim, Varun Jampan, Deqing Sun, Jae-Gil Lee, and Ming-Hsuan Yang. Exploiting scene depth for object detection with multimodal transformers. In 32nd British Machine Vision Conference (BMVC), pages 1–14. British Machine Vision Association (BMVA), 2021.
Sui et al. [2022] Lin Sui, Chen-Lin Zhang, and Jianxin Wu. Salvage of supervision in weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14227–14236, 2022.
Unal et al. [2022] Mesut Erhan Unal, Keren Ye, Mingda Zhang, Christopher Thomas, Adriana Kovashka, Wei Li, Danfeng Qin, and Jesse Berent. Learning to overcome noise in weak caption supervision for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.