Improving ProtoNet for Few-Shot Video Object Recognition: Winner of ORBIT Challenge 2022

Li Gu¹, Zhixiang Chi*¹, Huan Liu*¹², Yuanhao Yu ¹, Yang Wang ¹³
¹ Huawei, ² McMaster University ³ University of Manitoba
{li.gu, zhixiang.chi, huan.liu3, yuanhao.yu, yang.wang3}@huawei.com
* denotes equal contribution.

Abstract

In this work, we present the winning solution for ORBIT Few-Shot Video Object Recognition Challenge 2022. Built upon the ProtoNet baseline, the performance of our method is improved with three effective techniques. These techniques include the embedding adaptation, the uniform video clip sampler and the invalid frame detection. In addition, we re-factor and re-implement the official codebase to encourage modularity, compatibility and improved performance. Our implementation accelerates the data loading in both training and testing. The code can be found: ORBIT-2022-winner-method.

1 Introduction

Recently, Few-shot Learning has received increasing attention [19, 15, 7] as it allows models to recognize novel objects from only a few examples. This will enable computer vision systems adapt to dynamic real-world environments where users can provide a few training examples themselves. Few-shot concept is adapted and extended to various real-world settings, such as few-shot continual learning [4, 10, 17], test-time adaptation [5, 14, 8] and leveraging domain shift [21, 3].

Existing datasets in few-shot learning undergo a lack of high variation in both the number of examples per object and the quality of those examples. The resulting trained object recognizers are not robust to the noisy input data (e.g., video frames) streamed from real-world systems. To drive further innovation in few-shot learning, ORBIT dataset [11] captures the high variations inherent in real-world applications via the collection of thousands of videos recorded by vision-impaired people. ORBIT Few-Shot Object Recognition Challenge, newly introduced in 2022, invited teams to build a teachable object recognizer using the ORBIT dataset. Unlike a generic object recognizer, a user can ‘teach’ a teachable object recognizer to recognize their specific personal objects by providing just a few video sequences. Specifically, to register the object categories to be classified, a few clean video sequences with several user-centric objects are provided for the recognizer at the personalization stage. Then, the different video sequences from the same user are used to evaluate the recognizer at the recognition stage. Considering the fact that vision-impaired people cannot localize the target object accurately and quickly, video frames at the recognition stage are collected from cluttered scenes containing multiple objects.

[11] establishes the baseline on the ORBIT benchmark by extending 4 mainstream meta-learning based methods from image to videos [18, 12, 13, 7]. LITE [1] scales up the input image resolution with a general and memory-efficient episodic training scheme and thus achieves state-of-the-art accuracy. However, no algorithmic design for handling noisy examples and sampling video frames is considered in the existing methods. As observed in the real-world dataset, video frames are quite noisy, especially for those taken by vision-impaired people. In addition, the video frame sampling method used in the prior works is inefficient and ineffective. Since the video lengths are variable, random sampling may cause over-sampling or under-sampling. Both factors hamper the performance of the trained object recognizer.

Refer to caption — Figure 1: Overview of our method. At the personalization stage, the support video frames are uniformly sampled and then filtered by an edge detector. The resulting clips are used to generate the adaptive prototypes. At the recognition stage, the frames are classified by measuring the distance with the prototypes.

We build our solution on top of the state-of-the-art method, ProtoNet [13] with LITE. In the ORBIT benchmark, ProtoNet firstly uses clean video clips as the support data to produce category-level prototypes at the personalization stage. Then, the clutter video clips as the query data can be classified by directly comparing their embeddings with the prototypes using a similarity metric at the recognition stage. To generate high-quality prototypes at the personalization stage, we add three techniques: First, inspired by [20], we incorporate one transformer encoder block to leverage the relationship among prototypes to enable the adaptation to the specific episode, and thus highlight their discriminative representation for a specific user. Also, we replace the random video clip sampler by the uniform sampler to enable higher temporal coverage during testing and to reduce down-sampling and over-sampling in long and short video sequences respectively. Last, we apply an edge detector on each sampled video frame and set an empirical threshold to determine and remove the frame that contains nothing. Our approach improve the accuracy significantly by 8% and was selected as the winner from 12 submissions in the ORBIT Few-Shot Object Recognition Challenge.

In addition to our innovation on algorithm design, we also refactor and re-implement the data pipeline in the original ORBIT codebase to encourage modularity, compatibility and performance improvement. Specifically, our implementation achieves more than 2.7x acceleration in data loading during both training and testing.

2 Method

In this section, we present the details of our solution. We begin by introducing the entire pipeline of the baseline method using ProtoNet in Few-shot Video Object Recognition. Then, we introduce several improvements to each module, including a few-shot learner, video clip sampler, and frame edge detector, respectively.

2.1 Baseline pipeline

As shown in Fig. 1, the pipeline consists of three modules: video clip sampler, few-shot learner and casual sliding window. At the personalization stage, the few-shot learner, ProtoNet, aims to extract semantic information from a few of clean video sequences and generates the prototypes of object categories. Since the few-shot learner cannot process all video frames simultaneously due to the limited computational resources, a video clip sampler is introduced to randomly select multiple video clips from each sequence. Thus, the few-shot learner takes a few of clean video clips as the support set to generate the prototypes. At the recognition stage, in order to classify every frame in a clutter video sequence, a causal sliding window converts the entire video sequence into a series of overlapped video clips as the query set, where each video frame corresponds to a video clip. Furthermore, each video clip is fed into the few-shot learner and generates the prediction via comparing their embeddings with the prototypes using a similarity metric.

However, there are several reasons that hinder the generation of high-quality prototypes. First, due to the distribution shift between the clean support and the clutter query video sequences, using the same backbone to extract video clip features from both sets is sub-optimal. Second, each user’s video sequences are collected from limited scenes, resulting in a similar background or multiple target user-specific object issues. Third, there are dramatic appearance changes across each support video sequence, and some frames suffer from an ”object not present issue”. Thus, randomly sampled clips from support video sequences would not provide comprehensive information for building prototype.

2.2 Improvements

To make the few-shot learner, ProtoNet, build high-quality prototypes at the personalization stage, we develop three techniques on top of the baseline pipeline, shown in Fig. 1.

Embedding Adaptation. Since the support video clips are from the same user, the few-shot learner aims to generate prototypes that can adapt to the specific user. Therefore, inspired by [20], we add a set-to-set function, one transformer encoder block, to refine the original prototypes by leveraging their relationship, shown in Fig. 3. As a result, the most discriminative representations for a specific user can be highlighted. Also, the transformer encoder block can map the feature embeddings of the clean support video clips to the space close to those of the clutter query video clips and thus help alleviate the distribution shift.

Uniform Clip Sampler. In the random video clip sampler, the number of frames in each clip is fixed. However, both the number of clips and the starting position of each clip are randomly chosen across different video sequences, regardless of their duration. In consequence, the long and short video sequences may suffer from under-sampling or over-sampling respectively. Thus, to achieve higher temporal coverage and constant sampling rate, we replace the random video clip sampler with the uniform one [9]: First, we split each support video sequence into multiple fix-sized and non-overlapped clip candidates. Then, we evenly split clip candidates into non-overlapped chunks and ensure each chunk has the same number of clip candidates. Last, we sample one clip from each chunk. Fig. 2 demonstrates the details.

Invalid frame detection. Due to the dramatic changes across the video sequence, a few sampled support video clips may not contain the target object, thus, failing to contribute informative features to generate high-quality prototypes. Therefore, we apply an edge detector to each frame in the sampled video clips and determine whether the frame contains objects via an empirical threshold. If more than half of the frames in one video clip are identified with ”object not present issue”, that clip will be removed.

3 Contributions on Code Quality

To improve the code efficiency, we refactor and re-implement the data pipeline of the original official codebase to encourage modularity, compatibility and performance.

Modularity. We decouple the object category sampling, video sequence (instance) sampling, video frame image loading and tensor preparing from one deeper class into multiple independent shallow classes. These components can be used in plug-and-play and mix-and-match manners.

Compatibility. It is designed to be interoperable with other Pytorch standard domain-specific libraries (torchvision), and their highly-optimized modules and functions can be used in processing the ORBIT dataset.

Performance. To optimize I/O, we introduce multi-threading to reduce the latency of loading images from disk in each episode. Table 1 shows the comparison of data loading speed, where our re-implemented codebase accelerates the data loading speed by 2.7 and 2.77 times.

4 Experiment

4.1 Dataset and implementation details

Dataset. For this challenge, we evaluate our method on the ORBIT dataset which is designed for real-world few-shot object recognition [11]. It contains 3,822 videos of 486 objects collected by 67 users. Each user is asked to collect videos with the target object in isolation which is referred as clean videos. They are also asked to take videos where the target object is mixed with multiple other objects. These videos are referred as clutter videos. The goal of this challenge is to train a teachable object recognizer such that the model is personalized for each user using their clean videos. The personalized model is then evaluated on the clutter videos. Please refer [11] for more details. In the concept of meta-learning scenario, the clean videos are analogy as support set while the clutter videos are query set.

Network. We follow [1] to use EfficientNet-B0 [16] pre-trained on ImageNet [6] as the feature extractor. A single layer transformer encoder as in FEAT [20] is used to further adapt the computed prototypes.

Training and evaluation protocol. We leverage the concept of prototype-based meta-learning to train our network. A comprehensive survey on bi-level optimization can be found [2]. The embedding of the query videos corresponding to the same class is averaged as its prototypes. The query videos are classified by comparing the cosine similarity between the prototypes. We follow the episodic learning as in LITE [1] to utilize large resolution frame patches to train the network. For a fair comparison, we use the same hyper-parameters and evaluation protocol as in [1, 11].

Table 1: Training and testing speed for data loaders. The baseline is the original ORBIT codebase. Testing speed is measured by preparing 300 videos from 17 users. Training speed is measured by preparing 100 episodes.

Method	# workers	# threads	Test speed	Train speed
Baseline	4	1	233	2.61
	4	4	201 (1.15x)	2.2 (1.18x)
Ours	4	16	152 (1.53x)	1.08 (2.41x)
	8	16	86 (2.7x)	0.94 (2.77x)

5 Results

Quantitative results. Table 2 shows the main results and the contributions of each component. We re-run the codebase of [11] for typical ProtoNet with default hyper-parameters, but its per frame accuracy dropped by 3%. We re-implemented the codebase, and ours achieves similar results as reported by [1]. With Embedding adaptation method, the generated prototypes for all categories are further enhanced by examining the relationships among them. The prototypes are pushed away from others and become more discriminative. Therefore, the accuracy increased by 2.87%. Uniform clip sampler ensures the constant sampling rate and the higher temporal coverage for all videos to reduce the randomness of information gathering. In addition, due to the variable lengths of the videos, a uniform clip sampler ensures that the short videos are not over-sampled. Therefore, it improves an additional 1.52%. Furthermore, due to the high correlation between consecutive video frames, some sampled video clips may convey repetitive information, especially with minor movements. Thus, the overall amount of discriminative data points is reduced. Enriching the data information by augmenting the video clip is an effective method. Hence, Data augmentation improves additional 0.88%. To filter out the frames with less information (background only), the Invalid frame detection by edge detection and threshold is able to reduce such effect. Therefore, it further improves by 0.12%. Overall, with full proposed components, our method outperforms the baseline by 5.39%.

Table 2: Main results and contribution from each component.

Method	Frame accuracy	Improvement
ProtoNet (copy from [1])	66.30	-
ProtoNet (ORBIT code base)	63.27	-3.03
ProtoNet (Ours)	66.27	-0.03
+ Embedding adaptation	69.17	+2.87
+ Uniform sampler	70.69	+4.39
+ Data augmentation	71.57	+5.27
+ Invalid frame detection	71.69	+5.39

Qualitative results. Fig. 4 shows the per user performance accuracy. It is worth noting that the embedding adaptation and uniform sampling provide significant improvement for most of the users. With the integration of all proposed components, our method achieves the best accuracy for 11 users.

6 Conclusion and future work

In this work, we proposed several improvements for the few-shot video object recognition task. The proposed components consist of embedding adaptation, uniform video sampling, and invalid frame detection. Our unified solution achieves the winner of the first challenge on the ORBIT dataset. Furthermore, we also contribute to refactoring and optimizing the original codebase to improve the productivity of other researchers. Future work includes the solution to tackle the domain shift between support and query videos which is quite common in real-world scenarios.

References

[1] John Bronskill, Daniela Massiceti, Massimiliano Patacchiola, Katja Hofmann, Sebastian Nowozin, and Richard Turner. Memory efficient meta-learning with large images. Advances in Neural Information Processing Systems, 2021.
[2] Can Chen, Xi Chen, Chen Ma, Zixuan Liu, and Xue Liu. Gradient-based bi-level optimization for deep learning: A survey. arXiv preprint arXiv:2207.11719, 2022.
[3] Can Chen, Yingxue Zhang, Jie Fu, Mark Coates, et al. Bidirectional learning for offline infinite-width model-based optimization. arXiv preprint arXiv:2209.07507, 2022.
[4] Zhixiang Chi, Li Gu, Huan Liu, Yang Wang, Yuanhao Yu, and Jin Tang. Metafscil: A meta-learning approach for few-shot class incremental learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[5] Zhixiang Chi, Yang Wang, Yuanhao Yu, and Jin Tang. Test-time fast adaptation for dynamic scene deblurring via meta-auxiliary learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009.
[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, 2017.
[8] Yizhuo Li, Miao Hao, Zonglin Di, Nitesh Bharadwaj Gundavarapu, and Xiaolong Wang. Test-time personalization with a transformer for human pose estimation. Advances in Neural Information Processing Systems, 2021.
[9] Hanwen Liang, Niamul Quader, Zhixiang Chi, Lizhe Chen, Peng Dai, Juwei Lu, and Yang Wang. Self-supervised spatiotemporal representation learning by exploiting video continuity. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
[10] Huan Liu, Li Gu, Zhixiang Chi, Yang Wang, Yuanhao Yu, Jun Chen, and Jin Tang. Few-shot class-incremental learning via entropy-regularized data-free replay. arXiv preprint arXiv:2207.11213, 2022.
[11] Daniela Massiceti, Luisa Zintgraf, John Bronskill, Lida Theodorou, Matthew Tobias Harris, Edward Cutrell, Cecily Morrison, Katja Hofmann, and Simone Stumpf. Orbit: A real-world few-shot dataset for teachable object recognition. In IEEE/CVF International Conference on Computer Vision, 2021.
[12] James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E Turner. Fast and flexible multi-task classification using conditional neural adaptive processes. Advances in Neural Information Processing Systems, 32, 2019.
[13] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
[14] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, 2020.
[15] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
[16] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 2019.
[17] Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[18] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In European Conference on Computer Vision, pages 266–282. Springer, 2020.
[19] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 2016.
[20] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[21] Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive risk minimization: Learning to adapt to domain shift. Advances in Neural Information Processing Systems, 34:23664–23678, 2021.