Object Occlusion of Adding New Categories in Objection Detection

Boyang Deng, Meiyan Lin, and Shoulun Long

Abstract

Building instance detection models that are data efficient and can handle rare object categories is an important challenge in computer vision. But data collection methods and metrics are lack of research towards real scenarios application using neural network. Here, we perform a systematic study of the Object Occlusion data collection and augmentation methods where we imitate object occlusion relationship in target scenarios. However, we find that the simple mechanism of object occlusion is good enough and can provide acceptable accuracy in real scenarios adding new category. We illustate that only adding 15 images of new category in a half million training dataset with hundreds categories, can give this new category 95% accuracy in unseen test dataset including thousands of images of this category.

Index Terms:

Data Augmentation, Detection, Artificail Intelligence

I Introduction

Object Detection is an important task in computer vision with many real world applications. Object Detection models based on state-of-the-art convolutional networks are often data-hungry [1]. At the same time, annotating large dataset for Object Detection is usually expensive and time-consuming. For example, 4 workers need work for an hour to annotate about 3000 object bounding boxs of our dataset called Smart Shelf. Therefore, it is imperative to develop new methods to improve the data-efficiency of state-of-the-art object detection models.

We discuss the process of adding new categories to base dataset with less effort to collect and annotation new categories images. there is majorly two trends to solve this problem: (1) Use 3D reconstruction to generate 3D model of our new categories, and use render engine to design a program to generate annotated images automatically [2] [3]. (2) Use less real new categories images for training new categories, but keep high accuracy at the same time. We think the first method met some domain adaptation issues now, so it cannot reach highly enough accuracy in real situation now, even use some GAN based methods [4]. There are many researches focus on network structures to boost detection performance [5] [6] [1], but which may have some disadvantages in certain issues such as increasing inference time or complex networks. We want to provid universal strategies that boost network performance through data augmentations like [7]. Therefore, we focus on second way to address new categories training problem.

We consider that managing data collection and augmentation is a straight way to significantly improve the data-efficiency of object detection models. Detection network training based on diverse images [8]. Object Occlusion has the potential to create challenging new training data [9]. Data collection phase can give more natural occlusion because use real objects to take pictures, and data augmentation phase uses extracted bounding box from other images to occluded target objects. We realize bounding box can be annotated obvious faster than segmentation, but an extracted bounding box may contain partial background. If we copy this bounding box image, then paste to other images, the partial background may be inconsistent with new image background, so object occlusion in data augmentation phase is sub-optimal than use segmentation annotation. But our experiments show that do object occlusion in data augmentation phase still can improve model accuracy by appropriate organization.

Firstly, we consider use Class-Balanced Loss [10] with data copy-paste, But there is not obvious improvement show in our experiments. Thus we do not adopt this method in our continuous experiments. We are inspired by several data augmentation methods [11] [12], and propose a new copy-paste based method for network trainng using bounding boxes. We also agree the idea proposed in [13], while we do not get good results only use synthetic data as [2].

The key idea behind the Object-Occlusion is to imitate objects occlusion in real scenarios when constructing training dataset. This idea can lead to a number of combinatorial new object occlusion relationship, with multiple possibilities:

1.

Choose many objects that occluded each other;
2.

Decide the occlusion relationship among these objects;
3.

Decide positions to place these objects and camera viewpoints to see the scene.

II Method

Our approach for generating new data using Copy-Paste designs different occlusion levels. We assume occlusion relationship between objects is the most important factor for neural network learning. Every time our network can only see some parts of our new category. So imitating real occlusion between objects can give strong ability to learning a new category from small annotated images of the new category. And our experiments show objects appear occlusion, which shows the categories of objects are less important than the occlusion relationship itself. In another word, the occlusion relationship of real scenarios, including occluded level, view direction of target objects, visible region of target objects, can be imitated well when we construct dataset for training. The necessary number of training images is small, but still can hit high accuracy in test dataset. We also demonstrate annotated bounding box should only boxing visible part of objects, ignoring occluded region of the objects, which can make training process converge faster and give better accuracy.

II-A Choices of objects for occlusion

For real objects occlusion relationship, small objects can occluded small partial of large objects, while large objects can occluded large partial of small objects. In real scenarios, objects occlusion relationship has relative fixed distribution, which give us a chance to imitate distribution important sample points. If we can place objects to make occlusion which imitates important sample pointa as real scenarios’ objects occlusion distribution. And we find that according to a target object A of a category X, if in one point of occlusion distribution, A’s bottom 50% is occluded by a object B of a category Y, but now we do not have any object of category Y at hand, then we can use a object C of a category Z to occlude A’s bottom 50%, which gives a same effect as use a object B of a category Y. Therefore, the category of frontal objects used to occlude target object is not important, but whether it imitates a occlusion relationship fitting the real occlusion relationship is important.

We can see general placement of goods of one layer in shelf in Fig. 1 and Fig. 2. Because we use fisheye to take pictures in our scenarios, we should follow fisheye characteristics, which is an ultra wide-angle lens that produces strong visual distortion intending to create a hemispherical image. Therefore, for the purpose of make all goods have distinctive regions that can be visible in camera view, the tall beverages should be placed near shelf wall, and dwarf goods will be placed in the center area of one layer of our shelf. As the size of goods increase, goods will be placed more peripheral. We should guarantee all goods can be seen properly by fisheye camera, so they all can potentially be detected by neural network model.

Refer to caption — Figure 1: General placement of beverages.

Preparing more different sizes of objects can be helpful for imitate more occlusion relationships. For instance, one real occlusion relationship is that a target object is occluded a small part by small objects. And another real occlusion relationship is that a target object is occluded a large part by large objects.

We can generate occlusion in data collection stage or data augmentation stage. we should be aware of importance that the object size of each category in constructing occlusion in data collection stage, because different sizes of categories can generate different occlusion relationship by nature. But we do not care object size in constructing occlusion in data augmentation stage, we can use any category to occlude a target object, using techniques including copy-paste, cut-paste, image scale and image translation.

II-B Decide the occlusion relationship

Correctly finding the real object occlusion distribution is the precondition of imitation. In particular scenarios, object occlusion is determined by many factors, such as camera viewpoint, object size, object location, etc. Objects relative positions should be reasonable. For example, in indoor scenarios, a cup on the table along with a TV controller. Paper picker may partial occlude cup, even total occlude cup in some viewpoint, but the TV controller is rarely placed on top of the cup, almost all cases are alongside with the cup. So we give major cases high priority in data collection. Each occlusion relationship only collecting one or two cases is enough for hit high accuracy in real test cases.

Monte-Carlo method can be used to pick data sample points from occlusion distribution. Firstly, we should find data occlusion distribution of our new category in real scenarios. we deal each new category one by one. Then we generate sythetic images by copy-paste bounding boxes to occlude objects of new category following occlusion distribution of our new category. And we associate these sythetic images with a few real images to train our detection network.

In data-collection stage, we do not have previous data occlusion distribution of a new category. But we possibly have at least one similar size category before. For occlusion generation, we do not care the surface material or texture detail of our new category. the location this new category should be placed may be affected by new category’s volume and intrinsic characteristics. In dataset FVSS, small box package snacks or canned drinks usually are placed in center of a layer of the shelf, while big bag snacks are placed in the periphery of the layer. For most effecient utilization of one layer space, we fill a layer with goods as possible as we can, which under the conditions that each good placed in this layer can be viewed by top-centered fisheye camera clearly. At least each good visible region is necessary and distinct that human can recognize the good belong to which category from the fisheye camera captured images. Therefore, in a fisheye view, we should guarantee the top part or top lateral part of every good is visible and no good stacks. In a replete layer, below part of each good usually is occluded by nearby goods. We follow a rule that small size objects are placed at center of a layer, while large size objects are placed at periphery of the layer. Therefore, if we add a new category which has big size, so it is usually placed in periphery of one layer, or near the walls of the shelf. This kind of placement make objects of this category less possible to be occluded by nearby objects of same category or other categories. Below part of objects are usually occluded with different occlusion ratios by different categories. For example, we add a bottle water. If goods of same category occlude this good, maybe only top bottle cap of this good is visible. If a smaller milk box is next to this bottle, maybe below two third part of bottle is occluded. If a lie-down snack bag is next to this bottle, maybe below one third part of bottle is occluded. And we should not add a bigger object next to the bottle in the side near fisheye camera center, because we do not want our bottle to be totally occluded.

In data-occlusion stage, we use already annotated images to generate new occluded images following the data occlusion distribution we found. Firstly, we should find correct data occlusion distribution of our target new category. The most reasonable method is that we analyze the new category attributes and infer the occlusion distribution by an experienced researcher. But this method is hard to generalization, because we always need some experienced experts. So an automatic method is needed.

In dataset COCO, the category named ’person’ usually stands in street or sit near a table, so one person may be occluded by other people at the outdoor, or the table at the indoor. Interestingly, one person that is annotated almost show his head in most cases even in crowd scene or remote scene. If no head appear of one person, the dataset organizer may be not collect that kind of images, because our human being usually identify another human being by its head. And annotators may be feel strange to annotate a person only show below half part of body while no head can be seen.

Some features: (1) Notice that a person head maybe occluded by umbrella. (2) Head maybe show in a lateral view or show the back side of head. These view should imitate in data collection stage, while data augmentation stage only imitate the real occlusion relationship rather than different viewpoints. (3) In rare cases, in a image, a region inlcudes only visible a small part of human that is still annotated as a person, such as only close-up hands or a foot can be seen.

These features can generalize to all kinds of animals, which possibly show their heads in pictures as photographers can make the objects in their picture easy to be understood.

COCO is a general dataset which contains large diversity of each category. In COCO, each category may appear in abundant environments and light conditions, with diversity of gestures and viewpoints. For example, a animal category named ’bear’ has many sub-categories, such as polar bear, black bear, brown bear and raccoon. We can add a new category with little number of images like dozens of images, then train a detection network well, which includes all categories besides our new category.

For stuffs have small size, like toothbrush, remote controller, may have near view or remote view, so the object sizes in picture vary in large range. There is a question: is it useful that use small size object occlusion distribution in large size objects? we find it is useful, and can be even better with image scale changing as data augmentation.

We also analyze Open Images Dataset [9]. Each category has more samples as well as more diversity. It show 4 types of annotation: Detection, Segmentation, Relationships and Localized Narratives. Detection is annotated with bounding boxes. Segmentation is annotated with polygons. Relationships show many kind of relationships between human and objects or between different objects. Relationships use dotted-line bounding boxes that show one object is in another object. These relationships also strong relate to data occlusion distribution of categories and give us inspiration. We show one Relationships image in Fig. 3, which describes a contain relationship between a piatter and a tomato.

II-C Decide camera viewpoints

It is hard to imitate all viewpoints, especially in large outdoor scenarios. But we find a simple way which only use several camera viewpoints and reach high accuracy in real scenarios. We refer the method proposed in NERFIES [14], using main camera viewpoint and several nearly main camera viewpoints to take objects pictures, ignoring those rare seen viewpoints.

II-D Copy-paste collect data

After collecting tens of images for our new skus, we choose copy-paste [12] data augmentation to cover more sample points representing the data placement and data occlusion distribution, which can improve detection performance. Copy-paste strategy we used is that random copy-paste many image bounding box regions to another image following our data occlusion distribution, and following some boundary conditions like not overlap too much over exist objects.

II-E FairMOT for speed up data annotation

We mainly use bounding boxes as annotation for each objects. In situation that data collection is controllable, we can move objects slow in each frames. Thus we can use tracking models to annotate each object in continuous slow movement if the first frame is annotated, reducing human annotators workload.

We try single object tracking at first. If we move one object slowly each time, we can annotate a target object in first frame of a clip which only capture the movement of a target object, and the other objects keep static. We assume camera also stay in same position to simplify the issue, but we also can move camera slowly to achieve excellent single object tracking performance. But there are two main issues using SOT.

(1) If there are several objects of smae category placed near each other, when we slowly move one of them, the tracker may be jump to capture an another static but near object. It requires a more accuracy SOT model, which is necessary for scalable use of SOT for helping annotation.

(2) Because we only support single object tracking, we need annotate many clips, each of which moves a new object.

Multi-objects tracking stratey can tracking many objects, so we can move different objects simultaneously. So we combine FairMOT to boost our annatation process.

III Experiments

We design experiments to demonstrate our approach that only need tens of images each new category to compete thousands of images at accuracy. We conduct two directions of experiments. One is data-occlusion in data collection stage, another one is data-occlusion in data augmentation stage.

Our dataset named Fisheye View of Shelf SKUs(FVSS), is used to do experiments, like data collection stage validation. We show basic situation of FVSS in Fig. 4, and several bounding boxes of one category in FVSS in Fig. 5. This dataset gives a view in a shelf, and uses a fisheye camera in top center to view one layer goods in the shelf. In our experiments, we use hundreds of categories as base dataset, and try to add a new category.

We use dataset COCO as our target testbed for data augmentation stage validation. COCO has 80 categories category. We randomly pick one category as new categories, and use left 79 categories as base dataset. We analyze the new category data occlusion distribution, and only pick 1% to 10% images of the new category which hit important sample point of data distribution to training. We use yolov5-small as our detection model., and we convert all data annotation to yolov5 format.

III-A Data collection stage

We do experiments in shelf environment and we use fisheye cameras to take pictures, which follow our FVSS dataset construction style. We use 10 thousand images containing 457 categories as our base training dataset, and we add one new category with only 10 images into base training dataset, then we test the performance in an validation dataset with 1000 images, in which every image contains the new category at least one bounding box. we show two categories heatmap in our dataset in Fig. 6, which is relavant to the occlusion distribution of these categories.

For example, we use ”coke can” as our new category. Adding 10 images of ”coke can” to 10000 images training dataset , which contain 457 categories but do nothing contains category ”coke can”. These 10 images of ”coke can” is taken pictures in important sample points in data occlusion distribution of ”coke can” in shelf environment. Therefore, we have total 58 bounding boxes of ”coke can”. We also construction validation dataset with 1000 images, which contains 179 categories in total. And each image in validation dataset contains at least one ”coke can” bounding box. There are total 3939 ”coke can” bounding boxes. We use three metrics, first is the [email protected] and [email protected]:0.95 of ”coke can” in validation dataset. Second is pass rate of ”coke can”, which means if ”coke can” bounding boxes are all found in correct positions, this image is pass, otherwise is not pass. Third is if a ”coke can” bounding box is not detectd as ”coke can”, it is possible that the bounding box is detected as one other category or just ignore by neural network model. Therefore, we design a rate to describe whether this bounding box is detected as a other category with a confidence lower than 95%. Results show in Table I.

TABLE I: ”coke can” metrics in validation dataset

[email protected]	[email protected]:0.95	pass rate	mis-detect [email protected]	mis-detect [email protected]
98.4%	83.6%	81.3%	77.0%	87.0%

We test 16 categories as one new category each time. Our categories all are retail field goods such as snacks, milks and beverages, etc. And we conduct a conclusion that we can use few images for training a new category and keep accuracy above 80%, and mis-detected rate above 85% in average, which mean only 3% images in validation dataset is mis-detected with high confidence or ignore by detect model. We show some categories results in Table II. we use mis-detected rate that confidence below 90% is counted. And we add a new accuracy called final undistinguishable rate, which particular indicates bounding boxes are mis-detected as another category and have high confidence above 90% or cannot be detected by model. Results show in Table II and Table III.

We also find a way to better imitate important data occlusion distribution using many different size categories to generate diverse data occlusion relationship. And this method can get near double accuracy in average.

TABLE II: detail of new categories

name	image number	pass rate	mis-detect [email protected]	final undistinguishable [email protected]
xiandangao	6015	78%	91%	1.98%
yibaochunjingshui	2037	54%	92%	3.68%
jiaduobaoguan	1238	42%	78%	12.76%
cuiguoba	6359	91%	80%	1.8%
420meizhiyuanguolicheng	712	95%	95%	0.25%
feizixiaolizhi	1884	52%	95%	2.4%
heqingjiaotangbinggan	3569	84%	92%	1.28%
duoweixiaoxibing200	1799	90%	90%	1.0%
4wahahaadgainai	2807	55%	100%	0.0%
yizhongtaohuangtaoguantou	2520	78%	80%	4.4%
heqingjiaotangbinggan	3992	94%	86%	0.84%
4wahahaadgainai	3094	32%	91%	6.21%
enaakdianxinmian30g	488	62%	76%	9.12%
guowangshiguangguoba	867	90%	100%	0.0%
mailisu	531	95%	95%	0.25%
average	3194	72%	90.7%	4.2%

TABLE III: New category data power, evaluate in validation dataset

name	pass rate	mis-detect [email protected]
without new category data	0.0%	79.0%
with new category data	72.0%	90.0%

TABLE IV: Without new category data, evaluate in validation dataset

name	image number	mis-detected [email protected]
xiandangao	6015	87.0%
yibaochunjingshui	2030	87.0%
jiaduobaoguan	2945	81.0%
cuiguoba	6992	90.0%
420meizhiyuanguolicheng	712	78.0%
feizixiaolizhi	1884	87.0%
heqingjiaotangbinggan	3569	52.0%
duoweixiaoxibing200	1799	93.0%
4wahahaadgainai	2805	44.0%
yizhongtaohuangtaoguantou	2520	82.0%
average	3127.1	79.3%

TABLE V: New category training comparison different methods

sku name	3000+ bboxes	60 bboxes(20 images)	370 bboxes(20 images+copy-paste)
guangshiboluopi	33.83%	12.7%	53.38%
yangzhiganlu	60.96%	34.76%	76.83%
zhiqingchunniunai	49.87%	3.56%	27.95%
tengyeyicunxiaoyuanbinggan	95.98%	38.16%	98.19%
aolangtangeweihuabinggan	37.50%	58.33%	97.22%
average	55.63%	29.52%	70.71%

We try add images of a new category from another domain, like using phone take pictures, and add to training dataset including shelf fisheye images. We do not apply any domain adaptation methods for enhancement. We get 0% pass rate of the new category in 1000 images validation dataset. and mis-detection of the new category is almost same as adding no new category data. Therefore, domain adaptation still a issue in training a new category. Results show in Table III.

We test whether add a new category or not in training dataset. Validation dataset has new category data, evaluating in validation dataset. we discover if no new category data in training dataset, the new category in validation dataset cannot be detected correctly even one image, so pass rate is always 0.0%. And there is 21% bounding box will miss or be error detected as others with high confidence. Meanwhile, average pass rate is 72.0% of the new category if we add only 10 new category images to training dataset, and mis-detect [email protected] will increase to 90.0%. And we found that some high aspect ratio categories can get higher mis-detect [email protected] increasement. Maybe if we add images of this new high aspect ratio category, there are less chance that the objects of this category could be error detected as other categories. But a high aspect ratio category often gets a lower pass rate than other categories using few images. Therefore, we think adding new category images is important even few images is added. And if these few images can fit well in data occlusion distribution of the new category, most necessary and important information will be learned by neural network. Results show in Table IV.

We do a further experiment of few images of adding a new category in large dataset. We use a large shelf fisheye view dataset with more than 360 thousand images, and add a new category called ”sizhoushaokaoweixiatiao” with only 15 images containing 60 bounding boxes to this large dataset. we train 1.5 epochs, with common data augmentation methods like image flipping, hue tuning and normalization, etc. We evaluate on test dataset including 500 real images. Only about 10 images is incorrect. This result show the huge potential of use copy-paste with data occlusion distribution in detection using bounding boxes annotation format. Results show in Fig. 7 and Fig. 8.

III-B Data augmentation stage

We do experiments that compare normal data training with empirical enough images of a new category which have more than 3000 bounding boxes to only 20 new category training images. These 20 category training images is a subset of the before large dataset. And we conduct two kind of experiments of 20 new category images. one we use only these 20 images with some data augmentation such as image flipping, hsv transformation, hue tuning. Another experiment we use copy-paste data augmentation following data occlusion distribution to create more 100 images from these 20 images, Then we get 120 new category images in total, 100 of them are copy-pasted. We test 5 new categories one by one and show results in Table V. The results is amazing, which show that little amount of images with copy-paste generated images is better than original large images training results. Fig. 9 shows a image augmented with copy-paste by our strategy. Fig. 10 shows a failure case of detection by a network trained by our strategy.

There is a situation that we use already collected dataset to training, and we cannot control the data collection stage, but we also want to add few images of a new category to exist dataset. We design experiments for demonstrating that few images of a new category can give a relative high level accuracy of this new category if we pick these images from important sample points of data occlusion distribution of this new category in test dataset. Our implementation refers to [12]. We think [12] use simple copy-paste data augmentation strategy can get noticeable improvement of accuracy. This conclusion is based on these copy-paste operations create many new occlusion relationship which contains the important sample points of data occlusion distribution. Fig. 11 and Fig. 12 give us two categories test results, which show the condence distribution of target categories in test dataset.

IV Conclusion

Data collection is a core step of applying vision systems to real world. In this paper, we propose a Object occlusion data collection method, and find that it is very effective and robust. Object occlusion performs well across multiple experimental settings and provides significant improvement using obvious little amount of data. Experiments base on our dataset FVSS and COCO benchmarks.

The Object occlusion data collection and augmentation strategy is easy to plug into any dataset when construct new dataset or add new categories to exsit dataset, and save the training cost because using little amount of images. Therefore, we can use smaller model with suitable data occlusion, like use copy-paste to create appropriate occlusion relationship of target objects. Also use less memory in training process. Proper Object occlusion data collection and augmentation strategy enables small models to achieve accuracy competitive with more complicated model.

We find that networks can learn a new category from few samples like human, who have strong inference ability. On the other hand, human also need imitate networks learning style, which learning process uses little analysis skill and inference ability but sees more samples. Learning style of networks may be useful is learning some new concepts or new languages, showing more samples of this new thing without detailed explanation. Learners are also easy to understand and remember this new thing.

Interesting directions for future work are improving object-occlusion data collection and augmentation strategies to fill the gap between virtual 3D objects and real 3D objects.

References

[1] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
[2] S. Hinterstoisser, O. Pauly, H. Heibel, M. Martina, and M. Bokeloh, “An annotation saved is an annotation earned: Using fully synthetic training for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
[3] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 969–977.
[4] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning. PMLR, 2018, pp. 1989–1998.
[5] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9759–9768.
[6] T. Rong, Y. Zhu, H. Cai, and Y. Xiong, “A solution to product detection in densely packed scenes,” arXiv preprint arXiv:2007.11946, 2020.
[7] Z. Zhang, T. He, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of freebies for training object detection neural networks,” arXiv preprint arXiv:1902.04103, 2019.
[8] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4918–4927.
[9] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
[10] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277.
[11] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[12] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data augmentation method for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2918–2928.
[13] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object detectors without catastrophic forgetting,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3400–3409.
[14] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 5865–5874.