22institutetext: School of Computing and Information Systems, Singapore Management University, Singapore 22email: [email protected],[email protected]
Incremental Learning
on Food Instance Segmentation
Abstract
Food instance segmentation is essential to estimate the serving size of dishes in a food image. The recent cutting-edge techniques for instance segmentation are deep learning networks with impressive segmentation quality and fast computation. Nonetheless, they are hungry for data and expensive for annotation. This paper proposes an incremental learning framework to optimize the model performance given a limited data labelling budget. The power of the framework is a novel difficulty assessment model, which forecasts how challenging an unlabelled sample is to the latest trained instance segmentation model. The data collection procedure is divided into several stages, each in which a new sample package is collected. The framework allocates the labelling budget to the most difficult samples. The unlabelled samples that meet a certain qualification from the assessment model are used to generate pseudo-labels. Eventually, the manual labels and pseudo-labels are sent to the training data to improve the instance segmentation model. On four large-scale food datasets, our proposed framework outperforms current incremental learning benchmarks and achieves competitive performance with the model trained on fully annotated samples.
Keywords:
Food Computing Incremental learning Instance segmentation Semi-supervisor learning Membership inference.1 Introduction
Instance segmentation is a fundamental task in computer vision with several applications, including food portion size estimation [1, 2, 3], text localization [4, 5], vehicle surveillance [6, 7]. With the development of deep learning, several neural networks like Mask R-CNN [8], CenterMask [9], Watershed [6], Terrace [2], have been developed to address the problem. Nonetheless, these techniques prioritize segmentation quality and computation efficiency over the annotation effort. This paper aims to fill this gap with the proposal of an incremental learning framework that maximizes the competence of the labour force by automatically selecting the most difficult samples for annotation and high-quality pseudo-labels from the remaining samples for model improvement.

Over the past few years, food computing has received great attention with several research papers on food recognition [11], detection [12] and segmentation [2]. Among these tasks, food instance segmentation is essential for portion size estimation [13]. However, preparing ground-truth instance segmentation masks for model training is time-consuming. To address the annotation problem, most existing strategies primarily reduce the amount of supervised information, ranging from image-level [14], instance-level [15], bounding box [12], and polygon [2, 3], introducing a trade-off between segmentation performance and annotation effort. Instead of an instance mask, the image-level and instance-level may either generate a peak response map or a blob that provides the instance locations in the input image. Bounding box information is better for estimating instance size but fails to predict the instance shape. Polygon supervision exhibits the most appropriate technique that annotates critical corner points on the instances to secure segmentation quality. However, for a complex appearance, such as in the food domain, the large number of critical points leads to expensive polygon annotation. For such a situation, semi-supervision [16] is a potential solution where some random samples are annotated, and the remaining ones are prepared with model-generated instance masks. A risky aspect of this approach is that the quality of the generated masks is not guaranteed for model improvement.
This paper presents an incremental learning framework for instance segmentation, with the novelty being the proposal of an assessment module for automatically scoring the difficulty of newly collected samples. In practice, data samples are collected gradually in package units for research and commercial purposes. Our framework annotates and trains an initial instance segmentation model on the first package. In parallel, we use this package to train a difficulty assessment model that evaluates quantitatively how much each new sample is challenging to the latest instance segmentation model. Usually, a new package comprises easy, neutral, and hard samples, as illustrated in Fig. 1. The ready-to-use assessment model is employed to forecast their difficulty levels. With inspiration from hard-mining [17] and self-paced learning [18], we then direct hard samples to the labour force for polygon annotation and the easy samples to generate pseudo-labels. The neutral and easy samples will then be preserved in pooling data for consideration in the next stage. The manual labels and high-quality pseudo-labels are used to fine-tune the instance segmentation model. We repeat this procedure when either a new sample package is collected or the labour force is available for the annotation work.
2 Related Work
Instance segmentation Current mainstream deep learning techniques for instance segmentation can be classified into two major groups: proposal-based [8, 9] and clustering-based [2, 6]. Proposal-based techniques, such as Mask R-CNN [8] and Center Mask [9], generate a set of candidate bounding boxes for instance detection, and following each box is a prediction for the corresponding instance mask. Clustering-based generates one or many instance feature maps, such as multiple terrace layers [2], watershed energy and centre-direction maps [6], that can construct an instance segmentation map. This paper exploits Terrace [2], a clustering-based instance segmentation technique which exhibits impressive performance in the food domain, for our incremental learning framework. The use of the Terrace technique is two-fold. First, given a new sample, the output is a terrace probability distribution, representing how certain the model is about its prediction. This paper investigates the generated probability distribution to forecast the difficulty level of a new image sample. Second, we employ Terrace [2] as the core instance segmentation model to develop and evaluate the proposed incremental learning framework.
Semi-supervision instance segmentation To improve the instance segmentation performance with a limited human resource budget for image annotation, [16] proposes a semi-supervised learning mechanism. There are two instance segmentation models proposed to work in sequence: an annotation model and a production model. When a new package of samples is collected, a small number of samples will be randomly selected and manually labelled. The annotation model is trained on samples with manual labels and then used to generate pseudo-labels for the remaining unlabelled samples. The production model is trained on both manual and pseudo-labels. [16] argues that the extra knowledge from pseudo-labels improves the segmentation performance at no additional cost. However, the generated pseudo-labels are not properly qualified. While a good pseudo-label may provide more context to support the model in distinguishing instance versus background pixels to be aware of instance boundaries, a bad pseudo-label may give the wrong information and confuse the production model. In our framework, we propose a difficulty assessment technique to forecast the quality of generated pseudo-labels. Then, only samples with good pseudo-labels are used to improve the production model, while the remaining samples can be either skipped or annotated depending on labour force availability.
Membership inference attack Given a deep learning model with open access, [19] proposes a technique to predict whether a query sample is inside or outside the training set. An incremental learning framework may employ this technique to filter out the used samples and spend the annotation budget on new samples. This paper proposes a more advanced method, which may forecast the panoptic quality (PQ) [10] score of every new sample (without ground-truth instance segmentation mask), reflecting how much the sample is challenging to the latest instance segmentation model.
3 Incremental Learning Framework
Fig. 2 illustrates the proposed incremental learning on instance segmentation framework, including an incremental learning engine interconnecting the states of collected samples, supervised labels, and instance segmentation models between successive stages. In the scope of this paper, we employ the clustering-based technique Terrace [2] for instance segmentation. At a specific stage, our system stores unlabelled samples in a data pool, a set of supervised samples, an annotation model, and a production model. The incremental learning engine receives unlabelled samples, forecasts their challenging scores, and navigates them to the manual labelling module, pseudo-labelling module, or holds them in the data pool for future use.

3.1 Data Pool, Labels and Instance Segmentation Model
In both academia and industry, the data collection process is either automatic by crawler tools or manual by humans. As the required data for machine learning is usually large-scale, it takes a timely basis, such as daily, weekly, or monthly, for data collection and annotation. This paper defines such a timely basis as a “stage” and presents the incremental learning framework between consecutive stages. At stage , the following states are recorded in the learning system::
-
Data pool is to store the set of samples where the annotation for instance segmentation masks has not been prepared yet. As the task is time-consuming, there is always a long queue of unlabelled samples. The pool holds the samples unlabelled in the previous stage for use in the current stage.
-
Supervised labels is to hold the set of pairs (sample, instance segmentation label). In the proposed framework, we annotate most of the samples collected in the first stage. In the subsequent stages, we eliminate the labelling cost by only selecting a small percentage of samples to annotate and navigating them to the supervised labels. In other words, the expense of the labour force is only expensive for the first stage and can be flexibly adjusted in the subsequent stages.
-
Annotation model, inspired from [16], is an instance segmentation model trained on supervised samples. At the beginning of each stage, the annotation model makes inferences on all pooling samples to generate clustering maps and pseudo-labels. At the end of the stage, the model is improved when the set of supervised samples is updated.
-
Production model shares the same architecture as the annotation model, but it is trained on supervised labels and pseudo-labels. As pseudo-labels are generated from a large-scale data pool, they contain a diverse food context. This diversity supports the production model to understand the data population better. Unlike [16], we do not include all the pseudo-labels to improve the production model. Instead, we pick high-quality pseudo-labels, which are automatically recommended by the difficulty assessment module.
When a new package is collected, we merge it with the current pooling data and feed it into the incremental learning engine.
3.2 Incremental Learning Engine
The novelty of this paper is the proposal of an incremental learning engine that forecasts the difficulty level of collected samples and distributes them to manual annotation, pseudo-label generation, and data pool. In particular, the samples collected in package and current samples in the data pool are merged. Here, each sample is subjected to the latest annotation model, generating a set of clustering maps. Each clustering map compromises a pre-defined number of terrace layers. The probability distribution on terrace layers reveals how the annotation model is confident with its prediction of the sample. Therefore, we explore the outputted terrace layers to train a difficulty assessment model as follows:


-
Difficulty assessment model The proposed difficulty assessment model, as shown in Fig. 3, is a convolutional neural network composed of convolutional and fully connected layers. This paper employs ResNet-50 [20], pre-trained on ImageNet [21], in experiments. The model’s input is a stack of clustering layers representing the predicted terrace probability distributions by the latest annotation model. The model’s output is a regression value within the range , describing how the model forecasts the PQ score of a new sample made on the annotation model. A low score means a difficult sample, and a high score means an easy sample. The loss function for difficulty assessment model training is formulated as follows:
(1) where and are ground-truth and predicted PQ scores for sample, and is the total number of training samples.
-
Generating training samples Inspired by the membership attack framework [19], we generate training samples for the difficulty assessment model via shadow models, as illustrated in Fig. 4. First, a sample package is collected and manually annotated. In transfer learning, an alternative solution is to use annotated samples from the ready-to-use datasets. Second, the labelled samples are distributed into shadow pairs of training and testing sets. Third, each shadow training set is used to train a shadow Terrace model for instance segmentation, which shares the same hyper-parameters (e.g., number of terrace layers) as the production model. It is noted that all these samples, both for training and testing, are prepared with ground-truth instance segmentation masks. Therefore, the generated instance maps from a shadow model can be measured with PQ scores. A predicted clustering map and the corresponding PQ score form a pair (clustering map, PQ score). We repeat the second and the third steps to build up large-scale (clustering map, PQ score) samples to train and test the difficulty assessment model. We iterated the process 80 times in the experiment, resulting in over 35,000 pairs of (clustering map, PQ score). We select roughly 9,100 pairs in a uniform distribution for the PQ score for the assessment model training and evaluation.
To this end, the proposed incremental learning framework automatically forecasts the challenging level of a new sample by feeding it into the latest annotation model and the difficulty assessment model. Depending on the budget available, a percentage of the most challenging samples is selected for manual labelling. The remaining samples are navigated into neutral and easy sets, where the easy samples are required to meet a certain threshold of difficulty score. The generated instance maps of easy samples on the latest annotation model are used as pseudo-labels. For the next production model, low-weighted pseudo-labels and high-weighted supervised labels are sent to the training set. In particular, we weigh the samples in the first package, the hard and the easy samples in a ratio of 2:4:1 to better mine the challenging cases. For the annotation model, only supervised labels are used. Finally, neutral and easy samples are put into the data pool for use in the next stage.
4 Experiments
4.1 Experimental Setup
4.1.1 Dataset
Four food datasets, including Dimsum, Sushi, Cookie, and UECFoodPixComp (UEC) [2, 22], are employed to evaluate the incremental learning performance. While Sushi is only used as a pre-trained dataset to evaluate various sampling strategies for transfer learning on the Dimsum dataset, each of the remaining datasets is split into six packages to evaluate the incremental learning framework. It is noted that the number of food instances is different among image samples. To be fair for the instance segmentation labelling budget, which is instance-based, we organize the packages with a similar number of food instances: 1,575. The number of samples for evaluation in Dimsum, Cookie, and UEC is 768, 1152, and 1000, respectively.
4.1.2 Evaluation Metric
Regarding instance segmentation performance, we employ Panoptic Quality (PQ) [10] to evaluate the performance of the production model. For the difficulty assessment model, which outputs a PQ-based difficulty score, we measure the absolute error between the predicted and ground-truth PQ scores. To justify the efficiency of the proposed incremental framework, we evaluate the instance segmentation performance across various settings of manual labelling efforts, including 5%, 10%, 20%, 30% and 100% annotation time for each data package.
Sampling strategy | Trivial learning | Transfer learning | |||
---|---|---|---|---|---|
5% | 10% | 0% | 5% | 10% | |
easy | 45.66 | 54.79 | 76.33 | 76.88 | 77.54 |
random | 58.77 | 62.75 | 76.33 | 79.52 | 80.74 |
hard | 62.99 | 65.96 | 76.33 | 80.61 | 82.84 |
4.2 Experimental Results
4.2.1 Data sampling for instance segmentation
First, we investigate easy, random, and hard sampling strategies for food instance segmentation. Table 1 lists their performances on the Dimsum dataset. The hard-mining strategy consistently outperforms the easy and random sampling in both trivial and transfer learning with a gap of over 1-4% of PQ score when training on only 5% training samples and 1-3% of PQ score on 10% training samples. When the annotation budget is limited, it is better to prioritize the labour resources for the most challenging samples.

4.2.2 Difficult assessment model
We evaluate the assessment model by the mean absolute error on the validation set presented in Section 3.2. The recorded error is relatively low, at 7.8%. Therefore, the model is promising in forecasting the difficulty score for an unlabeled sample. Fig. 5 visualizes some food images and the predicted PQ scores. On the first three samples with the challenges due to the obscure boundary, stacking, and occlusion problems, the assessment model gives lower PQ scores, meaning challenges to the current instance segmentation model. Meanwhile, higher scores, considered easy samples, are given to the last three samples, where the boundaries among instances are relatively clear.
threshold for easy samples | 0 | 20 | 40 | 60 | 80 |
---|---|---|---|---|---|
performance | 82.52% | 82.85% | 83.06% | 83.34% | 83.19% |
4.2.3 Pseudo-labels
Next, we examine how pseudo-labels contribute to incremental learning. Table 2 lists the instance segmentation quality on the Dimsum dataset given different thresholds to accept pseudo-labels. There is a trade-off between the pseudo-label quality and the number of easy samples explored. On the one hand, a low-quality threshold (e.g., at 0% or 20% PQ scores), proposed by [16], lets the model explore the context from a large number of unlabelled samples. On the other hand, a high-quality threshold (e.g., at 80% PQ score) guarantees the model learns from more precise pseudo-labels. The medium PQ threshold, at 60%, is recorded as a balancing point to retrieve pseudo-labels for incremental learning.
4.2.4 Incremental Learning strategies
We compare the proposed incremental learning strategies, namely PQ-based∗, against various benchmarks of random and hard sampling techniques on the Dimsum dataset. Random sampling shuffles the newly collected samples and navigates 10% of them to the human resource for annotation. The advanced method, random∗ [16], employs the latest annotation model to generate pseudo-labels for the remaining 90% samples. Hard sampling techniques forecast the difficulty scores for every sample in the new package and give a higher annotation priority to the more difficult samples. Membership-based method [19] evaluates whether a new sample is similar to one of the existing members in the current training data. While the output label of this method is either yes or no to the question, the output confidence score can be used as an indicator to assess how challenging the samples are. However, as membership inference is a binary classification problem, the confidence score distribution is skewed to 0% for non-membership and 100% for membership. Our techniques, PQ-based and PQ-based ∗, generate difficulty scores in a more balanced distribution. The PQ-based ∗ method adds pseudo-labels to training samples. Different from random∗ [16], we can exclude low-quality pseudo samples thanks to the proposed assessment model.
Table 3 lists the performance of these incremental learning strategies on the Dimsum dataset, where the annotation budget is used for only 10% of the new sample package. Overall, PQ-based∗ consistently outperform the remaining strategies across six stages. The 10% package in hard sampling techniques, membership-based and PQ-based, contributes to around 1.5% improvement in panoptic quality compared to random sampling. PQ-based demonstrates a better assessment approach than membership-based, especially at the early stages. PQ-based∗, aggregating the advantages of hard sampling and good pseudo-labels, make improvements of 1.5%-2% compared to the benchmark random∗ [16].
package | full annotation | random sampling | hard sampling | |||
random | random∗ [16] | membership [19] | PQ-based | PQ-based∗ | ||
1 | 74.60 | |||||
2 | 76.23 | 76.73 | 76.57 | 77.08 | 78.32 | |
3 | 77.30 | 78.81 | 78.42 | 79.49 | 80.98 | |
4 | 79.39 | 79.87 | 80.57 | 81.24 | 81.99 | |
5 | 80.27 | 80.47 | 80.97 | 81.90 | 82.74 | |
6 | 80.75 | 81.95 | 82.11 | 82.39 | 83.34 |



4.2.5 Annotation effort and model performance
Last, we compare model performances given by the proposed method, PQ-based∗, on a little annotation effort and the traditional approach training on manual labels for all collected samples. Fig. 6a illustrates the panoptic quality over annotation time on the Dimsum dataset. Given the pre-trained model PQ-based∗ 10% on Dimsum, we perform transfer learning on Cookie and UEC datasets. On the Dimsum dataset, at the last package, with only one-fourth of the annotation effort, PQ-based∗ 10% is competitive with the fully annotated approach. The approaches PQ-based∗ 20% and PQ-based∗ 30%, with one-third and one-half of the annotation time, can achieve equivalent performance to the model with 100% data annotated. Using the pre-trained model of PQ-based∗ 10% on Dimsum, the transfer learning models of PQ-based∗ 30% on Cookie and UEC show a minor gap of around 1% PQ less than the model with 100% sample annotated.
5 Conclusion
We have presented an incremental learning framework for food instance segmentation given a fixed budget for annotation. The framework’s power comes from the proposed assessment model, which forecasts the difficulty scores for unlabelled samples. The score enables the framework to select hard samples for manual labelling and high-confidence samples for pseudo labels. The experimental results justify the efficiency of the assessment model in scoring food images regarding how challenging they are to instance segmentation. The proposed framework outperforms current incremental learning benchmarks on the Dimsum dataset with the same amount of annotation effort. The framework exhibits competitive performance with fully annotated models on the Dimsum dataset and the transfer learning on the Cookie and UEC datasets, with a much shorter labelling time. The proposed incremental learning strategy is a promising solution to train and deploy a food instance segmentation model in practice.
References
- [1] Eduardo Aguilar, Beatriz Remeseiro, Marc Bolanos, and Petia Radeva. Grab, pay, and eat: Semantic food detection for smart restaurants. In IEEE Transactions on Multimedia, pages 3266–3275, 2018.
- [2] Huu-Thanh Nguyen and Chong-Wah Ngo. Terrace-based food counting and segmentation. In The Association for the Advancement of Artificial Intelligence (AAAI), pages 2364–2372, May 2021.
- [3] Huu-Thanh Nguyen, Chong-Wah Ngo, and Wing-Kwong Chan. Sibnet: Food instance counting and segmentation. In Pattern Recognition, volume 124, page 108470, 2022.
- [4] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. Shape robust text detection with progressive scale expansion network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9328–9337, Jun 2019.
- [5] Yixing Zhu and Jun Du. Textmountain: Accurate scene text detection via instance segmentation. In Pattern Recognition, page 107336, 2021.
- [6] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2858–2866, Jul 2017.
- [7] Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
- [8] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
- [9] Youngwan Lee and Jongyoul Park. Centermask: Real-time anchor-free instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- [10] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar. Panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
- [11] Jingjing Chen, Lei Pang, and Chong-Wah Ngo. Cross-modal recipe retrieval: How to cook this dish? In MultiMedia Modeling, pages 588–600, 2017.
- [12] Lixi Deng, Jingjing Chen, Qianru Sun, Xiangnan He, Sheng Tang, Zhaoyan Ming, Yongdong Zhang, and Tat Seng Chua. Mixed-dish recognition with contextual relation networks. In Proceedings of the 27th ACM International Conference on Multimedia, page 112–120, 2019.
- [13] Jiabao Lei, Jianing Qiu, Frank P.-W. Lo, and Benny Lo. Assessing individual dietary intake in food sharing scenarios with food and human pose detection. In Pattern Recognition. ICPR International Workshops and Challenges, pages 549–557, 2021.
- [14] Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation using class peak response. In Conference on Computer Vision and Pattern Recognition, pages 3791–3800, 2018.
- [15] Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro, David Vazquez, and Mark Schmidt. Where are the blobs: Counting by localization with point supervision. In The European Conference on Computer Vision (ECCV), Sep 2018.
- [16] Miriam Bellver, Amaia Salvador, Jordi Torres, and Xavier Giró i Nieto. Budget-aware semi-supervised semantic and instance segmentation. In CVPR Workshops, 2019.
- [17] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Conference on Computer Vision and Pattern Recognition, 04 2016.
- [18] Te Pi, Xi Li, Zhongfei Zhang, Deyu Meng, Fei Wu, Jun Xiao, and Yueting Zhuang. Self-paced boost learning for classification. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, page 1932–1938. AAAI Press, 2016.
- [19] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy, 05 2017.
- [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- [21] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 06 2009.
- [22] Kaimu Okamoto and Keiji Yanai. UEC-FoodPIX Complete: A large-scale food image segmentation dataset. In Proc. of ICPR Workshop on Multimedia Assisted Dietary Management(MADiMa), 2021.