NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models

Chi-en Amy Tai^*¹ Jason Li^*¹ Sriram Kumar^*¹
Saeejith Nair¹ Yuhao Chen¹ Pengcheng Xi² Alexander Wong¹
^* All authors contributed equally.
¹ Vision and Image Processing Lab, University of Waterloo
² National Research Council Canada
{amy.tai, j2643li, ssriramk, smnair, yuhao.chen1, alexander.wong}@uwaterloo.ca
[email protected]

Abstract

With the growth in capabilities of generative models, there has been growing interest in using photo-realistic renders of common 3D food items to improve downstream tasks such as food printing, nutrition prediction, or management of food wastage. Despite 3D modelling capabilities being more accessible than ever due to the success of NeRF based view-synthesis, such rendering methods still struggle to correctly capture thin food objects, often generating meshes with significant holes. In this study, we present an optimized strategy for enabling improved rendering of thin 3D food models, and demonstrate qualitative improvements in rendering quality. Our method generates the 3D model mesh via a proposed thin-object-optimized differentiable reconstruction method and tailors the strategy at both the data collection and training stages to better handle thin objects. While simple, we find that this technique can be employed for quick and highly consistent capturing of thin 3D objects.

1 Introduction

With the growth in capabilities of generative models, there has been growing interest in using photo-realistic renders of common 3D food items. Having 3D models of food is incredibly beneficial for food printing to help design foods with new textures that are potentially more nutritious [1]. 3D food models can also assist with the generation of more data for volume estimation and nutrient prediction [2, 3]. Another use case for 3D food models is to improve management of food wastage as the internal structure of food can be modified to control user’s food intake [4]. Despite 3D modelling capabilities being more accessible than ever due to the success of NeRF based view-synthesis, such rendering methods still struggle to correctly capture thin food objects, often generating meshes with significant holes, as seen in Fig. 1.

In this study, we present an optimized strategy for enabling improved rendering of thin 3D food models, and demonstrate qualitative improvements in rendering quality. Our method generates the 3D model mesh via a proposed thin-object-optimized differentiable reconstruction method [5] and tailors the strategy at both the data collection and training stages to better handle thin objects. While simple, we find that this technique can be employed for quick and highly consistent capturing of 3D thin objects.

2 Methodology

The proposed optimized strategy for enabling improved rendering of 3D thin food models consists of optimizations in both the data collection and model training phases. An overview of the entire process is shown in Fig. 2.

2.1 Data Collection

A significant problem that arises when trying to create a 3D model of a thin object is image capture from different viewpoints since the object is unable to stand vertically without external support. Providing the necessary support obscures much of the article and stands out in the background, preventing quality data collection of the object itself. To alleviate this problem and collect multi-view perspectives without obstruction, we first attach the object to a thread using hot glue and next suspend the thread from a platform. We then rotate the thread (and thus the object) to capture images of the object from all 360 degrees.

Another issue is the existence of inconsistent lighting which results in irregularities in surface brightness. To minimize interference and ensure consistency, we suspend the object directly (45 cm) below a light source that shines perpendicular to the ground and dim all other light sources in the vicinity as seen in Fig. 2. A Google Pixel 7 Pro phone is used to record a video of the rotating item; the item always remains centered in the frame. We position the camera parallel to the item at around 10-15 cm from the item as these close-up shots help capture maximum detail, thus optimizing render quality. We gradually move the phone down and subsequently up (while maintaining distance and centering) to capture the top and bottom of the article. The suggested video length is 1:20 as this duration is short enough to avoid lengthy preprocessing times while allowing for competent multi-view data collection.

Approximately 200 frames are then sampled from the video at equal intervals using OpenCV [6]. The backgrounds of the sampled images are removed using Rembg and image masks are generated [7]. The images are downscaled to a resolution of (512x512) to reduce strain on the GPU during the mesh extraction process. The preprocessed images are then inputted to COLMAP [8] which produces intrinsic and extrinsic camera parameters of the accepted images [9]. We find that performing the more comprehensive exhaustive feature matching (compared to sequential matching) produces better results without imposing a significant time delay due to our small image set. Finally, the pose information, images, and image masks are passed into a thin object-optimized differentiable reconstruction method [5] for model training and mesh generation.

2.2 Model Training

We first investigated instant-ngp [10] for rendering of thin objects but found that the outputted mesh had an irregular/rough surface and contained numerous holes as seen in Fig. 1. To address the poor rendering results of existing methods, we propose an extended training strategy for optimizing the differentiable reconstruction method proposed in [5] to handle thin objects. More specifically, we conduct training with a reduced tetrahedral grid resolution, a large Laplacian value, and strong signed distance field regularization to enforce greater mesh smoothness. While these constraints do come at the expense of minor detail loss, they regularize the surface geometric shape of items and result in significant reductions in the jaggedness of the mesh. An additional benefit is the elimination of ghost effects as seen in Fig. 3. Finally, we recommend a lower training resolution and number of iterations as these significantly lower computational costs without incurring any noticeable differences in render quality.

3 Conclusion

In this paper, we present NutritionVerse-Thin, a technique that can be employed for quick and consistent capturing of 3D thin objects and show sample successful renderings of chips using this training strategy. However, this approach can be implemented for collecting other thin food models and hence, would be beneficial in advancing food printing, improving models for nutrient prediction, and reducing food wastage.

References

Godoi et al. [2016] Fernanda C. Godoi, Sangeeta Prakash, and Bhesh R. Bhandari. 3d printing technologies applied for food design: Status and prospects. Journal of Food Engineering, 179:44–54, 2016. ISSN 0260-8774. doi: https://doi.org/10.1016/j.jfoodeng.2016.01.025. URL https://www.sciencedirect.com/science/article/pii/S0260877416300243.
Xu et al. [2013] Chang Xu, Ye He, Nitin Khanna, Carol J. Boushey, and Edward J. Delp. Model-based food volume estimation using 3d pose. In 2013 IEEE International Conference on Image Processing, pages 2534–2538, 2013. doi: 10.1109/ICIP.2013.6738522.
Rahman et al. [2012] Md Hafizur Rahman, Qiang Li, Mark Pickering, Michael Frater, Deborah Kerr, Carol Bouchey, and Edward Delp. Food volume estimation in a mobile phone based dietary assessment system. In 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems, pages 988–995, 2012. doi: 10.1109/SITIS.2012.146.
Lin et al. [2020] Ying-Ju Lin, Parinya Punpongsanon, Xin Wen, Daisuke Iwai, Kosuke Sato, Marianna Obrist, and Stefanie Mueller. Foodfab: Creating food perception illusions using food 3d printing. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, page 1–13, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450367080. doi: 10.1145/3313831.3376421. URL https://doi.org/10.1145/3313831.3376421.
Munkberg et al. [2021] Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas Mueller, and Sanja Fidler. Extracting Triangular 3D Models, Materials, and Lighting From Images. arXiv:2111.12503, 2021.
Bradski [2000] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
Gatis [2022] Daniel Gatis. Rembg. https://github.com/danielgatis/rembg, 2022.
Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Mildenhall et al. [2019] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.
Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.