This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Saliency-Aware Class-Agnostic Food Image Segmentation

Sri Kalyan Yarlagadda [email protected] Purdue UniversityWest LafayetteUSA Daniel Mas Montserrat [email protected] Purdue UniversityWest LafayetteUSA David Güera [email protected] Purdue UniversityWest LafayetteUSA Carol J. Boushey [email protected] University of Hawaii Cancer CenterHonoluluUSA Deborah A. Kerr [email protected] Curtin UniversityPerthAustralia  and  Fengqing Zhu [email protected] Purdue UniversityWest LafayetteUSA
Abstract.

Advances in image-based dietary assessment methods have allowed nutrition professionals and researchers to improve the accuracy of dietary assessment, where images of food consumed are captured using smartphones or wearable devices. These images are then analyzed using computer vision methods to estimate energy and nutrition content of the foods. Food image segmentation, which determines the regions in an image where foods are located, plays an important role in this process. Current methods are data dependent, thus cannot generalize well for different food types. To address this problem, we propose a class-agnostic food image segmentation method. Our method uses a pair of eating scene images, one before start eating and one after eating is completed. Using information from both the before and after eating images, we can segment food images by finding the salient missing objects without any prior information about the food class. We model a paradigm of top down saliency which guides the attention of the human visual system (HVS) based on a task to find the salient missing objects in a pair of images. Our method is validated on food images collected from a dietary study which showed promising results.

food segmentation, image based dietary assessment
copyright: nonecopyright: none

1. Introduction

It is well-known that dietary habits have profound impacts on the quality of one’s health and well-being (Nordström et al., 2013; Mesas et al., 2012). While a nutritionally sound diet is essential to good health (Organization, 2009), it has been established through various studies that poor dietary habits can lead to many diseases and health complications. For example, studies from the World Health Organization (WHO) (Organization, 2009) have shown that poor diet is a key modifiable risk factor for the development of various non-communicable diseases such as heart disease, diabetes and cancers, which are the leading causes of death globally (Organization, 2009). In addition, studies have shown that poor dietary habits such as frequent consumption of fast food (Ginny et al., 2012), diets containing large portion size of energy-dense foods (Piernas and Popkin, 2011), absence of home food (Hammons et al., 2011) and skipping breakfast (PR1 et al., 2010) all contribute to the increasing risk of overweight and obesity. Because of the many popular diseases affecting humans are related to dietary habits, there is a need to study the relationship between our dietary habits and their effect on our health.

Understanding the complex relationship between dietary habits and human health is extremely important as it can help us mount intervention programs to prevent these diet related diseases (Daugherty et al., 2012). To better understand the relationship between our dietary habits and human health, nutrition practitioners and researchers often conduct dietary studies in which participants are asked to subjectively assess their dietary intake. In these studies, participants are asked to report foods and drinks they consumed on a daily basis over a period of time. Traditionally, self-reporting methods such as 24-hr recall, dietary records and food frequency questionnaire (FFQ) are popular for conducting dietary assessment studies (Shim et al., 2014). However, these methods have several drawbacks. For example, both the 24-hr recall and FFQ rely on the participants’ ability to recall foods they have consumed in the past. In addition, they are also very time-consuming. For dietary records, participants are asked to record details of the meals they consumed. Although this approach is less reliant on the participants’ memory, it requires motivated and trained participants to accurately report their diet (Shim et al., 2014). Another issue that affects the accuracy of these methods is that of under-reporting due to incorrect estimation of food portion sizes. Under-reporting has also been associated with factors such as obesity, gender, social desirability, restrained eating and hunger, education, literacy, perceived health status, age, and race/ethnicity (Zhu et al., 2010). Therefore, there is an urgent need to develop new dietary assessment methods that can overcome these limitations.

In the past decade, experts from the nutrition and engineering field have combined forces to develop new dietary assessment methods by leveraging technologies such as the Internet and mobile phones. Among the various new approaches, some of them use images captured at the eating scene to extract dietary information. These are called image-based dietary assessment methods. Examples of such methods include TADA™ (Zhu et al., 2010), FoodLog (Aizawa and Ogawa, 2015) , FoodCam (Kawano and Yanai, 2015b), Snap-n-Eat (Zhang et al., 2015), GoCARB (Vasiloglou et al., 2018), DietCam (Kong and Tan, 2012) and  (McCrory et al., 2019), to name a few. In these methods, participants are asked to capture images of foods and drinks consumed via a mobile phone. These images are then analyzed to estimate the nutrient content. Estimating the nutrient content of foods in an image is commonly performed by trained dietitians, which can be time consuming, costly and laborious. More recently, automated methods have been developed to extract nutrient information of the foods from images (Fang et al., 2017, 2015; Fang et al., 2018). The process of extracting nutrient information from images generally involves three sub-tasks, food segmentation, food classification and portion size estimation (Zhu et al., 2010). Food image segmentation is the task of grouping pixels in an image representing foods. Food classification can then identify the food types. Portion size estimation (Fang et al., 2015) is the task of estimating the volume/energy of the foods in the image. Each of these tasks is essential for building an automated system to accurately extract nutrient information from food in images. In this paper, we focus on the task of food segmentation. In particular, we propose a food segmentation method that does not require information of the food types.

Food segmentation plays a crucial role in estimating nutrient information as the image segmentation masks are often used to estimate food portion sizes (Kong and Tan, 2012; Fang et al., 2017, 2015; Okamoto and Yanai, 2016; Pouladzadeh et al., 2014). Food segmentation from a single image is a challenging problem as there is a large inter- and intra-class variance among different food types. Because of this variation, techniques developed for segmenting a particular class of foods will not be effective on other food classes. Despite these drawbacks, several learning based food segmentation methods (Zhu et al., 2015; Wang et al., 2017; Shimoda and Yanai, 2015; Dehais et al., 2016) have been proposed in recent years. One of the constraints of learning based methods is data dependency. They are only effective on the food categories they trained on. For instance in (Wang et al., 2017), class activation maps are used to segment food images. The Food-101 dataset (Bossard et al., 2014) is used to train the model and the method is tested on a subset of another dataset that have common food categories with Food-101. This is a clear indication that their method (Wang et al., 2017) is only effective on food classes that have been trained on. Similarly, the learning based method proposed in (Shimoda and Yanai, 2015) is trained and tested only on UEC-FOOD100 (Yuji et al., 2012). The UEC-FOOD100 dataset has a total of 12,740 images with 100 different food categories, out of which 1,174 have multiple foods in a single image. In their method, the dataset is partitioned into training and testing subsets, each contains all the food categories. The authors of (Shimoda and Yanai, 2015) split this dataset into training and testing in the following way. All the images containing a single food category were used for training and images containing multiple food categories were used for testing. This way the training set contained 11,566 images and the testing set contains 1,174 images. Splitting the dataset in this fashion does not guarantee that the training and testing subsets contain images belonging to different food categories. In fact this would mean they contain common food categories. Furthermore, the authors in (Shimoda and Yanai, 2015) did not conduct any cross dataset evaluation. Thus the learning based method in (Shimoda and Yanai, 2015) is also only effective on food categories it has been trained on. In (Dehais et al., 2016), a semi automatic method is proposed to segment foods. The authors of (Dehais et al., 2016) assume that foods are always present in a circular region. In addition, they assume information about the number of different food categories is known. The experiments are conducted on a dataset of 821 images. While they achieved promising results, the proposed approach is not designed for real world scenario as their assumptions may not hold. In (Chen et al., 2015), a food segmentation technique is proposed that exploits saliency information. However, this approach relies on successfully detecting the food container. In (Chen et al., 2015), the food container is assumed to be a circular plate. Experimental results were reported using a dataset consisting of only 60 images. While the assumptions in (Chen et al., 2015) are valid in some cases, it may not be true in many real life scenarios.

In addition, there are also constraints imposed by the available datasets. Publicly available food image datasets such as UECFOOD-100 (Yuji et al., 2012), Food-101 (Bossard et al., 2014) and UECFOOD-256 (Kawano and Yanai, 2015a) are biased towards a particular cuisine and also do not provide pixel level labelling. Pixel level labelling is crucial because it forms the necessary ground truth for training and evaluating learning based food segmentation methods. To overcome the limitations posed by learning based methods and the availability of public datasets with ground truth information, we proposed to develop a food segmentation method that is class-agnostic. In particular, our class-agnostic food segmentation method uses information from two images, the before eating and after eating image to segment the foods consumed during the meal.

Our data is collected from a community dwelling dietary study (Kerr et al., 2016) using the TADA™ platform. In this study, participants were asked to take two pictures of their eating scene, one before they start eating which we call the before eating image and one immediately after they finished eating which we call the after eating image. The before eating and after eating image represent the same eating scene, however for the purpose of this work, we only select image pairs where the after eating image does not contain any food. Our goal is to segment the foods in the before eating image using information from both before and after eating images. To illustrated this problem in a more general scenario, lets consider an experimental setup in which a person is given a pair of images shown in Fig.1 and is asked the following question, “Can you spot the salient objects in Fig. 1a that are missing in Fig. 1b?”. We refer to these as the salient missing objects. To find salient missing objects, the Human Vision System (HVS) compares regions that are salient in both images. In this example, the food, container and color checkerboard in Fig. 1a are the salient objects and in Fig. 1b, the color checkerboard, spoon and container are the salient objects. Comparing the salient objects in both of these images, HVS can identify the food as the salient missing object. In this paper, our goal is to build a model to answer this question. By looking for salient missing objects in the before eating image using the after eating image as the reference we can then segment the foods without additional information such as the food classes. As the above approach does not require information about the food class, we are able to build a class-agnostic food segmentation method by segmenting only the salient missing objects.

Refer to caption
(a) Before eating image.
Refer to caption
(b) After eating image.
Figure 1. A pair of eating scene images, taken before and after a meal is consumed. The salient missing object in figure a is the food in the container.

The above question does not bear significance for just any pair of random images. It only becomes relevant when the image pairs are related. For example, in Fig. 1, both images have many regions/objects with same semantic labels such as color checkerboard, container and the black background. However, the relative positions of these regions/objects are different in both images due to camera pose and different time of capturing the images. Because of similarity at the level of semantics between both images, it is plausible to define the notion of salient missing objects. Notice that we are not interested in pixel-level differences due to changes in illumination, poses and angles.

In this experimental scenario, the visual attention of HVS is guided via a task, hence it falls under the category of top down saliency. Visual attention  (Borji et al., 2013; Borji and Itti, 2013) is defined as the process that capacitates a biological or artificial vision system to identify relevant regions in a scene (Borji et al., 2013). Relevance of every region in a scene is attributed through two different mechanisms, namely top down saliency and bottom up saliency. In top down saliency, attention is directed by a task. An example of this mechanism in action is how a human driver’s HVS identifies relevant regions on the road for a safe journey. Other examples where top down saliency have been studied are sandwich making (Ballard et al., 1995) and interactive game playing (Peters et al., 2007). In bottom up saliency, attention is directed towards those regions that are the most conspicuous. Bottom up saliency is also known as visual saliency. In the real world, visual attention of HVS is guided by a combination of top down saliency and bottom up saliency. In the above question of finding salient missing objects, visual attention is guided by a task and hence it falls under the category of top down saliency. Top down saliency has not been studied as extensively as visual saliency because of its complexity (Borji et al., 2013).

In this paper, we propose an unsupervised method to find the salient missing objects between a pair of images for the purpose of designing a class agnostic food segmentation method. We use the after eating image as the background to find the contrast of every pixel in the before eating image. We then fuse the contrast map along with saliency maps to obtain the final segmentation mask of the salient missing objects in the before eating image. We also compare our method to other class-agnostic methods. Since food is a salient object in the before eating image, by detecting salient objects in the before eating image we are able to segment the food. We compared our method to four state-of-the-art salient object detection methods, namely R3NET (Deng et al., 2018), NLDF (Zhiming et al., 2017), UCF (Pingping et al., 2017) and Amulet (Zhang et al., 2017).

The paper is organized as follows. In Section 2, we formulate our problem and discuss related work. We describe our proposed method in detail in Section 3. In Section 4, we discuss dataset and experiment design. In Section 5, we discuss experimental results and compare our method with other salient object detection methods. Conclusions are provided in Section 6.

2. Problem Formulation and Related Work

In this section, we first introduce common notations used throughout the paper. We then discuss related works on modeling top down saliency and change detection.

2.1. Problem Formulation

Consider a pair of images {Ib,Ia}\{I^{b},I^{a}\} captured from an eating scene.

  • IbI^{b} : We refer to it as the “before eating image.” This is the meal image captured before consumption.

  • IaI^{a} : We refer to it as the “after eating image.” This is the meal image captured immediately after consumption.

Our goal is to obtain a binary mask MbM^{b}, that labels the salient missing objects in IbI^{b} as foreground (with a binary label of 1) and rest of IbI^{b} as background (with a binary label of 0).

2.2. Related Work

Our goal is to find salient missing objects in a pair of images. Since the visual attention of the HVS is guided by a task, it falls under the category of top down saliency. Top down saliency is much more complex than visual saliency and hence has not been studied extensively. Some of the recent works modeling top down saliency paradigms are (Cao et al., 2015; Ramanishka et al., 2017). In (Ramanishka et al., 2017), given an image or video and an associated caption, the authors proposed a model to selectively highlight different regions based on words in the caption. Our work is related in the sense that we also try to highlight and segment objects/regions based on a description, except that the description in our case is a much more generic question of finding the salient missing objects in a pair of images without specific details.

Another related problem is modeling change detection (Khan et al., 2017b; Sakurada and Okatani, 2015; Khan et al., 2017a; Vijay et al., 2014). In change detection, the objective is to detect all relevant changes between a pair of images that are aligned or can be potentially aligned via image registration. Examples of such changes may include object motion, missing objects, structural changes (Sakurada and Okatani, 2015) and changes in vegetation (Khan et al., 2017b). One of the key differences between change detection and our proposed problem is that in change detection, the pair of images are aligned or can be potentially aligned via image registration (Szeliski, 2004) which is not true in the case of salient missing objects. In the case of finding salient missing objects, we cannot guarantee that IbI^{b} and IaI^{a} can be registered, as often there is relative motion between objects of interest as shown in Fig. 1 and also in Fig. 6.

The problem of finding salient missing objects can be thought of as a change detection problem in a more complex environment than those that have been previously considered. Hence, we need to develop new methods to solve this problem.

3. Method

In this section, we describe the details of our proposed method to segment salient missing objects in IbI^{b}. Our method consists of three parts, segmentation and feature extraction, contrast map generation and saliency fusion. An overview of our proposed method is described in Fig. 2.

Refer to caption
Figure 2. Overview of proposed method.

3.1. Segmentation And Feature Extraction

We first segment the pair of images IaI^{a} and IbI^{b} using SLIC (Radhakrishna et al., 2012) to group pixels into perceptually similar superpixels. Let 𝒜={ai}\mathcal{A}=\{a_{i}\} denote the superpixels of the after eating image IaI^{a} and ={bj}\mathcal{B}=\{b_{j}\} for superpixels of the before eating image IbI^{b}.

We extract features from each superpixel. We use these features to compute the contrast map. The contrast map CabC_{a}^{b} gives an estimate of the probability of pixels belonging to objects/regions present in IbI^{b} but missing in IaI^{a}. This will be explained in detail in section 3.2. To compute an accurate contrast map, pixels belonging to similar regions in IbI^{b} and IaI^{a} should have similar feature representation and vice versa. Going from IbI^{b} to IaI^{a} we can expect changes in scene lightning, changes in noise levels and changes in segmentation boundaries because of relative object motion. To compute an accurate contrast map, its important that feature representation of pixels are robust to these artifacts.For this reason, we extract features using a pretrained Convolutional Neural Network (CNN) instead of using hand-crafted features. We use the VGG19 (Karen and Zisserman, 2015) pretrained on the ImageNet dataset (Deng et al., 2009). ImageNet is a large dataset consisting of more than a million images belonging to 1000 different classes. It captures the distribution of natural images very well. Because of all these reasons models pretrained on ImageNet are widely used in several applications (Donahue et al., 2014; Anselmi et al., 2016; Sakurada and Okatani, 2015; Guanbin and Yu, 2015).

We use the pretrained VGG19 for both IbI^{b} and IaI^{a}. The output of 16th16^{th} convolutional layer in VGG19 is extracted as the feature map. The reasoning behind this choice is explained in section 4.3.1. According to Table 1 in (Karen and Zisserman, 2015), VGG19 has a total of 1616 convolutional layers. The dimensionality of the output of the 16th16^{th} convolutinal layer of VGG19 is 14×14×51214\times 14\times 512 where 14×1414\times 14 is the spatial resolution. The input (IbI^{b} or IaI^{a}) to VGG19 has a spatial resolution of 224×224224\times 224. We spatially upscale the output of the 16th16^{th} convolution layers by a factor of 16. We denote these upscaled feature maps of IbI^{b} and IaI^{a} as FbF^{b} and FaF^{a}, respectively. The dimensionality of FbF^{b} and FaF^{a} is then 224×224×512224\times 224\times 512. Thus every pixel will be represented by a 512 dimensional vector in the feature space. For each superpixel, we denote the extracted features as {fjb}\{f_{j}^{b}\} for the before eating image and {fia}\{f_{i}^{a}\} for the after eating image. Using these extracted feature maps, fiaf_{i}^{a} and fjbf_{j}^{b} are computed as described in Eq. 1.

(1) fia=1miakriaFa(k)fjb=1mjbkrjbFb(k)\begin{split}f_{i}^{a}=\frac{1}{m_{i}^{a}}\sum_{k\in r_{i}^{a}}F^{a}(k)\\ f_{j}^{b}=\frac{1}{m_{j}^{b}}\sum_{k\in r_{j}^{b}}F^{b}(k)\end{split}

where riar_{i}^{a} denotes the set of pixels belongs to superpixel aia_{i} and miam_{i}^{a} is its cardinality. rjbr_{j}^{b} and mjbm_{j}^{b} are similarly defined.

3.2. Contrast Map Generation

Contrast is a term often associated with salient object detection methods. Contrast of a region in an image refers to its overall dissimilarity with other regions in the same image. It is generally assumed that regions with high contrast demand more visual attention (Federico et al., 2012). In the context of our problem, visual attention is guided by trying to find objects in IbI^{b} that are missing in IaI^{a}. Therefore, our contrast map CabC^{b}_{a} of IbI^{b} is an estimate of the probability of each pixel belonging to an object missing in IaI^{a}. CabC^{b}_{a} is computed as shown in Eq. 2.

(2) Cab=Cab,local+Cab,neighmax(Cab,local+Cab,neigh)\begin{split}C^{b}_{a}=\frac{C_{a}^{b,\text{local}}+C_{a}^{b,\text{neigh}}}{max(C_{a}^{b,\text{local}}+C_{a}^{b,\text{neigh}})}\end{split}

In Cab,localC_{a}^{b,\text{local}}, contrast values of a superpixel bjb_{j} is computed using information from bjb_{j} and IaI^{a}, while in Cab,neighC_{a}^{b,\text{neigh}} contrast value of bjb_{j} is computed using information from bjb_{j}, its neighboring superpixels and IaI^{a}. max(Cab,local+Cab,neigh)max(C_{a}^{b,\text{local}}+C_{a}^{b,\text{neigh}}), which is the maximum value in the contrast map, is used to normalize CabC^{b}_{a} to [0,1][0,1]. To compute the contrast map Cab,localC_{a}^{b,\text{local}} or Cab,neighC_{a}^{b,\text{neigh}}, contrast values are computed for each superpixel and then these values are assigned to the associated individual pixels. However, if bjb_{j} is a superpixel along the image boundaries, that is bjeb_{j}\in\mathcal{B}^{e}, we assign bjb_{j} a contrast value of zero. We assume that the salient missing objects are unlikely to be present along the image boundaries.

The contrast value of a superpixel bjeb_{j}\notin\mathcal{B}^{e} is denoted by cjb,localc_{j}^{b,\text{local}}, and is computed as:

(3) cjb,local=minisuch thatai𝒜fjbfia2\begin{split}c_{j}^{b,\text{local}}=\operatorname*{min}_{\forall i\ \text{such that}\ a_{i}\in\mathcal{A}}||f_{j}^{b}-f_{i}^{a}||_{2}\end{split}

If bjeb_{j}\in\mathcal{B}^{e} then cjb,local=0c_{j}^{b,\text{local}}=0. cjb,localc_{j}^{b,\text{local}} is the minimum Euclidean distance between the feature vector fjbf_{j}^{b} and the closest feature vector of a superpixel in the after eating image. A superpixel bjb_{j} belonging to objects/regions that are common to both IbI^{b} and IaI^{a} will have lower value of cjb,localc_{j}^{b,\text{local}}, while bjb_{j} belonging to objects/regions present in IbI^{b} but missing in IaI^{a} will likely have higher value of cjb,localc_{j}^{b,\text{local}}.

Before describing how we compute Cab,neighC_{a}^{b,\text{neigh}}, we need to introduce a few more notations. For a given superpixel bjb_{j}, let 𝒩(bj)\mathcal{N}(b_{j}) denote the set of all neighboring superpixels of bjb_{j}. Similarly, for any superpixel aia_{i}, 𝒩(ai)\mathcal{N}(a_{i}) is the set of neighboring superpixels. Consider a complete bipartite graph over the two sets of superpixels {ai,𝒩(ai)}\{a_{i},\mathcal{N}(a_{i})\} and {bj,𝒩(bj)}\{b_{j},\mathcal{N}(b_{j})\} denoted by

(4) Gai,bj=({ai,𝒩(ai)}{bj,𝒩(bj)},ai,bj)\begin{split}{G_{{a_{i}},{b_{j}}}}=(\{{a_{i}},{\mathcal{N}}({a_{i}})\}\cup\{{b_{j}},{\mathcal{N}}({b_{j}})\},\ {{\mathcal{E}}_{{a_{i}},{b_{j}}}})\end{split}

where ai,bj\mathcal{E}_{a_{i},b_{j}} is the set of edges in Gai,bjG_{a_{i},b_{j}}. An example is shown in Fig. 3.

Refer to caption
(a) Gai1,bj25G_{a_{i_{1}},b_{j_{25}}}
Refer to caption
(b) 𝒮1ai1,bj25={eai4,bj28,eai10,bj30,ea1,bj25}\mathcal{S}^{\mathcal{E}_{a_{i_{1}},b_{j_{25}}}}_{1}=\{e_{a_{i_{4}},b_{j_{28}}},e_{a_{i_{10}},b_{j_{30}}},e_{a_{1},b_{j_{25}}}\}
Refer to caption
(c) 𝒮2ai1,bj25={eai10,bj30,eai4,bj28,ea5,bj25}\mathcal{S}^{\mathcal{E}_{a_{i_{1}},b_{j_{25}}}}_{2}=\{e_{a_{i_{10}},b_{j_{30}}},e_{a_{i_{4}},b_{j_{28}}},e_{a_{5},b_{j_{25}}}\}
Figure 3. Consider 2 hypothetical nodes ai1𝒜a_{i_{1}}\in\mathcal{A} with 𝒩(ai1)={ai10,ai4,ai5}\mathcal{N}(a_{i_{1}})=\{a_{i_{10}},a_{i_{4}},a_{i_{5}}\} and bj25b_{j_{25}}\in\mathcal{B} with 𝒩(bj25)={bj30,bj28}\mathcal{N}(b_{j_{25}})=\{b_{j_{30}},b_{j_{28}}\}. In Fig. 3a, we illustrate how Gai1,bj25G_{a_{i_{1}},b_{j_{25}}} is constructed. Note that because Gai1,bj25G_{a_{i_{1}},b_{j_{25}}} is a complete bipartite graph there is an edge from every node in {ai1,𝒩(ai1)}\{a_{i_{1}},\mathcal{N}(a_{i_{1}})\} to every node in {bj25,𝒩(bj25)}\{b_{j_{25}},\mathcal{N}(b_{j_{25}})\}. In Fig. 3b and Fig. 3c, examples of plausible maximum matching are shown. The value of D(𝒮1ai1,bj25)=w(ea1,bj25)+w(eai10,bj30)+w(eai4,bj28)D(\mathcal{S}^{\mathcal{E}_{a_{i_{1}},b_{j_{25}}}}_{1})=w(e_{a_{1},b_{j_{25}}})+w(e_{a_{i_{10}},b_{j_{30}}})+w(e_{a_{i_{4}},b_{j_{28}}}) and D(𝒮2ai1,bj25)D(\mathcal{S}^{\mathcal{E}_{a_{i_{1}},b_{j_{25}}}}_{2}) can be computed in a similar manner.

In Gai,bjG_{a_{i},b_{j}}, consider an edge eai1,bj1e_{a_{i_{1}},b_{j_{1}}} between the two superpixels ai1{ai,𝒩(ai)}a_{i_{1}}\in\ \{a_{i},\mathcal{N}(a_{i})\} and bj1{bj,𝒩(bj)}b_{j_{1}}\in\ \{b_{j},\mathcal{N}(b_{j})\}, the edge weight is evaluated by the Euclidean norm w()w(\cdot) defined as:

(5) w(eai1,bj1)=fai1fbj12\begin{split}w(e_{a_{i_{1}},b_{j_{1}}})=||f_{a_{i_{1}}}-f_{b_{j_{1}}}||_{2}\end{split}

A matching over Gai,bjG_{a_{i},b_{j}} is a set of edges 𝒮ai,bj\mathcal{S}\subset\mathcal{E}_{a_{i},b_{j}} such that no two edges in 𝒮\mathcal{S} share the same nodes. A maximum matching over Gai,bjG_{a_{i},b_{j}}, denoted by 𝒮kai,bjai,bj\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}\subset\mathcal{E}_{a_{i},b_{j}}, is a matching of maximum cardinality. There can be many possible maximum matchings over Gai,bjG_{a_{i},b_{j}}, hence we use subscript kk in 𝒮kai,bj\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k} to denote one such possibility. The cost of a given 𝒮kai,bj\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k} is denoted by D(𝒮kai,bj)D(\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}) and is defined as:

(6) D(𝒮kai,bj)=eSkεai,bjw(e)\begin{split}D({\mathcal{S}}_{k}^{{{\mathcal{E}}_{{a_{i}},{b_{j}}}}})=\sum\limits_{\forall e\in S_{k}^{{\varepsilon_{{a_{i}},{b_{j}}}}}}{w(e)}\end{split}

Given a Gai,bjG_{a_{i},b_{j}}, we want to find the maximum matching with the minimum cost. We refer to this minimum cost as D^min(Gai,bj)\hat{D}_{\text{min}}(G_{a_{i},b_{j}}) and it is computed as:

(7) D^min(Gai,bj)=minksuch that𝒮kai,bjD(𝒮kai,bj)\begin{split}\hat{D}_{\text{min}}(G_{a_{i},b_{j}})=\operatorname*{\text{min}}_{\forall k\ \text{such that}\ \exists\ \mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}}D(\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k})\end{split}

For two superpixels aia_{i} and bjb_{j}, D^min(Gai,bj)\hat{D}_{\text{min}}(G_{a_{i},b_{j}}) measures the similarity between the two superpixels and the similarity between their neighborhoods. The lower the value of D^min(Gai,bj)\hat{D}_{\text{min}}(G_{a_{i},b_{j}}), the more similar the two superpixels are both in terms of their individual characteristics and their neighboring superpixels. The contrast value of superpixel bjeb_{j}\notin\mathcal{B}^{e} in Cab,neighC_{a}^{b,\text{neigh}} is denoted by cjb,neighc_{j}^{b,\text{neigh}} and is computed as:

(8) cjb,neigh=minisuch thatai𝒜D^min(Gai,bj)lai,bj\begin{split}c_{j}^{b,\text{neigh}}=\operatorname*{\text{min}}_{\forall i\ \text{such that}\ a_{i}\in\mathcal{A}}\frac{\hat{D}_{\text{min}}(G_{a_{i},b_{j}})}{l_{a_{i},b_{j}}}\end{split}

In Eq. 8, lai,bj=min(|{ai,𝒩(ai)}|,|{bj,𝒩(bj)}|)l_{a_{i},b_{j}}=\operatorname*{\text{min}}({|\{a_{i},\mathcal{N}(a_{i})\}|,|\{b_{j},\mathcal{N}(b_{j})\}|}) where |{.}||\{.\}| denotes the cardinality of the set {.}\{.\}. If bjeb_{j}\in\mathcal{B}^{e} then cjb,neigh=0c_{j}^{b,\text{neigh}}=0. D^min(Gai,bj)\hat{D}_{\text{min}}(G_{a_{i},b_{j}}) is likely to increase as lai,bjl_{a_{i},b_{j}} increases because there are more edges in maximum matching. In order to compensate this effect, we divide D^min(Gai,bj)\hat{D}_{\text{min}}(G_{a_{i},b_{j}}) by lai,bjl_{a_{i},b_{j}} in Eq. 8.

3.3. Saliency Fusion

The contrast map CabC_{a}^{b} gives an estimate of the probability of pixels belonging to objects/regions present in IbI^{b} but missing in IaI^{a}. However, we would like to segment salient missing objects. As explained in Section 1, to find the salient missing objects, the HVS compares objects/regions in IbI^{b} that have a high value of visual saliency. Therefore, we are interested in identifying regions in the contrast map CabC^{b}_{a} which correspond to high visual saliency. The visual saliency information of IbI^{b} needs to be incorporated into CabC^{b}_{a} to obtain our final estimate M^ab\hat{M}_{a}^{b}, where M^ab\hat{M}_{a}^{b} is the probability of each pixel in IbI^{b} belonging to the salient missing objects. We can then obtain the final binary label MbM^{b}, by thresholding M^ab\hat{M}_{a}^{b} with T[0,1]T\in[0,1]. If SbS^{b} is the visual saliency map of IbI^{b}, then M^ab\hat{M}_{a}^{b} is computed as:

(9) M^ab=αSb+Cabmax(αSb+Cab)\begin{split}\hat{M}_{a}^{b}=\frac{\alpha*S^{b}+C^{b}_{a}}{max(\alpha*S^{b}+C^{b}_{a})}\end{split}

where max(αSb+Cab)max(\alpha*S^{b}+C^{b}_{a}) is the normalization term. In Eq. 9, α\alpha is a weighting factor between [0,1][0,1] that varies the relative contributions of SbS^{b} and CabC^{b}_{a} towards Mab^\hat{M_{a}^{b}}. The value of α\alpha is empirically computed and will be explained in Section 4.3. To compute SbS^{b}, we use the state-of-the-art salient object detection method R3NET (Deng et al., 2018). We also compared our method to other deep learning based salient object detection methods such as Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017) and NLDF (Zhiming et al., 2017).

4. Experimental Results

4.1. Dataset

The dataset 𝒟\mathcal{D} we use for evaluating our method contains 566 pairs of before eating and after eating images. Along with image pairs, ground truth masks of the salient missing objects in the before eating images (which in this case are foods) are also provided. These images are a subset of images collected from a community dwelling dietary study (Kerr et al., 2016). The images in 𝒟\mathcal{D} exhibit a wide variety of foods and eating scenes. Participants in this dietary study are asked to capture a pair of before and after eating scene images, denoted as IbI^{b} and IaI^{a}. A typical participant takes about 3 to 5 pairs of images per day depending on his/her eating habits. These image pairs are then sent to a cloud based server to analyze nutrient contents. 𝒟\mathcal{D} is split randomly into 𝒟val\mathcal{D}_{val} (49 image pairs) and 𝒟test\mathcal{D}_{test} (517 image pairs). 𝒟val\mathcal{D}_{val} is used for choosing the optimal hypyerparameters namely α\alpha and the convolutional layer. More details are explained in section 4.3.1. 𝒟test\mathcal{D}_{test} is used to evaluate the accuracy of our method compared to other methods. Examples of image pairs from 𝒟test\mathcal{D}_{test} along with the predicted masks obtained by our method and the salient object detection methods are shown in Fig. 6. 𝒟test\mathcal{D}_{test} and 𝒟val\mathcal{D}_{val} have very different food classes. In addition, the background of the images in 𝒟val\mathcal{D}_{val} is very different from those in 𝒟test\mathcal{D}_{test}. This makes 𝒟\mathcal{D} very apt for our experiments, because 𝒟val\mathcal{D}_{val} does not give any information about the food classes present in 𝒟test\mathcal{D}_{test}. Thus if a model tuned on 𝒟val\mathcal{D}_{val} performs well on 𝒟test\mathcal{D}_{test}, it signifies that the model is able to segment foods without requiring information about the food class.

4.2. Evaluation Metrics

We use two standard metrics for evaluating the performance of the proposed method. These metrics are commonly used to assess the quality of salient object detection methods (Borji et al., 2015).

  • Precision and Recall Consider t={Ib,Ia,Gb}t=\{I^{b},I^{a},G^{b}\} in 𝒟\mathcal{D}. In tt, GbG^{b} represents the ground truth mask of the salient missing objects in IbI^{b}. Pixels belonging to the salient missing objects in GbG^{b} have a value of 11 and the rest have a value of 0. Our proposed method outputs M^ab\hat{M}^{b}_{a} which has a range between [0,1][0,1]. We can then generate a segmentation mask MbM^{b} using a threshold T[0,1]T\in[0,1]. Given MbM^{b} and GbG^{b}, precision (P) and recall (R) are computed over 𝒟\mathcal{D} as:

    (10) P  :t𝒟|MbGb|t𝒟|Mb|,R  :t𝒟|MbGb|t𝒟|Gb|\begin{split}\text{P \ :}\ \frac{\operatorname*{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap G^{b}|}{\operatorname*{\sum}_{\forall t\in\mathcal{D}}|M^{b}|}\ ,\ \text{R \ :}\ \frac{\operatorname{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap G^{b}|}{\operatorname{\sum}_{\forall t\in\mathcal{D}}|G^{b}|}\end{split}

    For a binary mask, |||\cdot| denotes the number of non-zero entries in it. By varying TT between 0 and 1, we have different pairs of precision and recall values. When precision and recall values are plotted against each other, we obtain the precision recall (PR) curve. The information provided by precision and recall can be condensed into their weighted harmonic mean denoted by FβF_{\beta}, where FβF_{\beta} is computed as:

    (11) Fβ=(1+β2)PrecisionRecallβ2Precision+Recall\begin{split}F_{\beta}=\frac{(1+\beta^{2})*Precision*Recall}{\beta^{2}*Precision+Recall}\end{split}

    The value of lies between [0,1][0,1]. A higher values of FβF_{\beta} indicates better performance. The value of β2\beta^{2} is chosen to be 0.3 similar to other works (Borji et al., 2015). β\beta is a control parameter that emphasizes the importance of precision over recall. The value FβF_{\beta} varies as we move along the PR curve. The entire information of PR curve can be summarized by the maximal FβF_{\beta} denoted by FβmaxF_{\beta}^{\text{max}}, as discussed in (Borji et al., 2015; R. et al., 2004).

  • Receiver Operator Characteristics (ROC) Similar to the PR curve, ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR). TPR and FPR are defined as:

    (12) TPR:t𝒟|MbGb|t𝒟|Gb|,FPR:t𝒟|Mb(1Gb)|t𝒟|(1Gb)|\begin{split}\text{TPR:}\;\frac{\operatorname{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap G^{b}|}{\operatorname{\sum}_{\forall t\in\mathcal{D}}|G^{b}|}\ ,\ \text{FPR:}\;\frac{\operatorname{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap(1-G^{b})|}{\operatorname{\sum}_{\forall t\in\mathcal{D}}|(1-G^{b})|}\end{split}

    Similar to FβmaxF_{\beta}^{max}, the entire information provided by ROC curve can be condensed into one metric called AUC, which is the area under the ROC curve. Higher values of AUC indicate better performance. A perfect method will have an AUC of 1 and a method that randomly guesses values in MbM^{b} will have an AUC of 0.5.

Refer to caption
(a) VGG19 FβmaxF_{\beta}^{\text{max}} vs α\alpha
Refer to caption
(b) ResNet34 FβmaxF_{\beta}^{\text{max}} vs α\alpha
Refer to caption
(c) Inception-v3 FβmaxF_{\beta}^{\text{max}} vs α\alpha
Figure 4. FβmaxF_{\beta}^{\text{max}} of M^ab\hat{M}^{b}_{a} on 𝒟val\mathcal{D}_{val} are plotted as α\alpha varies. (a) For VGG19, FβmaxF_{\beta}^{\text{max}} is reported using features from all convolutional layers that precede a max polling layer. (b) For ResNet34, features were extracted from the output of each stage. (c) For Inception-V3, features were extracted from each layer whenever the output spatial dimensions do not match the input spatial dimensions.
Refer to caption
(a) ROC curve
Refer to caption
(b) ROC curve (Zoomed)
Refer to caption
(c) PR curve
Figure 5. ROC and PR curves of R3NET (Deng et al., 2018) (also SbS^{b}), NLDF (Zhiming et al., 2017), Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017), CabC^{b}_{a} and M^ab\hat{M}^{b}_{a} are shown in the above plots. Fig 5b is a zoomed in version of ROC curve in Fig 5a

4.3. Experiments

4.3.1. Hyperparameter selection

The method described in Section 3 requires 2 hyperparameters, namely α\alpha in Eq. 9 and the convolutional layer of VGG19 for feature extraction. To justify the use of a pre-trained VGG19 for feature extraction, we have also conducted experiments by extracting features from ResNet34 (He et al., 2016) and Inception-v3 (Szegedy et al., 2016), pre-trained on ImageNet. These experiments are conducted on 𝒟val\mathcal{D}_{val} to find the best FβF_{\beta} which gives us a set of optimal hyperparameters.

To choose the best convolutional layer, we evaluate M^ab\hat{M}^{b}_{a} using features from every convolutional layer of VGG19 that precedes a max pooling layer. There are 5 such convolutional layers in VGG19. The architecture of ResNet34 can be divided into 5 stages (He et al., 2016). To find the optimal layer in ResNet34, we extracted features from the output of each stage. The architecture of Inception-v3 is very different from those of ResNet34 and VGG19. To find the optimal layer in Inception-v3, we extract features whenever there is a change in spatial dimension as the inputs propagate through the network. There are 7 such changes occur in Inception-v3 before the average pooling operation. Please refer to architecture of Inception-v3 provided in PyTorch (Paszke et al., 2017) for more details. In addition to extracting features from various convolutional layers, we also vary α\alpha from 0 to 11 in steps of 0.10.1. We plot FβmaxF_{\beta}^{max} as α\alpha varies for every convolutional layer. The result is shown in Fig. 4. From Fig. 4, its quite evident that features from the 16th16^{th} convolutional layer gives the best performance compared to features from other layers. In addition it’s also evident that features from VGG19 achieve better performance than features from ResNet34. For features from VGG19, the value of FβmaxF_{\beta}^{max} attains its maximum value of 0.7540.754 for α=0.6\alpha=0.6.

As we go deeper into the convolutional layers of VGG19, the features extracted become increasingly abstract, but suffer from decrease in resolution. Abstract features are less prone to changes in illumination, noise and pose which suits our task well. We noticed in Figure 4, as we go deeper into the convolutional layers, we first observe a degradation in the quality of features extracted (conv-layer 2 to conv-layer 8). This trend is reversed from conv-layer 8 to conv-layer 16 with a significant improvement of FβmaxF_{\beta}^{\text{max}}. We suspect this is because at first the negative effect of decreased resolution outweighs the benefit of abstract features. However, this trend quickly reverses from conv-layer 8 and beyond.

4.3.2. Testing

After obtaining the optimal hyperparameters as described in section 4.3.1, we evaluated our method on 𝒟test\mathcal{D}_{test}. M^ab\hat{M}^{b}_{a} is computed for every image pair in 𝒟test\mathcal{D}_{test} and the ROC and PR curves are computed on 𝒟test\mathcal{D}_{test}. Since our goal is to develop a class-agnostic food segmentation method, we compared the proposed method to 4 state-of-the-art salient object detection techniques, namely R3NET (Deng et al., 2018), NLDF (Zhiming et al., 2017), Amulet (Zhang et al., 2017) and UCF (Pingping et al., 2017). Salient object detection methods are class-agnostic and are applicable in this scenario as food is always a salient object in IbI_{b}. Since these are deep learning based methods, we use their respective pre-trained models to compute the saliency maps of IbI_{b}. The ROC and PR curves of various methods are shown in Fig. 5. The FβmaxF_{\beta}^{max} and AUC values are reported in Table 4.3.2.

Table 1. AUC and FβmaxF_{\beta}^{\text{max}} values of various maps and methods.
Maps AUC FβmaxF_{\beta}^{\text{max}}
CabC^{b}_{a} 0.937 0.645
R3NET (Deng et al., 2018) (SbS^{b}) 0.871 0.527
𝐌^𝐚𝐛\mathbf{\hat{M}^{b}_{a}} (ours) 0.954 0.741
Amulet (Zhang et al., 2017) 0.919 0.499
NLDF (Zhiming et al., 2017) 0.909 0.493
UCF (Pingping et al., 2017) 0.934 0.536

5. Discussion

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Refer to caption
(m)
Refer to caption
(n)
Refer to caption
(o)
Refer to caption
(p)
Refer to caption
(q)
Refer to caption
(r)
Refer to caption
(s)
Refer to caption
(t)
Refer to caption
(u)
Refer to caption
(v)
Refer to caption
(w)
Refer to caption
(x)
Refer to caption
(y)
Refer to caption
(z)
Refer to caption
(aa)
Refer to caption
(ab)
Refer to caption
(ac)
Refer to caption
(ad)
Refer to caption
(ae)
Refer to caption
(af)
Refer to caption
(ag)
Refer to caption
(ah)
Refer to caption
(ai)
Refer to caption
(aj)
Refer to caption
(ak)
Refer to caption
(al)
Refer to caption
(am)
Refer to caption
(an)
Refer to caption
(ao)
Refer to caption
(ap)
Refer to caption
(aq) Image Pairs.
Refer to caption
(ar) Amulet(Zhang et al., 2017)
Refer to caption
(as) UCF(Pingping et al., 2017)
Refer to caption
(at) NLDF(Zhiming et al., 2017)
Refer to caption
(au) R3NET(Deng et al., 2018)
Refer to caption
(av) Ours M^ab\hat{M}^{b}_{a}
Refer to caption
(aw) Ground Truth GbG^{b}
Figure 6. Sample image pairs from 𝒟test\mathcal{D}_{test} along with various maps are shown. For every row, the first group of two images are the original before and after eating images, respectively. The second group of images are the saliency maps generated by Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017), NLDF (Zhiming et al., 2017), R3NET (Deng et al., 2018) , MabM^{b}_{a} (our method) followed by ground truth mask GbG^{b}. The ground truth images are binary maps with pixels of value 1 representing foods and pixels of value 0 representing background. All the others are probability maps with pixels having values between 0 and 1.

The goal of our method is to segment the salient missing objects in IbI^{b} using information from a pair of images IaI^{a} and IbI^{b}. In the contrast map generation step as described in Section 3.2, we provide an estimate of the probability of pixels belonging to objects/regions in IbI^{b} but missing in IaI^{a}. In the saliency fusion step as described in Section 3.3, saliency information of pixels in IbI^{b} is fused into the contrast map CabC^{b}_{a} so as to emphasize that we are looking for salient missing objects. In order to show that the various steps of our proposed method achieve their individual objectives, we plotted the PR and ROC curves of the contrast map CabC^{b}_{a}, the visual saliency map SbS^{b} from R3NET (Deng et al., 2018) and the estimated salient missing objects probability map M^ab\hat{M}^{b}_{a} in Fig. 5c and Fig. 5a. In addition, we also plot PR and ROC curves for the 3 other salient object detection methods. From these plots, we can see that combining SbS^{b} and CabC^{b}_{a} as described in Section 3.3 improves the overall performance. This is also illustrated in Table 4.3.2, where both AUC and FβmaxF_{\beta}^{\text{max}} of M^ab\hat{M}^{b}_{a} are higher than CabC^{b}_{a}. This is because the contrast map CabC^{b}_{a} by itself models all the missing objects/regions while the probability map M^ab\hat{M}^{b}_{a} also takes into account the visual saliency map SbS^{b}, which can more accurately model the salient missing objects. We can also observe from the PR and ROC curves in Fig. 5 and values in Table4.3.2 that our method achieved better performance than the state-of-the-art salient object detection methods such as R3NET (Deng et al., 2018), NLDF (Zhiming et al., 2017), Amulet (Zhang et al., 2017) and UCF (Pingping et al., 2017). We also visually verify the performance of our method as illustrated in Fig. 6. The salient object detection methods Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017) and NLDF (Zhiming et al., 2017) failed to detect only foods in these images, while R3NET (Deng et al., 2018) succeeded in detecting the foods but also placed equal importance to other salient objects such as the color checkerboard. Our method gave higher probability to the foods which are the salient missing objects compared to other salient objects in the scene. It must also be noted that our method did not have access to information about food classes in 𝒟test\mathcal{D}_{test}. This is because 𝒟val\mathcal{D}_{val} and 𝒟test\mathcal{D}_{test} have very few food classes in common. By tuning the parameters on 𝒟val\mathcal{D}_{val}, our method will not have access to information about the food classes in 𝒟test\mathcal{D}_{test}. Hence the performance of our method on 𝒟test\mathcal{D}_{test} is indicative of its effectiveness of segmenting foods in a class-agnostic manner. These unique characteristics of 𝒟\mathcal{D} are also explained in section 4.1. Hence, by modeling the foods as salient missing objects, we are able to build a better class-agnostic food segmentation method compared to existing methods.

6. Conclusion

In this paper, we propose a class-agnostic food segmentation method by segmenting the salient missing objects in a before eating image IbI^{b} using information from a pair of before and after eating images, IbI^{b} and IaI^{a}. We treat this problem as a paradigm of top down saliency detection where visual attention of HVS is guided by a task. Our proposed method uses IaI^{a} as background to obtain a contrast map that is an estimate of the probability of pixels of IbI^{b} belonging to objects/regions missing in IaI^{a}. The contrast map is then fused with the saliency information of IbI^{b} to obtain a probability map M^ab\hat{M}^{b}_{a} for salient missing objects. Our experimental results validated that our approach achieves better performance both quantitatively and visually when compared to state-of-the-art salient object detection methods such as R3NET(Deng et al., 2018), NLDF(Zhiming et al., 2017), Amulet(Zhang et al., 2017) and UCF(Pingping et al., 2017). As discussed in Section 1, we have only considered the case where there is no food in the after eating image. In the future, we will extend our model to consider more general scenarios.

References

  • (1)
  • Aizawa and Ogawa (2015) Kiyoharu Aizawa and Makoto Ogawa. 2015. FoodLog: Multimedia Tool for Healthcare Applications. IEEE MultiMedia 22, 2 (Apr 2015), 4–8. https://doi.org/10.1109/MMUL.2015.39
  • Anselmi et al. (2016) Fabio Anselmi, Joel Z. Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, and Tomaso Poggio. 2016. Unsupervised learning of invariant representations. Theoretical Computer Science 633 (2016), 112 – 121. https://doi.org/10.1016/j.tcs.2015.06.048
  • Ballard et al. (1995) Dana H Ballard, Mary M Hayhoe, and Jeff B Pelz. 1995. Memory Representations in Natural Tasks. Journal of cognitive neuroscience. 7, 1 (1995), 66–80. https://doi.org/10.1162/jocn.1995.7.1.66
  • Borji et al. (2015) Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2015. Salient Object Detection: A Benchmark. IEEE Transactions on Image Processing 24, 12 (Dec 2015), 5706–5722. https://doi.org/10.1109/TIP.2015.2487833
  • Borji and Itti (2013) Ali Borji and Laurent Itti. 2013. State-of-the-Art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (Jan 2013), 185–207.
  • Borji et al. (2013) Ali Borji, Dicky N. Sihite, and Laurent Itti. 2013. Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study. IEEE Transactions on Image Processing 22, 1 (Jan 2013), 55–69. https://doi.org/10.1109/TIP.2012.2210727
  • Bossard et al. (2014) Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 – Mining Discriminative Components with Random Forests. European Conference on Computer Vision (2014), 446–461. https://doi.org/10.1007/978-3-319-10599-4_29
  • Cao et al. (2015) Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, Deva Ramanan, and Thomas S. Huang. 2015. Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks. IEEE International Conference on Computer Vision (Dec 2015), 2956–2964. https://doi.org/10.1109/ICCV.2015.338
  • Chen et al. (2015) Hsin-Chen Chen, Wenyan Jia, Xin Sun, Zhaoxin Li, Yuecheng Li, John D Fernstrom, Lora E Burke, Thomas Baranowski, and Mingui Sun. 2015. Saliency-aware food image segmentation for personal dietary assessment using a wearable computer. Measurement Science and Technology 26, 2 (2015), 025702.
  • Daugherty et al. (2012) Bethany L Daugherty, TusaRebecca E Schap, Reynolette Ettienne-Gittens, Fengqing M Zhu, Marc Bosch, Edward J Delp, David S Ebert, Deborah A Kerr, and Carol J Boushey. 2012. Novel Technologies for Assessing Dietary Intake: Evaluating the Usability of a Mobile Telephone Food Record Among Adults and Adolescents. Journal of Medical Internet Research 14, 2 (Apr 2012), e58. https://doi.org/10.2196/jmir.1967
  • Dehais et al. (2016) Joachim Dehais, Marios Anthimopoulos, and Stavroula Mougiakakou. 2016. Food image segmentation for dietary assessment. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. 23–28.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (June 2009), 248–255. https://doi.org/10.1109/CVPR.2009.5206848
  • Deng et al. (2018) Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin, Guoqiang Han, and Pheng-Ann Heng. 2018. R3Net: Recurrent Residual Refinement Network for Saliency Detection. International Joint Conference on Artificial Intelligence (July 2018), 684–690. http://dl.acm.org/citation.cfm?id=3304415.3304513
  • Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (2014), I–647–I–655.
  • Fang et al. (2015) Shaobo Fang, Chang Liu, Fengqing Zhu, Edward J. Delp, and Carol J. Boushey. 2015. Single-View Food Portion Estimation Based on Geometric Models. IEEE International Symposium on Multimedia (Dec 2015), 385–390. https://doi.org/10.1109/ISM.2015.67
  • Fang et al. (2018) S. Fang, Z. Shao, R. Mao, C. Fu, E. J. Delp, F. Zhu, D. A. Kerr, and C. J. Boushey. 2018. Single-View Food Portion Estimation: Learning Image-to-Energy Mappings Using Generative Adversarial Networks. 2018 25th IEEE International Conference on Image Processing (ICIP) (Oct 2018), 251–255. https://doi.org/10.1109/ICIP.2018.8451461
  • Fang et al. (2017) Shaobo Fang, Fengqing Zhu, Carol J Boushey, and Edward J Delp. 2017. The use of co-occurrence patterns in single image based food portion estimation. IEEE Global Conference on Signal and Information Processing (Nov 2017), 462–466. https://doi.org/10.1109/GlobalSIP.2017.8308685
  • Federico et al. (2012) Perazzi Federico, Philipp Krähenbühl, Yael Pritch, and Alexander Hornung. 2012. Saliency filters: Contrast based filtering for salient region detection. IEEE Conference on Computer Vision and Pattern Recognition (June 2012), 733–740. https://doi.org/10.1109/CVPR.2012.6247743
  • Ginny et al. (2012) Garcia Ginny, Thankam S. Sunil, and Pedro Hinojosa. 2012. The fast food and obesity link: consumption patterns and severity of obesity. Obes Surg 22, 5 (May 2012), 810–818. https://doi.org/10.1007/s11695-012-0601-8
  • Guanbin and Yu (2015) Li Guanbin and Yizhou Yu. 2015. Visual saliency based on multiscale deep features. IEEE Conference on Computer Vision and Pattern Recognition (June 2015), 5455–5463. https://doi.org/10.1109/CVPR.2015.7299184
  • Hammons et al. (2011) Hammons, Amber J., and Barbara H. Fiese. 2011. Is frequency of shared family meals related to the nutritional health of children and adolescents? Pediatrics 127, 6 (Jun 2011), e1565–1574.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (June 2016).
  • Karen and Zisserman (2015) Simonyan Karen and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (May 2015). San Diego, CA.
  • Kawano and Yanai (2015a) Yoshiyuki Kawano and Keiji Yanai. 2015a. Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation. European Conference on Computer Vision 2014 Workshops (2015), 3–17. https://doi.org/10.1007/978-3-319-16199-0_1
  • Kawano and Yanai (2015b) Yoshiyuki Kawano and Keiji Yanai. 2015b. FoodCam: A real-time food recognition system on a smartphone. Multimedia Tools and Applications 74, 14 (01 Jul 2015), 5263–5287. https://doi.org/10.1007/978-3-319-04117-9_38
  • Kerr et al. (2016) Deborah A Kerr, Amelia J Harray, Christina M Pollard, Satvinder S Dhaliwal, Edward J Delp, Peter A Howat, Mark R Pickering, Ziad Ahmad, Xingqiong Meng, Iain S Pratt, Janine L Wright, Katherine R Kerr, and Carol J Boushey. 2016. The connecting health and technology study: a 6-month randomized controlled trial to improve nutrition behaviours using a mobile food record and text messaging support in young adults. The international journal of behavioral nutrition and physical activity. 13, 1 (2016). https://doi.org/10.1186/s12966-016-0376-8
  • Khan et al. (2017b) Salman Khan, Xuming He, Fatih Porikli, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. 2017b. Learning Deep Structured Network for Weakly Supervised Change Detection. International Joint Conference on Artificial Intelligence (2017), 2008–2015. http://dl.acm.org/citation.cfm?id=3172077.3172167
  • Khan et al. (2017a) Salman H. Khan, Xuming He, Fatih Porikli, and Mohammed Bennamoun. 2017a. Forest Change Detection in Incomplete Satellite Images With Deep Neural Networks. IEEE Transactions on Geoscience and Remote Sensing 55, 9 (Sept 2017), 5407–5423. https://doi.org/10.1109/TGRS.2017.2707528
  • Kong and Tan (2012) Fanyu Kong and Jindong Tan. 2012. DietCam: Automatic dietary assessment with mobile camera phones. Pervasive and Mobile Computing 8, 1 (2012), 147 – 163. https://doi.org/10.1016/j.pmcj.2011.07.003
  • McCrory et al. (2019) Megan McCrory, Mingui Sun, Edward Sazonov, Gary Frost, Alex Anderson, Wenyan Jia, Modou L Jobarteh, Kathryn Maitland, Matilda Steiner-Asiedu, Tonmoy Ghosh, et al. 2019. Methodology for Objective, Passive, Image-and Sensor-based Assessment of Dietary Intake, Meal-timing, and Food-related Activity in Ghana and Kenya (P13-028-19). Current developments in nutrition 3, Supplement_1 (2019), nzz036–P13.
  • Mesas et al. (2012) A. E. Mesas, M. Muñoz-Pareja, E. López-García, and F. Rodríguez-Artalejo. 2012. Selected eating behaviours and excess body weight: a systematic review. Obes Rev 13, 2 (Feb 2012), 106–135. https://doi.org/10.1111/j.1467-789X.2011.00936.x 21955734[pmid].
  • Nordström et al. (2013) Karin Nordström, Christian Coff, Håkan Jönsson, Lennart Nordenfelt, and Ulf Görman. 2013. Food and health: individual, cultural, or scientific matters? Genes Nutr 8, 4 (Jul 2013), 357–363. https://doi.org/10.1007/s12263-013-0336-8 23494484[pmid].
  • Okamoto and Yanai (2016) Koichi Okamoto and Keiji Yanai. 2016. An Automatic Calorie Estimation System of Food Images on a Smartphone. International Workshop on Multimedia Assisted Dietary Management (2016), 63–70. https://doi.org/10.1145/2986035.2986040
  • Organization (2009) World Health Organization. 2009. Global Health Risks Mortality and Burden of Disease Attributable to Selected Major Risks. World Health Organization. https://apps.who.int/iris/handle/10665/44203
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
  • Peters et al. (2007) Peters, Robert J., and Laurent Itti. 2007. Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention. IEEE Conference on Computer Vision and Pattern Recognition (June 2007), 1–8. https://doi.org/10.1109/CVPR.2007.383337
  • Piernas and Popkin (2011) Carmen Piernas and Barry M. Popkin. 2011. Food portion patterns and trends among U.S. children and the relationship to total eating occasion size, 1977-2006. J Nutr 141, 6 (Jun 2011), 1159–1164. https://doi.org/10.3945/jn.111.138727
  • Pingping et al. (2017) Zhang Pingping, Dong Wang, Huchuan Lu, Hongyu Wang, and Baocai Yi. 2017. Learning Uncertain Convolutional Features for Accurate Saliency Detection. IEEE International Conference on Computer Vision (Oct 2017), 212–221. https://doi.org/10.1109/ICCV.2017.32
  • Pouladzadeh et al. (2014) Pouladzadeh, Parisa, Shervin Shirmohammadi, and Rana Al-Maghrabi. 2014. Measuring Calorie and Nutrition From Food Image. IEEE Transactions on Instrumentation and Measurement 63, 8 (Aug 2014), 1947–1956. https://doi.org/10.1109/TIM.2014.2303533
  • PR1 et al. (2010) Deshmukh-Taskar PR1, Nicklas TA, O’Neil CE, Keast DR, Radcliffe JD, and Cho S. 2010. The relationship of breakfast skipping and type of breakfast consumption with nutrient intake and weight status in children and adolescents: the National Health and Nutrition Examination Survey 1999-2006. J Am Diet Assoc 110, 6 (Jun 2010), 869–878. https://doi.org/10.1016/j.jada.2010.03.023
  • R. et al. (2004) Martin David R., Charless C. Fowlkes, and Jitendra Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 5 (May 2004), 530–549. https://doi.org/10.1109/TPAMI.2004.1273918
  • Radhakrishna et al. (2012) Achanta Radhakrishna, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 11 (Nov 2012). https://doi.org/10.1109/TPAMI.2012.120
  • Ramanishka et al. (2017) Vasili Ramanishka, Abir Das, and Jianming Zhang Kate Saenko. 2017. Top-Down Visual Saliency Guided by Captions. IEEE Conference on Computer Vision and Pattern Recognition (July 2017), 3135–3144. https://doi.org/10.1109/CVPR.2017.334
  • Sakurada and Okatani (2015) Ken Sakurada and Takayuki Okatani. 2015. Change Detection from a Street Image Pair using CNN Features and Superpixel Segmentation. British Machine Vision Conference, Article 61 (September 2015), 12 pages. https://doi.org/10.5244/C.29.61
  • Shim et al. (2014) Jee-Seon Shim, Kyungwon Oh, and Hyeon Chang Kim. 2014. Dietary assessment methods in epidemiologic studies. Epidemiol Health 36 (22 Jul 2014), e2014009–e2014009. https://doi.org/10.4178/epih/e2014009 25078382[pmid].
  • Shimoda and Yanai (2015) Wataru Shimoda and Keiji Yanai. 2015. CNN-Based Food Image Segmentation Without Pixel-Wise Annotation. New Trends in Image Analysis and Processing – ICIAP Workshops (2015). https://doi.org/10.1007/978-3-319-23222-5_55
  • Szegedy et al. (2016) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. (June 2016), 2818–2826.
  • Szeliski (2004) Rick Szeliski. 2004. Image Alignment and Stitching: A Tutorial. Technical Report MSR-TR-2004-92. 89 pages. https://www.microsoft.com/en-us/research/publication/image-alignment-and-stitching-a-tutorial/
  • Vasiloglou et al. (2018) Maria F Vasiloglou, Stavroula Mougiakakou, Emilie Aubry, Anika Bokelmann, Rita Fricker, Filomena Gomes, Cathrin Guntermann, Alexa Meyer, Diana Studerus, and Zeno Stanga. 2018. A comparative study on carbohydrate estimation: GoCARB vs. Dietitians. Nutrients 10, 6 (2018), 741.
  • Vijay et al. (2014) Rengarajan Vijay, Abhijith Punnappurath, A. N. Rajagopalan, and Guna Seetharaman. 2014. Efficient Change Detection for Very Large Motion Blurred Images. IEEE Conference on Computer Vision and Pattern Recognition Workshops (June 2014), 315–322. https://doi.org/10.1109/CVPRW.2014.55
  • Wang et al. (2017) Yu Wang, Fengqing Zhu, Carol J. Boushey, and Edward J. Delp. 2017. Weakly supervised food image segmentation using class activation maps. IEEE International Conference on Image Processing (Sep 2017), 1277–1281. https://doi.org/10.1109/ICIP.2017.8296487
  • Yuji et al. (2012) Matsuda Yuji, Hajime Hoashi, and Keiji Yana. 2012. Recognition of Multiple-Food Images by Detecting Candidate Regions. IEEE International Conference on Multimedia and Expo (July 2012), 25–30. https://doi.org/10.1109/ICME.2012.157
  • Zhang et al. (2017) Zhang, Pingping, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. 2017. Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection. IEEE International Conference on Computer Vision (Oct 2017), 202–211. https://doi.org/10.1109/ICCV.2017.31
  • Zhang et al. (2015) Weiyu Zhang, Qian Yu, Behjat Siddiquie, Ajay Divakaran, , and Harpreet Sawhney. 2015. “Snap-n-Eat”: Food Recognition and Nutrition Estimation on a Smartphone. J Diabetes Sci Technol 9, 3 (May 2015), 525–533. https://doi.org/10.1177/1932296815582222
  • Zhiming et al. (2017) Luo Zhiming, Akshaya Mishra, Andrew Achka, Justin Eichel, Shaozi Li, and Pierre-Marc Jodoin. 2017. Non-local Deep Features for Salient Object Detection. IEEE Conference on Computer Vision and Pattern Recognition (July 2017), 6593–6601. https://doi.org/10.1109/CVPR.2017.698
  • Zhu et al. (2015) Fengqing Zhu, Marc Bosch, Insoo Woo, SungYe Kim, Carol J. Boushey, David S. Ebert, , and Edward J. Delp. 2015. Multiple Hypotheses Image Segmentation and Classification With Application to Dietary Assessment. IEEE Journal of Biomedical and Health Informatics 19, 1 (Jan 2015), 377–388. https://doi.org/10.1109/JBHI.2014.2304925
  • Zhu et al. (2010) Fengqing Zhu, Marc Bosch, Insoo Woo, SungYe Kim, Carol J. Boushey, David S. Ebert, and Edward J. Delp. 2010. The Use of Mobile Devices in Aiding Dietary Assessment and Evaluation. IEEE Journal of Selected Topics in Signal Processing 4, 4 (Aug 2010), 756–766. https://doi.org/10.1109/JSTSP.2010.2051471