Saliency-Aware Class-Agnostic Food Image Segmentation

Sri Kalyan Yarlagadda [email protected] Purdue UniversityWest LafayetteUSA , Daniel Mas Montserrat [email protected] Purdue UniversityWest LafayetteUSA , David Güera [email protected] Purdue UniversityWest LafayetteUSA , Carol J. Boushey [email protected] University of Hawaii Cancer CenterHonoluluUSA , Deborah A. Kerr [email protected] Curtin UniversityPerthAustralia and Fengqing Zhu [email protected] Purdue UniversityWest LafayetteUSA

Abstract.

Advances in image-based dietary assessment methods have allowed nutrition professionals and researchers to improve the accuracy of dietary assessment, where images of food consumed are captured using smartphones or wearable devices. These images are then analyzed using computer vision methods to estimate energy and nutrition content of the foods. Food image segmentation, which determines the regions in an image where foods are located, plays an important role in this process. Current methods are data dependent, thus cannot generalize well for different food types. To address this problem, we propose a class-agnostic food image segmentation method. Our method uses a pair of eating scene images, one before start eating and one after eating is completed. Using information from both the before and after eating images, we can segment food images by finding the salient missing objects without any prior information about the food class. We model a paradigm of top down saliency which guides the attention of the human visual system (HVS) based on a task to find the salient missing objects in a pair of images. Our method is validated on food images collected from a dietary study which showed promising results.

food segmentation, image based dietary assessment

^†^†copyright: none^†^†copyright: none

1. Introduction

It is well-known that dietary habits have profound impacts on the quality of one’s health and well-being (Nordström et al., 2013; Mesas et al., 2012). While a nutritionally sound diet is essential to good health (Organization, 2009), it has been established through various studies that poor dietary habits can lead to many diseases and health complications. For example, studies from the World Health Organization (WHO) (Organization, 2009) have shown that poor diet is a key modifiable risk factor for the development of various non-communicable diseases such as heart disease, diabetes and cancers, which are the leading causes of death globally (Organization, 2009). In addition, studies have shown that poor dietary habits such as frequent consumption of fast food (Ginny et al., 2012), diets containing large portion size of energy-dense foods (Piernas and Popkin, 2011), absence of home food (Hammons et al., 2011) and skipping breakfast (PR1 et al., 2010) all contribute to the increasing risk of overweight and obesity. Because of the many popular diseases affecting humans are related to dietary habits, there is a need to study the relationship between our dietary habits and their effect on our health.

Understanding the complex relationship between dietary habits and human health is extremely important as it can help us mount intervention programs to prevent these diet related diseases (Daugherty et al., 2012). To better understand the relationship between our dietary habits and human health, nutrition practitioners and researchers often conduct dietary studies in which participants are asked to subjectively assess their dietary intake. In these studies, participants are asked to report foods and drinks they consumed on a daily basis over a period of time. Traditionally, self-reporting methods such as 24-hr recall, dietary records and food frequency questionnaire (FFQ) are popular for conducting dietary assessment studies (Shim et al., 2014). However, these methods have several drawbacks. For example, both the 24-hr recall and FFQ rely on the participants’ ability to recall foods they have consumed in the past. In addition, they are also very time-consuming. For dietary records, participants are asked to record details of the meals they consumed. Although this approach is less reliant on the participants’ memory, it requires motivated and trained participants to accurately report their diet (Shim et al., 2014). Another issue that affects the accuracy of these methods is that of under-reporting due to incorrect estimation of food portion sizes. Under-reporting has also been associated with factors such as obesity, gender, social desirability, restrained eating and hunger, education, literacy, perceived health status, age, and race/ethnicity (Zhu et al., 2010). Therefore, there is an urgent need to develop new dietary assessment methods that can overcome these limitations.

In the past decade, experts from the nutrition and engineering field have combined forces to develop new dietary assessment methods by leveraging technologies such as the Internet and mobile phones. Among the various new approaches, some of them use images captured at the eating scene to extract dietary information. These are called image-based dietary assessment methods. Examples of such methods include TADA™ (Zhu et al., 2010), FoodLog (Aizawa and Ogawa, 2015) , FoodCam (Kawano and Yanai, 2015b), Snap-n-Eat (Zhang et al., 2015), GoCARB (Vasiloglou et al., 2018), DietCam (Kong and Tan, 2012) and (McCrory et al., 2019), to name a few. In these methods, participants are asked to capture images of foods and drinks consumed via a mobile phone. These images are then analyzed to estimate the nutrient content. Estimating the nutrient content of foods in an image is commonly performed by trained dietitians, which can be time consuming, costly and laborious. More recently, automated methods have been developed to extract nutrient information of the foods from images (Fang et al., 2017, 2015; Fang et al., 2018). The process of extracting nutrient information from images generally involves three sub-tasks, food segmentation, food classification and portion size estimation (Zhu et al., 2010). Food image segmentation is the task of grouping pixels in an image representing foods. Food classification can then identify the food types. Portion size estimation (Fang et al., 2015) is the task of estimating the volume/energy of the foods in the image. Each of these tasks is essential for building an automated system to accurately extract nutrient information from food in images. In this paper, we focus on the task of food segmentation. In particular, we propose a food segmentation method that does not require information of the food types.

Food segmentation plays a crucial role in estimating nutrient information as the image segmentation masks are often used to estimate food portion sizes (Kong and Tan, 2012; Fang et al., 2017, 2015; Okamoto and Yanai, 2016; Pouladzadeh et al., 2014). Food segmentation from a single image is a challenging problem as there is a large inter- and intra-class variance among different food types. Because of this variation, techniques developed for segmenting a particular class of foods will not be effective on other food classes. Despite these drawbacks, several learning based food segmentation methods (Zhu et al., 2015; Wang et al., 2017; Shimoda and Yanai, 2015; Dehais et al., 2016) have been proposed in recent years. One of the constraints of learning based methods is data dependency. They are only effective on the food categories they trained on. For instance in (Wang et al., 2017), class activation maps are used to segment food images. The Food-101 dataset (Bossard et al., 2014) is used to train the model and the method is tested on a subset of another dataset that have common food categories with Food-101. This is a clear indication that their method (Wang et al., 2017) is only effective on food classes that have been trained on. Similarly, the learning based method proposed in (Shimoda and Yanai, 2015) is trained and tested only on UEC-FOOD100 (Yuji et al., 2012). The UEC-FOOD100 dataset has a total of 12,740 images with 100 different food categories, out of which 1,174 have multiple foods in a single image. In their method, the dataset is partitioned into training and testing subsets, each contains all the food categories. The authors of (Shimoda and Yanai, 2015) split this dataset into training and testing in the following way. All the images containing a single food category were used for training and images containing multiple food categories were used for testing. This way the training set contained 11,566 images and the testing set contains 1,174 images. Splitting the dataset in this fashion does not guarantee that the training and testing subsets contain images belonging to different food categories. In fact this would mean they contain common food categories. Furthermore, the authors in (Shimoda and Yanai, 2015) did not conduct any cross dataset evaluation. Thus the learning based method in (Shimoda and Yanai, 2015) is also only effective on food categories it has been trained on. In (Dehais et al., 2016), a semi automatic method is proposed to segment foods. The authors of (Dehais et al., 2016) assume that foods are always present in a circular region. In addition, they assume information about the number of different food categories is known. The experiments are conducted on a dataset of 821 images. While they achieved promising results, the proposed approach is not designed for real world scenario as their assumptions may not hold. In (Chen et al., 2015), a food segmentation technique is proposed that exploits saliency information. However, this approach relies on successfully detecting the food container. In (Chen et al., 2015), the food container is assumed to be a circular plate. Experimental results were reported using a dataset consisting of only 60 images. While the assumptions in (Chen et al., 2015) are valid in some cases, it may not be true in many real life scenarios.

In addition, there are also constraints imposed by the available datasets. Publicly available food image datasets such as UECFOOD-100 (Yuji et al., 2012), Food-101 (Bossard et al., 2014) and UECFOOD-256 (Kawano and Yanai, 2015a) are biased towards a particular cuisine and also do not provide pixel level labelling. Pixel level labelling is crucial because it forms the necessary ground truth for training and evaluating learning based food segmentation methods. To overcome the limitations posed by learning based methods and the availability of public datasets with ground truth information, we proposed to develop a food segmentation method that is class-agnostic. In particular, our class-agnostic food segmentation method uses information from two images, the before eating and after eating image to segment the foods consumed during the meal.

Our data is collected from a community dwelling dietary study (Kerr et al., 2016) using the TADA™ platform. In this study, participants were asked to take two pictures of their eating scene, one before they start eating which we call the before eating image and one immediately after they finished eating which we call the after eating image. The before eating and after eating image represent the same eating scene, however for the purpose of this work, we only select image pairs where the after eating image does not contain any food. Our goal is to segment the foods in the before eating image using information from both before and after eating images. To illustrated this problem in a more general scenario, lets consider an experimental setup in which a person is given a pair of images shown in Fig.1 and is asked the following question, “Can you spot the salient objects in Fig. 1a that are missing in Fig. 1b?”. We refer to these as the salient missing objects. To find salient missing objects, the Human Vision System (HVS) compares regions that are salient in both images. In this example, the food, container and color checkerboard in Fig. 1a are the salient objects and in Fig. 1b, the color checkerboard, spoon and container are the salient objects. Comparing the salient objects in both of these images, HVS can identify the food as the salient missing object. In this paper, our goal is to build a model to answer this question. By looking for salient missing objects in the before eating image using the after eating image as the reference we can then segment the foods without additional information such as the food classes. As the above approach does not require information about the food class, we are able to build a class-agnostic food segmentation method by segmenting only the salient missing objects.

Refer to caption — (a) Before eating image.

The above question does not bear significance for just any pair of random images. It only becomes relevant when the image pairs are related. For example, in Fig. 1, both images have many regions/objects with same semantic labels such as color checkerboard, container and the black background. However, the relative positions of these regions/objects are different in both images due to camera pose and different time of capturing the images. Because of similarity at the level of semantics between both images, it is plausible to define the notion of salient missing objects. Notice that we are not interested in pixel-level differences due to changes in illumination, poses and angles.

In this experimental scenario, the visual attention of HVS is guided via a task, hence it falls under the category of top down saliency. Visual attention (Borji et al., 2013; Borji and Itti, 2013) is defined as the process that capacitates a biological or artificial vision system to identify relevant regions in a scene (Borji et al., 2013). Relevance of every region in a scene is attributed through two different mechanisms, namely top down saliency and bottom up saliency. In top down saliency, attention is directed by a task. An example of this mechanism in action is how a human driver’s HVS identifies relevant regions on the road for a safe journey. Other examples where top down saliency have been studied are sandwich making (Ballard et al., 1995) and interactive game playing (Peters et al., 2007). In bottom up saliency, attention is directed towards those regions that are the most conspicuous. Bottom up saliency is also known as visual saliency. In the real world, visual attention of HVS is guided by a combination of top down saliency and bottom up saliency. In the above question of finding salient missing objects, visual attention is guided by a task and hence it falls under the category of top down saliency. Top down saliency has not been studied as extensively as visual saliency because of its complexity (Borji et al., 2013).

In this paper, we propose an unsupervised method to find the salient missing objects between a pair of images for the purpose of designing a class agnostic food segmentation method. We use the after eating image as the background to find the contrast of every pixel in the before eating image. We then fuse the contrast map along with saliency maps to obtain the final segmentation mask of the salient missing objects in the before eating image. We also compare our method to other class-agnostic methods. Since food is a salient object in the before eating image, by detecting salient objects in the before eating image we are able to segment the food. We compared our method to four state-of-the-art salient object detection methods, namely R3NET (Deng et al., 2018), NLDF (Zhiming et al., 2017), UCF (Pingping et al., 2017) and Amulet (Zhang et al., 2017).

The paper is organized as follows. In Section 2, we formulate our problem and discuss related work. We describe our proposed method in detail in Section 3. In Section 4, we discuss dataset and experiment design. In Section 5, we discuss experimental results and compare our method with other salient object detection methods. Conclusions are provided in Section 6.

2. Problem Formulation and Related Work

In this section, we first introduce common notations used throughout the paper. We then discuss related works on modeling top down saliency and change detection.

2.1. Problem Formulation

Consider a pair of images $\{I^{b},I^{a}\}$ captured from an eating scene.

•

$I^{b}$ : We refer to it as the “before eating image.” This is the meal image captured before consumption.
•

$I^{a}$ : We refer to it as the “after eating image.” This is the meal image captured immediately after consumption.

Our goal is to obtain a binary mask $M^{b}$ , that labels the salient missing objects in $I^{b}$ as foreground (with a binary label of 1) and rest of $I^{b}$ as background (with a binary label of 0).

2.2. Related Work

Our goal is to find salient missing objects in a pair of images. Since the visual attention of the HVS is guided by a task, it falls under the category of top down saliency. Top down saliency is much more complex than visual saliency and hence has not been studied extensively. Some of the recent works modeling top down saliency paradigms are (Cao et al., 2015; Ramanishka et al., 2017). In (Ramanishka et al., 2017), given an image or video and an associated caption, the authors proposed a model to selectively highlight different regions based on words in the caption. Our work is related in the sense that we also try to highlight and segment objects/regions based on a description, except that the description in our case is a much more generic question of finding the salient missing objects in a pair of images without specific details.

Another related problem is modeling change detection (Khan et al., 2017b; Sakurada and Okatani, 2015; Khan et al., 2017a; Vijay et al., 2014). In change detection, the objective is to detect all relevant changes between a pair of images that are aligned or can be potentially aligned via image registration. Examples of such changes may include object motion, missing objects, structural changes (Sakurada and Okatani, 2015) and changes in vegetation (Khan et al., 2017b). One of the key differences between change detection and our proposed problem is that in change detection, the pair of images are aligned or can be potentially aligned via image registration (Szeliski, 2004) which is not true in the case of salient missing objects. In the case of finding salient missing objects, we cannot guarantee that $I^{b}$ and $I^{a}$ can be registered, as often there is relative motion between objects of interest as shown in Fig. 1 and also in Fig. 6.

The problem of finding salient missing objects can be thought of as a change detection problem in a more complex environment than those that have been previously considered. Hence, we need to develop new methods to solve this problem.

3. Method

In this section, we describe the details of our proposed method to segment salient missing objects in $I^{b}$ . Our method consists of three parts, segmentation and feature extraction, contrast map generation and saliency fusion. An overview of our proposed method is described in Fig. 2.

3.1. Segmentation And Feature Extraction

We first segment the pair of images $I^{a}$ and $I^{b}$ using SLIC (Radhakrishna et al., 2012) to group pixels into perceptually similar superpixels. Let $\mathcal{A}=\{a_{i}\}$ denote the superpixels of the after eating image $I^{a}$ and $\mathcal{B}=\{b_{j}\}$ for superpixels of the before eating image $I^{b}$ .

We extract features from each superpixel. We use these features to compute the contrast map. The contrast map $C_{a}^{b}$ gives an estimate of the probability of pixels belonging to objects/regions present in $I^{b}$ but missing in $I^{a}$ . This will be explained in detail in section 3.2. To compute an accurate contrast map, pixels belonging to similar regions in $I^{b}$ and $I^{a}$ should have similar feature representation and vice versa. Going from $I^{b}$ to $I^{a}$ we can expect changes in scene lightning, changes in noise levels and changes in segmentation boundaries because of relative object motion. To compute an accurate contrast map, its important that feature representation of pixels are robust to these artifacts.For this reason, we extract features using a pretrained Convolutional Neural Network (CNN) instead of using hand-crafted features. We use the VGG19 (Karen and Zisserman, 2015) pretrained on the ImageNet dataset (Deng et al., 2009). ImageNet is a large dataset consisting of more than a million images belonging to 1000 different classes. It captures the distribution of natural images very well. Because of all these reasons models pretrained on ImageNet are widely used in several applications (Donahue et al., 2014; Anselmi et al., 2016; Sakurada and Okatani, 2015; Guanbin and Yu, 2015).

We use the pretrained VGG19 for both $I^{b}$ and $I^{a}$ . The output of $16^{th}$ convolutional layer in VGG19 is extracted as the feature map. The reasoning behind this choice is explained in section 4.3.1. According to Table 1 in (Karen and Zisserman, 2015), VGG19 has a total of $16$ convolutional layers. The dimensionality of the output of the $16^{th}$ convolutinal layer of VGG19 is $14\times 14\times 512$ where $14\times 14$ is the spatial resolution. The input ( $I^{b}$ or $I^{a}$ ) to VGG19 has a spatial resolution of $224\times 224$ . We spatially upscale the output of the $16^{th}$ convolution layers by a factor of 16. We denote these upscaled feature maps of $I^{b}$ and $I^{a}$ as $F^{b}$ and $F^{a}$ , respectively. The dimensionality of $F^{b}$ and $F^{a}$ is then $224\times 224\times 512$ . Thus every pixel will be represented by a 512 dimensional vector in the feature space. For each superpixel, we denote the extracted features as $\{f_{j}^{b}\}$ for the before eating image and $\{f_{i}^{a}\}$ for the after eating image. Using these extracted feature maps, $f_{i}^{a}$ and $f_{j}^{b}$ are computed as described in Eq. 1.

(1)

\begin{split}f_{i}^{a}=\frac{1}{m_{i}^{a}}\sum_{k\in r_{i}^{a}}F^{a}(k)\\ f_{j}^{b}=\frac{1}{m_{j}^{b}}\sum_{k\in r_{j}^{b}}F^{b}(k)\end{split}

where $r_{i}^{a}$ denotes the set of pixels belongs to superpixel $a_{i}$ and $m_{i}^{a}$ is its cardinality. $r_{j}^{b}$ and $m_{j}^{b}$ are similarly defined.

3.2. Contrast Map Generation

Contrast is a term often associated with salient object detection methods. Contrast of a region in an image refers to its overall dissimilarity with other regions in the same image. It is generally assumed that regions with high contrast demand more visual attention (Federico et al., 2012). In the context of our problem, visual attention is guided by trying to find objects in $I^{b}$ that are missing in $I^{a}$ . Therefore, our contrast map $C^{b}_{a}$ of $I^{b}$ is an estimate of the probability of each pixel belonging to an object missing in $I^{a}$ . $C^{b}_{a}$ is computed as shown in Eq. 2.

(2)

\begin{split}C^{b}_{a}=\frac{C_{a}^{b,\text{local}}+C_{a}^{b,\text{neigh}}}{max(C_{a}^{b,\text{local}}+C_{a}^{b,\text{neigh}})}\end{split}

In $C_{a}^{b,\text{local}}$ , contrast values of a superpixel $b_{j}$ is computed using information from $b_{j}$ and $I^{a}$ , while in $C_{a}^{b,\text{neigh}}$ contrast value of $b_{j}$ is computed using information from $b_{j}$ , its neighboring superpixels and $I^{a}$ . $max(C_{a}^{b,\text{local}}+C_{a}^{b,\text{neigh}})$ , which is the maximum value in the contrast map, is used to normalize $C^{b}_{a}$ to $[0,1]$ . To compute the contrast map $C_{a}^{b,\text{local}}$ or $C_{a}^{b,\text{neigh}}$ , contrast values are computed for each superpixel and then these values are assigned to the associated individual pixels. However, if $b_{j}$ is a superpixel along the image boundaries, that is $b_{j}\in\mathcal{B}^{e}$ , we assign $b_{j}$ a contrast value of zero. We assume that the salient missing objects are unlikely to be present along the image boundaries.

The contrast value of a superpixel $b_{j}\notin\mathcal{B}^{e}$ is denoted by $c_{j}^{b,\text{local}}$ , and is computed as:

(3)

\begin{split}c_{j}^{b,\text{local}}=\operatorname*{min}_{\forall i\ \text{such that}\ a_{i}\in\mathcal{A}}||f_{j}^{b}-f_{i}^{a}||_{2}\end{split}

If $b_{j}\in\mathcal{B}^{e}$ then $c_{j}^{b,\text{local}}=0$ . $c_{j}^{b,\text{local}}$ is the minimum Euclidean distance between the feature vector $f_{j}^{b}$ and the closest feature vector of a superpixel in the after eating image. A superpixel $b_{j}$ belonging to objects/regions that are common to both $I^{b}$ and $I^{a}$ will have lower value of $c_{j}^{b,\text{local}}$ , while $b_{j}$ belonging to objects/regions present in $I^{b}$ but missing in $I^{a}$ will likely have higher value of $c_{j}^{b,\text{local}}$ .

Before describing how we compute $C_{a}^{b,\text{neigh}}$ , we need to introduce a few more notations. For a given superpixel $b_{j}$ , let $\mathcal{N}(b_{j})$ denote the set of all neighboring superpixels of $b_{j}$ . Similarly, for any superpixel $a_{i}$ , $\mathcal{N}(a_{i})$ is the set of neighboring superpixels. Consider a complete bipartite graph over the two sets of superpixels $\{a_{i},\mathcal{N}(a_{i})\}$ and $\{b_{j},\mathcal{N}(b_{j})\}$ denoted by

(4)

\begin{split}{G_{{a_{i}},{b_{j}}}}=(\{{a_{i}},{\mathcal{N}}({a_{i}})\}\cup\{{b_{j}},{\mathcal{N}}({b_{j}})\},\ {{\mathcal{E}}_{{a_{i}},{b_{j}}}})\end{split}

where $\mathcal{E}_{a_{i},b_{j}}$ is the set of edges in $G_{a_{i},b_{j}}$ . An example is shown in Fig. 3.

In $G_{a_{i},b_{j}}$ , consider an edge $e_{a_{i_{1}},b_{j_{1}}}$ between the two superpixels $a_{i_{1}}\in\ \{a_{i},\mathcal{N}(a_{i})\}$ and $b_{j_{1}}\in\ \{b_{j},\mathcal{N}(b_{j})\}$ , the edge weight is evaluated by the Euclidean norm $w(\cdot)$ defined as:

(5)

\begin{split}w(e_{a_{i_{1}},b_{j_{1}}})=||f_{a_{i_{1}}}-f_{b_{j_{1}}}||_{2}\end{split}

A matching over $G_{a_{i},b_{j}}$ is a set of edges $\mathcal{S}\subset\mathcal{E}_{a_{i},b_{j}}$ such that no two edges in $\mathcal{S}$ share the same nodes. A maximum matching over $G_{a_{i},b_{j}}$ , denoted by $\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}\subset\mathcal{E}_{a_{i},b_{j}}$ , is a matching of maximum cardinality. There can be many possible maximum matchings over $G_{a_{i},b_{j}}$ , hence we use subscript $k$ in $\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}$ to denote one such possibility. The cost of a given $\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}$ is denoted by $D(\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k})$ and is defined as:

(6)

\begin{split}D({\mathcal{S}}_{k}^{{{\mathcal{E}}_{{a_{i}},{b_{j}}}}})=\sum\limits_{\forall e\in S_{k}^{{\varepsilon_{{a_{i}},{b_{j}}}}}}{w(e)}\end{split}

Given a $G_{a_{i},b_{j}}$ , we want to find the maximum matching with the minimum cost. We refer to this minimum cost as $\hat{D}_{\text{min}}(G_{a_{i},b_{j}})$ and it is computed as:

(7)

\begin{split}\hat{D}_{\text{min}}(G_{a_{i},b_{j}})=\operatorname*{\text{min}}_{\forall k\ \text{such that}\ \exists\ \mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k}}D(\mathcal{S}^{\mathcal{E}_{a_{i},b_{j}}}_{k})\end{split}

For two superpixels $a_{i}$ and $b_{j}$ , $\hat{D}_{\text{min}}(G_{a_{i},b_{j}})$ measures the similarity between the two superpixels and the similarity between their neighborhoods. The lower the value of $\hat{D}_{\text{min}}(G_{a_{i},b_{j}})$ , the more similar the two superpixels are both in terms of their individual characteristics and their neighboring superpixels. The contrast value of superpixel $b_{j}\notin\mathcal{B}^{e}$ in $C_{a}^{b,\text{neigh}}$ is denoted by $c_{j}^{b,\text{neigh}}$ and is computed as:

(8)

\begin{split}c_{j}^{b,\text{neigh}}=\operatorname*{\text{min}}_{\forall i\ \text{such that}\ a_{i}\in\mathcal{A}}\frac{\hat{D}_{\text{min}}(G_{a_{i},b_{j}})}{l_{a_{i},b_{j}}}\end{split}

In Eq. 8, $l_{a_{i},b_{j}}=\operatorname*{\text{min}}({|\{a_{i},\mathcal{N}(a_{i})\}|,|\{b_{j},\mathcal{N}(b_{j})\}|})$ where $|\{.\}|$ denotes the cardinality of the set $\{.\}$ . If $b_{j}\in\mathcal{B}^{e}$ then $c_{j}^{b,\text{neigh}}=0$ . $\hat{D}_{\text{min}}(G_{a_{i},b_{j}})$ is likely to increase as $l_{a_{i},b_{j}}$ increases because there are more edges in maximum matching. In order to compensate this effect, we divide $\hat{D}_{\text{min}}(G_{a_{i},b_{j}})$ by $l_{a_{i},b_{j}}$ in Eq. 8.

3.3. Saliency Fusion

The contrast map $C_{a}^{b}$ gives an estimate of the probability of pixels belonging to objects/regions present in $I^{b}$ but missing in $I^{a}$ . However, we would like to segment salient missing objects. As explained in Section 1, to find the salient missing objects, the HVS compares objects/regions in $I^{b}$ that have a high value of visual saliency. Therefore, we are interested in identifying regions in the contrast map $C^{b}_{a}$ which correspond to high visual saliency. The visual saliency information of $I^{b}$ needs to be incorporated into $C^{b}_{a}$ to obtain our final estimate $\hat{M}_{a}^{b}$ , where $\hat{M}_{a}^{b}$ is the probability of each pixel in $I^{b}$ belonging to the salient missing objects. We can then obtain the final binary label $M^{b}$ , by thresholding $\hat{M}_{a}^{b}$ with $T\in[0,1]$ . If $S^{b}$ is the visual saliency map of $I^{b}$ , then $\hat{M}_{a}^{b}$ is computed as:

(9)

\begin{split}\hat{M}_{a}^{b}=\frac{\alpha*S^{b}+C^{b}_{a}}{max(\alpha*S^{b}+C^{b}_{a})}\end{split}

where $max(\alpha*S^{b}+C^{b}_{a})$ is the normalization term. In Eq. 9, $\alpha$ is a weighting factor between $[0,1]$ that varies the relative contributions of $S^{b}$ and $C^{b}_{a}$ towards $\hat{M_{a}^{b}}$ . The value of $\alpha$ is empirically computed and will be explained in Section 4.3. To compute $S^{b}$ , we use the state-of-the-art salient object detection method R3NET (Deng et al., 2018). We also compared our method to other deep learning based salient object detection methods such as Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017) and NLDF (Zhiming et al., 2017).

4. Experimental Results

4.1. Dataset

The dataset $\mathcal{D}$ we use for evaluating our method contains 566 pairs of before eating and after eating images. Along with image pairs, ground truth masks of the salient missing objects in the before eating images (which in this case are foods) are also provided. These images are a subset of images collected from a community dwelling dietary study (Kerr et al., 2016). The images in $\mathcal{D}$ exhibit a wide variety of foods and eating scenes. Participants in this dietary study are asked to capture a pair of before and after eating scene images, denoted as $I^{b}$ and $I^{a}$ . A typical participant takes about 3 to 5 pairs of images per day depending on his/her eating habits. These image pairs are then sent to a cloud based server to analyze nutrient contents. $\mathcal{D}$ is split randomly into $\mathcal{D}_{val}$ (49 image pairs) and $\mathcal{D}_{test}$ (517 image pairs). $\mathcal{D}_{val}$ is used for choosing the optimal hypyerparameters namely $\alpha$ and the convolutional layer. More details are explained in section 4.3.1. $\mathcal{D}_{test}$ is used to evaluate the accuracy of our method compared to other methods. Examples of image pairs from $\mathcal{D}_{test}$ along with the predicted masks obtained by our method and the salient object detection methods are shown in Fig. 6. $\mathcal{D}_{test}$ and $\mathcal{D}_{val}$ have very different food classes. In addition, the background of the images in $\mathcal{D}_{val}$ is very different from those in $\mathcal{D}_{test}$ . This makes $\mathcal{D}$ very apt for our experiments, because $\mathcal{D}_{val}$ does not give any information about the food classes present in $\mathcal{D}_{test}$ . Thus if a model tuned on $\mathcal{D}_{val}$ performs well on $\mathcal{D}_{test}$ , it signifies that the model is able to segment foods without requiring information about the food class.

4.2. Evaluation Metrics

We use two standard metrics for evaluating the performance of the proposed method. These metrics are commonly used to assess the quality of salient object detection methods (Borji et al., 2015).

•

Precision and Recall Consider $t=\{I^{b},I^{a},G^{b}\}$ in $\mathcal{D}$ . In $t$ , $G^{b}$ represents the ground truth mask of the salient missing objects in $I^{b}$ . Pixels belonging to the salient missing objects in $G^{b}$ have a value of $1$ and the rest have a value of $0$ . Our proposed method outputs $\hat{M}^{b}_{a}$ which has a range between $[0,1]$ . We can then generate a segmentation mask $M^{b}$ using a threshold $T\in[0,1]$ . Given $M^{b}$ and $G^{b}$ , precision (P) and recall (R) are computed over $\mathcal{D}$ as:

(10)

\begin{split}\text{P \ :}\ \frac{\operatorname*{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap G^{b}|}{\operatorname*{\sum}_{\forall t\in\mathcal{D}}|M^{b}|}\ ,\ \text{R \ :}\ \frac{\operatorname{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap G^{b}|}{\operatorname{\sum}_{\forall t\in\mathcal{D}}|G^{b}|}\end{split}

For a binary mask, $|\cdot|$ denotes the number of non-zero entries in it. By varying $T$ between 0 and 1, we have different pairs of precision and recall values. When precision and recall values are plotted against each other, we obtain the precision recall (PR) curve. The information provided by precision and recall can be condensed into their weighted harmonic mean denoted by $F_{\beta}$ , where $F_{\beta}$ is computed as:

(11)

\begin{split}F_{\beta}=\frac{(1+\beta^{2})*Precision*Recall}{\beta^{2}*Precision+Recall}\end{split}

The value of lies between $[0,1]$ . A higher values of $F_{\beta}$ indicates better performance. The value of $\beta^{2}$ is chosen to be 0.3 similar to other works (Borji et al., 2015). $\beta$ is a control parameter that emphasizes the importance of precision over recall. The value $F_{\beta}$ varies as we move along the PR curve. The entire information of PR curve can be summarized by the maximal $F_{\beta}$ denoted by $F_{\beta}^{\text{max}}$ , as discussed in (Borji et al., 2015; R. et al., 2004).

•

Receiver Operator Characteristics (ROC) Similar to the PR curve, ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR). TPR and FPR are defined as:

(12)

\begin{split}\text{TPR:}\;\frac{\operatorname{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap G^{b}|}{\operatorname{\sum}_{\forall t\in\mathcal{D}}|G^{b}|}\ ,\ \text{FPR:}\;\frac{\operatorname{\sum}_{\forall t\in\mathcal{D}}|M^{b}\cap(1-G^{b})|}{\operatorname{\sum}_{\forall t\in\mathcal{D}}|(1-G^{b})|}\end{split}

Similar to $F_{\beta}^{max}$ , the entire information provided by ROC curve can be condensed into one metric called AUC, which is the area under the ROC curve. Higher values of AUC indicate better performance. A perfect method will have an AUC of 1 and a method that randomly guesses values in $M^{b}$ will have an AUC of 0.5.

4.3. Experiments

4.3.1. Hyperparameter selection

The method described in Section 3 requires 2 hyperparameters, namely $\alpha$ in Eq. 9 and the convolutional layer of VGG19 for feature extraction. To justify the use of a pre-trained VGG19 for feature extraction, we have also conducted experiments by extracting features from ResNet34 (He et al., 2016) and Inception-v3 (Szegedy et al., 2016), pre-trained on ImageNet. These experiments are conducted on $\mathcal{D}_{val}$ to find the best $F_{\beta}$ which gives us a set of optimal hyperparameters.

To choose the best convolutional layer, we evaluate $\hat{M}^{b}_{a}$ using features from every convolutional layer of VGG19 that precedes a max pooling layer. There are 5 such convolutional layers in VGG19. The architecture of ResNet34 can be divided into 5 stages (He et al., 2016). To find the optimal layer in ResNet34, we extracted features from the output of each stage. The architecture of Inception-v3 is very different from those of ResNet34 and VGG19. To find the optimal layer in Inception-v3, we extract features whenever there is a change in spatial dimension as the inputs propagate through the network. There are 7 such changes occur in Inception-v3 before the average pooling operation. Please refer to architecture of Inception-v3 provided in PyTorch (Paszke et al., 2017) for more details. In addition to extracting features from various convolutional layers, we also vary $\alpha$ from $0$ to $1$ in steps of $0.1$ . We plot $F_{\beta}^{max}$ as $\alpha$ varies for every convolutional layer. The result is shown in Fig. 4. From Fig. 4, its quite evident that features from the $16^{th}$ convolutional layer gives the best performance compared to features from other layers. In addition it’s also evident that features from VGG19 achieve better performance than features from ResNet34. For features from VGG19, the value of $F_{\beta}^{max}$ attains its maximum value of $0.754$ for $\alpha=0.6$ .

As we go deeper into the convolutional layers of VGG19, the features extracted become increasingly abstract, but suffer from decrease in resolution. Abstract features are less prone to changes in illumination, noise and pose which suits our task well. We noticed in Figure 4, as we go deeper into the convolutional layers, we first observe a degradation in the quality of features extracted (conv-layer 2 to conv-layer 8). This trend is reversed from conv-layer 8 to conv-layer 16 with a significant improvement of $F_{\beta}^{\text{max}}$ . We suspect this is because at first the negative effect of decreased resolution outweighs the benefit of abstract features. However, this trend quickly reverses from conv-layer 8 and beyond.

4.3.2. Testing

After obtaining the optimal hyperparameters as described in section 4.3.1, we evaluated our method on $\mathcal{D}_{test}$ . $\hat{M}^{b}_{a}$ is computed for every image pair in $\mathcal{D}_{test}$ and the ROC and PR curves are computed on $\mathcal{D}_{test}$ . Since our goal is to develop a class-agnostic food segmentation method, we compared the proposed method to 4 state-of-the-art salient object detection techniques, namely R3NET (Deng et al., 2018), NLDF (Zhiming et al., 2017), Amulet (Zhang et al., 2017) and UCF (Pingping et al., 2017). Salient object detection methods are class-agnostic and are applicable in this scenario as food is always a salient object in $I_{b}$ . Since these are deep learning based methods, we use their respective pre-trained models to compute the saliency maps of $I_{b}$ . The ROC and PR curves of various methods are shown in Fig. 5. The $F_{\beta}^{max}$ and AUC values are reported in Table 4.3.2.

Table 1. AUC and

F_{\beta}^{\text{max}}

values of various maps and methods.

Maps	AUC	$F_{\beta}^{\text{max}}$
$C^{b}_{a}$	0.937	0.645
R3NET (Deng et al., 2018) ( $S^{b}$ )	0.871	0.527
$\mathbf{\hat{M}^{b}_{a}}$ (ours)	0.954	0.741
Amulet (Zhang et al., 2017)	0.919	0.499
NLDF (Zhiming et al., 2017)	0.909	0.493
UCF (Pingping et al., 2017)	0.934	0.536

5. Discussion

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

(p)

(q)

(r)

(s)

(t)

(u)

(v)

(w)

(x)

(y)

(z)

(aa)

(ab)

(ac)

(ad)

(ae)

(af)

(ag)

(ah)

(ai)

(aj)

(ak)

(al)

(am)

(an)

(ao)

(ap)

(aq) Image Pairs.

(ar) Amulet(Zhang et al., 2017)

(as) UCF(Pingping et al., 2017)

(at) NLDF(Zhiming et al., 2017)

(au) R3NET(Deng et al., 2018)

(av) Ours

\hat{M}^{b}_{a}

(aw) Ground Truth

G^{b}

Figure 6. Sample image pairs from

\mathcal{D}_{test}

along with various maps are shown. For every row, the first group of two images are the original before and after eating images, respectively. The second group of images are the saliency maps generated by Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017), NLDF (Zhiming et al., 2017), R3NET (Deng et al., 2018) ,

M^{b}_{a}

(our method) followed by ground truth mask

G^{b}

. The ground truth images are binary maps with pixels of value 1 representing foods and pixels of value 0 representing background. All the others are probability maps with pixels having values between 0 and 1.

The goal of our method is to segment the salient missing objects in $I^{b}$ using information from a pair of images $I^{a}$ and $I^{b}$ . In the contrast map generation step as described in Section 3.2, we provide an estimate of the probability of pixels belonging to objects/regions in $I^{b}$ but missing in $I^{a}$ . In the saliency fusion step as described in Section 3.3, saliency information of pixels in $I^{b}$ is fused into the contrast map $C^{b}_{a}$ so as to emphasize that we are looking for salient missing objects. In order to show that the various steps of our proposed method achieve their individual objectives, we plotted the PR and ROC curves of the contrast map $C^{b}_{a}$ , the visual saliency map $S^{b}$ from R3NET (Deng et al., 2018) and the estimated salient missing objects probability map $\hat{M}^{b}_{a}$ in Fig. 5c and Fig. 5a. In addition, we also plot PR and ROC curves for the 3 other salient object detection methods. From these plots, we can see that combining $S^{b}$ and $C^{b}_{a}$ as described in Section 3.3 improves the overall performance. This is also illustrated in Table 4.3.2, where both AUC and $F_{\beta}^{\text{max}}$ of $\hat{M}^{b}_{a}$ are higher than $C^{b}_{a}$ . This is because the contrast map $C^{b}_{a}$ by itself models all the missing objects/regions while the probability map $\hat{M}^{b}_{a}$ also takes into account the visual saliency map $S^{b}$ , which can more accurately model the salient missing objects. We can also observe from the PR and ROC curves in Fig. 5 and values in Table4.3.2 that our method achieved better performance than the state-of-the-art salient object detection methods such as R3NET (Deng et al., 2018), NLDF (Zhiming et al., 2017), Amulet (Zhang et al., 2017) and UCF (Pingping et al., 2017). We also visually verify the performance of our method as illustrated in Fig. 6. The salient object detection methods Amulet (Zhang et al., 2017), UCF (Pingping et al., 2017) and NLDF (Zhiming et al., 2017) failed to detect only foods in these images, while R3NET (Deng et al., 2018) succeeded in detecting the foods but also placed equal importance to other salient objects such as the color checkerboard. Our method gave higher probability to the foods which are the salient missing objects compared to other salient objects in the scene. It must also be noted that our method did not have access to information about food classes in $\mathcal{D}_{test}$ . This is because $\mathcal{D}_{val}$ and $\mathcal{D}_{test}$ have very few food classes in common. By tuning the parameters on $\mathcal{D}_{val}$ , our method will not have access to information about the food classes in $\mathcal{D}_{test}$ . Hence the performance of our method on $\mathcal{D}_{test}$ is indicative of its effectiveness of segmenting foods in a class-agnostic manner. These unique characteristics of $\mathcal{D}$ are also explained in section 4.1. Hence, by modeling the foods as salient missing objects, we are able to build a better class-agnostic food segmentation method compared to existing methods.

6. Conclusion

In this paper, we propose a class-agnostic food segmentation method by segmenting the salient missing objects in a before eating image $I^{b}$ using information from a pair of before and after eating images, $I^{b}$ and $I^{a}$ . We treat this problem as a paradigm of top down saliency detection where visual attention of HVS is guided by a task. Our proposed method uses $I^{a}$ as background to obtain a contrast map that is an estimate of the probability of pixels of $I^{b}$ belonging to objects/regions missing in $I^{a}$ . The contrast map is then fused with the saliency information of $I^{b}$ to obtain a probability map $\hat{M}^{b}_{a}$ for salient missing objects. Our experimental results validated that our approach achieves better performance both quantitatively and visually when compared to state-of-the-art salient object detection methods such as R3NET(Deng et al., 2018), NLDF(Zhiming et al., 2017), Amulet(Zhang et al., 2017) and UCF(Pingping et al., 2017). As discussed in Section 1, we have only considered the case where there is no food in the after eating image. In the future, we will extend our model to consider more general scenarios.

References

(1)
Aizawa and Ogawa (2015) Kiyoharu Aizawa and Makoto Ogawa. 2015. FoodLog: Multimedia Tool for Healthcare Applications. IEEE MultiMedia 22, 2 (Apr 2015), 4–8. https://doi.org/10.1109/MMUL.2015.39
Anselmi et al. (2016) Fabio Anselmi, Joel Z. Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, and Tomaso Poggio. 2016. Unsupervised learning of invariant representations. Theoretical Computer Science 633 (2016), 112 – 121. https://doi.org/10.1016/j.tcs.2015.06.048
Ballard et al. (1995) Dana H Ballard, Mary M Hayhoe, and Jeff B Pelz. 1995. Memory Representations in Natural Tasks. Journal of cognitive neuroscience. 7, 1 (1995), 66–80. https://doi.org/10.1162/jocn.1995.7.1.66
Borji et al. (2015) Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2015. Salient Object Detection: A Benchmark. IEEE Transactions on Image Processing 24, 12 (Dec 2015), 5706–5722. https://doi.org/10.1109/TIP.2015.2487833
Borji and Itti (2013) Ali Borji and Laurent Itti. 2013. State-of-the-Art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (Jan 2013), 185–207.
Borji et al. (2013) Ali Borji, Dicky N. Sihite, and Laurent Itti. 2013. Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study. IEEE Transactions on Image Processing 22, 1 (Jan 2013), 55–69. https://doi.org/10.1109/TIP.2012.2210727
Bossard et al. (2014) Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 – Mining Discriminative Components with Random Forests. European Conference on Computer Vision (2014), 446–461. https://doi.org/10.1007/978-3-319-10599-4_29
Cao et al. (2015) Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, Deva Ramanan, and Thomas S. Huang. 2015. Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks. IEEE International Conference on Computer Vision (Dec 2015), 2956–2964. https://doi.org/10.1109/ICCV.2015.338
Chen et al. (2015) Hsin-Chen Chen, Wenyan Jia, Xin Sun, Zhaoxin Li, Yuecheng Li, John D Fernstrom, Lora E Burke, Thomas Baranowski, and Mingui Sun. 2015. Saliency-aware food image segmentation for personal dietary assessment using a wearable computer. Measurement Science and Technology 26, 2 (2015), 025702.
Daugherty et al. (2012) Bethany L Daugherty, TusaRebecca E Schap, Reynolette Ettienne-Gittens, Fengqing M Zhu, Marc Bosch, Edward J Delp, David S Ebert, Deborah A Kerr, and Carol J Boushey. 2012. Novel Technologies for Assessing Dietary Intake: Evaluating the Usability of a Mobile Telephone Food Record Among Adults and Adolescents. Journal of Medical Internet Research 14, 2 (Apr 2012), e58. https://doi.org/10.2196/jmir.1967
Dehais et al. (2016) Joachim Dehais, Marios Anthimopoulos, and Stavroula Mougiakakou. 2016. Food image segmentation for dietary assessment. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. 23–28.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (June 2009), 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Deng et al. (2018) Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin, Guoqiang Han, and Pheng-Ann Heng. 2018. R3Net: Recurrent Residual Refinement Network for Saliency Detection. International Joint Conference on Artificial Intelligence (July 2018), 684–690. http://dl.acm.org/citation.cfm?id=3304415.3304513
Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (2014), I–647–I–655.
Fang et al. (2015) Shaobo Fang, Chang Liu, Fengqing Zhu, Edward J. Delp, and Carol J. Boushey. 2015. Single-View Food Portion Estimation Based on Geometric Models. IEEE International Symposium on Multimedia (Dec 2015), 385–390. https://doi.org/10.1109/ISM.2015.67
Fang et al. (2018) S. Fang, Z. Shao, R. Mao, C. Fu, E. J. Delp, F. Zhu, D. A. Kerr, and C. J. Boushey. 2018. Single-View Food Portion Estimation: Learning Image-to-Energy Mappings Using Generative Adversarial Networks. 2018 25th IEEE International Conference on Image Processing (ICIP) (Oct 2018), 251–255. https://doi.org/10.1109/ICIP.2018.8451461
Fang et al. (2017) Shaobo Fang, Fengqing Zhu, Carol J Boushey, and Edward J Delp. 2017. The use of co-occurrence patterns in single image based food portion estimation. IEEE Global Conference on Signal and Information Processing (Nov 2017), 462–466. https://doi.org/10.1109/GlobalSIP.2017.8308685
Federico et al. (2012) Perazzi Federico, Philipp Krähenbühl, Yael Pritch, and Alexander Hornung. 2012. Saliency filters: Contrast based filtering for salient region detection. IEEE Conference on Computer Vision and Pattern Recognition (June 2012), 733–740. https://doi.org/10.1109/CVPR.2012.6247743
Ginny et al. (2012) Garcia Ginny, Thankam S. Sunil, and Pedro Hinojosa. 2012. The fast food and obesity link: consumption patterns and severity of obesity. Obes Surg 22, 5 (May 2012), 810–818. https://doi.org/10.1007/s11695-012-0601-8
Guanbin and Yu (2015) Li Guanbin and Yizhou Yu. 2015. Visual saliency based on multiscale deep features. IEEE Conference on Computer Vision and Pattern Recognition (June 2015), 5455–5463. https://doi.org/10.1109/CVPR.2015.7299184
Hammons et al. (2011) Hammons, Amber J., and Barbara H. Fiese. 2011. Is frequency of shared family meals related to the nutritional health of children and adolescents? Pediatrics 127, 6 (Jun 2011), e1565–1574.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (June 2016).
Karen and Zisserman (2015) Simonyan Karen and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (May 2015). San Diego, CA.
Kawano and Yanai (2015a) Yoshiyuki Kawano and Keiji Yanai. 2015a. Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation. European Conference on Computer Vision 2014 Workshops (2015), 3–17. https://doi.org/10.1007/978-3-319-16199-0_1
Kawano and Yanai (2015b) Yoshiyuki Kawano and Keiji Yanai. 2015b. FoodCam: A real-time food recognition system on a smartphone. Multimedia Tools and Applications 74, 14 (01 Jul 2015), 5263–5287. https://doi.org/10.1007/978-3-319-04117-9_38
Kerr et al. (2016) Deborah A Kerr, Amelia J Harray, Christina M Pollard, Satvinder S Dhaliwal, Edward J Delp, Peter A Howat, Mark R Pickering, Ziad Ahmad, Xingqiong Meng, Iain S Pratt, Janine L Wright, Katherine R Kerr, and Carol J Boushey. 2016. The connecting health and technology study: a 6-month randomized controlled trial to improve nutrition behaviours using a mobile food record and text messaging support in young adults. The international journal of behavioral nutrition and physical activity. 13, 1 (2016). https://doi.org/10.1186/s12966-016-0376-8
Khan et al. (2017b) Salman Khan, Xuming He, Fatih Porikli, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. 2017b. Learning Deep Structured Network for Weakly Supervised Change Detection. International Joint Conference on Artificial Intelligence (2017), 2008–2015. http://dl.acm.org/citation.cfm?id=3172077.3172167
Khan et al. (2017a) Salman H. Khan, Xuming He, Fatih Porikli, and Mohammed Bennamoun. 2017a. Forest Change Detection in Incomplete Satellite Images With Deep Neural Networks. IEEE Transactions on Geoscience and Remote Sensing 55, 9 (Sept 2017), 5407–5423. https://doi.org/10.1109/TGRS.2017.2707528
Kong and Tan (2012) Fanyu Kong and Jindong Tan. 2012. DietCam: Automatic dietary assessment with mobile camera phones. Pervasive and Mobile Computing 8, 1 (2012), 147 – 163. https://doi.org/10.1016/j.pmcj.2011.07.003
McCrory et al. (2019) Megan McCrory, Mingui Sun, Edward Sazonov, Gary Frost, Alex Anderson, Wenyan Jia, Modou L Jobarteh, Kathryn Maitland, Matilda Steiner-Asiedu, Tonmoy Ghosh, et al. 2019. Methodology for Objective, Passive, Image-and Sensor-based Assessment of Dietary Intake, Meal-timing, and Food-related Activity in Ghana and Kenya (P13-028-19). Current developments in nutrition 3, Supplement_1 (2019), nzz036–P13.
Mesas et al. (2012) A. E. Mesas, M. Muñoz-Pareja, E. López-García, and F. Rodríguez-Artalejo. 2012. Selected eating behaviours and excess body weight: a systematic review. Obes Rev 13, 2 (Feb 2012), 106–135. https://doi.org/10.1111/j.1467-789X.2011.00936.x 21955734[pmid].
Nordström et al. (2013) Karin Nordström, Christian Coff, Håkan Jönsson, Lennart Nordenfelt, and Ulf Görman. 2013. Food and health: individual, cultural, or scientific matters? Genes Nutr 8, 4 (Jul 2013), 357–363. https://doi.org/10.1007/s12263-013-0336-8 23494484[pmid].
Okamoto and Yanai (2016) Koichi Okamoto and Keiji Yanai. 2016. An Automatic Calorie Estimation System of Food Images on a Smartphone. International Workshop on Multimedia Assisted Dietary Management (2016), 63–70. https://doi.org/10.1145/2986035.2986040
Organization (2009) World Health Organization. 2009. Global Health Risks Mortality and Burden of Disease Attributable to Selected Major Risks. World Health Organization. https://apps.who.int/iris/handle/10665/44203
Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
Peters et al. (2007) Peters, Robert J., and Laurent Itti. 2007. Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention. IEEE Conference on Computer Vision and Pattern Recognition (June 2007), 1–8. https://doi.org/10.1109/CVPR.2007.383337
Piernas and Popkin (2011) Carmen Piernas and Barry M. Popkin. 2011. Food portion patterns and trends among U.S. children and the relationship to total eating occasion size, 1977-2006. J Nutr 141, 6 (Jun 2011), 1159–1164. https://doi.org/10.3945/jn.111.138727
Pingping et al. (2017) Zhang Pingping, Dong Wang, Huchuan Lu, Hongyu Wang, and Baocai Yi. 2017. Learning Uncertain Convolutional Features for Accurate Saliency Detection. IEEE International Conference on Computer Vision (Oct 2017), 212–221. https://doi.org/10.1109/ICCV.2017.32
Pouladzadeh et al. (2014) Pouladzadeh, Parisa, Shervin Shirmohammadi, and Rana Al-Maghrabi. 2014. Measuring Calorie and Nutrition From Food Image. IEEE Transactions on Instrumentation and Measurement 63, 8 (Aug 2014), 1947–1956. https://doi.org/10.1109/TIM.2014.2303533
PR1 et al. (2010) Deshmukh-Taskar PR1, Nicklas TA, O’Neil CE, Keast DR, Radcliffe JD, and Cho S. 2010. The relationship of breakfast skipping and type of breakfast consumption with nutrient intake and weight status in children and adolescents: the National Health and Nutrition Examination Survey 1999-2006. J Am Diet Assoc 110, 6 (Jun 2010), 869–878. https://doi.org/10.1016/j.jada.2010.03.023
R. et al. (2004) Martin David R., Charless C. Fowlkes, and Jitendra Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 5 (May 2004), 530–549. https://doi.org/10.1109/TPAMI.2004.1273918
Radhakrishna et al. (2012) Achanta Radhakrishna, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 11 (Nov 2012). https://doi.org/10.1109/TPAMI.2012.120
Ramanishka et al. (2017) Vasili Ramanishka, Abir Das, and Jianming Zhang Kate Saenko. 2017. Top-Down Visual Saliency Guided by Captions. IEEE Conference on Computer Vision and Pattern Recognition (July 2017), 3135–3144. https://doi.org/10.1109/CVPR.2017.334
Sakurada and Okatani (2015) Ken Sakurada and Takayuki Okatani. 2015. Change Detection from a Street Image Pair using CNN Features and Superpixel Segmentation. British Machine Vision Conference, Article 61 (September 2015), 12 pages. https://doi.org/10.5244/C.29.61
Shim et al. (2014) Jee-Seon Shim, Kyungwon Oh, and Hyeon Chang Kim. 2014. Dietary assessment methods in epidemiologic studies. Epidemiol Health 36 (22 Jul 2014), e2014009–e2014009. https://doi.org/10.4178/epih/e2014009 25078382[pmid].
Shimoda and Yanai (2015) Wataru Shimoda and Keiji Yanai. 2015. CNN-Based Food Image Segmentation Without Pixel-Wise Annotation. New Trends in Image Analysis and Processing – ICIAP Workshops (2015). https://doi.org/10.1007/978-3-319-23222-5_55
Szegedy et al. (2016) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. (June 2016), 2818–2826.
Szeliski (2004) Rick Szeliski. 2004. Image Alignment and Stitching: A Tutorial. Technical Report MSR-TR-2004-92. 89 pages. https://www.microsoft.com/en-us/research/publication/image-alignment-and-stitching-a-tutorial/
Vasiloglou et al. (2018) Maria F Vasiloglou, Stavroula Mougiakakou, Emilie Aubry, Anika Bokelmann, Rita Fricker, Filomena Gomes, Cathrin Guntermann, Alexa Meyer, Diana Studerus, and Zeno Stanga. 2018. A comparative study on carbohydrate estimation: GoCARB vs. Dietitians. Nutrients 10, 6 (2018), 741.
Vijay et al. (2014) Rengarajan Vijay, Abhijith Punnappurath, A. N. Rajagopalan, and Guna Seetharaman. 2014. Efficient Change Detection for Very Large Motion Blurred Images. IEEE Conference on Computer Vision and Pattern Recognition Workshops (June 2014), 315–322. https://doi.org/10.1109/CVPRW.2014.55
Wang et al. (2017) Yu Wang, Fengqing Zhu, Carol J. Boushey, and Edward J. Delp. 2017. Weakly supervised food image segmentation using class activation maps. IEEE International Conference on Image Processing (Sep 2017), 1277–1281. https://doi.org/10.1109/ICIP.2017.8296487
Yuji et al. (2012) Matsuda Yuji, Hajime Hoashi, and Keiji Yana. 2012. Recognition of Multiple-Food Images by Detecting Candidate Regions. IEEE International Conference on Multimedia and Expo (July 2012), 25–30. https://doi.org/10.1109/ICME.2012.157
Zhang et al. (2017) Zhang, Pingping, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. 2017. Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection. IEEE International Conference on Computer Vision (Oct 2017), 202–211. https://doi.org/10.1109/ICCV.2017.31
Zhang et al. (2015) Weiyu Zhang, Qian Yu, Behjat Siddiquie, Ajay Divakaran, , and Harpreet Sawhney. 2015. “Snap-n-Eat”: Food Recognition and Nutrition Estimation on a Smartphone. J Diabetes Sci Technol 9, 3 (May 2015), 525–533. https://doi.org/10.1177/1932296815582222
Zhiming et al. (2017) Luo Zhiming, Akshaya Mishra, Andrew Achka, Justin Eichel, Shaozi Li, and Pierre-Marc Jodoin. 2017. Non-local Deep Features for Salient Object Detection. IEEE Conference on Computer Vision and Pattern Recognition (July 2017), 6593–6601. https://doi.org/10.1109/CVPR.2017.698
Zhu et al. (2015) Fengqing Zhu, Marc Bosch, Insoo Woo, SungYe Kim, Carol J. Boushey, David S. Ebert, , and Edward J. Delp. 2015. Multiple Hypotheses Image Segmentation and Classification With Application to Dietary Assessment. IEEE Journal of Biomedical and Health Informatics 19, 1 (Jan 2015), 377–388. https://doi.org/10.1109/JBHI.2014.2304925
Zhu et al. (2010) Fengqing Zhu, Marc Bosch, Insoo Woo, SungYe Kim, Carol J. Boushey, David S. Ebert, and Edward J. Delp. 2010. The Use of Mobile Devices in Aiding Dietary Assessment and Evaluation. IEEE Journal of Selected Topics in Signal Processing 4, 4 (Aug 2010), 756–766. https://doi.org/10.1109/JSTSP.2010.2051471