\field\authorlist\authorentry

[[email protected]]Takao YamanakaMembersophia\MembershipNumber0909050 \authorentryTatsuya SuzukiNonmembersophia\MembershipNumber \authorentryTaiki NobutsuneNonmembersophia\MembershipNumber \authorentryChenjunlin WuNonmembersophia\MembershipNumber \affiliate[sophia]The authors are with the Department of Information and Communication Sciences, Sophia University, Tokyo, 102-0094, Japan. 322 610

Multi-Scale Estimation for Omni-Directional Saliency Maps Using Learnable Equator Bias

(2023; 2023)

keywords:

omni-directional image, saliency map, bias layer, multi-scale detection

{summary}

Omni-directional images have been used in wide range of applications including virtual/augmented realities, self-driving cars, robotics simulators, and surveillance systems. For these applications, it would be useful to estimate saliency maps representing probability distributions of gazing points with a head-mounted display, to detect important regions in the omni-directional images. This paper proposes a novel saliency-map estimation model for the omni-directional images by extracting overlapping 2-dimensional (2D) plane images from omni-directional images at various directions and angles of view. While 2D saliency maps tend to have high probability at the center of images (center bias), the high-probability region appears at horizontal directions in omni-directional saliency maps when a head-mounted display is used (equator bias). Therefore, the 2D saliency model with a center-bias layer was fine-tuned with an omni-directional dataset by replacing the center-bias layer to an equator-bias layer conditioned on the elevation angle for the extraction of the 2D plane image. The limited availability of omni-directional images in saliency datasets can be compensated by using the well-established 2D saliency model pretrained by a large number of training images with the ground truth of 2D saliency maps. In addition, this paper proposes a multi-scale estimation method by extracting 2D images in multiple angles of view to detect objects of various sizes with variable receptive fields. The saliency maps estimated from the multiple angles of view were integrated by using pixel-wise attention weights calculated in an integration layer for weighting the optimal scale to each object. The proposed method was evaluated using a publicly available dataset with evaluation metrics for omni-directional saliency maps. It was confirmed that the accuracy of the saliency maps was improved by the proposed method.

1 Introduction

Omni-directional images (ODI) are expected to be applied in widespread fields, such as virtual/augmented realities, self-driving cars, and robotics. It provides immersive impression in virtual environments using a head-mounted display (HMD). In addition, ODI can be also dealt with panorama viewers in personal computers or smart phones, which extend applicable situations of ODI. Estimation of salient regions in ODI will promote the applications of ODI [1]. For example, it can be used for indicating detailed information of salient objects in virtual/augmented environments, or for suppressing amount of ODI data for network transmission.

Refer to caption — Figure 1: Examples of saliency maps for 2D image and omni-directional image.

ODI is usually represented in the equirectangular projection (ERP), as shown in Fig. 1 (b). An approach to estimate ODI saliency maps is to directly apply a 2-dimensional (2D) saliency model to ODI in ERP [2]. However, ODI in ERP has large distortion in the top and bottom areas (the north and south poles), which results in unsatisfactory saliency estimation. To overcome this distortion, the representation of cube mapping is also used for the saliency estimation in SalNet360 [3], and then each cubic face is independently processed by a 2D saliency model, followed by a refinement stage to integrate saliency maps from all the faces. However, the cube mapping has too small field of view (90 degrees) to estimate accurate saliency in each cubic face [4]. In addition, an object is often divided into two or more cubic faces, so that it is difficult to detect the object by a 2D saliency model [5].

To solve these problems, this paper proposes a novel ODI saliency model by extracting overlapping 2D plane images from ODI at various directions with multiple angles of view. 2D saliency models have been well established, and can be trained using large amount of training data [6, 7, 8], while ODI saliency datasets have been so far limited since obtaining fixation data is much more cumbersome than 2D images. By using the pretrained 2D saliency models, the required training data for the fine-tuning to the ODI dataset can be reduced. However, there is an important difference between the conventional 2D saliency model and the model for 2D plane images extracted from ODI. The saliency maps for normal 2D images have high prior probability at the center of images called center bias, while the high-probability region is observed at the horizontal directions in ODI saliency maps called equator bias [9], as shown in Fig. 1. Therefore, the 2D saliency model was pretrained with a learnable center-bias layer (single channel pixel-wise weighting layer) with 2D saliency datasets, and then the network was fine-tuned with an ODI dataset by replacing the center-bias layer to an equator-bias layer composed of pixel-wise weights in multiple channels, each of which corresponds to the elevation angle for the extraction of the 2D plane image. Furthermore, the proposed method extracts 2D plane images from ODI in multiple angles of view to detect objects of various sizes. Since the field of view changes depending on the angles of view, the receptive field to detect saliency for an object also changes depending on the angles of view. That is, information from larger area (receptive field) is used for larger angle of view. This would contribute on more accurate saliency detection in ODI.

The contributions of this paper are as follows.

1.

A novel method to estimate the saliency maps for ODI was proposed by extracting overlapping 2D plane images from ODI at various directions with multiple angles of view, to accurately detect objects of various sizes with variable receptive fields.
2.

The ODI saliency model was realized based on a well-established 2D saliency model pretrained with large amount of training data, by learning prior distribution conditioned on elevation angles in ODI.
3.

It was confirmed from experiments that the accuracy of ODI saliency estimation was improved by the proposed method in a publicly available ODI saliency dataset.

2 Related Works

2.1 Saliency map estimation for 2D images

Many saliency models have been proposed to estimate the locations in 2D images which attract attentions of people. The saliency map computed from image features with these models represents the probability density function of fixations when people look at an image. In the earlier studies, the saliency map has been estimated by extracting low-level features. One of these saliency-map estimation models is that proposed by Itti et al. [10], which estimates saliency maps by extracting early visual features such as intensities, colors, and orientations. AWS (Adaptive Whitening Saliency) [11] is another conventional model based on the whitening of low level features.

The recent progress of deep learning has also contributed to the improvement of saliency-map estimation models. Although the conventional models described above have used the hand-crafted features to estimate the saliency maps, the image feature vectors extracted from convolutional neural networks (CNN) have been used instead [12, 13, 14, 15, 16, 17]. The ensembles of deep networks (eDN) [14] have trained small-size CNN (up to 3 layers) to extract multiple-layer features for estimating the saliency maps by combining the features with support vector machines. DeepGaze I [15] has shown that the image features extracted from the multiple layers of CNN trained for an image-classification task on ImageNet [18] are useful to estimate the saliency maps. This implies that the transfer learning from the image classification to the saliency map estimation is effective because people tend to look at the centers of objects [19], which are learned to recognize in the pretrained model for the ImageNet classification task.

In contrast to DeepGaze I which uses AlexNet as the backbone network, SaliconNet [12] and DeepFix [16] uses fully convolutional neural networks based on the VGG architecture which has shown better performance than AlexNet on the image classification task. SaliconNet [12] estimates saliency maps with two image scales to allow the model to be robust against the sizes of objects in images. DeepFix [16] uses dilated convolution filters [20] to enlarge the receptive field and inception modules to capture multi-scale information. The deep-type network of SalNet [13] has also adopted the VGG architecture with simplification composed of 10 layers. DeepGaze II [17] is also based on the VGG architecture, which uses a center-bias layer to incorporate the prior distribution and log-likelihood as the loss function.

In addition to the VGG-based CNN models, several CNN architectures have been used for the saliency-map estimation. For example, SalGAN [21] is a model that estimates saliency maps using generative adversarial networks. DenseSal and DPNSal [8] have shown better performance based on densely connected neural networks (DenseNet) [22] and dual path networks (DPN) [23], respectively. EML-NET [24] combines the feature vectors extracted from DenseNet and NasNet [25]. DeepGazeIIE [26] also combines saliency predictions from multiple saliency models. These methods have achieved the state-of-the-art results on the MIT Saliency Benchmark [6] and MIT/Tuebingen Saliency Benchmark [7].

2.2 Saliency map estimation for omni-directional images

In addition to the saliency-map estimation for 2D images, the saliency models for ODI have been developed since the ICME2017 Competition [27]. ODI is represented in various projections, among which ERP is commonly used to represent ODI in 2 dimensions as shown in Fig. 1, though it has distortions at poles (the top and bottom regions in ERP). The model by Abreu et al. [2] has used SaliconNet [12] to estimate the ODI saliency map by directly inputting ODI in ERP. The ERP saliency maps for horizontally different viewing directions are fused into an ERP saliency map to suppress the center-bias effect induced by the 2D saliency-map model (SaliconNet). However, the ERP images have distortions at poles, so that saliency at the poles cannot be correctly estimated. Moreover, the center-bias effect cannot be completely suppressed at the edges of the ERP image.

Since the database for ODI saliency maps are relatively small, it would be important to first pre-train the CNN model for 2D images with large databases, and then fine-tune it with the ODI database. In SalNet360 [3], this has been realized by subdividing ODI into 6 undistorted plane patches in cubic projection, estimating the 2D saliency maps with SaliconNet (2D saliency model) [12] and a refinement network, and then integrating the 2D saliency maps into an ODI saliency map in ERP. In order to incorporate the dependence of saliency on the locations in ODI, the ODI spherical coordinates of each pixel in each 2D patch were input to the refinement neural network. However, since SaliconNet outputs a saliency map biased in the center of an image, the saliency at the 2D image edges is estimated lower. SalGAN360 [28] based on SalGAN [21] has estimated the ODI saliency maps by fusing the saliency map in ERP and saliency maps for undistorted plane patches in the cubic projection to integrate the global and local information. This method employs multiple cubic projection, where overlapping cubic faces are extracted as in our proposed method. However, the angles of view are limited to 90^∘ in the multiple cubic projection, and biases to represent prior distributions are not considered. Qing et al. [4] also have proposed a method to train a model with multiple cubic projection and to estimate saliency maps using ERP with multiple sphere rotation to reduce the influence of distortion at poles. Although global information can be utilized from ERP in the inference, the angle of view in the training stage is limited to 90^∘ and the equator bias is not explicitly considered.

3 Method

The proposed model for estimating ODI saliency maps is illustrated in Fig. 2 and Fig. 3. In the method, undistorted plane patches (2D images) are extracted in various viewing directions from ODI to use an existing saliency model for 2D images. These patches are overlapping each other in contrast to the cubic projection as in [3]. In addition, 2D patches are extracted in multiple angles of view for a viewing direction as shown in Fig. 3. Since an object appears in different sizes in these patches, the 2D saliency model can detect the object in the optimal scale with appropriate receptive field (information from surrounding region). Then, the patches in various viewing directions with multiple angles of view are fed into the 2D saliency-map estimation model (MainNet), to predict the saliency map for each patch. The 2D saliency-map estimation model is pretrained with large databases of 2D saliency maps such as the SALICON dataset [13]. The outputs of MainNet are applied to the bias layer to model the prior distribution for ODI (equator bias) depending on the elevation angle in spherical coordinates. The patches in multiple angles of view are first integrated into a saliency map for a viewing direction by weighting important scales at each pixel using an integration layer consisting of an attention module, as in Fig. 3. Then, the integrated patches are fused into a saliency map in ERP, as in Fig. 2. In the following subsections, key components in the proposed method are explained in detail.

3.1 Bias Layer for Prior Distribution

As shown in Fig. 1, the center bias represents the property of tendency for fixations to concentrate on the center of a 2D image while the equator bias represents that to concentrate on the equator of ODI. The average of saliency maps over 2D images and ODI in databases are shown in Fig. 4 (a) and (b), respectively. They represent the prior distributions of fixations, independent on the local image features. Thus, the prior distributions are different between 2D images and ODI, so that these differences have to be compensated in the ODI saliency-map estimation.

The saliency is defined as a probability of observing a fixation ( $o$ : binary random variable) at a location $x$ for a local image feature $f$ : $p(o|f,x)$ . If $x$ and $f$ are assumed to be independent,

\begin{split}p(o|f,x)&=p(f|o,x)p(o|x)/p(f|x)\\ &=p(f|o)p(o|x)/p(f)\\ &=p(o|f)p(o|x)/p(o),\end{split}

(1)

where $p(o|f)$ represents the saliency map depending only on the local image feature $f$ without the prior probability $p(o|x)$ , and $p(o)$ is a normalization constant. Thus, the saliency map can be modeled by the multiplication of $p(o|f)$ and $p(o|x)$ with normalization. In the proposed ODI saliency model, a 2D saliency-map estimation model, DenseSal [8], was used with an additional layer (bias layer) for learning the prior probability $p(o|x)$ in 2D saliency maps, as shown in Fig. 5. This layer consists of pixel-wise scaling weights for a feature map from the 2D saliency model. By learning the weights with 2D saliency datasets, the saliency depending only on a location $x$ in an image can be estimated in the center-bias layer. When estimating ODI saliency maps, this center-bias layer is replaced by an equator-bias layer, which consists of pixel-wise scaling weights in multiple channels corresponding to elevation angles of viewing directions for 2D patches.

In the 2D saliency model, the saliency at image boundaries are underestimated by the center-bias effect even if there are some objects at the edges of an image [29, 30]. In the extracted 2D images from ODI, objects can appear at the edge of the extracted images, which may be focused by people with HMD. Therefore, the center bias can give the negative effect to the saliency estimation for ODI. In our approach, first the 2D saliency model was trained with center-bias layer to model the center-bias effect existing in the 2D saliency dataset similar to [29], so that the MainNet (excluding the center-bias layer) represents only the saliency for contents (such as objects) not depending on the position of the image. By replacing the center-bias layer to the equator-bias layer, and by learning the bias layer using the training data extracted from ODI, the bias layer expresses the equator-bias effect (saliency depending on the elevation angle in the spherical coordinates, and not depending on the horizontal angle), so that the negative effect of the center bias can be eliminated.

3.2 Extraction of 2D Images from ODI

Since ODI in ERP has distortion at poles as shown in Fig. 1, 2D images are extracted from ODI to estimate an ODI saliency map by removing the distortion. The correspondence between ERP and the spherical coordinate system is shown in Fig. 6. The unit vectors of a 2D image extracted at a viewing direction ( $\theta_{c},\phi_{c}$ ) in spherical coordinates are represented in the following equations in the 3D Euclidean coordinate system:

\begin{split}X_{n}&=(-\sin\theta_{c},-\cos\theta_{c},0)\\ Y_{n}&=(-\sin\phi_{c}\cos\theta_{c},\sin\phi_{c}\sin\theta_{c},\cos\phi_{c})\end{split}

(2)

The ERP coordinates of the 2D image in a tangent plane can be obtained by transforming the 3D Euclidean coordinates of points in the 2D image to spherical coordinates. Thus, 2D images can be extracted from ODI in ERP using the coordinates. When 2D saliency maps are integrated into an ODI saliency map in ERP, each point in ERP is assigned to the nearest point in 2D saliency maps of multiple viewing directions.

When the 2D images are extracted from ODI without overlapping (for example, cube mapping), the objects placed at the edge of the 2D patch are cut off as shown in Fig. 7 (a), so that it would be difficult to be recognized for accurate saliency estimation. However, if the 2D images are extracted from ODI with overlapping as in Fig. 7 (b), the objects can be detected, leading to more accurate estimation. In the integration of 2D saliency maps for multiple viewing directions, the saliency values in ERP are calculated by averaging the overlapping saliency values. Note that the 2D saliency maps with multiple angles of view are integrated using an integration layer by weighting the optimal scales, as explained in the following subsection and shown in Fig. 3. Although the multiple viewing directions can also be integrated using importance weights depending on the position of an extracted image (saliency at the center of an extracted image may be more accurately predicted than at the edge of an extracted image), it would be less effective than weighting optimal scales, so that a simple averaging integration was selected in the proposed method for the simplicity.

3.3 Multi-scale Saliency Model with Multiple Angles of View

In the proposed method, 2D images are extracted in multiple angles of view, as shown in Fig. 3. The relationship between the angle of view ( $\theta_{a},\phi_{a}$ ) and the extracted 2D image size ( $W,H$ ) can be represented by the following equation, where $L$ is the distance from the camera to the image plane.

\begin{split}\tan(\theta_{a}/2)&=W/2L\\ \tan(\phi_{a}/2)&=H/2L\end{split}

(3)

The 2D image size ( $W,H$ ) are set to constant values for multiple angles of view.

The structures of the integration layer for the 2D saliency maps estimated from the extracted patches with multiple angles of view at a viewing direction are shown in Fig. 8. Since the receptive fields are different among the angles of view, the 2D saliency maps with larger angles of view are cropped and resized to cover the same range with the smallest angle of view, using the relationship in Eq. 3. Then, they are integrated into a 2D saliency map using pixel-wise channel weighting, whose weights are calculated by an attention module. In the experiments, two integration structures were examined, as in Fig. 8 (a) and (b). The attention weights are calculated from the concatenated saliency maps in (a), whereas the weights are obtained from image features in MainNet for the smallest angle of view in (b) since more information for importance of scales for each objects can be included in the image features. Since two structures of attention modules were also examined, the 4 types of architectures shown in Table 1 were tested in the experiments.

Table 1: Architectures of integration layer for multi-scale saliency estimation.

	Arch. 1	Arch. 2	Arch. 3	Arch. 4
Structure	Integration	Integration	Integration	Integration
(Fig. 8)	Layer 1	Layer 1	Layer 2	Layer 2
Attention	3x3 Conv ( $C$ ch)	1x1 Conv ( $C$ ch)	3x3 Conv ( $C$ ch)	1x1 Conv ( $C$ ch)
module	ReLU	ReLU	ReLU	ReLU
	2x2 Conv ( $N$ ch)	3x3 Conv ( $C$ ch)	2x2 Conv ( $N$ ch)	3x3 Conv ( $C$ ch)
	Softmax	ReLU	Softmax	ReLU
		1x1 Conv ( $4C$ ch)		1x1 Conv ( $4C$ ch)
		ReLU		ReLU
		1x1 Conv ( $N$ ch)		1x1 Conv ( $N$ ch)
		Softmax		Softmax

3.4 Normalization of Saliency Map

The output of the 2D saliency model is usually normalized by its norm, such as the L1 norm. If the 2D saliency model was used with the normalization to estimate the saliency for 2D patches extracted from ODI, the information of the dependence on the viewing direction for the extraction would be lost by the normalization at the integration to an ODI saliency map. Therefore, the 2D saliency model is used without the normalization at the output of the 2D saliency model. After the integration to the ODI saliency map, it is normalized by the L1 norm.

4 Experimental Setup

In the experiments, ’Head+Eye Saliency’ in the Salient360 Dataset (ver. 2018) [31] was used to evaluate ODI saliency maps. This dataset includes 85 ERP images with fixations obtained using HMD, whose field of view is 100^∘ [32]. Following [3], the 25 images specified in the paper were used as test data, and the remaining 60 images were used as training and validation data. In the experiments for the comparison with conventional methods, only the 40 images in the training data (training data in salient360 ver. 2017) were used for fine-tuning of the network, which was same as [3]. The evaluation metrics used in the experiments were also same as the reference [3, 27]: the normalized scanpath saliency (NSS), the area under the curve (AUC), the correlation coefficient (CC), and the Kullback-Leibler divergence (KLD). Although these metrics have been also used in the evaluation of 2D saliency models, the metrics used in this paper were tailored to evaluate ODI saliency models by uniformly sampling on the sphere [27], since the area at poles on the sphere is much smaller than in ERP. For these metrics, a higher value represents better performance except for KLD.

For the 2D saliency model in Fig. 5, DenseSal [8] was adopted, which have achieved high performance in a saliency benchmark [6]. It was pre-trained with the ImageNet classification task, and was then fine-tuned by the 2D saliency-map estimation task with Salicon Dataset [33] and OSIE Dataset [34]. In our experiments, this model was further fine-tuned with the extracted 2D images for multiple viewing directions with multiple angles of view from ODI training data in Salient360 Dataset. From the results of the preliminary experiments [5], the interval of viewing directions was set to 45^∘ in all the experiments in this paper. In this interval, 2D images were extracted at 26 viewing directions (8 horizontal directions $\times$ 3 vertical directions + 2 poles). At each of the viewing directions, multiple angles of view were used to extract the 2D images. The size of ODI in ERP was $800\times 1600$ pixels, resized from the original sizes in the dataset, whereas the extracted 2D images were $500\times 500$ pixels. The equator-bias layer in the saliency-map model was a scalar layer of $20\times 20$ pixels with 5 channels corresponding to 5 elevation angles ( $0^{\circ},\pm 45^{\circ},\pm 90^{\circ}$ ). The model was fine-tuned up to 5 epochs for the ODI dataset with the batch size of 1 due to the GPU memory limitation. Learning rates for MainNet and the bias layer were set to $10^{-5}$ and $10^{-4}$ , respectively. KLD with L1 normalization and RMSProp were used for the loss function and the optimization method for training, respectively. The image features at the output of MainNet in Fig. 8 was $M=4416$ ch, and the intermediate channels in the attention modules was set to $C=512$ ch in Table 1.

5 Results

5.1 ODI Saliency Estimation with Learnable Equator Bias

First, the effect of fine-tuning on the ODI dataset and the learnable equator-bias layer was examined in the condition of 2D-image extraction with the single angle of view 100^∘. The results are shown in Table 2. The experimental conditions were separated into two groups, MainNet without the fine-tuning to the ODI dataset (w/o FT in MainNet) and MainNet with the fine-tuning (w/ FT in MainNet). For each condition in MainNet, several biases were compared. The ”Learned (single bias)” means that the 2D saliency model with a single-channel bias layer (same as the center-bias layer) was fine-tuned on the ODI dataset. The ”Learned (multi-bias)” is the proposed method but using only a single angle of view for 2D-image extraction. The ”Constant (Avg)” bias means that the average equator bias over images in the ODI dataset was used for weighting the output of 2D saliency map. From the results, the fine-tuning of MainNet was highly effective to improve the performance. The learnable bias layers were also effective in both the conditions in ’MainNet w/o FT’ and ’MainNet w/ FT’, especially in the multiple-channel equator bias which achieved the best performance in the condition of MainNet with the fine-tuning in all the metrics. This means that leaning bias layer depending on the elevation angle for the 2D image extraction was useful to estimate the ODI saliency maps. To see this more clearly, NSS was calculated against the elevation angle of ODI in ERP, as shown in Fig. 9. It can be seen from the figure that the fine-tuning with the multiple-channel bias was better than other conditions, especially around at $-45^{\circ}$ .

Table 2: Effect of fine-tuning on ODI dataset and equator-bias layer. FT: Fine-Tuning on ODI dataset.

MainNet	Bias	NSS $\uparrow$	AUC $\uparrow$	CC $\uparrow$	KLD $\downarrow$
w/o FT	w/o Bias	0.8711	0.7098	0.5634	0.7152
w/o FT	Learned (single bias)	0.8906	0.7119	0.5766	0.7640
w/o FT	Learned (multi-bias)	0.8986	0.7142	0.5819	0.7586
w/ FT	Constant (Avg)	1.0097	0.7313	0.6601	0.7116
w/ FT	w/o Bias	1.1083	0.7531	0.7295	0.3038
w/ FT	Learned (single bias)	1.1108	0.7527	0.7306	0.3021
w/ FT	Learned (multi-bias)	$\mathbf{1.1601}$	$\mathbf{0.7541}$	$\mathbf{0.7345}$	$\mathbf{0.3005}$

5.2 Multi-scale Saliency Estimation

Next, the effect of extracting 2D images in multiple angles of view (MAV) was studied for the integration layers in Table 1. The results are shown in Table 3. The 4 architectures of the integration layer with MAV ( $100^{\circ},110^{\circ},120^{\circ}$ ) were compared with the results with single angle of view (SAV) $100^{\circ}$ . It can be seen from the table that the performance on all the architectures with MAV was better than that with SAV. Among the 4 architectures, ”Arch. 4” achieved the best performance. This means that the image features from MainNet (DenseSal) with deeper attention modules were effective to calculate the attention weights on multiple scales for each pixel.

Table 3: Effect of multi-scale method by extracting 2D images in multi-angles of view. SAV: Single Angle of View, MAV: Multi-Angles of View.

Methods		NSS $\uparrow$	AUC $\uparrow$	CC $\uparrow$	KLD $\downarrow$
SAV		1.1601	0.7541	0.7345	0.3005
MAV	Arch. 1	1.1638	0.7604	0.7789	0.2523
	Arch. 2	1.1728	0.7617	0.7879	0.2467
	Arch. 3	1.1732	0.7618	0.7881	0.2427
	Arch. 4	$\mathbf{1.1937}$	$\mathbf{0.7642}$	$\mathbf{0.8010}$	$\mathbf{0.2344}$

To examine the effective combination of angles of view, several combinations were tested, as shown in Table 4. It can be seen from the results that smaller angles of view than the field of view in HMD ( $100^{\circ}$ ) were not effective, as in ( $80^{\circ},90^{\circ},100^{\circ}$ ). Since the performance with more than 3 angles of view decreased, the combination of ( $100^{\circ},110^{\circ},120^{\circ}$ ) was the best on performance among the conditions tested. The reason for this might be that ( $100^{\circ},110^{\circ},120^{\circ}$ ) was enough to capture the saliency for people, or that the larger angles of view detected the saliency which was not focused by people.

Table 4: Comparison among angles of view for extraction.

Angles of View	NSS $\uparrow$	AUC $\uparrow$	CC $\uparrow$	KLD $\downarrow$
(100, 110)	1.1736	0.7617	0.7874	0.2478
(80, 90, 100)	1.1438	0.7587	0.7678	0.2673
(90, 100, 110)	1.1897	0.7612	0.7961	0.2386
(100, 110, 120)	$\mathbf{1.1937}$	$\mathbf{0.7642}$	$\mathbf{0.8010}$	$\mathbf{0.2344}$
(100, 110, 120, 130)	1.1540	0.7593	0.7758	0.2570
(100, 110, 120, 130, 140)	1.1828	0.7631	0.7944	0.2394

5.3 Comparison with Conventional Method

To compare the performance with conventional methods, the proposed model was evaluated in the same experimental setup as the conventional methods following the description in [3] and [27]. In this experiment, two additional state-of-the-art (SOTA) 2D saliency models (DPNSal [8] and DeepGaze IIE [26]) in addition to DenseSal were tested as MainNet in the proposed method. The results are shown in Table 5. The values for the conventional methods were taken from the reference [3]. As can be seen from the table, the proposed method with DenseSal achieved the best performance in all the metrics. Although DeepGaze IIE [26] has shown the best performance in the 2D saliency benchmark [7], the performance on the ODI saliency estimation was worse than DenseSal and DPNSal. Since DeepGaze IIE has frozen the base networks (ShapeNetC, EfficientNetB5, DenseNet201, and ResNext50) and has trained only a readout network, it would be difficult to be adapted to new datasets such as the ODI saliency dataset. The effectiveness of integrating multi-view angles (MAV) was confirmed for all the 2D saliency models tested. The proposed method with DenseSal outperformed the conventional methods by a large margin.

Table 5: Comparison with conventional methods.

Proposed (MainNet: DenseSal [8])
Methods	NSS $\uparrow$	AUC $\uparrow$	CC $\uparrow$	KLD $\downarrow$
TU Munich [35]	0.805	0.726	0.579	0.449
SJTU [36]	0.918	0.735	0.532	0.481
CDSR [37]	0.936	0.736	0.538	0.508
GBVS360-Eq [32]	0.851	0.714	0.527	0.698
SalNet360 [3]	0.757	0.702	0.536	0.487
SAV	1.1405	0.7537	0.7622	0.2781
MAV	$\mathbf{1.1751}$	$\mathbf{0.7623}$	$\mathbf{0.7891}$	$\mathbf{0.2464}$
Proposed (MainNet: DPNSal [8])
SAV	1.0858	0.7509	0.7178	0.3112
MAV	1.1105	0.7539	0.7296	0.3061
Proposed (MainNet: DeepGazeIIE [26])
SAV	0.9446	0.7275	0.6169	0.4053
MAV	0.9836	0.7359	0.6465	0.3770

5.4 Examples of Saliency Maps

The examples of saliency maps estimated by the proposed methods (SAV and MAV with Arch. 4 in Table 3) are shown in Fig. 10. As can be seen from the sample images, most of the salient regions were detected both in SAV and MAV. To see more details, 2D images were extracted from the estimated ODI saliency maps, as shown in Fig. 11. In the left column of the figure, a small painting was fixated in the ground truth (GT), whereas SAV estimated lower saliency than GT. By using the multi-scale method with MAV, high saliency was estimated to the small painting. Similarly, a left hand of a human in the middle column example and a small head of a bird in the right column example were provided higher saliency by MAV than SAV as in GT.

6 Conclusions

In this paper, a novel ODI saliency estimation model was proposed, based on the well-established 2D saliency model pretrained on large 2D saliency databases. By extracting overlapping 2D patches from ODI in various viewing directions with multiple angles of view, the saliency of objects was accurately estimated in the optimal scales. To compensate the difference between the prior distributions for 2D saliency maps (center bias) and ODI saliency maps (equator bias), the 2D saliency model was pretrained with a center-bias layer to explicitly represent the prior distribution and was fine-tuned to an ODI saliency dataset by replacing the center-bias layer to an equator-bias layer composed of multiple channels corresponding to elevation angles for 2D-patch extraction. From the experiments, it was confirmed that the accuracy of the ODI saliency map estimation was improved by the proposed method by a large margin.

One of the limitations of the ODI saliency prediction is that the database for this task has been relying on a single small dataset only including 60 images for training and 25 images for testing, since the publicly available database for this task is currently limited to this database. Thus, the presented results are limited to the evaluation for only the 25 omni-directional images. Therefore, the more evaluation would be desired in the future research by creating larger databases for this task. Furthermore, since the proposed method is relying on the extraction of multiple 2D images from an omni-directional image, the computational cost for the saliency prediction was higher than the methods which predict saliency directly from omni-directional images. Therefore, the suppression of the computational cost would be another future research direction.

References

[1] C. Li, M. Xu, S. Zhang, and P.L. Callet, ”State-of-the-art in 360^∘ video/image processing: perception, assessment and compression,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, issue 1, 2020, pp. 5-26.
[2] A.D. Abreu, C. Ozcinar, and A. Smolic, ”Look around you: saliency maps for omnidirectional images in VR applications,” International Conference on Quality of Multimedia Experience, 2017.
[3] R. Monroy, S. Lutz, T. Chalasani, and A. Smolic, ”SalNet360: saliency maps for omni-directional images with CNN,” Signal Processing: Image Communication, vol. 69, 2018, pp. 26-34.
[4] C. Qing, H. Zhu, X. Xing, D. Chen, and J. Jin, ”Attentive and context-aware deep network for saliency prediction on omni-directional images,” Digital Signal Processing, vol. 120, 2022.
[5] T. Suzuki and T. Yamanaka, ”Saliency map estimation for omni-directional image considering prior distributions,” International Conference on Systems, Man, and Cybernetics, 2018.
[6] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba, ”Mit saliency benchmark,” available at: http://saliency.mit.edu.
[7] M. Kümmerer, Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba, ”MIT/Tuebingen saliency benchmark,” available at: https://saliency.tuebingen.ai/.
[8] T. Oyama and T. Yamanaka, ”Influence of image classification accuracy on saliency map estimation,” CAAI Transactions on Intelligence Technology, vol. 3, issue 3, 2018, pp. 140-152.
[9] V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, B. Masia, and G. Wetzstein, ”Saliency in VR: how do people explore virtual environments?” IEEE Transactions on Visualization and Computer Graphics, vol. 24, issue 4, 2018, pp. 1633-1642.
[10] L. Itti, C. Koch, and E. Niebur, ”A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, issue 11, 1998, pp. 1254-1259.
[11] A. Garcia-Diaz, X.R. Fdez-Vidal, X.M. Pardo, and R. Dosil, ”Saliency from hierarchical adaptation through decorrelation and variance normalization,” Image and Vision Computing, vol. 30, issue 1, 2012, pp. 51-64.
[12] X. Huang, C. Shen, X. Boix, and Q. Zhao, ”SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks,” International Conference on Computer Vision, 2015.
[13] J. Pan, K. McGuinness, E. Sayrol, N. O’Connor, and X. Giro-i-Nieto, ”Shallow and deep convolutional networks for saliency prediction,” IEEE/CVF Computer Vision and Pattern Recognition Conference, 2016.
[14] E. Vig, M. Dorr, and D. Cox, ”Large-scale optimization of hierarchical features for saliency prediction in natural images,” IEEE/CVF Computer Vision and Pattern Recognition Conference, 2014.
[15] M. Kümmerer, L. Theis, and M. Bethge, ”Deep gaze I: boosting saliency prediction with feature maps trained on ImageNet,” arXiv, 2014.
[16] S.S.S. Kruthiventi, K. Ayush, and R.V. Babu, ”DeepFix: a fully convolutional neural network for predicting human eye fixations,” IEEE Transactions on Image Processing, vol. 26, issue 9, 2017, pp. 4446-4456.
[17] M. Kümmerer, T.S.A. Wallis, and M. Bethge, ”DeepGaze II: reading fixations from deep features trained on object recognition,” arXiv, 2016.
[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei, ”ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, 2015, pp. 211-252.
[19] A. Borji and J. Tanner, ”Reconciling saliency and object center-bias hypotheses in explaining free-viewing fixations,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, issue 6, 2016, pp. 1214-1226.
[20] F. Yu and V. Koltun, ”Multi-scale context aggregation by dilated convolutions,” International Conference on Learning Representations, 2016.
[21] J. Pan, C.C. Ferrer, K. McGuinness, N.E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i-Nieto, ”SalGAN: visual saliency prediction with generative adversarial networks,” Workshop on IEEE/CVF Computer Vision and Pattern Recognition Conference, 2017.
[22] G. Huang, Z. Liu, L. van der Maaten, and K.Q. Weinberger, ”Densely connected convolutional networks,” IEEE/CVF Computer Vision and Pattern Recognition Conference, 2017.
[23] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, ”Dual path networks,” Conference on Neural Information Processing Systems, 2017.
[24] S. Jia and N.D.B. Bruce, ”EML-NET: an expandable multi-layer network for saliency prediction,” Image and Vision Computing, vol. 95, 2020.
[25] B. Zoph, V. Vasudevan, J. Shlens, and Q.V. Le, ”Learning transferable architectures for scalable image recognition,” IEEE/CVF Computer Vision and Pattern Recognition Conference, 2018.
[26] A. Linardos, M. Kümmerer, O. Press, and M. Bethge, ”DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling,” IEEE/CVF International Conference on Computer Vision, 2021.
[27] J. Gutiérrez, E.J. David, Y. Rai, and P. Le Callet, ”Toolbox and dataset for the development of saliency and scan-path models for omnidirectional/360^∘ still images,” Signal Processing: Image Communication, vol. 69, 2018, pp. 35-42.
[28] F.Y. Chao, L. Zhang, W. Hamidouche, and O. Déforges, ”Salgan360: visual saliency prediction on 360 degree images with generative adversarial networks,” Workshop on International Conference on Multimedia & Expo, 2018.
[29] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, ”A deep multi-level network for saliency prediction,” International Conference on Pattern Recognition, 2016.
[30] H. Li, H. Lu, Z. Lin, X. Shen, and B. Price, ”Inner and inter label propagation: Salient object detection in the wild,” IEEE Transactions on Image Processing, vol. 24, no. 10, 2015, pp. 3176-3186.
[31] J. Gutiérrez, E.J. David, A. Coutrot, M.P. Da Silva, and P. Le Callet, ”Introducing UN Salient360! Benchmark: a platform for evaluating visual attention models for 360^∘ contents,” International Conference on Quality of Multimedia Experience, 2018.
[32] P. Lebreton and A. Raake, ”GBVS360, BMS360, ProSal: extending existing saliency prediction models from 2D to omnidirectional images,” Signal Processing: Image Communication, vol. 69, 2018, pp. 69-78.
[33] M. Jiang, S. Huang, J. Duan and Q. Zhao, ”SALICON: saliency in context,” IEEE/CVF Computer Vision and Pattern Recognition Conference, 2015.
[34] J. Xu, M. Jiang, S. Wang, M.S. Kankanhalli, and Q. Zhao, ”Predicting human gaze beyond pixels,” Journal of Vision, vol. 14, 2014, pp. 1-20.
[35] M. Startsev and M. Dorr, ”360-aware saliency estimation with conventional image saliency predictors,” Signal Processing: Image Communication, vol. 69, 2018, pp. 43-52.
[36] Y. Zhu, G. Zhai, and X. Min, ”The prediction of head and eye movement for 360 degree images,” Signal Processing: Image Communication, vol. 69, 2018, pp. 15-25.
[37] Jing Ling, Kao Zhang, Yingxue Zhang, Daiqin Yang, Zhenzhong Chen, ”A saliency prediction model on 360 degree images using color dictionary based sparse representation,” Signal Processing: Image Communication, vol. 69, 2018, pp. 60-68.

\profile

Takao Yamanaka received the B.S., M.S., and Ph.D. degrees in Electrical Engineering from Tokyo Institute of Technology in 1996, 1998, and 2004, respectively. During 1998-2000, he worked in Canon Inc. After working in Texas A&M University as a post-doctoral fellow, he joined Sophia University in 2006, where he is currently an associate professor in Department of Information and Communication Sciences.

\profile

Tatsuya Suzuki received the B.S. and M.S. degrees in Computer Science from Sophia University in 2018 and 2020, respectively. During his master’s program, he worked on saliency estimation for omni-directional images.

\profile

Taiki Nobutsune received the B.S degree in Computer Science from Sophia University in 2023. He worked on multi-scale saliency estimation for omni-directional images in his undergraduate project.

\profile

Chenjunlin Wu received the M.S. degree in Computer Science from Sophia University in 2022. During his master’s program, he worked on saliency estimation for both 2D images and omni-directional images.