Upright Adjustment with Graph Convolutional Networks

Abstract

We present a novel method for the upright adjustment of 360^∘ images. Our network consists of two modules, which are a convolutional neural network (CNN) and a graph convolutional network (GCN). The input 360^∘ images is processed with the CNN for visual feature extraction, and the extracted feature map is converted into a graph that finds a spherical representation of the input. We also introduce a novel loss function to address the issue of discrete probability distributions defined on the surface of a sphere. Experimental results demonstrate that our method outperforms fully connected-based methods.

Index Terms— Upright adjustment, Graph convolution

1 Introduction

A 360^∘ image covers 180^∘ of vertical field of view (FoV) and 360^∘ of horizontal FoV. One of the distinguishing features of a 360^∘ image is to preserve the image information in every direction. To exploit this advantageous feature, 360^∘ images are used in popular online platforms such as YouTube and Facebook, which have widely started supporting 360^∘ images or videos [1]. However, when a 360^∘ image is captured by an amateur without using any specialized instruments for stabilization (e.g., tripod), the 360^∘ image obtained as the output can display slanted objects and wavy horizons due to camera tilts and rolls, as shown in the image on the left of Fig.1. If a user views this image using a head-mount display (HMD), he/she is likely to feel falling down or leaning backward. This not only diminishes the quality of the virtual reality (VR) experience but can also lead to the user feeling sick. The upright adjustment aims to compensate for these tilts and rolls and recover the straight version of the relatively inclined image [2].

Refer to caption — Fig. 1: Upright adjustment consisting of two steps. The first step is to estimate a North pole. Once the North pole is estimated, a rotation matrix $R$ that can map the estimated North pole to $(0,0,1)$ is left-multiplied to the input image. The left and right images represent the input and output, respectively.

Upright adjustment of 360^∘ images consists of two steps. The first step is to estimate a position of a North pole (i.e., the opposite direction of gravity). The second step is to apply a rotation matrix that can map the estimated North pole to $(0,0,1)$ . Fig.1 illustrates an example of the upright adjustment. Recently, few studies have been conducted on upright adjustment of 360^∘ images based on the deep learning algorithm [3, 4, 5] and have adopted the convolutional neural network (CNN), where the input is a regular grid 2D image. As the natural shape of a 360^∘ image is a sphere, each study suggests its own way to fit the 360^∘ image into a regular 2D grid image through projection methods or sampling FoV images.

In this paper, we investigate a way to process the 360^∘ image in its own natural shape (sphere). For processing of spherical data, we adopt the graph convolutional networks (GCN). We use the GCN in conjunction with a CNN module. The CNN module extracts visual representation from an input image and we convert this feature map into a graph that represents the sphere. Finally, the graph is processed by the GCN.

The main contributions of our method are three-fold.

•

We propose a network composed of the CNN and the GCN. The GCN module helps processing the input image in the form of a sphere.
•

We propose a new loss function. This loss function handles a position of the north pole in a probabilistic way. The loss function reduces a distance between the predicted probability to the ground truth probability of a position of the north pole.
•

We show that our network has reported more competitive result over the typical network composed of the CNN and fully connected layers. The advantages are rotation invariance, fast convergence and the performance.

2 Related Work

2.1 Upright Adjustment Methods

Feature-based algorithms: Feature-based algorithms follow several assumptions to determine features. For example, line-based algorithms [6] follow Atlanta world [7] or Manhattan world [8] assumptions and search for the vanishing point that is most likely in the opposite direction of sky. Another kind of feature-based algorithm is based on horizon search[9]. These algorithms assume that a clearly visible horizon exists in the image and try to find this horizon in the image.

Deep learning-based algorithms: Owing to the ability to extract semantic visual features, CNN-based algorithms are not required to make assumptions in terms of the input. However, 360^∘ images have to be fitted into 2D regular grid in order to be processed by the CNN. Existing deep learning papers process flat images generated by projections rather than processing the spherical representation. Jeon et al. [4] addressed this issue by sampling narrow FoV images from a 360^∘ image. Jung et al. [3] chose the equirectangular projection, which serves to be the most popular choice. Yu et al. [5] investigated more accurate projection methods and proposed the discrete spherical image representation.

2.2 Graph Convolutional Networks

GCNs are designed to represent graph structured data such as social networks, 3D meshes, relation database and molecular geometry. Most GCNs are trained by propagating information through edges that connect two nodes. The connectivity is expressed using the adjacency matrix (i.e., square matrix), which represents a finite graph. Bruna et al. [10] generalized CNNs into signals defined on graphs. Defferrard et al. [11] designed fast localized convolution filters on graphs in the context of spectral graph theory and the Chebyshev polynomial. Kipf et al. [12] proposed tidy GCNs using first-order approximation of spectral graph convolutions and successfully performed the node classification task.

3 Proposed Method

3.1 Method Overview

The proposed network is composed of a CNN module and GCN module. The CNN extracts visual features from the input image in equirectangular projection and the GCN predicts a discrete probability distribution of the North pole, which is represented by a group of points defined on the surface of the sphere sampled by Leopardi et al. [13]. The final predicted position of the North pole is obtained by computing the expectation for $x,y,$ and $z$ .

As an input, the 360^∘ image $x\in{\rm I\!R}^{h\times w\times c}$ is fed into the CNN that produces the feature map $f_{\theta}(x)\in{\rm I\!R}^{h^{\prime}\times w^{\prime}\times c^{\prime}}$ . Then, the feature map is converted into a graph denoting the spherical representation. To convert the feature map into a graph, the map is flattened and projected into the points of the sphere starting from the North pole and moving toward the South pole. The orange box in Fig.2 shows the correspondence between the feature map and the graph.

3.2 Network Architecture

For the CNN module, we utilize pre-trained architectures such as ResNet-18 [14] and DenseNet-121 [15]. The GCN module [12] is composed of five layers. The size of the channel is reduced to half for the consequent layers except the last layer. Regardless of the input channel size, the output channel size is $1$ in the last layer. In conjunction with the GCN layers, we insert the rectified linear unit (ReLU) in between. The adjacency matrix is constructed by connecting the $6$ -nearest neighbors and is improved into the $n$ -hop matrix by multiplying itself for $n$ times. As $n$ grows, it would connect more number of nodes. Then, the Softmax function is applied to the GCN output, which produces the discrete probability distribution.

3.3 Objective Function

We represent the position of the North pole as the probability distribution. The output of our networks is a discrete probability distribution of points defined on the surface of the sphere. Therefore, it is necessary to generate a probability distribution whose expectation is the ground truth North pole.

3.3.1 Distribution Labels

In directional statistics, von Mises–Fisher distribution is a probability distribution on the $(p-1)$ -dimensional sphere. The probability density function (PDF) of this distribution for a random $p$ -dimensional unit vector is as follows:

f_{p}(\mathbf{x};\kappa\mu^{\mathbf{T}}\mathbf{x})=C_{p}exp(\kappa\mu^{\mathbf{T}}\mathbf{x}),

(1)

where $\mu$ is the mean direction that is the center of the distribution with $\lVert\mu\rVert=1$ and $\kappa\geq 0$ is a standard deviation of Gaussian distribution on a sphere. The larger the value of $\kappa$ , the higher is the concentration of the distribution around $\mu$ , as shown in Fig.3. Therefore, we set $\mu$ as the ground truth North pole and vary $\kappa$ . In (1), the normalization constant $C_{p}$ ( $\kappa$ ) is defined as

C_{p}(\kappa)=\frac{\kappa^{(p/2-1)}}{(2\pi)^{p/2}I_{p/2-1(\kappa)}},

(2)

where $p$ denotes the dimension of the sphere. By utilizing von Mises-Fisher PDF, we can generate labels for training the network.

3.3.2 JSD Loss

We use the Jenson-Shannon divergence (JSD) as a distance metric and calculate the distance between two probability distributions. Then, our method aims to minimize the distance between the predicted distributions and ground truth distributions $label_{dist}$ using the following loss function:

\mathcal{L}=JSD(softmax(G_{\phi}(f_{\theta}(x),A)),label_{dist}),

(3)

where $A$ denotes the adjacency matrix, and $\theta$ and $\phi$ denote the parameters for CNN and GCN, respectively, as shown in Fig.2. The red box in Fig.2 illustrates the JSD loss concept.

4 EXPERIMENTS

We used the SUN360 dataset [16], which consists of 360^∘ images taken in various places (e.g., indoor, outdoor, urban, and rural) and different conditions (e.g., day and night). We sampled $25000$ , $5000$ , and $4260$ images for training, validation, and testing datasets, respectively. All images were synthetically rotated based on the rotation strategy in [3].

4.1 Ablation Study

Table 1: We used DenseNet and ResNet as CNN backbones with the different combinations of kappa values. DenseNet with kappa value as

25

shows the best result in terms of the average error. The column within

10

^∘indicates the percentage of images whose error is below 10^∘.

Variants of our method	$\kappa$	Avg	within 10^∘
DenseNet121	15	6.0^∘	90%
DenseNet121	20	4.3^∘	97%
DenseNet121	25	4.0^∘	97%
ResNet18	15	6.4^∘	93%
ResNet18	20	6.4^∘	93%
ResNet18	25	6.6^∘	93%

We made six variations of our networks by changing their main components, CNN structure, and $\kappa$ . For the CNN structure, we tested two popular networks, which are ResNet-18 [14] and DenseNet-121 [15]. Four different values of $10,15,20$ , and $25$ were used for the $\kappa$ values, which results in eight combinations. According to Table 1, the DenseNet reports better performance than the ResNet. This tendency holds with high accuracy regardless of the value of kappa.

4.2 Advantages of the GCN module

In our method, two primary advantages of using the GCN module are rotation invariance and fast convergence. To justify these advantages, we compared our method, which is CNN (DenseNet-121 with kappa value of $25$ ) $+$ GCN, with CNN $+$ conventional fully-connected layers.

Rotation Invariance: For this experiment, $500$ images were selected from the test set and each image was rotated into $20$ random directions. We computed the standard deviation (STD) for each group that shared the same source image. A lower standard deviation indicates better rotation-invariant, because, in this case, the error angle remains the same regardless of its initial rotation. The mean value of STD is 2.1^∘ for ours and 4.4^∘for conventional fully-connected layers-based methods. The proposed GCN produces more consistent error angles regardless of the initial rotation.

Fast convergence: Our method with GCN converged much faster than networks with fully connected layers (FC). Fig.5 shows the average error of GCN and FC over epochs for the training and validation sets. The data has been recorded during a training session. Our method with GCN consistently reports better errors for training and validation sets.

4.3 Comparison to other methods

Table 2: Quantitative Comparison. Our network attains the most competitive result according to the average angle. The column within

10

^∘indicates the percentage of images whose error is below 10^∘.

Method	Avg	within $10$ ^∘
GCN	4.0^∘	97%
Horizon based [9]	89.7^∘	20%
Jung et al. [3]	5.9^∘	96%

We compared our network (i.e., DenseNet with $\kappa$ of 25+GCN) with a feature-based algorithm [9] and a deep learning-based algorithm [3]. We used $4260$ randomly rotated images for testing. Table 2 demonstrates that our method outperforms other methods in terms of accuracy. However, it should be noted that our network is trained for only $50$ epochs, whereas Jung et al. have trained their network for $800$ epochs whose training environments are exactly same with ours.

5 Conclusion

We present the networks based on the CNN and GCN for upright adjustment. The feature map obtained by the CNN is converted into a graph with the spherical representation of the relative input. This is the first approach in terms of upright adjustment to preserve its spherical shape. With the newly adopted GCN, our network shows better rotation invariance and faster convergence over its fully connected layer counterpart.

6 acknowledgement

This work was supported by the Seoul R&BD Program (CY190032) and (NRF-2018R1A4A1059731).

References

[1] Youtube, “Youtube VR support,” 2020.
[2] Hyunjoon Lee, Eli Shechtman, Jue Wang, and Seungyong Lee, “Automatic upright adjustment of photographs with robust camera calibration,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 833–844, 2013.
[3] Raehyuk Jung, Aiden Seuna Joon Lee, Amirsaman Ashtari, and Jean-Charles Bazin, “Deep360up: A deep learning-based approach for automatic vr image upright adjustment,” in IEEE Conference on Virtual Reality and 3D User Interfaces, 2019.
[4] Junho Jeon, Jinwoong Jung, and Seungyong Lee, “Deep upright adjustment of 360 panoramas using multiple roll estimations,” in Asian Conference on Computer Vision, 2018.
[5] Yuhao Shan and Shigang Li, “Discrete spherical image representation for cnn-based inclination estimation,” IEEE Access, 2019.
[6] Kyungdon Joo, Tae-Hyun Oh, In So Kweon, and Jean-Charles Bazin, “Globally optimal inlier set maximization for atlanta frame estimation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[7] Grant Schindler and Frank Dellaert, “Atlanta world: An expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments,” in IEEE Conference on Computer Vision and Pattern Recognition, 2004.
[8] James M Coughlan and Alan L Yuille, “The manhattan world assumption: Regularities in scene statistics which enable bayesian inference,” in Advances in Neural Information Processing Systems, 2001.
[9] Cédric Demonceaux, Pascal Vasseur, and Claude Pégard, “Robust attitude estimation with catadioptric vision,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006.
[10] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
[11] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016.
[12] Thomas N Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[13] Paul Leopardi, “A partition of the unit sphere into regions of equal area and small diameter,” Electronic Transactions on Numerical Analysis, vol. 25, no. 12, pp. 309–327, 2006.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in IEEE conference on computer vision and pattern recognition, 2016.
[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in IEEE conference on computer vision and pattern recognition, 2017.
[16] Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba, “Recognizing scene viewpoint using panoramic place representation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012.