Deep network for rolling shutter rectification

Praveen K, Lokesh Kumar T,and A.N. Rajagopalan

Abstract

CMOS sensors employ row-wise acquisition mechanism while imaging a scene, which can result in undesired motion artifacts known as rolling shutter (RS) distortions in the captured image. Existing single image RS rectification methods attempt to account for these distortions by either using algorithms tailored for specific class of scenes which warrants information of intrinsic camera parameters or a learning-based framework with known ground truth motion parameters. In this paper, we propose an end-to-end deep neural network for the challenging task of single image RS rectification. Our network consists of a motion block, a trajectory module, a row block, an RS rectification module and an RS regeneration module (which is used only during training). The motion block predicts camera pose for every row of the input RS distorted image while the trajectory module fits estimated motion parameters to a third-order polynomial. The row block predicts the camera motion that must be associated with every pixel in the target i.e, RS rectified image. Finally, the RS rectification module uses motion trajectory and the output of row block to warp the input RS image to arrive at a distortion-free image. For faster convergence during training, we additionally use an RS regeneration module which compares the input RS image with the ground truth image distorted by estimated motion parameters. The end-to-end formulation in our model does not constrain the estimated motion to ground-truth motion parameters, thereby successfully rectifying the RS images with complex real-life camera motion. Experiments on synthetic and real datasets reveal that our network outperforms prior art both qualitatively and quantitatively.

1 Introduction

Most present-day cameras are equipped with CMOS sensors due to advantages such as slimmer readout circuitry, lower cost, and higher frame rate over their CCD counterparts. While capturing an image, CMOS sensor array is exposed to a scene in a sequentially row-wise manner. The flip side is that, in the presence of camera motion, the inter-row delay leads to undesirable geometric effects, also known as rolling shutter (RS) distortions. This is because rows of the RS image do not necessarily sense the same camera motion. A prominent effect of RS distortion is the manifestation of straight lines as curves which call for correction of RS effect also termed as RS rectification. Rectification of RS distortion involves finding the camera motion for every row of RS image (row motions). Each row of the RS image is then warped using estimated row motions by taking one of the rows as reference. More than aesthetic appeal, the implications of RS rectification are critical for vision-tasks such as image registration, structure from motion (SFM), etc which perform scene inference based on geometric attributes in the captured images.

Multi-frame RS rectification methods use videos [10, 21, 7, 2] and estimate the motion across the frames using point-correspondences. The inter-frame motion helps with estimating the row motion of each RS frame, aiding the warping process to obtain distortion-free frames. Different algorithms are proposed for RS deblurring [20, 13], RS super-resolution [16], RS registration [26], and change detection [15]. The works in [25, 30] have addressed the problem of depth-aware RS rectification and again rely on multiple input frames. A differential SFM based framework is employed in [30] to perform RS rectification of input images captured by a slow-moving camera. [25] can handle the additional effects of occlusion that arise while capturing images using a fast-moving RS camera. Some methods have used external sensor information such as a gyroscope [4, 5, 14] to stabilize RS distortion in videos. Moreover, these methods are strongly constrained by the availability as well as reliability of external sensor data.

The afore-mentioned methods are data greedy and time-consuming except [2]. Moreover, they are rendered unusable when only a single image is available. [19, 17, 9] rely on straight lines becoming curves as a prominent effect to correct RS distortions. However, these methods are tailored for scenes that consist predominantly of straight lines and hence fail to generalize to natural images where actual curves are present in the 3D world. Moreover, they require knowledge of intrinsic camera parameters for RS rectification.

In this paper, we address the problem of single image RS rectification using a deep neural network. The prior work to use a deep network for RS rectification for 2D scenes is [18] wherein a neural network is trained using RS images as input, and ground truth distortion (i.e., motion) parameters as the target. Given an RS image during inference, the trained neural network predicts motion parameters corresponding to a set of key rows which is then followed by interpolation for all rows. A main drawback of this approach is that it restricts the solution space of estimated camera parameters to the ground truth parameters used during training. Moreover, arriving at the rectified image is challenging since the association between the estimated motion parameters and the pixel position of ground truth global shutter (GS) image is unknown. [18] attempts to solve this problem using an optimization framework as a complex post-processing step.

Recent findings in image restoration advocate that end-to-end training performs better than decoupled or piece-wise training such as in image deblurring [12, 24], ghost imaging [27], hyperspectral imaging [1] and image super-resolution [11, 6]. As also reiterated in [29], a fisheye distortion rectification network, regressing for ground truth distortion parameters and then rectifying the distorted image gives sub-optimal performance compared to an end-to-end approach for the clean image. To this end, we propose a simple and elegant end-to-end deep network which uses ground truth image to guide the rectification process during training. RS rectification is done in a single step during inference.

2 RS Image Generation and Rectification

Rolling shutter distortion due to row-wise exposure of sensor array depends on the relative motion between camera and scene. Fig. LABEL:rollingShutter shows a scene captured using RS camera under different camera trajectories i.e., different values of $[r_{x},r_{y},r_{z},t_{x},t_{y},t_{z}]$ where $t_{\phi}$ and $r_{\phi}$ , $\phi\in\{x,y,z\}$ , indicate translations along and rotations about $\phi$ axis, respectively. As observed in the figure and as also stated in [18], the effect of $t_{y},t_{z}$ and $r_{x}$ on RS distortion is negligible as compared to the effect of $t_{x},r_{y}$ and $r_{z}$ . Moreover, the effect of $r_{y}$ can be approximated by $t_{x}$ for large focal length and when the movement of camera towards or away from scene is minimal. Hence, it suffices to consider only $t_{x}$ and $r_{z}$ to be essentially responsible for RS image formation.

The GS image coordinates $(x_{\mbox{gs}},y_{\mbox{gs}})$ are related to RS image coordinates $(x_{\mbox{rs}},y_{\mbox{rs}})$ by

\begin{split}x_{\mbox{rs}}=x_{\mbox{gs}}\cdot\,\textrm{cos}\,(r_{z}(x_{\mbox{rs}}))-y_{\mbox{gs}}\cdot\,\textrm{sin}\,(r_{z}(x_{\mbox{rs}}))+t_{x}(x_{\mbox{rs}})\\ y_{\mbox{rs}}=x_{\mbox{gs}}\cdot\,\textrm{sin}\,(r_{z}(x_{\mbox{rs}}))+y_{\mbox{gs}}\cdot\,\textrm{cos}\,(r_{z}(x_{\mbox{rs}}))\end{split}

(1)

where $r_{z}(x_{\mbox{rs}})$ and $t_{x}(x_{\mbox{rs}})$ are the rotation and translation motion experienced by the $x_{\mbox{rs}}^{th}$ row of RS image. The GS-RS image pairs required for training our neural network are synthesized using Eq 1. Given a GS image and the rotation and translation motion for every row of the RS image, the RS image can be generated using either source-to-target (S-T) or target-to-source (T-S) mapping with GS coordinates as source and RS coordinates as target. Since the motion parameters are associated with RS coordinates, S-T mapping is not employed for RS image generation. In T-S mapping, each pixel location of target RS image is multiplied with corresponding warping matrix formed by the motion parameters to yield source GS pixel location. The intensity at the resultant GS pixel coordinate is found by using bilinear interpolation and then copied to the RS pixel location.

Given an RS image and motion parameters for each row of RS image, the RS observation can be rectified akin to the process of RS image formation except that the source is now the RS image while target is the RS rectified image. In S-T mapping, each pixel location of RS image along with its row wise camera motion is substituted in Eq. (1) to get RS rectified (target) pixel location. However, there is a possibility that some of the pixels in the target RS rectified image may go unfilled leaving holes in the resultant RS image. In T-S mapping, for every pixel location of RS rectified image (target), the same set of equations (i.e., Eq. (1)) can be used to solve for RS image (source) coordinates provided the camera motion acting on RS rectified coordinates is known.

Refer to caption — Figure 1: Proposed end-to-end deep network architecture.

3 Network architecture

Our main objective is to find a direct mapping from the input RS image to the target RS rectified image. This requires estimation of row-wise camera motion parameters, and the correspondence between estimated motion parameters and target pixel locations. We achieve the above in a principled manner as follows. We propose to use image features of input RS image for estimating camera motion parameters and devise a mapping function to relate estimated motion parameters with pixel locations of target image.

Our network architecture (Fig. 1) consists of five basic modules: motion block, trajectory module, row block, RS regeneration module and RS rectification module. The motion block predicts camera motion parameters ( $t_{x}$ and $r_{z}$ ) for each row of the input RS image. The trajectory module ensures that the estimated camera motion parameters follow a smooth and continuous curve as we traverse the rows in the RS image, in compliance with real-life camera trajectories. For each pixel location in the target image, the corresponding camera motion is found using the row block. The output of row block and trajectory module are used by the RS rectification module to warp the input RS image to get the RS rectified image. For faster convergence during training and to better-condition the optimisation process, we also employ an RS regeneration module which takes motion parameters from the trajectory module and warps the GS image to estimate the given (input) RS image. A detailed discussion of each of the module follows.

Motion block This consists of a base block followed by $t_{x}$ (translation) and $r_{z}$ (rotation) blocks, respectively. The base block extracts features from the input RS image which are then used to find row wise translation and rotation motion of the input image. Thus, the motion block takes input color image of size $r\times r\times 3$ and outputs two 1D vectors of length $r$ indicating rotation and translation motion parameters for each row of the input RS image. The base block is designed using three convolutional layers. Both translation and rotation blocks, which take the output of the base network, are designed using three convolutional layers followed by two fully connected (FC) layers. The final FC layer is of dimension [ $r$ ,1] reflecting the motion for every row of the input RS image. Each convolutional layer is followed by batch normalization.

Row block As discussed in the second section, every pixel coordinate in the GS (target) image is substituted in Eq. (1) along with the corresponding camera motion to get the RS (source) image coordinate. However, camera motion acting on each GS pixel to form the RS image is known only to the extent that it is one of the motions experienced by the rows of RS image. Specifically, for a given input RS image, all the pixels in a row stem from a single camera motion and the motion will generally be different for each row. In contrast, pixels in a row of the GS image need not be influenced by a single motion. The ambiguity of which motion to associate to a pixel in target image was addressed in [18] as a post-processing step, which is a complicated exercise and implicitly constrains the estimated motion parameters.

We propose to use a deep network to solve this issue thus rendering our network end-to-end. The row block takes an image of size $r\times r\times 3$ and outputs a matrix of dimension $r\times r$ , with each location indicating which row motion of RS image must be considered from the estimated motion parameters for rectification. In real camera trajectory, camera motion is typically smooth. Consequently, the output of row block at each coordinate location can be expected to be close to its corresponding row number. Hence, for both stability and faster convergence, we learn to solve only for the residual (motivated by [3, 6]). The residual image is an $r\times r$ matrix and is the output of the row block. This image is added to another matrix (say $A$ ) which is initialized with its own row number i.e., $A(i,j)=i$ for the reason stated above. The resultant matrix values indicate which camera motion is to be considered from the estimated motion parameters for rectification of RS image. From now, we refer to output of row block as the sum of learnt residual with a matrix initialized with its row number at each coordinate position. The row block consists of five convolutional layers with each layer followed by batch normalization and an activation function. Use of three layers for base block and a total of 6 convolution layers for motion block is partly motivated by [18] (it uses five convolution layers for motion estimation). Since, the objective of row block (finding residual for) is comparatively less complex compared to motion block we used only 3 layers.

3.1 Loss functions

Given an input RS image, using the trajectory module, row block and RS rectification module, an input image can be rectified to give an RS distortion-free or rectified image. We employ different loss functions to enable the network to learn the rectification process.

The first loss is the mean squared error (MSE) between rectified RS image and ground truth image but with a modification. In the rectified RS image, it is possible that certain regions in the boundary are not recovered (when compared with GS image) since these regions were not present in the RS image itself due to camera motion. This can be noticed in Fig. 5 (third row) where the building has been rectified but there are regions on the boundary where the rectification algorithm could not retrieve pixel values as they were not present in the original RS image. To account for this effect, we used a visibility aware MSE loss where MSE is considered between two pixels only if the intensity in at least one of the channels in the rectified RS image is non-zero. Let $I_{\mbox{rs}},I_{\mbox{gs}},I_{\mbox{rs\_rec}}$ be input RS, ground truth GS, and RS rectified image, respectively. Then, we define mask $M_{\mbox{rs\_rec}}$ , such that

M_{\mbox{rs\_rec}}(i,j)=\begin{cases}0&\mbox{if}\sum\limits_{k=1}^{3}I_{\mbox{rs\_rec}}(i,j,k)=0\\ 1&\,\textrm{otherwise}\end{cases}

where $k$ indicates color channel in the RGB image. The error between GS and RS rectified image can be written as

L_{\mbox{rs\_rec\_MSE}}=||I_{\mbox{rs\_rec}}-M_{\mbox{rs\_rec}}\otimes I_{\mbox{gs}}||^{2}_{2}

where $\otimes$ refers to point-wise multiplication.

The second loss that we devise is based on the error between the given RS image and the GS image distorted by estimated motion parameters. To account for holes in the boundary, we again define mask $M_{\mbox{rs\_reg}}(i,j)$ such that

M_{\mbox{rs\_reg}}(i,j)=\begin{cases}0&\mbox{if}\sum\limits_{k=1}^{3}I_{\mbox{rs\_reg}}(i,j,k)=0\\ 1&\,\textrm{otherwise}\end{cases}

where $I_{\mbox{rs\_reg}}$ is the image obtained by applying estimated motion parameters on the GS image. The error between the RS image and the RS regenerated image is given by

L_{\mbox{rs\_reg\_MSE}}=||I_{\mbox{rs\_reg}}-M_{\mbox{rs\_reg}}\otimes I_{\mbox{rs}}||^{2}_{2}

Since edges play a very important role in RS rectification, we also compare Sobel edges of RS rectified and RS regenerated images with ground truth GS and input RS images, respectively. Let the Sobel operation be represented as $E(.)$ . Then the edge losses for regeneration phase and rectification phase can be formulated as

L_{\mbox{rs\_rec\_edge}}=||E(I_{\mbox{rs\_rec}})-M_{\mbox{rs\_rec}}\otimes E(I_{\mbox{gs}})||^{2}_{2}

L_{\mbox{rs\_reg\_edge}}=||E(I_{\mbox{rs\_reg}})-M_{\mbox{rs\_reg}}\otimes E(I_{\mbox{rs}})||^{2}_{2}

The overall loss function (please refer to Appendix for back propagation equations w.r.t different loss functions) of our network is a combination of the afore-mentioned loss functions and is given by

L_{total}=\lambda_{1}L_{\mbox{rs\_rec\_MSE}}+\lambda_{2}L_{\mbox{rs\_reg\_MSE}}+\lambda_{3}L_{\mbox{rs\_rec\_edge}}+\lambda_{4}L_{\mbox{rs\_reg\_edge}}

(2)

4 Experiments

This section is arranged as follows: (i) dataset generation, (ii) implementation details, (iii) competing methods, (iv) quantitative analysis, and (v) visual results.

4.1 Dataset generation

To train our network, we used images of buildings but tested on buildings, as well as real-world RS images (having atleast few real-world straight lines). In this section, we explain the synthesis of camera motion and generation of RS images for training. Since fully connected (FC) layers are present in our network, we used images of constant size (256x256) for both training and testing.
Camera motion and Training dataset: Because it is difficult to capture real GS-RS pairs, following [18] we synthesized camera motion using a second-degree polynomial for generating the RS images. We used the Buildings dataset from [28, 22] with a total of 440 clean images cropped to a size of 256x256. Out of those, we randomly chose 400 images and each image is distorted using 200 synthesized camera motions resulting in 80K images for training. The remaining 40 images are used in the test dataset. In order to ensure that there are no missing parts in the boundaries of generated RS images, we increased the size of each image to 356x356, applied RS distortions, and then cropped them back to 256x256.

4.2 Implementation details and competing methods

To stabilize training and mitigate the ill-conditioness during the initial steps, we trained our network to regress for only ground truth motion parameters using 50 images from the training dataset for 5 epochs. Then the network is trained using Eq. 2 as our loss function with full size training dataset. We used TensorFlow for both training and testing with following options: ADAM optimizer to minimize the loss function, momentum values with $\beta_{1}$ = 0.9 , $\beta_{2}$ = 0.99 and with a learning rate of 0.001. The weights of different cost functions are set as $\lambda_{1}$ = $\lambda_{2}$ = 1 and $\lambda_{3}$ = $\lambda_{4}$ = 0.5. We compared our method with state-of-the-art single image RS rectification methods [18, 9, 17, 19]. Note that non-learning based methods [19, 9, 17] require intrinsic camera parameters while ours and [18] do not. We gave our set of RS images for comparison to the respective authors and obtained the results from them.

4.3 Visual comparisons

We give results on the test dataset and RS images captured using a hand-held mobile camera. Fig. 3 depicts qualitative comparisons with competing methods. The RS image in the first row is an Indoor scene image ([23]) affected by a real-life camera trajectory ([8]). Because the Manhattan world assumption is not satisfied and due to the presence of influential outliers in the background (cloth), [9, 17, 19] is not able to rectify the image properly. Our rectification result is better than that of [18] since the solution space of estimated motion parameters is not skewed, unlike [18]. The RS images in the second and third (taken from [9]) row, also part of test dataset, are affected by complex real-life camera motion which is evident from the RS distortions. Since strong outliers are present in these images in the form of branches, [9, 17, 19] which depend on the detection of curves for estimation of camera motion fail to rectify the image. Due to restrictions on estimated camera motion estimation, [18] is unable to properly rectify the images in comparison to ours.

Refined and complete version of this work appeared in JOSA 2020

References

[1] Hao Fu, Liheng Bian, Xianbin Cao, and Jun Zhang. Hyperspectral imaging from a raw mosaic image with end-to-end learning. Optics Express, 28(1):314–324, 2020.
[2] Matthias Grundmann, Vivek Kwatra, Daniel Castro, and Irfan Essa. Calibration-free rolling shutter removal. In 2012 IEEE international conference on computational photography (ICCP), pages 1–8. IEEE, 2012.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[4] Sung Hee Park and Marc Levoy. Gyro-based multi-image deconvolution for removing handshake blur. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3373, 2014.
[5] Chao Jia and Brian L Evans. Probabilistic 3-d motion estimation for rolling shutter video rectification from visual and inertial measurements. In 2012 IEEE 14th International Workshop on Multimedia Signal Processing (MMSP), pages 203–208. IEEE, 2012.
[6] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
[7] Young-Geun Kim, Venkata Ravisankar Jayanthi, and In-So Kweon. System-on-chip solution of video stabilization for cmos image sensors in hand-held devices. IEEE transactions on circuits and systems for video technology, 21(10):1401–1414, 2011.
[8] Rolf Köhler, Michael Hirsch, Betty Mohler, Bernhard Schölkopf, and Stefan Harmeling. Recording and playback of camera shake: Benchmarking blind deconvolution with a real-world database. In European conference on computer vision, pages 27–40. Springer, 2012.
[9] Yizhen Lao and Omar Ait-Aider. A robust method for strong rolling shutter effects correction using lines with automatic feature selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4795–4803, 2018.
[10] Chia-Kai Liang, Li-Wen Chang, and Homer H Chen. Analysis and compensation of rolling shutter effect. IEEE Transactions on Image Processing, 17(8):1323–1330, 2008.
[11] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 136–144, 2017.
[12] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
[13] Mahesh MR Mohan, AN Rajagopalan, and Gunasekaran Seetharaman. Going unconstrained with rolling shutter deblurring. In Proceedings of the IEEE International Conference on Computer Vision, pages 4010–4018, 2017.
[14] Alonso Patron-Perez, Steven Lovegrove, and Gabe Sibley. A spline-based trajectory representation for sensor fusion and rolling shutter cameras. International Journal of Computer Vision, 113(3):208–219, 2015.
[15] Vijay Rengarajan Angarai Pichaikuppan, Rajagopalan Ambasamudram Narayanan, and Aravind Rangarajan. Change detection in the presence of motion blur and rolling shutter effect. In European Conference on Computer Vision, pages 123–137. Springer, 2014.
[16] Abhijith Punnappurath, Vijay Rengarajan, and AN Rajagopalan. Rolling shutter super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 558–566, 2015.
[17] Pulak Purkait, Christopher Zach, and Ales Leonardis. Rolling shutter correction in manhattan world. In Proceedings of the IEEE International Conference on Computer Vision, pages 882–890, 2017.
[18] Vijay Rengarajan, Yogesh Balaji, and AN Rajagopalan. Unrolling the shutter: Cnn to correct motion distortions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2291–2299, 2017.
[19] Vijay Rengarajan, Ambasamudram N Rajagopalan, and Rangarajan Aravind. From bows to arrows: Rolling shutter rectification of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2773–2781, 2016.
[20] Vijay Rengarajan, Ambasamudram Narayanan Rajagopalan, Rangarajan Aravind, and Guna Seetharaman. Image registration and change detection under rolling shutter motion blur. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(10):1959–1972, 2017.
[21] Erik Ringaby and Per-Erik Forssén. Efficient video rectification and stabilisation for cell-phones. International Journal of Computer Vision, 96(3):335–352, 2012.
[22] Hao Shao, Tomáš Svoboda, and Luc Van Gool. Zubud-zurich buildings database for image based recognition. Computer Vision Lab, Swiss Federal Institute of Technology, Switzerland, Tech. Rep, 260(20):6–8, 2003.
[23] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
[24] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
[25] Subeesh Vasu, Mahesh MR Mohan, and AN Rajagopalan. Occlusion-aware rolling shutter rectification of 3d scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 636–645, 2018.
[26] Subeesh Vasu, Ambasamudram Narayanan Rajagopalan, and Guna Seetharaman. Camera shutter-independent registration and rectification. IEEE Transactions on Image Processing, 27(4):1901–1913, 2018.
[27] Fei Wang, Hao Wang, Haichao Wang, Guowei Li, and Guohai Situ. Learning from simulation: An end-to-end deep-learning approach for computational ghost imaging. Optics express, 27(18):25560–25572, 2019.
[28] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492. IEEE, 2010.
[29] Xiaoqing Yin, Xinchao Wang, Jun Yu, Maojun Zhang, Pascal Fua, and Dacheng Tao. Fisheyerecnet: A multi-context collaborative deep network for fisheye image rectification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 469–484, 2018.
[30] Bingbing Zhuang, Loong-Fah Cheong, and Gim Hee Lee. Rolling-shutter-aware differential sfm and image rectification. In Proceedings of the IEEE International Conference on Computer Vision, pages 948–956, 2017.