Real-World Video for Zoom Enhancement based on Spatio-Temporal Coupling

Zhiling Guo

{}^{\textit{1,2}}

, Yinqiang Zheng

{}^{\textit{3}}

, Haoran Zhang

{}^{\textit{4}}

, Xiaodan Shi

{}^{\textit{2}}

, Zekun Cai

{}^{\textit{2}}

,
Ryosuke Shibasaki

{}^{\textit{2}}

, Jinyue Yan

{}^{\textit{1}}

{}^{\textit{1}}

Department of Building Environment and Energy Engineering,
The Hong Kong Polytechnic University, Kowloon, Hong Kong, China

{}^{\textit{2}}

Center for Spatial Information Science, The University of Tokyo, Kashiwa, Japan

{}^{\textit{3}}

Next Generation Artificial Intelligence Research Center, The University of Tokyo, Tokyo, Japan

{}^{\textit{4}}

School of Urban Planning and Design, Peking University, Shenzhen, China

Abstract

In recent years, single-frame image super-resolution (SR) has become more realistic by considering the zooming effect and using real-world short- and long-focus image pairs. In this paper, we further investigate the feasibility of applying realistic multi-frame clips to enhance zoom quality via spatio-temporal information coupling. Specifically, we first built a real-world video benchmark, VideoRAW, by a synchronized co-axis optical system. The dataset contains paired short-focus raw and long-focus sRGB videos of different dynamic scenes. Based on VideoRAW, we then presented a Spatio-Temporal Coupling Loss, termed as STCL. The proposed STCL is intended for better utilization of information from paired and adjacent frames to align and fuse features both temporally and spatially at the feature level. The outperformed experimental results obtained in different zoom scenarios demonstrate the superiority of integrating real-world video dataset and STCL into existing SR models for zoom quality enhancement, and reveal that the proposed method can serve as an advanced and viable tool for video zoom.

1 Introduction

Zoom functionality plays an important role nowadays owing to the increasing demand for more detailed contents of view in camera equipment. Instead of using the bulky and expensive optical lens, adopting digital zoom to increase the resolution has emerged as an alternative strategy. Regarding the digital zoom method, which accomplished by cropping a portion of an image down to a centered area and simply upsampling back up to the same aspect ratio as the original, as shown in Figure 1 B, it unavoidably results in the annoying quality problems in super-resolved image including noise, artifacts, loss of detail, unnaturalness, etc. [29]. The generalization of high-quality contents based on digital zoom remains a formidable challenge.

Refer to caption — Figure 1: Visual comparison among realistic LR&HR pair and learning based digital zoom methods. $L_{2}$ , CoBi, and Ours all trained by SRResNet architecture, here represent the methods based on weak-spatial, spatial, and spatio-temporal coupling constraint, respectively.

Considering the zoom principles as well as the difficulties faced by digital zoom mentioned above, super-resolution (SR) techniques, which aim at increasing the image resolution while providing finer spatial details than those captured by the original acquisition sensors, can be adopted to boost digital zoom quality. SR has experienced significant improvements over the last few years thanks to deep learning methods and large-scale training datasets [8, 30, 18, 22, 23, 37, 7, 46]. Unfortunately, most existing methods that evaluated on simulated datasets are hard to be generalized in challenging real-world SR conditions, where the authentic degradations in low-resolution (LR) images are far more complicated [2, 20, 43]. Thus, introducing high-quality training datasets that contain real-world LR and high-resolution (HR) pairs to realistic SR is highly desired.

Recent studies [48, 3, 5, 16] have investigated the strategy of applying real sensor based single-frame datasets including raw data for digital zoom quality enhancement. Although remarkable improvement can be achieved, since some severe issues are inevitable in captured raw data pairs, including lens distortion, spatial misalignment, and color mismatching, the effectiveness of SR is heavily limited by single-frame based method in how much information can be reconstructed from limited and misaligned spatial features.

The successes of applying simulated multi-frame datasets in image restoration tasks such as video SR [39, 13, 45, 36, 44] and deblurring [33] inspire us to raise the possibility of adopting real sensor based multi-frame datasets in digital zoom. Considering the inter-frame spatio-temporal correlations, extracting and combining information from multiple frames would be a promising strategy to alleviate the intrinsic issues between realistic LR and HR pairs. Under the aforementioned assumption, in this paper, we proposed to utilize real sensor based video datasets in SR to achieve digital zoom enhancement. It remains the challenges in two aspects: (1) how to acquire high-quality real-world video pairs with different resolutions, and (2) how to effectively make full use of the multi-frame information for model training.

To obtain a paired video dataset, we build a novel optical system that adopts a beam splitter to split the light from the same scene, and then capture the paired LR and HR videos by the equipped long- and short-focal length camera independently and simultaneously. The system can conveniently collect realistic raw videos as well as image datasets with different scale ratios by simply adjusting the equipped manual zoom lens. In this paper, we provide a benchmark, named as VideoRAW, for training and evaluating SR algorithms in practical applications. We define a scene captured at a long focal length as the HR ground truth, and the same one captured at a short focal length as its paired LR observation. In comparison to existing datasets [20, 27, 3, 48, 2, 16], VideoRAW is the first large-scale video-based raw dataset used for real-world computational zoom. It enables the comparisons among different algorithms in both video and image zoom scenarios, and the diverse scenes contained inside make it more realistic and practical.

According to VideoRAW, where the paired LR and HR images are not perfectly aligned while adjacent frames contain spatio-temporal correlations, we propose a novel loss framework, termed Spatio-Temporal Coupling Loss (STCL), to address the challenges in training for the challenging feature alignment and fusion. STCL draws inspiration from recent proposed contextual bilateral loss (CoBi) [48] in dealing with unaligned features in paired single-frame LR and HR images. Different from CoBi, which only focuses on limited spatial pattern, STCL takes both spatial and temporal correlations of reference multi-frame clips into account, and performs realistic SR enhancement in a coupled manner at the feature level. Specifically, regarding spatial aspect, STCL aligns the location of HR and input LR in a lower scale with coarse constraint while fusing the features from the paired HR frame into SR in a higher scale. In perspective of temporary aspect, STCL convolves the features in adjacent frames and then takes them as supplementary cues to help compensate the SR quality.

Finally, the proposed VideoRAW and STCL are adopted to SR for digital zoom quality enhancement. During training, raw sensor data, which taken with a shorter focal length, are served as LR input to fully exploit the information from raw, as well as to avoid the artifacts occurred in demosaicing preprocessing [9, 48, 4]. To evaluate our approach, we integrate the proposed method into different existing deep learning based SR architectures [23, 24, 42, 34]. The experimental results show that our method could outperform others in both construction accuracy [11] and perceptual quality [47, 15] among all scenarios, which reveal the generalizability and the effeteness of applying real-world video datasets and spatio-temporal coupling method in realistic SR.

The main contributions of this study are three-fold:

•

We demonstrate the feasibility of introducing spatio-temporal coupling for zoom quality enhancement, which is achieved by adopting real-world zoom video datasets and a novel loss framework.
•

We design a co-axial optical system to obtain paired short- and long-focal length videos from different scenes, and will publicly release a valuable real sensor based raw/sRGB video benchmark, VideoRAW.
•

We present a loss framework named STCL in dealing with realistic SR based on VideoRAW and spatio-temporal coupling.

It should be noted that our paper is the first realistic video-based solution for learning-based digital zoom quality enhancement.

2 Related Work

2.1 Super-Resolution for Digital Zoom

The past few years have witnessed great success in applying deep learning to enhance the SR quality [23, 24, 42, 34, 8, 30, 18, 22, 23, 37, 7, 13, 39]. However, most existing SR methods are typically evaluated with simulated datasets, learning digital zoom in practical scenarios by these methods would be less effective and results in a significant deterioration on the SR performance.

Lately, the realistic single-frame datasets based methods are proposed by a few of the studies for zoom quality enhancement. Cai et al. [3] presented a new single-frame benchmark and a Laplacian pyramid based kernel predication network (LP-KPN) to handle the real-world scenes. Chen et al. [5] investigated the feasibility by alleviating the intrinsic tradeoff between resolution and field-of-view from the perspective of camera lenses. Zhang et al. [48] adopted a contextual bilateral loss (CoBi) to deal with the misalignment issue between paired LH and HR images, and used a high-bit raw dataset, SR-RAW, to improve the input data quality. The benefits of using raw as input over RGB is proven in [48] as well. Joze et al. [16] released a realistic dataset named ImagePairs for image SR and image quality enhancement. Besides, NTIRE 2019 [2] organized a challenge to facilitate the development of real image SR. The provided image pairs in the challenge are registered beforehand by a sophisticated image registration algorithm, thus high-quality SR results can be generated by using carefully designed deep learning architectures[6, 21, 49]. Considering that very accurate subpixel registration is difficult in realistic raw data, as well as the misaligned feature would undermine the feature extraction capability of deep learning model, instead of using single-frame pairs, to adopt temporal correlation in adjacent multi-frame clips would be a promising strategy to enhance the limited spatial information. However, to our best knowledge, there exists no work for learning-based zoom by multi-frame pairs, and the closest one is [38], which supplants the need for demosaicing in a camera pipeline by merging a burst of raw images. In this paper, we address the challenge of introducing both spatial and temporal information via real-world video dataset for learning-based zoom quality enhancement.

2.2 Spatio-Temporal Coupling

The inter-frame spatial and temporal information has been exploited by many recent simulated data based video SR studies [35, 12, 13, 36, 45, 1, 41, 32, 28]. The methods can be grouped into time-based, space-based, and spatio-temporal based. A representative study of time-based way is presented by Shi et al. [40], which formulated a convolutional long short-term memory (ConvLSTM) architecture to reserve information from the last frame. Regarding space-based way, the aim of which is trying to merge temporal information in a parallel manner, studies such as VSRnet [17] and DUFVSR [14] achieved the goal by direct fusion architecture and 3D convolutional neural networks (CNN), respectively. Besides, for spatio-temporal way, Yi et al. [45] proposed a novel progressive fusion network by incorporating a series of progressive fusion residual blocks (PFRBs), and Caballero et al. [1] adopted spatio-temporal networks and motion compensation, and Wang [36] introduced the enhanced deformable convolutional networks, while Wang [35] applied optical flow estimation.

However, due to the spatial misalignment issues in realistic video pairs, the aforementioned methods which mainly focus on network architecture optimization have not been able to effectively couple spatio-temporal information via pixel-wise loss functions. Instead of further optimizing the network architecture, we propose to investigate the feasibility of achieving spatio-temporal coupling in the perspective of loss function.

3 Realistic Video Raw Dataset

To enable training, we introduce a novel dataset, VideoRAW, which contains realistic video pairs of both LR and HR, taken with our co-axis optical imaging system. For data preprocessing, we align the captured video pairs one-by-one with geometric transformation.

3.1 Data Capturing

As shown in Figure 2, the optical system consists of one beam splitter, two global shutter cameras, two identical zoom lenses with different focal lengths, and a signal synchronizer to keep time synchronization. When capturing videos, the incoming light is first divided into two perpendicular lights by a 45^∘ beam splitter, and then pass through a long and a short focal-length camera equipped with RGGB Bayer sensor, respectively.

Here, two FLIR GS3-U3-15S5C cameras and two RICOH FL-CC6Z1218-VG manual zoom lenses are adopted to collect 4X upscale ratio video pairs. The focal length is set to 18mm in the branch of LR videos with larger field-of-view (FoV), and it is set to 72mm in the branch of HR videos with smaller FoV. Although we focus on the investigation of 4X data, which is common in video SR, our capture system can be used to capture up to 6x paired data without modifications.

For camera settings, we choose 15fps frame rate to enhance spatial-temporal correlation, 2mm shutter speed to avoid obvious blur on fast-moving objects, and a relative small aperture size to alleviate the influence of depth-of-field difference. With the proposed imaging system, 84 pairs in multiple scenes, each containing 200 frames with 1384 $\times$ 1032 resolution, are captured from different street spots. We take 16-bit LR raw and 8-bit HR sRGB as the input and ground truth for zoom learning, respectively. It should be noted that the camera has a 14-bit ADC and the affiliated Flycapture2 software converts the original 14-bit raw into 16-bit by linear scaling. The 8-bit HR sRGB images are generated by the in-built ISP of Flycapture2.

3.2 Data Prepossessing

In terms of geometric alignment, we first match the FoV of each paired LR and HR frame in VideoRAW based on the predefined focal lengths. Some examples are shown in Figure 3. Since the videos taken at different focal lengths suffer from different lens distortions and perspective effects, as well as the subtle shift between light splitting paths, the misalignment is inevitable during data capturing. To address this issue, we employ homography transformation to warp the HR image based on the paired LR. Then, to match the size of LR frames based on the target zoom ratio, a scale offset is applied to HR frames. After that, we randomly crop consecutive frame patches from the paired videos for 4X SR training. Although obvious misalignment can be alleviated by the preprocessing step, as shown in Figure 4 A and B (GT: HR₀), nontrivial misalignment between paired LR and HR imagery is still unavoidable. We usually observe 10-40 pixels shift in a processed pair depending on the scene geometry. We attribute this shift to various physical effects in optical zooming, such as perspective distortion, rather than the temporal synchronization, since the two cameras are rigorously synchronized within sub-microsecond accuracy.

In terms of photometric alignment, including image brightness and color white balance, we capture the dark background and estimate white balance ratios between raw and RGB images for the blue and red channels. For all 16-bit raw images, we first subtract the black level, and then multiply the two ratios, so as to approximate the white balance of the RGB images. This correction allows to minimize the color and brightness difference between the pair, and helps to highlight the aforementioned discrepancies caused by the zooming effect.

4 Spatio-Temporal Coupling Loss

4.1 Framework

We propose a unified framework Spatio-Temporal Coupling Loss (STCL), as shown in Figure 5, which is extensible to different existing deep learning architectures for digital zoom quality enhancement. The challenge lies in the design of the constraint for spatio-temporal coupling via realistic video datasets when paired LR and HR₀ are misaligned, and establishing precise correspondences among LR and adjacent frames HR_t is difficult. To obtain high-quality outputs, we solve the difficulties by (1) spatial alignment and fusion, and (2) temporal fusion and aggregation, both at the high-dimension feature level.

Concretely, we design a spatial constraint by exploring the correlation among LR, HR₀, and expected SR. Considering in realistic scenarios when capturing a closer view of far-away subjects with greater details, the position relations among the features in zoom-in contents should rely on short-focal length LR, while the others such as edges, texture, and color are referenced based on long-focal length HR₀. Thus, the proposed spatial constraint consists of two components. The one aligns the feature position among SR and LR in a lower scale with a position constraint kernel, while the other one is responsible for fusing the features from the paired HR₀ frame into SR in higher scale.

In terms of temporal constraint, we propose to compare the feature distribution of SR and adjacent frames rather than just comparing the appearance. Since different frames are not equally informative to the reconstruction, the weighted constraint is designed by considering the correlation between the features of HR₀ and adjacent frames. Thus, the temporal constraint is able to guide the feature extraction and aggregation from consecutive frames for the effective feature fusion and compensation. Finally, the spatio-temporal coupling can be achieved in the zoom task by the given realistic video pairs and the integration of spatial and temporal constraints.

4.2 Loss Function

Our objective function is formulated as

\displaystyle STCL={Loss}_{s}+\lambda{Loss}_{t},

(1)

where $Loss_{s}$ and $Loss_{t}$ refer to spatial and temporal constraint, respectively. By effectively coupling spatial and temporal information from realistic video datasets, the STCL could achieve zoom quality enhancement. To the best of our knowledge, our approach proposed in this paper is the first attempt in this direction.

Spatial constraint. The core of $Loss_{s}$ is two loss terms: (1) The alignment loss, $Loss_{a}$ , computed at low resolution, to drive the generated image to share the spatial structure of the LR. (2) The Contextual Loss (CX) [26], here defined as reference loss, ${Loss}_{r}$ , is to make sure that the internal statistics of the generated image match those of the target $H_{0}$ .

To align SR and LR, we first downsample SR into ${LR}^{\prime}$ to match the size of LR in a lower scale. Instead of using pixel-to-pixel loss like $L_{1}$ and $L_{2}$ , we align LR and ${LR}^{\prime}$ at the feature level based on the features extracted by $[conv3\_2,conv4\_1]$ in pretrained VGG19 ( $\Phi_{1}$ ) [31]. Then we introduce spatial awareness into CX by Gaussian kernel to constraint the spatial distance between the two similar features. Our $Loss_{a}$ can be defined as

{Loss}_{a}(L,{L}^{\prime})=\frac{1}{N}\sum_{i}^{N}\underset{j=1,...,M}{min}(\kappa\cdot{D}_{l_{i},{l}^{\prime}_{j}}),

(2)

where $L$ and ${L}^{\prime}$ are the feature space in LR and ${LR}^{\prime}$ , respectively, and ${D}_{l_{i},{l}^{\prime}_{j}}$ denominates the cosine distance between feature $l_{i}$ in $L$ and ${l}^{\prime}_{j}$ in ${L}^{\prime}$ . The kernel $\kappa$ can be formulated as

\kappa=exp(-\frac{{({D}^{\prime}_{l_{i},{l}^{\prime}_{j}}-\mu})^{2}}{2\sigma^{2}}),

(3)

where ${D}^{\prime}_{l_{i},{l}^{\prime}_{j}}=\left\|(x_{i},y_{i})-(x_{j},y_{j})\right\|_{2}$ denominates the spatial coordinate distance between feature $l_{i}$ and ${l}^{\prime}_{j}$ . Here, we select $\mu=0$ and $\sigma=2$ . By adopting proposed $Loss_{a}$ , similar feature pairs between LR and ${LR}^{\prime}$ can be aligned spatially.

Regarding ${Loss}_{r}$ , since CX can be viewed as an approximation to KL divergence, and is designed for comparing images that are not spatially aligned [25], we directly apply it to perform statistical constraint between feature distributions as

{Loss}_{r}(H_{0},S)=\frac{1}{K_{0}}\sum_{i}^{K_{0}}\underset{j=1,...,G}{min}({D}_{h_{0_{i}},s_{j}}),

(4)

where $H_{0}$ and $S$ refer to the feature space generated by $[conv1\_2,conv2\_2,andconv3\_2]$ in VGG19 ( $\Phi_{2}$ ). Thus, ${Loss}_{s}={Loss}_{a}(L,{L}^{\prime})+{Loss}_{r}(H_{0},S)$ can align feature from LR and fusion feature from $H_{0}$ spatially.

Temporal constraint. We further adopt CX to emphasize important features via temporal frames for information compensation and restoration. However, the adjacent frames are not equally beneficial to the reconstruction as $H_{0}$ . To avoid the incorrectness brought by the adjacent frames which would decrease and corrupt the performance of SR, we define a correlation coefficient $w_{t}$ to weight each neighboring frame $H_{t}$ . Here, the compensation loss for $S$ and $H_{t}$ is formulated as

\displaystyle{Loss}_{c}(H_{t},S)=w_{t}\cdot CX(H_{t},S).

(5)

Then, the temporal loss is defined by aggregating all the compensation losses as

\displaystyle{Loss}_{t}=\sum_{t=-T}^{T}\cdot{Loss}_{c}(H_{t},S),t\neq 0.

(6)

In this paper, since our current dataset (15fps) mainly covers city views with pedestrians and moving vehicles under the speed of 45km/h, we choose T = 1 (3-frame clip) and $w_{\pm}$ = 0.1.

5 Experimental Setup

16-bit LR raw and 8-bit HR RGB videos are adopted to train a 4X SR model. We first randomly choose 80 clips from different video pairs in VideoRAW, with 4000 image pairs, for training, validation, and testing. The ratio of them is about 45:10:45. Then, we randomly crop 160 $\times$ 160 and 640 $\times$ 640 consecutive patches from LR and HR clips as input for training. Here, 16 layer ResNet [10] based SRResNet [23] without batch normalization [37] is adopted for SR architecture. We select a batch size of one, thus in our spatio-temporal model, one LR Bayer mosaic would pair three consecutive HR RGB ground truth frames for each iteration. We implement the proposed network based on TensorFlow 1.9 and train it with NVIDIA Tesla V100. The proposed model is trained for 200,000 iterations with 100 validations performed by every 1,000 iterations. In our experiment, parameters are optimized by the Adam optimizer [19] using initial learning rate = $1e^{-4}$ , $\beta_{1}$ = 0.9, $\beta_{2}$ = 0.999, and $\epsilon$ = $1e^{-8}$ .

Given the existing video SR models are not designed for realistic dataset with misalignment issues, in the baseline, we propose to investigate the feasibility of achieving spatio-temporal coupling in the perspective of loss function. We first compare the proposed spatio-temporal coupling approach to the loss functions which rely on distribution constraint, weak-spatial constraint, and spatial constraint, respectively. Here we define pixel-wise methods as weak-spatial in misalignment condition. Then we conduct the ablation study on our model variants on spatial compensation. After that, we integrate our framework into other deep learning architectures for generalization testing. Finally, we investigate the extensibility of our approach into more challenging video zoom tasks by perceptual experiments. All comparisons are conducted on the random selected three scenes, and each contains 4 clips with 200 frames.

5.1 Baselines

For comparison, we choose a few representative loss functions used for SR methods based on spatial and temporal concern: CX [26], which performed by comparing statistic feature distributions without considering both spatial and temporal correlations; $L_{2}$ [23], the most widely used pixel-wise spatial constraint applied in many state-of-the-art SR approaches; CoBi [48], an effective loss used in realistic SR for spatial constraint. All the baselines are integrated into SRResNet architecture for re-training on the proposed training dataset.

6 Results and Discussions

6.1 Quantitative Evaluation

To evaluate our method as well as the baselines, evaluation metrics, including pixel-based PSNR, structure-based SSIM, and learning-based LPIPS, are adopted. Unlike the case of PSNR and SSIM, the lower score of LPIPS indicates better image quality.

The relative performances of different methods over testing data are listed in Table 1. In general, the proposed method outperforms others in terms of most evaluation metrics and scenes, while CX, enforcing constraints on feature distribution only, gets the worst in learning based methods. Specifically, both our method and CoBi are better than $L_{2}$ in PSRN. The $L_{2}$ loss function usually achieves particularly high PSNR in SR than others due to the pixel-to-pixel mapping via MSE. However, in realistic case where LR and HR are misaligned, such pixel-wise mapping will bring incorrectness and noise to learning, which yields the lower PSNR performance. It indicates that the pixel-wise loss cannot perform effective mapping between LR and misaligned HR. In perspective of LPIPS, since the pixel-wise optimization often lacks high frequency content, it results in perceptually unsatisfying results with overly smooth textures. Our model and CoBi still perform much better than $L_{2}$ in all scenes. Besides, by adopting spatio-temporal coupling, our model shows better performance than CoBi in all cases. Such results verify the effectiveness of introducing temporal components in zoom quality enhancement.

Table 1: Performance comparison on digital zoom tasks under different scenes. Metric with ’

\uparrow

’ means the higher the better image quality, while ’

\downarrow

’ means the opposite.

Scene	#1			#2			#3
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
Bicubic	12.5652	0.4584	0.6525	11.5330	0.4928	0.5081	11.6146	0.4637	0.5888
CX [26]	24.4284	0.6652	0.3900	24.2536	0.7503	0.3582	25.0389	0.7134	0.3355
$L_{2}$ [23]	29.4006	0.8034	0.3456	26.8314	0.8419	0.3011	26.5033	0.7914	0.3208
CoBi [48]	29.4692	0.8131	0.2336	27.9387	0.8272	0.2207	27.1417	0.7759	0.2442
Ours	30.2093	0.8216	0.2213	27.9551	0.8311	0.2114	27.4081	0.7930	0.2391

6.2 Qualitative Evaluation

Qualitative comparison of our model against baselines in all three scenes is shown in Figure 6. Within these scenes, moving vehicles and pedestrians are presented. Since the direct Bicubic upsampling from LR (the $2^{nd}$ column) brings very blurry outlook, and the CX (the $4^{th}$ column) leads strong artifacts that caused by the inappropriate feature matching, we mainly focus on the comparison between ’weak-spatial’ $L_{2}$ , ’spatial’ CoBi, and our ’spatiao-temporal’.

In scene 1, we focus on the comparison of character (the $1^{st}$ row) and vegetation (the $2^{nd}$ row). Due to the weak-spatial mapping, $L_{2}$ results in very blurry for the character. Although CoBi can generate clear character as ours, for the high frequency texture back on the wall, our method achieves sharper edges and finer textures. As for vegetation, our method yields the most consistent visual result without any artifacts. In scene 2, it appears that our method can super-resolve zebra crossing (the $3^{th}$ row) with higher quality, while CoBi is too pale with limited contrast. Regarding the number plate on the moving vehicle (the $4^{th}$ row), although all the results generated by baselines and ours are not very clear, which we think is caused by high noise and signal loss in original LR image, our method still generates better results. In scene 3, our method results in very clear appearance for distant guideboard (the $5^{th}$ row) and wall (the $6^{th}$ row), while $L_{2}$ is too blurry to see the details and CoBi yields additional unnatural ’stripe’ artifacts on the vegetation. All of the above qualitative results demonstrate the effectiveness of our method on the realistic SR in different scenarios.

6.3 Ablation Analysis

To further investigate the effectiveness of the proposed spatio-temporal coupling method, we conduct an ablation study using two variants: one with temporal compensation provided by multi-frame videos and the other without. The relative quantitative comparison is shown in Table 2.

Table 2: Performances of variants with or without temporal compensation. ’T’ refers to temporal compensation.

Scene	Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
#1	Ours(-T)	29.7731	0.7865	0.2254
#1	Ours(+T)	30.2093	0.8216	0.2213
#2	Ours(-T)	27.7265	0.8105	0.2123
#2	Ours(+T)	27.9551	0.8311	0.2114
#3	Ours(-T)	27.0743	0.7462	0.2309
#3	Ours(+T)	27.4081	0.7930	0.2391

It reveals that with the help of spatio-temporal coupling, the reconstruction quality on multiple metrics can be improved, which consolidates the value of temporal compensation. In Particular, for SSIM, it leads to about 4.4% (0.8152 vs. 0.7811) improvement on average.

6.4 Generalization Ability

Table 3: Generalization ability analysis using existing deep learning architectures.

Archi.	Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
EDSR [24]	CX [26]	25.7379	0.7828	0.3679
	Ori	26.6179	0.7798	0.3724
	CoBi [48]	27.7801	0.7918	0.3405
	Ours	27.8733	0.7953	0.3248
DCSCN [42]	CX [26]	25.2992	0.8035	0.3488
	Ori	26.5322	0.8318	0.3471
	CoBi [48]	27.5385	0.8226	0.3130
	Ours	27.8891	0.8228	0.2946
FEQE [34]	CX [26]	25.7089	0.7937	0.3374
	Ori	27.0253	0.8287	0.3623
	CoBi [48]	27.7957	0.8360	0.3134
	Ours	27.9870	0.8376	0.2913

To evaluate the generalizability of the proposed method, we further integrate our framework into more existing deep learning architectures. In this paper, EDSR [24], DCSCN [42], and FEQE [34] are adopted. For comparison, the ’Ori’ refers to the weak-spatial loss function applied in the original paper. The comparison results are generated from the average performance of three scenes, as shown in Table 3, which reveals the generalization of the proposed method.

6.5 Perceptual Experiments for Video Zoom

Table 4: Perceptual experiments show that our results are significantly preferred on video zoom tasks.

Scene	CX [26]	$L_{2}$ [23]	CoBi [48]	Ours	No preference
Scene	Preference Rate
#1	3.33%	6.67%	15.00%	53.33%	21.67%
#2	1.67%	23.33%	8.33%	55.00%	11.67%
#3	1.67%	18.33%	15.00%	50.00%	15.00%

Moreover, to demonstrate the proposed method has a favorable capability in terms of video zoom, we evaluate the perceptual quality of the generated video through blind testing. In each inquiry, we present the participants with three videos (200 frames per video) taken from different scenes. At every frame, ground truth image and corresponding images generated from baseline models (CX, $L_{2}$ , CoBi) and ours are organized side by side. The location information of both ’Ours’ and the ’Baseline’ are not provided. The participants are asked to pick up the one that is more close to the ground truth video. In the experiment, the responses from 60 valid participants are collected and listed in Table 4. Since some occasional but noticeable artifacts in CoBi might severely influence subjective evaluation, especially in #2, videos generated by ours achieve a significantly higher preference rate under blind pairwise human judgment.

7 Conclusion

This paper investigated the effectiveness of spatial and temporal coupling for digital zoom quality enhancement. To enable training with spatio-temporal information, we collect a new dataset that contains realistic LR&HR video pairs, and introduce a novel loss framework for spatio-temporal constraint. The experimental results demonstrated the potential and capability of the proposed method in solving realistic SR problems.

References

[1] Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, and Wenzhe Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4778–4787, 2017.
[2] Jianrui Cai, Shuhang Gu, Radu Timofte, and Lei Zhang. Ntire 2019 challenge on real image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[3] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. arXiv preprint arXiv:1904.00523, 2019.
[4] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3291–3300, 2018.
[5] Chang Chen, Zhiwei Xiong, Xinmei Tian, Zheng-Jun Zha, and Feng Wu. Camera lens super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1652–1660, 2019.
[6] Guoan Cheng, Ai Matsune, Qiuyu Li, Leilei Zhu, Huaijuan Zang, and Shu Zhan. Encoder-decoder residual network for real super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11065–11074, 2019.
[8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
[9] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. Deep joint demosaicking and denoising. ACM Transactions on Graphics (TOG), 35(6):191, 2016.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[11] Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, pages 2366–2369. IEEE, 2010.
[12] Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network. In European Conference on Computer Vision, pages 645–660. Springer, 2020.
[13] Takashi Isobe, Songjiang Li, Xu Jia, Shanxin Yuan, Gregory Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, and Qi Tian. Video super-resolution with temporal group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[14] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3232, 2018.
[15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
[16] Hamid Reza Vaezi Joze, Ilya Zharkov, Karlton Powell, Carl Ringler, Luming Liang, Andy Roulston, Moshe Lutz, and Vivek Pradeep. Imagepairs: Realistic super resolution dataset via beam splitter camera rig. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
[17] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
[18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[20] Thomas Köhler, Michel Bätz, Farzad Naderi, André Kaup, Andreas Maier, and Christian Riess. Bridging the simulated-to-real gap: benchmarking super-resolution on real data. arXiv preprint arXiv:1809.06420, 2018.
[21] Junhyung Kwak and Donghee Son. Fractal residual network and solutions for real super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[22] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 624–632, 2017.
[23] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
[24] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
[25] Roey Mechrez, Itamar Talmi, Firas Shama, and Lihi Zelnik-Manor. Maintaining natural image statistics with the contextual loss. In Asian Conference on Computer Vision, pages 427–443. Springer, 2018.
[26] Roey Mechrez, Itamar Talmi, and Lihi Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 768–783, 2018.
[27] Chengchao Qu, Ding Luo, Eduardo Monari, Tobias Schuchert, and Jürgen Beyerer. Capturing ground truth super-resolution data. In 2016 IEEE International Conference on Image Processing (ICIP), pages 2812–2816. IEEE, 2016.
[28] Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6626–6634, 2018.
[29] Hamid R Sheikh, John W Glotzbach, and Osman G Sezer. Methodology for generating high fidelity digital zoom for mobile phone cameras, Sept. 6 2016. US Patent 9,438,809.
[30] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[32] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 4472–4480, 2017.
[33] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
[34] Thang Vu, Cao Van Nguyen, Trung X Pham, Tung M Luu, and Chang D Yoo. Fast and efficient image quality enhancement via desubpixel convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
[35] Longguang Wang, Yulan Guo, Li Liu, Zaiping Lin, Xinpu Deng, and Wei An. Deep video super-resolution using hr optical flow estimation. IEEE Transactions on Image Processing, 29:4323–4336, 2020.
[36] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[37] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
[38] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc Levoy, and Peyman Milanfar. Handheld multi-frame super-resolution. arXiv preprint arXiv:1905.03277, 2019.
[39] Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[40] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
[41] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
[42] Jin Yamanaka, Shigesumi Kuwashima, and Takio Kurita. Fast and accurate image super resolution by deep cnn with skip connection and network in network. In International Conference on Neural Information Processing, pages 217–225. Springer, 2017.
[43] Chih-Yuan Yang, Chao Ma, and Ming-Hsuan Yang. Single-image super-resolution: A benchmark. In European Conference on Computer Vision, pages 372–386. Springer, 2014.
[44] Ren Yang, Mai Xu, Zulin Wang, and Tianyi Li. Multi-frame quality enhancement for compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6664–6673, 2018.
[45] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE International Conference on Computer Vision, pages 3106–3115, 2019.
[46] Kai Zhang, Luc Van Gool, and Radu Timofte. Deep unfolding network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3217–3226, 2020.
[47] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
[48] Xuaner Zhang, Qifeng Chen, Ren Ng, and Vladlen Koltun. Zoom to learn, learn to zoom. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3762–3770, 2019.
[49] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 286–301, 2018.