Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression

Ali Mollaahmadi Dehaghi Reza Razavi Mohammad Moshirpour University of Calgary Userful Corporation University of California, Irvine Calgary, AB, Canada Calgary, AB, Canada Irvine, CA, USA [email protected] [email protected] [email protected]

Abstract

In this paper, we introduce DiQP; a novel Transformer-Diffusion model for restoring 8K video quality degraded by codec compression. To the best of our knowledge, our model is the first to consider restoring the artifacts introduced by various codecs (AV1, HEVC) by Denoising Diffusion without considering additional noise. This approach allows us to model the complex, non-Gaussian nature of compression artifacts, effectively learning to reverse the degradation. Our architecture combines the power of Transformers to capture long-range dependencies with an enhanced windowed mechanism that preserves spatio-temporal context within groups of pixels across frames. To further enhance restoration, the model incorporates auxiliary ”Look Ahead” and ”Look Around” modules, providing both future and surrounding frame information to aid in reconstructing fine details and enhancing overall visual quality. Extensive experiments on different datasets demonstrate that our model outperforms state-of-the-art methods, particularly for high-resolution videos such as 4K and 8K, showcasing its effectiveness in restoring perceptually pleasing videos from highly compressed sources. ¹¹1https://github.com/alimd94/DiQP

1 Introduction

8K video offers exceptional resolution, contrast, and motion quality, but it demands significant data and computational power for transmitting and coding [46]. With an estimated 15% of global electricity consumption attributed to information and communication technology (ICT) by 2040 [24], and video traffic accounting for 82% of global Internet traffic in 2022, efficient storage and transmission are increasingly crucial, particularly in bandwidth-limited scenarios prevalent in certain regions and demographics [17]. Video codecs offer a solution by compressing video data, but this often introduces visual artifacts like blockiness, blurring, or ringing, due to the lossy nature of compression algorithms [8]. A key parameter controlling this trade-off is the Quantization Parameter (QP), which determines the level of quantization applied to transform coefficients in the video data [42] (Fig. 1).

With the help of Video Restoration we can address this issue by recovering high-quality video sequences from their degraded compressed counterparts [30]. Video restoration, particularly for heavily compressed videos, is a highly challenging and ill-posed problem due to the inherent trade-off between compression and quality [50]. This process requires a range of techniques, including denoising [35] to remove unwanted artifacts, deblurring [34] to sharpen the frame, super-resolution [3, 19, 20] to enhance details, and crucially reducing compression artifacts – all aimed at recovering lost visual information. This task becomes even more complex when considering high-resolution videos like 8K, where the sheer volume of data exacerbates the challenges of artifact removal and quality restoration.

Refer to caption — Figure 1: As the QP increases, the overall quality decreases, leading to more noticeable artifacts. Pixels are affected differently relative to each other. We calculated the Mean Absolute Difference and PSNR for these four quality levels and also present a heatmap of the affected areas.

Our proposed model, DiQP (Fig. 2), not only introduces a novel approach to reducing video compression artifacts, but also, according to our research, is the first model specifically designed and trained for 8K videos. DiQP uniquely reverses the codec side effects by using Denoising Diffusion [15, 33]. While modern codecs such as AV1 and HEVC utilize adaptive QPs, we focus on fixed QPs to ensure robustness across varying compression levels. Unlike previous methods that add artificial noise [49], DiQP directly addresses complex compression artifacts by leveraging the inherent noise introduced during compression. DiQP features a U-shaped hierarchical network inspired by [41], with skip connections and enhanced windowed self-attention for capturing long-range dependencies in high-resolution videos while preserving local context. Look Ahead and Look Around modules further enhance temporal coherence and global awareness, while LOST embedding effectively incorporates conditional data. This combination of components allows DiQP to effectively reverse compression degradation, and significantly improve video quality restoration; particularly for 8K content.

2 Related Works

Video Diffusion Models’ applications contain a wide scope of video analysis tasks [45], including video generation [14, 23, 32], and video editing [48, 26]. The methodologies for these tasks share similarities, often formulating the problems as Diffusion generation tasks or utilizing the potent controlled generation capabilities of Diffusion models for downstream tasks [45]. In video enhancement and restoration, CaDM [49] introduces a new approach to video streaming, reducing bitrates while improving video restoration quality compared to existing methods. This is achieved by reducing frame resolution and color depth during encoding, and then utilizing a Diffusion-based restoration process at the decoder that is aware of these encoding conditions. LDMVFI [6] marks a advancement in video frame interpolation by utilizing a conditional latent Diffusion model. It features an autoencoding network tailored for video frame interpolation, incorporating efficient self-attention mechanisms and deformable kernel-based synthesis for superior performance. VIDM [4] leverages a pre-trained Latent Diffusion Model (LDM) [28] to tackle video in-painting, demonstrating the adaptability of this tool. By providing a mask for first-person perspective videos, VIDM harnesses the image completion capabilities of LDM to generate seamless in-painted videos.

Video Transformers have found applications in various domains due to their ability to model long-range dependencies efficiently [9, 21, 11]. These applications showcase the versatility and effectiveness of Video Transformers in various video processing tasks [31]. In the restoration task, VRT [19], a new Video Restoration Transformer, allows for parallel processing of long video sequences and models long-range dependencies for video restoration. VRT jointly extracts, aligns, and fuses features at multiple scales using a novel mutual attention mechanism, achieving great performance in various video restoration tasks. RVRT [20], a recurrent video restoration transformer, combines the strengths of parallel and recurrent methods for efficient and effective video restoration. It processes video clips jointly, utilizes a larger hidden state to alleviate information loss, and introduces a novel guided deformable attention mechanism for accurate video clip alignment.

Video Restoration has gained significant attention in recent years. FTVSR [27] uses frequency-based patch representations and attention mechanisms to address the challenges of compressed video restoration. This approach preserves high-frequency details and leverages low-frequency information to guide high-frequency texture generation, effectively reducing compression artifacts. BasicVSR++ [3] improves video super-resolution using two main techniques: second-order grid propagation, which allows for more flexible information flow and aggregation across frames, and flow-guided deformable alignment, which utilizes optical flow to refine feature alignment across misaligned frames. These enhancements lead to better utilization of spatio-temporal information and improve overall performance. CAVSR [38] is designed to enhance video super-resolution specifically for compressed videos. It incorporates a compression encoder to assess compression levels in frames, using metadata such as frame type and motion vectors. This information is then used to modulate a base VSR model, enabling adaptive handling of various compression levels. The model further utilizes metadata like residual maps for accurate frame alignment, enhancing the bidirectional recurrent network’s performance. In addition to the aforementioned multi-frame-based models, VCISR [37] introduces an approach for the blind single image super-resolution (SISR) task that focuses on enhancing single-frame input affected by video compression artifacts, relying solely on spatial information.

3 Preliminary

Before delving into the specifics of our work, it is important to discuss the inspirations behind this idea. Deep generative models, such as Denoising Diffusion Probabilistic Models (DDPMs) [15], offer a compelling alternative to GANs [12, 7] without the need for adversarial training, careful optimization, or the risk of missing parts of the data distribution. DDPMs achieve this by training denoising models to progressively transform Gaussian noise into images through a Markov chain process, providing a stable and effective generative approach. While DDPMs produce high-quality images through a lengthy generative process, this method requires numerous iterations, making it significantly slower than GANs. For instance, generating 50,000 images of size 256 × 256 can take nearly 1,000 hours on a Nvidia 2080 Ti GPU, which becomes increasingly impractical for larger resolutions [33]. To address this efficiency gap, Denoising Diffusion Implicit Models (DDIMs) [33] were introduced as a more efficient alternative to DDPMs. DDIMs generalize the forward Diffusion process from the Markovian framework used in DDPMs to non-Markovian processes. This allows for the creation of ”short” generative Markov chains that can produce high-quality samples in far fewer steps. Cold Diffusion [2] further explores the boundaries of Diffusion models by eliminating the reliance on Gaussian noise or randomness altogether. Instead of using noise, it leverages arbitrary image transformations and degradations, training a restoration network to reverse these transformations. This approach challenges the traditional theoretical frameworks of Diffusion models and opens the door to new types of generative models with distinct properties compared to conventional methods. Moreover, the use of Gaussian noise schedules not only prevents Stable Diffusion models [28] from generating images with mean brightness greater or less than 0 (on a scale of -1 to 1), but also proves to be an overextension of model’s capacity. This is particularly true for restoration tasks, where the model must remove both artificially added Gaussian noise and existing artifacts [39]. Since compressed frames are not a natural intermediate step in the vanilla Diffusion process, the restoration process does not need to start from pure noise, nor does it require a large number of inference steps or a large model size—advantages that are critical for real-world applications. With all this in mind, we present DiQP as a proof of concept for using Denoising Diffusion to directly address the complex artifacts introduced by video compression in 8K resolution without adding artificial Gaussian noise [49].

4 Methodology

Big Picture: DiQP consists of 4 different parts which are explained in depth in this section. But first let’s revisit the problem. Let $F_{\text{raw}}\in\mathbb{R}^{T\times H\times W\times C}$ , be a sequence of raw target frames without added artifacts and distortions. $T,H,W,C$ are the frame number, height, width and channel number, respectively. Now if we consider $CODEC_{\text{Decode}}(CODEC_{\text{Encode}}(F_{\text{raw}},QP))=F_{\text{qp}}=F_{\text{raw}}+\text{noise}_{\text{qp}}$ , our model aims to predict the $\text{noise}_{\text{qp}}$ as accurately as possible. Therefore, we formulate our model as $DiQP(F_{\text{iqp}},QP,Z)=\text{noise}_{\text{qp}}^{\prime}$ , where $F_{\text{iqp}}$ is a randomly selected window from the original 8K frame, calculated by the Hadamard product of $F_{\text{qp}}$ and a random binary mask $M^{h\times w}$ : $F_{\text{iqp}}=F_{\text{qp}}\circ M$ . The reconstructed output frame sequence $F_{\text{res}}$ is then obtained as $F_{\text{res}}=F_{\text{iqp}}+\text{noise}_{\text{qp}}^{\prime}$ . Additional input Z includes both conditional inputs and the inputs specifically for the Look Ahead and Look Around modules, which are discussed in detail later in this section.

For a fair comparison with existing methods, we use the commonly adopted Charbonnier loss [5] between the reconstructed frame sequence and the ground truth or raw sequence, defined as:

\mathcal{L}=\sqrt{\|F_{\text{res}}-(F_{\text{raw}}\circ M)\|^{2}+\epsilon^{2}}

(1)

and $\epsilon=10^{-3}$ is a constant in all the experiments.

4.1 Look Around

Due to the large size of 8K frames (7680 * 4320) and the limited coverage of the model’s randomly selected (512 * 512) window which is less than 1% of the whole frame, we introduce a supplementary model called Look Around (Fig. 3) to provide the Core model with a better understanding of the surrounding context. To achieve this, we feed the Look Around model, which consists of Temporal Separable CNN [44] blocks, with bicubically down sampled ( $DS$ ) versions of the original 8K frames. The extracted features from this model are then used to guide the Core model. The Look Around model consists of K blocks of these temporal CNNs, with the output of each block being added separately to the corresponding level of the Core’s Encoder part. This addition helps the Encoder better understand the input. The entire process can be formulated as follows:

LookAround(DS(F_{\text{qp}}^{7680\times 4320}))=L_{\text{ar}}

(2)

4.2 Look Ahead

Since our model employs a sliding window-based method which is not easily scalable to long sequence modeling [19] and due to the challenges associated with estimating optical flow, particularly in highly compressed videos where inaccurate predictions and increased computational overhead can negatively impact the performance [38], we introduce an auxiliary model called Look Ahead (Fig. 3) to address this limitation and enhance the model’s ability to anticipate future events and changes. The Look Ahead model takes the down-sampled of next T frame from the last frame in the input sequence and extracts informative features for the Core model. In addition to that frame, the Look Ahead model is also fed the same window coordination of input from the future frame. These two groups of data are processed separately and then concatenated ( $\oplus$ ). Unlike the Look Around model, these extracted features are added to the corresponding levels of the Core’s Decoder part. This addition enhances the decoder’s restoration abilities. Furthermore, a weight decay factor ( $WDF$ ) is incorporated to control the influence of the Look Ahead model. This decay factor proves particularly beneficial when processing the last T frames of the clip, as the last frame is used as input for these frames. The entire process can be formulated as follows: Let’s denote the input frame set as $\{F_{\text{qp1}},F_{\text{qp2}},\ldots,F_{\text{qpn}}\}$ where $F_{\text{qpn}}$ represents the last frame in the input set. The frame of interest is then $F_{\text{qp(n+T)}}$

\begin{split}WDF&=\frac{\text{Total Frame Numbers}-\text{Middle Frame Position}}{\text{Total Frame Numbers}}\end{split}

(3)

\begin{split}&\text{LookAhead}\Big{(}\underbrace{DS(F_{\text{qpn}}^{7680\times 4320})\oplus DS(F_{\text{qp(n+T)}}^{7680\times 4320})}_{\text{Group 1}},\\ &\qquad\underbrace{(F_{\text{iqpn}}\oplus F_{\text{qp(n+t)}}\circ M)}_{\text{Group 2}}\Big{)}\times WDF=L_{\text{ah}}\end{split}

(4)

To determine the optimal temporal window size (T) for the Look Ahead model, we conducted an experiment analyzing how the input changes as the window size varies. Specifically, we randomly selected a frame, referred to as N, and subtracted it from each subsequent frame, up to the last frame (N+1 to frame 300). For each subtraction result, we calculated the minimum, maximum, total number of non-zero pixels, and the average. This process was repeated for various window sizes to evaluate the differences and identify the optimal temporal resolution.

Our results, as shown in Fig. 4, indicate that the most significant changes occur at a window size of 50. The magnitude of change between window sizes 1 to 50 is considerably greater than that of between 50 to 299. Furthermore, the first derivative of the mean change, also depicted in Fig. 4(c), approaches zero around window size 50.

4.3 LOST embedding

When utilizing Denoising Diffusion for restoration tasks, a key challenge lies in effectively incorporating both degraded data and additional conditional data such as Diffusion time steps into the model. An improved conditional framework can significantly enhance the generative potential of Denoising Diffusions, guiding them towards producing realistic output that accurately matches the original sources [47]. Thus, we employ an alternative approach to fully leverage the capabilities of the Diffusion model.

We introduce two types of conditional data to our model: 1) LOcation and 2) Diffusion STep, collectively referred to as LOST! (Fig. 5). Location data encompasses the index of the middle frame within the entire clip, as well as the height and width of the input window in both original and down scaled sizes (used in Look Ahead and Look Around). For instance, the scaled crop points can be calculated as:

\begin{split}\text{Scale}_{d}&=\frac{\text{WindowSize}_{d}}{\text{OriginalSize}_{d}},\\ \text{CropPoint}_{\text{Downscaled},d}&=\left\lfloor\text{CropPoint}_{\text{Original},d}\times\text{Scale}_{d}\right\rfloor,\\ &\quad\quad\quad\quad d\in\{\text{Width, Height}\}\end{split}

(5)

The step embedding is simply the Quantization Parameter (QP) used for video encoding. For each of these six values, we have a dedicated embedding that is trained independently. After obtaining and concatenating these embeddings into a larger vector, we pass the vector through a neural network (NN) with SiLU activation [10] to produce a more informative and compact embedding. This embedding is then concatenated with the output of each block, serving as guidance and conditioning for subsequent blocks to optimize the model’s performance. However, since the NN output is a vector, we reshape it into a matrix with dimensions matching the core’s kernel size and replicate it horizontally and vertically to align with the corresponding block sizes (Algorithm 1).

Algorithm 1 LOST

0: Conditional information (step, frame number, window starting point scaled and original) and Size of the final output

0: Processed and encoded location and step information

1: Apply embedding function to all conditional information

2: Concatenate all embedded tensors on the last dimension

3: Apply the NN on the embedded tensors

4: Reshape the output from shape

(1,L)

(1,l,l)

where

l=\sqrt{\text{Size}}

5: Repeat the reshaped tensor along new dimensions

6: Create a new dimension with size

k=\text{Size}/l

by replicating each element in the first reshaped dimension and the second reshaped dimension

l

times

7: The resulting tensor will have the shape

(1,k\times l,k\times l)

4.4 Core Model

The Core model is a U-shaped hierarchical network with skip connections between the encoder and the decoder. To be specific, given a triplet of degraded frames $F_{\text{iqp}}\in\mathbb{R}^{3\times H\times W\times C}$ , the Core model first applies a 3D convolutional layer with LeakyReLU and a kernel size of 3 to extract low-level features. Next, following the design of the U-shaped structures [16, 29], the feature maps are passed through K encoder stages. Each stage contains a stack of the Transformer blocks and one down-sampling layer. The output of each stage is then concatenated with the output of $K$ -th layer of Look Around and LOST before going through down sampling. In the down-sampling layer, we first reshape the flattened features into 3D spatial-temporal feature maps, and then down-sample the maps. Then, a bottleneck stage with a stack of Transformer blocks is added at the end of the encoder. In this stage, only LOST is concatenated with the output of each block. For feature reconstruction, the decoder also contains K stages. Each consists of an up-sampling layer and a stack of Transformer blocks similar to the encoder. After that, the features input to the Transformer blocks are the concatenation of the up-sampled features and the corresponding features from the encoder through skip-connection and the output of $K$ -th layer of the Look Ahead model and LOST. Next, the Transformer blocks are utilized to learn to restore the frames. After the K decoder stages, we reshape the flattened features to 3D feature maps and apply a 3D convolution layer with kernel size of 3 to obtain added artifacts and distortions to remove from the frames. Due to the high computational cost of the standard Transformer architecture [36, 9] and its limitations in capturing local dependencies [43, 18], we created a spatio-temporal compatible Transformer block based on the Locally enhanced Window (LeWin) Transformer introduced by [41]. This block benefits from two key designs: Window-based Multiheaded Self-Attention, which performs self-attention within non-overlapping local windows, significantly reducing computational cost, and an enhanced Feed-Forward Network that leverages local context.

5 Experiments

Datasets:

We used the SEPE8K dataset [1] for our training. This dataset comprises 40 different 8K (8192 x 4320) video sequences, each captured at a framerate of 29.97 frames per second (FPS) with a duration of 10 seconds. We randomly split the dataset into 30, 5, and 5 sequences for training, testing ²²2Sequences 5, 11, 21, 23, 26, and validation, respectively. Using ffmpeg with the help of NVIDIA A6000 Ada GPU, we created frames from encoded videos using two codecs, HEVC/H.265 and AV1, with varying Quantization Parameters (QPs). For HEVC, we used QPs ranging from 3 to 51 (maximum) with a step size of 3, resulting in 17 quality levels. For AV1, we used QPs from 3 to 255 (maximum) with the same step size, yielding 85 quality levels. The total data occupied approximately 40 TB of storage. For training, we divided each video into 100 non-overlapping segments, each containing three frames. After loading the frames, we randomly selected 512*512 non-tile-wise window crops to prevent probable boundary artifacts. To broaden the evaluation of our model and ensure a fair comparison, we also tested it on the UVG 4K dataset [25], specifically selecting videos with a duration of 12 seconds, given the very limited availability of 8K datasets.

Implementation Details:

Due to the performance gap between HEVC and AV1 in the high-resolution domain [46], we trained the same model on each codec separately. Training was conducted on a server with 8 NVIDIA A100 GPUs, taking 40 epochs for AV1 and 200 epochs for HEVC. The total training time, including experiments for the ablation study, was 40 days.

Following the common training strategy for Transformers [36], we employed the AdamW optimizer [22] with momentum terms of (0.9, 0.999) and a weight decay of 0.02. We also applied a learning rate warmup [13] for approximately 3% of the initial epochs.

Evaluation Metrics:

We adopt the commonly-used PSNR and SSIM [40] metrics to evaluate the restoration performance. These metrics are calculated in the RGB.

Table 1: Performance Comparison on SEPE8K and UVG with AV1 (QP = 255) and HEVC (QP = 51) codecs

(a) SEPE8K with AV1 (QP = 255)

Model	PSNR	SSIM
DiQP	34.868	0.8611
FTVSR	33.201	0.8525
RVRT	33.113	0.8577
VRT	33.101	0.8501
BasicVSR++	33.105	0.8469
Degraded Input	33.095	0.8463

(b) SEPE8K with HEVC (QP = 51)

Model	PSNR	SSIM
DiQP	34.197	0.8538
FTVSR	32.316	0.8472
RVRT	32.281	0.8491
VRT	32.213	0.8464
BasicVSR++	32.211	0.8395
Degraded Input	32.206	0.8393

Model	PSNR	SSIM
DiQP	32.551	0.8662
FTVSR	31.880	0.8511
RVRT	31.711	0.8545
VRT	31.708	0.8506
BasicVSR++	31.706	0.8463
Degraded Input	31.702	0.8461

(d) UVG with HEVC (QP = 51)

Model	PSNR	SSIM
DiQP	31.965	0.8590
FTVSR	31.289	0.8485
RVRT	31.271	0.8498
VRT	31.270	0.8481
BasicVSR++	31.269	0.8430
Degraded Input	31.267	0.8426

5.1 Comparison

We selected the four representative methods in video restoration (VRT [19], RVRT [20], BasicVSR++ [3], and FTVSR) as baselines to compare our model. We present quantitive comparison results between DiQP and baselines in Table 1(d). The test was conducted with the maximum QP available for both codecs. To develop a better understanding, we also added the metrics for degraded input. As we can see, DiQP achieves the best performance on SEPE8K and UVG for both codecs. Compared with the baseline models, it improves the PSNR by significant margins of 1.77 to 1.99dB in SEPE8K and 0.84 to 0.69dB in UVG. For comparison with UVG, due to the fixed dimensions of the LOST embedding (learned specifically for 8K domains), we had to upsample the UVG 4K frames to 8K using bicubic interpolation before performing restoration. After restoration, we downscaled the results and compared them with the original raw frames. This process likely affected the overall results, as some fine details may have been lost during the upscaling and downscaling steps. In Table 2 we present a comparison of the model parameters and runtime across different methods, highlighting that DiQP, despite having the highest number of parameters, achieves the fastest runtime. The visual comparisons of different methods shown in Fig. 6(a) and Fig. 6(b) indicate that DiQP generates smoother and more clear HQ frames with removed artifacts, while other methods fail to restore fine textures and details. The second best performing model here is FTVSR because it has a better understanding of compression side effects on video. Due to limited space, additional results are provided in the supplementary material.

5.2 Ablation Study

Understanding the Role of Auxiliary Models

We conducted an ablation study to evaluate the impact of Look Ahead and Look Around models on the overall performance. Due to computational constraints, we focused our analysis on comparing our complete, fully-featured DiQP model with a simplified version lacking the Look Ahead and Look Around modules. This targeted comparison allowed us to isolate the contributions of these two modules and better understand their role in achieving the final performance of the complete model. In this experiment, both models were trained under identical conditions for 10 epochs. We then analyzed their output quality by calculating the PSNR between the generated results and the ground truth. Notably, after 10 epochs, we observed a significant difference of approximately 3dB in PSNR between the two models as illustrated in Fig. 7.

Table 2: Model Parameters and Runtime Comparison

Method	Params (M)	Runtime (ms)
DiQP	79.37	125
FTVSR	10.8	499
RVRT	10.8	321
VRT	35.6	427
BasicVSR++	7.3	135

6 Conclusion

DiQP is a novel Transformer-Diffusion model for 8K video restoration; specifically addressing the complex artifacts introduced by codec compression. By viewing the restoration process itself as a Deonising Diffusion model and leveraging the Quantization Parameter (QP) as the Diffusion step, we successfully applied this powerful framework to the challenging task of video restoration. Our approach demonstrates superior performance in restoring high-resolution videos from heavily compressed sources. The experimental results highlight the effectiveness of our model in recovering fine details and improving overall visual quality compared to other existing models. For future work, we plan to work on a more compatible LOST embedding since it is currently designed only for 8K resolution, and adapting it for other resolutions is challenging.

References

[1] Tariq Al Shoura, Ali Mollaahmadi Dehaghi, Reza Razavi, Behrouz Far, and Mohammad Moshirpour. Sepe dataset: 8k video sequences and images for analysis and development. In Proceedings of the 14th ACM Multimedia Systems Conference, MMSys ’23, page 463–468, New York, NY, USA, 2023. Association for Computing Machinery.
[2] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S. Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[3] Kelvin C.K. Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2022-June, 2022.
[4] Matthew Chang, Aditya Prakash, and Saurabh Gupta. Look ma, no hands! agent-environment factorization of egocentric videos. Advances in Neural Information Processing Systems, 36:21466–21486, 12 2023.
[5] P. Charbonnier, L. Blanc-Féraud, G. Aubert, and M. Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. Proceedings - International Conference on Image Processing, ICIP, 2:168–172, 1994.
[6] Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. Proceedings of the AAAI Conference on Artificial Intelligence, 38:1472–1480, 3 2023.
[7] Ali Mollaahmadi Dehaqi, Vahid Seydi, and Yeganeh Madadi. Adversarial image caption generator network. SN Computer Science, 2(3):182, 2021.
[8] Jianing Deng, Li Wang, Shiliang Pu, and Cheng Zhuo. Spatio-temporal deformable convolution for compressed video quality enhancement. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021 - 9th International Conference on Learning Representations, 2021.
[10] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 11 2018.
[11] Rohit Girdhar, João João Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 244–253, 2019.
[12] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
[13] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He Facebook. Accurate, large minibatch sgd: Training imagenet in 1 hour, 6 2017.
[14] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 10 2022.
[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 2020-December, 2020.
[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
[17] Janusz Klink. A method of codec comparison and selection for good quality video transmission over limited-bandwidth networks. Sensors, 21(13), 2021.
[18] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers, 2021.
[19] Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer. IEEE Transactions on Image Processing, 33, 2024.
[20] Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc Van Gool. Recurrent video restoration transformer with guided deformable attention. In Advances in Neural Information Processing Systems, volume 35, 2022.
[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 7th International Conference on Learning Representations, ICLR 2019, 11 2017.
[23] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Notice of removal: Videofusion: Decomposed diffusion models for high-quality video generation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10209–10218, 6 2023.
[24] Laura U. Marks and Radek Przedpełski. The carbon footprint of streaming media: Problems, calculations, solutions. Palgrave Studies in Media and Environmental Communication, pages 207–234, 2022.
[25] Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. MMSys 2020 - Proceedings of the 2020 Multimedia Systems Conference, pages 297–302, 5 2020.
[26] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. Proceedings of the IEEE International Conference on Computer Vision, pages 15886–15896, 3 2023.
[27] Zhongwei Qiu, Huan Yang, Jianlong Fu, and Dongmei Fu. Learning spatiotemporal frequency-transformer for compressed video super-resolution. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 257–273, Cham, 2022. Springer Nature Switzerland.
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2022-June, 2022.
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9351:234–241, 2015.
[30] Claudio Rota, Marco Buzzelli, Simone Bianco, and Raimondo Schettini. Video restoration based on deep learning: a comprehensive survey. Artificial Intelligence Review, 56(6):5317–5364, 2023.
[31] Javier Selva, Anders S. Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund, and Albert Clapes. Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–20, 2023.
[32] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 9 2022.
[33] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
[34] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 237–246, 2017.
[35] Gregory Vaksman, Michael Elad, and Peyman Milanfar. Patch craft: Video denoising by deep modeling and patch matching. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2137–2146, 2021.
[36] Ashish Vaswani, Google Brain, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
[37] Boyang Wang, Bowen Liu, Shiyu Liu, and Fengyu Yang. Vcisr: Blind single image super-resolution with video compression synthetic data. arXiv preprint arXiv:2311.00996, 2023.
[38] Yingwei Wang, Takashi Isobe, Xu Jia, Xin Tao, Huchuan Lu, and Yu Wing Tai. Compression-aware video super-resolution. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2023-June, 2023.
[39] Y. Wang, Y. Yu, W. Yang, L. Guo, L. Chau, A. C. Kot, and B. Wen. Exposurediffusion: Learning to expose for low-light image enhancement. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12404–12414, Los Alamitos, CA, USA, oct 2023. IEEE Computer Society.
[40] Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. Image quality assessment: From error visibility to structural similarity. Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004.
[41] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2022-June, 2022.
[42] Ching Yu Wu and Po Chyi Su. A content-adaptive distortion-quantization model for h.264/avc and its applications. IEEE Transactions on Circuits and Systems for Video Technology, 24:113–126, 1 2014.
[43] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22–31, 2021.
[44] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, 2018.
[45] Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models, 10 2023.
[46] Yiming Yuan, Jingya Zhu, and Rao Fu. A comparative study of video coding standards for 8k ultra-high-definition video: Hevc vs. av1. Academic Journal of Engineering and Technology Science, 6:1–9, 8 2023.
[47] Yi Zhang, Xiaoyu Shi, Dasong Li, Xiaogang Wang, Jian Wang, and Hongsheng Li. A unified conditional framework for diffusion-based image restoration, 5 2023.
[48] Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, and Luoqi Liu. Towards consistent video editing with text-to-image diffusion models. Advances in Neural Information Processing Systems, 36:58508–58519, 12 2023.
[49] Qihua Zhou, Ruibin Li, Song Guo, Peiran Dong, Yi Liu, Jingcai Guo, and Zhenda Xu. Cadm: Codec-aware diffusion modeling for neural-enhanced video streaming, 11 2022.
[50] Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30599–30611. Curran Associates, Inc., 2022.

Supplementary Material

In this section, we delve deeper into the quantitative and qualitative results obtained from our model. Fig. S1 showcases frame 179 from sequence 26 of the SEPE8K dataset, along with our model’s output. To better illustrate our model’s performance, we highlight three specific crops from this frame: 1) the in-focus ”bag” crop, demonstrating the model’s ability to handle sharp details, 2) the out-of-focus ”hat” crop with its intricate winter hat texture, testing the model’s performance on challenging patterns, and 3) the out-of-focus ”jacket” crop, further evaluating the model’s ability to reconstruct detailed textures in less-than-ideal conditions. Our findings highlight that our model consistently delivers high-quality results in both in-focus and out-of-focus scenarios, effectively handling various textures and sharpness levels, unlike other models that struggle in such diverse situations.

To validate our model’s robustness, we evaluated its performance under a range of compression levels (QPs). This included not only the most challenging highest QP scenario but also lower QPs to ensure comprehensive learning. As detailed in Tables S1, S2, S3, and S4, our model consistently demonstrates strong performance across various compression levels, from the least to the most compressed videos. Intuitively, our DiQP exhibits a notably stronger performance advantage at lower QPs, with a margin of approximately 5 dB in the SEPE8K dataset and around 2 dB in the UVG dataset. This shows a superior ability to leverage the additional information available in less compressed videos.

Table S1: PSNR Performance of DiQP on AV1 Codec with Various QPs

QP	Dataset	Output	Input
51	SEPE8K	42.171	37.544
	UVG	37.001	35.124
102	SEPE8K	40.478	36.774
	UVG	35.772	34.286
153	SEPE8K	38.762	35.816
	UVG	34.962	33.644
204	SEPE8K	36.818	34.564
	UVG	33.871	32.740

Table S2: SSIM Performance of DiQP on AV1 Codec with Various QPs

QP	Dataset	Output	Input
51	SEPE8K	0.9304	0.9105
	UVG	0.9148	0.8981
102	SEPE8K	0.9047	0.8886
	UVG	0.8924	0.8788
153	SEPE8K	0.8852	0.8708
	UVG	0.8831	0.8699
204	SEPE8K	0.8730	0.8584
	UVG	0.8731	0.8587

Table S3: PSNR Performance of DiQP on HEVC Codec with Various QPs

QP	Dataset	Output	Input
12	SEPE8K	43.063	38.036
	UVG	38.396	36.018
33	SEPE8K	38.099	35.808
	UVG	34.882	33.590
42	SEPE8K	35.829	34.385
	UVG	33.660	32.665

Table S4: SSIM Performance of DiQP on HEVC Codec with Various QPs

QP	Dataset	Output	Input
12	SEPE8K	0.9456	0.9241
	UVG	0.9307	0.9133
33	SEPE8K	0.8842	0.8704
	UVG	0.8835	0.8684
42	SEPE8K	0.8695	0.8564
	UVG	0.8728	0.8571