This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PNeRV: Enhancing Spatial Consistency via
Pyramidal Neural Representation for Videos

Qi Zhao
Nanjing University
[email protected]
   M. Salman Asif
University of California Riverside
[email protected]
   Zhan Ma111

Nanjing University
[email protected]
Abstract

The primary focus of Neural Representation for Videos (NeRV) is to effectively model its spatiotemporal consistency. However, current NeRV systems often face a significant issue of spatial inconsistency, leading to decreased perceptual quality. To address this issue, we introduce the Pyramidal Neural Representation for Videos (PNeRV), which is built on a multi-scale information connection and comprises a lightweight rescaling operator, Kronecker Fully-connected layer (KFc), and a Benign Selective Memory (BSM) mechanism. The KFc, inspired by the tensor decomposition of the vanilla Fully-connected layer, facilitates low-cost rescaling and global correlation modeling. BSM merges high-level features with granular ones adaptively. Furthermore, we provide an analysis based on the Universal Approximation Theory of the NeRV system and validate the effectiveness of the proposed PNeRV. We conducted comprehensive experiments to demonstrate that PNeRV surpasses the performance of contemporary NeRV models, achieving the best results in video regression on UVG and DAVIS under various metrics (PSNR, SSIM, LPIPS, and FVD). Compared to vanilla NeRV, PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG, along with a +3.28 dB PSNR and 634% FVD increase on DAVIS.

[Uncaptioned image]
Figure 1: High-quality video (1920×9601920\times 960) reconstruction comparisons between the proposed Pyramidal NeRV and other models, PSNR in yellow. PNeRV outperforms other models on perceptual quality with less noise and artifacts, maintaining spatial consistency.
11footnotetext: Corresponding author: Zhan Ma ([email protected])

1 Introduction

In recent years, Implicit Neural Representation (INR) has emerged as a pivotal area of research across various vision tasks, including neural radiance fields modeling [42, 67], 3D vision [51, 8, 57] and multimedia neural coding [53, 9]. INR operates on the philosophy that target implicit mapping will be encoded into a learnable neural network through end-to-end training. By leveraging the modeling capabilities of neural nets, INR can approximate a wide range of complex nonlinear or high-dimensional mappings.

However, when considering the video coding task, extant NeRV systems exhibit a notable deficiency in perceptual quality. The reconstructions of foreground subjects, which are obscured by high-frequency irrelevant details or blurring, prove challenging for current NeRV models. This issue of spatial inconsistency is primarily attributed to semantic uncertainty, causing the model to struggle with discerning whether two long-range pixels pertain to the same objects or constitute part of a noisy background. We postulate that this predicament stems from the absence of global receptive field and multi-scale information communication. Inspired by existing empirical evidence from other vision research, we speculate that if the dense prediction could leverage the high-level information learned from raw input, it would substantially alleviate both the semantic uncertainty and spatial inconsistency (as illustrated in Fig. 1).

In practice, introducing multi-scale structures into NeRV poses a significant and non-trivial challenge. Existing NeRV models typically resort to cascaded upsampling layers (the so-called “mainstream”) for decoding fine video, striking a compromise between performance and efficiency. However, layers that use subpixel-based operators [52, 69] can hardly maintain a balance between the increasing receptive field, parameter demand, and performance (more discussions in Sec. B and visualization in Fig. 2). Additionally, these decoding layers are solely receptive to features from the previous layer, ignoring information from other preceding layers. Moreover, the design of multi-scale structures in NeRV remains unguided by either practical or theoretical principles due to constraints on parameter quantities compared with methods for other vision tasks.

To address this issue, we propose the Pyramidal Neural Representation for Videos (PNeRV) based on hierarchical information interaction via a low-cost upscaling operator, Kronecker Fully-connected (KFc) layer, and a gated mechanism, Benign Selective Memory (BSM), which aims at adaptive feature merging. Utilizing these modules, PNeRV can fuse the high-level features directly into each underlying fine-grained layer via shortcuts, thereby creating a pyramidal structure. Further, we introduce Universal Approximation Theory (UAT) into the NeRV system for the first time and provide an analysis of existing NeRV models, revealing the superiority of our proposed pyramid structure. Our main contributions are summarized as follows.

  • Towards the poor perceptual quality of NeRV systems, we propose PNeRV to enhance spatial consistency via multi-scale feature learning.

  • In pursuit of model efficiency pursuit, we propose the KFc, which realizes low-cost upsampling with a global receptive field and BSM for adaptive feature fusion, thus forming an efficient multi-level pyramidal structure.

  • We introduce the first UAT analysis in NeRV research. Using UAT, we describe NeRV-based video neural coding as the Implicit Video Neural Coding problem, clarifying and defining some fundamental concepts within this framework.

  • We confirm the superiority of PNeRV against other models on two datasets (UVG and DAVIS) using four video quality metrics (PSNR, SSIM, LPIPS, and FVD).

2 Related Work

Implicit Neural Representation for Videos. In recent years, INR has gained increasing attention in various vision areas, such as neural radiance fields modeling [42, 43, 11, 16], novel view synthesis [41, 29], and multimedia neural coding [14, 9, 10, 31, 71]. For INR-based neural video coding, NeRV [9] first uses index embeddings as input and then decodes back to high-resolution videos via cascaded PixelShuffle [52] blocks. ENeRV [31] aims to reallocate the parameter quantity between different modules for better performance. Unlike the above index-based methods, HNeRV [10] employs ConvNeXT [37] blocks as an encoder and provides content-aware embeddings, improving the performance. Furthermore, apart from content embedding, DiffNeRV [71] inputs the difference between adjacent frames as temporal embeddings, enhancing temporal consistency. The major distinction between PNeRV and DiffNeRV is that the latter does not refer to multi-scale spatial information, resulting in spatial discontinuity.

Multi-scale Hierarchy Structure for Dense Prediction. In previous CV research, there have emerged numerous studies on multi-scale vision [7, 50, 22, 34, 72, 36, 35, 62, 38]. UNet [50] aimed to improve accuracy by combining contextual information from features at different resolutions. FPN [34] developed a top-down architecture with high-level semantic feature maps at all scales, showing significant improvements in dense prediction tasks. PANet [35] followed the idea of multi-level information fusion and proposed adaptive feature pooling to leverage useful information from each level. PVT [62] introduced the pyramidal architecture into vision transformers. The success of pyramidal structure lies in multi-level feature fusion, and detailed predictions should be guided by high-level context features.

Video Coding Pipelines and Theories. Video coding has been studied for several decades based on handcrafted design and domain transformation [17, 64, 55, 4]. Furthermore, neural video coding [39, 30, 26, 33] aims to replace some components in the traditional pipeline, but they suffer from high computational complexity and slow decoding speeds. Beyond Rate-Distortion Optimization (RDO) [54], [5] reveals the importance of perceptual quality and proposes the Perception-Distortion Optimization (PDO). [6] defines the Rate-Distortion-Perception Optimization (RDPO). Different from those pipelines, we reinterpret the INR-based video coding [9, 10, 71] with UAT framework, and more details are in Sec. 4.2 and Sec. A.1.

Universal Approximation Theory (UAT). One of the pursuits of UAT analysis on the deep neural net is to estimate the minimal width of a model to approximate continuous functions under certain errors and fixed lengths. [20] provides the estimation of minimal width ww^{*} of a ReLU net as din+1wdin+doutd_{in}+1\leq w^{*}\leq d_{in}+d_{out} in Theorem 1. [46] provides the first definitive result for deep ReLU nets, and the minimum width required for the universal approximation of the LpL^{p} functions is exactly max{din+1,dout}\mathop{\max}\{d_{in}+1,d_{out}\}. [32] demonstrates that a deep ReLU ResNet with one neuron per hidden layer can uniformly approximate any Lebesgue integrable function. More discussions are given in Sec. 4 and Sec. A.1.

3 Pyramidal Neural Representation for Videos

Refer to caption
Figure 2: Visualized comparison between PixelShuffle and KFc, where ×\times denotes matrix multiplication and black box is the subpixel area. PixelShuffle fills the subpixels using a local receptive field, lacking long-range relationship modeling ability, while KFc calculates the correlation between every position.

As analyzed before, pursuing spatial consistency leads to the communication of multi-scale information via a global receptive field. Fine-grained reconstruction requires high-level information as guidance and a low-cost upsampling operator is crucial for creating multi-level shortcuts.

Therefore, we propose Pyramidal NeRV (PNeRV) consisting of a learnable encoder and a novel pyramidal decoder. The main innovation in the decoder is a low-cost global-wise upscaling operator, Kronecker Fully-connected (KFc) layer, and a gated memory unit, Benign Selective Memory (BSM) for disentangled feature fusion. The overall structure of PNeRV is shown in Fig. 3.

3.1 Kronecker Fully-connected Layer

NeRV aims to decode high-resolution videos from tiny embeddings. Therefore, Conv-based upsampling operators [69, 52] are not efficient enough due to the huge upscaling ratio, which differs from previous visual tasks. The parameter quantity will grow sharply due to increased channels or kernel size. However, NeRV aims to encode videos with as few parameters as possible, namely model efficiency pursuit.

In contrast to this goal, subpixel-based upscaling operators fail to form shortcuts and a pyramidal structure. Once upscaling from given embeddings F0F_{0} (16×2×416\times 2\times 4) to fine-grained features FnF_{n} (16×320×64016\times 320\times 640), there is an intolerable increase in parameters (25600×\times) to fill in the target subpixels. Even when the kernel size is only 1×11\times 1, a single PixelShuffle [52] layer requires 6.96M parameters from F0F_{0} to FnF_{n}, regardless of the size of videos or model structure.

Towards this dilemma, we propose the Kronecker Fully-connected layer (KFc), given as

𝐙=concati(𝐊1(i)𝐗(i)𝐊2(i))+𝐛c𝐛h𝐛w,\small\mathbf{Z}=\mathop{\textsc{concat}}_{i}\left(\mathbf{K}_{1}^{(i)}\mathbf{X}^{(i)}\mathbf{K}_{2}^{(i)}\right)+\mathbf{b}_{c}\otimes\mathbf{b}_{h}\otimes\mathbf{b}_{w}, (1)

where 𝐗(i)Hin×Win\mathbf{X}^{(i)}\in\mathbb{R}^{H_{in}\times W_{in}} are input features, 𝐙(i)Hout×Wout\mathbf{Z}^{(i)}\in\mathbb{R}^{H_{out}\times W_{out}} are output features, 𝐊1,2\mathbf{K}_{1,2} are two kernels which 𝐊1(i)Hout×Hin\mathbf{K}_{1}^{(i)}\in\mathbb{R}^{H_{out}\times H_{in}} and 𝐊2(i)Win×Wout\mathbf{K}_{2}^{(i)}\in\mathbb{R}^{W_{in}\times W_{out}} in channel ii. Each feature map is calculated channel-wise and will be concatenated in the channel. 𝐛c,h,w\mathbf{b}_{c,h,w} are three vectors and they output the bias via kronecker product \otimes where 𝐛cC×1\mathbf{b}_{c}\in\mathbb{R}^{C\times 1}, 𝐛hHout×1\mathbf{b}_{h}\in\mathbb{R}^{H_{out}\times 1} and 𝐛wWout×1\mathbf{b}_{w}\in\mathbb{R}^{W_{out}\times 1}.

Motivation. KFc is motivated by the fact that, the subpixels of one position are related to every other position in current feature maps. The dilemma between local and global feature learning is an enduring issue in deep learning [63, 68, 61, 38]. Unlike the local prior in the Conv layer, Fc is more effective, especially for the top embeddings containing semantic features with little local spatial structure. The calculation between 𝐊1\mathbf{K}_{1}, 𝐗\mathbf{X} and 𝐊2\mathbf{K}_{2} is actually the product between vectorized input features vec(𝐗)HinWin×1\text{vec}(\mathbf{X})\in\mathbb{R}^{H_{in}W_{in}\times 1} and hybrid weight matrix 𝐊HoutWout×HinWin\mathbf{K}_{\otimes}\in\mathbb{R}^{H_{out}W_{out}\times H_{in}W_{in}}, where 𝐊=𝐊1𝐊2\mathbf{K}_{\otimes}=\mathbf{K}_{1}\otimes\mathbf{K}^{\top}_{2}. Compared with the vanilla Fc layer, two low-rank matrices 𝐊1\mathbf{K}_{1} and 𝐊2\mathbf{K}_{2} come from the Kronecker decomposition, while the bias term 𝐛c,h,w\mathbf{b}_{c,h,w} is the CP decomposition of the original ones.

Besides, KFc is also inspired by LoRA [25], which uses adaptive weights in low “intrinsic dimension” [2] for PEFT. Visualization is shown in Fig. 2. For the same F0F_{0} and FnF_{n} mentioned above, parameters needed by KFc is 0.05M, only 0.7% of that required by PixelShuffle. Detailed comparisons of parameters and FLOPs are given in Fig. 3.

Refer to caption
Figure 3: The overall architecture of PNeRV, consists of KFc and BSM. The right part shows the comparison of parameters and FLOPs between PixelShuffle (PS) and KFc, where input feature maps are in c×h×wc\times h\times w, the upscaling rate is rr and kernel size in PS is k×kk\times k.

3.2 Benign Selective Memory

Using KFc as the basic operator for shortcuts, PNeRV realizes efficient multi-scale feature learning. Also, adaptive feature fusion between different levels is quite important.

Therefore, we propose the Benign Selective Memory (BSM). BSM is inspired by the gated mechanism in RNN research [12, 24], treating features in different streams as input and cell states. We follow the convention in RNN, where lowercase represents hidden states. For the high-level feature zz on the top and the fine-grained feature hl1h_{l-1} in the ll-th layer, BSM is given as follows:

nl=\displaystyle n_{l}= Wnz,\displaystyle~{}W_{n}\ast z, Kownledge
ml=\displaystyle m_{l}= Wmhl1\displaystyle~{}W_{m}\ast h_{l-1} Memory
sl=\displaystyle s_{l}= 𝜎(WsReLU(nl+ml)),\displaystyle~{}\mathop{\sigma}(W_{s}\ast\textsc{ReLU}(n_{l}+m_{l})), Decision
hl=\displaystyle h_{l}= hl1(1sl)+nlsl,\displaystyle~{}h_{l-1}\odot(1-s_{l})+n_{l}\odot s_{l}, Behaviour

where \ast is convolution with weights Wn,m,sW_{n,m,s} ,\odot is hadamard product and σ\sigma is the sigmoid activation.

BSM is an imitation of the human learning and decision-making process. The high-level zz is regarded as external Knowledge, while hl1h_{l-1} from the previous block in the mainstream is the inheriting Memory. The model should learn from Knowledge and integrate it with Memory to guide the Behaviors (reconstruction). That is the so-called Benign Selective Memory.

Motivation. The primary distinction between previous gated mechanisms and BSM is that BSM learns features (referred to as “Knowledge” and “Memory”) separately before merging them. This disentangled fashion aids PNeRV in adaptively merging features from different levels. The ablation studies in Tab. 7 show the superiority of BSM.

3.3 Overall Structure

Therefore, the proposed PNeRV consists of three parts, as follows (where XX is the input embedding, Hl^\hat{H_{l}} are featured in the mainstream lthl^{th} layer, ZlZ_{l} are features upsampled by shortcuts, and HlH_{l} are the features after fusion):

  1. 1.

    A mainstream comprises cascaded upsampling layers (containing Conv, PixelShuffle, and GELU) to provide high-resolution reconstruction, H^l=Block(Hl1),1lL,L=6,H0=X\hat{H}_{l}=Block(H_{l-1}),1\leq l\leq L,L=6,H_{0}=X.

  2. 2.

    Various shortcuts upsample the high-level embeddings XX into ZlZ_{l} before merging into the mainstream, forming a multi-level hierarchical architecture, Zl=Shortcut(X),2lL0,L0=5Z_{l}=Shortcut(X),2\leq l\leq L_{0},L_{0}=5.

  3. 3.

    A feature fusion mechanism is employed to merge ZlZ_{l} with H^l\hat{H}_{l} adaptively for the final output, Hl=Fusion(Zl,H^l)H_{l}=Fusion(Z_{l},\hat{H}_{l}).

In implementation, we conducted two versions, namely PNeRV-M and PNeRV-L. PNeRV-M has only a single stream which takes content embeddings [10] 𝐗C\mathbf{X}^{C} in 16×2×416\times 2\times 4 as input. For PNeRV-L, temporal embeddings [71] 𝐗T\mathbf{X}^{T} in 2×40×80{2\times 40\times 80} are involved. 𝐗C\mathbf{X}^{C} is delivered to the mainstream and 𝐗T\mathbf{X}^{T} is upscaled in shortcuts via KFc and merged into each mainstream layers through BSM. We choose PNeRV-L as the final version. All kernels are 3×33\times 3 except for the first and final output layer. For the input video 𝐕\mathbf{V} and reconstructions 𝐕~\tilde{\mathbf{V}}, the key equations of the entire model in the ll-th layer (1<lL1<l\leq L) are presented as follows:

Encoder:𝐗C,𝐗T=(𝐕),Decoder:𝐇^l=Blockl(𝐇l1)=BlocklBlock1(𝐗C),𝐙l=Shortcutl(𝐗T),𝐇l=BSMl(𝐇^l,𝐙l),\small\begin{split}Encoder:~{}\mathbf{X}^{C},\mathbf{X}^{T}=&~{}\mathcal{E}(\mathbf{V}),\\ Decoder:~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\hat{\mathbf{H}}_{l}=&~{}\textsc{Block}_{l}(\mathbf{H}_{l-1})\\ =&~{}\textsc{Block}_{l}\circ\cdots\circ\textsc{Block}_{1}(\mathbf{X}^{C}),\\ \mathbf{Z}_{l}=&~{}\textsc{Shortcut}_{l}(\mathbf{X}^{T}),\\ \mathbf{H}_{l}=&~{}\textsc{BSM}_{l}(\hat{\mathbf{H}}_{l},\mathbf{Z}_{l}),\\ \end{split}

where 𝐇0=𝐗C\mathbf{H}_{0}=\mathbf{X}^{C}. The final output will be passed through an output layer, 𝐕~=Conv1×1(𝐇l=L)\tilde{\mathbf{V}}=\textsc{Conv}_{1\times 1}(\mathbf{H}_{l=L}).

4 Universal Approximation Theory on NeRV

First, we will clarify some concepts in NeRV within the UAT framework. A NeRV-based neural video coding pipeline is defined in Sec. 4.2. We describe the limitations of existing NeRV models in Sec. 4.3, discuss the significance of shortcuts and the multi-scale structure in the proposed PNeRV in Sec. 4.4.

4.1 Basic Definitions and Notations

One of the main issues for the UAT analysis of a finite length LL feed-forward network is to find out the minimal width w:=minmaxdi,1iLw^{*}:=\min\max d_{i},1\leq i\leq L where did_{i} is the width of the ii-th layer so that neural nets with width ww^{*} and length LL can approximate any scalar continuous function arbitrarily well [20, 19, 46]. Following the statement in [20], a deep affine net is defined as follows.

Definition 1.

(Deep Affine Net). A deep affine net of LL layers is given as follows:

𝒩:=ALσAL1σA1.\small\mathcal{N}:=A_{L}\circ\sigma\circ A_{L-1}\circ\cdots\circ\sigma\circ A_{1}. (2)

where the ithi^{th} layer is an affine transformations Ai:=didi+1A_{i}:=\mathbb{R}^{d_{i}}\to\mathbb{R}^{d_{i+1}}, d1=dind_{1}=d_{in}, dL=doutd_{L}=d_{out} with σ\sigma as activation.

In existing NeRV research, NeRV [9] and HNeRV [10] meet this definition.

4.2 Implicit Neural Video Coding

Recently, INR-based video coding has received increasing attention, and it uses a lightweight model to fit a video clip. We formulate this coding pipeline as Implicit Neural Video Coding (INVC), and the decoder with its embeddings together is known as the NeRV system [9, 31, 10, 21, 71].

Definition 2.

(NeRV System).  Each frame Vt{V_{t}} in an RGB video clip V={Vt}t=1TT×3×H×WV=\{{V_{t}}\}^{T}_{t=1}\in\mathbb{R}^{T\times 3\times H\times W} is represented by an implicit unknown continuous function :[0,1]dindout\mathcal{F}:[0,1]^{d_{in}}\to\mathbb{R}^{d_{out}} with the embedding (t)\mathcal{E}(t) obtained by encoder :[0,1]din\mathcal{E}:\mathbb{N}\to[0,1]^{d_{in}} on the tt time stamp,

Vt=(t),\small V_{t}=\mathcal{F}\circ\mathcal{E}\left(t\right),

where \mathcal{F} can be approximated by a learnable neural network 𝒟\mathcal{D} of finite length L𝒟L_{\mathcal{D}}, width w𝒟w_{\mathcal{D}} and activation σ\sigma. The reconstruction Vt~\tilde{V_{t}} via 𝒟\mathcal{D} and \mathcal{E} is given as follows:

Vt~=𝒟(t),\displaystyle\small\tilde{V_{t}}=\mathcal{D}\circ\mathcal{E}\left(t\right),

where the decoder 𝒟\mathcal{D} and embedding (t)\mathcal{E}(t) together are known as NeRV system, {𝒟,(t)}t=1T\{\mathcal{D},\mathcal{E}(t)\}^{T}_{t=1}.

For the index-based models [9] and [31], the encoder \mathcal{E} is Positional Encoding [53]. In content-based models [10, 71], \mathcal{E} is learnable and provides content embeddings. When 𝒟\mathcal{D} is a deep affine net, it is named as a serial cascaded NeRV system, such as NeRV [9] and HNeRV [10], and 𝒟\mathcal{D} is formulated as follows, where BlB_{l} is the ll-th upsampling layer.

𝒟:=BLσBL1σB1.\displaystyle\small\mathcal{D}:=B_{L}\circ\sigma\circ B_{L-1}\circ\cdots\circ\sigma\circ B_{1}. (3)

We present the proposed Implicit Neural Video Coding Problem (INVCP) as follows. More discussions between INVCP and existing pipelines are given in Sec. A.

Problem 1.

(INVCP). The goal of INVC is to obtain the minimal parameter quantity under a certain approximation error ϵ\epsilon between input VV and reconstruction V~\tilde{V},

argmin𝒟,𝖯𝖺𝗋𝖺𝗆(𝒟)+t=1Tdint,\displaystyle\mathop{\arg\min}_{\mathcal{D},\mathcal{E}}~{}\mathsf{Param}\left(\mathcal{D}\right)+\sum_{t=1}^{T}d_{in}^{t},
s.t.L𝒟,w𝒟[1,),supVt~Vtϵ,t[1,T].\displaystyle\mathrm{s.t.}~{}L_{\mathcal{D}},w_{\mathcal{D}}\in\left[1,\infty\right),~{}\mathop{\sup}\mathop{\sum}\|\tilde{V_{t}}-V_{t}\|\leq\epsilon,~{}t\in[1,T].

where dintd_{in}^{t} is the dimension of embedding (t)\mathcal{E}(t) w.r.t. the tt-th frame, L𝒟L_{\mathcal{D}} and w𝒟w_{\mathcal{D}} are the length and width.

4.3 UAT Analysis of Cascaded NeRV Model

For video INRs, the model strives to capture the implicit function that efficiently encodes a video. Within the UAT framework, a keen focus is on the smoothness properties of this implicit function, as it also encapsulates the video’s inherent dynamics.

We name these properties as rate of dynamics, referring to the differences and transitions between consecutive frames within the video. We introduce ω𝒱1\omega_{\mathcal{V}}^{-1} to informally represent the rate of dynamics for video 𝒱\mathcal{V}, inspired by the mathematical techniques used in UAT analysis [20].

Definition 3.

The dual modulus of continuity ωf1\omega_{f}^{-1} w.r.t. a continuous ff defined on Ω\Omega is set as

ωf1(ϵ):=sup{δ:ωf(δ)ϵ},\small\omega_{f}^{-1}(\epsilon):=\mathop{\sup}\{\delta:\omega_{f}(\delta)\leq\epsilon\},

where ωf\omega_{f} represents the modulus of continuity of ff

ωf(δ):=supx,yΩ{f(x)f(y):d(x,y)δ}.\small\omega_{f}(\delta):=\mathop{\sup}_{x,y\in\Omega}\{\|f(x)-f(y)\|:d(x,y)\leq\delta\}.
Remark 1.

Using a function :dV\mathcal{F}:\mathbb{N}\to\mathbb{R}^{d_{V}} to roughly represent a video VV, when the variation of frames (video dynamics) (ti)(tj)\|\mathcal{F}(t_{i})-\mathcal{F}(t_{j})\| is at a certain level ϵ\epsilon for two time stamps tit_{i} and tjt_{j}, then the longer the duration sustains, the larger ω1\omega_{\mathcal{F}}^{-1} gets. Smoother video has larger ω1\omega_{\mathcal{F}}^{-1}.

Notably, the explicit calculation of ωf1\omega_{f}^{-1} is hard to obtain, and it is more like an empirical judgment, such as camera movement, subject speed, noise, and others. We present the estimation of the upper bound of the minimal parameter quantity of the cascaded NeRV model as Theorem 1. The proof of Theorem 1 can be found in Sec. A.3.

Theorem 1.

For a cascaded NeRV system to ϵ\epsilon-approximate a video VV which is implicitly characterized by a certain unknown L-Lipschitz continuous function :Kdout\mathcal{F}:K\to\mathbb{R}^{d_{out}} where KdinK\subseteq\mathbb{R}^{d_{in}} is a compact set, then the upper bound of the minimal parameter quantity 𝖯𝖺𝗋𝖺𝗆(𝒟)\mathsf{Param}(\mathcal{D}) is given as

𝖯𝖺𝗋𝖺𝗆min(𝒟)dout2(𝑂(diam(K))ω1(ϵ))din+1.\small\mathsf{Param}_{\min}(\mathcal{D})\leq d_{out}^{2}\left(\frac{\mathop{O}\left(diam\left(K\right)\right)}{\omega_{\mathcal{F}}^{-1}\left(\epsilon\right)}\right)^{d_{in}+1}.

From Theorem 1, it can be seen that for a video, the fitting performance of the cascaded NeRV model depends on the rate of dynamics ω1\omega_{\mathcal{F}}^{-1} and the dimension of the video, doutd_{out}. The smoother and lower the dimension of the video to be modeled, the less difficult it is to approximate.

Remark 2.

The rate of dynamics for a given video will determine the performance of the NeRV system.

4.4 UAT Analysis of PNeRV

According to Theorem 1, the upper bound of parameters of cascaded NeRV required for model fitting only depends on the properties of the target video. It demonstrates that, although different models can exhibit diverse architectures, their fitting behavior on the same video tends to be similar, indicating a limitation in the model’s ability. However, according to observations in UAT research [32, 15], the model with shortcuts will reduce the maximum width to 1, indicating that the model size can be greatly reduced while maintaining the performance. Therefore, the involvement of shortcuts is the key to enhancing model capability.

Besides, we believe the implicit function representing a video can be decomposed into diverse sub-functions from a pattern-disentangled perspective. If we treat each stream in 𝒟\mathcal{D} as a sub net, the whole 𝒟\mathcal{D} is an ensemble,

𝒟:=AL(i)ρL1(i)AL1(i)ρ1(i)A1(i).\small\mathcal{D}:=\sum A_{L}^{(i)}\circ\rho_{L-1}^{(i)}\circ A_{L-1}^{(i)}\circ\cdots\circ\rho_{1}^{(i)}\circ A_{1}^{(i)}. (4)

Different shortcut pathways can fit various patterns, as a single shortcut has the universal approximation ability. For example, in Fig. 3, ① may capture the low-frequency motions. Whereas ②, directed towards fine-grained layers, signifies spatial details. This hypothesis aligns with the empirical evidence observed in other vision areas, which shows that the pyramid structure, a widely adopted hierarchical topology, can improve dense prediction tasks. That is why PNeRV outperforms others and achieves less semantic uncertainty and better perceptual quality.

Remark 3.

As the ensemble of sub-nets, the Pyramidal structure will enhance the perceptual quality of NeRV systems.

5 Experiment

Settings. We perform video regression on 2 datasets, and all videos are center cropped to a 1×21\times 2 ratio. UVG [40] has 7 videos with a size of 960×1920960\times 1920 in 300300 or 600600 frames at 120 FPS. DAVIS [47] is a large dataset of 47 videos in 960×1920960\times 1920, containing large motions and complex spatial details. We choose 9 videos111Bmx-bumps, Camel, Dance-jump, Dog, Drift-chicane, Elephant, Parkour, Scooter-gray, Soapbox. from DAVIS as a subset, containing different types of spatiotemporal features.

Metrics. We use PSNR and MS-SSIM to evaluate pixel-wise errors. For spatial consistency, we choose the Learned Perceptual Image Patch Similarity (LPIPS) [70] and Frechet Video Distance (FVD) [60] as perceptual metrics, where LPIPS is based on AlexNet and FVD is based on the I3D model. The difference between PNeRV (P) and the baseline (B) is calculated as (BP)/B(B-P)/B to show the improvement.

Training. We adopt Adam as the optimizer, where beta is (0.9, 0.999) and weight decay is 0. The learning rate is 5e-4 with a cosine annealing schedule. The loss function is L2, and the batch size is 1. All experiments are conducted using PyTorch 1.8.1 on NVIDIA GPU RTX2080ti, training for 300 epochs. We choose NeRV [9], E-NeRV [31], HNeRV [10], DivNeRV [21] and DiffNeRV [71] as baseline models. All models are trained with a similar 3M size, and we follow the setting of embedding size as the baseline method.

5.1 Video Regression on UVG

Pixel-wise error. PSNR comparison on UVG is reported in Tab. 1, where bold font is the best result and underline is the second best. PNeRV-L surpasses other models (+0.42 dB against DiffNeRV and +4.25 dB against NeRV). PNeRV-M achieves the best result against other single-stream models (+1.96 dB against HNeRV and +3.02 dB against NeRV). The proposed pyramidal architecture shows its effectiveness when combined with various encoders.

Perceptual quality. The perceptual results are given in Tab. 3 (LPIPS) and Tab. 4 (FVD), and the results of PNeRV show a significant improvement, especially for “Bospho” and “ShakeN”. The FVD results in Tab. 4 indicate that PNeRV provides better spatiotemporal consistency compared to other baseline models (+231% against NeRV [9] and +64.5% against DiffNeRV [71]).

Case study. The visualized comparison on UVG is exhibited in the bottom three rows of Fig. 4. For dynamic objects with indistinct boundaries or noisy backgrounds, such as the horse in “ReadyS” and the tail in “ShakeN,” PNeRV demonstrates superior visual quality without requiring additional semantic information.

PSNR \uparrow D.P. E.S. Beauty Bospho HoneyB Jockey ReadyS ShakeN YachtR Avg. M.
Avg. V. N/A N/A 36.06 35.32 39.48 33.27 27.53 35.27 30.03 N/A
NeRV [9] 3M 160 33.25 33.22 37.26 31.74 24.84 33.08 28.03 31.63
NeRV\text{NeRV}^{*} [9] 3.2M 160 32.71 33.36 36.74 32.16 26.93 32.69 28.48 31.87
E-NeRV [31] 3M 160 33.17 33.69 37.63 31.63 25.24 34.39 28.42 32.02
HNeRV [10] 3M 128 33.58 34.73 38.96 32.04 25.74 34.57 29.26 32.69
DiffNeRV [71] 3.4M 6528 40.00 36.67 41.92 35.75 28.67 36.53 31.10 35.80
DivNeRV\text{DivNeRV}^{*} [21] 3.2M N/A 33.77 38.66 37.97 35.51 33.93 35.04 33.73 35.52
PNeRV-M 1.5M 128 37.51 33.80 41.76 29.96 24.15 36.18 28.92 33.18
3M 128 39.08 35.56 42.59 31.51 25.94 37.61 30.27 34.65
PNeRV-L 1.5M 6528 37.98 35.18 41.78 34.43 27.28 36.65 28.29 34.51
3.3M 6528 39.46 36.68 42.73 35.81 28.97 38.25 30.92 36.12
Table 1: PSNR comparison on UVG: the larger, the better. * indicates methods that fit videos in a shared model while others fit each video in a single model. D.P. is the parameter quantity of the decoder, and E.S. is the corresponding embedding size per frame. Avg. V is the average PSNR across all models for the same video. Avg. M is the average PSNR for a single model on the entire dataset.
PSNR / SSIM \uparrow Bmx-B Camel Dance-J Dog Drift-C Elephant Parkour Scoo-gray Soapbox Avg.
NeRV [9] 29.42/0.864 24.81/0.781 27.33/0.794 28.17/0.795 36.12/0.969 26.51/0.826 25.15/0.794 28.16/0.892 27.68/0.848 27.99/0.840
E-NeRV [31] 28.90/0.851 25.85/0.844 29.52/0.855 30.40/0.882 39.26/0.983 28.11/0.871 25.31/0.845 29.49/0.907 28.98/0.867 29.62/0.878
HNeRV [10] 29.98/0.872 25.94/0.851 29.60/0.850 30.96/0.898 39.27/0.985 28.25/0.876 26.56/0.851 31.64/0.939 29.81/0.881 30.22/0.889
DiffNeRV [71] 30.58/0.890 27.38/0.887 29.09/0.837 31.32/0.905 40.29/0.987 27.30/0.848 25.75/0.827 30.35/0.923 31.47/0.912 30.39/0.890
PNeRV-L (ours) 31.05/0.896 27.89/0.892 30.45/0.873 31.08/0.898 40.23/0.987 29.72/0.903 27.53/0.878 32.68/0.950 30.85/0.902 31.27/0.908
Table 2: PSNR and MS-SSIM comparison on DAVIS.

Compared with the SOTA. As shown in Tab. 1, PNeRV obtained competitive PSNR results on dynamic and smooth videos. [21] is less effective for videos with fewer motions but complicated contextual spatial correlation. Also, [71] makes it hard to reconstruct the videos filled with high-frequency details. By comparison, PNeRV achieves comparable performance on all videos.

5.2 Video Regression on DAVIS

Pixel-wise error. In Tab. 2, we present the PSNR and SSIM comparison on the DAVIS dataset. PNeRV gains a +0.88 dB PSNR increase compared to DiffNeRV and +3.28 dB compared to vanilla NeRV. Despite the challenges posed by complex spatiotemporal features, PNeRV exhibits significant improvements (refer to “Parkour”, which is the most difficult one, or “Drift-chicane”, where the racing car undergoes intense motion amidst smoke-induced noise).

Perceptual quality. The LPIPS results on DAVIS are reported in Tab. 3, where PNeRV achieved a 32.0% increase compared to NeRV and 12.6% against the second-best DiffNeRV. In Tab. 5, PNeRV gains a 634% FVD increase over NeRV and 128% against DiffNeRV. For the worst case, “Dog”, although PNeRV obtained a poor FVD result owing to the severe global blurring caused by camera motion, the PSNR is only slightly lower than the best (-0.24 db).

Case study. Visualizations are shown in Fig. 4. PNeRV reduced spatial inconsistency, particularly in “Dance Jump” and “Elephant,” which are filled with irrelevant high-frequency details obscuring semantic clarity.

Refer to caption
Figure 4: Visual comparison on various videos. “Bmx” has larger motion, “Elephant” has massive droplets blurring, “Parkour” involves both camera rotation and extreme dynamics, “Dance” contains large motion under high-frequency reed leaves. “Jockey”, “ReadyS”, and “ShakeN” are videos with complex spatiotemporal correlation in UVG. Zoom in for a detailed comparison.
LPIPS \downarrow Beauty Bospho HoneyB Jockey ReadyS ShakeN YachtR Avg.
NeRV [9] 0.229 0.203 0.043 0.251 0.326 0.189 0.276 0.216
ENeRV [31] 0.224 0.179 0.039 0.279 0.318 0.168 0.363 0.224
HNeRV [10] 0.218 0.172 0.042 0.270 0.348 0.191 0.253 0.213
DiffNeRV [71] 0.205 0.164 0.042 0.196 0.206 0.181 0.241 0.176
PNeRV (ours) 0.210 0.132 0.037 0.177 0.211 0.146 0.230 0.163
LPIPS \downarrow Bmx-B Camel Dance Dog Drift Eleph Parko Scoo-g Soapb Avg.
NeRV [9] 0.374 0.476 0.517 0.573 0.136 0.490 0.481 0.308 0.424 0.419
ENeRV [31] 0.386 0.357 0.426 0.404 0.061 0.419 0.429 0.282 0.380 0.349
HNeRV [10] 0.315 0.331 0.392 0.405 0.058 0.387 0.414 0.226 0.357 0.321
DiffNeRV [71] 0.320 0.278 0.423 0.394 0.053 0.431 0.478 0.268 0.297 0.326
PNeRV (ours) 0.308 0.284 0.363 0.387 0.054 0.343 0.314 0.188 0.324 0.285
Table 3: LPIPS comparison on UVG (left) and DAVIS (right) dataset.
FVD\downarrow  Gap\uparrow Beauty Bospho HoneyB Jockey ReadyS ShakeN YachtR Avg. \uparrow
NeRV [9] 3.76e-5 281% 1.00e-4 253% 1.45e-5 193% 5.81e-4 499% 1.98e-3 122% 3.27e-5 178% 4.07e-4 92.8% 231%
ENeRV [31] 2.66e-5 169% 7.86e-5 176% 5.88e-6 186% 1.00e-3 936% 1.46e-3 64.2% 2.12e-5 80.7% 1.00e-3 376% 284%
HNeRV [10] 3.29e-5 233% 6.74e-5 137% 1.50e-5 203% 9.46e-4 874% 2.07e-3 132% 5.06e-5 331% 3.56e-4 68.8% 282%
DiffNeRV [71] 1.29e-5 30.7% 4.28e-5 50.3% 6.50e-6 31.1% 1.55e-4 60.1% 6.58e-4 -26.3% 4.69e-5 300% 2.23e-4 5.9% 64.5%
PNeRV (ours) 9.88e-6 - 2.85e-5 - 4.96e-6 - 9.70e-5 - 8.94e-4 - 1.17e-5 - 2.11e-4 - -
Table 4: FVD comparison on UVG.
FVD\downarrow  Gap\uparrow Bmx-B Camel Dance-Jump Dog Drift-C Elephant Parkour Scoo-gray Soapbox Avg. \uparrow
NeRV [9] 8.99e-5 146% 2.70e-4 404% 6.66e-5 1273% 3.02e-5 336% 3.85e-6 2830% 2.470e-5 95.8% 1.35e-4 309% 3.815e-5 197% 9.39e-5 115% 634%
ENeRV [31] 1.20e-4 229% 1.08e-4 102% 6.05e-6 24.8% 4.04e-6 -41.5% 5.41e-7 311% 2.647e-5 110% 7.09e-5 114% 3.961e-5 208% 7.01e-5 61.1% 124%
HNeRV [10] 4.97e-5 36.2% 1.04e-4 94.1% 9.58e-6 97.5% 4.51e-6 -34.6% 1.21e-6 821% 4.439e-5 252% 7.81e-5 135% 2.256e-5 75.8% 7.36e-5 69.3% 171%
DiffNeRV [71] 3.11e-5 -14.8% 3.85e-5 -28.1% 1.19e-5 146% 3.61e-6 -47.6% 6.48e-7 392% 6.408e-5 408% 1.45e-4 339% 1.614e-5 25.7% 1.64e-5 -62.2% 128%
PNeRV (ours) 3.65e-5 - 5.36e-5 - 4.85e-6 - 6.91e-6 - 1.31e-7 - 1.261e-5 - 3.31e-5 - 1.283e-5 - 4.35e-5 - -
Table 5: FVD comparison on DAVIS.

5.3 Ablation Studies

The ablation of the effectiveness of the proposed pyramidal architecture is in Tab. 6, and the contributions of two proposed modules are validated in Tab. 7, where the parameters of different models remain the same for a fair comparison.

Overall structure. We validate the design of the multi-level structure on the most dynamic and smooth videos (“Parkour” and “HoneyB”). In Tab. 6, the “serial” in the first row represents HNeRV [10]. “Pyram.+Concat.” incorporates solely shortcuts without fusion modules. The main difference between DiffNeRV and PNeRV-L is the quantity of shortcuts (2 vs 5), and PNeRV-L performs better.

Modules contribution. We compare KFc with two upscaling layers, Deconv [69] and Bilinear (the combination of bilinear upsampling and Conv2D). KFc performs better due to the global receptive field, as shown in Tab. 7.

Also, we compare BSM with Concat, GRU [12] and LSTM [24]. The results suggest that, disentangled feature fusion significantly enhances performance. Detailed results for each video are listed in Tab. C.6 in the appendix.

Parkour (Dynamic) HoneyB (Smooth)
Models Size 1.5M 3M 5M Avg. 0.75M 1.5M 3M Avg.
Serial (HNeRV [10]) 25.07 26.56 24.34 25.32 36.65 36.72 38.96 37.44
Pyram. + Concat. 24.20 25.45 25.83 25.16 40.07 41.58 42.34 41.33
Pyram. + BSM. (PNeRV-M) 24.81 26.02 27.13 25.99 40.34 41.36 42.59 41.43
Serial + Diff. (DiffNeRV [71]) 25.49 25.75 25.71 25.65 40.52 41.52 41.92 41.32
Pyram. + Diff. + BSM. (PNeRV-L) 25.62 27.08 27.21 26.67 39.81 41.85 42.73 41.46
Table 6: Ablation studies for model size and overall architecture on “HoneyB” and “Parkour”.

5.4 Validation of Theoretical Analysis

The results in Tab. 1 and Tab. 6 validate the Remark 2. For those smooth videos with larger ωf1\omega_{f}^{-1} and a smaller upper bound, models may obtain better performance; vice versa. The results of PNeRV in Fig. 4, which exhibit less noise and blurring, validate Remark. 3. Hierarchy structure reduces ambiguity and artifacts caused by semantic uncertainty.

PSNR\uparrow SSIM\uparrow (A.P.G.)\uparrow Concat GRU LSTM BSM
Bilinear 27.16/0.816(-4.14) 28.39/0.847(-2.91) 28.07/0.834(-3.23) 29.08/0.862(-2.22)
Deconv 27.37/0.803(-3.93) 29.00/0.845(-2.30) 28.91/0.850(-2.39) 29.96/0.881(-1.34)
KFc 28.68/0.848(-2.62) 29.31/0.868(-1.99) 29.04/0.866(-2.26) 31.30/0.904(+0)
Table 7: Contribution ablations for KFc and BSM, reported as average results on 7 DAVIS videos. A.P.G. indicates the average PNSR gap compared with the final version of PNeRV (KFc + BSM); the larger the better. Detailed results for each video are given in Sec. D.2.

5.5 Additional Experiment Results

Additional results are provided in the appendix. Video interpolation on UVG is discussed in Sec. C.1 where PNeRV achieves the second-best PSNR (31.18 dB), exceeding the vanilla NeRV (26.54 dB). Video compression is shown in Sec. C.2, where competitive results are achieved over different coding pipelines. Video inpainting on the DAVIS subset is provided in Sec. C.3, where an average PSNR of 25.54 dB is achieved, outperforming NeRV (22.71 dB) and DNeRV (25.20 dB). More visual examples are shown in Sec. C.4, and visualization of feature maps in Sec. D.1. More detailed ablations are presented in Sec. D.2. More video examples with the link are listed in Sec. C.6.

6 Conclusion

To resolve the spatiotemporal inconsistency issue, we propose Pyramidal NeRV realizing multi-level information interaction by a low-cost KFc and a fusion module BSM. Further, we use UAT to provide some explanations and insights for NeRV. Competitive results on various tasks and metrics validate the superiority of PNeRV.

Limitation and future work. Hierarchical structure brings higher computational complexity. We will optimize redundant modules of the model for acceleration in the future.

Acknowledgements. The work was supported in part by National Key Research and Development Project of China (2022YFF0902402) and U.S. National Science Foundation award (CCF-2046293).

References

  • Advani and Saxe [2017] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 2017.
  • [2] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021.
  • Agustsson et al. [2018] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2018.
  • Ahmed et al. [2019] Nasir Ahmed, T. Raj Natarajan, and K. R. Rao. Discrete cosine transform. IEEE Transactions on Computers, 2019.
  • Blau and Michaeli [2017] Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Computer Vision and Pattern Recognition, 2017.
  • Blau and Michaeli [2019] Yochai Blau and Tomer Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff. Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Burt and Adelson [1983] Peter J. Burt and Edward H. Adelson. The laplacian pyramid as a compact image code. IEEE Trans. Commun., 1983.
  • Chan and et al. [2021] Eric Chan and Connor Z. Lin et al. Efficient geometry-aware 3d generative adversarial networks. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Chen et al. [2021] Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, and Abhinav Shrivastava. Nerv: Neural representations for videos. In NeurIPS, 2021.
  • Chen et al. [2023a] Hao Chen, M. Gwilliam, Ser Nam Lim, and Abhinav Shrivastava. Hnerv: A hybrid neural representation for videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  • Chen et al. [2023b] Zhiqin Chen, Thomas A. Funkhouser, Peter Hedman, and Andrea Tagliasacchi. Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023b.
  • Chung et al. [2014] Junyoung Chung, Caglar Gülcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv, 2014.
  • Del’etang et al. [2023] Gr’egoire Del’etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Wenliang Kevin Li, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. Language modeling is compression. ArXiv, abs/2309.10668, 2023.
  • Dupont et al. [2021] Emilien Dupont, Adam Goli’nski, Milad Alizadeh, Yee Whye Teh, and A. Doucet. Coin: Compression with implicit neural representations. ArXiv, 2021.
  • Fan et al. [2018] Fenglei Fan, Dayang Wang, and Ge Wang. Universal approximation by a slim network with sparse shortcut connections. ArXiv, abs/1811.09003, 2018.
  • Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022.
  • Gall [1991] Didier J. Le Gall. Mpeg: a video compression standard for multimedia applications. Commun. ACM, 1991.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 2014.
  • Hanin [2017] Boris Hanin. Universal function approximation by deep neural nets with bounded width and relu activations. ArXiv, abs/1708.02691, 2017.
  • Hanin and Sellke [2017] Boris Hanin and Mark Sellke. Approximating continuous functions by relu nets of minimal width. ArXiv, abs/1710.11278, 2017.
  • He and et al. [2023] Bo He and Xitong Yang et al. Towards scalable neural representation for diverse videos. ArXiv, abs/2303.14124, 2023.
  • He et al. [2014] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
  • [25] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
  • Hu et al. [2021] Zhihao Hu, Guo Lu, and Dong Xu. Fvc: A new framework towards deep video compression in feature space. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Jonschkowski et al. [2015] Rico Jonschkowski, Sebastian Hofer, and Oliver Brock. Patterns for learning with side information. arXiv: Learning, 2015.
  • Kingma and Welling [2013] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
  • Li et al. [2021a] Jiaxin Li, Zijian Feng, Qi She, Henghui Ding, Changhu Wang, and Gim Hee Lee. MINE: towards continuous depth MPI with nerf for novel view synthesis. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021a.
  • Li et al. [2021b] Jiahao Li, Bin Li, and Yan Lu. Deep contextual video compression. NeurIPS, 2021b.
  • Li et al. [2022] Zizhang Li, Mengmeng Wang, Huaijin Pi, Kechun Xu, Jianbiao Mei, and Yong Liu. E-nerv: Expedite neural video representation with disentangled spatial-temporal context. arXiv:2207.08132, 2022.
  • Lin and Jegelka [2018] Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. ArXiv, abs/1806.10909, 2018.
  • Lin et al. [2020] Jianping Lin, Dong Liu, Houqiang Li, and Feng Wu. M-lvc: Multiple frames prediction for learned video compression. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Lin et al. [2016] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Liu et al. [2018] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Liu et al. [2019] Songtao Liu, Di Huang, and Yunhong Wang. Learning spatial fusion for single-shot object detection. ArXiv, abs/1911.09516, 2019.
  • [37] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • Lu et al. [2019] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. Dvc: An end-to-end deep video compression framework. 2019.
  • [40] Alexandre Mercat, Marko Viitanen, and Jarno Vanne. UVG dataset: 50/120fps 4k sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, MMSys 2020.
  • Mildenhall et al. [2019] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph., 2019.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 2022.
  • Nakkiran et al. [2019] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2019.
  • Odena et al. [2016] Augustus Odena, Vincent Dumoulin, and Christopher Olah. Deconvolution and checkerboard artifacts. 2016.
  • [46] Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  • [47] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus H. Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016.
  • Rae [2023] Jack Rae. Compression for agi. YouTube, https://youtu.be/dO4TPJkeaaU, 2023.
  • [49] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
  • Schwarz et al. [2020] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. ArXiv, abs/2007.02442, 2020.
  • [52] Wenzhe Shi and Jose Caballero et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Sitzmann et al. [2020] Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In NeurIPS, 2020.
  • Sullivan and Wiegand [1998] Gary J. Sullivan and Thomas Wiegand. Rate-distortion optimization for video compression. IEEE Signal Process. Mag., 1998.
  • Sullivan et al. [2012] Gary J. Sullivan, Jens-Rainer Ohm, Woojin Han, and Thomas Wiegand. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol., 2012.
  • T et al. [2023] Mukund Varma T, Peihao Wang, Xuxi Chen, Tianlong Chen, Subhashini Venugopalan, and Zhangyang Wang. Is attention all that neRF needs? In The Eleventh International Conference on Learning Representations, 2023.
  • Takikawa and et al. [2021] Towaki Takikawa and Joey Litalien et al. Neural geometric level of detail: Real-time rendering with implicit 3d shapes. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Tancik et al. [2020] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In NeurIPS, 2020.
  • Tschannen et al. [2018] Michael Tschannen, Eirikur Agustsson, and Mario Lucic. Deep generative models for distribution-preserving lossy compression. In Neural Information Processing Systems, 2018.
  • Unterthiner et al. [2018] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. ArXiv, abs/1812.01717, 2018.
  • Vaswani et al. [2017] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
  • Wang et al. [2021] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 548–558, 2021.
  • Wang et al. [2018] X. Wang, Ross Girshick, Abhinav Kumar Gupta, and Kaiming He. Non-local neural networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Wiegand et al. [2003] Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol., 2003.
  • Wyner and Ziv [1976] Aaron D. Wyner and Jacob Ziv. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inf. Theory, 1976.
  • Xu et al. [2022] Dejia Xu, Peihao Wang, Yifan Jiang, Zhiwen Fan, and Zhangyang Wang. Signal processing for implicit neural representations. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • Yu et al. [2020] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Yu and Koltun [2015] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015.
  • Zeiler et al. [2010] Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Rob Fergus. Deconvolutional networks. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
  • [70] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zhao et al. [2023] Qi Zhao, M. Salman Asif, and Zhan Ma. Dnerv: Modeling inherent dynamics via difference neural representation for videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [72] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. MICCAI 2018.
\thetitle

Supplementary Material

A More Discussions of Universal Approximation Theory (UAT) Analysis on NeRV

We provide more analysis and discussions of UAT analysis on the NeRV system. We define the problem that current NeRV systems are attempting to address and provide a comparison with existing video neural coding pipelines.

A.1 Implicit Neural Video Coding Problem

Following the pipeline of Implicit Neural Video Coding (INVC) presented in Sec. 4.2 , we recall the proposed Implicit Neural Video Coding Problem (INVCP) as follows.

Problem A.1.

(INVC Problem). The goal of Implicit Neural Video Coding is to find out the optimal design of the decoder 𝒟\mathcal{D} and encoder \mathcal{E} in pursuit of minimal parameter quantity 𝖯𝖺𝗋𝖺𝗆(𝒟)\mathsf{Param}(\mathcal{D}) and embeddings {et=(t)dint}t=1T\{e_{t}=\mathcal{E}\left(t\right)\in\mathbb{R}^{d_{in}^{t}}\}^{T}_{t=1} (where d=dintd=d^{t}_{in} is often the same for all tt in existing NeRV systems) under a certain approximation error ϵ\epsilon between the reconstruction V~\tilde{V} and a given video sequence VV,

argmin𝒟,𝖯𝖺𝗋𝖺𝗆(𝒟)+t=1Tdint,\displaystyle\mathop{\arg\min}_{\mathcal{D},\mathcal{E}}~{}\mathsf{Param}\left(\mathcal{D}\right)+\sum_{t=1}^{T}d_{in}^{t},
s.t.L𝒟,w𝒟[1,),supVt~Vtϵ,t[1,T].\displaystyle\mathrm{s.t.}~{}L_{\mathcal{D}},w_{\mathcal{D}}\in\left[1,\infty\right),~{}\mathop{\sup}\mathop{\sum}\|\tilde{V_{t}}-V_{t}\|\leq\epsilon,~{}t\in[1,T].

In the practice of INVC research, we usually use the dual problem of A.1 to determine the optimal architecture of a model to achieve a certain level of accuracy for fitting the video. We name it the Dual Implicit Neural Video Coding Problem (DINVCP).

Problem A.2.

(Dual INVC Problem). Given a certain parameter quantity μ\mu, the Dual INVC problem aims to determine the optimal design of decoder 𝒟\mathcal{D} and encoder \mathcal{E} to minimize the minimal approximation error between the reconstruction V~\tilde{V} and the given video sequence VV,

argmin𝒟,supVt~Vt,\displaystyle\mathop{\arg\min}_{\mathcal{D},\mathcal{E}}~{}\mathop{\sup}\mathop{\sum}\|\tilde{V_{t}}-V_{t}\|,
s.t.L𝒟,w𝒟[1,),𝖯𝖺𝗋𝖺𝗆(𝒟)+t=1Tdintμ,t[1,T].\displaystyle\mathrm{s.t.}~{}L_{\mathcal{D}},w_{\mathcal{D}}\in\left[1,\infty\right),\mathsf{Param}\left(\mathcal{D}\right)+\sum_{t=1}^{T}d_{in}^{t}\leq\mu,~{}t\in[1,T].

In practice, when using a NeRV model to represent a given video within a certain model size limit μ\mu through end-to-end training, it is trying to solve the DINVCP.

A.2 Comparison between DINVCP and Previous Neural Coding Pipelines

Distribution-Preserving Lossy Compression (DPLC) is proposed by [59] motivated by GAN-based image compression [3]. It is defined as follows:

minE,D𝔼X,D[d(X,D(E(X)))]+λdf(pX,pX~),\min_{E,D}~{}~{}\mathbb{E}_{X,D}[d(X,D(E(X)))]+\lambda d_{f}(p_{X},p_{\tilde{X}}),

where E,D,X,X~E,D,X,\tilde{X} are encoder, decoder, given input and reconstruction, dfd_{f} is a divergence which can be estimated from samples. DPLC emphasizes the importance of maintaining distribution consistency for effective compression and reconstruction.

[54] proposes Rate-Distortion Optimization (RDO). Later, [5] reveals the importance of perceptual quality and proposes the Perception-Distortion Optimization (PDO) as

minpX~|Yd(pX,pX~)s.t.𝔼[Δ(X,X~)]D,\min_{p_{\tilde{X}|Y}}~{}d(p_{X},p_{\tilde{X}})~{}~{}s.t.~{}~{}\mathbb{E}[\Delta(X,\tilde{X})]\leq D,

where Δ\Delta is the distortion measure and dd is the divergence between distributions. Furthermore, [6] defines the Rate-Distortion-Perception Optimization (RDPO) as

minpX~|XI(X,X~)s.t.𝔼[Δ(X,X~)]D,d(pX,pX~)P,\min_{p_{\tilde{X}|X}}I(X,\tilde{X})~{}~{}s.t.~{}~{}\mathbb{E}[\Delta(X,\tilde{X})]\leq D,~{}d(p_{X},p_{\tilde{X}})\leq P,

where II denotes mutual information.

The primary objective, which also serves as the main obstacle in the aforementioned pipelines, is that density estimation is not only costly but also challenging to estimate accurately. Different from DPLC, PDO, or RDPO, DINVCP does not need to model the distribution of the given signal explicitly. In fact, the distribution of input images or videos is difficult to approximate. Whether it is approached by minimizing ELBO or through adversarial training [28, 18], there is always a certain gap or mismatch. Besides, other density estimation methods, such as flow-based or diffusion models, suffer from huge computational costs [23, 49]. In contrast, NeRV system implicitly models the unknown distribution of a given signal via specific decoding computation process under certain model parameter quantity constraints. The calculation process per se is regarded as the side information [65, 27].

This approach of implicitly modeling distributions through computational processes under parameter quantity constraints aligns with some current perspectives that suggest the intelligence of Large Language Models (LLM) emerges from data compression [48, 13]. LLMs such as GPT aim to transfer as much data as possible to models of the same size for learning (and continue to increase the model size after learning) to achieve information compression and efficient information coding. However, the NeRV system strives to compress the model size as much as possible for a given video, emerging with robust representations with generalized capability.

The improvement of PNeRV in terms of perceptual quality confirms this conjecture. By upgrading the model structure and training with only MSE loss, PNeRV emerges better perceptual performance without having to estimate the signal’s unobtainable prior distribution.

A.3 Proof of Theorem 1

Following the definitions given in Sec. 4 , the width ww of 𝒩\mathcal{N} is named as maxdi,{di}i=1L\mathop{\max}d_{i},\{d_{i}\in\mathbb{N}\}^{L}_{i=1}. Once the minimal width w=wmin(din,dout)w^{*}=w_{min}\left(d_{in},d_{out}\right) is estimated by din,doutd_{in},d_{out}, such that, for any continuous function f:[0,1]dindoutf:[0,1]^{d_{in}}\to\mathbb{R}^{d_{out}} with ϵ0\epsilon\geq 0, there exists a 𝒩\mathcal{N} with input dimension dind_{in}, hidden layer widths at most ww^{*}, and output dimension doutd_{out} that ϵ\epsilon-approximates ff:

supx[0,1]dinf(x)𝒩(x)ϵ.\mathop{\sup}_{x\in[0,1]^{d_{in}}}\|f\left(x\right)-\mathcal{N}\left(x\right)\|\leq\epsilon.

The goal of Theorem 1 is to determine the minimum parameter demand when ϵ\epsilon-approximates the implicit \mathcal{F} which represents the given video. We recall Theorem 1 as Theorem A.1 as follows for better illustration.

Theorem A.1.

For a cascaded NeRV system to ϵ\epsilon-approximate a video VV which is implicitly characterized by a certain unknown L-Lipschitz continuous function :Kdout\mathcal{F}:K\to\mathbb{R}^{d_{out}} where KdinK\subseteq\mathbb{R}^{d_{in}} is a compact set, then the upper bound of the minimal parameter quantity 𝖯𝖺𝗋𝖺𝗆(𝒟)\mathsf{Param}(\mathcal{D}) is given as

𝖯𝖺𝗋𝖺𝗆min(𝒟)dout2(𝑂(diam(K))ω1(ϵ))din+1.\mathsf{Param}_{\min}(\mathcal{D})\leq d_{out}^{2}\left(\frac{\mathop{O}\left(diam\left(K\right)\right)}{\omega_{\mathcal{F}}^{-1}\left(\epsilon\right)}\right)^{d_{in}+1}.

Before we start, we will recall the setup and demonstrate some mathematic concepts and lemmas.

Definition A.1.

A function g:dindoutg:\mathbb{R}^{d_{in}}\to\mathbb{R}^{d_{out}} is a max-min string of length L1L\geq 1 on dind_{in} input variables and doutd_{out} output variables if there exist affine functions 1,,L:dindout{\ell}_{1},\ldots,{\ell}_{L}:\mathbb{R}^{d_{in}}\to\mathbb{R}^{d_{out}} such that

g=σL1(L,σL2(L1,,σ2(3,σ1(1,2))).g=\sigma_{L-1}(\ell_{L},\sigma_{L-2}\left(\ell_{L-1},\cdots,\sigma_{2}\left(\ell_{3},\sigma_{1}\left(\ell_{1},\ell_{2}\right)\right)\cdots\right).

The definition of max-min string and DMoC (Def. 3 ) are first introduced in [19] and [20]. We introduce two lemmas, which were presented as Propositions 2 and 3 in [20].

Lemma A.1.

[20] For every compact KdinK\subseteq\mathbb{R}^{d_{in}}, any continuous f:Kdoutf:K\to\mathbb{R}^{d_{out}} and each ϵ0\epsilon\geq 0, there exists a max-min string gg on dind_{in} input variables and doutd_{out} output variables with length

(𝑂(diam(K))ωf1(ϵ))din+1,\left(\frac{\mathop{O}\left(diam\left(K\right)\right)}{\omega_{f}^{-1}\left(\epsilon\right)}\right)^{d_{in}+1},

for which

fgC0(K)ϵ.\|f-g\|_{C^{0}(K)}\leq\epsilon.
Lemma A.2.

[20] For every max-min string gg on dind_{in} input variables and doutd_{out} output variables with length LL and every compact KdinK\subseteq\mathbb{R}^{d_{in}}, there exists a ReLU net 𝒩\mathcal{N} with input dimension dind_{in}, hidden layer width din+doutd_{in}+d_{out}, and depth LL that computes xg(x)x\mapsto g(x) for every xKx\in K.

Lemma A.3.

[46] For any p[1,)p\in[1,\infty), ReLU nets of width ww are dense in LP(din,dout)L^{P}(\mathbb{R}^{d_{in}},\mathbb{R}^{d_{out}}) if and only if wmax{din+1,dout}w\geq\max\{d_{in}+1,d_{out}\}.

The proofs of Lemma A.1 and A.2 can be found in the Sec. 2.1 and Sec. 2.2 of [20]. Lemma A.3 is the Theorem 1 demonstrated in [46] with its proof. Now we provide the proof of Theorem A.1 as follows.

Proof.

From Lemma A.1, the implicit function 𝒱\mathcal{F}_{\mathcal{V}} which represents the video 𝒱\mathcal{V} can be approximated by one max-min string gg. It is worth mentioning that 𝒱\mathcal{F}_{\mathcal{V}} is supposed to be continuous because video can be considered as a slice of the real world. The length of this max-min string gg is given by Lemma A.1. According to Lemma A.2, there exists a ReLU net 𝒩g\mathcal{N}_{g} with the same input and output dimensions that fit this max-min string. So, the minimal parameters of 𝒩g\mathcal{N}_{g}, also the sum of weights for each layer, is

𝐏𝐚𝐫𝐚𝐦=l=1Lwlwl1,\mathbf{Param}=\sum_{l=1}^{L}w_{l}w_{l-1},

where wlw_{l} is width in each hidden layer and LL is given in Lemma A.1. Noticed that the whole width ww of a model is the upper bound of all hidden layer widths {wl}l=0L{\{w_{l}\}}^{L}_{l=0}. wminw_{\min} is the minimum estimate for this upper bound, wlwminww_{l}\leq w_{\min}\leq w. wminw_{\min} is further contracted from din+doutd_{in}+d_{out} to max{din+1,dout}\max\{d_{in}+1,d_{out}\} by [46] (Lemma A.3).

Thus, the minimal parameters of 𝒩g\mathcal{N}_{g} under a certain error is no longer than

𝐏𝐚𝐫𝐚𝐦minwmin2(𝑂(diam(K))ωf1(ϵ))din+1=dout2(𝑂(diam(K))ωf1(ϵ))din+1,\begin{split}\mathbf{Param}_{\min}&\leq w_{\min}^{2}\left(\frac{\mathop{O}\left(diam\left(K\right)\right)}{\omega_{f}^{-1}\left(\epsilon\right)}\right)^{d_{in}+1}\\ &=d_{out}^{2}\left(\frac{\mathop{O}\left(diam\left(K\right)\right)}{\omega_{f}^{-1}\left(\epsilon\right)}\right)^{d_{in}+1},\end{split}

where wmin=doutw_{\min}=d_{out} for video 𝒱:dout\mathcal{V}:\mathbb{N}\to\mathbb{R}^{d_{out}}. Equality is reached when each layer width reaches the upper bound of minimal width, the worst case. ∎

Although the upper bound of 𝖯𝖺𝗋𝖺𝗆(𝒟)\mathsf{Param}(\mathcal{D}) is fixed regardless of the detailed architecture, the actual performance of serial NeRV will be influenced by structure design, parameter initialization, activation functions, loss functions, and optimizer.

B More Related Works

Comparison with Other Subpixel-based Upsampling Operators. The NeRV system aims at reconstructing high-resolution videos through decoding low-dim embeddings. Therefore, proper upsampling operators are crucial for its performance. Existing subpixel-based upsampling operators are not efficient enough for the NeRV system. Deconv [69] pads the subpixels with zeros and passes them through a Conv layer, resulting in block artifacts [45]. PixelShuffle [52] first expands the feature map channels through a Conv and then rearranges them into the target subpixels. However, the desired subpixels of a given position are only related to the expanding channels of the same position, ignoring contextual information, as shown in Fig. 2 of the main text. Additionally, PixelShuffle encounters an exponential explosion of required channels when the upsampling ratio is large.

Comparison with INR on Images[53] (SIREN) uses sine as a periodic activation function to model the high-frequency information of a given image [58] and performs a sinusoidal transformation before input  [66] tries to directly modify an INR without explicit decoding. The main difference between these methods and ours is that we consider the input coordinate-pixel pairs to be dense for the INR on image coding. In a natural image, the RGB value at a specific position is often closely related to its neighboring positions. However, for high-resolution videos, the gap between adjacent frames can be much larger, both in terms of pixels and semantic terms. This situation is akin to only observing partial pixels from a given image.

Comparison with Self-attention Module. Self-attention (SA) and Multi-head Self-attention (MSA) modules [63, 61, 38, 56] compute the response at a position by attending to all positions, which is similar to KFc. The major defect of SA and MSA when adopted in NeRV is that the computational complexity and the space complexity are too high to efficiently compute the global correlations between arbitrary positions, especially the computational cost (O(n2d)O(n^{2}d)) between queries and keys for high-resolution feature maps. KFc not only captures long-range dependencies but also achieves low-cost rescaling, both of which are significant for NeRV.

C Additional Results

Unless otherwise specified, all models utilized in the additional results are trained on a 3M model for 300 epochs.

C.1 Comparison of Generalization Ability by Video Interpolation Results

Indeed, the concepts of approximation and generalization are distinct topics within the field of deep learning theory [44, 1]. Understanding the causal relationship between overfitting and the generalization capacity of NeRV necessitates further investigation. Existing NeRV models always focus on the models’ approximation capabilities through overfitting training.

Nonetheless, we also evaluate the generalization performance of our proposed PNeRV through a video interpolation experiment. Adhering to the experimental methodology employed in [10] and [71], the model is trained using odd-numbered frames and then tested with unseen even-numbered frames. The results, presented in Table C.1, indicate that PNeRV surpasses most baseline methods. Future research will focus on the theoretical analysis and enhancement of PNeRV’s generalization abilities.

Beauty Bospho Honey Jockey Ready Shake Yacht Avg.
NeRV [9] 28.05 30.04 36.99 20.00 17.02 29.15 24.50 26.54
E-NeRV [31] 27.35 28.95 38.24 19.39 16.74 30.23 22.45 26.19
H-NeRV [10] 31.10 34.38 38.83 23.82 20.99 32.61 27.24 29.85
DiffNeRV [71] 35.99 35.10 37.43 30.61 24.05 35.34 28.70 32.47
PNeRV 33.64 34.09 39.85 28.74 23.12 31.49 27.35 31.18
Table C.1: Video interpolation results on 960 ×\times 1920 UVG in PSNR.
Bmx-B Camel Dance-J Drift-C Elephant Parkour Scoo-G Scoo-B Avg.
HNeRV 20.39 21.85 21.73 28.81 17.35 19.97 24.49 19.76 21.79
DiffNeRV 22.95 23.72 21.78 30.37 26.02 21.55 22.78 21.00 23.77
PNeRV 21.69 24.28 25.21 30.01 27.32 22.61 22.84 22.61 24.57
Table C.2: Video inpainting results using center mask on 960 ×\times 1920 DAVIS in PSNR.
Bmx-B Camel Dance-J Drift-C Elephant Parkour Scoo-G Scoo-B Avg.
HNeRV 0.665 0.733 0.677 0.650 0.489 0.650 0.859 0.789 0.725
DiffNeRV 0.767 0.815 0.667 0.949 0.817 0.754 0.852 0.844 0.808
PNeRV 0.802 0.844 0.792 0.947 0.862 0.801 0.874 0.812 0.842
Table C.3: Video inpainting results using center mask on 960 ×\times 1920 DAVIS in SSIM.
Bmx-B Camel Dance-J Drift-C Elephant Parkour Scoo-G Scoo-B Avg.
HNeRV 23.16 20.94 26.54 31.70 17.36 21.32 26.89 21.05 23.62
DiffNeRV 25.70 24.71 26.59 34.74 25.93 24.51 26.61 24.27 26.63
PNeRV 24.96 24.18 26.62 34.84 27.50 24.98 26.85 22.13 26.51
Table C.4: Video inpainting results using disperse mask on 960 ×\times 1920 DAVIS in PSNR.
Bmx-B Camel Dance-J Drift-C Elephant Parkour Scoo-G Scoo-B Avg.
HNeRV 0.728 0.661 0.779 0.957 0.490 0.685 0.889 0.794 0.748
DiffNeRV 0.819 0.832 0.795 0.972 0.827 0.799 0.892 0.897 0.854
PNeRV 0.843 0.854 0.806 0.975 0.877 0.836 0.910 0.866 0.871
Table C.5: Video inpainting results using disperse mask on 960 ×\times 1920 DAVIS in SSIM.
Refer to caption
Figure C.1: PSNR of video compression on UVG.
Refer to caption
Figure C.2: SSIM of video compression on UVG.
Refer to caption
Figure C.3: Example of training difficulty of different NeRV methods in 3M size.

C.2 Comparison of Video Compression and Discussion of Training Difficulties

The video compression comparison of PNeRV with other NeRV models in terms of PSNR and MS-SSIM is shown in Fig. C.2 and Fig. C.2. Following the same settings utilized in [10, 71], we evaluate the video compression comparison with 8-bit quantization for both embeddings and the model without model pruning.

PNeRV has demonstrated remarkable performance, notably outperforming conventional encoding pipelines like H264 [64] and H265 [55], and possesses substantial advantages over several traditional neural video coding models [39, 33, 30], particularly at low bit rates. Compared to INR-based methods, PNeRV has also achieved competitive results and outperforms other NeRV methods [9, 10, 71] in terms of PSNR.

For detailed experimental settings, PNeRV adjusts the size of the decoder and the dimensions of the input diff embedding to validate the encoding performance of the proposed method across various bit rates. At low bit rates, the encoding performance of the model may experience some degradation. We believe this is due to the diversity and complexity of the modules required by PNeRV. Maintaining a certain amount of parameters (such as the number of channels in convolutional layers) is crucial for preserving performance. This ensures that the model has sufficient capacity to handle the challenges posed by low-bit rate encoding.

It is worth noting that all implicit models encounter significant training challenges when dealing with large parameters, such as those exceeding 5M. As a result, these models often converge to local minima, leading to trivial outputs. This issue poses a significant obstacle to the compression performance of all NeRV methods, particularly when the Bpp value increases. Some examples of training failure are shown in Fig. C.3, where models are 3M under the same conditions.

C.3 Comparison of Robustness by Video Inpainting Results

We evaluate the robustness of different methods using video inpainting tasks following the same setting as in [10] and [71], which use a center mask and disperse mask. The center mask uses a rectangular area that occupies one-fourth of the width and height of the original frame, positioned at its center. The disperse mask comprises five square areas, each measuring 100×100100\times 100 pixels, positioned in the four corners and the center of the frame. The pixel value of areas in the masks is reset to 0. The trained models in video regression tasks will be directly utilized for inpainting without any fine-tuning. Models take the masked frames as input and try to predict the original ones.

The results using the center mask are provided in Tab. C.2 and Tab. C.3. The dispersed ones are in Tab. C.4 and Tab. C.5. PNeRV acquires competitive results with both the center mask and the disperse mask, indicating robust modeling capability.

C.4 More Visualization Examples for Perceptual Quality

We show some more examples of qualitative comparisons between different models.

Shown in Fig. D.4, the results of PNeRV are smoother and less noisy. For instance, in “Lucia” and “Horse-low”, PNeRV pays more attention to the geometric pattern of the main objects and ignores those high-frequency details of the background scene. Other baseline methods cannot reconstruct frames at such a semantic level. Due to the lack of high-level information guidance and a global receptive field, baseline methods are hard to reasonably allocate model weights to more important objects, e.g., red waterpipe in “Breakdance-flare” and patterns in “Cows”.

Shown in Fig. D.5, the comparison at different timestamps of the same video indicates some specific common issues of different models. Overlapping and noisy patterns have occurred in the results of DiffNeRV [71] and HNeRV [10], such as the grass and hands in “Hike”. ENeRV [31] and NeRV [9] often result in color deviation and blurring, e.g., backpack in “Hike” and motor in “motor-bump”. PNeRV achieves a balance between preserving details and maintaining semantic consistency. Compared to DiffNeRV, which also uses the difference between frames as input, the latter’s reconstruction of details is unbiased. However, human attention to visual elements under different semantics should be different. Improving the reconstruction results through high-level information is one of PNeRV’s pursuits.

C.5 Discussion on the Failure Cases

As shown in Table 2 , PNeRV fails in the “Dog” which is blurred and mixed with jitter and deformation. Also, the “Soapbox” video, which comprises two clips from entirely different scenes connected by a few frames where the camera rotates through a large angle, poses a challenge. So far, PNeRV has not been able to handle severe temporal inconsistency effectively.

C.6 Video Examples

We provide some video examples from DAVIS as follows. From the video comparison, it can be seen that the reconstructions of NeRV have lost spatial details, and it is difficult for DNeRV to reconstruct videos containing pervasive scattered high-frequency details. Whether there is large motion or high-frequency details in the given videos, PNeRV is more robust in modeling the spatial consistency, leading to better perceptual quality in reconstructions. The links to the examples are presented as follow.
Dance-jump: https://drive.google.com/file/d/18JZq1BCkBJWCkZs-71OB7wI6j_Vma0vP/view?usp=drive_link
Elephant: https://drive.google.com/file/d/1rnPEsEtfA5UADU6BnwEDOPRG9hO9uPuM/view?usp=drive_link
Kite-surf: https://drive.google.com/file/d/1DDGw1zc2iJWcJHdBS4DOnfUQVf2H04Bs/view?usp=drive_link
Parkour: https://drive.google.com/file/d/1jWbJuoc-GCz2N_dXAJSER0PSy7ThrMr-/view?usp=drive_link
Scooter-grey: https://drive.google.com/file/d/1vs22Ru-AwAQuG710qbF72lwdHS1ABy83/view?usp=drive_link

Models Bmx-B Camel Dance-J Dog Drift-C Parkour Soapbox Avg. A.P.G
NeRV [9] 29.42/0.864 24.81/0.781 27.33/0.794 28.17/0.795 36.12/0.969 25.15/0.794 27.68/0.848 28.38/0.835 -
E-NeRV [31] 28.90/0.851 25.85/0.844 29.52/0.855 30.40/0.882 39.26/0.983 25.31/0.845 28.98/0.867 29.75/0.875 -
HNeRV [10] 29.98/0.872 25.94/0.851 29.60/0.850 30.96/0.898 39.27/0.985 26.56/0.851 29.81/0.881 30.30/0.874 -
DiffNeRV [71] 30.58/0.890 27.38/0.887 29.09/0.837 31.32/0.905 40.21/0.987 25.75/0.827 31.47/0.912 30.84/0.892 -
Ablation Study
Bilinear + Concat 24.85/0.783 24.49/0.793 28.32/0.806 26.19/0.723 31.92/0.943 25.09/0.793 29.23/0.872 27.16/0.816 -4.07
Bilinear + GRU 29.86/0.874 25.00/0.811 29.16/0.830 27.11/0.753 32.09/0.945 26.43/0.845 29.10/0.874 28.39/0.847 -2.84
Bilinear + LSTM 26.22/0.792 26.87/0.871 27.85/0.788 26.71/0.741 33.65/0.946 25.82/0.820 29.42/0.881 28.07/0.834 -3.16
Bilinear + BSM 29.97/0.877 27.35/0.881 29.49/0.838 27.14/0.756 34.34/0.968 26.15/0.835 29.14/0.876 29.08/0.862 -2.15
DeConv + Concat 28.06/0.840 24.07/0.774 27.86/0.792 25.16/0.693 34.97/0.961 22.13/0.683 29.33/0.877 27.37/0.803 -3.86
DeConv + GRU 27.52/0.827 28.16/0.900 29.09/0.825 25.76/0.706 37.91/0.980 25.09/0.793 29.54/0.882 29.00/0.845 -2.23
DeConv + LSTM 30.15/0.882 26.49/0.859 28.30/0.805 25.94/0.712 34.91/0.956 26.35/0.842 30.26/0.895 28.91/0.850 -2.32
DeConv + BSM 31.56/0.906 27.18/0.878 29.77/0.847 30.09/0.868 36.03/0.971 26.09/0.831 29.00/0.872 29.96/0.881 -1.27
KFc + Concat 27.51/0.826 25.02/0.816 29.02/0.831 28.80/0.831 36.82/0.974 25.12/0.796 28.53/0.864 28.68/0.848 -2.55
KFc + GRU 31.69/0.910 25.88/0.848 28.32/0.805 28.47/0.813 33.25/0.942 26.68/0.853 30.89/0.903 29.31/0.868 -1.92
KFc + LSTM 29.16/0.862 27.24/0.878 28.90/0.825 29.28/0.842 32.73/0.935 26.62/0.839 29.35/0.879 29.04/0.866 -2.19
KFc + BSM (PNeRV) 31.05/0.896 27.89/0.892 30.45/0.873 31.08/0.898 40.23/0.987 27.08/0.867 30.85/0.902 31.22/0.902 +0
Table C.6: Ablation results on DAVIS subset in PSNR and MS-SSIM, where Avg. is the average PSNR and A.P.G is the average PSNR gap. Every result is reported by corresponding model trained in 300 epoch and 3M size.
40×\times80 20×\times40 10×\times20
PSNR 31.94 31.33 30.50
SSIM 0.960 0.954 0.947
Table C.7: Embedding size in PNeRV-L.
1×\times1 3×\times3 5×\times5
PSNR 31.92 31.94 31.90
SSIM 0.960 0.960 0.961
Table C.8: Kernel size in BSM.
ReLU Leaky GeLU w/o BN
PSNR 31.80 31.86 31.94 31.53
SSIM 0.959 0.961 0.960 0.959
Table C.9: Activation and BN in KFc.

D Additional Ablation Studies

D.1 Ablation Results of Model Structure Details

We ablate the structure details of PNeRV in 3M on “Rollerblade” in 480×960480\times 960 from DAVIS, given in Tab. C.9, Tab. C.9 and Tab. C.9. The alternation of kernel size or activation has little influence. Encoding more information into embeddings will help the decoder reconstruct better and also increase the overall size.

D.2 Ablation Results of Proposed Modules on DAVIS

To verify the contribution of different modules in PNeRV, we conduct ablation studies on (1) upscaling operators and (2) gated memory mechanisms. We compare KFc with two upscaling layers, Deconv and Bilinear, where “Deconv” is implemented by “nn.ConvTranspose2d” from PyTorch, and “Bilinear” is the combination of bilinear upsampling and Conv2D. KFc achieves better performance due to the global receptive field regardless of what fusion module it is combined with.

Also, to illustrate the importance of adaptive feature fusion and improvement of BSM, we compare BSM with Concat, GRU and LSTM, where “Concat” means directly concatenating two features from different domains together. The ablation results suggest that the adaptive fusion of features from different domains significantly improves performance, and BSM outperforms other memory cells due to the disentangled feature learning. The last row is the final PNeRV and the last column shows PSNR gaps when changing modules in PNeRV.

D.3 Visualization of Feature Maps

To verify the effectiveness of hierarchical information merging via KFc and BSM, we visualize some feature maps in PNeRV-L which was pretrained on “Parkour” as examples. Those feature maps shown in Fig. D.6 are from different channels and layers using the same frame as input. Those in Fig. D.7 are all from the 44-th layer but using different frames as input. The feature maps from 44-th layer are in 480×960480\times 960, and the original frames are in 960×1920960\times 1920. For each lower layer, the height and width are halved compared to the upper layer. “Before” and “After” refer to the feature maps before and after passing through BSM or after.

Fig. D.6 illustrates how the coarse features are refined by BSM. Different channels respond to distinct spatial patterns of video frames, including factors like color, geometric structure, texture, brightness, motion, and so on. Before being processed by the BSM, the vanilla features are semantically mixed and entangled. However, the BSM is able to decouple these features and distinguish their specific effects, resulting in more refined and distinct outputs.

Additionally, for imperfect feature maps, BSM can add details or balance the focus of the reconstruction across various areas in the frames. These phenomena are commonly observed in the 44-th layer, which is responsible for preparing for fine-grained reconstruction, as demonstrated in Figure. D.7. This shows the effectiveness of BSM in enhancing the quality of feature maps and improving the overall reconstruction.

Refer to caption
Figure D.4: Visual comparison examples on various videos.
Refer to caption
Figure D.5: Visual comparison examples on the same video by same models. Corresponding time stamps are shown in the bottom left.
Refer to caption
Figure D.6: Visualization examples of feature maps in different layers.“C” refers to the channel number and “L” is the layer number.
Refer to caption
Figure D.7: Visualization examples of feature maps for different frames.“C” refers to the channel number and “L” is the layer number.