MNeRV: A Multilayer Neural Representation for Videos

Qingling Chang Haohui Yu Shuxuan Fu Zhiqiang Zeng Chuangquan Chen The Institute of China-Germany Artificial Intelligence, Wuyi university, Welcome Avenue, 99, Jiangmen, China College of Mathematics and Physics, North China Electric Power University, No.2 Beinong Road, Huilongguan, Changping District, Beijing, China

(2009)

Abstract

As a novel video representation method, Neural Representations for Videos (NeRV) has shown great potential in the fields of video compression, video restoration, and video interpolation. In the process of representing videos using NeRV, each frame corresponds to an embedding, which is then reconstructed into a video frame sequence after passing through a small number of decoding layers (E-NeRV, HNeRV, etc.). However, this small number of decoding layers can easily lead to the problem of redundant model parameters due to the large proportion of parameters in a single decoding layer, which greatly restricts the video regression ability of neural network models. In this paper, we propose a multilayer neural representation for videos (MNeRV) and design a new decoder M-Decoder and its matching encoder M-Encoder. MNeRV has more encoding and decoding layers, which effectively alleviates the problem of redundant model parameters caused by too few layers. In addition, we design MNeRV blocks to perform more uniform and effective parameter allocation between decoding layers. In the field of video regression reconstruction, we achieve better reconstruction quality (+4.06 PSNR) with fewer parameters. Finally, we showcase MNeRV’s performance in downstream tasks such as video restoration and video interpolation. The source code of MNeRV is available at https://github.com/Aaronbtb/MNeRV.

keywords:

Neural networks\sepNoise reduction\sepVideo compression\sepNetwork architecture\sepDecoding

^†^†volume: 1

, , , , and E-mail: [email protected]
E-mail: [email protected]
E-mail: [email protected]
E-mail: [email protected]
E-mail: [email protected]

1 Introduction

With the rise of short videos on the internet, more and more people are uploading videos shot on mobile devices (such as smartphones) to video-sharing websites like YouTube and Bilibili. Videos are playing an increasingly important role in people’s lives. Due to objective factors such as network bandwidth and storage space, videos must be compressed to better serve people. The growing demand for network video sharing and the limitations of bandwidth and storage make video compression algorithms more important. Traditional video compression methods[1; 2; 3] can explicitly represent videos as frame sequences, but they bring huge computational costs during compression and decompression, and the corresponding decoding time cost also increases. Some researchers have introduced deep learning into video encoding and decoding. Based on the traditional framework, some modules are replaced with trainable deep learning models[4; 5; 6; 7; 8; 9; 10; 11], which has greatly improved the performance. Other researchers have replaced all traditional modules with deep learning models[12; 13; 14; 15; 16; 17; 18; 19; 20] on the basis of the traditional framework, and adopted an end-to-end training method, which has also achieved good compression results. However, these two methods have similar disadvantages to traditional methods: high training computational cost and slow decoding speed. To avoid the high computational cost of the traditional pipeline and explore the more powerful fitting ability of neural networks, some researchers have applied implicit neural representations to video encoding and decoding, proposing a new video representation method: neural representations for videos (NeRV)[21]. This method represents videos as neural networks in an implicit neural representation way at the decoder end.It takes video frames as indices, inputs them into the decoding layer for decoding, and outputs images, that is, representing videos as an implicit neural network. This unique method NeRV and other series of papers has not only shown good results in the compression field, but also achieved good results in various video downstream task such as video interpolation and video denoising, with the advantages of easy training and fast decoding. However, most of the models in the NeRV-like for video representation have relatively few decoding layers in the decoding end, which leads to two problems. The first problem is that fewer decoding layers perform poorly in fitting videos with large camera movements. The second problem is that fewer NeRV-like blocks lack the ability to allocate model parameters reasonably.

In this work, we propose a multilayer neural representation for videos (MNeRV) that has a new encoder end: M-Encoder and decoder end: M-Decoder. Thanks to the design of more encoding and decoding layers, it has excellent performance in fitting videos with large camera movements. Additionally, we propose an MNeRV block that allows for more reasonable parameter allocation at the decoder end. We also removed some redundant designs, making the model more streamlined and able to achieve better results with fewer parameters. In M-Encoder, we keep the number of encoding layers the same as the number of decoding layers in the decoder end, so that the extracted feature indices match the decoder end more closely. We also introduce the Global Response Normalization (GRN)[22] layer to enhance the competition of feature channels at the encoding end and improve the model’s accuracy. We apply MNeRV to downstream tasks such as video compression, video restoration, and video interpolation, and demonstrate its excellent performance.

Our main contributions are summarized as follows:

$\bullet$

We propose MNeRV, a novel image-wise video implicit representation method with efficient multilayer.
$\bullet$

We design a new encoder and decoder: M-Encoder, M-Decoder. By introducing the GRN layer, removing redundant structures, and increasing the number of encoding and decoding layers, the model is more efficient. We design ablation experiments to prove that this is effective.
$\bullet$

Through extensive experiments, we demonstrate that our method has better performance (+4.06 PSNR) in video reconstruction quality and better results in downstream tasks such as video compression, video restoration, and video interpolation.

2 Related Work

2.1 Pixel-wise Implicit Neural Representations

Implicit neural representations are a novel signal representation method that approximates a mapping function by fitting a neural network. Implicit neural representations have powerful modeling capabilities for various signals, such as data compression[23; 24; 25], 3D reconstruction[26; 27; 28; 29; 30], and 3D view synthesis[26; 27; 28; 29; 30], etc. In the early days of implicit neural representations for videos, it was usually pixel-wise encoding and decoding, which specifically trained the model to learn the mapping relationship between the coordinates of a point and its RGB value[31; 32]. This method has high training costs, slow encoding and decoding speeds, and lower compression rates.

2.2 Image-wise Implicit Neural Representations

NeRV first proposed an implicit representation method for image-wise videos, which trains and fits videos into a neural network using convolution and pixel shuffle. In this way, the decoding process of the video is transformed into the inference process of the model, which greatly improves the decoding speed of the video. In addition, due to the characteristics of neural networks themselves, NeRV-like videos also perform well in downstream video tasks such as video interpolation and video restoration, attracting more and more researchers to study and improve it. Bai et al[33] balanced the coordinate-based implicit neural representation (INRs) and the image-based implicit neural representation (NeRV). They introduced the idea of partitioning into NeRV, representing videos as multiple image blocks, each with a coordinate, and adding AdaIN to the network to improve the fitting effect. Li et al[34] improved the encoder and decoder layers based on NeRV and proposed E-NeRV, further removing redundant structures in the neural network and introducing a spatiotemporal context-based encoder, achieving an experimental effect that converges 8 times faster than NeRV. Mai et al[35] introduced a motion-adjustable neural implicit video representation, which maps time to a driving signal to modulate the frame-generation process, and achieved good results. Lee et al[36] further improved NeRV by introducing optical flow into the frame information. In addition, they introduced a fully convolutional architecture, enabled by one-dimensional temporal grids, improving the continuity of spatial features.

However, all of the above improvements are based on the positional embedding of time as the input. Time-based encoding cannot capture specific content information in the image, resulting in a low compression rate of the model. To make the embedding content-related, Chen et al. proposed a content-adaptive encoder CNeRV[37]. Subsequently, following the idea of feature embedding decoder, they proposed another decoder architecture HNeRV[38], which uses the convnext block in the encoder to encode a smaller feature map of image as an embedding. In this way, the encoded embedding is content-related, resulting in a high compression rate and good fitting effect of the model. However, due to the fewer decoding layers at the decoder end, the accuracy will decrease when fitting some fast-moving objects, moving cameras, and other dynamic videos. He et al. fully utilized the fitting role of neural networks in various types of videos and proposed D-NeRV[39], which uses massive video data to represent a large and diverse set of videos as a single neural network, employing the task-oriented flow as intermediate output to reduce spatial redundancies. It performs better in long videos and is one of the development directions of future large video models. However, the performance is poor when fitting a single short video. Zhao et al[40] fully utilized the diff flow of frames on the basis of HNeRV and proposed a differential encoder to model the spatial features of specific content in a short time, achieving good results in video interpolation and video restoration. However, the model cannot maintain small embeddings, resulting in a low overall compression rate of the video represented by neural networks, which is a disaster for multi-frame videos. Kwan et al. proposed HiNeRV[41], which pursues the ultimate bit rate performance and is currently the most competitive INRs method in video compression. However, its decoding method based on bilinear interpolation performs poorly in video interpolation and video restoration.

In MNeRV, a single video is fitted with a neural network while maintaining small embeddings. Thanks to the reasonable parameter volume and streamlined network architecture at each layer, MNeRV can reconstruct videos with better quality using fewer parameters. Unlike DNeRV and HiNeRV, which focus on a specific downstream task of video, MNeRV is consistent with HNeRV in achieving good performance on many downstream tasks such as video interpolation, video restoration, and video compression.

3 Preliminaries

In NeRV-like models, the model mainly learns a mapping between the current frame $vt$ and the reconstructed frame $vt^{\prime}$ after passing through the neural network, where $vt\in\mathbb{R}^{3\text{$\times$}H\text{$\times$}W}$ . The entire architecture(See Figure 1) is divided into the encoder part $Fe$ , the embedding $ft$ , and the decoder part $Fd$ , expressed as follows:

		$\displaystyle ft=Fe(vt)$		(1)
		$\displaystyle vt^{\prime}=Fd(ft)$		(1)

$Fe$ is a learnable network encoder. In NeRV and E-NeRV, $Fe$ uses regular frequency positional encoding. In HNeRV, $Fe$ uses ConvNeXt blocks[42] to construct. $ft$ is the feature map encoded by the encoder $Fe$ . $Fd$ is the decoder with many NeRV-like blocks. NeRV-like blocks consist of three layers: Convolution layer, PixelShuffle layer, and Activation layer, where the activation layer and PixelShuffle layer do not contain learning parameters. Unlike the past coordinate-based pixel-wise implicit neural representation[32], NeRV blocks mainly learn the mapping from feature map to feature map. During the inference process of the model, the number of channels in the feature map will decrease after each NeRV block, while the size of the feature map will increase. Finally, the number of channels in the feature map will decrease to 3, and the size will increase to the size of the image. This is an image-wise representation method, which gives NeRV more compact model parameters and faster decoding speed. In the NeRV-like architecture before HNeRV, the positional embedding of time was used as input, and the embedding did not contain content information, resulting in a low model compression rate. HNeRV encodes a frame image into a small feature map and inputs it to the decoder as an embedding. Based on the content-based encoder end, it can quickly encode the content of a frame into a feature map, which eliminates the feature encoding process after embedding in NeRV and has the advantage of fast decoding compared to NeRV.

Refer to caption — Figure 1: Architecture of multilayer neural representation for videos.

MNeRV and HNeRV encode $ft$ into entropy coding and input it into the decoder $Fd$ to obtain the reconstructed frame Vt’. As a type of implicit neural network representation, when MNeRV and HNeRV are used to represent videos, the total size includes both the embedding and decoder parts. Its principle is to fit several frames in a video into a neural network stored in the decoder end, and the size of the neural network directly affects the size of the fitted video. Therefore, the size of the video represented as a neural network is the sum of the parameters of the decoder $Fd$ and the embedding $ft$ , expressed as follows:

modelsize=ft+Fd

(2)

There are multiple decoding layers in the decoder $Fd$ . In HNeRV, most decoding layers contain a NeRV block. MNeRV further removes redundant structures, changes the decoding step size, and changes the kernel size of the convolution layer in each MNeRV block, so that all decoding layers contain a MNeRV block, reducing the model parameters under the same effect (see SEC3.1 for details). In M-Encoder, we use ConvNeXt blocks to construct our encoder, introduce the GRN layer, and change the kernel size and step size (see SEC4.1 for details).

4 Method

First, we introduced M-Encoder, including encoding step size and encoding layer number. Then, we detailed M-Decoder and the upgraded design and parameter configuration of MNeRV blocks. The improved design of MNeRV is illustrated in Figure 2 a), b), and c), while the composition of the MNeRV block is shown in d). Finally, we introduced the loss functions used in downstream video tasks such as video compression, video interpolation, and video restoration.

4.1 Encoder

Inspired by HNeRV, we also use ConvNeXt blocks to construct the encoding layer and assemble them into M-Encoder to extract features as the encoder end of MNeRV. In M-Encoder, to make the features extracted by the encoder more compatible with the decoder, we changed the step size 5,4,4,2,2 in HNeRV to 5,2,2,2,2,2,2, while keeping the small embedding 16x2x4 unchanged. To enhance the competition between feature channels, we introduced the GRN layer in M-Encoder, which we proved to be necessary (Table 5).

4.2 Decoder

In the NeRV decoder end, there are five decoding layers, and the kernel size k of the convolution layer in each NeRV block of each decoding layer is set to 3, and the channel attenuation factor r is set to 2, that is, the number of channels is halved for each layer. This parameter allocation makes it difficult for the later decoding layers to obtain enough parameters to fit more detailed videos. The authors of HNeRV also found this problem. In the decoder end of HNeRV, the kernel size of the convolution layer in the NeRV block of the later decoding layer was increased, and the channel attenuation factor was reduced, achieving certain results. But we proved that this is not enough. The too few decoding layers in the decoder end make it difficult to allocate parameters reasonably, and the uneven parameter allocation directly affects the fitting effect of the video. In this paper, we propose a new decoder: M-Decoder and a more efficient decoding block: MNeRV block. M-Decoder has seven decoding layers, that is, the feature map transmitted from the encoder end is enlarged seven times to obtain the fitted image. The comparison of the model parameter allocation of NeRV, HNeRV, and M-Decoder is shown in Figure 3. Thanks to more decoding layers and more reasonable MNeRV block size, it can use more subtle operations on images at different stages, so that the parameters are evenly distributed in each layer. Moreover, unlike the downsampling operation before the first decoding layer in HNeRV, there is no downsampling operation in M-Decoder, and all decoding layers perform upsampling operations, further reducing redundancy.

4.3 Loss Function

Regarding the loss function, compared to $L_{2}$ , which has a larger penalty for outliers and makes the model unstable, we use a combination of $L_{1}$ and $L_{MS}$ to avoid such situations. The loss function is as follows:

		$\displaystyle\mathrm{L}_{all}=\mathrm{L}_{1}+\mathrm{L}_{MS}$		(3)
		$\displaystyle\mathrm{L}_{1}=\frac{1}{\mathrm{~{}T}}\sum_{\mathrm{t}=1}^{\mathrm{T}}\alpha\left\\|\hat{\mathrm{v}}_{\mathrm{t}}-\mathrm{v}_{\mathrm{t}}\right\\|_{1}$
		$\displaystyle\mathrm{L}_{MS}=(1-\alpha)\left(1-\mathrm{MS}\_\operatorname{SSIM}\left(\hat{\mathrm{v}}_{\mathrm{t}},\mathrm{v}_{\mathrm{t}}\right)\right)$
		$\displaystyle\mathrm{L}_{2}=\frac{1}{\mathrm{~{}T}}\sum_{\mathrm{t}=1}^{\mathrm{T}}(\hat{\mathrm{v}}_{\mathrm{t}}-\mathrm{v}_{\mathrm{t}})^{2}$

where is the video frame fitted by the neural network, is the corresponding ground truth. T represents the total number of frames in the video. In MNeRV, we set $\alpha$ to 0.7 or 0.6. More detailed information will be explained in the experiment section.

5 Experiments

5.1 Datasets and Implementation Details

We validated our model on UVG[43], DAVIS[44], RED[45], and Bunny[46] datasets. The experiments were run on an RTX3090 device. In the UVG dataset, we followed the processing method of HNeRV and cropped 7 videos to 640 $\times$ 1280 resolution before inputting them into the M-Encoder. We also used the Adam optimizer to train the model, set the channel attenuation factor r to 1.2, the learning rate at 0.001, and the batch size as 2. To fit 640 $\times$ 1280 videos, we set the stride to 5,2,2,2,2,2,2 and the kernel size to 1,5,5,3,3,3,3. In implicit neural representation of videos, the fitting speed of the neural network is the encoding speed of the video. To pursue faster encoding, we set the training rounds to 100 epochs. Unless otherwise specified, we performed the same operation on the DAVIS, RED, and bunny datasets.

5.2 Main Results

We first conducted experiments on the bunny dataset, comparing it at 1.5M, and 3M sizes. The comparison results are shown in Table 1, where the loss functions of NeRV and HNeRV are L2, and the loss functions of NeRV $\dagger$ , HNeRV $\dagger$ , and MNeRV $\dagger$ are set $\alpha$ to 0.7. For fair comparison, we trained each of the three models three times with each loss function and took the average. We found that the loss function set $\alpha$ to 0.7 can improve the loss function on the Bunny dataset compared to the L2 loss function. Moreover, the MNeRV model performs better than NeRV and HNeRV.

Table 1: Performance of NeRV, HNeRV and MNeRV on the Bunny dataset. The “

\dagger

” symbol indicates the use of different loss.

Bunny	NeRV	NeRV $\dagger$	HNeRV	HNeRV $\dagger$	MNeRV $\dagger$
1.5M	27.86	29.58	31.06	31.98	32.14 $\uparrow$
3M	28.79	31.97	32.47	34.21	34.36 $\uparrow$

Table 2: Comparison results of 10 sub-datasets of REDs.

	size	032	033	084	108	153	231	232	233	243	269	avg
HNeRV	1.50M	23.49	25.24	20.64	25.52	21.1	26.1	24.91	25.48	21.98	24.71	23.92
MNeRV	1.46M	25.9	27.3	22.67	28	23.41	28.38	27.02	27.61	23.99	26.62	26.09

Table 3: Comparison results of UVG datasets.

	size	beauty	Bosphorus	HoneyBee	Jockey	ReadySteadyGo	ShakeNDry	YachtRide	avg
HNeRV	1.50M	33.92	32.49	38.13	28.75	23.82	33.01	28.45	31.22
MNeRV	1.48M	34.1	32.79	37.79	30.6	24.75	32.98	28.84	31.69

We conducted comparative experiments on four scales of the Loading dataset. For the loss function part, we set $\alpha$ to 0.6. Figure 5 shows the performance of NeRV, HNeRV, and MNeRV models on different scales, and their visual comparison is shown in Figure 4. The experimental results on the Loading dataset show that our method has a significant improvement over NeRV and HNeRV. In addition, we compared our method with HNeRV on 10 REDs sub-datasets and 7 UVG sub-datasets, using L2 as the loss function, and the results are shown in Table 2 and Table 3, respectively. Note that our method achieves better results with fewer parameters. The model size is indicated in the second column of both tables. It should be noted that the ShakeNDry dataset is different from the other UVG sub-datasets, as it only contains 300 images. On the ShakeNDry dataset, the size of HNeRV is 1.52M, and the size of MNeRV is 1.49M.

5.3 Ablation Studies

We conducted detailed ablation analysis comparative experiments on NeRV, HNeRV, and MNeRV. In implicit neural networks, the main parameters for adjusting parameter allocation are kernel size and stride. Thanks to the design of its multi-layer network structure(ML), MNeRV offers a more extensive range of parameter adjustment options. We conducted ablation on kernel size and stride on the jockey dataset, and the results are shown in Table 4. We found that in the 5-layer NeRV-like block structure, due to the natural limitation of the number of layers, the parameter allocation scheme is lacking, and it is easy to have bloated parameters (as shown in Figure 3). The 7-layer NeRV-like block structure can adjust more kernel sizes and strides, and there are more parameter allocation methods, and the parameter volume of each layer is more uniform, resulting in better effects. Finally, we selected 5,2,2,2,2,2,2 and 1,5,5,3,3,3,3 as the stride and kernel size of MNeRV, respectively.

Table 4: Ablation experiments on stride and kernel size.

block	s	K_s	PSNR
5layers	5,4,2,2,2,2	1,3,5,5,5,5	28.64
	5,4,2,2,2,2	1,3,5,7,5,3	28.35
	5,4,2,2,2,2	1,3,5,5,3,3	28.77
	5,4,2,2,2,2	1,3,5,5,5,3	28.67
7layers	5,2,2,2,2,2,2	1,3,5,5,5,5,5	29.23
	5,2,2,2,2,2,2	1,3,5,5,5,3,3	29.2
	5,2,2,2,2,2,2	1,3,5,5,3,3,3	29.23
	5,2,2,2,2,2,2	1,3,5,7,5,3,3	28.9
	5,2,2,2,2,2,2	1,3,5,3,3,3,3	29.3
	5,2,2,2,2,2,2	1,3,3,3,3,3,3	29.11
	5,2,2,2,2,2,2	1,5,5,3,3,3,3	29.62

The effects of three novel components of our proposed method are studied separately: the GRN layer introduced in M-Encoder, the multi-layer network structure design and the header layer(HL) design at the beginning of the network. The ablation results are shown in Table 2, which shows the lifting effect of each component in detail.

To evaluate the performance of MNeRV on videos with significant camera motion, we conducted a comparative experiment on the REDs dataset (Figure 7) and performed an ablation study on the loss function (Table 6).

Table 5: Ablation study on components.

UVG	GRN	ML	HL	PSNR	MSSSIM
HNeRV	✗	✗	✗	32.41	0.9085143
Variant 1	✓	✗	✗	32.45	0.90927143
Variant 2	✓	✓	✗	32.61	0.91531
MNeRV	✓	✓	✓	32.64	0.915486

Table 6: Detailed ablation experiments on the loss function are carried out on the REDs dataset, where sml1 represents the smooth L1 function,s represents the SSIM function, and ms represents the MSSSIM function.

Loss	PSNR	Loss	PSNR
0.7L1+0.3SSIM	22.875	0.5L2+0.5S	21.752
SML1	23.191	0.3L1+0.7S	21.602
L1	22.565	0.3L2+0.7S	21.286
0.5L1+0.5S	22.408	0.6L2+0.4MS	23.503
0.7L2+0.3S	22.706	0.9L1+0.1MS	23.319
0.7L2+0.3L1	23.039	0.8L1+0.2MS	23.502
0.5L2+0.5L1	22.92	0.4L2+0.6MS	22.853
0.9L1+0.1S	22.842	0.7L1+0.3MS	23.729

5.4 Downstream Tasks

We compared MNeRV with other implicit neural representation models on various downstream tasks, including video interpolation, video restoration, and video compression.

video compression. Figure 6(a,b) shows the comparison of the reconstruction quality of HNeRV and MNeRV. In the video compression experiment, we followed the pipeline of HNeRV and conducted a comparison experiment of NeRV, HNeRV, and MNeRV on the UVG dataset using different compression ratios. The results are shown in Figure 8, where our method is better than NeRV and HNeRV in both PSNR and MSSSIM metrics. In addition, we show the best and worst results of MNeRV in Figure 9. In videos with intense camera movement (such as the readysteadygo dataset), MNeRV has shown significant improvement over HNeRV, but in videos with less camera movement (such as the honeybee dataset), MNeRV is inferior to HNeRV. On the UVG dataset, MNeRV’s video compression performance is generally better than HNeRV’s (+0.47 PSNR).

video restoration.In Figure 6(c,d), we show the comparison of the restoration effect of HNeRV and MNeRV. MNeRV also has excellent performance.

video interpolation. As a type of neural representation, MNeRV has powerful frame fitting ability. In the video interpolation experiment, we used HNeRV and MNeRV as test frames every other frame, and used interpolated embedding as input. Through learnable content embedding, MNeRV showed better results, as shown in Figure 6(e,f).

6 Conclusion

In this paper, we propose a multilayer neural representation for videos that utilizes a novel encoder-decoder architecture: M-Encoder and M-Decoder. This architecture allows for a more reasonable parameter allocation, while retaining the quality of the fitted images and reducing the number of model parameters. It uses a network with more layers to fit videos, resulting in better fitting effects in some videos with camera movement. We have demonstrated through experiments that MNeRV has significantly improved on UVG, RED, DAVIS, and other datasets. Finally, we apply MNeRV to downstream tasks such as video interpolation and video restoration, and showcase its excellent performance.

Future work. The potential of the encoder end of MHeRV has not been fully realized, and future work should focus on improving the encoder end to make the information contained in each frame’s embedding more accurate.

Acknowledgment

This project was partially funded by: “AI neural network algorithm gives optical camera laser capability from 2021 Guangdong Province Key Construction Discipline Scientific Research Ability Improvement Project”, “Neutrino detector intelligent monitoring from National key research and development program”, “Knowledge graph construction for big data of scientific and technological resources from Wuyi University High-level Talent Scientific Research Start-up Project”.

References

[1] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, Jul. 2003.
[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, Dec. 2012, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[3] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the Versatile Video Coding (VVC) Standard and its Applications,” vol. 31, no. 10, pp. 3736–3764, Oct. 2021.
[4] M. Afonso, F. Zhang, and D. R. Bull, “Video Compression Based on Spatio-Temporal Resolution Adaptation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 275–280, Jan. 2019, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[5] Y. Wang, X. Fan, S. Liu, D. Zhao, and W. Gao, “Multi-Scale Convolutional Neural Network-Based Intra Prediction for Video Coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 1803–1815, Jul. 2020, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[6] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Convolutional Neural Network-Based Fractional-Pixel Motion Compensation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 840–853, Mar. 2019, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[7] J. Mao and L. Yu, “Convolutional Neural Network Based Bi-Prediction Utilizing Spatial and Temporal Information in Video Coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 1856–1870, Jul. 2020, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[8] K. Yang, D. Liu, and F. Wu, “Deep Learning-Based Nonlinear Transform for HEVC Intra Coding,” in 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2020, pp. 387–390, iSSN: 2642-9357.
[9] C. Ma, D. Liu, X. Peng, L. Li, and F. Wu, “Convolutional Neural Network-Based Arithmetic Coding for HEVC Intra-Predicted Residues,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 1901–1916, Jul. 2020, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[10] S. Zhang, Z. Fan, N. Ling, and M. Jiang, “Recursive Residual Convolutional Neural Network- Based In-Loop Filtering for Intra Frames,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 1888–1900, Jul. 2020, conference Name: IEEE Transactions on Circuits and Systems for Video Technology.
[11] “High-Definition Video Compression System Based on Perception Guidance of Salient Information of a Convolutional Neural Network and HEVC Compression Domain | IEEE Journals & Magazine | IEEE Xplore.”
[12] H. S. Haojie Liu, “Learned Video Compression via Joint Spatial-Temporal Correlation Exploration.”
[13] Y. Hu, W. Yang, and J. Liu, “Coarse-to-Fine Hyper-Prior Modeling for Learned Image Compression,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 013–11 020, Apr. 2020.
[14] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimization of nonlinear transform codes for perceptual quality,” in 2016 Picture Coding Symposium (PCS), Dec. 2016, pp. 1–5, iSSN: 2472-7822.
[15] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 4385–4393.
[16] S. Kim, J. S. Park, C. G. Bampis, J. Lee, M. K. Markey, A. G. Dimakis, and A. C. Bovik, “Adversarial Video Compression Guided by Soft Edge Detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 2193–2197, iSSN: 2379-190X.
[17] “Nonlinear Transform Coding | IEEE Journals & Magazine | IEEE Xplore.”
[18] A. Goliński, R. Pourreza, Y. Yang, G. Sautière, and T. S. Cohen, “Feedback Recurrent Autoencoder for Video Compression,” in Computer Vision – ACCV 2020, H. Ishikawa, C.-L. Liu, T. Pajdla, and J. Shi, Eds. Cham: Springer International Publishing, 2021, vol. 12625, pp. 591–607, series Title: Lecture Notes in Computer Science.
[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, Dec. 2014, pp. 2672–2680.
[20] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “AI Benchmark: Running Deep Neural Networks on Android Smartphones,” in Computer Vision – ECCV 2018 Workshops, ser. Lecture Notes in Computer Science, L. Leal-Taixé and S. Roth, Eds. Cham: Springer International Publishing, 2019, pp. 288–314.
[21] H. Chen, B. He, H. Wang, Y. Ren, S.-N. Lim, and A. Shrivastava, “NeRV: Neural Representations for Videos,” Oct. 2021, arXiv:2110.13903 [cs, eess].
[22] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoencoders,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16 133–16 142.
[23] E. Dupont, A. Golinski, M. Alizadeh, Y. W. Teh, and A. Doucet, “COIN: COmpression with Implicit Neural representations,” Mar. 2021.
[24] E. Dupont, H. Loya, M. Alizadeh, A. Goli’nski, Y. Teh, and A. Doucet, “COIN++: Data Agnostic Neural Compression,” ArXiv, 2022.
[25] Y. Zhang, T. v. Rozendaal, J. Brehmer, M. Nagel, and T. Cohen, “Implicit Neural Video Compression,” Mar. 2022.
[26] Z. Li, S. Niklaus, N. Snavely, and O. Wang, “Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp. 6494–6504, iSSN: 2575-7075.
[27] G. Littwin and L. Wolf, “Deep Meta Functionals for Shape Representation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, pp. 1824–1833, iSSN: 2380-7504.
[28] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, Jun. 2020, pp. 3501–3512.
[29] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 27 171–27 183.
[30] W. Xian, J.-B. Huang, J. Kopf, and C. Kim, “Space-time Neural Irradiance Fields for Free-Viewpoint Video,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, Jun. 2021, pp. 9416–9426.
[31] M. Bauer, E. Dupont, A. Brock, D. Rosenbaum, J. R. Schwarz, and H. Kim, “Spatial Functa: Scaling Functa to ImageNet Classification and Generation,” Feb. 2023, arXiv:2302.03130 [cs].
[32] S. Kim, S. Yu, J. Lee, and J. Shin, “Scalable Neural Video Representations with Learnable Positional Features,” Oct. 2022, arXiv:2210.06823 [cs].
[33] Y. Bai, C. Dong, and C. Wang, “PS-NeRV: Patch-wise Stylized Neural Representations for Videos,” Aug. 2022, arXiv:2208.03742 [cs].
[34] Z. Li, M. Wang, H. Pi, K. Xu, J. Mei, and Y. Liu, “E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context,” Jul. 2022, arXiv:2207.08132 [cs].
[35] L. Mai and F. Liu, “Motion-adjustable neural implicit video representation,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 728–10 737.
[36] J. C. Lee, D. Rho, J. H. Ko, and E. Park, “FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos,” Dec. 2022, arXiv:2212.12294 [cs].
[37] H. Chen, M. Gwilliam, B. He, S.-N. Lim, and A. Shrivastava, “CNeRV: Content-adaptive Neural Representation for Visual Data,” Nov. 2022, arXiv:2211.10421 [cs].
[38] H. Chen, M. Gwilliam, S.-N. Lim, and A. Shrivastava, “HNeRV: A Hybrid Neural Representation for Videos,” Apr. 2023, arXiv:2304.02633 [cs].
[39] B. He, X. Yang, H. Wang, Z. Wu, H. Chen, S. Huang, Y. Ren, S.-N. Lim, and A. Shrivastava, “Towards Scalable Neural Representation for Diverse Videos.”
[40] Q. Zhao, M. S. Asif, and Z. Ma, “DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos,” Apr. 2023, arXiv:2304.06544 [cs].
[41] H. M. Kwan, G. Gao, F. Zhang, A. Gower, and D. Bull, “HiNeRV: Video Compression with Hierarchical Encoding based Neural Representation,” Jun. 2023, arXiv:2306.09818 [cs, eess].
[42] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 966–11 976.
[43] A. Mercat, M. Viitanen, and J. Vanne, “Uvg dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, ser. MMSys ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 297–302. [Online]. Available: https://doi.org/10.1145/3339825.3394937
[44] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset,” in 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 1509–1513.
[45] S. Nah, S. Baik, S. Hong, G. Moon, S. Son, R. Timofte, and K. M. Lee, “Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study,” in CVPR Workshops, June 2019.
[46] “Big buck bunny, sunflower version.” [Online]. Available: http://bbb3d.renderfarming.net/download.html

7 Supplementary Material

7.1 Supplementary Material

We conducted a detailed ablation study on the variants of the three components(GRN, ML, and HL) on the UVG dataset. Tables X and A present the performance of each variant on the UVG sub-datasets.

Table 7: More detailed ablation experiments (PSNR) for each video in the UVG dataset.

UVG-PSNR	beauty	Bosphorus	HoneyBee	Jockey	ReadySteadyGo	ShakeNDry	YachtRide	avg
HNeRV	34.27	33.82	38.97	30.97	25.165	33.905	29.8	32.414
Variant 1	34.28	33.815	38.95	31.08	25.23	33.92	29.845	32.446
Variant 2	34.3	33.73	38.46	32.15	26.21	33.64	29.79	32.611
MNeRV	34.3	33.86	38.48	32.11	26.11	33.69	29.96	32.644

Table 8: More detailed ablation experiments (MSSSIM) for each video in the UVG dataset.

UVG-MSSSIM	beauty	Bosphorus	HoneyBee	Jockey	ReadySteadyGo	ShakeNDry	YachtRide	avg
HNeRV	0.9076	0.9363	0.9844	0.8742	0.8391	0.9277	0.8903	0.9085143
Variant 1	0.9077	0.9364	0.9844	0.8757	0.8415	0.9282	0.891	0.90927143
Variant 2	0.9080	0.9364	0.9830	0.8953	0.8701	0.9236	0.8908	0.91531
MNeRV	0.9078	0.9385	0.9831	0.8944	0.8662	0.9245	0.8939	0.915486

7.2 More Visualizations

We show more visualizations for video interpolation (Figure 10), and video inpainting (Figure 11).

7.3 Detailed Comparative Experiments

In Figure 12, we present the video compression results obtained on the UVG dataset(all) and each of its sub-datasets.