Watermarking for Neural Radiation Fields by Invertible Neural Network

Wenquan Sun, Jia Liu, Weina Dong, Lifeng Chen and Ke Niu Manuscript received 2 December 2023; accepted X December 2023. Date of publication X X 2023. This work was supported in part by the General Program of the National Natural Science Foundation of China under Grant 62272478.The associate editor coordinating the review of this manuscript and approving it for publication was XXX (Corresponding author: Jia Liu.)The authors are with the College of Cryptography Engineering, Engineering University of PAP, Xi’an Shanxi 710086, China; Key Laboratory of Network and Information Security of PAP (Engineering University of PAP), Xi’an Shanxi 710086, China. (e-mail: [email protected];[email protected];[email protected]; [email protected];[email protected]).

Abstract

To protect the copyright of the 3D scene represented by the neural radiation field, the embedding and extraction of the neural radiation field watermark are considered as a pair of inverse problems of image transformations. A scheme for protecting the copyright of the neural radiation field is proposed using invertible neural network watermarking, which utilizes watermarking techniques for 2D images to achieve the protection of the 3D scene. The scheme embeds the watermark in the training image of the neural radiation field through the forward process in the invertible network and extracts the watermark from the image rendered by the neural radiation field using the inverse process to realize the copyright protection of both the neural radiation field and the 3D scene. Since the rendering process of the neural radiation field can cause the loss of watermark information, the scheme incorporates an image quality enhancement module, which utilizes a neural network to recover the rendered image and then extracts the watermark. The scheme embeds a watermark in each training image to train the neural radiation field and enables the extraction of watermark information from multiple viewpoints. Simulation experimental results demonstrate the effectiveness of the method.

Index Terms:

Neural Radiance Fields, Copyright Protection, 3D Scene, Watermarking, Invertible Neural Networks.

I Introduction

Implicit Neural Representation (INR), also known as a coordinate-based representation, is a method for parameterizing various signals. While traditional signal representations are usually discrete, implicit neural representations parameterize a signal as a continuous function. Currently, the most typical application of INR is Neural Radiance Fields (NeRF)[1]. NeRF is a deep learning model for 3D implicit spatial modeling that uses neural networks to implicitly represent the color and density functions of each point in a 3D scene. Current research in NeRF is dedicated to working on higher quality 3D content representation [2, 3, 4, 5], faster rendering [6, 7, 8, 9, 10], and sparse view reconstruction [11, 12, 13, 14, 15, 16]. As NeRF continues to progress in 3D representation, the issue of copyright protection for 3D models of neural radiance fields oriented towards implicit representation has become a pressing topic.

Traditional representations of 3D models can be categorized into point cloud models, mesh models, and surface models. Watermarking techniques for traditional 3D models can be mainly classified into two categories: 3D mesh model-based watermarking algorithms [17, 18, 19, 20, 21, 22, 23, 24] and 3D point cloud model-based watermarking algorithms. For 3D mesh model-based watermarking algorithms, a multi-resolution framework is typically used to perform wavelet decomposition or Fourier transform on the target triangular or polygonal mesh. The watermark embedding is achieved by modifying the topological or geometric features of the mesh model or establishing a correlation function between the mesh vertices. On the other hand, the 3D point cloud model watermarking algorithm [25] first establishes a synchronization relationship between point clouds. Then, the model is divided into spherical rings based on the radial radius, and the watermark is repeatedly inserted into the vertices of each sphere ring to realize the watermark embedding.

However, NeRF’s 3D model representation differs from traditional 3D models in that NeRF does not use geometric structures in the traditional sense, but instead learns and generates realistic renderings directly through neural networks. It is essentially a neural network that performs an implicit representation of the 3D scene. Therefore, traditional 3D model watermarking algorithms cannot be applied to watermark neural radiation fields. Copyright protection for neural networks, i.e., neural network watermarking, has become an important research direction in the security field. There are four main types of watermarking for neural networks: white-box watermarking, black-box watermarking, no-box watermarking, and vulnerable neural network watermarking. In the white-box watermarking scheme [26], the verifier can access the interior of the network and retrieve information such as weights when verifying the copyright of the network. The black-box watermarking scheme [27] is suitable for cases where the verifier cannot access the interior of the network but can only interact with the network through a remote API interface. Boxless watermarking [28] is primarily used for copyright authentication in generative networks, where the approach involves training the network in such a way that the generated image contains watermark information, which the verifier can then directly verify for copyright. Vulnerable watermarking [29] differs from the above three methods in that it detects whether the functionality of the network has been maliciously tampered with, such as through the injection of a backdoor, by examining the corruption of the watermark.

Traditional neural network watermarking typically considers the neural network as a tool for data processing, thus aiming to protect the tool itself. However, in the context of implicit expression, the neural network itself becomes the data. Therefore, this paper proposes that future protection of multimedia data (such as images, videos, audio, etc.) can be approached in two ways, as depicted in Figure 1. The first approach involves directly applying traditional watermarking techniques to safeguard multimedia data. The second approach involves first transforming the multimedia data into neural network data using neural implicit expression, and then utilizing neural network watermarking techniques to protect the multimedia data represented by the implicitly expressed network.

Refer to caption — Figure 1: Two ways of protecting data.

StegaNeRF [30] was the first to establish a connection between neural radiation field and message hiding by training NeRF twice to extract messages from rendered images. In contrast to StegaNeRF, this paper presents a novel approach to safeguarding the neural radiation field by employing invertible neural network watermarking. This technique does not modify the NeRF network but ensures the protection of the NeRF model by leveraging traditional image watermarking techniques. The proposed scheme starts by utilizing the forward network watermarking algorithm of invertible neural networks to embed watermark information into each image separately within the training set used for NeRF training. Following this, 3D modeling is conducted using the NeRF model. To counteract the influence of NeRF rendering, the verifier can render the NeRF from any viewpoint and subsequently recover the rendered image using a trained image quality enhancement network. Finally, the verifier can extract the embedded watermark information using the reverse process of the invertible network, namely the extraction network. In the case of a black-box scenario, where the 3D model is suspected to be utilized by unauthorized parties, the verifier can extract watermark information from multiple perspectives to verify the network copyright.

II The algorithms in this paper

II-A Application Scenarios

The usage scenarios of the algorithm described in this paper are as follows:

Alice has acquired some pictures of a 3D scene by taking photographs, etc;

Alice embeds watermarks in the images and renders 3D scenes by training NeRF models;

Alice shares NeRF models and 3D scenes online for others to enjoy;

Bob acquired the NeRF model without Alice’s permission and posted it on the web under his name;

Alice sees the NeRF model posted by Bob and uses his model to render a 2D image for watermark extraction, thereby verifying that Alice is the copyright holder of the NeRF model;

Bob is infringing on copyright and needs to withdraw the release;

II-B General framework of the algorithm

In this paper, we propose a new scheme using invertible neural network 2D watermarking algorithm to achieve the protection of neural radiation field and 3D scene. As shown in Fig. 3 the algorithm framework contains frequency domain transform module, invertible module, neural radiation field and image quality enhancement module composition. Embedding and extraction in invertible neural network watermarking is a pair of inverse processes

{I_{W}}=H\left({I,{M_{W}}}\right)

(1)

\left(I^{{}^{\prime\prime}},R_{W}\right)=H^{-1}\left(QEM\left(NeRF\left(I_{W}\right)\right)\right)

(2)

(1)-(2) where: H(-) represents the forward embedding watermarking process, and $H^{-1}$ (-) denotes the reverse extraction watermarking process. In the forward embedding watermarking process, the training image I and the watermark information $M_{W}$ serve as inputs. These inputs are initially decomposed into high-frequency and low-frequency wavelet subbands through discrete wavelet transform (DWT) within a sequence of invertible blocks. After the final invertible block input, the inverse wavelet transform is performed using IWT to generate the watermarked image IW and the loss information r. All training images for NeRF should undergo these operations to ensure that the watermarking information can be extracted from any angle in the training set. The resulting watermark image $I_{W}$ is then utilized to train the NeRF model, specifying the camera position, orientation, and field of view parameters for rendering. The rendered image I’ is generated using ray-voxel intersection sampling and color blending operations. In the reverse extraction watermarking process, the rendered image I’ initially undergoes the image quality increase module (QEM) to mitigate distortion effects caused by the NeRF rendering process. Subsequently, similar to the embedding process, the auxiliary variable Z and the quality-enhanced rendered image I’ are subjected to a frequency domain transform and a series of invertible blocks to generate the recovery watermark and the recovery image $I_{r}$ .

II-C Network structure

II-C1 Frequency domain transform module

Watermarked images embedded in the pixel domain are prone to texture replication artifacts and color distortion [31, 32]. The frequency domain and high-frequency domain are better suited for watermark embedding compared to the pixel domain. This paper utilizes the frequency domain transform module (FDTM) to partition the image into low-frequency and high-frequency wavelet subbands before the invertible transform. The high-frequency subbands contain the image details, while the low-frequency subbands encompass the overall image features. This division allows the network to effectively integrate the watermark information into the cover image. The wavelet transform, when compared to direct operations in the original image domain, offers improved visual fidelity and embedding efficiency as it operates on only a few subbands. Consequently, this approach minimizes the impact on the entire image and proves generally difficult to detect. Moreover, the favorable reconstruction properties of wavelets [33] contribute to reducing information loss and enhancing watermark embedding capabilities. Before entering the invertible block, the image undergoes the FDTM, and following the discrete wavelet transform (DWT), the feature map’s size (B, C, H, W) is transformed into (B, 4C, H/2, W/2), where B represents the batch size, H denotes the height, W indicates the width, and C represents the number of channels. The DWT reduces computational costs, thereby accelerating the training process. Subsequently, after the last invertible block, the feature map (B, 4C, H/2, W/2) is fed into the FDTM for inverse wavelet transform (IWT), resulting in the generation of the watermarked image $I_{W}$ by restoring the feature map size to (B, C, H, W).

II-C2 Invertible blocks

As shown in Fig. 3, the hiding process and the recovery process have the same sub-blocks and share the same network parameters, except that the information flow is in the opposite direction. The network structure in this paper has 8 invertible blocks with the same structure, constructed as follows: For the $L^{th}$ hidden block in the forward process, the inputs are $I_{l}$ and $M_{W}^{l}$ , and the outputs are $I^{l+1}$ and $M_{W}^{l+1}$ .

I^{l+1}=I^{l}+f\left({M_{W}^{l}}\right)

(3)

M_{W}^{l+1}=M_{W}^{l}\otimes\exp\left({\sigma\left({r\left({I^{l+1}}\right)}\right)}\right)+y\left({I^{l+1}}\right)

(4)

Equations(3)-(4): The activation function $\sigma$ is utilized, specifically the LeakyReLU. The densely connected networks, denoted as f(-), r(-), and y(-), are applied to invertible block, the outputs of the final invertible block are M^k_W and I^k. These outputs are further transformed using the inverse wavelet transform (IWT) to obtain the dense-containing image, I_W, and the loss information, r. The L^th display block in the reverse recovery process takes inputs I ${}^{{}^{\prime}l+1}$ and Z^l+1 and produces outputs I ${}^{{}^{\prime}}$ _l and Z^l. The equations(5)-(6) illustrate this block as follows.

{Z^{l}}=\left({{Z^{l+1}}-y\left({I_{r}^{l+1}}\right)}\right)\otimes\exp\left({-\sigma\left({r\left({I_{r}^{l+1}}\right)}\right)}\right)

(5)

I_{r}^{l}=I_{r}^{l+1}-f\left({{Z^{l}}}\right)

(6)

II-C3 Neural Radiation Field

Neural Radiation Field is a neural network model designed for generating 3D scenes. The network structure comprises multiple layers of perceptrons, which are employed to encode the scene’s surface. Figure 5 illustrates this network structure.

In the neural radiation field model, each pixel position of the input image can be represented as a 3D coordinate point in the scene, allowing for precise object location and rendering within the scene. In NeRF, the input spatial point is defined by a 3D coordinate position, x=(x, y, z), and a direction, d=( $\theta$ , $\phi$ ), while the output spatial point is characterized by a color, c=(r, g, b), and density, $\sigma$ , at the corresponding voxel position.

{F_{\Theta}}:(x(x,y,z),d(\theta,\Phi))\to(c(r,g,b),\sigma)

(7)

NeRF takes in a finite sequence of discrete images and camera parameters associated with specific viewpoints to generate a continuous static 3D scene. Moreover, it can render the scene from infinite perspectives, resulting in new viewpoint images. Body rendering, on the other hand, is a 3D-to-2D modeling process that leverages the pixel values c and the body density $\sigma$ of 3D points obtained through 3D reconstruction. The final pixel values of the 2D image are derived by the weighted superposition of pixel point samples along a ray in the direction of observation. This process is illustrated in equation (8).

\begin{array}[]{l}{T(t)=\exp(-\int_{t_{n}}^{t_{f}}\sigma(r(s))ds),}\\ {C(r)=\int_{t_{n}}^{t_{f}}T(t)\sigma(r(t))c(r(t),d)dt}\end{array}

(8)

Equation (8) introduces the ray, denoted as r(t), which is defined as r(t) = o + t_d. o represents the position of the camera’s optical center, and d represents the direction of the viewing angle. Furthermore, T(t) indicates the cumulative transmittance of the ray as it travels from the proximal point tn to the distal boundary t_f.

Building upon this characteristic of NeRF, this paper proposes a method for extracting watermarks from any angle in the training set by randomly selecting camera parameters. This approach aims to provide copyright protection for NeRF.

II-C4 Image Quality Enhancement Module

Before the reverse process to extract the watermark, to eliminate the impact of the distortion changes brought about by the NeRF rendering process, this paper sets up an image quality enhancement module (IQEM), using the residual convolutional codec network on the left side of the left 6 is a convolutional encoder, extracting the distorted image I ${}^{{}^{\prime}}$ different levels of feature information. Then the features are input into the right side of the inverse convolutional decoder while inputting the residuals passed from the previous layer, and the final result is superimposed on the original image, which completes the image restoration. By adding IQEM to the watermark extraction process, the rendered image I ${}^{{}^{\prime}}$ is preprocessed before it enters the invertible neural network to ensure that the inputs passed backward are sufficiently similar to the watermarked image I_W so that the invertible neural network can extract the watermark information more completely.

II-D Loss Functions

The loss associated with network model training proposed in this paper consists of four main components:

Embedding loss L_E_mb

The purpose of the embedding loss is to ensure that the generated watermarked image I_W is indistinguishable from the training image I. The embedding loss is used in the following steps:

L_{Emb}\left(\theta\right)=\sum_{n=1}^{N}\ell_{Emb}\left(I_{W}^{\left(n\right)},I^{\left(n\right)}\right)

(9)

In Eq.(9), N represents the number of training samples, and $\ell$ _Emb calculates the difference between the watermarked image I_W and the training image I. In this paper, we use the L₂ paradigm.

Low-frequency wavelet loss L_low-f

Literature[34] verifies that watermark information embedded in high-frequency components is less detectable than watermark information embedded in low-frequency components. To ensure higher visual fidelity and minimize the impact on the image as a whole due to the embedding of the watermarking information, so that the watermarking information is embedded in the high-frequency region of the image as much as possible, this paper employs a loss constraint on the low-frequency subbands of the training image I and the watermarked image I_W.

L_{low-f}\left(\theta\right)=\sum_{n=1}^{N}\ell_{f}\left(H\left(I^{\left(n\right)}\right)_{ll},H\left(I_{W}^{\left(n\right)}\right)_{ll}\right)

(10)

In Eq.(10), N represents the number of training samples, $\ell$ _f calculates the low-frequency difference between the training image I and the watermarked image I_W, and H(-)_ll ${}_{\ }$ represents the low-frequency subband operation of the extracted image.

Extraction loss L_Ext

To ensure the consistency between the extracted watermark information R_W and the embedded watermark information M_W. The difference between the recovered watermark R_W and the embedded watermark information M_W is minimized to improve the watermark extraction accuracy of the model.

L_{Ext}\left(\theta\right)=\sum_{n=1}^{N}E_{z\sim p\left(z\right)}\left[\ell_{Ext}\left(R_{W}^{\left(n\right)},M_{W}^{\left(n\right)}\right)\right]

(11)

In Eq.(11), N represents the number of training samples, and $\ell$ _Ex_t computes the difference between the watermark information M_W and the recovered watermark R_W. The process of sampling the random vector z is random.

The total loss function of the invertible neural network is a weighted sum of the embedding loss, the low-frequency wavelet loss, and the extraction loss:

L_{total}\left(\theta\right)=\lambda_{1}L_{Emb}+\lambda_{2}L_{low-f}+\lambda_{3}L_{Ext}

(12)

In the training process, $\lambda$ ₂ is first set to 0, i.e., the network model is directly pre-trained without considering the effect of L_low-f on the network, so that the network model first obtains the basic embedding-extraction ability. Then the L_low-f constraints are gradually added to further optimize the network model to embed the watermark information in the high-frequency region of the training image, to minimize the impact of the embedding of the watermark information on the image as a whole.

Loss of image quality enhancement module MSE

To ensure that the embedding of the watermark does not damage the original 2D image content, the image quality enhancement module in this paper is independent of the training of the invertible neural network, and the loss of the image quality increase module is constrained by the MSE, which is designed to ensure that the image I’ rendered by the NeRF can be restored to the watermarked image I_W generated by the invertible neural network to resist image watermark corruption and loss caused by the rendering process.

MSE=\frac{1}{n}\sum\limits_{i=1}^{n}{\left({{I_{i}^{{}^{\prime}}},{I_{Wi}}}\right)}

(13)

Eq.(13) where I’_i is the i^th rendered image and I_Wi is the i^th watermarked image.

III Experimental results and analysis

In this study, the network model employed the Pytorch platform with Cuda version 11.6 and Nvidia GeForce RTX2070 GPU. The NeRFSynthetic datasets of Lego, Hotdog, and Chair were used to train the Nerf model. To ensure diversity, high resolution, and authenticity, the DIV2K dataset was used for training the invertible neural network structure, which was modified from HiNet [35]. Specifically, the DIV2K training dataset, consisting of 800 images with a resolution of 1024 $\times$ 1024, was used for training, while the validation dataset (100 images, resolution of 1024 $\times$ 1024) was used for validating the network model. To test the effectiveness of the network model, the DIV2K test dataset (100 images, resolution of 1024 $\times$ 1024) was used. The Adam optimizer was used with $\uplambda$ ₁=5, $\uplambda$ ₂=0.5, $\uplambda$ ₃=1, a learning rate of 1 $\times$ 10^-4.5, and a batch size of 2 for the network model training. The entire network model consisted of 8 invertible blocks, with each block containing three DenseNet blocks including 7 layers of convolutional blocks as f(-), r(-), and y(-) respectively.

III-A Evaluation Metrics

In this paper, four metrics: Peak Signal Noise Ratio (PSNR), Structural Similarity (SSIM), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE), are used to measure the watermark embedding and extraction capabilities of the network model.

PSNR is commonly used to evaluate the quality of image reconstruction and is defined by the Mean Square Error (MSE) between two images of size W?H, X, and Y. The formula for PSNR is given by:

MSE=\frac{1}{W\times H}\sum_{i=1}^{W}\sum_{j=1}^{H}\left[X_{i,j}-Y_{i,j}\right]^{2}

(14)

PSNR=10\times\log_{10}\frac{MAX^{2}}{MSE}

(15)

In the equation, X_i,j and Y_i,j refer to the pixel values of image X and Y at position (i,j) respectively. MAX represents the maximum pixel value of an image point, and a higher PSNR value indicates less distortion.

SSIM is another image quality evaluation metric that measures image similarity in terms of brightness, contrast, and structure. It is defined by:

\begin{array}[]{l}{l(x,y){\rm=}\frac{2\mu x\mu y+C1}{\mu x^{2}+\mu y^{2}+C1}}\\ {{\rm c}(x,y)=\frac{2\sigma x\sigma y+C2}{\sigma x^{2}+\sigma y^{2}{\rm+C}2}}\\ {s(x,y)=\frac{\sigma_{xy}{\rm+}C_{3}}{\sigma_{x}\sigma_{y}{\rm+}C_{3}}}\end{array}

(16)

In the equation, $\mu$ _x and $\sigma$ _x are the mean and variance of image X, $\mu$ _y and $\sigma$ _y are the mean and variance of image Y, and $\sigma$ _xy is the covariance of X and Y. Constants C₁, C₂, and C₃ are used, with C₁=(K₁*L) $\wedge$ 2, C₂=(K₂*L) $\wedge$ 2, and C₃=C₂/2. In general, K₁=0.01, K₂=0.03, and L=255.

SSIM\left(X,Y\right)=l\left(x,y\right)\cdot c\left(x,y\right)\cdot s\left(x,y\right)

(17)

SSIM values range from 0 to 1, where a higher value indicates less image distortion.

RMSE indicates the sample standard deviation of the difference between predicted and observed values (called residuals). It is equivalent to the L₂ paradigm and is more sensitive to outliers in the data.

RMSE=\sqrt{MSE}

(18)

MAE represents the mean of the absolute errors between predicted and observed values and is equivalent to the L₁ paradigm.

MAE=\frac{1}{W\times H}\sum_{i=1}^{W}\sum_{j=1}^{H}\left|X_{i,j}-Y_{i,j}\right|

(19)

III-B The imperceptibility of watermarked images

The imperceptibility of a watermarked image refers to the difficulty for the human eye to detect the presence of a watermark. To achieve blind watermarking, the original image must be visually indistinguishable from the watermarked image. This paper aims to minimize the distortion rate between the original image (i.e., the training image I) and the watermarked image I_W. To evaluate the imperceptibility of the proposed method, four metrics, namely PSNR, SSIM, MAE, and RMSE, are utilized. The experimental results are presented in Table 2.

TABLE I: Comparison of embedding rates under different thresholds

Datasets	Lego	Hotdog	Chair
Metrics	Training image I / Watermark imageI_W
PSNR	38.226542	37.379475	37.890828
SSIM	0.943185	0.918761	0.936977
MAE	3.168850	3.962571	3.114965
RMSE	5.732577	6.188527	5.956620

Meanwhile, according to Fig. 6, it is observed that by embedding watermarks in the images of three datasets, namely Lego, Hotdog, and Chair, a comparison between the original training image I and the watermarked image I_W reveals that there is no visible distinction indicating the presence or absence of the embedded watermarking information in the training image. This demonstrates the imperceptibility of the watermark embedded using the method proposed in this paper, resulting in the successful realization of blind watermarking.

III-C Accuracy of watermark extraction

In this paper, the watermark information, M_W, is embedded in three datasets, namely Lego, Hotdog, and Chair, using a forward invertible neural network. Then, NeRF is employed for training, and the resulting 3D scenes are rendered to obtain images from different viewpoints. The rendered images are then processed using an image quality enhancement module, after which the recovered watermark, R_W, is extracted with the aid of an inverse invertible neural network. To evaluate the quality of the extracted watermark information, we used four metrics, namely, PSNR, SSIM, MAE, and RMSE.

As depicted in Fig. 7, the average values of each metric for 100 images are as follows: PSNR is greater than 22 dB, SSIM is around 0.55, MAE is around 9.2, and RMSE is approximately 29. In this study, watermark extraction was conducted on three datasets: Lego, Hotdog, and Chair, for both original training image angles and non-original training image angles. Two parameters, $\Theta$ and $\Phi$ , influence the image angle, and this study manipulates $\Theta$ while keeping $\Phi$ constant to control the angle variation. Using the original angles $\Theta$ =30, $\Theta$ =45, and $\Theta$ =60, a view angle offset of +1 is applied to examine whether the watermark information can be successfully extracted when the selected angle differs from the original training angle. The experimental results, depicted in Fig. 8, indicate that the watermark information can be extracted effectively when the selected angle matches the original training image angle. However, when the selected angle is different from the original training image angle (i.e., other angles), the accurate extraction of the watermark information is not achieved.

III-D Image Quality Enhancement Module

Traditional deep learning image robust watermarking techniques, such as HiNet [35] and ISN [36], are not directly applicable to our task. These techniques rely on reversibility and do not account for the susceptibility of watermarked images to corruption during NeRF rendering. To address this limitation, we propose the incorporation of an image quality enhancement module (IQEM) before the extraction of watermarking operations. By introducing the IQEM, the PSNR of both M_W and R_W significantly improves from 5.31dB to 27.23dB, as evidenced in Table 3. The experimental results confirm the efficacy of the IQEM in successfully extracting watermark information.

TABLE II: Effectiveness of network architecture and design strategies

IQEM	FDTM	L_low-f	Compare M_W ${}_{\ }$ with R_W (PSNR)
$\times$	$\mathrm{\sqrt{}}$	$\mathrm{\sqrt{}}$	5.31dB
$\mathrm{\sqrt{}}$	$\times$	$\mathrm{\sqrt{}}$	12.44dB
$\mathrm{\sqrt{}}$	$\mathrm{\sqrt{}}$	$\times$	19.88dB
$\mathrm{\sqrt{}}$	$\mathrm{\sqrt{}}$	$\mathrm{\sqrt{}}$	27.23dB

III-E Comparison with StegaNeRF

Modifying the MLP structure to enable watermark embedding is a challenging task. Any direct modifications to the MLP structure can potentially compromise NeRF’s rendering ability and hinder its capacity to capture 3D content. Thus, we opt to utilize an invertible neural network watermarking approach for protecting NeRF. Specifically, we embed watermarks onto the 2D images used to train NeRF, and then extract these watermarks from the rendered images to confirm NeRF’s copyright. Unlike StegaNeRF, which changes the network structure and may affect NeRF’s ability, our proposed method achieves copyright protection through an indirect approach that does not impact the network structure or the rendering ability of NeRF. By training with the same Epoch (50000), our approach yields higher-quality rendered images compared to StegaNeRF, as evidenced in Fig. 8 from a subjective visual perspective.

The quantitative results of comparing the 13-angle images rendered by two different approaches with the corresponding original training images are presented in Table 3.

TABLE III: Quantitative comparison of rendered image quality

	PSNR(dB) $\boldsymbol{\mathrm{\uparrow}}$	SSIM $\boldsymbol{\mathrm{\uparrow}}$	MAE $\boldsymbol{\mathrm{\downarrow}}$	RMSE $\boldsymbol{\mathrm{\downarrow}}$
This paper	32.879404	0.965815	3.021400	7.474979
StegaNeRF	29.230964	0.907661	4.134496	10.397353

The results indicate that our proposed approach outperforms StegaNeRF in all four evaluation metrics. This demonstrates that our approach achieves copyright protection without compromising NeRF’s rendering ability.

IV Conclusion

In this paper, we propose for the first time a scheme to protect the neural radiation field using invertible neural network watermarking to achieve copyright protection for NeRF. The method employs an invertible neural network to embed and extract watermarks on 2D images, modelling the embedding and extraction of watermarks as forward and reverse processes of the invertible network, while adding an image quality enhancement module in the intermediate process to compensate for the loss of watermark information caused by NeRF rendering process, and to achieve the protection of 3D models represented by neuroradiometric fields. The experimental results show that the scheme in this paper can achieve the embedding and extraction of watermarks, but the extraction quality of watermarks needs to be further improved.

References

[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” vol. abs/2003.08934.
[2] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” p. arXiv:2106.10689, _eprint: 2106.10689.
[3] X. Zhang, P. P. Srinivasan, B. Deng, P. Debevec, W. T. Freeman, and J. T. Barron, “NeRFactor: Neural factorization of shape and reflectance under an unknown illumination,” p. arXiv:2106.01970, _eprint: 2106.01970.
[4] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields,” pp. 5835–5844.
[5] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar, “Block-NeRF: Scalable large scene neural view synthesis,” p. arXiv:2202.05263, _eprint: 2202.05263.
[6] K. Schwarz, A. Sauer, M. Niemeyer, Y. Liao, and A. Geiger, “VoxGRAF: Fast 3d-aware image synthesis with sparse voxel grids,” p. arXiv:2206.07695, _eprint: 2206.07695.
[7] A. Yu, S. Fridovich-Keil, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” pp. 5491–5500.
[8] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” p. arXiv:2201.05989, _eprint: 2201.05989.
[9] L. Wang, J. Zhang, X. Liu, F. Zhao, Y. Zhang, Y. Zhang, M. Wu, L. Xu, and J. Yu, “Fourier PlenOctrees for dynamic radiance field rendering in real-time,” p. arXiv:2202.08614, _eprint: 2202.08614.
[10] C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction,” p. arXiv:2111.11215, _eprint: 2111.11215.
[11] D. Xu, P. Wang, Y. Jiang, Z. Fan, and Z. Wang, “Signal processing for implicit neural representations,” p. arXiv:2210.08772, _eprint: 2210.08772.
[12] D. Chen, Y. Liu, L. Huang, B. Wang, and P. Pan, “GeoAug: Data augmentation for few-shot NeRF with geometry constraints,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Springer Nature Switzerland, pp. 322–337.
[13] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su, “MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo,” pp. 14 104–14 113.
[14] A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelNeRF: Neural radiance fields from one or few images,” p. arXiv:2012.02190, _eprint: 2012.02190.
[15] J. Zhang, Y. Zhang, H. Fu, X. Zhou, B. Cai, J. Huang, R. Jia, B. Zhao, and X. Tang, “Ray priors through reprojection: Improving neural radiance fields for novel view extrapolation,” p. arXiv:2205.05922, _eprint: 2205.05922.
[16] M. Niemeyer, J. T. Barron, B. Mildenhall, M. S. M. Sajjadi, A. Geiger, and N. Radwan, “RegNeRF: Regularizing neural radiance fields for view synthesis from sparse inputs,” p. arXiv:2112.00724, _eprint: 2112.00724.
[17] C. Qin and X. Zhang, “Effective reversible data hiding in encrypted image with privacy protection for image content,” vol. 31, pp. 154–164. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S104732031500108X
[18] X. Liao and C. Shu, “Reversible data hiding in encrypted images based on absolute mean difference of multiple neighboring pixels,” vol. 28, pp. 21–27. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1047320314002132
[19] F. Uccheddu, M. Corsini, and M. Barni, “Wavelet-based blind watermarking of 3d models,” in Workshop on Multimedia & Security.
[20] E. Praun, H. Hoppe, and A. Finkelstein, “Robust mesh watermarking,” in Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’99. ACM Press/Addison-Wesley Publishing Co., pp. 49–56. [Online]. Available: https://doi.org/10.1145/311535.311540
[21] R. Ohbuchi, A. Mukaiyama, and S. Takahashi, “A frequency‐domain approach to watermarking 3d shapes,” vol. 21.
[22] J.-U. Hou, D.-G. Kim, and H.-K. Lee, “Blind 3d mesh watermarking for 3d printed model by analyzing layering artifact,” vol. 12, pp. 2712–2725.
[23] J. Son, D. Kim, H.-Y. Choi, H.-U. Jang, and S. Choi, “Perceptual 3d watermarking using mesh saliency,” in Information Science and Applications 2017, K. Kim and N. Joukov, Eds. Springer Singapore, pp. 315–322.
[24] M. Hamidi, A. Chetouani, M. El Haziti, M. El Hassouni, and H. Cherifi, “Blind robust 3-d mesh watermarking based on mesh saliency and QIM quantization for copyright protection,” in Pattern Recognition and Image Analysis, A. Morales, J. Fierrez, J. S. Sánchez, and B. Ribeiro, Eds. Springer International Publishing, pp. 170–181.
[25] J. Liu, Y. Yang, D. Ma, W. He, and Y. Wang, “A novel watermarking algorithm for three-dimensional point-cloud models based on vertex curvature,” vol. 15.
[26] Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh, “Embedding watermarks into deep neural networks.”
[27] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet, “Turning your weakness into a strength: Watermarking deep neural networks by backdooring,” p. arXiv:1802.04633, _eprint: 1802.04633.
[28] H. Wu, G. Liu, Y. Yao, and X. Zhang, “Watermarking neural networks with watermarked images,” vol. 31, no. 7, pp. 2591–2601.
[29] X. Guan, H. Feng, W. Zhang, H. Zhou, J. Zhang, and N. Yu, “Reversible watermarking in deep convolutional neural networks for integrity authentication,” p. arXiv:2104.04268, _eprint: 2104.04268.
[30] C. Li, B. Y. Feng, Z. Fan, P. Pan, and Z. Wang, “StegaNeRF: Embedding invisible information within neural radiance fields.” [Online]. Available: http://arxiv.org/abs/2212.01602
[31] J. Fridrich, M. Goljan, and R. Du, “Detecting LSB steganography in color, and gray-scale images,” vol. 8, no. 4, pp. 22–28.
[32] X. Weng, Y. Li, L. Chi, and Y. Mu, “High-capacity convolutional video steganography with temporal residual modeling.” [Online]. Available: https://api.semanticscholar.org/CorpusID:174802332
[33] S. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” vol. 11, no. 7, pp. 674–693.
[34] S. Baluja, “Hiding images in plain sight: Deep steganography,” in Neural Information Processing Systems. [Online]. Available: https://api.semanticscholar.org/CorpusID:29764034
[35] J. Jing, X. Deng, M. Xu, J. Wang, and Z. Guan, “HiNet: Deep image hiding by invertible network,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, pp. 4713–4722. [Online]. Available: https://ieeexplore.ieee.org/document/9711382/
[36] S.-P. Lu, R. Wang, T. Zhong, and P. L. Rosin, “Large-capacity image steganography based on invertible neural networks,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 10 811–10 820. [Online]. Available: https://ieeexplore.ieee.org/document/9577969/