This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multiresolution Neural Networks for Imaging

Hallison Paz, Tiago Novello, Vinicius Silva, Luiz Schirmer,
Guilherme Schardong, Fabio Chagas, Helio Lopes, Luiz Velho
Abstract

We present MR-Net, a general architecture for multiresolution neural networks, and a framework for imaging applications based on this architecture. Our coordinate-based networks are continuous both in space and in scale as they are composed of multiple stages that progressively add finer details. Besides that, they are a compact and efficient representation. We show examples of multiresolution image representation and applications to texture magnification, minification, and antialiasing.

This document is the extended version of the paper [PNS+22]. It includes additional material that would not fit the page limitations of the conference track for publication.

1 Introduction

Imaging applications benefit greatly from representations that support multiple resolutions. This is because they are instrumental for many tasks in computer vision and graphics, such as: compression; analysis and rendering. Traditionally, multiresolution representations for images have been based on signal processing techniques derived from Fourier theory.

Recently, the revolution in the media industry caused by deep neural networks motivated the development of new image representations adapted to machine learning methods. In that context, we introduce a new framework for the representation of images in multiresolution using coordinate-based sinusoidal neural networks.

1.1 Motivation

The main breakthrough in deep learning for computer vision and imaging was due to the seminal work of LeCun, Bengio, and Hinton [LB98]. This paper proposes the Convolutional Neural Network (CNN) as a proper architecture for the analysis of visual imagery. The effectiveness of CNNs comes from the translation invariant properties of the convolution operator.

Deep multi-layer perceptron networks, such as CNNs, employ an array-based discrete representation of the underlying signal. In this case, the network input consists of a vector of pixel values (R,G,B) that represents the image directly by data samples. Therefore, we can also call this kind of network a data-based network.

In contrast, a coordinate-based neural network represents the image indirectly. This kind of network is a fully connected MLP (Multi Layer Perceptron) that takes as input a pixel coordinate (x,y)(x,y) and outputs the (R,G,B) color at that location.

While the data-based network is appropriate for analysis tasks, relying on a discrete description of the image, the coordinate-based network is suitable for synthesis, providing a continuous image description. For its characteristics, namely continuity and compactness, there is a growing interest in the research community to explore the potential of coordinate-based networks for imaging applications.

1.2 Related Work

Coordinate-based networks provide a continuous functional description for images using an implicit neural representation [CLW21]. Since the coordinates are continuous, images can be presented in arbitrary resolution.

Spectral neural implicit architectures constitute a particular form of neural implicit representation in which the non-linear activation function is the periodic function sin(x)\sin(x). As such, it bridges the gap between the spatial and spectral domains, given the close relationship of the sine function with the Fourier basis.

However, these neural representations with periodic activation functions have been regarded as difficult to train [PHV17]. To overcome this problem, Sitzmann et al. [SMB+20] proposed a sinusoidal network for signal representation called SIREN. One of the key contributions of this work is the initialization scheme that guarantees stability and good convergence. Furthermore, it also allows modeling fine details in accordance with the signal’s frequency content.

A multiplicative filter network (MFN) [FSWK20] is a spectral neural implicit architecture simpler than SIREN which is equivalent to a shallow sinusoidal network. Lindell et al. [LVPW22] presented BACON (Band-limited Coordinate Network), an MFN that produces intermediate outputs with an analytical spectral bandwidth which can be specified at initialization and achieves multi-resolution decomposition of the output.

The control of frequency bands in the representation is closely related with the capability of adaptive reconstruction of the signal in multiple levels of detail. In that context, Mueller et al. [MESK22] developed a multiresolution neural network architecture based on hash encoding. Also, Martel et al. [MLL+21] designed an adaptive coordinate network for neural signal representation.

Another benefit of multiresolution image representations is the built-in support for antialiasing, which traditionally is implemented using image pyramids, such as in MIP Mapping [Wil83].

One of the important applications of imaging in both 2D and 3D Computer Graphics is texture synthesis. In that realm, besides antialiasing, the creation of visual patterns from exemplars has great relevance [TZN19]. Spectral neural implicit architectures are particularly suited to model stationary or quasi-stationary signals due to the periodic nature of its activation function [CLC+22].

1.3 Contributions

In summary, we make the following contributions:

  • We introduce a family of multiresolution coordinate-based networks, with unified architecture, that provides a continuous representation spatially and in scale.

  • We develop a framework for imaging applications based on this architecture, leveraging classical multiresolution concepts such as pyramids.

  • We show that our architecture can represent images with good visual quality, outperforming related methods both in PSNR and number of parameters; we also demonstrate its use in applications of texture magnification and minification, and antialiasing.

2 Multiresolution Sinusoidal Neural Networks

In this section we present MR-Net (Multiresolution Sinusoidal Neural Networks), a representation of signals in multiple levels of detail using deep neural networks.

2.1 Overview

Our proposal is a family of coordinate-based networks with unified architecture. We derive three main variants, namely S-Net, L-Net, and M-Net. As a whole, they provide different trade-offs with respect to control of frequencies in the representation.

The characteristics of the MR-Net Family are:

  • 2 Types of Level of Detail – in this respect it can be based on network capacity or spectral projection.

  • 3 Types of Sampling – the input signal can be given either by regular sampling (with or without subsampling) or by stochastic / stratified sampling.

  • Progressive Training – the network is trained progressively using a variety of schedule regimes.

  • Continuous Multiscale – the representation is continuous both in space and scale. Therefore it can reconstruct the signal at any desired resolution / level of detail.

The following subsections present the concepts involved in our approach, as well as, the technical details of the MR-Net architecture. For a more complete description see [VPNY22].

2.2 Architecture

Based on considerations for learning and splitting frequencies of the input signal into levels of detail, we devised a general architecture for a family of neural networks.

The core idea is to structure the network into multiple stages. Each stage learns in a controlled way a level of detail corresponding to a frequency band.

A stage configuration is derived from a sub-network called MR-Module, which is composed of four blocks of layers: the first layer; the hidden layers; the linear layer; and the control layer. The three initial blocks form a fully connected sinusoidal MLP. The last block consists of one node (not trained) to control the stage output, given by the function cα(x)=αxc_{\alpha}({x})=\alpha x, with α[0,1]\alpha\in[0,1], where x{x} comes from the linear layer. Figure 1 depicts the anatomy of the network, showing NN stages of MR-Modules.

Refer to caption
Figure 1: Anatomy of MR-Net Family

We can say that the first layer performs a projection of the input signal into a dictionary of sine functions, the hidden layers correspond to correlations of order nn of signal frequencies, the linear layer reconstructs the signal as a combination of these frequency atoms, and the control layer is just a mechanism to provide a continuous blend of level of detail in the network.

The stages of the network are trained based on a predefined schedule (See Section 2.3). During training, the control layer is the identity function, i.e. α=1\alpha=1.

The contribution of these NN stages is added together forming the network output. Assuming that the MR-Net is learning a function f(x)f(x) that fits the input signal, then

f(x)=g0(x)++gN(x)f(x)=g_{0}(x)+\cdots+g_{N}(x) (1)

where gi(x)g_{i}(x) is the detail function given by the stage sis_{i}, for i=1,,Ni=1,\ldots,N. The first stage s0s_{0} corresponds to the coarsest approximation of the signal and the other subsequent stages add increasingly finer details to it.

Note that this architecture is very much in the spirit of the Multiresolution Analysis [Mal89]. Indeed, consider the base case with f(x)=g0(x)+g1(x)f(x)=g_{0}(x)+g_{1}(x), then g1(x)=f(x)g0(x)g_{1}(x)=f(x)-g_{0}(x), i.e., g1g_{1} are the details that need to be added to g0g_{0} to increase the level of detail.

The network learns the decomposition of the signal as the projection into the coarse scale space and a sequence of finer detail spaces. The characteristics of the level of detail decomposition of each member of the MR-Net family will depend on the specific configuration of the network stages, as will be presented next.

2.2.1 S-Net

In the S-Net, a stage has only the first layer, the linear layer, and the control layer, which is not involved in the learning process. (See Figure 2.) Therefore, this kind of network consists of a learned “sine transform”, with these two layers corresponding, respectively, to the direct and inverse sine transform. As a consequence, the S-Net can provide level of detail by the initialization of frequency bands in each stage.

Refer to caption
Figure 2: S-Net

2.2.2 L-Net

The L-Net is composed of NN complete independent stages (with the four blocks: first; hidden; linear and control) that are aggregated in the output (see Figure 3). Consequently, the level of detail in this kind of network is determined by the capacity of the individual stages.

Refer to caption
Figure 3: L-Net

2.2.3 M-Net

The M-Net consists of a hierarchy of stages, in which subsequent stages are linked together. The output of a block of hidden layers is connected both to the linear layer, as well as, to the input of the hidden layer block of the next stage (see Figure 4).

Refer to caption
Figure 4: M-Net

A consequence of this hierarchical structure is that the hidden layer block in a stage is augmented with the sequence of hidden layer blocks coming from previous stages. Therefore, the capacity of each stage increases with its depth in the hierarchy. Accordingly, it is expected that this kind of network provides a more powerful mechanism for learning levels of detail. For this reason we will use this variant for the imaging applications in this paper.

2.3 Training

The training of the MR-Net has to take into account the mechanisms for learning different levels of detail by each stage of the network. There are two basic mechanisms: i) level of detail filtering and ii) pre-processing of the input signal.

Level of detail filtering is achieved either with the frequency band filtering (through the initialization of the w0w_{0} weights of the first layer); or with the capacity filtering (conditioned on the number of nodes and layers of the network).

Regarding the network input, we can either use the original signal, or pre-process the signal with a low-pass filter.

2.3.1 Multi-Stage Schedule

Since the M-Net has multiple stages, each learning a different level of detail, one important aspect is the stage training schedule. We could train the whole network with all stages in parallel, but we found it beneficial to train each stage in sequence from the lowest to the highest level of detail. This scheduling is our choice and a common strategy in the traditional multiresolution analysis of signals.

2.3.2 Progressive learning

Furthermore, we adopt a progressive learning strategy by “freezing” the weights of a stage once it is trained in the schedule sequence. This strategy guarantees that the details are added to the representation incrementally from coarse to fine.

2.3.3 Adaptive Training

We also employ an adaptive training scheme for the optimization of each network stage combining both accuracy loss thresholds and convergence rates.

2.4 Level of Detail Schemes

By incorporating the different aspects discussed in the previous sections we can define various schemes for learning level of detail representations using the family of Multiresolution Sinusoidal Neural Networks. The main ones are: Filtering with Gaussian Tower; and Filtering with Gaussian Pyramid. Here we highlight these two.

2.4.1 Gaussian Tower

If we want to have control over the frequencies present in the signal we can build a multiscale representation of the signal and train the network to approximate each level of this representation.

We start by feeding our model with a “Gaussian Tower”, that is, multiple versions of the signal filtered consecutively by a low-pass filter, but without decimation. This way, each scale is reconstructed from the same amount of samples. The network must be trained from the less detailed scale to the most detailed one.

2.4.2 Gaussian Pyramid

Based on the Shannon sampling theorem the Gaussian Tower is a highly redundant multiscale representation. On the other hand, the Gaussian Pyramid is ”critically sampled”, i.e., it has the minimum number of samples required to represent each frequency band. The Gaussian Pyramid is a classical multiscale representation of uniformly sampled signals and we will adopt this scheme in the imaging applications of this paper.

3 Imaging Applications

In this section we describe the implementation and experiments of the MR-Net for imaging applications. As already mentioned, we will adopt the M-Net variant of the architecture and a level of detail scheme based on the Gaussian Pyramid.

We designed the MR Module considering an empirical exploration of the sub-network capacity to represent images with typical characteristics (i.e., photographs). The configuration is as follows: width 96 neurons (fist, hidden and linear layers); number of layers of the hidden block equal to one.

The number of stages of the network is determined by the resolution of the image to be represented. The multiresolution for the image Pyramid is according to a dyadic structure, i.e., 2j2^{j}. The base resolution of the first stage is 23=82^{3}=8.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Cameraman - reconstructed multiresolution levels 1, 3, 5 and 7 and corresponding Fourier spectra.

The Image Pyramid is built by filtering with a Gaussian kernel and decimation. Most of the images used in the experiments have a resolution of 512. So, the pyramid is composed of the following resolution levels: 8, 16, 32, 64, 128, 256, and 512. Consequently, the network has a total of 7 stages.

We train the network using an adaptive scheme with the following hyper-parameters: Loss Function = MSE (Mean Squared Error); convergence threshold = 0.0010.001 (i.e., training of a stage stops if loss value changes less than 0.0010.001 percent); maximum number of epochs per stage = 300; each epoch visits all the pixels once; size of mini-batch = 65536 (to fit the GPU memory). Training is done with ADAM and learning rate of 0.00010.0001.

The initialization of the network follows the scheme in [SMB+20], that normalizes the weights of all layers and includes a factor ω0\omega_{0} that sets the spatial frequency of the first layer to better match the frequency spectrum of the signal. However, we differ in that the ω0\omega_{0} factor is used just for the initialization of the first layer and we create an additional factor, ωG\omega_{G} that is applied to the other layers to boost the network gradients. Also, the weights of first layer of each stage are set to match the frequencies of the corresponding level of detail, in the following way — ω0\omega_{0} is uniformly distributed as 𝒰(Bi,Bi)\mathcal{U}(-B_{i},B_{i}) with Bi=[4,8,16,32,64,128,256]i=17B_{i}=[4,8,16,32,64,128,256]_{i=1}^{7} for a network with seven stages. The factor ωG=30\omega_{G}=30 for all experiments.

3.1 Level of Detail Example

We now show an example of a multiresolution image representation using the setup described above. For this experiment we chose the ”Cameraman”, a standard test image used in the field of image processing and also in [SMB+20]. The source is a monochromatic picture with 512×512512\times 512 pixels of resolution.

Figure 5 depicts the levels 1, 3, 5 and 7 of the multiresolution hierarchy reconstructed with full resolution of 512×512512\times 512, as well as the corresponding Fourier spectra (intermediate levels skipped to fit the page).

The training times for each stage of the networks are as follows: 5s, 4s, 3s, 11s, 17s, 29s and 48s. The total training time is 117s. The machine was a Windows 10 laptop with a NVIDIA RTX A5000 Laptop GPU. Note that these times result from the adaptive training regime and the number of samples for each level of the Gaussian Pyramid.

The training evolution is depicted in the graph of Figure 6 that shows the convergence of the MSE loss with the number of epochs for each multiresolution stages 1, 3, 5, and 7. It is worth pointing out the qualitative behavior of the network, in that the base level (stage 1) takes more than 200 epochs to reach the limit, while detail levels (stages 3, 5, 7) take less than 150 epochs to converge. Also, the error decreases for each level of detail. It is like, there are two different modes, one to fit the base level and other for the detail levels.

Refer to caption
Figure 6: Qualitative convergence behavior for Cameraman in Fig 5.

The inference time for image reconstruction, both in CPU and GPU is sufficiently fast for interactive visualization.

3.2 Texture

The second imaging example is the usage of the MR-Net representation to model texture and patterns. Arguably, visual textures constitute one of the most important applications for images in diverse fields, ranging from photo-realistic simulations to interactive games. Currently, more and more image rendering relies on some kind of graphics acceleration, sometimes through GPUs integrated with Neural Engines. In that context, it is desirable to have a neural image representation that is compact and supports a level of detail.

For the experiment shown in this subsection we have chosen an image of woven fabric background with patterns. The characteristics of this texture allows us to explore the limits of visual patterns at different resolutions. The original image has a resolution of 1025×10251025\times 1025 pixels and the corresponding model contains 5 levels of detail.

In Figure 7 we show our reconstruction at original resolution, Fig.7(b) (center); as well as a zoom-in, Fig.7(c) (right), and a zoom-out, Fig.7(a) (left).

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 7: Woven Fabric Texture: (a) zoom out; (b) original image resolution; (c) zoom in.

The zoom-in is a detail with resolution of 562×562562\times 562. marked by the white rectangle in Fig.7(b) and scaled up to 1025×10251025\times 1025. Thus a zoom-in factor of 1.8 times. It can be seen that the enlargement extrapolates the fine details of the image at this higher resolution beyond the original image.

The zoom-out is a reduction of the entire image to 118×118118\times 118 pixels (shown enlarged to 501×501501\times 501 in the image for better viewing). The top sub-image is the network reconstruction at the appropriate level of detail (approximately 0.92). The bottom sub-image is point-sample nearest neighbor reduction. It can be seen that our reconstruction is a proper anti-aliased rendition of the image, while the sampled reduction exhibits aliasing artifacts.

These two behaviors in the experiment are manifestations of “magnification” and “minification”, classical resampling regimes for respectively scaling up and down the image [Smi21]. In the first case it is necessary to interpolate the pixel values and in the second case it is required to integrate pixel values corresponding to the reconstructed pixel. The M-Net model accomplishes these tasks automatically. Note that we have chosen a fractional scaling factors in both cases to demonstrate the continuous properties in space and scale of the M-Net model.

3.3 Anti-aliasing

In the previous subsection we resorted to level of detail control to guarantee an alias free rendering independently of the sampling resolution. However, this task was facilitated because we could use a constant level of detail for the entire image, due to the zooming in/out operation in 2D.

On the other hand, in texture mapping applications, this scenario is no longer the case. Typically, it requires to map a 2D texture onto a 3D surface that is rendered in perspective by a virtual camera. In such situation, the level of detail varies spatially depending on the distance of the 3D surface point from the camera. Here, proper anti-aliasing must compensate the foreshortening caused by a projective transformation. Next we present a simple example of anti-aliasing using the M-Net.

Let II be a checkerboard image, TT be a homography mapping the pixel coordinates (x,y)(x,y) of the screen to the texture coordinates (u,v)(u,v) of II, and f=g1++gN:[1,1]2f=g_{1}+\cdots+g_{N}:[-1,1]^{2}\to\mathbb{R} be a M-net with NN stages approximating II.

Refer to caption
Refer to caption

(a) (b)

Figure 8: Checkerboard in perspective: (a) point sampled texture rendition; (b) M-Net anti-aliased reconstruction.

Figure 8(left) illustrates aliasing effects on the image II after applying it to the inverse of TT. We avoid such a problem using the multiresolution of the M-Net ff. The result is presented in Figure 8(right). Observe that the procedure reduces aliasing at large distances. Specifically, we define the level of detail parameter λ(x,y)\lambda(x,y) for the M-Net ff at a pixel (x,y)(x,y) considering the formula proposed by Heckbert [H+83]:

λ(x,y)=max{(ux)2+(vx)2,(uy)2+(vy)2}.\displaystyle\lambda(x,y)=\max\left\{\sqrt{\!\left(\frac{\partial u}{\partial x}\right)^{2}\!\!\!+\!\!\left(\frac{\partial v}{\partial x}\right)^{2}}\!,\sqrt{\!\left(\frac{\partial u}{\partial y}\right)^{2}\!\!\!+\!\!\left(\frac{\partial v}{\partial y}\right)^{2}}\right\}.

Thus λ(x,y)\lambda(x,y) is the bigger length of the parallelogram generated by the vectors Tx\frac{\partial T}{\partial x} and Ty\frac{\partial T}{\partial y}.

We define the resolution λ\lambda of ff using the formula

fλ=λ1g1++λNgN,\displaystyle f_{\lambda}=\lambda_{1}g_{1}+\cdots+\lambda_{N}g_{N},

where λi\lambda_{i} are weights defined as follows. First, scale λ\lambda such that λ([1,1]2)[0,N]\lambda\big{(}[-1,1]^{2}\big{)}\subset[0,N]. Thus, set λi=1\lambda_{i}=1 for 1iλ1\leq i\leq\lfloor\lambda\rfloor, λλ=λλ\lambda_{\lfloor\lambda\rfloor}=\lambda-\lfloor\lambda\rfloor, and λi=0\lambda_{i}=0 otherwise. Here λ\lfloor\lambda\rfloor denotes the floor value of λ\lambda.

4 Comparisons

In this section we compare the performance of MR-Net image representation with two other neural network models, namely SIREN and BACON. For this evaluation we used the “Cameraman” image shown in Subsection 3.1, comparing model size and reconstruction quality. The results are summarized in Table 1.

Model # Params \downarrow PSNR \uparrow # Levels
SIREN 197K 61.8 db 1
BACON 398K 82.1 db 7
MR-Net 121K 84.9 db 7
Table 1: Comparison with SIREN and BACON.

The M-Net hyper-parameters are: 96 hidden features; ω0[4,256]\omega_{0}\in[4,256] ; trained with a Gaussian Pyramid of 7 levels. Therefore, the model size has 121937 parameters. The image resolution is 512 x 512 pixels. The PSNR of the final image reconstruction is 84.9 db.

For SIREN we employed the configuration of the image experiments in their Github public code and paper. The network hyper-parameters are: 3 hidden layers; 256 hidden features; ω0=120\omega_{0}=120. The model size has 197376 parameters and it has only 1 level of detail. The PSNR of the reconstructed image is 61.8 db.

For BACON we also based the configuration on their paper examples and public code at Github. However we made some adjustments to make the hyper-parameters compatible with SIREN and M-Net settings, as follows: image size 256×256256\times 256; 6 hidden layers; 256 hidden features. Accordingly, the total number of parameters is 398343. The PSNR of the level 6 image is 82.1 db. We remark that to establish a fair comparison with SIREN we evaluate only the PSNR of the final full resolution image.

Based on these experiments we conclude that MR-Net compared favorably in relation to SIREN and BACON, both in terms of representation size and quality of image reconstruction. Compared to SIREN the M-Net model is only 62 % of the size of the SIREN model, even though it has 7 levels of detail in contrast with 1 level for SIREN. Additionally, the image reconstruction has 1.37 better quality, despite the fact that the representation is more compact. Compared to BACON, the results are somewhat better. The M-Net model is just 30 % of the size of BACON model, while our reconstruction of the final image has almost the same quality (i.e., 1.034 %).

4.1 Comparison with SIREN

Parameters: Hidden Features = 256; Hidden Layers = 3; Number of Levels = 1; w0 parameter = 120; Model Size = 197376; Image Resolution = 512 x 512. The PSNR of Reconstruction = 61.8 db.

Figure 9 shows the image reconstructions with M-Net and SIREN models.

Refer to caption
Refer to caption

M-Net SIREN

Figure 9: Comparison of reconstructed images.

4.2 Comparison with BACON

Parameters: Hidden Features = 256; Hidden Layers = 6; Learning Rate = 0.005; Number of Steps = 5001; Pe Scale = 3.0; Model size = 398343. The PSNR of Reconstruction = 82.1 db.

Figures 10 and 11 show the image reconstructions and the corresponding frequency spectra for three levels of detail in the BACON and M-Net models respectively (1, 2, 6) and (3,5, 7). Note that we select these levels to better match the frequency spectra between the representations.

Refer to caption

Level 1 Level 2 Level 6

Figure 10: BACON image reconstruction and frequency spectra
Refer to caption

Level 3 Level 5 Level 7

Figure 11: M-Net image reconstruction and frequency spectra

BACON controls the frequency band for Level of Detail by truncating the spectrum of the more detailed level (note that the center of the spectrum images for Levels 1 and 2 in Figure 10 are analogous to the center of the spectrum in Level 6). This approach is analogous to applying a low-pass filter with non-ideal shape in the frequency domain, which results in an image with ringing effect (notice how the silhouettes propagate all over the left image in Figure 10). M-Net does not present such artifacts. Figure 11 resembles more faithfully what would be expected to be the process of sequentially applying Gaussian filters in a high resolution image.

5 Considerations for Other MR-Net Variants

In the paper, we described the MR-Net architecture and its three variants, namely: S-Net; L-Net and M-Net. however we have used only the M-Net variant for imaging applications in the paper. In the future we plan to explore the other two variants.

Nonetheless, here we it is appropriate to make a few considerations about the S-Net and L-Net variants.

The S-Net provides a neural image representation as a weighted sum of sin(x)\sin(x) functions. In that sense, S-Net is equivalent to BACON and other Multiplicative Filter Networks based on sinusoidal atoms. As such, it is amenable to represent periodic visual patterns.

The L-Net image representation is composed of a sum of level-of-detail stages given by independent MR-Modules neural sub-networks. The relation of this representation with the Laplacian Pyramid makes it suitable for image operations in the gradient domain.

We intend study and compare in greater detail the three variants of the MR-Net architecture and also to develop imaging applications in directions pointed out above for S-Net and L-Net, complementing the ones we have presented in the paper for M-Net.

6 Closing Remarks

In this last section we close the paper with an assessment of our results, as well as, its limitations, and a discussion of future directions for our research.

6.1 Limitations

We found that choosing appropriate values for the ω0\omega_{0} parameter, which defines the spatial frequency of the first layer of the network, is important to achieve proper results. When using a shallow sinusoidal network such as the S-Net, we can use the Nyquist frequency as a direct reference to pick the frequency intervals at each stage. However, a deep sinusoidal network such as the M-Net can learn higher frequencies that were not present in the initialization. To the best of our knowledge, there is not (yet) theory to compute or bound these frequencies.

In our experiments, we determined the ω0\omega_{0} values for initialization empirically, testing values below the Nyquist frequency. To better harness the power of sinusoidal neural networks, it is important to develop mathematical theories to understand how the composition of sine functions introduces new frequencies based on the initialization of the network.

6.2 Ongoing and Future Work

In terms of future work, we plan to expand this research in two main directions. On one hand, we would like to explore the MR-Net architecture for other image applications including super-resolution, operations in the gradient domain, generation of periodic and quasi-periodic patterns, as well as image compression. On the other hand, we would like to extend the MR-Net representation to other media signals in higher dimensions, such as video, volumes, and implicit surfaces.

Acknowledgment

We thank Daniel Yukimura for participating in this project.

References

  • [CLC+22] H. Chen, J. Liu, W. Chen, S. Liu, and Y. Zhao. Exemplar-based pattern synthesis with implicit periodic field network, 2022.
  • [CLW21] Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8628–8638, 2021.
  • [FSWK20] Rizal Fathony, Anit Kumar Sahu, Devin Willmott, and J Zico Kolter. Multiplicative filter networks. In International Conference on Learning Representations, 2020.
  • [H+83] Paul S Heckbert et al. Texture mapping polygons in perspective. Technical report, NYIT Computer Graphics Lab, 1983.
  • [LB98] Yann LeCun and Yoshua Bengio. Convolutional Networks for Images, Speech, and Time Series, page 255–258. MIT Press, 1998.
  • [LVPW22] D. Lindell, D. Van Veen, J. Park, and G. Wetzstein. BACON: Band-limited coordinate networks for multiscale scene representation. In Proceedings of CVPR, 2022.
  • [Mal89] S.G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989.
  • [MESK22] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
  • [MLL+21] Julien N. P. Martel, David B. Lindell, Connor Z. Lin, Eric R. Chan, Marco Monteiro, and Gordon Wetzstein. ACORN: Adaptive coordinate networks for neural scene representation. ACM Trans. Graph. (SIGGRAPH), 40(4), 2021.
  • [PHV17] G. Parascandolo, H. Huttunen, and T. Virtanen. Taming the waves: sine as activation function in deep neural networks. 2017.
  • [PNS+22] Hallison Paz, Tiago Novello, Vinicius Silva, Guilherme Shardong, Luiz Schirmer, Fabio Chagas, Helio Lopes, and Luiz Velho. Multiresolution neural networks for imaging. In Proceedings of SIBGRAPI, 2022.
  • [SMB+20] Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Proc. NeurIPS, 2020.
  • [Smi21] Alvy Ray Smith. A Biography of the Pixel. MIT Press, Boston, 2021.
  • [TZN19] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph., 38(4), 2019.
  • [VPNY22] Luiz Velho, Hallison Paz, Tiago Novello, and Daniel Yukimura. Multiresolution neural networks for multiscale signal representation. Technical report, VISGRAF Lab, 2022.
  • [Wil83] Lance Williams. Pyramidal parametrics. SIGGRAPH Comput. Graph., 17(3):1–11, jul 1983.