A unified Fourier slice method to derive ridgelet transform for a variety of depth-2 neural networks
Abstract
To investigate neural network parameters, it is easier to study the distribution of parameters than to study the parameters in each neuron. The ridgelet transform is a pseudo-inverse operator that maps a given function to the parameter distribution so that a network reproduces , i.e. . For depth-2 fully-connected networks on a Euclidean space, the ridgelet transform has been discovered up to the closed-form expression, thus we could describe how the parameters are distributed. However, for a variety of modern neural network architectures, the closed-form expression has not been known. In this paper, we explain a systematic method using Fourier expressions to derive ridgelet transforms for a variety of modern networks such as networks on finite fields , group convolutional networks on abstract Hilbert space , fully-connected networks on noncompact symmetric spaces , and pooling layers, or the -plane ridgelet transform.
1 Introduction
Neural networks are learning machines that support today’s AI technology. Mathematically, they are nonlinear functions determined by a network of functions with learnable parameters (called neurons) connecting in parallel and series. Since the learning process is automated, we do not fully understand the parameters obtained through learning. An integral representation is a powerful tool for mathematical analysis of these parameters. One of the technical difficulties in analyzing the behavior of neural networks is that their parameters are extremely nonlinear. An integral representation is a method of indirectly analyzing the parameters through their distribution, rather than directly analyzing the parameters of each neuron. The set of all the signed (or probability) parameter distributions forms a linear (or convex) space, making it possible to perform far more insightful analysis than directly analyzing individual parameters.
For instances, characterization of neural network parameters such as the ridgelet transform (Murata, 1996; Candès, 1998; Sonoda and Murata, 2017) and the representer theorems for ReLU networks (Savarese et al., 2019; Ongie et al., 2020; Parhi and Nowak, 2021; Unser, 2019), and convergence analysis of stochastic gradient descent (SGD) for deep learning such as the mean field theory (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020) and the infinite-dimensional Langevin dynamics (Suzuki, 2020; Nitanda et al., 2022), have been developed using integral representations.
1.1 Integral Representation
The integral representation of a depth-2 fully-connected neural network is defined as below.
Definition 1.1.
Let be a measurable function, called activation function, and fix it. For any signed measure on , called a parameter distribution, we define the integral representation of depth-2 fully-connected neural network as
(1) |
Here, for each hidden parameter , feature map corresponds to a single hidden neuron with activation function , weight corresponds to an output coefficient, and the integration implies that all the possible neurons are assigned in advance. Since the only free parameter is parameter distribution , we can identify network with point in a function space.
We note that this representation covers both infinite (or continuous) and finite widths. Indeed, while the integration may be understood as an infinite width layer, the integration with a finite sum of point masses such as can represent a finite width layer:
where the third term is the so-called “matrix” representation with matrices and vector followed by component-wise activation . Singular measures as above can be mathematically justified without any inconsistency if the class of the parameter distributions are set to a class of Borel measures or Schwartz distributions.
There are at least four advantages to introducing integral representations:
-
(1)
Aggregation of parameters, say , into a single function (parameter distribution) ,
-
(2)
Ability to represent finite models and continuous models in the same form,
-
(3)
Linearization of networks and convexification of learning problems, and
-
(4)
Presence of the ridgelet transform.
Advantages 1 and 2 have already been explained. Advantage 3 is described in the next subsection, and Advantage 4 is emphasized throughout the paper. On the other hand, there are two disadvantages:
- (1)
-
(2)
Advanced knowledge on functional analysis are required.
1.2 Linearization and Convexification Effect
The third advantage of the integral representation is the so-called linearization (and convexification) tricks. That is, while the network is nonlinear with respect to the raw parameters and , namely,
(and similarly for ), it is linear with respect to the parameter distribution , namely,
Furthermore, linearizing neural networks leads to convexifying learning problems. Specifically, for a convex function , the loss function defined as satisfies the following:
It may sound paradoxical that a convex loss function on a function space has local minima in raw parameters, but we can understand this through the chain rule for functional derivative: Suppose that a parameter distribution is parametrized by a raw parameter, say , then
In other words, a local minimum () in raw parameter can arise not only from the global optimum () but also from the case when two derivatives and are orthogonal.
The trick of lifting nonlinear objects in a linear space has been studied since the age of Frobenius, one of the founders of the linear representation theory of groups. In the context of neural network study, as well as the recent studies mentioned above, either the integral representation by Barron (1993) or the convex neural network by Bengio et al. (2006) are often referred. In the context of deep learning theory, this linearization/convexification trick has been employed to show the global convergence of the SGD training of shallow ReLU networks (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020; Suzuki, 2020; Nitanda et al., 2022), and to characterize parameters in ReLU networks (Savarese et al., 2019; Ongie et al., 2020; Parhi and Nowak, 2021; Unser, 2019).
1.3 Ridgelet Transform
The fourth advantage of the integral representation is the so-called the ridgelet transform , or a right inverse operator of the integral representation operator . For example, the ridgelet transform for depth-2 fully-connected network (1) is given as below.
Definition 1.2.
For any measurable functions and ,
(2) |
In principle, the ridgelet function can be chosen independently of the activation function of neural network . The following theorem holds.
Theorem 1.1 (Reconstruction Formula).
Suppose and are a tempered distribution () on and a rapidly decreasing function () on , respectively. Then, for any square integrable function , the following reconstruction formula
holds with the factor being a scalar product of and ,
where denotes the Fourier transform.
From the perspective of neural network theory, the reconstruction formula claims a detailed/constructive version of the universal approximation theorem. That is, given any target function , as long as , the network with coefficient reproduces the original function, and the coefficient is given explicit.
From the perspective of functional analysis, on the other hand, the reconstruction formula states that and are analysis and synthesis operators, and thus play the same roles as, for instance, the Fourier () and inverse Fourier () transforms respectively, in the sense that the reconstruction formula corresponds to the Fourier inversion formula .
Despite the common belief that neural network parameters are a blackbox, the closed-form expression (2) of ridgelet transform clearly describes how the network parameters are distributed, which is a clear advantage of the integral representation theory (see e.g. Sonoda et al., 2021a). Moreover, the integral representation theory can deal with a wide range of activation functions without approximation, not only ReLU but all the tempered distribution (see e.g. Sonoda and Murata, 2017).
The ridgelet transform is discovered in the late 1990s independently by Murata (1996) and Candès (1998). The term “ridgelet” is named by Candès, based on the facts that the graph of a function is ridge-shaped, and that the integral transform can be regarded as a multidimensional counterpart of the wavelet transform.
In fact, the ridgelet transform can be decomposed into the composite of wavelet transform after the Radon transform, namely,
where denotes the -dimensional unit sphere, denotes the orthocomplement of the normal vector , denotes the Hausdorff measure on or the Lebesgue measure on , and is represented in polar coordinates with allowing the double covering: (see Sonoda and Murata, 2017, for the proof). Therefore, several authors have remarked that ridgelet analysis is wavelet analysis in the Radon domain (Donoho, 2002; Kostadinova et al., 2014; Starck et al., 2010).
In the context of deep learning theory, Savarese et al. (2019); Ongie et al. (2020); Parhi and Nowak (2021) and Unser (2019) investigate the ridgelet transform for the specific case of fully-connected ReLU layers to establish the representer theorem. Sonoda et al. (2021a) have shown that the parameter distribution of a finite model trained by regularized empirical risk minimization (RERM) converges to the ridgelet spectrum in an over-parametrized regime, meaning that we can understand the parameters at local minima to be a finite approximation of the ridgelet transform. In other words, analyzing neural network parameters can be turned to analyzing the ridgelet transform.
1.4 Scope and Contribution of This Study
On the other hand, one of the major shortcomings of ridgelet analysis is that the closed-form expression is known for relatively small class of networks. Indeed, until Sonoda et al. (2022b, a), it was known only for depth-2 fully-connected layer: . In the age of deep learning, a variety of layers have become popular such as the convolution and pooling layers (Fukushima, 1980; LeCun et al., 1998; Ranzato et al., 2007; Krizhevsky et al., 2012). Furthermore, the fully-connected layers on manifolds have also been developed such as the hyperbolic network (Ganea et al., 2018; Shimizu et al., 2021). Since the conventional ridgelet transform was discovered heuristically in the 1990s, and the derivation heavily depends on the specific structure of affine map , the ridgelet transforms for those modern architectures have been unknown for a long time.
In this study, we explain a systematic method to find the ridgelet transforms via the Fourier expression of neural networks, and obtain new ridgelet transforms in a unified manner. The Fourier expression of is essentially a change-of-frame from neurons to plane waves (or harmonic oscillators) . Since the Fourier transform is extensively developed on a variety of domains, once a network is translated into a Fourier expression, we can systematically find a particular coefficient satisfying via the Fourier inversion formula. In fact, the traditional ridgelet transform is re-discovered. Associated with the change-of-frame in , the ridgelet transform is also given a Fourier expression, but this form is known as the Fourier slice theorem of ridgelet transform (see e.g. Kostadinova et al., 2014). Hence, we call our proposed method as the Fourier slice method.
Besides the classical networks, we deal with four types of networks:
-
(1)
Networks on finite fields in § 3,
-
(2)
Group convolution networks on Hilbert spaces in § 4,
-
(3)
Fully-connected networks on noncompact symmetric spaces in § 5, and
-
(4)
Pooling layers (also known as the -plane ridgelet transform) in § 6.
The first three cases are already published thus we only showcase them, while the last case (pooling layer and -plane ridgelet) involves new results.
For all the cases, the reconstruction formula is understood as a constructive proof of the universal approximation theorem for corresponding networks. The group convolution layer case widely extends the ordinary convolution layer with periodic boundary, which is also the main subject of the so-called geometric deep learning (Bronstein et al., 2021). The case of fully-connected layer on symmetric spaces widely extends the recently emerging concept of hyperbolic networks (Ganea et al., 2018; Gulcehre et al., 2019; Shimizu et al., 2021), which can be cast as another geometric deep learning. The pooling layer case includes the original fully-connected layer and the pooling layer; and the corresponding ridgelet transforms include previously developed formulas such as the Radon transform formula by Savarese et al. (2019) and related to the previously developed “-plane ridgelet transforms” by Rubin (2004) and Donoho (2001).
1.5 General Notations
For any integer , and denote the classes of Schwartz test functions (or rapidly decreasing functions) and tempered distributions on , respectively. Namely, is the topological dual of . We note that includes truncated power functions such as step function for and ReLU for .
Fourier Transform
To avoid potential confusion, we use two symbols and for the Fourier transforms in and , respectively. For example,
Furthermore, with a slight abuse of notation, when is a tempered distribution (i.e., ), then is understood as the Fourier transform of distributions. Namely, is another tempered distribution satisfying for any test function . We refer to Grafakos (2008) for more details on the Fourier transform for distributions.
2 Method
We explain the basic steps to find the parameter distribution satisfying . The basic steps is three-fold: (Step 1) Turn the network into the Fourier expression, (Step 2) change variables inside the feature map into principal and auxiliary variables, and (Step 3) put unknown in the separation-of-variables form to find a particular solution. In the following, we conduct the basic steps for the classical setting, i.e., the case of the fully-connected layer, for the explanation purpose. However, the “catch” of this procedure is that it is applicable to a wide range of networks as we will see in the subsequent sections.
2.1 Basic Steps to Solve
The following procedure is valid, for example, when and . See Kostadinova et al. (2014) and Sonoda and Murata (2017) for more details on the valid combinations of function classes.
Step 1.
Using the convolution in , we can turn the network into the Fourier expression as below.
Here, denotes the convolution in ; and the third equation follows from an identity (Fourier inversion formula) with and .
Step 2.
Change variables with so that feature map splits into a product of a principal factor (in and ) and an auxiliary factor (in ), namely
Now, we can see that the integration inside brackets becomes the Fourier inversion with respect to and .
Step 3.
Because of the Fourier inversion, it is natural to assume that the unknown function has a separation-of-variables form as
(3) |
with using an arbitrary function . Namely, it is composed of a principal factor containing the target function , and an auxiliary factor set only for convergence of the integration in . Then, we have
where we put
In other words, the separation-of-variables expression is a particular solution to the integral equation with factor .
Furthermore, is the ridgelet transform because it is rewritten as
and thus calculated as
which is exactly the definition of the ridgelet transform .
In conclusion, the separation-of-variables expression (3) is the way to naturally find the ridgelet transform. We note that Steps 1 and 2 can be understood as the change-of-frame from the frame spanned by neurons , with which we are less familiar, to the frame spanned by (the tensor product of an auxiliary function and the) plane wave , with which we are much familiar. Hence, in particular, the map can be understood as the associated coordinate transformation.
3 Case I: NN on Finite Field
As one of the simplest applications, we showcase the results by Yamasaki et al. (2023), a neural network on the finite field (with prime number ). This study aimed to design a quantum algorithm that efficiently computes the ridgelet transform, and the authors developed this example based on the demand to represent all data and parameters in finite qubits. To be precise, the authors dealt with functions on discrete space , as a discretization of functions on a continuous space .
3.1 Fourier Transform
For any positive integers , let denote the product of cyclic groups. We note that the set of all real functions on is identified with the ()-dimensional real vector space, i.e. , because each value of function at can be identified with the th component of vector . In particular, .
We use the Fourier transform on a product of cyclic groups as below.
Definition 3.1 (Fourier Transform on ).
For any , put
Theorem 3.1 (Inversion Formula).
For any ,
The proof is immediate from the so-called orthogonality of characters, an identity , where being the Kronecker’s .
We note that despite the Fourier transform itself can be defined on any cyclic group , namely needs not be prime, a finite field is assumed to perform the change-of-variables step.
3.2 Network Design
Remarkably, the -version of arguments is obtained by formally replacing every integration in the -version of arguments with summation.
Definition 3.2 (NN on ).
For any functions and , put
Again, in Yamasaki et al. (2023), it is introduced as a discretized version of a function on a continuous space .
3.3 Ridgelet Transform
Theorem 3.2 (Reconstruction Formula).
For any function , put
Then, for any ,
In other words, the fully-connected network on finite field can strictly represent any square integrable function on .
Finally, the following proof shows that a new example of neural networks can be obtained by systematically following the same three steps as in the original arguments.
Proof.
Step 1. Turn to the Fourier expression:
Step 2. Change variables
Step 3. Put separation-of-variables form
and we can verify . ∎
4 Case II: Group Convolutional NN on Hilbert Space
Next, we showcase the results by Sonoda et al. (2022a). Since there are various versions of convolutional neural networks (CNNs), their approximation properties (such as the universality) have been investigated individually depending on the network architecture. The method presented here defines the generalized group convolutional neural network (GCNN) that encompasses a wide range of CNNs, and provides a powerful result by unifying the expressivity analysis in a constructive and simple manner by using ridgelet transforms.
4.1 Fourier Transform
Since the input to CNNs is a signal (or a function), the Fourier transform corresponding to a naive integral representation is the Fourier transform on the space of signals, which is typically an infinite-dimensional space of functions. Although the Fourier transform on the infinite-dimensional Hilbert space has been well developed in the probability theory, the mathematics tends to become excessively advanced for the expected results. One of the important ideas of this study is to induce the Fourier transform of in a finite-dimensional subspace of instead of using the Fourier transform on the entire space . To be precise, we simply take an -dimensional orthonormal frame of , put the linear span , and identify each element with point . Obviously, this embedding depends on the choice of -frame , yet drastically simplifies the main theory itself.
Definition 4.1 (Fourier Transform on a Hilbert Space ).
Let be a Hilbert space, be an -dimensional subspace, and be the Lebesgue measure induced from . Put
Then, obviously from the construction, we have the following.
Theorem 4.1.
For any ,
4.2 Network Design
Another important idea is to deal with various group convolutions in a uniform manner by using the linear representation of groups.
Definition 4.2 (Generalized Group Convolution).
Let be a group, be a Hilbert space, and be a group representation of . The -convolution is given by
As clarified in Sonoda et al. (2022a), the generalized convolution covers a variety of typical examples such as (1) classical group convolution , (2) discrete cyclic convolution for multi-channel digital images, (3) DeepSets, or permutation equivariant maps, (4) continuous cyclic convolution for signals, and (5) -equivariant maps.
Definition 4.3 (Group CNN).
Let be an -dimensional subspace equipped with the Lebesgue measure . Put
Here, the integration runs all the possible convolution filters . For the sake of simplicity, however, the domain of filters is restricted to an -dimensional subspace of entire space .
4.3 Ridgelet Transform
In the following, denotes the identity element.
Definition 4.4 (-Equivariance).
A (nonlinear) map is -equivariant when
We note that the proposed network is inherently -equivariant
Definition 4.5 (Ridgelet Transform).
For any measurable functions and , put
It is remarkable that the product of and inside the is not convolution but scalar product . This is essentially because (1) will be assumed to be group equivariant, and (2) the network is group equivariant by definition.
Theorem 4.2 (Reconstruction Formula).
Suppose that is -equivariant and , then .
In other words, a continuous GCNN can represent any square-integrable group-equivariant function. Again, the proof is performed by systematically following the three steps as below.
Proof.
Step 1. Turn to a Fourier expression:
Step 2. Change variables with :
Step 3. Put separation-of-variables form
and we can verify . ∎
4.4 Literature in Geometric Deep Learning
General convolution networks for geometric/algebraic domains have been developed for capturing the invariance/equivariance to the symmetry in a data-efficient manner (Bruna and Mallat, 2013; Cohen and Welling, 2016; Zaheer et al., 2017; Kondor and Trivedi, 2018; Cohen et al., 2019; Kumagai and Sannai, 2020). To this date, quite a variety of convolution networks have been proposed for grids, finite sets, graphs, groups, homogeneous spaces and manifolds. We refer to Bronstein et al. (2021) for a detailed survey on the so-called geometric deep learning.
Since a universal approximation theorem (UAT) is a corollary of a reconstruction formula, , the 3-steps Fourier expression method have provided a variety of UATs for -type networks in a unified manner. Here, we remark that the UATs of individual convolution networks have already shown (Maron et al., 2019; Zhou, 2020; Yarotsky, 2022). However, in addition to above mentioned advantages, the wide coverage of activation functions is also another strong advantage. In particular, we do not need to rely on the specific features of ReLU, nor need to rely on Taylor expansions/density arguments/randomized assumptions to deal with nonlinear activation functions.
5 Case III: NN on Noncompact Symmetric Space
Then, we showcase the results by Sonoda et al. (2022b). When the data is known to be on a certain manifold, it is natural to consider developing NNs on the manifold, in order to explicitly incorporate the inductive bias. Since there are no such thing as the standard inner products or affine mappings on manifolds, various NNs have been proposed based on geometric considerations and implementation constraints. The main idea of this study is to start from the Fourier transform on a manifold and induce a NN on the manifold and its reconstruction formula. On compact groups such as spheres , the Fourier analysis is well known as the Peter–Weyl theorem, but in this study, the authors focused on noncompact symmetric spaces such as hyperbolic space and space of positive definite matrices and developed NNs based on the Helgason–Fourier transform on noncompact symmetric space.
5.1 Noncompact Symmetric Space
We refer to Helgason (1984, Introduction) and Helgason (2008, Chapter III). A noncompact symmetric space is a homogeneous space with nonpositive sectional curvature on which acts transitively. Two important examples are hyperbolic space (Figure 1) and SPD manifold . See also Appendices A and B for more details on these spaces.

Let be a connected semisimple Lie group with finite center, and let be its Iwasawa decomposition. Namely, it is a unique diffeomorphic decomposition of into subgroups , and , where is maximal compact, is maximal abelian, and is maximal nilpotent. For example, when (general linear group), then (orthogonal group), (all positive diagonal matrices), and (all upper triangular matrices with ones on the diagonal).
Let , and be left -invariant measures on , and respectively.
Let , and be the Lie algebras of , and respectively. By a fundamental property of abelian Lie algebra, both and its dual are the same dimensional vector spaces, and thus they can be identified with for some , namely . We call the rank of . For example, when , then (all real matrices), (all skew-symmetric matrices), (all diagonal matrices), and (all strictly upper triangular matrices).
Definition 5.1.
Let be a noncompact symmetric space, namely, a Riemannian manifold composed of all the left cosets
Using the identity element of , let be the origin of . By the construction of , group acts transitively on , and let (for ) denote the -action of on . Specifically, any point can always be written as for some . Let denote the left -invariant measure on .
Example 5.1 (Hyperbolic Space ).
It is used for embedding words and tree-structured dataset.
Example 5.2 (SPD Manifold ).
It is a manifold of positive definite matrices, such as covariance matrices.
5.2 Boundary , Horosphere , and Vector-Valued Composite Distance
We further introduce three geometric objects in noncompact symmetric space . In comparison to Euclidean space , the boundary corresponds to “the set of all infinite points” , a horosphere through point with normal corresponds to a straight line through point with normal , and the vector distance between origin and horosphere corresponds to the Riemannian distance between origin and straight line .
Definition 5.2.
Let be the centralizer of in , and let
be the boundary (or ideal sphere) of , which is known to be a compact manifold. Let denote the uniform probability measure on .
For example, when and , then (the subgroup of consisting of diagonal matrices with entries ).
Definition 5.3.
Let
be the space of horospheres.
Here, basic horospheres are: An -orbit , which is a horosphere passing through the origin with normal ; and , which is a horosphere through point with normal . In fact, any horosphere can be represented as since for any . We refer to Helgason (2008, Ch.I, § 1) and Bartolucci et al. (2021, § 3.5) for more details on the horospheres and boundaries.
Definition 5.4.
As a consequence of the Iwasawa decomposition, for any there uniquely exists an -dimensional vector satisfying . For any , put
which is understood as the -dimensional vector-valued distance, called the composite distance, from the origin to the horosphere through point with normal .
5.3 Fourier Transform
Based on the preparations so far, we introduce the Fourier transform on , known as the Helgason–Fourier transform. Let be the Weyl group of , and let denote its order. Let be the Harish-Chandra -function for . We refer to Helgason (1984, Theorem 6.14, Ch. IV) for the closed-form expression of the -function.
Definition 5.5 (Helgason–Fourier Transform).
For any measurable function on , put
with a certain constant vector . Here, the exponent is understood as the action of functional on a vector .
This is understood as a “Fourier transform” because is the eigenfunction of Laplace–Beltrami operator on .
Theorem 5.1 (Inversion Formula).
For any square integrable function ,
We refer to Helgason (2008, Theorems 1.3 and 1.5, Ch. III) for more details on the inversion formula.
5.4 Network Design
In accordance with the geometric perspective, it is natural to define the network as below.
Definition 5.6 (NN on Noncompact Symmetric Space ).
For any measurable functions and , put
Remarkably, the scalar product (or in polar coordinate) in the Euclidean setting is replaced with a distance function in the manifold setting. In Sonoda et al. (2022b), the authors instantiate two important examples as below.
Example 5.3 (Continuous Horospherical Hyperbolic NN).
On the Poincare ball model equipped with the Riemannian metric ,
,
Example 5.4 (Continuous Horospherical SPD Net).
On the SPD manifold ,
, where denotes the diagonal in the Cholesky decomposition of .
5.5 Ridgelet Transform
Definition 5.7 (Ridgelet Transform).
For any measurable functions and , put
Here is defined as a multiplier satisfying .
Theorem 5.2 (Reconstruction Formula).
Let . Then, for any square integrable function on , we have
In other words, the fully-connected network on noncompact symmetric space can represent any square-integrable function. Again, the proof is performed by systematically following the three steps as below.
Proof.
We identify the scale parameter with vector .
Step 1. Turn to a Fourier expression:
Step 2. Change variables with :
Step 3. Put separation-of-variables form
and we can verify . ∎
5.6 Literature in Hyperbolic Neural Networks
The hyperbolic neural network (HNN) (Ganea et al., 2018; Gulcehre et al., 2019; Shimizu et al., 2021) is another emerging direction of geometric deep learning, inspired by the empirical observations that some datasets having tree or hierarchical structure can be efficiently embedded into hyperbolic spaces (Krioukov et al., 2010; Nickel and Kiela, 2017, 2018; Sala et al., 2018). We note that designing a FC layer on manifold is less trivial, because neither scalar product , bias subtraction , nor elementwise activation of is trivially defined on in general, and thus we have to face those primitive issues.
The design concept of the original HNN (Ganea et al., 2018) is to reconstruct basic operations in the ordinary neural networks such as linear maps, bias translations, pointwise nonlinearities and softmax layers by using the Gyrovector operations in a tractable and geometric manner. For example, in HNN++ by Shimizu et al. (2021), the Poincaré multinomial logistic regression (MLR) layer and fully-connected (FC) layer , corresponding to the -affine layer and -affine layer without activation respectively, are designed as nonlinear maps and so that coincides with the distance between output and Poincaré hyperplane . Here, a Poincaré hyperplane is defined as the collection of all geodesics through point and normal to . Furthermore, they are designed so that the discriminative hyperplane coincides a Poincaré hyperplane. The nonlinear activation function is cast as a map via lifting for any . However, in practice, activation can be omitted since the FC layer is inherently nonlinear.

In this study, on the other hand, we take to be a noncompact symmetric space , which is a generalized version of the hyperbolic space . Following the philosophy of the Helgason–Fourier transform, we regard the scalar product of unit vector and point as the signed distance between the origin and plane through point with normal . Then, we recast it to the vector-valued distance, denoted , between the origin and horocycle through point with normal . As a result, we can naturally define bias subtraction and elementwise activation of because the signed distance is identified with a vector.
More geometrically, in is understood as the distance between point and plane satisfying (see Figure 2). Similarly, is understood as the distance between point and horocycle satisfying . Hence, as a general principle, we may formulate a versatile template of affine layers on as
(4) |
For example, in the original HNN, the Poincaré hyperplane is employed as the geometric object. If we have a nice coordinates such as satisfying , then we can turn it to the Fourier expression and hopefully obtain the ridgelet transform.
The strengths of our results are summarized as that we obtained the ridgelet transform in a unified manner for a wide class of input domain in a geometric manner, i.e., independent of the coordinates; in particular, that it is the first result to define the neural network and obtained the ridgelet transform on noncompact space.
6 Case IV: Pooling Layer and -plane Ridgelet Transform
Finally, we present several new results. Technically, we consider networks with multivariate activation functions . In all the sections up to this point, we have considered univariate activation function (i.e. ). In the context of neural networks, it is understood as a mathematical model of pooling layers such as
(average pooling), | |||||
(max pooling), and | |||||
(-norm). |
Meanwhile, in the context of sparse signal processing in the 2000s such as Donoho (2001) and Rubin (2004) (see also § 7.2), it can also be understood as the ridgelet transform corresponding to the so-called -plane transform (see also § 6.1).
As mentioned in § 1.3, the ridgelet transforms have profound relations to the Radon and wavelet transforms. In the language of probability theory, a Radon transform is understood as a marginalization, and the traditional problem of Johann Radon is the inverse problem of reconstructing the original joint distribution from several marginal distributions. Hence, depending on the choice of variables to be marginalized, there are countless different Radon transforms. In other words, the Radon transform can also be a rich source for finding a variety of novel networks and ridgelet transforms. In this section, we derive the ridgelet transform corresponding to the -plane transform. (Nonetheless, the proofs are shown by the 3-steps Fourier expression method.)
Additional Notations
Let be positive integers satisfying ; let be a set of all full-column-rank (i.e., injective) matrices equipped with the Lebesgue measure ; let be the Stiefel manifold of orthonormal -frames in equipped with invariant measure ; let be the orthogonal group in equipped with invariant measure . In addition, let be a similitude group equipped with the product measure . For a rectangular matrix , we write for short. In the following, we use and for the Fourier transforms in and , respectively. For any , let denote the fractional Laplacian defined as a Fourier multiplier: .
6.1 -plane transform
The -plane transform is a Radon transform that marginalizes a -dimensional affine subspace (-plane) in an -dimensional space. In the special cases when (hyperplane) and (straight line), they are respectively called the (strict) Radon transform and the X-ray transform. The ridgelet transforms to be introduced in this section correspond to -plane Radon transform, and the classical ridgelet transform corresponds to the strict Radon transform (). We refer to Chapter 1 of Helgason (2010).
Definition 6.1 (-plane).
A -plane is a -dimensional affine subspace in . Here, affine emphasizes that it does not always pass through the origin . Let denote the collection of all -planes in , called the affine Grassmannian manifold.
A -plane is parametrized by its orthonormal directions and coordinate vector from the origin as below
where is a -frame satisfying for any . The first term is the displacement vector from the origin , its norm is the distance from the origin and -plane , and the second term is the -dimensional linear subspace that is parallel to .
Recall that for each direction , the whole space can be decomposed into a disjoint union of -planes as . In this perspective, the -plane transform of at is defined as a marginalization of in .
Definition 6.2 (-plane Transform).
For any integrable function and -plane , put
where is the -dimensional Hausdorff measure on .
Particularly, the strict Radon transform corresponds to , and the -ray transform corresponds to .
The -plane transform has the following Fourier expression.
Lemma 6.1 (Fourier Slice Theorem for -plane Transform).
For any ,
where and denote the Fourier transforms in and respectively. In other words,
Using the Fourier slice theorem, we can invert the -plane transform.
Lemma 6.2 (Inversion Formula for -plane Radon Transform).
For any ,
Proof.
By the Fourier slice theorem,
Here, we change variable and use the matrix polar integration formula Lemma C.1. ∎
Remark 1 (Relations to Marginalization of Probability Distributions).
In short, a -plane is a subset in , it is identified with a single variable as well, and -plane (as a variable) is marginalized.
Let us consider a two-variables (or bivariate) case. The marginalization of a probability distribution in (resp. ) refers to an integral transform of into its first (resp. second) variable defined by (rep. ).
On the other hand, the -plane transform of an integrable function on (i.e. ) with (which is reduced to the classical Radon transform) is given by
where denotes the set of unit vectors in , i.e. , and denotes an orthonormal vector, or a unit vector satisfying (there always exist two ’s for each ). Each indicates a -plane .
In particular, by fixing an orthonormal basis , and identifying bivariate function with univariate function , the marginalization of probability distributions is identified with the following specific cases:
6.2 Network Design
We define the -plane (or -affine) layer. Here, is the co-dimension of , satisfying . In addition to the full-column-matrices cases , we consider the degenerated cases and , which correspond to several previous studies.
Definition 6.3.
Let be a measurable function. Let denote either or . For any function , the continuous neural network with -plane (or -affine) layer is given by
Since the null space is -dimensional, each -plane neuron has -dimensions of constant directions. Therefore, -plane networks are able to capture -dimensional singularities in a target function .
6.3 Ridgelet Transforms and Reconstruction Formulas
We present three variants of solutions for -plane networks. We note that typical pooling layers such as average pooling, max pooling, and -norm are contained in the class of tempered distributions () on . The first and second theorems present dense () and sparse () solutions of parameters for the same class of activation functions. Since is a measure-zero subset of , the second solution is much sparser than the first solution. The third theorem present the sparsest () solution, by restricting the class of activation functions. It is supposed to capture characteristic solutions modern activation functions such as ReLU.
In the following, .
Theorem 6.1.
Let . Put
where is defined as with being the singular values of . Then, for almost every , we have
Theorem 6.2.
Let be a real number; let . Put
Then, for almost every , we have
Theorem 6.3.
For any real number , suppose that satisfy (i.e., is the Green function of ). Let . Put
where denotes the fractional Laplacian in , and is the -plane transform. Then, for almost every , we have
As consequences, these reconstruction formulas are understood as constructive universality theorems for -plane networks. We note (1) that, as far as we have noticed, the first result was not known, (2) that the second result extends the “-plane ridgelet transform” by Donoho (2001) and Rubin (2004) (see § 6.4), and (3) that the third result extends the Radon formulas (Theorem 7.2) by Carroll and Dickinson (1989) and Ito (1991) as the special case and , and recent results on ReLU-nets such as in Savarese et al. (2019); Ongie et al. (2020) and Parhi and Nowak (2021) as the special case and .
The proof is performed by systematically following the three steps as below.
Proof.
We present the first case. See Appendix D for full proofs.
Step 1. Turn to the Fourier expression:
Step 2. Use singular value decomposition (SVD)
with to have
Change variables ( fixed) and ( fixed)
Step 3. Put a separation-of-variables form (note: )
Then, turns out to be a particular solution because
Finally, the matrix ridgelet transform can be calculated as below
∎
6.4 Literature in -plane ridgelet transform
In the past, two versions of the -plane ridgelet transform have been proposed. One is a tight frame (i.e., a discrete transform) by Donoho (2001), and the other is a continuous transform by Rubin (2004). The -plane ridgelet by Donoho can be regarded as the discrete version of the -plane ridgelet transform by Rubin.
Theorem 6.4 (Continuous -plane Ridgelet Transform by Rubin (2004)).
where denotes the Euclidean distance between point and -plane , and .
Recall that an affine -plane is parametrized by an orthonormal -frame and a coordinate vector as . Because for any point , the quantity is understood as the Euclidean vector-distance between point and -plane . Therefore, the -plane ridgelet transform by Rubin is understood as a special case of as in Theorem 6.2 where both and are radial functions. We remark that a more redundant parametrization as in Theorem 6.1 is natural for the purpose of neural network study, simply because neural network parameters are not strictly restricted to during the training.
7 Literature Overview
7.1 Ridgelet Transform in the 1990s
One of the major problems in neural network study in the 1990s was to investigate the expressive power of (fully-connected) shallow neural networks, and the original ridgelet transform was discovered in this context independently by Murata (1996), and Candès (1998). Later in the 2010s, the classes of and have been extended to the distributions by Kostadinova et al. (2014) and Sonoda and Murata (2017) to include the modern activation functions such as ReLU.
The idea of using integral transforms for function approximation is fundamental and has a long history in approximation theory (see, e.g. DeVore and Lorentz, 1993). In the literature of neural network study, the integral representation by Barron (1993) is one of the representative works, where the so-called Barron class and Maurey–Jones–Barron (MJB) approximation error upperbound have been established, which play an important role both in the approximation and estimation theories of neural networks. We refer to Kainen et al. (2013) for more details on the MJB theory.
One obvious strength of the ridgelet transform is the closed-form expression. Before the ridgelet transform, two pioneering results were proposed. One is the Fourier formula by Irie and Miyake (1988) and Funahashi (1989):
Theorem 7.1.
For any and ,
Theorem 7.2.
For (step function) and any ,
where denotes the Radon transform.
7.2 Ridgelet Transform in the 2000s
In the context of sparse signal processing, the emergence of the ridgelet transform has motivated another direction of research: Exploring a high-dimensional counterpart of the -dimensional wavelet transform. Indeed, the wavelet transform for -dimensional signals such as images and videos does exist, and it is given as below
for , where and is a wavelet function. However, it is considered to be unsatisfied in its localization ability, because it is essentially a tensor product of -dimensional wavelet transforms.
More precisely, while the -dimensional wavelet transform is good at localizing the point singularities such as jumps and kinks in the -dimensional signals such as audio recordings, the -dimensional wavelet transform is not good at localizing the line singularities in -dimensional signals such as pictures except when the singularity is straight and parallel to either - or -axes. Here, the singularity of dimension is the term by Donoho (2001). For example,
is a singular square-integrable function on that attains along the hyperplane .
On the other hand, the ridgelet transform is good at localizing the -dimensional singularities in any direction because the feature map is a constant function along the -dimensional hyperplane normal to . Similarly, the -plane (or -affine) ridgelet transform presented in this study is good at localizing the -dimensional singularities because the feature map is a constant function along the -dimensional subspace .
In search for better localization properties, a variety of “X-lets” have been developed such as curvelet, beamlet, contourlet, and sheerlet under the slogan of geometric multiscale analysis (GMA) (see e.g. Donoho, 2002; Starck et al., 2010). Since ridgelet analysis had already been recognized as wavelet analysis in the Radon domain, a variety of generalizations of wavelet transforms and Radon transforms were investigated. In a modern sense, the philosophy of general Radon transforms is to map a function on a space of points to a function on another space of shapes (see e.g. Helgason, 2010). In the context of singularity localization, the shape such as -plane determines the shape of singularities, namely, a collection of constant directions in , and thus the general Radon domain is understood as the parameter space of the singularities. In this perspective, we can restate the functionality of the ridgelet transform as wavelet localization in the space of singularities in .
7.3 Ridgelet Transform in the 2020s
In the context of deep learning study, the idea of ridgelet transforms have regained the spotlight for the representer theorem that characterizes (either deep or shallow) infinitely-wide ReLU networks that minimizes a “representational cost” (Savarese et al., 2019; Ongie et al., 2020; Parhi and Nowak, 2021; Unser, 2019). Here, the representational cost for function is defined as the infimum of the total variation (TV) norm of the parameter distribution:
where is the collection of all signed measures. The TV-norm is a fundamental quantity for the MJB bounds (see e.g. Kainen et al., 2013).
According to Sonoda et al. (2021b), when the class of parameter distributions is restricted to , then any satisfying is uniquely written as a series of ridgelet transforms: where is a certain unique function satisfying , yielding ; is an orthonormal system satisfying , yielding ; and is -functions that is uniquely determined for each . We note that and are independent of . Hence, the cost is rewritten as a constraint-free expression:
As a result, we can conjecture that the minimizer of is given by ridgelet transform(s). In fact, Ongie et al. (2020) have shown that under some assumptions, the minimizer is given by a derivative of the Radon transform: , which is exactly the special case of the ridgelet transform in Theorem 6.3 when and .
Update: At the same time as the initial submission, Parhi and Unser (2023b) have obtained a representer theorem for multivariate activation functions under more careful considerations on the regularization and function spaces based on an extended theory of the -plane transforms for distributions (Parhi and Unser, 2023a). Their result suggests our conjecture was essentially true (modulo finite-order polynomials).
8 Discussion
In the main text, we have seen a variety of examples, but what is essential behind the Fourier expression, changing variables and assuming the separation-of-variables form? In a nutshell, it is coefficient comparison for solving equations. Namely, after appropriately changing variables, the network is rewritten in the Fourier basis, which is thus the coordinate transform from the basis to the Fourier basis . Since we (are supposed to) know the Fourier coefficient , we can obtain the unknown function by comparing the coefficients in the Fourier domain. From this perspective, we can now understand that the Fourier basis is just a one choice of frames, and the solution steps are summarized as below:
Let and be two feature maps on parametrized by and respectively, and consider their associated integral representations:
Here, and correspond to the continuous neural network and the inverse Fourier transform respectively. Given a function , suppose that of is known as, say , but of is unknown. Then, find a coordinate transform satisfying so that
where is a dual transform of the coefficients associated with the coordinate transform. Then, we can find by comparing the coefficients:
In other words, the mapping that matches the neuron and Fourier basis corresponds to the Fourier expression and change of variables in the main text, yielding .
9 Conclusion
The ultimate goal of this study is to understand neural network parameters. While the ridgelet transform is a strong analysis tool, one of the major short-comings is that the closed-form expression has been known only for small class of neural networks. In this paper, we propose the Fourier slice method, and have shown that various neural networks and their corresponding ridgelet transforms, listed in Table 1, can be systematically obtained by following the three steps of the Fourier slice method.
layer type | input | parameter | single neuron |
---|---|---|---|
§ 1-2. fully-connected (FC) layer | |||
§ 3. FC layer on finite fields | |||
§ 4. group convolution layer | |||
§ 5. FC layer on manifolds | |||
§ 6. pooling (-plane ridgelet) |
Needless to say, it is more efficient to analyze networks uniformly in terms of ridgelet analysis than to analyze individual networks manually one by one. As demonstrated in this paper, the coverage of ridgelet analysis is gradually expanding. With the strength of a closed-form expression of the pseudo-inverse operator, the ridgelet transform has several applications. For example, we can/may
-
(1)
present a constructive proof of the universal approximation theorem,
-
(2)
estimate approximation error bounds by discretizing the reconstruction formula using numerical integration schemes (e.g. MJB theory),
-
(3)
describe the distribution of parameters obtained by gradient descent learning (Sonoda et al., 2021a),
-
(4)
obtain the general solution to the learning equation (Sonoda et al., 2021b), and
-
(5)
construct a representer theorem (Unser, 2019).
The Fourier expression further allows us to view neural networks from the perspective of harmonic analysis and integral geometry. By recasting neural networks in these contexts, we will be able to discover further varieties of novel networks. On the other hand, after the initial submission of this manuscript, the authors have also developed an alternative method of discovery that uses group invariant functions instead of the Fourier expression (Sonoda et al., 2023a, b). The characterization of the networks obtained by the Fourier slice method would be an interesting direction of this study.
Acknowledgment
This work was supported by JSPS KAKENHI 18K18113, JST CREST JPMJCR2015 and JPMJCR1913, JST PRESTO JPMJPR2125, and JST ACT-X JPMJAX2004.
Appendix A Poincaré Disk
Following Helgason (1984)[Introduction, § 4] and Helgason (2008)[Ch.II, § 1], we describe the group theoretic aspect of the Poincaré disk. Let be the unit open disk in equipped with the Riemannian metric for any tangent vectors at , where denotes the Euclidean inner product in . Let be the boundary of equipped with the uniform probability measure . Namely, is the Poincaré disk model of hyperbolic plane . On this model, the Poincaré metric between two points is given by , and the volume element is given by .
Consider now the group
which acts on (and ) by
The -action is transitive, conformal, and maps circles, lines, and the boundary into circles, lines, and the boundary. In addition, consider the subgroups
The subgroup fixes the origin . So we have the identifications
On this model, the following are known (1) that , , , and for , (2) that the geodesics are the circular arcs perpendicular to the boundary , and (3) that the horocycles are the circles tangent to the boundary . Hence, let denote the horocycle through and tangent to the boundary at ; and let denote the signed distance from the origin to the horocycle .
In order to compute the distance , we use the following fact: The distance from the origin to a point is . Hence, let be the center of the horocycle , and let be the closest point on the horocycle to the origin. By definition, . But we can find the via the cosine rule:
which yields the tractable formula:
Appendix B SPD Manifold
Following Terras (2016, Chapter 1), we introduce the SPD manifold. On the space of symmetric positive definite (SPD) matrices, the Riemannian metric is given by
where and denote the matrices of entries and .
Put , then the Iwasawa decomposition is given by ; and the centralizer is given by (diagonal matrices with entries ). The quotient space is identified with the SPD manifold via a diffeomorphism onto, for any ; and is identified with the boundary , another manifold of all singular positive semidefinite matrices. The action of on is given by for any and . In particular, the metric is -invariant. According to the spectral decomposition, for any , there uniquely exist and such that ; and according to the Cholesky (or Iwasawa) decomposition, there exist and such that .
When for some and , then the geodesic segment from the origin (the identity matrix) to is given by
satisfying and ; and the Riemannian length of (i.e., the Riemannian distance from to ) is given by . So, is the vector-valued distance from to .
The -invariant measures are given by on , to be the uniform probability measure on , on , on ,
where the second expression is for the polar coordinates with and , and to be the uniform probability measure on .
The vector-valued composite distance from the origin to a horosphere is calculated as
where denotes the diagonal vector in the Cholesky decomposition of for some .
Proof.
Since , it suffices to consider the case . Namely, we solve for unknowns . (To be precise, we only need because .) Put the Cholesky decomposition for some . Then, because , while . ∎
The Helgason–Fourier transform and its inversion formula are given by
for any (where ) and . Here, , , and
where is the beta function.
Appendix C Matrix Calculus
Let be positive integers (). Let be the set of all full-column-rank matrices equipped with the volume measure . Let be the Stiefel manifold of orthonormal -frames in equipped with the invariant measure normalized to where is the Siegel gamma function. Let be the space of real symmetric matrices equipped with the volume measure , which is isometric to the euclidean space . Let be the cone of positive definite matrices in equipped with the volume measure .
Lemma C.1 (Matrix Polar Decomposition).
For any , there uniquely exist and such that
and for any ,
See Rubin (2018, Lemma 2.1) for more details. We remark that while Lemma C.1 is an integration formula on Stiefel manifold, the Grassmannian manifold version is the Blaschke–Petkantschin formula.
Lemma C.2 (Matrix Polar Integration).
For any ,
where .
Proof.
Recall that and thus . Hence, using the polar decomposition with ,
then letting , which is a unit vector in , and letting be a rearranged -frame in that excludes ,
∎
Lemma C.3 (SVD).
For any column-full-rank matrix , there exist satisfying with ; and for any ,
where and denote the invariant measures.
See Lemma 1 of Díaz-García and González-Farías (2005) for the proof.
Appendix D Proofs
D.1 Solution via Singular Value Decomposition
Step 1.
We begin with the following Fourier expression:
(D.1) |
Here, we assume (D.1) to be absolutely convergent for a.e. , so that we can change the order of integrations freely. But this assumption will be automatically satisfied because we eventually set to be the ridgelet transform.
Step 2.
In the following, we aim to turn the integration into the Fourier inversion in the matrix polar coordinates. To achieve this, we use the singular value decomposition.
Lemma D.1 (Singular Value Decomposition, Lemma C.3).
The matrix space can be decomposed into via singular value decomposition
satisfying (distinct singular values); and the Lebesgue measure is calculated as
where ; and are invariant measures on and respectively; and denotes the Vandermonde polynomial (or the products of differences) of a given (diagonalized) vector .
If there is no risk of confusion, we write as for the sake of readability.
Using SVD, the Fourier expression is rewritten as follows:
(D.1) | (D.2) |
Changing the variables as with ,
(D.3) |
Then, extending the domain of from to , changing the variables as with , and writing ,
(D.4) |
Step 3.
Therefore, it is natural to suppose to satisfy a separation-of-variables form
(D.5) |
with an auxiliary convergence factor . Then, we have
(D.4) | |||
Here, we put , and used a matrix polar integration formula given in Lemma C.1.
Finally, the form (D.5) can be satisfied as below. Since and , it is reduced to
Hence we can put
(D.6) | ||||
with additional convergence factor . In this setting, the constant is calculated as
and is obtained by taking the Fourier inversion of (D.6) with respect to as follows:
To sum up, we have shown that
D.2 Restriction to the Similitude Group
Let us consider the restricted case of the similitude group . Since it is a measure-zero subspace of , we can obtain the different solution.
Step 1.
The continuous network and its Fourier expression are given as
(D.7) |
Step 2.
(Skipping the matrix decomposition of and) turning into polar coordinates with and , yielding
Changing the variable with ,
and returning into the Euclidean coordinate with ,
(D.8) |
Step 3.
Since and , supposing the separation-of-variables form as
(D.9) |
for any number , we have
Note that in addition to , we have and thus for any number . Hence the condition is reduced to
where we put , which can be an arbitrary real number. As a result, to satisfy (D.9), we may put
which lead to
By matching the order of the fractional derivative , corresponds to the SVD solution. On the other hand, by matching the weight , and exactly reproduces the classical result.
D.3 Restriction to the Stiefel manifold
Let us consider a further restricted case of the Stiefel manifold.
Step 1.
Namely, the weight matrix parameter now simply lies in the Stiefel manifold , and thus it does not contain any scaling factor. We show that this formulation still admit solutions, provided that is also appropriately restricted.
Step 3.
Skipping the rescaling step (Step 2), let us consider the separation-of-variables form:
(D.10) |
satisfied by
for any real number . Here, we used .
In order (D.10) to turn to a solution, it is sufficient when
(D.11) |
because then
(D.10) | |||
Compared to the previous results, (D.11) demands much more strict. Nonetheless, a few examples are such as
or equivalently in the real domain,
In particular when , then coincides with the Dirac delta (), step function (), and ReLU function ().
Interestingly, the ridgelet transform is reduced to the -plane transform ( is the codimension). Since , we have
but this is the Fourier expression (a.k.a. Fourier slice theorem) for the -plane transform, say , of the derivative . In other words, when the scaling parameter is removed, the reconstruction formula is reduced to the Radon transform:
References
- Barron (1993) A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
- Bartolucci et al. (2021) F. Bartolucci, F. De Mari, and M. Monti. Unitarization of the Horocyclic Radon Transform on Symmetric Spaces. In Harmonic and Applied Analysis: From Radon Transforms to Machine Learning, pages 1–54. Springer International Publishing, Cham, 2021.
- Bengio et al. (2006) Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In Advances in Neural Information Processing Systems 18, pages 123–130, 2006.
- Bronstein et al. (2021) M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv preprint: 2104.13478, 2021.
- Bruna and Mallat (2013) J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, 2013.
- Candès (1998) E. J. Candès. Ridgelets: theory and applications. PhD thesis, Standford University, 1998.
- Carroll and Dickinson (1989) S. M. Carroll and B. W. Dickinson. Construction of neural nets using the Radon transform. In International Joint Conference on Neural Networks 1989, volume 1, pages 607–611. IEEE, 1989.
- Chizat and Bach (2018) L. Chizat and F. Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In Advances in Neural Information Processing Systems 32, pages 3036–3046, 2018.
- Cohen and Welling (2016) T. Cohen and M. Welling. Group Equivariant Convolutional Networks. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 2990–2999, 2016.
- Cohen et al. (2019) T. S. Cohen, M. Geiger, and M. Weiler. A General Theory of Equivariant CNNs on Homogeneous Spaces. In Advances in Neural Information Processing Systems, volume 32, 2019.
- DeVore and Lorentz (1993) R. A. DeVore and G. G. Lorentz. Constructive Approximation. Springer-Verlag Berlin Heidelberg, 1993.
- Díaz-García and González-Farías (2005) J. A. Díaz-García and G. González-Farías. Singular random matrix decompositions: Jacobians. Journal of Multivariate Analysis, 93(2):296–312, 2005.
- Donoho (2001) D. L. Donoho. Ridge functions and orthonormal ridgelets. Journal of Approximation Theory, 111(2):143–179, 2001.
- Donoho (2002) D. L. Donoho. Emerging applications of geometric multiscale analysis. Proceedings of the ICM, Beijing 2002, I:209–233, 2002.
- Fukushima (1980) K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980.
- Funahashi (1989) K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
- Ganea et al. (2018) O. Ganea, G. Becigneul, and T. Hofmann. Hyperbolic Neural Networks. In Advances in Neural Information Processing Systems 31, 2018.
- Grafakos (2008) L. Grafakos. Classical Fourier Analysis. Springer New York, second edition, 2008.
- Gulcehre et al. (2019) C. Gulcehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. Battaglia, V. Bapst, D. Raposo, A. Santoro, and N. de Freitas. Hyperbolic Attention Networks. In International Conference on Learning Representations, 2019.
- Helgason (1984) S. Helgason. Groups and Geometric Analysis: Integral Geometry, Invariant Differential Operators, and Spherical Functions. American Mathematical Society, 1984.
- Helgason (2008) S. Helgason. Geometric Analysis on Symmetric Spaces: Second Edition. American Mathematical Society, second edition, 2008.
- Helgason (2010) S. Helgason. Integral Geometry and Radon Transforms. Springer-Verlag New York, 2010.
- Irie and Miyake (1988) B. Irie and S. Miyake. Capabilities of three-layered perceptrons. In IEEE 1988 International Conference on Neural Networks, pages 641–648. IEEE, 1988.
- Ito (1991) Y. Ito. Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Networks, 4(3):385–394, 1991.
- Kainen et al. (2013) P. C. Kainen, V. Kůrková, and M. Sanguineti. Approximating multivariable functions by feedforward neural nets. In Handbook on Neural Information Processing, volume 49 of Intelligent Systems Reference Library, pages 143–181. Springer Berlin Heidelberg, 2013.
- Kapovich et al. (2017) M. Kapovich, B. Leeb, and J. Porti. Anosov subgroups: dynamical and geometric characterizations. European Journal of Mathematics, 3(4):808–898, 2017.
- Kondor and Trivedi (2018) R. Kondor and S. Trivedi. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. In Proceedings of the 35th International Conference on Machine Learning, pages 2747–2755, 2018.
- Kostadinova et al. (2014) S. Kostadinova, S. Pilipović, K. Saneva, and J. Vindas. The ridgelet transform of distributions. Integral Transforms and Special Functions, 25(5):344–358, 2014.
- Krioukov et al. (2010) D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguñá. Hyperbolic geometry of complex networks. Phys. Rev. E, 82(3):36106, 2010.
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105, 2012.
- Kumagai and Sannai (2020) W. Kumagai and A. Sannai. Universal Approximation Theorem for Equivariant Maps by Group CNNs. arXiv preprint: 2012.13882, 2020.
- LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86:2278–2324, 1998.
- Maron et al. (2019) H. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the Universality of Invariant Networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 4363–4371, 2019.
- Mei et al. (2018) S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Murata (1996) N. Murata. An integral representation of functions using three-layered networks and their approximation bounds. Neural Networks, 9(6):947–956, 1996.
- Nickel and Kiela (2017) M. Nickel and D. Kiela. Poincaré Embeddings for Learning Hierarchical Representations. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Nickel and Kiela (2018) M. Nickel and D. Kiela. Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 3779–3788, 2018.
- Nitanda and Suzuki (2017) A. Nitanda and T. Suzuki. Stochastic Particle Gradient Descent for Infinite Ensembles. arXiv preprint: 1712.05438, 2017.
- Nitanda et al. (2022) A. Nitanda, D. Wu, and T. Suzuki. Convex Analysis of the Mean Field Langevin Dynamics. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 9741–9757, 2022.
- Ongie et al. (2020) G. Ongie, R. Willett, D. Soudry, and N. Srebro. A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case. In International Conference on Learning Representations, 2020.
- Parhi and Nowak (2021) R. Parhi and R. D. Nowak. Banach Space Representer Theorems for Neural Networks and Ridge Splines. Journal of Machine Learning Research, 22(43):1–40, 2021.
- Parhi and Unser (2023a) R. Parhi and M. Unser. Distributional Extension and Invertibility of the -Plane Transform and Its Dual. arXiv preprint: 2310.01233, 2023a.
- Parhi and Unser (2023b) R. Parhi and M. Unser. Function-Space Optimality of Neural Architectures With Multivariate Nonlinearities. arXiv preprint: 2310.03696, 2023b.
- Ranzato et al. (2007) M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient Learning of Sparse Representations with an Energy-Based Model. In Advances In Neural Information Processing Systems 19, pages 1137–1144, 2007.
- Rotskoff and Vanden-Eijnden (2018) G. Rotskoff and E. Vanden-Eijnden. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In Advances in Neural Information Processing Systems 31, pages 7146–7155, 2018.
- Rubin (2004) B. Rubin. Convolution–backprojection method for the -plane transform, and Calderón’s identity for ridgelet transforms. Applied and Computational Harmonic Analysis, 16(3):231–242, 2004.
- Rubin (2018) B. Rubin. A note on the Blaschke-Petkantschin formula, Riesz distributions, and Drury’s identity. Fractional Calculus and Applied Analysis, 21(6):1641–1650, 2018.
- Sala et al. (2018) F. Sala, C. De Sa, A. Gu, and C. Re. Representation Tradeoffs for Hyperbolic Embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4460–4469, 2018.
- Savarese et al. (2019) P. Savarese, I. Evron, D. Soudry, and N. Srebro. How do infinite width bounded norm networks look in function space? In Proceedings of the 32nd Conference on Learning Theory, volume 99, pages 2667–2690, 2019.
- Shimizu et al. (2021) R. Shimizu, Y. Mukuta, and T. Harada. Hyperbolic Neural Networks++. In International Conference on Learning Representations, 2021.
- Sirignano and Spiliopoulos (2020) J. Sirignano and K. Spiliopoulos. Mean Field Analysis of Neural Networks: A Law of Large Numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.
- Sonoda and Murata (2017) S. Sonoda and N. Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
- Sonoda et al. (2021a) S. Sonoda, I. Ishikawa, and M. Ikeda. Ridge Regression with Over-Parametrized Two-Layer Networks Converge to Ridgelet Spectrum. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 2021, volume 130, pages 2674–2682, 2021a.
- Sonoda et al. (2021b) S. Sonoda, I. Ishikawa, and M. Ikeda. Ghosts in Neural Networks: Existence, Structure and Role of Infinite-Dimensional Null Space. arXiv preprint: 2106.04770, 2021b.
- Sonoda et al. (2022a) S. Sonoda, I. Ishikawa, and M. Ikeda. Universality of Group Convolutional Neural Networks Based on Ridgelet Analysis on Groups. In Advances in Neural Information Processing Systems 35, pages 38680–38694, 2022a.
- Sonoda et al. (2022b) S. Sonoda, I. Ishikawa, and M. Ikeda. Fully-Connected Network on Noncompact Symmetric Space and Ridgelet Transform based on Helgason-Fourier Analysis. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 20405–20422, 2022b.
- Sonoda et al. (2023a) S. Sonoda, Y. Hashimoto, I. Ishikawa, and M. Ikeda. Deep Ridgelet Transform: Voice with Koopman Operator Proves Universality of Formal Deep Networks. In Proceedings of the 2nd NeurIPS Workshop on Symmetry and Geometry in Neural Representations, 2023a.
- Sonoda et al. (2023b) S. Sonoda, H. Ishi, I. Ishikawa, and M. Ikeda. Joint Group Invariant Functions on Data-Parameter Domain Induce Universal Neural Networks. In Proceedings of the 2nd NeurIPS Workshop on Symmetry and Geometry in Neural Representations, 2023b.
- Starck et al. (2010) J.-L. Starck, F. Murtagh, and J. M. Fadili. The ridgelet and curvelet transforms. In Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity, pages 89–118. Cambridge University Press, 2010.
- Suzuki (2020) T. Suzuki. Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics. In Advances in Neural Information Processing Systems 33, pages 19224–19237, 2020.
- Terras (2016) A. Terras. Harmonic Analysis on Symmetric Spaces—Higher Rank Spaces, Positive Definite Matrix Space and Generalizations. Springer New York, 2016.
- Unser (2019) M. Unser. A Representer Theorem for Deep Neural Networks. Journal of Machine Learning Research, 20(110):1–30, 2019.
- Yamasaki et al. (2023) H. Yamasaki, S. Subramanian, S. Hayakawa, and S. Sonoda. Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 39008–39034, 2023.
- Yarotsky (2022) D. Yarotsky. Universal Approximations of Invariant Maps by Neural Networks. Constructive Approximation, 55:407–474, 2022.
- Zaheer et al. (2017) M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep Sets. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Zhou (2020) D.-X. Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2):787–794, 2020.