This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A unified Fourier slice method to derive ridgelet transform for a variety of depth-2 neural networks

Sho Sonoda1 [email protected]
Isao Ishikawa2,1 [email protected]
Masahiro Ikeda1 [email protected]
1RIKEN Center for Advanced Intelligence Project (AIP) a
2Center for Data Science, Ehime University (CDSE) a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
(April 18, 2024)
Abstract

To investigate neural network parameters, it is easier to study the distribution of parameters than to study the parameters in each neuron. The ridgelet transform is a pseudo-inverse operator that maps a given function ff to the parameter distribution γ\gamma so that a network 𝙽𝙽[γ]\mathtt{NN}[\gamma] reproduces ff, i.e. 𝙽𝙽[γ]=f\mathtt{NN}[\gamma]=f. For depth-2 fully-connected networks on a Euclidean space, the ridgelet transform has been discovered up to the closed-form expression, thus we could describe how the parameters are distributed. However, for a variety of modern neural network architectures, the closed-form expression has not been known. In this paper, we explain a systematic method using Fourier expressions to derive ridgelet transforms for a variety of modern networks such as networks on finite fields 𝔽p\mathbb{F}_{p}, group convolutional networks on abstract Hilbert space \mathcal{H}, fully-connected networks on noncompact symmetric spaces G/KG/K, and pooling layers, or the dd-plane ridgelet transform.

1 Introduction

Neural networks are learning machines that support today’s AI technology. Mathematically, they are nonlinear functions determined by a network of functions with learnable parameters (called neurons) connecting in parallel and series. Since the learning process is automated, we do not fully understand the parameters obtained through learning. An integral representation is a powerful tool for mathematical analysis of these parameters. One of the technical difficulties in analyzing the behavior of neural networks is that their parameters are extremely nonlinear. An integral representation is a method of indirectly analyzing the parameters through their distribution, rather than directly analyzing the parameters of each neuron. The set of all the signed (or probability) parameter distributions forms a linear (or convex) space, making it possible to perform far more insightful analysis than directly analyzing individual parameters.

For instances, characterization of neural network parameters such as the ridgelet transform (Murata, 1996; Candès, 1998; Sonoda and Murata, 2017) and the representer theorems for ReLU networks (Savarese et al., 2019; Ongie et al., 2020; Parhi and Nowak, 2021; Unser, 2019), and convergence analysis of stochastic gradient descent (SGD) for deep learning such as the mean field theory (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020) and the infinite-dimensional Langevin dynamics (Suzuki, 2020; Nitanda et al., 2022), have been developed using integral representations.

1.1 Integral Representation

The integral representation of a depth-2 fully-connected neural network is defined as below.

Definition 1.1.

Let σ:\sigma:\mathbb{R}\to\mathbb{C} be a measurable function, called activation function, and fix it. For any signed measure γ\gamma on m×\mathbb{R}^{m}\times\mathbb{R}, called a parameter distribution, we define the integral representation of depth-2 fully-connected neural network as

S[γ](𝒙)=m×γ(𝒂,b)σ(𝒂𝒙b)d𝒂db,𝒙m.S[\gamma]({\bm{x}})=\int_{\mathbb{R}^{m}\times\mathbb{R}}\gamma({\bm{a}},b)\sigma({\bm{a}}\cdot{\bm{x}}-b)\mathrm{d}{\bm{a}}\mathrm{d}b,\quad{\bm{x}}\in\mathbb{R}^{m}. (1)

Here, for each hidden parameter (𝒂,b)({\bm{a}},b), feature map 𝒙σ(𝒂𝒙b){\bm{x}}\mapsto\sigma({\bm{a}}\cdot{\bm{x}}-b) corresponds to a single hidden neuron with activation function σ\sigma, weight γ(𝒂,b)\gamma({\bm{a}},b) corresponds to an output coefficient, and the integration implies that all the possible neurons are assigned in advance. Since the only free parameter is parameter distribution γ\gamma, we can identify network S[γ]S[\gamma] with point γ\gamma in a function space.

We note that this representation covers both infinite (or continuous) and finite widths. Indeed, while the integration may be understood as an infinite width layer, the integration with a finite sum of point masses such as γp:=i=1pciδ(𝒂i,bi)\gamma_{p}:=\sum_{i=1}^{p}c_{i}\delta_{({\bm{a}}_{i},b_{i})} can represent a finite width layer:

S[γp](𝒙)=i=1pciσ(𝒂i𝒙bi)=Cσ(A𝒙𝒃),𝒙mS[\gamma_{p}]({\bm{x}})=\sum_{i=1}^{p}c_{i}\sigma({\bm{a}}_{i}\cdot{\bm{x}}-b_{i})=C\sigma(A{\bm{x}}-\bm{b}),\quad{\bm{x}}\in\mathbb{R}^{m}

where the third term is the so-called “matrix” representation with matrices Ap×m,C1×pA\in\mathbb{R}^{p\times m},C\in\mathbb{R}^{1\times p} and vector 𝒃p\bm{b}\in\mathbb{R}^{p} followed by component-wise activation σ\sigma. Singular measures as above can be mathematically justified without any inconsistency if the class of the parameter distributions are set to a class of Borel measures or Schwartz distributions.

There are at least four advantages to introducing integral representations:

  1. (1)

    Aggregation of parameters, say {(𝒂i,bi,ci)}i=1p\{({\bm{a}}_{i},b_{i},c_{i})\}_{i=1}^{p}, into a single function (parameter distribution) γ(𝒂,b)\gamma({\bm{a}},b),

  2. (2)

    Ability to represent finite models and continuous models in the same form,

  3. (3)

    Linearization of networks and convexification of learning problems, and

  4. (4)

    Presence of the ridgelet transform.

Advantages 1 and 2 have already been explained. Advantage 3 is described in the next subsection, and Advantage 4 is emphasized throughout the paper. On the other hand, there are two disadvantages:

  1. (1)

    Extensions to deep networks are hard111Good News: After the initial submission of this manuscript, the authors have successfully developed the ridgelet transform for deep networks in Sonoda et al. (2023a, b)., and

  2. (2)

    Advanced knowledge on functional analysis are required.

1.2 Linearization and Convexification Effect

The third advantage of the integral representation is the so-called linearization (and convexification) tricks. That is, while the network is nonlinear with respect to the raw parameters 𝒂{\bm{a}} and bb, namely,

S[δ(α1𝒂1+α2𝒂2,b)]α1S[δ(𝒂1,b)]+α2S[δ(𝒂2,b)],α1,α2S[\delta_{(\alpha_{1}{\bm{a}}_{1}+\alpha_{2}{\bm{a}}_{2},b)}]\neq\alpha_{1}S[\delta_{({\bm{a}}_{1},b)}]+\alpha_{2}S[\delta_{({\bm{a}}_{2},b)}],\quad\alpha_{1},\alpha_{2}\in\mathbb{C}

(and similarly for bb), it is linear with respect to the parameter distribution γ\gamma, namely,

S[α1γ1+α2γ2]=α1S[γ1]+α2S[γ2],α1,α2.S[\alpha_{1}\gamma_{1}+\alpha_{2}\gamma_{2}]=\alpha_{1}S[\gamma_{1}]+\alpha_{2}S[\gamma_{2}],\quad\alpha_{1},\alpha_{2}\in\mathbb{C}.

Furthermore, linearizing neural networks leads to convexifying learning problems. Specifically, for a convex function :\ell:\mathbb{R}\to\mathbb{R}, the loss function defined as L[γ]:=(S[γ])L[\gamma]:=\ell(S[\gamma]) satisfies the following:

L[tγ1+(1t)γ2]tL[γ1]+(1t)L[γ2],t[0,1].L[t\gamma_{1}+(1-t)\gamma_{2}]\leq tL[\gamma_{1}]+(1-t)L[\gamma_{2}],\quad t\in[0,1].

It may sound paradoxical that a convex loss function on a function space has local minima in raw parameters, but we can understand this through the chain rule for functional derivative: Suppose that a parameter distribution γ\gamma is parametrized by a raw parameter, say θ\theta, then

L[γ(θ)]θ=γ(θ)θ,L[γ]γ.\frac{\partial L[\gamma(\theta)]}{\partial\theta}=\left\langle\frac{\partial\gamma(\theta)}{\partial\theta},\frac{\partial L[\gamma]}{\partial\gamma}\right\rangle.

In other words, a local minimum (θL=0\partial_{\theta}L=0) in raw parameter θ\theta can arise not only from the global optimum (γL=0\partial_{\gamma}L=0) but also from the case when two derivatives θγ\partial_{\theta}\gamma and γL\partial_{\gamma}L are orthogonal.

The trick of lifting nonlinear objects in a linear space has been studied since the age of Frobenius, one of the founders of the linear representation theory of groups. In the context of neural network study, as well as the recent studies mentioned above, either the integral representation by Barron (1993) or the convex neural network by Bengio et al. (2006) are often referred. In the context of deep learning theory, this linearization/convexification trick has been employed to show the global convergence of the SGD training of shallow ReLU networks (Nitanda and Suzuki, 2017; Chizat and Bach, 2018; Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020; Suzuki, 2020; Nitanda et al., 2022), and to characterize parameters in ReLU networks (Savarese et al., 2019; Ongie et al., 2020; Parhi and Nowak, 2021; Unser, 2019).

1.3 Ridgelet Transform

The fourth advantage of the integral representation is the so-called the ridgelet transform RR, or a right inverse operator of the integral representation operator SS. For example, the ridgelet transform for depth-2 fully-connected network (1) is given as below.

Definition 1.2.

For any measurable functions f:mf:\mathbb{R}^{m}\to\mathbb{C} and ρ:\rho:\mathbb{R}\to\mathbb{C},

R[f;ρ](𝒂,b):=mf(𝒙)ρ(𝒂𝒙b)¯d𝒙,(𝒂,b)m×.R[f;\rho]({\bm{a}},b):=\int_{\mathbb{R}^{m}}f({\bm{x}})\overline{\rho({\bm{a}}\cdot{\bm{x}}-b)}\mathrm{d}{\bm{x}},\quad({\bm{a}},b)\in\mathbb{R}^{m}\times\mathbb{R}. (2)

In principle, the ridgelet function ρ\rho can be chosen independently of the activation function σ\sigma of neural network SS. The following theorem holds.

Theorem 1.1 (Reconstruction Formula).

Suppose σ\sigma and ρ\rho are a tempered distribution (𝒮\mathcal{S}^{\prime}) on \mathbb{R} and a rapidly decreasing function (𝒮\mathcal{S}) on \mathbb{R}, respectively. Then, for any square integrable function ff, the following reconstruction formula

S[R[f;ρ]]=((σ,ρ))finL2(m)S[R[f;\rho]]=(\!(\sigma,\rho)\!)f\quad\mbox{in}\quad L^{2}(\mathbb{R}^{m})

holds with the factor being a scalar product of σ\sigma and ρ\rho,

((σ,ρ)):=σ(ω)ρ(ω)¯|ω|mdω,(\!(\sigma,\rho)\!):=\int_{\mathbb{R}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}|\omega|^{-m}\mathrm{d}\omega,

where \sharp denotes the Fourier transform.

From the perspective of neural network theory, the reconstruction formula claims a detailed/constructive version of the universal approximation theorem. That is, given any target function ff, as long as ((σ,ρ))0(\!(\sigma,\rho)\!)\neq 0, the network S[γ]S[\gamma] with coefficient γ=R[f;ρ]\gamma=R[f;\rho] reproduces the original function, and the coefficient is given explicit.

From the perspective of functional analysis, on the other hand, the reconstruction formula states that RR and SS are analysis and synthesis operators, and thus play the same roles as, for instance, the Fourier (FF) and inverse Fourier (F1F^{-1}) transforms respectively, in the sense that the reconstruction formula S[R[f;ρ]]=((σ,ρ))fS[R[f;\rho]]=(\!(\sigma,\rho)\!)f corresponds to the Fourier inversion formula F1[F[f]]=fF^{-1}[F[f]]=f.

Despite the common belief that neural network parameters are a blackbox, the closed-form expression (2) of ridgelet transform clearly describes how the network parameters are distributed, which is a clear advantage of the integral representation theory (see e.g. Sonoda et al., 2021a). Moreover, the integral representation theory can deal with a wide range of activation functions without approximation, not only ReLU but all the tempered distribution 𝒮()\mathcal{S}^{\prime}(\mathbb{R}) (see e.g. Sonoda and Murata, 2017).

The ridgelet transform is discovered in the late 1990s independently by Murata (1996) and Candès (1998). The term “ridgelet” is named by Candès, based on the facts that the graph of a function 𝒙ρ(𝒂𝒙b){\bm{x}}\mapsto\rho({\bm{a}}\cdot{\bm{x}}-b) is ridge-shaped, and that the integral transform RR can be regarded as a multidimensional counterpart of the wavelet transform.

In fact, the ridgelet transform can be decomposed into the composite of wavelet transform after the Radon transform, namely,

R[f;ρ](a𝒖,b)\displaystyle R[f;\rho](a{\bm{u}},b) =P[f](𝒖,t)ρ(atb)¯dadb,(a,𝒖,b)×SSm1×,\displaystyle=\int_{\mathbb{R}}P[f]({\bm{u}},t)\overline{\rho(at-b)}\mathrm{d}a\mathrm{d}b,\quad(a,{\bm{u}},b)\in\mathbb{R}\times\SS^{m-1}\times\mathbb{R},
P[f](𝒖,t)\displaystyle P[f]({\bm{u}},t) :=(𝒖)f(t𝒖+𝒚)d𝒚,(𝒖,t)SSm1×,\displaystyle:=\int_{(\mathbb{R}{\bm{u}})^{\perp}}f(t{\bm{u}}+{\bm{y}})\mathrm{d}{\bm{y}},\quad({\bm{u}},t)\in\SS^{m-1}\times\mathbb{R},

where SSm1\SS^{m-1} denotes the mm-dimensional unit sphere, (𝒖)m1(\mathbb{R}{\bm{u}})^{\perp}\cong\mathbb{R}^{m-1} denotes the orthocomplement of the normal vector 𝒖SSm1{\bm{u}}\in\SS^{m-1}, d𝒚\mathrm{d}{\bm{y}} denotes the Hausdorff measure on (𝒖)(\mathbb{R}{\bm{u}})^{\perp} or the Lebesgue measure on m1\mathbb{R}^{m-1}, and 𝒂m{\bm{a}}\in\mathbb{R}^{m} is represented in polar coordinates 𝒂=a𝒖{\bm{a}}=a{\bm{u}} with (a,𝒖)×SSm1(a,{\bm{u}})\in\mathbb{R}\times\SS^{m-1} allowing the double covering: (a,𝒖)=(a,𝒖)(a,{\bm{u}})=(-a,-{\bm{u}}) (see Sonoda and Murata, 2017, for the proof). Therefore, several authors have remarked that ridgelet analysis is wavelet analysis in the Radon domain (Donoho, 2002; Kostadinova et al., 2014; Starck et al., 2010).

In the context of deep learning theory, Savarese et al. (2019); Ongie et al. (2020); Parhi and Nowak (2021) and Unser (2019) investigate the ridgelet transform for the specific case of fully-connected ReLU layers to establish the representer theorem. Sonoda et al. (2021a) have shown that the parameter distribution of a finite model trained by regularized empirical risk minimization (RERM) converges to the ridgelet spectrum R[f;ρ]R[f;\rho] in an over-parametrized regime, meaning that we can understand the parameters at local minima to be a finite approximation of the ridgelet transform. In other words, analyzing neural network parameters can be turned to analyzing the ridgelet transform.

1.4 Scope and Contribution of This Study

On the other hand, one of the major shortcomings of ridgelet analysis is that the closed-form expression is known for relatively small class of networks. Indeed, until Sonoda et al. (2022b, a), it was known only for depth-2 fully-connected layer: σ(𝒂𝒙b)\sigma({\bm{a}}\cdot{\bm{x}}-b). In the age of deep learning, a variety of layers have become popular such as the convolution and pooling layers (Fukushima, 1980; LeCun et al., 1998; Ranzato et al., 2007; Krizhevsky et al., 2012). Furthermore, the fully-connected layers on manifolds have also been developed such as the hyperbolic network (Ganea et al., 2018; Shimizu et al., 2021). Since the conventional ridgelet transform was discovered heuristically in the 1990s, and the derivation heavily depends on the specific structure of affine map 𝒂𝒙b{\bm{a}}\cdot{\bm{x}}-b, the ridgelet transforms for those modern architectures have been unknown for a long time.

In this study, we explain a systematic method to find the ridgelet transforms via the Fourier expression of neural networks, and obtain new ridgelet transforms in a unified manner. The Fourier expression of S[γ]S[\gamma] is essentially a change-of-frame from neurons σ(𝒂𝒙b)\sigma({\bm{a}}\cdot{\bm{x}}-b) to plane waves (or harmonic oscillators) exp(i𝒙𝝃)\exp(i{\bm{x}}\cdot{\bm{\xi}}). Since the Fourier transform is extensively developed on a variety of domains, once a network S[γ]S[\gamma] is translated into a Fourier expression, we can systematically find a particular coefficient γf\gamma_{f} satisfying S[γf]=fS[\gamma_{f}]=f via the Fourier inversion formula. In fact, the traditional ridgelet transform is re-discovered. Associated with the change-of-frame in S[γ]S[\gamma], the ridgelet transform R[f]R[f] is also given a Fourier expression, but this form is known as the Fourier slice theorem of ridgelet transform R[f]R[f] (see e.g. Kostadinova et al., 2014). Hence, we call our proposed method as the Fourier slice method.

Besides the classical networks, we deal with four types of networks:

  1. (1)

    Networks on finite fields 𝔽p\mathbb{F}_{p} in § 3,

  2. (2)

    Group convolution networks on Hilbert spaces \mathcal{H} in § 4,

  3. (3)

    Fully-connected networks on noncompact symmetric spaces G/KG/K in § 5, and

  4. (4)

    Pooling layers (also known as the dd-plane ridgelet transform) in § 6.

The first three cases are already published thus we only showcase them, while the last case (pooling layer and dd-plane ridgelet) involves new results.

For all the cases, the reconstruction formula S[R[f]]=fS[R[f]]=f is understood as a constructive proof of the universal approximation theorem for corresponding networks. The group convolution layer case widely extends the ordinary convolution layer with periodic boundary, which is also the main subject of the so-called geometric deep learning (Bronstein et al., 2021). The case of fully-connected layer on symmetric spaces widely extends the recently emerging concept of hyperbolic networks (Ganea et al., 2018; Gulcehre et al., 2019; Shimizu et al., 2021), which can be cast as another geometric deep learning. The pooling layer case includes the original fully-connected layer and the pooling layer; and the corresponding ridgelet transforms include previously developed formulas such as the Radon transform formula by Savarese et al. (2019) and related to the previously developed “dd-plane ridgelet transforms” by Rubin (2004) and Donoho (2001).

1.5 General Notations

For any integer d>0d>0, 𝒮(d)\mathcal{S}(\mathbb{R}^{d}) and 𝒮(d)\mathcal{S}^{\prime}(\mathbb{R}^{d}) denote the classes of Schwartz test functions (or rapidly decreasing functions) and tempered distributions on d\mathbb{R}^{d}, respectively. Namely, 𝒮\mathcal{S}^{\prime} is the topological dual of 𝒮\mathcal{S}. We note that 𝒮()\mathcal{S}^{\prime}(\mathbb{R}) includes truncated power functions σ(b)=b+k=max{b,0}k\sigma(b)=b_{+}^{k}=\max\{b,0\}^{k} such as step function for k=0k=0 and ReLU for k=1k=1.

Fourier Transform

To avoid potential confusion, we use two symbols ^\widehat{\cdot} and \cdot^{\sharp} for the Fourier transforms in 𝒙m{\bm{x}}\in\mathbb{R}^{m} and bb\in\mathbb{R}, respectively. For example,

f^(𝝃):=mf(𝒙)ei𝒙𝝃d𝒙,𝝃m\displaystyle\widehat{f}({\bm{\xi}}):=\int_{\mathbb{R}^{m}}f({\bm{x}})e^{-i{\bm{x}}\cdot{\bm{\xi}}}\mathrm{d}{\bm{x}},\quad{\bm{\xi}}\in\mathbb{R}^{m}
ρ(ω):=ρ(b)eibωdb,ω\displaystyle\rho^{\sharp}(\omega):=\int_{\mathbb{R}}\rho(b)e^{-ib\omega}\mathrm{d}b,\quad\omega\in\mathbb{R}
γ(𝒂,ω)=γ(𝒂,b)eibωdb,(𝒂,ω)m×.\displaystyle\gamma^{\sharp}({\bm{a}},\omega)=\int_{\mathbb{R}}\gamma({\bm{a}},b)e^{-ib\omega}\mathrm{d}b,\quad({\bm{a}},\omega)\in\mathbb{R}^{m}\times\mathbb{R}.

Furthermore, with a slight abuse of notation, when σ\sigma is a tempered distribution (i.e., σ𝒮()\sigma\in\mathcal{S}^{\prime}(\mathbb{R})), then σ\sigma^{\sharp} is understood as the Fourier transform of distributions. Namely, σ\sigma^{\sharp} is another tempered distribution satisfying σ(ω)ϕ(ω)dω=σ(ω)ϕ(ω)dω\int_{\mathbb{R}}\sigma^{\sharp}(\omega)\phi(\omega)\mathrm{d}\omega=\int_{\mathbb{R}}\sigma(\omega)\phi^{\sharp}(\omega)\mathrm{d}\omega for any test function ϕ𝒮()\phi\in\mathcal{S}(\mathbb{R}). We refer to Grafakos (2008) for more details on the Fourier transform for distributions.

2 Method

We explain the basic steps to find the parameter distribution γ\gamma satisfying S[γ]=fS[\gamma]=f. The basic steps is three-fold: (Step 1) Turn the network into the Fourier expression, (Step 2) change variables inside the feature map into principal and auxiliary variables, and (Step 3) put unknown γ\gamma in the separation-of-variables form to find a particular solution. In the following, we conduct the basic steps for the classical setting, i.e., the case of the fully-connected layer, for the explanation purpose. However, the “catch” of this procedure is that it is applicable to a wide range of networks as we will see in the subsequent sections.

2.1 Basic Steps to Solve S[γ]=fS[\gamma]=f

The following procedure is valid, for example, when σ𝒮(),ρ𝒮(),fL2(m)\sigma\in\mathcal{S}^{\prime}(\mathbb{R}),\rho\in\mathcal{S}(\mathbb{R}),f\in L^{2}(\mathbb{R}^{m}) and γL2(m×)\gamma\in L^{2}(\mathbb{R}^{m}\times\mathbb{R}). See Kostadinova et al. (2014) and Sonoda and Murata (2017) for more details on the valid combinations of function classes.

Step 1.

Using the convolution in bb, we can turn the network into the Fourier expression as below.

S[γ](𝒙)\displaystyle S[\gamma]({\bm{x}}) :=m×γ(𝒂,b)σ(𝒂𝒙b)d𝒂db\displaystyle:=\int_{\mathbb{R}^{m}\times\mathbb{R}}\gamma({\bm{a}},b)\sigma({\bm{a}}\cdot{\bm{x}}-b)\mathrm{d}{\bm{a}}\mathrm{d}b
=m[γ(𝒂,)bσ](𝒂𝒙)d𝒂\displaystyle=\int_{\mathbb{R}^{m}}[\gamma({\bm{a}},\cdot)*_{b}\sigma]({\bm{a}}\cdot{\bm{x}})\mathrm{d}{\bm{a}}
=12πm×γ(𝒂,ω)σ(ω)eiω𝒂𝒙d𝒂dω.\displaystyle=\frac{1}{2\pi}\int_{\mathbb{R}^{m}\times\mathbb{R}}\gamma^{\sharp}({\bm{a}},\omega)\sigma^{\sharp}(\omega)e^{i\omega{\bm{a}}\cdot{\bm{x}}}\mathrm{d}{\bm{a}}\mathrm{d}\omega.

Here, b*_{b} denotes the convolution in bb; and the third equation follows from an identity (Fourier inversion formula) ϕ(b)=12πϕ(ω)eiωbdω\phi(b)=\frac{1}{2\pi}\int_{\mathbb{R}}\phi^{\sharp}(\omega)e^{i\omega b}\mathrm{d}\omega with ϕ(b)=[γ(𝒂,)bσ](b)\phi(b)=[\gamma({\bm{a}},)*_{b}\sigma](b) and b=𝒂𝒙b={\bm{a}}\cdot{\bm{x}}.

Step 2.

Change variables (𝒂,ω)=(𝝃/ω,ω)({\bm{a}},\omega)=({\bm{\xi}}/\omega,\omega) with d𝒂dω=|ω|md𝝃dω\mathrm{d}{\bm{a}}\mathrm{d}\omega=|\omega|^{-m}\mathrm{d}{\bm{\xi}}\mathrm{d}\omega so that feature map σ(ω)eiω𝒂𝒙\sigma^{\sharp}(\omega)e^{i\omega{\bm{a}}\cdot{\bm{x}}} splits into a product of a principal factor (in 𝝃{\bm{\xi}} and 𝒙{\bm{x}}) and an auxiliary factor (in ω\omega), namely

S[γ](𝒙)=(2π)m1[1(2π)mmγ(𝝃/ω,ω)ei𝝃𝒙d𝝃]σ(ω)|ω|mdω.S[\gamma]({\bm{x}})=(2\pi)^{m-1}\int_{\mathbb{R}}\left[\frac{1}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}\gamma^{\sharp}({\bm{\xi}}/\omega,\omega)e^{i{\bm{\xi}}\cdot{\bm{x}}}\mathrm{d}{\bm{\xi}}\right]\sigma^{\sharp}(\omega)|\omega|^{-m}\mathrm{d}\omega.

Now, we can see that the integration inside brackets [][\cdots] becomes the Fourier inversion with respect to 𝝃{\bm{\xi}} and 𝒙{\bm{x}}.

Step 3.

Because of the Fourier inversion, it is natural to assume that the unknown function γ\gamma has a separation-of-variables form as

γf,ρ(𝝃/ω,ω):=f^(𝝃)ρ(ω)¯,\gamma_{f,\rho}^{\sharp}({\bm{\xi}}/\omega,\omega):=\widehat{f}({\bm{\xi}})\overline{\rho^{\sharp}(\omega)}, (3)

with using an arbitrary function ρ𝒮()\rho\in\mathcal{S}(\mathbb{R}). Namely, it is composed of a principal factor f^\widehat{f} containing the target function ff, and an auxiliary factor ρ\rho^{\sharp} set only for convergence of the integration in ω\omega. Then, we have

S[γf,ρ](𝒙)\displaystyle S[\gamma_{f,\rho}]({\bm{x}}) =(2π)m1[1(2π)mmf^(𝝃)ei𝝃𝒙d𝝃]σ(ω)ρ(ω)¯|ω|mdω\displaystyle=(2\pi)^{m-1}\int_{\mathbb{R}}\left[\frac{1}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}\widehat{f}({\bm{\xi}})e^{i{\bm{\xi}}\cdot{\bm{x}}}\mathrm{d}{\bm{\xi}}\right]\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}|\omega|^{-m}\mathrm{d}\omega
=((σ,ρ))1(2π)mmf^(𝝃)ei𝝃𝒙d𝝃dω\displaystyle=(\!(\sigma,\rho)\!)\frac{1}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}\widehat{f}({\bm{\xi}})e^{i{\bm{\xi}}\cdot{\bm{x}}}\mathrm{d}{\bm{\xi}}\mathrm{d}\omega
=((σ,ρ))f(𝒙),\displaystyle=(\!(\sigma,\rho)\!)f({\bm{x}}),

where we put

((σ,ρ)):=(2π)m1σ(ω)ρ(ω)¯|ω|mdω.(\!(\sigma,\rho)\!):=(2\pi)^{m-1}\int_{\mathbb{R}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}|\omega|^{-m}\mathrm{d}\omega.

In other words, the separation-of-variables expression γf,ρ\gamma_{f,\rho} is a particular solution to the integral equation S[γ]=cfS[\gamma]=cf with factor c=((σ,ρ))c=(\!(\sigma,\rho)\!)\in\mathbb{C}.

Furthermore, γf,ρ\gamma_{f,\rho} is the ridgelet transform because it is rewritten as

γf,ρ(𝒂,ω)=f^(ω𝒂)ρ(ω)¯,\gamma_{f,\rho}^{\sharp}({\bm{a}},\omega)=\widehat{f}(\omega{\bm{a}})\overline{\rho^{\sharp}(\omega)},

and thus calculated as

γ(𝒂,b)\displaystyle\gamma({\bm{a}},b) =12πf^(ω𝒂)ρ(ω)eiωb¯dω\displaystyle=\frac{1}{2\pi}\int_{\mathbb{R}}\widehat{f}(\omega{\bm{a}})\overline{\rho^{\sharp}(\omega)e^{-i\omega b}}\mathrm{d}\omega
=12πm×f(𝒙)ρ(ω)eiω(𝒂𝒙b)¯d𝒙dω\displaystyle=\frac{1}{2\pi}\int_{\mathbb{R}^{m}\times\mathbb{R}}f({\bm{x}})\overline{\rho^{\sharp}(\omega)e^{i\omega({\bm{a}}\cdot{\bm{x}}-b)}}\mathrm{d}{\bm{x}}\mathrm{d}\omega
=m×f(𝒙)ρ(𝒂𝒙b)¯d𝒙,\displaystyle=\int_{\mathbb{R}^{m}\times\mathbb{R}}f({\bm{x}})\overline{\rho({\bm{a}}\cdot{\bm{x}}-b)}\mathrm{d}{\bm{x}},

which is exactly the definition of the ridgelet transform R[f;ρ]R[f;\rho].

In conclusion, the separation-of-variables expression (3) is the way to naturally find the ridgelet transform. We note that Steps 1 and 2 can be understood as the change-of-frame from the frame spanned by neurons {𝒙σ(𝒂𝒙b)(𝒂,b)m×}\{{\bm{x}}\mapsto\sigma({\bm{a}}\cdot{\bm{x}}-b)\mid({\bm{a}},b)\in\mathbb{R}^{m}\times\mathbb{R}\}, with which we are less familiar, to the frame spanned by (the tensor product of an auxiliary function and the) plane wave {𝒙σ(ω)ei𝝃𝒙(𝝃,ω)m×}\{{\bm{x}}\mapsto\sigma^{\sharp}(\omega)e^{i{\bm{\xi}}\cdot{\bm{x}}}\mid({\bm{\xi}},\omega)\in\mathbb{R}^{m}\times\mathbb{R}\}, with which we are much familiar. Hence, in particular, the map γ(𝒂,b)γ(𝒂/ω,ω)|ω|m\gamma({\bm{a}},b)\mapsto\gamma^{\sharp}({\bm{a}}/\omega,\omega)|\omega|^{-m} can be understood as the associated coordinate transformation.

3 Case I: NN on Finite Field 𝔽p:=/p\mathbb{F}_{p}:=\mathbb{Z}/\mathbb{Z}_{p}

As one of the simplest applications, we showcase the results by Yamasaki et al. (2023), a neural network on the finite field 𝔽p:=/p{0,1,,p1modp}\mathbb{F}_{p}:=\mathbb{Z}/p\mathbb{Z}\cong\{0,1,\ldots,p-1\mod p\} (with prime number pp). This study aimed to design a quantum algorithm that efficiently computes the ridgelet transform, and the authors developed this example based on the demand to represent all data and parameters in finite qubits. To be precise, the authors dealt with functions on discrete space 𝔽pm\mathbb{F}_{p}^{m}, as a discretization of functions on a continuous space m\mathbb{R}^{m}.

3.1 Fourier Transform

For any positive integers n,mn,m, let nm:=(/n)m\mathbb{Z}_{n}^{m}:=(\mathbb{Z}/n\mathbb{Z})^{m} denote the product of cyclic groups. We note that the set of all real functions ff on nm\mathbb{Z}_{n}^{m} is identified with the (n×mn\times m)-dimensional real vector space, i.e. {f:nm}n×m\{f:\mathbb{Z}_{n}^{m}\to\mathbb{R}\}\cong\mathbb{R}^{n\times m}, because each value f(i,j)f(i,j) of function ff at (i,j)nm(i,j)\in\mathbb{Z}_{n}^{m} can be identified with the (i,j)(i,j)th component aija_{ij} of vector 𝒂=(aij)n×m{\bm{a}}=(a_{ij})\in\mathbb{R}^{n\times m}. In particular, L2(nm)=n×mL^{2}(\mathbb{Z}_{n}^{m})=\mathbb{R}^{n\times m}.

We use the Fourier transform on a product of cyclic groups as below.

Definition 3.1 (Fourier Transform on nm\mathbb{Z}_{n}^{m}).

For any fL2(nm)f\in L^{2}(\mathbb{Z}_{n}^{m}), put

f^(𝝃):=𝒙nmf(𝒙)e2πi𝝃𝒙/n,𝝃nm.\widehat{f}({\bm{\xi}}):=\sum_{{\bm{x}}\in\mathbb{Z}_{n}^{m}}f({\bm{x}})e^{-2\pi i{\bm{\xi}}\cdot{\bm{x}}/n},\quad{\bm{\xi}}\in\mathbb{Z}_{n}^{m}.
Theorem 3.1 (Inversion Formula).

For any fL2(nm)f\in L^{2}(\mathbb{Z}_{n}^{m}),

f(𝒙)=1|nm|𝝃nmf^(𝝃)e2πi𝝃𝒙/n,𝒙nm.f({\bm{x}})=\frac{1}{|\mathbb{Z}_{n}^{m}|}\sum_{{\bm{\xi}}\in\mathbb{Z}_{n}^{m}}\widehat{f}({\bm{\xi}})e^{2\pi i{\bm{\xi}}\cdot{\bm{x}}/n},\quad{\bm{x}}\in\mathbb{Z}_{n}^{m}.

The proof is immediate from the so-called orthogonality of characters, an identity gne2πig(ts)/n=|n|δts(t,sn)\sum_{g\in\mathbb{Z}_{n}}e^{2\pi ig(t-s)/n}=|\mathbb{Z}_{n}|\delta_{ts}\ (t,s\in\mathbb{Z}_{n}), where δts\delta_{ts} being the Kronecker’s δ\delta.

We note that despite the Fourier transform itself can be defined on any cyclic group n\mathbb{Z}_{n}, namely nn needs not be prime, a finite field 𝔽p(=p)\mathbb{F}_{p}(=\mathbb{Z}_{p}) is assumed to perform the change-of-variables step.

3.2 Network Design

Remarkably, the 𝔽p\mathbb{F}_{p}-version of arguments is obtained by formally replacing every integration in the \mathbb{R}-version of arguments with summation.

Definition 3.2 (NN on 𝔽pm\mathbb{F}_{p}^{m}).

For any functions γL2(𝔽pm×𝔽p)\gamma\in L^{2}(\mathbb{F}_{p}^{m}\times\mathbb{F}_{p}) and σL(𝔽p)\sigma\in L^{\infty}(\mathbb{F}_{p}), put

S[γ](𝒙):=(𝒂,b)𝔽pm×𝔽pγ(𝒂,b)σ(𝒂𝒙b),𝒙𝔽pmS[\gamma]({\bm{x}}):=\sum_{({\bm{a}},b)\in\mathbb{F}_{p}^{m}\times\mathbb{F}_{p}}\gamma({\bm{a}},b)\sigma({\bm{a}}\cdot{\bm{x}}-b),\quad{\bm{x}}\in\mathbb{F}_{p}^{m}

Again, in Yamasaki et al. (2023), it is introduced as a discretized version of a function on a continuous space m\mathbb{R}^{m}.

3.3 Ridgelet Transform

Theorem 3.2 (Reconstruction Formula).

For any function ρL(𝔽p)\rho\in L^{\infty}(\mathbb{F}_{p}), put

R[f;ρ](𝒂,b)\displaystyle R[f;\rho]({\bm{a}},b) :=𝒙𝔽pmf(𝒙)ρ(𝒂𝒙b)¯,(𝒂,b)𝔽pm×𝔽p\displaystyle:=\sum_{{\bm{x}}\in\mathbb{F}_{p}^{m}}f({\bm{x}})\overline{\rho({\bm{a}}\cdot{\bm{x}}-b)},\quad({\bm{a}},b)\in\mathbb{F}_{p}^{m}\times\mathbb{F}_{p}
((σ,ρ))\displaystyle(\!(\sigma,\rho)\!) :=1|𝔽p|m1ω𝔽pσ(ω)ρ(ω)¯.\displaystyle:=\frac{1}{|\mathbb{F}_{p}|^{m-1}}\sum_{\omega\in\mathbb{F}_{p}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}.

Then, for any fL2(𝔽pm)f\in L^{2}(\mathbb{F}_{p}^{m}),

S[R[f;ρ]]=((σ,ρ))f.S[R[f;\rho]]=(\!(\sigma,\rho)\!)f.

In other words, the fully-connected network on finite field 𝔽pm\mathbb{F}_{p}^{m} can strictly represent any square integrable function on 𝔽pm\mathbb{F}_{p}^{m}.

Finally, the following proof shows that a new example of neural networks can be obtained by systematically following the same three steps as in the original arguments.

Proof.

Step 1. Turn to the Fourier expression:

S[γ](𝒙)\displaystyle S[\gamma]({\bm{x}}) :=(𝒂,b)𝔽pm×𝔽pγ(𝒂,b)σ(𝒂𝒙b)\displaystyle:=\sum_{({\bm{a}},b)\in\mathbb{F}_{p}^{m}\times\mathbb{F}_{p}}\gamma({\bm{a}},b)\sigma({\bm{a}}\cdot{\bm{x}}-b)
=1|𝔽p|(𝒂,ω)𝔽pm×𝔽pγ(𝒂,ω)σ(ω)e2πiω𝒂𝒙/p\displaystyle=\frac{1}{|\mathbb{F}_{p}|}\sum_{({\bm{a}},\omega)\in\mathbb{F}_{p}^{m}\times\mathbb{F}_{p}}\gamma^{\sharp}({\bm{a}},\omega)\sigma(\omega)e^{2\pi i\omega{\bm{a}}\cdot{\bm{x}}/p}

Step 2. Change variables 𝝃=ω𝒂{\bm{\xi}}=\omega{\bm{a}}

=1|𝔽p|(𝝃,ω)𝔽pm×𝔽pγ(𝝃/ω,ω)σ(ω)e2πi𝝃𝒙/p\phantom{S[\gamma]({\bm{x}})}=\frac{1}{|\mathbb{F}_{p}|}\sum_{({\bm{\xi}},\omega)\in\mathbb{F}_{p}^{m}\times\mathbb{F}_{p}}\gamma^{\sharp}({\bm{\xi}}/\omega,\omega)\sigma(\omega)e^{2\pi i{\bm{\xi}}\cdot{\bm{x}}/p}

Step 3. Put separation-of-variables form γ(𝝃/ω,ω)=f^(𝝃)ρ(ω)¯\gamma^{\sharp}({\bm{\xi}}/\omega,\omega)=\widehat{f}({\bm{\xi}})\overline{\rho^{\sharp}(\omega)}

=(|𝔽p|m1ω𝔽pσ(ω)ρ(ω)¯)(1|𝔽p|m𝝃𝔽pmf^(𝝃)e2πi𝝃𝒙/p)\displaystyle\phantom{S[\gamma]({\bm{x}})}=\left(|\mathbb{F}_{p}|^{m-1}\sum_{\omega\in\mathbb{F}_{p}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}\right)\left(\frac{1}{|\mathbb{F}_{p}|^{m}}\sum_{{\bm{\xi}}\in\mathbb{F}_{p}^{m}}\widehat{f}({\bm{\xi}})e^{2\pi i{\bm{\xi}}\cdot{\bm{x}}/p}\right)
=((σ,ρ))f(𝒙),\displaystyle\phantom{S[\gamma]({\bm{x}})}=(\!(\sigma,\rho)\!)f({\bm{x}}),

and we can verify γ=R[f;ρ]\gamma=R[f;\rho]. ∎

4 Case II: Group Convolutional NN on Hilbert Space \mathcal{H}

Next, we showcase the results by Sonoda et al. (2022a). Since there are various versions of convolutional neural networks (CNNs), their approximation properties (such as the universality) have been investigated individually depending on the network architecture. The method presented here defines the generalized group convolutional neural network (GCNN) that encompasses a wide range of CNNs, and provides a powerful result by unifying the expressivity analysis in a constructive and simple manner by using ridgelet transforms.

4.1 Fourier Transform

Since the input to CNNs is a signal (or a function), the Fourier transform corresponding to a naive integral representation is the Fourier transform on the space of signals, which is typically an infinite-dimensional space \mathcal{H} of functions. Although the Fourier transform on the infinite-dimensional Hilbert space has been well developed in the probability theory, the mathematics tends to become excessively advanced for the expected results. One of the important ideas of this study is to induce the Fourier transform of m\mathbb{R}^{m} in a finite-dimensional subspace m\mathcal{H}_{m} of \mathcal{H} instead of using the Fourier transform on the entire space \mathcal{H}. To be precise, we simply take an mm-dimensional orthonormal frame Fm={hi}i=1mF_{m}=\{h_{i}\}_{i=1}^{m} of \mathcal{H}, put the linear span m:=spanFm={i=1mcihici}\mathcal{H}_{m}:=\operatorname{span}F_{m}=\{\sum_{i=1}^{m}c_{i}h_{i}\mid c_{i}\in\mathbb{R}\}, and identify each element f=i=1cihimf=\sum_{i=1}c_{i}h_{i}\in\mathcal{H}_{m}\subset\mathcal{H} with point 𝒄=(c1,,cm)m{\bm{c}}=(c_{1},\ldots,c_{m})\in\mathbb{R}^{m}. Obviously, this embedding depends on the choice of mm-frame FmF_{m}, yet drastically simplifies the main theory itself.

Definition 4.1 (Fourier Transform on a Hilbert Space m\mathcal{H}_{m}\subset\mathcal{H}).

Let \mathcal{H} be a Hilbert space, m\mathcal{H}_{m}\subset\mathcal{H} be an mm-dimensional subspace, and λ\lambda be the Lebesgue measure induced from m\mathbb{R}^{m}. Put

f^(ξ):=mf(x)eix,ξdλ(x),ξm\widehat{f}(\xi):=\int_{\mathcal{H}_{m}}f(x)e^{-i\langle x,\xi\rangle}\mathrm{d}\lambda(x),\quad\xi\in\mathcal{H}_{m}

Then, obviously from the construction, we have the following.

Theorem 4.1.

For any fL2(m)f\in L^{2}(\mathcal{H}_{m}),

1(2π)mmf^(ξ)eix,ξdλ(ξ)=f(x),xm\frac{1}{(2\pi)^{m}}\int_{\mathcal{H}_{m}}\widehat{f}(\xi)e^{i\langle x,\xi\rangle}\mathrm{d}\lambda(\xi)=f(x),\quad x\in\mathcal{H}_{m}

4.2 Network Design

Another important idea is to deal with various group convolutions in a uniform manner by using the linear representation of groups.

Definition 4.2 (Generalized Group Convolution).

Let GG be a group, \mathcal{H} be a Hilbert space, and T:GGL()T:G\to GL(\mathcal{H}) be a group representation of GG. The (G,T)(G,T)-convolution is given by

(ax)(g):=Tg1[x],a,a,x.(a*x)(g):=\langle T_{g^{-1}}[x],a\rangle_{\mathcal{H}},\quad a,x\in\mathcal{H}.

As clarified in Sonoda et al. (2022a), the generalized convolution covers a variety of typical examples such as (1) classical group convolution Gx(h)a(h1g)dh\int_{G}x(h)a(h^{-1}g)\mathrm{d}h, (2) discrete cyclic convolution for multi-channel digital images, (3) DeepSets, or permutation equivariant maps, (4) continuous cyclic convolution for signals, and (5) E(n)\mathrm{E}(n)-equivariant maps.

Definition 4.3 (Group CNN).

Let m\mathcal{H}_{m}\subset\mathcal{H} be an mm-dimensional subspace equipped with the Lebesgue measure λ\lambda. Put

S[γ](x)(g):=m×γ(a,b)σ((ax)(g)b)dλ(a)db,x,gGS[\gamma](x)(g):=\int_{\mathcal{H}_{m}\times\mathbb{R}}\gamma(a,b)\sigma((a*x)(g)-b)\mathrm{d}\lambda(a)\mathrm{d}b,\quad x\in\mathcal{H},g\in G

Here, the integration runs all the possible convolution filters aa. For the sake of simplicity, however, the domain m\mathcal{H}_{m} of filters is restricted to an mm-dimensional subspace of entire space \mathcal{H}.

4.3 Ridgelet Transform

In the following, eGe\in G denotes the identity element.

Definition 4.4 ((G,T)(G,T)-Equivariance).

A (nonlinear) map f:Gf:\mathcal{H}\to\mathbb{C}^{G} is (G,T)(G,T)-equivariant when

f(Tg[x])(h)=f(x)(g1h),xm,g,hGf(T_{g}[x])(h)=f(x)(g^{-1}h),\quad\forall x\in\mathcal{H}_{m},g,h\in G

We note that the proposed network is inherently (G,T)(G,T)-equivariant

S[γ](Tg[x])(h)=S[γ](x)(g1h),x,g,hG.S[\gamma](T_{g}[x])(h)=S[\gamma](x)(g^{-1}h),\quad\forall x\in\mathcal{H},\ g,h\in G.
Definition 4.5 (Ridgelet Transform).

For any measurable functions f:mGf:\mathcal{H}_{m}\to\mathbb{C}^{G} and ρ:\rho:\mathbb{R}\to\mathbb{C}, put

R[f;ρ](a,b):=mf(x)(e)ρ(a,xb)¯dλ(x).R[f;\rho](a,b):=\int_{\mathcal{H}_{m}}f(x)(e)\overline{\rho(\langle a,x\rangle_{\mathcal{H}}-b)}\mathrm{d}\lambda(x).

It is remarkable that the product of aa and xx inside the ρ\rho is not convolution axa*x but scalar product a,x\langle a,x\rangle. This is essentially because (1) ff will be assumed to be group equivariant, and (2) the network is group equivariant by definition.

Theorem 4.2 (Reconstruction Formula).

Suppose that ff is (G,T)(G,T)-equivariant and f()(e)L2(m)f(\bullet)(e)\in L^{2}(\mathcal{H}_{m}), then S[R[f;ρ]]=((σ,ρ))fS[R[f;\rho]]=(\!(\sigma,\rho)\!)f.

In other words, a continuous GCNN can represent any square-integrable group-equivariant function. Again, the proof is performed by systematically following the three steps as below.

Proof.

Step 1. Turn to a Fourier expression:

S[γ](x)(g)\displaystyle S[\gamma](x)(g) =m×γ(a,b)σ(Tg1[x],ab)dadb\displaystyle=\int_{\mathcal{H}_{m}\times\mathbb{R}}\gamma(a,b)\sigma(\langle T_{g^{-1}}[x],a\rangle_{\mathcal{H}}-b)\mathrm{d}a\mathrm{d}b
=12πm×γ(a,ω)σ(ω)eiωTg1[x],adadω\displaystyle=\frac{1}{2\pi}\int_{\mathcal{H}_{m}\times\mathbb{R}}\gamma^{\sharp}(a,\omega)\sigma^{\sharp}(\omega)e^{i\omega\langle T_{g^{-1}}[x],a\rangle_{\mathcal{H}}}\mathrm{d}a\mathrm{d}\omega

Step 2. Change variables (a,ω)=(ξ/ω,ω)(a,\omega)=(\xi/\omega,\omega) with dadω=|ω|mdξdω\mathrm{d}a\mathrm{d}\omega=|\omega|^{-m}\mathrm{d}\xi\mathrm{d}\omega:

=12πm×γ(ξ/ω,ω)σ(ω)eiTg1[x],ξ|ω|mdξdω.\phantom{S[\gamma](x)(g)}=\frac{1}{2\pi}\int_{\mathcal{H}_{m}\times\mathbb{R}}\gamma^{\sharp}(\xi/\omega,\omega)\sigma^{\sharp}(\omega)e^{i\langle T_{g^{-1}}[x],\xi\rangle_{\mathcal{H}}}|\omega|^{-m}\mathrm{d}\xi\mathrm{d}\omega.

Step 3. Put separation-of-variables form γf,ρ(ξ/ω,ω):=f^(ξ)(e)ρ(ω)¯\gamma_{f,\rho}^{\sharp}(\xi/\omega,\omega):=\widehat{f}(\xi)(e)\overline{\rho^{\sharp}(\omega)}

=12πmf^(ξ)(e)eiTg1[x],ξdλ(ξ)σ(ω)ρ(ω)¯|ω|mdω\displaystyle\phantom{S[\gamma](x)(g)}=\frac{1}{2\pi}\int_{\mathcal{H}_{m}}\widehat{f}(\xi)(e)e^{i\langle T_{g^{-1}}[x],\xi\rangle_{\mathcal{H}}}\mathrm{d}\lambda(\xi)\int_{\mathbb{R}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}|\omega|^{-m}\mathrm{d}\omega
=((σ,ρ))f(x)(g),\displaystyle\phantom{S[\gamma](x)(g)}=(\!(\sigma,\rho)\!)f(x)(g),

and we can verify γf,ρ=R[f;ρ]\gamma_{f,\rho}=R[f;\rho]. ∎

4.4 Literature in Geometric Deep Learning

General convolution networks for geometric/algebraic domains have been developed for capturing the invariance/equivariance to the symmetry in a data-efficient manner (Bruna and Mallat, 2013; Cohen and Welling, 2016; Zaheer et al., 2017; Kondor and Trivedi, 2018; Cohen et al., 2019; Kumagai and Sannai, 2020). To this date, quite a variety of convolution networks have been proposed for grids, finite sets, graphs, groups, homogeneous spaces and manifolds. We refer to Bronstein et al. (2021) for a detailed survey on the so-called geometric deep learning.

Since a universal approximation theorem (UAT) is a corollary of a reconstruction formula, S[R[f;ρ]]=((σ,ρ))fS[R[f;\rho]]=(\!(\sigma,\rho)\!)f, the 3-steps Fourier expression method have provided a variety of UATs for σ(axb)\sigma(ax-b)-type networks in a unified manner. Here, we remark that the UATs of individual convolution networks have already shown (Maron et al., 2019; Zhou, 2020; Yarotsky, 2022). However, in addition to above mentioned advantages, the wide coverage of activation functions σ\sigma is also another strong advantage. In particular, we do not need to rely on the specific features of ReLU, nor need to rely on Taylor expansions/density arguments/randomized assumptions to deal with nonlinear activation functions.

5 Case III: NN on Noncompact Symmetric Space X=G/KX=G/K

Then, we showcase the results by Sonoda et al. (2022b). When the data is known to be on a certain manifold, it is natural to consider developing NNs on the manifold, in order to explicitly incorporate the inductive bias. Since there are no such thing as the standard inner products or affine mappings on manifolds, various NNs have been proposed based on geometric considerations and implementation constraints. The main idea of this study is to start from the Fourier transform on a manifold and induce a NN on the manifold and its reconstruction formula. On compact groups such as spheres SSm1\SS^{m-1}, the Fourier analysis is well known as the Peter–Weyl theorem, but in this study, the authors focused on noncompact symmetric spaces G/KG/K such as hyperbolic space m\mathbb{H}^{m} and space m\mathbb{P}_{m} of positive definite matrices and developed NNs based on the Helgason–Fourier transform on noncompact symmetric space.

5.1 Noncompact Symmetric Space G/KG/K

We refer to Helgason (1984, Introduction) and Helgason (2008, Chapter III). A noncompact symmetric space is a homogeneous space G/KG/K with nonpositive sectional curvature on which GG acts transitively. Two important examples are hyperbolic space m\mathbb{H}^{m} (Figure 1) and SPD manifold m\mathbb{P}_{m}. See also Appendices A and B for more details on these spaces.

Refer to caption
Figure 1: Poincare Disk 𝔹2\mathbb{B}^{2} is a noncompact symmetric space SU(1,1)/SO(2)SU(1,1)/SO(2). Poincaré disk 𝔹2\mathbb{B}^{2}, boundary 𝔹2\partial\mathbb{B}^{2}, point 𝒙{\bm{x}} (magenta), horocycle ξ(𝒚,𝒖)\xi({\bm{y}},{\bm{u}}) (magenta) through point 𝒚{\bm{y}} tangent to the boundary at 𝒖{\bm{u}}, and two geodesics (solid black) orthogonal to the boundary at 𝒖{\bm{u}} through 𝒐{\bm{o}} and 𝒙{\bm{x}} respectively. The signed composite distance 𝒚,𝒖\langle{\bm{y}},{\bm{u}}\rangle from the origin 𝒐{\bm{o}} to the horocycle ξ(𝒚,𝒖)\xi({\bm{y}},{\bm{u}}) can be visualized as the Riemannian distance from 𝒐{\bm{o}} to point 𝒚0{\bm{y}}_{0}. Similarly, the distance between point 𝒙{\bm{x}} and horocycle ξ(𝒚,𝒖)\xi({\bm{y}},{\bm{u}}) is understood as the Riemannian distance between 𝒙{\bm{x}} and 𝒚x{\bm{y}}_{x} along the geodesic, or equivalently, 𝒙0{\bm{x}}_{0} and 𝒚0{\bm{y}}_{0}. See Appendix A for more details.

Let GG be a connected semisimple Lie group with finite center, and let G=KANG=KAN be its Iwasawa decomposition. Namely, it is a unique diffeomorphic decomposition of GG into subgroups K,AK,A, and NN, where KK is maximal compact, AA is maximal abelian, and NN is maximal nilpotent. For example, when G=GL(m,)G=GL(m,\mathbb{R}) (general linear group), then K=O(m)K=O(m) (orthogonal group), A=D+(m)A=D_{+}(m) (all positive diagonal matrices), and N=T1(m)N=T_{1}(m) (all upper triangular matrices with ones on the diagonal).

Let dg,dk,da\mathrm{d}g,\mathrm{d}k,\mathrm{d}a, and dn\mathrm{d}n be left GG-invariant measures on G,K,AG,K,A, and NN respectively.

Let 𝔤,𝔨,𝔞\mathfrak{g},\mathfrak{k},\mathfrak{a}, and 𝔫\mathfrak{n} be the Lie algebras of G,K,AG,K,A, and NN respectively. By a fundamental property of abelian Lie algebra, both 𝔞\mathfrak{a} and its dual 𝔞\mathfrak{a}^{*} are the same dimensional vector spaces, and thus they can be identified with r\mathbb{R}^{r} for some rr, namely 𝔞=𝔞=r\mathfrak{a}=\mathfrak{a}^{*}=\mathbb{R}^{r}. We call r:=dim𝔞r:=\dim\mathfrak{a} the rank of XX. For example, when G=GL(m,)G=GL(m,\mathbb{R}), then 𝔤=𝔤𝔩m=m×m\mathfrak{g}=\mathfrak{gl}_{m}=\mathbb{R}^{m\times m} (all m×mm\times m real matrices), 𝔨=𝔬m\mathfrak{k}=\mathfrak{o}_{m} (all skew-symmetric matrices), 𝔞=D(m)\mathfrak{a}=D(m) (all diagonal matrices), and 𝔫=T0(m)\mathfrak{n}=T_{0}(m) (all strictly upper triangular matrices).

Definition 5.1.

Let X:=G/KX:=G/K be a noncompact symmetric space, namely, a Riemannian manifold composed of all the left cosets

X:=G/K:={x=gKgG}.X:=G/K:=\{x=gK\mid g\in G\}.

Using the identity element ee of GG, let o=eKo=eK be the origin of XX. By the construction of XX, group GG acts transitively on XX, and let g[x]:=ghKg[x]:=ghK (for x=hKx=hK) denote the GG-action of gGg\in G on XX. Specifically, any point xXx\in X can always be written as x=g[o]x=g[o] for some gGg\in G. Let dx\mathrm{d}x denote the left GG-invariant measure on XX.

Example 5.1 (Hyperbolic Space m=SO+(1,m)/O(m)\mathbb{H}^{m}=SO^{+}(1,m)/O(m)).

It is used for embedding words and tree-structured dataset.

Example 5.2 (SPD Manifold m=GL(m)/O(m)\mathbb{P}_{m}=GL(m)/O(m)).

It is a manifold of positive definite matrices, such as covariance matrices.

5.2 Boundary X\partial X, Horosphere ξ\xi, and Vector-Valued Composite Distance x,u\langle x,u\rangle

We further introduce three geometric objects in noncompact symmetric space G/XG/X. In comparison to Euclidean space m\mathbb{R}^{m}, the boundary X\partial X corresponds to “the set of all infinite points” limr+{r𝒖|𝒖|=1,𝒖m}\lim_{r\to+\infty}\{r{\bm{u}}\mid|{\bm{u}}|=1,{\bm{u}}\in\mathbb{R}^{m}\}, a horosphere ξ\xi through point xXx\in X with normal uXu\in\partial X corresponds to a straight line 𝝃{\bm{\xi}} through point 𝒙m{\bm{x}}\in\mathbb{R}^{m} with normal 𝒖SSm1{\bm{u}}\in\SS^{m-1}, and the vector distance x,u\langle x,u\rangle between origin oXo\in X and horosphere ξ(x,u)\xi(x,u) corresponds to the Riemannian distance between origin 𝟎m\mathbf{0}\in\mathbb{R}^{m} and straight line 𝝃(𝒙,𝒖){\bm{\xi}}({\bm{x}},{\bm{u}}).

Definition 5.2.

Let M:=CK(A):={kKka=ak for all aA}M:=C_{K}(A):=\{k\in K\mid ka=ak\mbox{ for all }a\in A\} be the centralizer of AA in KK, and let

X:=K/M:={u=kMkK}\partial X:=K/M:=\{u=kM\mid k\in K\}

be the boundary (or ideal sphere) of XX, which is known to be a compact manifold. Let du\mathrm{d}u denote the uniform probability measure on X\partial X.

For example, when K=O(m)K=O(m) and A=D+(m)A=D_{+}(m), then M=D±1M=D_{\pm 1} (the subgroup of KK consisting of diagonal matrices with entries ±1\pm 1).

Definition 5.3.

Let

Ξ:=G/MN:={ξ=gMNgG}\Xi:=G/MN:=\{\xi=gMN\mid g\in G\}

be the space of horospheres.

Here, basic horospheres are: An NN-orbit ξo:=N[o]={n[o]nN}\xi_{o}:=N[o]=\{n[o]\mid n\in N\}, which is a horosphere passing through the origin x=ox=o with normal u=eMu=eM; and ka[ξo]=kaN[o]ka[\xi_{o}]=kaN[o], which is a horosphere through point x=ka[o]x=ka[o] with normal u=kMu=kM. In fact, any horosphere can be represented as ξ(kan[o],kM)\xi(kan[o],kM) since kaN=kanNkaN=kanN for any nNn\in N. We refer to Helgason (2008, Ch.I, § 1) and Bartolucci et al. (2021, § 3.5) for more details on the horospheres and boundaries.

Definition 5.4.

As a consequence of the Iwasawa decomposition, for any gGg\in G there uniquely exists an rr-dimensional vector H(g)𝔞H(g)\in\mathfrak{a} satisfying gKeH(g)Ng\in Ke^{H(g)}N. For any (x,u)=(g[o],kM)X×X(x,u)=(g[o],kM)\in X\times\partial X, put

x,u:=H(g1k)𝔞r,\langle x,u\rangle:=-H(g^{-1}k)\in\mathfrak{a}\cong\mathbb{R}^{r},

which is understood as the rr-dimensional vector-valued distance, called the composite distance, from the origin oXo\in X to the horosphere ξ(x,u)\xi(x,u) through point xx with normal uu.

Here, the vector-valued distance means that the 2\ell^{2}-norm coincides with the Riemannian length, that is, |x,u|=|d(o,ξ(x,u))||\langle x,u\rangle|=|d(o,\xi(x,u))|. We refer to Helgason (2008, Ch.II, § 1, 4) and Kapovich et al. (2017, § 2) for more details on the vector-valued composite distance.

5.3 Fourier Transform

Based on the preparations so far, we introduce the Fourier transform on G/KG/K, known as the Helgason–Fourier transform. Let WW be the Weyl group of G/KG/K, and let |W||W| denote its order. Let 𝒄(λ){\bm{c}}(\lambda) be the Harish-Chandra 𝒄{\bm{c}}-function for GG. We refer to Helgason (1984, Theorem 6.14, Ch. IV) for the closed-form expression of the 𝒄{\bm{c}}-function.

Definition 5.5 (Helgason–Fourier Transform).

For any measurable function ff on XX, put

f^(λ,u):=Xf(x)e(iλ+ϱ)x,udx,(λ,u)𝔞×X\widehat{f}(\lambda,u):=\int_{X}f(x)e^{(-i\lambda+\varrho)\langle x,u\rangle}\mathrm{d}x,\ (\lambda,u)\in\mathfrak{a}^{*}\times\partial X

with a certain constant vector ϱ𝔞\varrho\in\mathfrak{a}^{*}. Here, the exponent (iλ+ϱ)x,u(-i\lambda+\varrho)\langle x,u\rangle is understood as the action of functional iλ+ϱ𝔞-i\lambda+\varrho\in\mathfrak{a}^{*} on a vector x,u𝔞\langle x,u\rangle\in\mathfrak{a}.

This is understood as a “Fourier transform” because e(iλ+ϱ)x,ue^{(-i\lambda+\varrho)\langle x,u\rangle} is the eigenfunction of Laplace–Beltrami operator ΔX\Delta_{X} on XX.

Theorem 5.1 (Inversion Formula).

For any square integrable function fL2(X)f\in L^{2}(X),

f(x)=1|W|𝔞×Xf^(λ,u)e(iλ+ϱ)x,udλdu|𝒄(λ)|2,xX.f(x)=\frac{1}{|W|}\int_{\mathfrak{a}^{*}\times\partial X}\widehat{f}(\lambda,u)e^{(i\lambda+\varrho)\langle x,u\rangle}\frac{\mathrm{d}\lambda\mathrm{d}u}{|{\bm{c}}(\lambda)|^{2}},\quad x\in X.

We refer to Helgason (2008, Theorems 1.3 and 1.5, Ch. III) for more details on the inversion formula.

5.4 Network Design

In accordance with the geometric perspective, it is natural to define the network as below.

Definition 5.6 (NN on Noncompact Symmetric Space G/KG/K).

For any measurable functions σ:\sigma:\mathbb{R}\to\mathbb{C} and γ:𝔞×X×\gamma:\mathfrak{a}^{*}\times\partial X\times\mathbb{R}\to\mathbb{C}, put

S[γ](x):=𝔞×X×γ(a,u,b)σ(ax,ub)eϱx,udadudb,xG/K.S[\gamma](x):=\int_{\mathfrak{a}^{*}\times\partial X\times\mathbb{R}}\gamma(a,u,b)\sigma(a\langle x,u\rangle-b)e^{\varrho\langle x,u\rangle}\mathrm{d}a\mathrm{d}u\mathrm{d}b,\quad x\in G/K.

Remarkably, the scalar product 𝒂𝒙{\bm{a}}\cdot{\bm{x}} (or a𝒖𝒙a{\bm{u}}\cdot{\bm{x}} in polar coordinate) in the Euclidean setting is replaced with a distance function ax,ua\langle x,u\rangle in the manifold setting. In Sonoda et al. (2022b), the authors instantiate two important examples as below.

Example 5.3 (Continuous Horospherical Hyperbolic NN).

On the Poincare ball model 𝔹m:={𝐱m|𝐱|<1}\mathbb{B}^{m}:=\{{\bm{x}}\in\mathbb{R}^{m}\mid|{\bm{x}}|<1\} equipped with the Riemannian metric 𝔤=4(1|𝐱|)2i=1mdxidxi\mathfrak{g}=4(1-|{\bm{x}}|)^{-2}\sum_{i=1}^{m}\mathrm{d}x_{i}\otimes\mathrm{d}x_{i},

S[γ](𝒙):=×𝔹m×γ(a,𝒖,b)σ(a𝒙,𝒖b)eϱ𝒙,𝒖dad𝒖db,𝒙𝔹mS[\gamma]({\bm{x}}):=\int_{\mathbb{R}\times\partial\mathbb{B}^{m}\times\mathbb{R}}\gamma(a,{\bm{u}},b)\sigma(a\langle{\bm{x}},{\bm{u}}\rangle-b)e^{\varrho\langle{\bm{x}},{\bm{u}}\rangle}\mathrm{d}a\mathrm{d}{\bm{u}}\mathrm{d}b,\quad{\bm{x}}\in\mathbb{B}^{m}

ϱ=(m1)/2\varrho=(m-1)/2, 𝐱,𝐮=log(1|𝐱|E2|𝐱𝐮|E2),(𝐱,𝐮)𝔹m×𝔹m\langle{\bm{x}},{\bm{u}}\rangle=\log\left(\frac{1-|{\bm{x}}|_{E}^{2}}{|{\bm{x}}-{\bm{u}}|_{E}^{2}}\right),\quad({\bm{x}},{\bm{u}})\in\mathbb{B}^{m}\times\partial\mathbb{B}^{m}

Example 5.4 (Continuous Horospherical SPD Net).

On the SPD manifold m\mathbb{P}_{m},

S[γ](x):=m×m×γ(𝒂,u,b)σ(𝒂x,ub)eϱx,ud𝒂dudb,xmS[\gamma](x):=\int_{\mathbb{R}^{m}\times\partial\mathbb{P}_{m}\times\mathbb{R}}\gamma({\bm{a}},u,b)\sigma({\bm{a}}\cdot\langle x,u\rangle-b)e^{\bm{\varrho}\cdot\langle x,u\rangle}\mathrm{d}{\bm{a}}\mathrm{d}u\mathrm{d}b,\quad x\in\mathbb{P}_{m}

ϱ=(12,,12,m14)\bm{\varrho}=(-\frac{1}{2},\ldots,-\frac{1}{2},\frac{m-1}{4}), 𝐱,𝐮=12logλ(uxu),(x,u)m×m\langle{\bm{x}},{\bm{u}}\rangle=\frac{1}{2}\log\lambda\left(uxu^{\top}\right),\quad(x,u)\in\mathbb{P}_{m}\times\partial\mathbb{P}_{m} where λ(x)\lambda(x) denotes the diagonal in the Cholesky decomposition of xx.

5.5 Ridgelet Transform

Definition 5.7 (Ridgelet Transform).

For any measurable functions f:Xf:X\to\mathbb{C} and ρ:\rho:\mathbb{R}\to\mathbb{C}, put

R[f;ρ](a,u,b):=X𝒄[f](x)ρ(ax,ub)¯eϱx,udx,\displaystyle R[f;\rho](a,u,b):=\int_{X}{\bm{c}}[f](x)\overline{\rho(a\langle x,u\rangle-b)}e^{\varrho\langle x,u\rangle}\mathrm{d}x,
𝒄[f](x):=𝔞×Xf^(λ,u)e(iλ+ϱ)x,udλdu|W||𝒄(λ)|4,\displaystyle{\bm{c}}[f](x):=\int_{\mathfrak{a}^{*}\times\partial X}\widehat{f}(\lambda,u)e^{(i\lambda+\varrho)\langle x,u\rangle}\frac{\mathrm{d}\lambda\mathrm{d}u}{|W|\,|{\bm{c}}(\lambda)|^{4}},
((σ,ρ)):=|W|2πσ(ω)ρ(ω)¯|ω|rdω.\displaystyle(\!(\sigma,\rho)\!):=\frac{|W|}{2\pi}\int_{\mathbb{R}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}|\omega|^{-r}\mathrm{d}\omega.

Here 𝒄[f]{\bm{c}}[f] is defined as a multiplier satisfying 𝒄[f]^(λ,u)=f^(λ,u)|𝒄(λ)|2\widehat{{\bm{c}}[f]}(\lambda,u)=\widehat{f}(\lambda,u)|{\bm{c}}(\lambda)|^{-2}.

Theorem 5.2 (Reconstruction Formula).

Let σ𝒮(),ρ𝒮()\sigma\in\mathcal{S}^{\prime}(\mathbb{R}),\rho\in\mathcal{S}(\mathbb{R}). Then, for any square integrable function ff on XX, we have

S[R[f;ρ]](x)=𝔞×X×R[f;ρ](a,u,b)σ(ax,ub)eϱx,udadudb=((σ,ρ))f(x).S[R[f;\rho]](x)=\int_{\mathfrak{a}^{*}\times\partial X\times\mathbb{R}}R[f;\rho](a,u,b)\sigma(a\langle x,u\rangle-b)e^{\varrho\langle x,u\rangle}\mathrm{d}a\mathrm{d}u\mathrm{d}b=(\!(\sigma,\rho)\!)f(x).

In other words, the fully-connected network on noncompact symmetric space G/KG/K can represent any square-integrable function. Again, the proof is performed by systematically following the three steps as below.

Proof.

We identify the scale parameter a𝔞a\in\mathfrak{a}^{*} with vector 𝒂r{\bm{a}}\in\mathbb{R}^{r}.

Step 1. Turn to a Fourier expression:

S[γ](x)\displaystyle S[\gamma](x) :=r×X×γ(𝒂,u,b)σ(𝒂x,ub)eϱx,ud𝒂dudb\displaystyle:=\int_{\mathbb{R}^{r}\times\partial X\times\mathbb{R}}\gamma({\bm{a}},u,b)\sigma({\bm{a}}\cdot\langle x,u\rangle-b)e^{\varrho\langle x,u\rangle}\mathrm{d}{\bm{a}}\mathrm{d}u\mathrm{d}b
=12πr×X×γ(𝒂,u,ω)σ(ω)e(iω𝒂+ϱ)x,ud𝒂dudω\displaystyle=\frac{1}{2\pi}\int_{\mathbb{R}^{r}\times\partial X\times\mathbb{R}}\gamma^{\sharp}({\bm{a}},u,\omega)\sigma^{\sharp}(\omega)e^{(i\omega{\bm{a}}+\varrho)\langle x,u\rangle}\mathrm{d}{\bm{a}}\mathrm{d}u\mathrm{d}\omega

Step 2. Change variables (𝒂,ω)=(𝝀/ω,ω)({\bm{a}},\omega)=({\bm{\lambda}}/\omega,\omega) with d𝒂dω=|ω|rd𝝀dω\mathrm{d}{\bm{a}}\mathrm{d}\omega=|\omega|^{-r}\mathrm{d}{\bm{\lambda}}\mathrm{d}\omega:

=12π[𝔞×Xγ(λ/ω,u,ω)e(iλ+ϱ)x,udλdu]σ(ω)|ω|rdω.\displaystyle\phantom{S[\gamma](x)}=\frac{1}{2\pi}\int_{\mathbb{R}}\left[\int_{\mathfrak{a}^{*}\times\partial X}\gamma^{\sharp}(\lambda/\omega,u,\omega)e^{(i\lambda+\varrho)\langle x,u\rangle}\mathrm{d}\lambda\mathrm{d}u\right]\sigma^{\sharp}(\omega)|\omega|^{-r}\mathrm{d}\omega.

Step 3. Put separation-of-variables form γf,ρ(λ/ω,u,ω)=f^(λ,u)ρ(ω)¯|𝒄(λ)|2\gamma_{f,\rho}^{\sharp}(\lambda/\omega,u,\omega)=\widehat{f}(\lambda,u)\overline{\rho^{\sharp}(\omega)}|{\bm{c}}(\lambda)|^{-2}

=(|W|2πσ(ω)ρ(ω)¯|ω|rdω)(𝔞×Xf^(λ,u)e(iλ+ϱ)x,udλdu|W||𝒄(λ)|2)\displaystyle\phantom{S[\gamma](x)}=\left(\frac{|W|}{2\pi}\int_{\mathbb{R}}\sigma^{\sharp}(\omega)\overline{\rho^{\sharp}(\omega)}|\omega|^{-r}\mathrm{d}\omega\right)\left(\int_{\mathfrak{a}^{*}\times\partial X}\widehat{f}(\lambda,u)e^{(i\lambda+\varrho)\langle x,u\rangle}\frac{\mathrm{d}\lambda\mathrm{d}u}{|W|\,|{\bm{c}}(\lambda)|^{2}}\right)
=((σ,ρ))f(x),\displaystyle\phantom{S[\gamma](x)}=(\!(\sigma,\rho)\!)f(x),

and we can verify γf,ρ=R[f;ρ]\gamma_{f,\rho}=R[f;\rho]. ∎

5.6 Literature in Hyperbolic Neural Networks

The hyperbolic neural network (HNN) (Ganea et al., 2018; Gulcehre et al., 2019; Shimizu et al., 2021) is another emerging direction of geometric deep learning, inspired by the empirical observations that some datasets having tree or hierarchical structure can be efficiently embedded into hyperbolic spaces (Krioukov et al., 2010; Nickel and Kiela, 2017, 2018; Sala et al., 2018). We note that designing a FC layer σ(a,xb)\sigma(\langle a,x\rangle-b) on manifold XX is less trivial, because neither scalar product a,x\langle a,x\rangle, bias subtraction b-b, nor elementwise activation of σ\sigma is trivially defined on XX in general, and thus we have to face those primitive issues.

The design concept of the original HNN (Ganea et al., 2018) is to reconstruct basic operations in the ordinary neural networks such as linear maps, bias translations, pointwise nonlinearities and softmax layers by using the Gyrovector operations in a tractable and geometric manner. For example, in HNN++ by Shimizu et al. (2021), the Poincaré multinomial logistic regression (MLR) layer v(𝒙)v({\bm{x}}) and fully-connected (FC) layer (𝒙)\mathcal{F}({\bm{x}}), corresponding to the 11-affine layer 𝒂𝒙b{\bm{a}}\cdot{\bm{x}}-b and kk-affine layer A𝒙𝒃A^{\top}{\bm{x}}-{\bm{b}} without activation respectively, are designed as nonlinear maps v:mv:\mathbb{H}^{m}\to\mathbb{R} and :mn\mathcal{F}:\mathbb{H}^{m}\to\mathbb{H}^{n} so that v(𝒙;𝒂,b)v({\bm{x}};{\bm{a}},b) coincides with the distance between output 𝒚=(𝒙;{𝒂i,bi}i[n]){\bm{y}}=\mathcal{F}({\bm{x}};\{{\bm{a}}_{i},b_{i}\}_{i\in[n]}) and Poincaré hyperplane H(o,𝒆𝒂,b)H(o,\bm{e}_{{\bm{a}},b}). Here, a Poincaré hyperplane H(𝒙,𝒛)H({\bm{x}},{\bm{z}}) is defined as the collection of all geodesics through point 𝒙{\bm{x}} and normal to 𝒛{\bm{z}}. Furthermore, they are designed so that the discriminative hyperplane coincides a Poincaré hyperplane. The nonlinear activation function σ:mn\sigma:\mathbb{R}^{m}\to\mathbb{R}^{n} is cast as a map σ:mk\sigma:\mathbb{H}^{m}\to\mathbb{H}^{k} via lifting σ(𝒙):=exp0σlog0(𝒙)\sigma({\bm{x}}):=\exp_{0}\circ\,\sigma\circ\log_{0}({\bm{x}}) for any 𝒙m{\bm{x}}\in\mathbb{H}^{m}. However, in practice, activation can be omitted since the FC layer is inherently nonlinear.

Refer to caption
Figure 2: The Euclidean fully-connected layer σ(𝒂𝒙b)\sigma({\bm{a}}\cdot{\bm{x}}-b) is recast as the signed distance d(𝒙,ξ)d({\bm{x}},\xi) from a point 𝒙{\bm{x}} to a hyperplane ξ(𝒚,𝒖)\xi({\bm{y}},{\bm{u}}) followed by nonlinearity σ(r)\sigma(r\bullet), where 𝒚{\bm{y}} satisfies r𝒚𝒖=br{\bm{y}}\cdot{\bm{u}}=b and ξ(𝒚,𝒖)\xi({\bm{y}},{\bm{u}}) passes through the point 𝒚{\bm{y}} with normal 𝒖{\bm{u}}.

In this study, on the other hand, we take XX to be a noncompact symmetric space G/KG/K, which is a generalized version of the hyperbolic space m\mathbb{H}^{m}. Following the philosophy of the Helgason–Fourier transform, we regard the scalar product 𝒖𝒙{\bm{u}}\cdot{\bm{x}} of unit vector 𝒖SSm1{\bm{u}}\in\SS^{m-1} and point 𝒙m{\bm{x}}\in\mathbb{R}^{m} as the signed distance between the origin 𝒐{\bm{o}} and plane ξ(𝒙,𝒖)\xi({\bm{x}},{\bm{u}}) through point 𝒙{\bm{x}} with normal 𝒖{\bm{u}}. Then, we recast it to the vector-valued distance, denoted u,x\langle u,x\rangle, between the origin oo and horocycle ξ(x,u)\xi(x,u) through point xx with normal uu. As a result, we can naturally define bias subtraction b-b and elementwise activation of σ:\sigma:\mathbb{R}\to\mathbb{R} because the signed distance is identified with a vector.

More geometrically, 𝒖𝒙b{\bm{u}}\cdot{\bm{x}}-b in m\mathbb{R}^{m} is understood as the distance between point 𝒙{\bm{x}} and plane ξ\xi satisfying 𝒖𝒙b=0{\bm{u}}\cdot{\bm{x}}-b=0 (see Figure 2). Similarly, u,xb\langle u,x\rangle-b is understood as the distance between point xx and horocycle ξ\xi satisfying u,xb=0\langle u,x\rangle-b=0. Hence, as a general principle, we may formulate a versatile template of affine layers on XX as

S[γ](x):=×Ξγ(a,ξ)σ(ad(x,ξ))dadξ.S[\gamma](x):=\int_{\mathbb{R}\times\Xi}\gamma(a,\xi)\sigma(ad(x,\xi))\mathrm{d}a\mathrm{d}\xi. (4)

For example, in the original HNN, the Poincaré hyperplane HH is employed as the geometric object. If we have a nice coordinates such as (s,t)m×m(s,t)\in\mathbb{R}^{m}\times\mathbb{R}^{m} satisfying d(x(t),ξ(s))=tsd(x(t),\xi(s))=t-s, then we can turn it to the Fourier expression and hopefully obtain the ridgelet transform.

The strengths of our results are summarized as that we obtained the ridgelet transform in a unified manner for a wide class of input domain XX in a geometric manner, i.e., independent of the coordinates; in particular, that it is the first result to define the neural network and obtained the ridgelet transform on noncompact space.

6 Case IV: Pooling Layer and dd-plane Ridgelet Transform

Finally, we present several new results. Technically, we consider networks with multivariate activation functions σ:k\sigma:\mathbb{R}^{k}\to\mathbb{C}. In all the sections up to this point, we have considered univariate activation function σ:\sigma:\mathbb{R}\to\mathbb{C} (i.e. k=1k=1). In the context of neural networks, it is understood as a mathematical model of pooling layers such as

σ(𝒃)\displaystyle\sigma({\bm{b}}) =1ki=1kbi\displaystyle=\frac{1}{k}\sum_{i=1}^{k}b_{i} (average pooling),
σ(𝒃)\displaystyle\sigma({\bm{b}}) =maxi[k]{bi}\displaystyle=\max_{i\in[k]}\{b_{i}\} (max pooling), and
σ(𝒃)\displaystyle\sigma({\bm{b}}) =|𝒃|p\displaystyle=|{\bm{b}}|_{p} (p\ell^{p}-norm).

Meanwhile, in the context of sparse signal processing in the 2000s such as Donoho (2001) and Rubin (2004) (see also § 7.2), it can also be understood as the ridgelet transform corresponding to the so-called dd-plane transform (see also § 6.1).

As mentioned in § 1.3, the ridgelet transforms have profound relations to the Radon and wavelet transforms. In the language of probability theory, a Radon transform is understood as a marginalization, and the traditional problem of Johann Radon is the inverse problem of reconstructing the original joint distribution from several marginal distributions. Hence, depending on the choice of variables to be marginalized, there are countless different Radon transforms. In other words, the Radon transform can also be a rich source for finding a variety of novel networks and ridgelet transforms. In this section, we derive the ridgelet transform corresponding to the dd-plane transform. (Nonetheless, the proofs are shown by the 3-steps Fourier expression method.)

Additional Notations

Let m,d,km,d,k be positive integers satisfying m=d+km=d+k; let Mm,k:={Am×krankA=k}M_{m,k}:=\{A\in\mathbb{R}^{m\times k}\mid\operatorname{rank}A=k\} be a set of all full-column-rank (i.e., injective) matrices equipped with the Lebesgue measure dA=ijdaij\mathrm{d}A=\bigwedge_{ij}\mathrm{d}a_{ij}; let Vm,k:={U=[𝒖1,,𝒖k]m×kUU=Ik}V_{m,k}:=\{U=[{\bm{u}}_{1},\ldots,{\bm{u}}_{k}]\in\mathbb{R}^{m\times k}\mid U^{\top}U=I_{k}\} be the Stiefel manifold of orthonormal kk-frames in m\mathbb{R}^{m} equipped with invariant measure dU\mathrm{d}U; let O(k):={VkVV=Ik}O(k):=\{V\in\mathbb{R}^{k}\mid V^{\top}V=I_{k}\} be the orthogonal group in k\mathbb{R}^{k} equipped with invariant measure dV\mathrm{d}V. In addition, let GVm,k:={A=aUm×ka+,UVm,k}GV_{m,k}:=\{A=aU\in\mathbb{R}^{m\times k}\mid a\in\mathbb{R}_{+},U\in V_{m,k}\} be a similitude group equipped with the product measure dadU\mathrm{d}a\mathrm{d}U. For a rectangular matrix AMm,kA\in M_{m,k}, we write |detA|:=|detAA|1/2|\det A|:=|\det A^{\top}A|^{1/2} for short. In the following, we use ^\widehat{\cdot} and \cdot^{\sharp} for the Fourier transforms in 𝒙m{\bm{x}}\in\mathbb{R}^{m} and 𝒃k{\bm{b}}\in\mathbb{R}^{k}, respectively. For any ss\in\mathbb{R}, let s/2\triangle^{s/2} denote the fractional Laplacian defined as a Fourier multiplier: s/2[f](𝒙):=1(2π)mm|𝝃|sf^(𝝃)ei𝝃𝒙d𝝃\triangle^{s/2}[f]({\bm{x}}):=\frac{1}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}|{\bm{\xi}}|^{s}\widehat{f}({\bm{\xi}})e^{i{\bm{\xi}}\cdot{\bm{x}}}\mathrm{d}{\bm{\xi}}.

6.1 dd-plane transform

The dd-plane transform is a Radon transform that marginalizes a dd-dimensional affine subspace (dd-plane) in an mm-dimensional space. In the special cases when d=m1d=m-1 (hyperplane) and d=1d=1 (straight line), they are respectively called the (strict) Radon transform and the X-ray transform. The ridgelet transforms to be introduced in this section correspond to dd-plane Radon transform, and the classical ridgelet transform corresponds to the strict Radon transform (d=m1d=m-1). We refer to Chapter 1 of Helgason (2010).

Definition 6.1 (dd-plane).

A dd-plane 𝝃m{\bm{\xi}}\subset\mathbb{R}^{m} is a dd-dimensional affine subspace in m\mathbb{R}^{m}. Here, affine emphasizes that it does not always pass through the origin 𝒐m{\bm{o}}\in\mathbb{R}^{m}. Let Gm,dG_{m,d} denote the collection of all dd-planes in m\mathbb{R}^{m}, called the affine Grassmannian manifold.

A dd-plane is parametrized by its orthonormal directions U=[𝒖1,,𝒖k]Vm,kU=[{\bm{u}}_{1},\ldots,{\bm{u}}_{k}]\in V_{m,k} and coordinate vector 𝒃k{\bm{b}}\in\mathbb{R}^{k} from the origin 𝒐{\bm{o}} as below

𝝃(U,𝒃):=U𝒃+kerU=i=1kbi𝒖i+{j=1dcj𝒗j|cj},{\bm{\xi}}(U,{\bm{b}}):=U{\bm{b}}+\ker U=\sum_{i=1}^{k}b_{i}{\bm{u}}_{i}+\left\{\sum_{j=1}^{d}c_{j}{\bm{v}}_{j}{\ \Bigg{|}\ }c_{j}\in\mathbb{R}\right\},

where [𝒗1,,𝒗d]Vm,d[{\bm{v}}_{1},\ldots,{\bm{v}}_{d}]\in V_{m,d} is a dd-frame satisfying 𝒗j𝒖i{\bm{v}}_{j}\perp{\bm{u}}_{i} for any i,j\forall i,j. The first term U𝒃U{\bm{b}} is the displacement vector from the origin 𝒐{\bm{o}}, its norm |U𝒃||U{\bm{b}}| is the distance from the origin 𝒐{\bm{o}} and dd-plane 𝝃{\bm{\xi}}, and the second term kerU=(spanU)\ker U=(\operatorname{span}U)^{\perp} is the dd-dimensional linear subspace that is parallel to 𝝃{\bm{\xi}}.

Recall that for each direction UVm,dU\in V_{m,d}, the whole space m\mathbb{R}^{m} can be decomposed into a disjoint union of dd-planes as m=𝒃k𝝃(U,𝒃)\mathbb{R}^{m}=\cup_{{\bm{b}}\in\mathbb{R}^{k}}{\bm{\xi}}(U,{\bm{b}}). In this perspective, the dd-plane transform of ff at 𝝃{\bm{\xi}} is defined as a marginalization of ff in 𝝃{\bm{\xi}}.

Definition 6.2 (dd-plane Transform).

For any integrable function fL1(m)f\in L^{1}(\mathbb{R}^{m}) and dd-plane 𝝃=(U,𝒃)Vm,k×k{\bm{\xi}}=(U,{\bm{b}})\in V_{m,k}\times\mathbb{R}^{k}, put

Pd[f](𝝃):=𝝃f(𝒙)d𝖫d(𝒙)=kerUf(U𝒃+𝒚)d𝒚,P_{d}[f]({\bm{\xi}}):=\int_{{\bm{\xi}}}f({\bm{x}})\mathrm{d}\mathsf{L}^{d}({\bm{x}})=\int_{\ker U}f(U{\bm{b}}+{\bm{y}})\mathrm{d}{\bm{y}},

where 𝖫d\mathsf{L}^{d} is the dd-dimensional Hausdorff measure on 𝝃{\bm{\xi}}.

Particularly, the strict Radon transform corresponds to d=m1d=m-1, and the XX-ray transform corresponds to d=1d=1.

The dd-plane transform has the following Fourier expression.

Lemma 6.1 (Fourier Slice Theorem for dd-plane Transform).

For any fL1(m)f\in L^{1}(\mathbb{R}^{m}),

Pd[f](U,𝝎)=f^(U𝝎),(U,𝝎)Vm,k×k,P_{d}[f]^{\sharp}(U,{\bm{\omega}})=\widehat{f}(U{\bm{\omega}}),\quad(U,{\bm{\omega}})\in V_{m,k}\times\mathbb{R}^{k},

where ^\widehat{\cdot} and \cdot^{\sharp} denote the Fourier transforms in 𝐱{\bm{x}} and 𝐛{\bm{b}} respectively. In other words,

kPd[f](U,𝒃)ei𝒃𝝎d𝒃=mf(𝒙)eiU𝝎𝒙d𝒙.\int_{\mathbb{R}^{k}}P_{d}[f](U,{\bm{b}})e^{-i{\bm{b}}\cdot{\bm{\omega}}}\mathrm{d}{\bm{b}}=\int_{\mathbb{R}^{m}}f({\bm{x}})e^{-iU{\bm{\omega}}\cdot{\bm{x}}}\mathrm{d}{\bm{x}}.

Using the Fourier slice theorem, we can invert the dd-plane transform.

Lemma 6.2 (Inversion Formula for dd-plane Radon Transform).

For any fL1(m)f\in L^{1}(\mathbb{R}^{m}),

f(𝒙)=1(2π)mVm,k×kf^(U𝝎)|U𝝎|mkeiU𝝎𝒙d𝝎dU,𝒙m.f({\bm{x}})=\frac{1}{(2\pi)^{m}}\int_{V_{m,k}\times\mathbb{R}^{k}}\widehat{f}(U{\bm{\omega}})|U{\bm{\omega}}|^{m-k}e^{iU{\bm{\omega}}\cdot{\bm{x}}}\mathrm{d}{\bm{\omega}}\mathrm{d}U,\quad{\bm{x}}\in\mathbb{R}^{m}.
Proof.

By the Fourier slice theorem,

1(2π)mVm,k×k(Pd[f])(U,𝝎)eiU𝝎𝒙|U𝝎|mkdUd𝝎\displaystyle\frac{1}{(2\pi)^{m}}\int_{V_{m,k}\times\mathbb{R}^{k}}(P_{d}[f])^{\sharp}(U,{\bm{\omega}})e^{iU{\bm{\omega}}\cdot{\bm{x}}}|U{\bm{\omega}}|^{m-k}\mathrm{d}U\mathrm{d}{\bm{\omega}}
=1(2π)mVm,k×kf^(U𝝎)eiU𝝎𝒙|U𝝎|mkdUd𝝎\displaystyle=\frac{1}{(2\pi)^{m}}\int_{V_{m,k}\times\mathbb{R}^{k}}\widehat{f}(U{\bm{\omega}})e^{iU{\bm{\omega}}\cdot{\bm{x}}}|U{\bm{\omega}}|^{m-k}\mathrm{d}U\mathrm{d}{\bm{\omega}}
=1(2π)mmf^(𝝃)ei𝝃𝒙d𝝃=f(𝒙).\displaystyle=\frac{1}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}\widehat{f}({\bm{\xi}})e^{i{\bm{\xi}}\cdot{\bm{x}}}\mathrm{d}{\bm{\xi}}=f({\bm{x}}).

Here, we change variable 𝝃=U𝝎{\bm{\xi}}=U{\bm{\omega}} and use the matrix polar integration formula Lemma C.1. ∎

Remark 1 (Relations to Marginalization of Probability Distributions).

In short, a dd-plane 𝝃{\bm{\xi}} is a subset in m\mathbb{R}^{m}, it is identified with a single variable as well, and dd-plane 𝝃{\bm{\xi}} (as a variable) is marginalized.

Let us consider a two-variables (or bivariate) case. The marginalization of a probability distribution f(x1,x2)f(x_{1},x_{2}) in x1x_{1} (resp. x2x_{2}) refers to an integral transform of ff into its first (resp. second) variable defined by f1(x2)=f(x1,x2)dx1f_{1}(x_{2})=\int_{\mathbb{R}}f(x_{1},x_{2})\mathrm{d}x_{1} (rep. f2(x1)=f(x1,x2)dx2f_{2}(x_{1})=\int_{\mathbb{R}}f(x_{1},x_{2})\mathrm{d}x_{2}).

On the other hand, the dd-plane transform of an integrable function ff on 2\mathbb{R}^{2} (i.e. fL1(2)f\in L^{1}(\mathbb{R}^{2})) with d=1d=1 (which is reduced to the classical Radon transform) is given by

Pd[f](s,𝒖)=f(s𝒖+t𝒖)dt,(t,𝒖)×SS1P_{d}[f](s,{\bm{u}})=\int_{\mathbb{R}}f(s{\bm{u}}+t{\bm{u}}^{\perp})\mathrm{d}t,\quad(t,{\bm{u}})\in\mathbb{R}\times\SS^{1}

where SS1\SS^{1} denotes the set of unit vectors in 2\mathbb{R}^{2}, i.e. SS1={𝒖2|𝒖|=1}\SS^{1}=\{{\bm{u}}\in\mathbb{R}^{2}\mid|{\bm{u}}|=1\}, and 𝒖{\bm{u}}^{\perp} denotes an orthonormal vector, or a unit vector satisfying 𝒖𝒖=0{\bm{u}}\cdot{\bm{u}}^{\perp}=0 (there always exist two 𝒖{\bm{u}}^{\perp}’s for each 𝒖{\bm{u}}). Each (s,𝒖)×SS1(s,{\bm{u}})\in\mathbb{R}\times\SS^{1} indicates a dd-plane 𝝃(s,𝒖)={s𝒖+t𝒖t}{\bm{\xi}}(s,{\bm{u}})=\{s{\bm{u}}+t{\bm{u}}^{\perp}\mid t\in\mathbb{R}\}.

In particular, by fixing an orthonormal basis {𝒖1,𝒖2}2\{{\bm{u}}_{1},{\bm{u}}_{2}\}\in\mathbb{R}^{2}, and identifying bivariate function f(x1,x2)f(x_{1},x_{2}) with univariate function f(x1𝒖1+x2𝒖2)f(x_{1}{\bm{u}}_{1}+x_{2}{\bm{u}}_{2}), the marginalization of probability distributions is identified with the following specific cases:

f1(x2)=Pd[f](x2,𝒖1),f2(x1)=Pd[f](x1,𝒖2).f_{1}(x_{2})=P_{d}[f](x_{2},{\bm{u}}_{1}),\quad f_{2}(x_{1})=P_{d}[f](x_{1},{\bm{u}}_{2}).

6.2 Network Design

We define the dd-plane (or kk-affine) layer. Here, kk is the co-dimension of dd, satisfying d+k=md+k=m. In addition to the full-column-matrices cases (A,𝒃)Mm,k×k(A,{\bm{b}})\in M_{m,k}\times\mathbb{R}^{k}, we consider the degenerated cases (A=aU,𝒃)GVm,k×k(A=aU,{\bm{b}})\in GV_{m,k}\times\mathbb{R}^{k} and (A=U,𝒃)Vm,k×k(A=U,{\bm{b}})\in V_{m,k}\times\mathbb{R}^{k}, which correspond to several previous studies.

Definition 6.3.

Let σ:k\sigma:\mathbb{R}^{k}\to\mathbb{C} be a measurable function. Let MM denote either Mm,k,GVm,kM_{m,k},GV_{m,k} or Vm,kV_{m,k}. For any function γ:M×k\gamma:M\times\mathbb{R}^{k}\to\mathbb{C}, the continuous neural network with dd-plane (or kk-affine) layer is given by

S[γ](𝒙):=M×kγ(A,𝒃)σ(A𝒙𝒃)dAd𝒃,𝒙m.S[\gamma]({\bm{x}}):=\int_{M\times\mathbb{R}^{k}}\gamma(A,{\bm{b}})\sigma(A^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}A\mathrm{d}{\bm{b}},\quad{\bm{x}}\in\mathbb{R}^{m}.

Since the null space kerA:={𝒙mA𝒙=0}\ker A^{\top}:=\{{\bm{x}}\in\mathbb{R}^{m}\mid A^{\top}{\bm{x}}=0\} is dd-dimensional, each dd-plane neuron σ(A𝒙𝒃)\sigma(A^{\top}{\bm{x}}-{\bm{b}}) has dd-dimensions of constant directions. Therefore, dd-plane networks are able to capture dd-dimensional singularities in a target function ff.

6.3 Ridgelet Transforms and Reconstruction Formulas

We present three variants of solutions for dd-plane networks. We note that typical pooling layers σ\sigma such as average pooling, max pooling, and p\ell^{p}-norm are contained in the class of tempered distributions (𝒮\mathcal{S}^{\prime}) on k\mathbb{R}^{k}. The first and second theorems present dense (AMm,kA\in M_{m,k}) and sparse (AGVm,kA\in GV_{m,k}) solutions of parameters for the same class of activation functions. Since GVm,kGV_{m,k} is a measure-zero subset of Mm,kM_{m,k}, the second solution is much sparser than the first solution. The third theorem present the sparsest (AVm,kA\in V_{m,k}) solution, by restricting the class of activation functions. It is supposed to capture characteristic solutions modern activation functions such as ReLU.

In the following, cm,k:=SSk1×Vm,k1dUd𝝎c_{m,k}:=\int_{\SS^{k-1}\times V_{m,k-1}}\mathrm{d}U\mathrm{d}{\bm{\omega}}.

Theorem 6.1.

Let σ𝒮(k),ρ𝒮(k),fHd(m)\sigma\in\mathcal{S}^{\prime}(\mathbb{R}^{k}),\rho\in\mathcal{S}(\mathbb{R}^{k}),f\in H^{d}(\mathbb{R}^{m}). Put

R[f;ρ](A,𝒃)\displaystyle R[f;\rho](A,{\bm{b}}) :=1δ(A)md/2[f](𝒙)ρ(A𝒙𝒃)¯d𝒙,(A,𝒃)Mm,k×k\displaystyle:=\frac{1}{\delta(A)}\int_{\mathbb{R}^{m}}\triangle^{d/2}[f]({\bm{x}})\overline{\rho(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{x}},\quad(A,{\bm{b}})\in M_{m,k}\times\mathbb{R}^{k}
((σ,ρ))\displaystyle(\!(\sigma,\rho)\!) :=(2π)d2kcm,kkσ(𝝎)ρ(𝝎)¯i=1k|ωi|1d𝝎,\displaystyle:=\frac{(2\pi)^{d}}{2^{k}c_{m,k}}\int_{\mathbb{R}^{k}}\sigma^{\sharp}({\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}\prod_{i=1}^{k}|\omega_{i}|^{-1}\mathrm{d}{\bm{\omega}},

where δ(A)\delta(A) is defined as 2ki=1ddidi<j(di2dj2)2^{-k}\prod_{i=1}^{d}d_{i}^{d}\prod_{i<j}(d_{i}^{2}-d_{j}^{2}) with d1>>dk>0d_{1}>\cdots>d_{k}>0 being the singular values of AA. Then, for almost every 𝐱m{\bm{x}}\in\mathbb{R}^{m}, we have

S[R[f;ρ]](𝒙)=Mm,k×kR[f;ρ](A,𝒃)σ(A𝒙𝒃)dAd𝒃=((σ,ρ))f(𝒙).S[R[f;\rho]]({\bm{x}})=\int_{M_{m,k}\times\mathbb{R}^{k}}R[f;\rho](A,{\bm{b}})\sigma(A^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}A\mathrm{d}{\bm{b}}=(\!(\sigma,\rho)\!)f({\bm{x}}).
Theorem 6.2.

Let ss be a real number; let σ𝒮(k),ρ𝒮(k),fHs(m)\sigma\in\mathcal{S}^{\prime}(\mathbb{R}^{k}),\rho\in\mathcal{S}(\mathbb{R}^{k}),f\in H^{s}(\mathbb{R}^{m}). Put

Rs[f;ρ](aU,𝒃)\displaystyle R_{s}[f;\rho](aU,{\bm{b}}) :=ams1ms/2[f](𝒙)ρ(A𝒙𝒃)¯d𝒙,(aU,𝒃)GVm,k×k\displaystyle:=a^{m-s-1}\int_{\mathbb{R}^{m}}\triangle^{s/2}[f]({\bm{x}})\overline{\rho(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{x}},\quad(aU,{\bm{b}})\in GV_{m,k}\times\mathbb{R}^{k}
((σ,ρ))s\displaystyle(\!(\sigma,\rho)\!)_{s} :=(2π)d2kcm,kkσ(𝝎)ρ(𝝎)¯|𝝎|(ds+1)d𝝎,\displaystyle:=\frac{(2\pi)^{d}}{2^{k}c_{m,k}}\int_{\mathbb{R}^{k}}\sigma^{\sharp}({\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}|{\bm{\omega}}|^{-(d-s+1)}\mathrm{d}{\bm{\omega}},

Then, for almost every 𝐱m{\bm{x}}\in\mathbb{R}^{m}, we have

S[Rs[f;ρ]](𝒙)=GVm,k×kRs[f;ρ](aU,𝒃)σ(aU𝒙𝒃)dadUd𝒃=((σ,ρ))sf(𝒙).S[R_{s}[f;\rho]]({\bm{x}})=\int_{GV_{m,k}\times\mathbb{R}^{k}}R_{s}[f;\rho](aU,{\bm{b}})\sigma(aU^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}a\mathrm{d}U\mathrm{d}{\bm{b}}=(\!(\sigma,\rho)\!)_{s}f({\bm{x}}).
Theorem 6.3.

For any real number tt, suppose that σ𝒮(k)\sigma\in\mathcal{S}^{\prime}(\mathbb{R}^{k}) satisfy σ(𝛚)=|𝛚|t\sigma^{\sharp}({\bm{\omega}})=|{\bm{\omega}}|^{t} (i.e., σ\sigma is the Green function of 𝐛t/2\triangle_{{\bm{b}}}^{-t/2}). Let fHd(m)f\in H^{d}(\mathbb{R}^{m}). Put

R[f](U,𝒃):=𝒃(dt)/2Pd[f](U,𝒃)=Pd[(dt)/2f](U,𝒃),(U,𝒃)Vm,k×kR[f](U,{\bm{b}}):=\triangle_{{\bm{b}}}^{(d-t)/2}P_{d}[f](U,{\bm{b}})=P_{d}[\triangle^{(d-t)/2}f](U,{\bm{b}}),\quad(U,{\bm{b}})\in V_{m,k}\times\mathbb{R}^{k}

where 𝐛\triangle_{{\bm{b}}} denotes the fractional Laplacian in 𝐛k{\bm{b}}\in\mathbb{R}^{k}, and PdP_{d} is the dd-plane transform. Then, for almost every 𝐱m{\bm{x}}\in\mathbb{R}^{m}, we have

S[R[f]](𝒙)=Vm,k×kR[f](U,𝒃)σ(U𝒙𝒃)dUd𝒃=1(2π)dcm,kf(𝒙).S[R[f]]({\bm{x}})=\int_{V_{m,k}\times\mathbb{R}^{k}}R[f](U,{\bm{b}})\sigma(U^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}U\mathrm{d}{\bm{b}}=\frac{1}{(2\pi)^{d}c_{m,k}}f({\bm{x}}).

As consequences, these reconstruction formulas are understood as constructive universality theorems for dd-plane networks. We note (1) that, as far as we have noticed, the first result was not known, (2) that the second result extends the “dd-plane ridgelet transform” by Donoho (2001) and Rubin (2004) (see § 6.4), and (3) that the third result extends the Radon formulas (Theorem 7.2) by Carroll and Dickinson (1989) and Ito (1991) as the special case k=1k=1 and t=1t=-1, and recent results on ReLU-nets such as in Savarese et al. (2019); Ongie et al. (2020) and Parhi and Nowak (2021) as the special case k=1k=1 and t=2t=-2.

The proof is performed by systematically following the three steps as below.

Proof.

We present the first case. See Appendix D for full proofs.

Step 1. Turn to the Fourier expression:

S[γ](𝒙)=1(2π)kγ(A,𝝎)σ(𝝎)ei(A𝝎)𝒙dAd𝝎δ(A)S[\gamma]({\bm{x}})=\frac{1}{(2\pi)^{k}}\int\gamma^{\sharp}(A,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})e^{i(A{\bm{\omega}})\cdot{\bm{x}}}\frac{\mathrm{d}A\mathrm{d}{\bm{\omega}}}{\delta(A)}

Step 2. Use singular value decomposition (SVD)

A=UDV,(U,D,V)Vm,k×+k×O(k),A=UDV^{\top},\quad(U,D,V)\in V_{m,k}\times\mathbb{R}_{+}^{k}\times O(k),

with dA/δ(A)=dUdDdV\mathrm{d}A/\delta(A)=\mathrm{d}U\mathrm{d}D\mathrm{d}V to have

=1(2π)kγ(A,𝝎)σ(𝝎)ei(UDV𝝎)𝒙dUdDdVd𝝎.=\frac{1}{(2\pi)^{k}}\int\gamma^{\sharp}(A,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})e^{i(UDV^{\top}{\bm{\omega}})\cdot{\bm{x}}}\mathrm{d}U\mathrm{d}D\mathrm{d}V\mathrm{d}{\bm{\omega}}.

Change variables 𝝎=V𝝎{\bm{\omega}}^{\prime}=V^{\top}{\bm{\omega}} (VV fixed) and 𝒚=D𝝎{\bm{y}}=D{\bm{\omega}}^{\prime} (𝝎{\bm{\omega}}^{\prime} fixed)

=1(2π)kγ(A,V𝝎)σ(V𝝎)ei(U𝒚)𝒙i=1k|ωi|1dUd𝒚dVd𝝎.=\frac{1}{(2\pi)^{k}}\int\gamma^{\sharp}(A,V{\bm{\omega}}^{\prime})\sigma^{\sharp}(V{\bm{\omega}}^{\prime})e^{i(U{\bm{y}})\cdot{\bm{x}}}\prod_{i=1}^{k}|\omega_{i}^{\prime}|^{-1}\mathrm{d}U\mathrm{d}{\bm{y}}\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime}.

Step 3. Put a separation-of-variables form (note: A𝝎=AV𝝎=U𝒚A{\bm{\omega}}=AV{\bm{\omega}}^{\prime}=U{\bm{y}})

γf,ρ(A,V𝝎)=f^(U𝒚)|U𝒚|mkρ(V𝝎)¯\gamma^{\sharp}_{f,\rho}(A,V{\bm{\omega}}^{\prime})=\widehat{f}(U{\bm{y}})|U{\bm{y}}|^{m-k}\overline{\rho(V{\bm{\omega}}^{\prime})}

Then, γf,ρ\gamma_{f,\rho} turns out to be a particular solution because

S[γf,ρ](𝒙)\displaystyle S[\gamma_{f,\rho}]({\bm{x}}) =cm,k(2π)k(O(k)×kσ(V𝝎)ρ(V𝝎)¯i=1k|ωi|1dVd𝝎)\displaystyle=\frac{c_{m,k}}{(2\pi)^{k}}\left(\int_{O(k)\times\mathbb{R}^{k}}\sigma^{\sharp}(V{\bm{\omega}}^{\prime})\overline{\rho^{\sharp}(V{\bm{\omega}}^{\prime})}\prod_{i=1}^{k}|\omega_{i}^{\prime}|^{-1}\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime}\right)
×(Vm,k×kf^(U𝒚)|U𝒚|mkei(U𝒚)𝒙dUd𝒚)\displaystyle\quad\times\left(\int_{V_{m,k}\times\mathbb{R}^{k}}\widehat{f}(U{\bm{y}})|U{\bm{y}}|^{m-k}e^{i(U{\bm{y}})\cdot{\bm{x}}}\mathrm{d}U\mathrm{d}{\bm{y}}\right)
=((σ,ρ))f(𝒙).\displaystyle=(\!(\sigma,\rho)\!)f({\bm{x}}).

Finally, the matrix ridgelet transform can be calculated as below

γf,ρ(A,𝒃)\displaystyle\gamma_{f,\rho}(A,{\bm{b}}) =1(2π)kk|A𝝎|mkf^(A𝝎)ρ(𝝎)¯ei𝝎𝒃d𝝎\displaystyle=\frac{1}{(2\pi)^{k}}\int_{\mathbb{R}^{k}}|A{\bm{\omega}}|^{m-k}\widehat{f}(A{\bm{\omega}})\overline{\rho({\bm{\omega}})}e^{i{\bm{\omega}}\cdot{\bm{b}}}\mathrm{d}{\bm{\omega}}
=1(2π)kk[m(mk)/2[f](𝒙)eiA𝝎𝒙d𝒙]ρ(𝝎)¯ei𝝎𝒃d𝝎\displaystyle=\frac{1}{(2\pi)^{k}}\int_{\mathbb{R}^{k}}\left[\int_{\mathbb{R}^{m}}\triangle^{(m-k)/2}[f]({\bm{x}})e^{-iA{\bm{\omega}}\cdot{\bm{x}}}\mathrm{d}{\bm{x}}\right]\overline{\rho^{\sharp}({\bm{\omega}})}e^{i{\bm{\omega}}\cdot{\bm{b}}}\mathrm{d}{\bm{\omega}}
=m(mk)/2[f](𝒙)[1(2π)kkρ(𝝎)ei𝝎(A𝒙𝒃)d𝝎]d𝒙\displaystyle=\int_{\mathbb{R}^{m}}\triangle^{(m-k)/2}[f]({\bm{x}})\left[\frac{1}{(2\pi)^{k}}\int_{\mathbb{R}^{k}}\rho^{\sharp}({\bm{\omega}})e^{i{\bm{\omega}}\cdot(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{\omega}}\right]^{*}\mathrm{d}{\bm{x}}
=mmk2[f](𝒙)ρ(A𝒙𝒃)¯d𝒙\displaystyle=\int_{\mathbb{R}^{m}}\triangle^{\frac{m-k}{2}}[f]({\bm{x}})\overline{\rho(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{x}}
=:R[f;ρ](A,𝒃)\displaystyle=:R[f;\rho](A,{\bm{b}})

6.4 Literature in dd-plane ridgelet transform

In the past, two versions of the dd-plane ridgelet transform have been proposed. One is a tight frame (i.e., a discrete transform) by Donoho (2001), and the other is a continuous transform by Rubin (2004). The dd-plane ridgelet by Donoho can be regarded as the discrete version of the dd-plane ridgelet transform by Rubin.

Theorem 6.4 (Continuous dd-plane Ridgelet Transform by Rubin (2004)).
Ua[f](𝝃):=mf(𝒙)ua(|𝒙𝝃|)d𝒙,𝝃Gm,d\displaystyle U_{a}[f]({\bm{\xi}}):=\int_{\mathbb{R}^{m}}f({\bm{x}})u_{a}\left(|{\bm{x}}-{\bm{\xi}}|\right)\mathrm{d}{\bm{x}},\quad{\bm{\xi}}\in G_{m,d}
Va[ϕ](𝒙):=Gm,dϕ(𝝃)va(|𝒙𝝃|)d𝝃,𝒙m\displaystyle V^{*}_{a}[\phi]({\bm{x}}):=\int_{G_{m,d}}\phi({\bm{\xi}})v_{a}\left(|{\bm{x}}-{\bm{\xi}}|\right)\mathrm{d}{\bm{\xi}},\quad{\bm{x}}\in\mathbb{R}^{m}
0Va[Ua[f]]daa1+d=cf,\displaystyle\int_{0}^{\infty}V_{a}^{*}[U_{a}[f]]\frac{\mathrm{d}a}{a^{1+d}}=cf,

where |𝐱𝛏||{\bm{x}}-{\bm{\xi}}| denotes the Euclidean distance between point 𝐱{\bm{x}} and dd-plane 𝛏{\bm{\xi}}, ua()=u(/a)/adu_{a}(\cdot)=u(\cdot/a)/a^{d} and va()=v(/a)/adv_{a}(\cdot)=v(\cdot/a)/a^{d}.

Recall that an affine dd-plane 𝝃Gm,k{\bm{\xi}}\in G_{m,k} is parametrized by an orthonormal kk-frame UVm,kU\in V_{m,k} and a coordinate vector 𝒃k{\bm{b}}\in\mathbb{R}^{k} as 𝝃(U,𝒃):={𝒙mU𝒙=𝒃}=U𝒃+kerU{\bm{\xi}}(U,{\bm{b}}):=\{{\bm{x}}\in\mathbb{R}^{m}\mid U^{\top}{\bm{x}}={\bm{b}}\}=U{\bm{b}}+\ker U. Because |𝒙𝝃(U,𝒃)|=|U𝒙𝒃||{\bm{x}}-{\bm{\xi}}(U,{\bm{b}})|=|U^{\top}{\bm{x}}-{\bm{b}}| for any point 𝒙m{\bm{x}}\in\mathbb{R}^{m}, the quantity U𝒙𝒃U^{\top}{\bm{x}}-{\bm{b}} is understood as the Euclidean vector-distance between point 𝒙{\bm{x}} and dd-plane 𝝃{\bm{\xi}}. Therefore, the dd-plane ridgelet transform by Rubin is understood as a special case of σ(aU𝒙𝒃)\sigma(aU^{\top}{\bm{x}}-{\bm{b}}) as in Theorem 6.2 where both σ\sigma and ρ\rho are radial functions. We remark that a more redundant parametrization σ(A𝒙𝒃)\sigma(A^{\top}{\bm{x}}-{\bm{b}}) as in Theorem 6.1 is natural for the purpose of neural network study, simply because neural network parameters are not strictly restricted to σ(aU𝒙𝒃)\sigma(aU^{\top}{\bm{x}}-{\bm{b}}) during the training.

7 Literature Overview

7.1 Ridgelet Transform in the 1990s

One of the major problems in neural network study in the 1990s was to investigate the expressive power of (fully-connected) shallow neural networks, and the original ridgelet transform was discovered in this context independently by Murata (1996), and Candès (1998). Later in the 2010s, the classes of ff and σ\sigma have been extended to the distributions by Kostadinova et al. (2014) and Sonoda and Murata (2017) to include the modern activation functions such as ReLU.

The idea of using integral transforms for function approximation is fundamental and has a long history in approximation theory (see, e.g. DeVore and Lorentz, 1993). In the literature of neural network study, the integral representation by Barron (1993) is one of the representative works, where the so-called Barron class and Maurey–Jones–Barron (MJB) approximation error upperbound have been established, which play an important role both in the approximation and estimation theories of neural networks. We refer to Kainen et al. (2013) for more details on the MJB theory.

One obvious strength of the ridgelet transform is the closed-form expression. Before the ridgelet transform, two pioneering results were proposed. One is the Fourier formula by Irie and Miyake (1988) and Funahashi (1989):

Theorem 7.1.

For any σL1()\sigma\in L^{1}(\mathbb{R}) and fL2(m)f\in L^{2}(\mathbb{R}^{m}),

f(𝒙)=1(2π)mσ(1)m×f^(𝒂)σ(𝒂𝒙b)eibd𝒂db.f({\bm{x}})=\frac{1}{(2\pi)^{m}\sigma^{\sharp}(1)}\int_{\mathbb{R}^{m}\times\mathbb{R}}\widehat{f}({\bm{a}})\sigma({\bm{a}}\cdot{\bm{x}}-b)e^{ib}\mathrm{d}{\bm{a}}\mathrm{d}b.

The other is the Radon formula by Carroll and Dickinson (1989) and Ito (1991):

Theorem 7.2.

For σ(b):=b+0\sigma(b):=b_{+}^{0} (step function) and any f𝒮(m)f\in\mathcal{S}(\mathbb{R}^{m}),

f(𝒙)=12(2π)m1SSm1×t(t)(m1)/2P[f](𝒖,t)σ(𝒖𝒙t)d𝒖dt,f({\bm{x}})=\frac{1}{2(2\pi)^{m-1}}\int_{\SS^{m-1}\times\mathbb{R}}\partial_{t}(-\triangle_{t})^{(m-1)/2}P[f]({\bm{u}},t)\sigma({\bm{u}}\cdot{\bm{x}}-t)\mathrm{d}{\bm{u}}\mathrm{d}t,

where PP denotes the Radon transform.

Both results clearly show the strong relationship between neural networks and the Fourier and Radon transforms. We note that our result Theorem 6.3 includes the Radon formula (Theorem 7.2) as the special case k=1k=1 and t=1t=-1.

7.2 Ridgelet Transform in the 2000s

In the context of sparse signal processing, the emergence of the ridgelet transform has motivated another direction of research: Exploring a high-dimensional counterpart of the 11-dimensional wavelet transform. Indeed, the wavelet transform for mm-dimensional signals such as images and videos does exist, and it is given as below

W[f;ψ](a,𝒃):=m×mf(𝒙)ψa(𝒙𝒃)¯d𝒙,(a,𝒃)+×m\displaystyle W[f;\psi](a,{\bm{b}}):=\int_{\mathbb{R}^{m}\times\mathbb{R}^{m}}f({\bm{x}})\overline{\psi_{a}({\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{x}},\quad(a,{\bm{b}})\in\mathbb{R}_{+}\times\mathbb{R}^{m}
f(𝒙)=m×+W[f;ψ]ψa(𝒙𝒃)d𝒃daa,𝒙m\displaystyle f({\bm{x}})=\int_{\mathbb{R}^{m}\times\mathbb{R}_{+}}W[f;\psi]\psi_{a}({\bm{x}}-{\bm{b}})\frac{\mathrm{d}{\bm{b}}\mathrm{d}a}{a},\quad{\bm{x}}\in\mathbb{R}^{m}

for fL2(m)f\in L^{2}(\mathbb{R}^{m}), where ψa(𝒃):=ψ(𝒃/a)/am\psi_{a}({\bm{b}}):=\psi({\bm{b}}/a)/a^{m} and ψ𝒮(m)\psi\in\mathcal{S}(\mathbb{R}^{m}) is a wavelet function. However, it is considered to be unsatisfied in its localization ability, because it is essentially a tensor product of 11-dimensional wavelet transforms.

More precisely, while the 11-dimensional wavelet transform is good at localizing the point singularities such as jumps and kinks in the 11-dimensional signals such as audio recordings, the 22-dimensional wavelet transform is not good at localizing the line singularities in 22-dimensional signals such as pictures except when the singularity is straight and parallel to either xx- or yy-axes. Here, the singularity of dimension dd is the term by Donoho (2001). For example,

f(𝒙):=(x12++xk2)α/2exp(|𝒙|2),0<α<k/2f({\bm{x}}):=(x_{1}^{2}+\cdots+x_{k}^{2})^{-\alpha/2}\exp(-|{\bm{x}}|^{2}),\quad 0<\alpha<k/2

is a singular square-integrable function ff on m\mathbb{R}^{m} that attains \infty along the hyperplane x1==xk=0x_{1}=\cdots=x_{k}=0.

On the other hand, the ridgelet transform is good at localizing the (m1)(m-1)-dimensional singularities in any direction because the feature map 𝒙σ(𝒂𝒙b){\bm{x}}\mapsto\sigma({\bm{a}}\cdot{\bm{x}}-b) is a constant function along the (m1)(m-1)-dimensional hyperplane normal to 𝒂{\bm{a}}. Similarly, the dd-plane (or kk-affine) ridgelet transform presented in this study is good at localizing the dd-dimensional singularities because the feature map 𝒙σ(A𝒙𝒃){\bm{x}}\mapsto\sigma(A^{\top}{\bm{x}}-{\bm{b}}) is a constant function along the dd-dimensional subspace kerA={𝒙mA𝒙=0}\ker A^{\top}=\{{\bm{x}}\in\mathbb{R}^{m}\mid A^{\top}{\bm{x}}=0\}.

In search for better localization properties, a variety of “X-lets” have been developed such as curvelet, beamlet, contourlet, and sheerlet under the slogan of geometric multiscale analysis (GMA) (see e.g.  Donoho, 2002; Starck et al., 2010). Since ridgelet analysis had already been recognized as wavelet analysis in the Radon domain, a variety of generalizations of wavelet transforms and Radon transforms were investigated. In a modern sense, the philosophy of general Radon transforms is to map a function ff on a space X=G/KX=G/K of points xx to a function P[f]P[f] on another space Ξ=G/H\Xi=G/H of shapes ξ\xi (see e.g. Helgason, 2010). In the context of singularity localization, the shape ξ\xi such as dd-plane determines the shape of singularities, namely, a collection of constant directions in XX, and thus the general Radon domain Ξ\Xi is understood as the parameter space of the singularities. In this perspective, we can restate the functionality of the ridgelet transform as wavelet localization in the space Ξ\Xi of singularities in XX.

7.3 Ridgelet Transform in the 2020s

In the context of deep learning study, the idea of ridgelet transforms have regained the spotlight for the representer theorem that characterizes (either deep or shallow) infinitely-wide ReLU networks that minimizes a “representational cost” (Savarese et al., 2019; Ongie et al., 2020; Parhi and Nowak, 2021; Unser, 2019). Here, the representational cost for function ff is defined as the infimum of the total variation (TV) norm of the parameter distribution:

C[f]:=infγ𝒢γTV,s.t.S[γ]=f,C[f]:=\inf_{\gamma\in\mathcal{G}}\|\gamma\|_{\rm TV},\quad\mbox{s.t.}\quad S[\gamma]=f,

where 𝒢\mathcal{G} is the collection of all signed measures. The TV-norm is a fundamental quantity for the MJB bounds (see e.g. Kainen et al., 2013).

According to Sonoda et al. (2021b), when the class 𝒢\mathcal{G} of parameter distributions is restricted to L2(m×)L^{2}(\mathbb{R}^{m}\times\mathbb{R}), then any γ\gamma satisfying S[γ]=fS[\gamma]=f is uniquely written as a series of ridgelet transforms: γ=R[f;σ]+i=1R[fi;ρi]\gamma=R[f;\sigma_{*}]+\sum_{i=1}^{\infty}R[f_{i};\rho_{i}] where σ\sigma_{*} is a certain unique function satisfying ((σ,ρ0))=1(\!(\sigma,\rho_{0})\!)=1, yielding S[R[f;σ]]=fS[R[f;\sigma_{*}]]=f; {ρi}i\{\rho_{i}\}_{i\in\mathbb{N}} is an orthonormal system satisfying ((σ,ρi))=0(\!(\sigma,\rho_{i})\!)=0, yielding S[R[fi;ρi]]=0S[R[f_{i};\rho_{i}]]=0; and {fi}i\{f_{i}\}_{i\in\mathbb{N}} is L2L^{2}-functions that is uniquely determined for each γ\gamma. We note that σ\sigma_{*} and ρi\rho_{i} are independent of γ\gamma. Hence, the cost is rewritten as a constraint-free expression:

C[f]=inf{fi}R[f;σ]+i=1R[fi;ρi]L1.C[f]=\inf_{\{f_{i}\}}\Big{\|}R[f;\sigma_{*}]+\sum_{i=1}^{\infty}R[f_{i};\rho_{i}]\Big{\|}_{L^{1}}.

As a result, we can conjecture that the minimizer of C[f]C[f] is given by ridgelet transform(s). In fact, Ongie et al. (2020) have shown that under some assumptions, the minimizer is given by a derivative of the Radon transform: (m+1)/2P[f]\triangle^{(m+1)/2}P[f], which is exactly the special case of the ridgelet transform in Theorem 6.3 when k=1k=1 and t=1t=-1.

Update: At the same time as the initial submission, Parhi and Unser (2023b) have obtained a representer theorem for multivariate activation functions under more careful considerations on the regularization and function spaces based on an extended theory of the dd-plane transforms for distributions (Parhi and Unser, 2023a). Their result suggests our conjecture was essentially true (modulo finite-order polynomials).

8 Discussion

In the main text, we have seen a variety of examples, but what is essential behind the Fourier expression, changing variables and assuming the separation-of-variables form? In a nutshell, it is coefficient comparison for solving equations. Namely, after appropriately changing variables, the network S[γ]S[\gamma] is rewritten in the Fourier basis, which is thus the coordinate transform from the basis {σ(axb)}a,b\{\sigma(ax-b)\}_{a,b} to the Fourier basis {exp(iξx)}ξ\{\exp(i\xi x)\}_{\xi}. Since we (are supposed to) know the Fourier coefficient f^(ξ)\widehat{f}(\xi), we can obtain the unknown function γ\gamma by comparing the coefficients in the Fourier domain. From this perspective, we can now understand that the Fourier basis is just a one choice of frames, and the solution steps are summarized as below:

Let ϕ:V×X\phi:V\times X\to\mathbb{R} and ψ:Ξ×X\psi:\Xi\times X\to\mathbb{R} be two feature maps on XX parametrized by VV and Ξ\Xi respectively, and consider their associated integral representations:

S[γ](x):=Vγ(v)ϕ(v,x)dv,T[g](x):=Ξg(ξ)ψ(ξ,x)dξ.S[\gamma](x):=\int_{V}\gamma(v)\phi(v,x)\mathrm{d}v,\quad T[g](x):=\int_{\Xi}g(\xi)\psi(\xi,x)\mathrm{d}\xi.

Here, SS and TT correspond to the continuous neural network and the inverse Fourier transform respectively. Given a function f:Xf:X\to\mathbb{R}, suppose that gg of T[g]=fT[g]=f is known as, say g=f^g=\widehat{f}, but γ\gamma of S[γ]=fS[\gamma]=f is unknown. Then, find a coordinate transform HH satisfying H[ϕ](ξ,x)=ψ(ξ,x)H[\phi](\xi,x)=\psi(\xi,x) so that

S[γ](x)=ΞH[γ](ξ)ψ(ξ,x)dξ,S[\gamma](x)=\int_{\Xi}H^{\prime}[\gamma](\xi)\psi(\xi,x)\mathrm{d}\xi,

where HH^{\prime} is a dual transform of the coefficients γ\gamma associated with the coordinate transform. Then, we can find γ\gamma by comparing the coefficients:

H[γ]=f^.H^{\prime}[\gamma]=\widehat{f}.

In other words, the mapping HH that matches the neuron σ(𝒂𝒙b)\sigma({\bm{a}}\cdot{\bm{x}}-b) and Fourier basis exp(i𝝃𝒙)\exp(i{\bm{\xi}}\cdot{\bm{x}}) corresponds to the Fourier expression and change of variables in the main text, yielding H[γ](𝝃)=γ(𝝃/ω,ω)H^{\prime}[\gamma]({\bm{\xi}})=\gamma^{\sharp}({\bm{\xi}}/\omega,\omega).

9 Conclusion

The ultimate goal of this study is to understand neural network parameters. While the ridgelet transform is a strong analysis tool, one of the major short-comings is that the closed-form expression has been known only for small class of neural networks. In this paper, we propose the Fourier slice method, and have shown that various neural networks and their corresponding ridgelet transforms, listed in Table 1, can be systematically obtained by following the three steps of the Fourier slice method.

layer type input xx parameter (a,b)(a,b) single neuron
§ 1-2. fully-connected (FC) layer m\mathbb{R}^{m} m×\mathbb{R}^{m}\times\mathbb{R} σ(𝒂𝒙b)\sigma({\bm{a}}\cdot{\bm{x}}-b)
§ 3. FC layer on finite fields 𝔽pm\mathbb{F}_{p}^{m} 𝔽pm×𝔽p\mathbb{F}_{p}^{m}\times\mathbb{F}_{p} σ(𝒂𝒙b)\sigma({\bm{a}}\cdot{\bm{x}}-b)
§ 4. group convolution layer m\mathcal{H}_{m} m×\mathcal{H}_{m}\times\mathbb{R} σ((ax)(g)b)\sigma((a\star x)(g)-b)
§ 5. FC layer on manifolds G/KG/K 𝔞×X×\mathfrak{a}^{*}\times\partial X\times\mathbb{R} σ(ax,ub)\sigma(a\langle x,u\rangle-b)
§ 6. pooling (dd-plane ridgelet) m\mathbb{R}^{m} m×k×k\mathbb{R}^{m\times k}\times\mathbb{R}^{k} σ(A𝒙𝒃)\sigma(A^{\top}{\bm{x}}-{\bm{b}})
Table 1: List of layer types σ(axb)\sigma(ax-b) covered in this study. See corresponding sections for the definitions of symbols such as 𝔽p,m,G/K,X\mathbb{F}_{p},\mathcal{H}_{m},G/K,\partial X and 𝔞\mathfrak{a}^{*}.

Needless to say, it is more efficient to analyze networks uniformly in terms of ridgelet analysis than to analyze individual networks manually one by one. As demonstrated in this paper, the coverage of ridgelet analysis is gradually expanding. With the strength of a closed-form expression of the pseudo-inverse operator, the ridgelet transform has several applications. For example, we can/may

  1. (1)

    present a constructive proof of the universal approximation theorem,

  2. (2)

    estimate approximation error bounds by discretizing the reconstruction formula using numerical integration schemes (e.g. MJB theory),

  3. (3)

    describe the distribution of parameters obtained by gradient descent learning (Sonoda et al., 2021a),

  4. (4)

    obtain the general solution to the learning equation S[γ]=fS[\gamma]=f (Sonoda et al., 2021b), and

  5. (5)

    construct a representer theorem (Unser, 2019).

The Fourier expression further allows us to view neural networks from the perspective of harmonic analysis and integral geometry. By recasting neural networks in these contexts, we will be able to discover further varieties of novel networks. On the other hand, after the initial submission of this manuscript, the authors have also developed an alternative method of discovery that uses group invariant functions instead of the Fourier expression (Sonoda et al., 2023a, b). The characterization of the networks obtained by the Fourier slice method would be an interesting direction of this study.

Acknowledgment

This work was supported by JSPS KAKENHI 18K18113, JST CREST JPMJCR2015 and JPMJCR1913, JST PRESTO JPMJPR2125, and JST ACT-X JPMJAX2004.

Appendix A Poincaré Disk

Following Helgason (1984)[Introduction, § 4] and Helgason (2008)[Ch.II, § 1], we describe the group theoretic aspect of the Poincaré disk. Let D:={z|z|<1}D:=\{z\in\mathbb{C}\mid|z|<1\} be the unit open disk in \mathbb{C} equipped with the Riemannian metric gz(u,v)=(u,v)/(1|z|2)2g_{z}(u,v)=(u,v)/(1-|z|^{2})^{2} for any tangent vectors u,vTzDu,v\in T_{z}D at zDz\in D, where (,)(\cdot,\cdot) denotes the Euclidean inner product in 2\mathbb{R}^{2}. Let D:={u|u|=1}\partial D:=\{u\in\mathbb{C}\mid|u|=1\} be the boundary of DD equipped with the uniform probability measure du\mathrm{d}u. Namely, DD is the Poincaré disk model of hyperbolic plane 2\mathbb{H}^{2}. On this model, the Poincaré metric between two points z,wDz,w\in D is given by d(z,w)=tanh1|(zw)/(1zw)|d(z,w)=\tanh^{-1}|(z-w)/(1-zw^{*})|, and the volume element is given by dz=(1(x2+y2))2dxdy\mathrm{d}z=(1-(x^{2}+y^{2}))^{-2}\mathrm{d}x\mathrm{d}y.

Consider now the group

G:=SU(1,1):={(αββα)|(α,β)2,|α|2|β|2=1},G:=SU(1,1):=\left\{\begin{pmatrix}\alpha&\beta\\ \beta^{*}&\alpha^{*}\end{pmatrix}{\ \Bigg{|}\ }(\alpha,\beta)\in\mathbb{C}^{2},|\alpha|^{2}-|\beta|^{2}=1\right\},

which acts on DD (and D\partial D) by

gz:=αz+ββz+α,zDD.g\cdot z:=\frac{\alpha z+\beta}{\beta^{*}z+\alpha^{*}},\quad z\in D\cup\partial D.

The GG-action is transitive, conformal, and maps circles, lines, and the boundary into circles, lines, and the boundary. In addition, consider the subgroups

K\displaystyle K :=SO(2)={kϕ:=(eiϕ00eiϕ)|ϕ[0,2π)},\displaystyle:=SO(2)=\left\{k_{\phi}:=\begin{pmatrix}e^{i\phi}&0\\ 0&e^{-i\phi}\end{pmatrix}{\ \Bigg{|}\ }\phi\in[0,2\pi)\right\},
A\displaystyle A :={at:=(coshtsinhtsinhtcosht)|t},\displaystyle:=\left\{a_{t}:=\begin{pmatrix}\cosh t&\sinh t\\ \sinh t&\cosh t\end{pmatrix}{\ \Bigg{|}\ }t\in\mathbb{R}\right\},
N\displaystyle N :={ns:=(1+isisis1is)|s},\displaystyle:=\left\{n_{s}:=\begin{pmatrix}1+is&-is\\ is&1-is\end{pmatrix}{\ \Bigg{|}\ }s\in\mathbb{R}\right\},
M\displaystyle M :=CK(A)={k0=(1001),kπ=(1001)}\displaystyle:=C_{K}(A)=\left\{k_{0}=\begin{pmatrix}1&0\\ 0&1\end{pmatrix},k_{\pi}=\begin{pmatrix}-1&0\\ 0&-1\end{pmatrix}\right\}

The subgroup K:=SO(2)K:=SO(2) fixes the origin oDo\in D. So we have the identifications

D=G/K=SU(1,1)/SO(2),andD=K/M=SS1.D=G/K=SU(1,1)/SO(2),\quad\mbox{and}\quad\partial D=K/M=\SS^{1}.

On this model, the following are known (1) that m=dim𝔞=1m=\dim\mathfrak{a}=1, |W|=1|W|=1, ϱ=1\varrho=1, and |𝒄(λ)|2=πλ2tanh(πλ2)|{\bm{c}}(\lambda)|^{-2}=\frac{\pi\lambda}{2}\tanh(\frac{\pi\lambda}{2}) for λ𝔞=\lambda\in\mathfrak{a}^{*}=\mathbb{R}, (2) that the geodesics are the circular arcs perpendicular to the boundary D\partial D, and (3) that the horocycles are the circles tangent to the boundary D\partial D. Hence, let ξ(x,u)\xi(x,u) denote the horocycle ξ\xi through xDx\in D and tangent to the boundary at uDu\in\partial D; and let x,u\langle x,u\rangle denote the signed distance from the origin oDo\in D to the horocycle ξ(x,u)\xi(x,u).

In order to compute the distance z,u\langle z,u\rangle, we use the following fact: The distance from the origin oo to a point z=reiuz=re^{iu} is d(o,z)=tanh1|(0z)/(10z)|=12log1+r1rd(o,z)=\tanh^{-1}|(0-z)/(1-0z^{*})|=\frac{1}{2}\log\frac{1+r}{1-r}. Hence, let cDc\in D be the center of the horocycle ξ(z,u)\xi(z,u), and let wDw\in D be the closest point on the horocycle ξ(z,u)\xi(z,u) to the origin. By definition, z,u=d(o,w)\langle z,u\rangle=d(o,w). But we can find the ww via the cosine rule:

coszou=|u|2+|z|2|zu|22|u||z|=coszoc=|z|2+|12(1+|w|)|2|12(1|w|)|22|z||12(1+|w|)|,\cos zou=\frac{|u|^{2}+|z|^{2}-|z-u|^{2}}{2|u|\,|z|}=\cos zoc=\frac{|z|^{2}+{|\frac{1}{2}(1+|w|)|}^{2}-{|\frac{1}{2}(1-|w|)|}^{2}}{2|z|\,|\frac{1}{2}(1+|w|)|},

which yields the tractable formula:

z,u=12log1+|w|1|w|=12log1|z|2|zu|2,(z,u)D×D.\langle z,u\rangle=\frac{1}{2}\log\frac{1+|w|}{1-|w|}=\frac{1}{2}\log\frac{1-|z|^{2}}{|z-u|^{2}},\quad(z,u)\in D\times\partial D.

Appendix B SPD Manifold

Following Terras (2016, Chapter 1), we introduce the SPD manifold. On the space m\mathbb{P}_{m} of m×mm\times m symmetric positive definite (SPD) matrices, the Riemannian metric is given by

𝔤x:=tr((x1dx)2),xm\mathfrak{g}_{x}:=\operatorname{tr}\left((x^{-1}\mathrm{d}x)^{2}\right),\quad x\in\mathbb{P}_{m}

where xx and dx\mathrm{d}x denote the matrices of entries xijx_{ij} and dxij\mathrm{d}x_{ij}.

Put G=GL(m,)G=GL(m,\mathbb{R}), then the Iwasawa decomposition G=KANG=KAN is given by K=O(m),A=D+(m),N=T1(m)K=O(m),A=D_{+}(m),N=T_{1}(m); and the centralizer M=CK(A)M=C_{K}(A) is given by M=D±1M=D_{\pm 1} (diagonal matrices with entries ±1\pm 1). The quotient space G/KG/K is identified with the SPD manifold m\mathbb{P}_{m} via a diffeomorphism onto, gKgggK\mapsto gg^{\top} for any gGg\in G; and K/MK/M is identified with the boundary m\partial\mathbb{P}_{m}, another manifold of all singular positive semidefinite matrices. The action of GG on m\mathbb{P}_{m} is given by g[x]:=gxgg[x]:=gxg^{\top} for any gGg\in G and xmx\in\mathbb{P}_{m}. In particular, the metric 𝔤\mathfrak{g} is GG-invariant. According to the spectral decomposition, for any xmx\in\mathbb{P}_{m}, there uniquely exist kKk\in K and aAa\in A such that x=k[a]x=k[a]; and according to the Cholesky (or Iwasawa) decomposition, there exist nNn\in N and aAa\in A such that x=n[a]x=n[a].

When x=k[exp(H)]=exp(k[H])x=k[\exp(H)]=\exp(k[H]) for some H𝔞=D(m)H\in\mathfrak{a}=D(m) and kKk\in K, then the geodesic segment yy from the origin o=Io=I (the identity matrix) to xx is given by

y(t)=exp(tk[H]),t[0,1]y(t)=\exp(tk[H]),\quad t\in[0,1]

satisfying y(0)=oy(0)=o and y(1)=xy(1)=x; and the Riemannian length of yy (i.e., the Riemannian distance from oo to xx) is given by d(o,x)=|H|Ed(o,x)=|H|_{E}. So, H𝔞H\in\mathfrak{a} is the vector-valued distance from oo to x=k[exp(H)]x=k[\exp(H)].

The GG-invariant measures are given by dg=|detg|mi,jdgij\mathrm{d}g=|\det g|^{-m}\bigwedge_{i,j}\mathrm{d}g_{ij} on GG, dk\mathrm{d}k to be the uniform probability measure on KK, da=idai/ai\mathrm{d}a=\bigwedge_{i}\mathrm{d}a_{i}/a_{i} on AA, dn=1<i<jmdnij\mathrm{d}n=\bigwedge_{1<i<j\leq m}\mathrm{d}n_{ij} on NN,

dμ(x)\displaystyle\mathrm{d}\mu(x) =|detx|m+121ijmdxijonm,\displaystyle=|\det x|^{-\frac{m+1}{2}}\bigwedge_{1\leq i\leq j\leq m}\mathrm{d}x_{ij}\quad\mbox{on}\quad\mathbb{P}_{m},
=cmj=1majm121i<jm|aiaj|dadk,\displaystyle=c_{m}\prod_{j=1}^{m}a_{j}^{-\frac{m-1}{2}}\prod_{1\leq i<j\leq m}|a_{i}-a_{j}|\mathrm{d}a\mathrm{d}k,

where the second expression is for the polar coordinates x=k[a]x=k[a] with (k,a)K×A(k,a)\in K\times A and cm:=π(m2+m)/4j=1mj1Γ1(j/2)c_{m}:=\pi^{(m^{2}+m)/4}\prod_{j=1}^{m}j^{-1}\Gamma^{-1}(j/2), and du\mathrm{d}u to be the uniform probability measure on m:=K/M\partial\mathbb{P}_{m}:=K/M.

The vector-valued composite distance from the origin oo to a horosphere ξ(x,u)\xi(x,u) is calculated as

x=g[o],u=kM=12logλ(k[x]),\langle x=g[o],u=kM\rangle=\frac{1}{2}\log\lambda(k^{\top}[x]),

where λ(y)\lambda(y) denotes the diagonal vector λ\lambda in the Cholesky decomposition y=ν[λ]=νλνy=\nu[\lambda]=\nu\lambda\nu^{\top} of yy for some (ν,λ)NA(\nu,\lambda)\in NA.

Proof.

Since x,kM:=H(g1k)=k[x],eM\langle x,kM\rangle:=-H(g^{-1}k)=\langle k^{\top}[x],eM\rangle, it suffices to consider the case (x,u)=(g[o],eM)(x,u)=(g[o],eM). Namely, we solve g1=kang^{-1}=kan for unknowns (k,a,n)KAN(k,a,n)\in KAN. (To be precise, we only need aa because x,eM=loga\langle x,eM\rangle=-\log a.) Put the Cholesky decomposition x=ν[λ]=νλνx=\nu[\lambda]=\nu\lambda\nu^{\top} for some (ν,λ)NA(\nu,\lambda)\in NA. Then, a=λ1/2a=\lambda^{-1/2} because x1=(ν1)λ1ν1x^{-1}=(\nu^{-1})^{\top}\lambda^{-1}\nu^{-1}, while x1=(gg)1=na2nx^{-1}=(gg^{\top})^{-1}=n^{\top}a^{2}n. ∎

The Helgason–Fourier transform and its inversion formula are given by

f^(𝒔,u)=mf(x)e𝒔x,u¯dμ(x),\displaystyle\widehat{f}({\bm{s}},u)=\int_{\mathbb{P}_{m}}f(x)\overline{e^{{\bm{s}}\cdot\langle x,u\rangle}}\mathrm{d}\mu(x),
f(x)=ωm𝒔=𝝆mf^(𝒔,u)e𝒔x,udud𝒔|𝒄(𝒔)|2,\displaystyle f(x)=\omega_{m}\int_{\Re{\bm{s}}={\bm{\rho}}}\int_{\partial\mathbb{P}_{m}}\widehat{f}({\bm{s}},u)e^{{\bm{s}}\cdot\langle x,u\rangle}\mathrm{d}u\frac{\mathrm{d}{\bm{s}}}{|{\bm{c}}({\bm{s}})|^{2}},

for any (𝒔,u)𝔞×O(m)({\bm{s}},u)\in\mathfrak{a}^{*}_{\mathbb{C}}\times O(m) (where 𝔞=m\mathfrak{a}^{*}_{\mathbb{C}}=\mathbb{C}^{m}) and xmx\in\mathbb{P}_{m}. Here, ωm:=j=1mΓ(j/2)j(2πi)πj/2\omega_{m}:=\prod_{j=1}^{m}\frac{\Gamma(j/2)}{j(2\pi i)\pi^{j/2}}, 𝝆=(12,,12,m14)m{\bm{\rho}}=(-\frac{1}{2},\ldots,-\frac{1}{2},\frac{m-1}{4})\in\mathbb{C}^{m}, and

𝒄(𝒔)=1ij<mB(12,si++sj+ji+12)B(12,ji+12),{\bm{c}}({\bm{s}})=\prod_{1\leq i\leq j<m}\frac{B(\frac{1}{2},s_{i}+\cdots+s_{j}+\frac{j-i+1}{2})}{B(\frac{1}{2},\frac{j-i+1}{2})},

where B(x,y):=Γ(x)Γ(y)/Γ(x+y)B(x,y):=\Gamma(x)\Gamma(y)/\Gamma(x+y) is the beta function.

Appendix C Matrix Calculus

We refer to Rubin (2018) and Díaz-García and González-Farías (2005) for matrix calculus.

Let m,km,k be positive integers (mkm\geq k). Let Mm,km×kM_{m,k}\subset\mathbb{R}^{m\times k} be the set of all full-column-rank matrices equipped with the volume measure dW=i,jdwij\mathrm{d}W=\prod_{i,j}\mathrm{d}w_{ij}. Let Vm,kV_{m,k} be the Stiefel manifold of orthonormal kk-frames in m\mathbb{R}^{m} equipped with the invariant measure dU\mathrm{d}U normalized to Vm,kdU=2kπmk/2/Γk(m/2)=:σm,k\int_{V_{m,k}}\mathrm{d}U=2^{k}\pi^{mk/2}/\Gamma_{k}(m/2)=:\sigma_{m,k} where Γk\Gamma_{k} is the Siegel gamma function. Let 𝔖kk×k\mathfrak{S}_{k}\subset\mathbb{R}^{k\times k} be the space of real symmetric matrices equipped with the volume measure dS=ijdsij\mathrm{d}S=\prod_{i\leq j}\mathrm{d}s_{ij}, which is isometric to the euclidean space k(k+1)/2\mathbb{R}^{k(k+1)/2}. Let 𝔓kk×k\mathfrak{P}_{k}\subset\mathbb{R}^{k\times k} be the cone of positive definite matrices in 𝔖k\mathfrak{S}_{k} equipped with the volume measure dP=ijdpij\mathrm{d}P=\prod_{i\leq j}\mathrm{d}p_{ij}.

Lemma C.1 (Matrix Polar Decomposition).

For any WMm,kW\in M_{m,k}, there uniquely exist UVm,kU\in V_{m,k} and P𝔓kP\in\mathfrak{P}_{k} such that

W=UP1/2,P=WW;W=UP^{1/2},\quad P=W^{\top}W;

and for any fL1(Mm,k)f\in L^{1}(M_{m,k}),

Mm,kf(W)dW=12kVm,k×𝔓kf(UP1/2)|detP|mk12dPdU,P=WW.\int_{M_{m,k}}f(W)\mathrm{d}W=\frac{1}{2^{k}}\int_{V_{m,k}\times\mathfrak{P}_{k}}f(UP^{1/2})|\det P|^{\frac{m-k-1}{2}}\mathrm{d}P\mathrm{d}U,\quad P=W^{\top}W.

See Rubin (2018, Lemma 2.1) for more details. We remark that while Lemma C.1 is an integration formula on Stiefel manifold, the Grassmannian manifold version is the Blaschke–Petkantschin formula.

Lemma C.2 (Matrix Polar Integration).

For any fL1(k)f\in L^{1}(\mathbb{R}^{k}),

cm,kmf(𝒙)d𝒙=Vm,k×kf(U𝒃)|U𝒃|mkdUd𝒃,c_{m,k}\int_{\mathbb{R}^{m}}f({\bm{x}})\mathrm{d}{\bm{x}}=\int_{V_{m,k}\times\mathbb{R}^{k}}f(U{\bm{b}})|U{\bm{b}}|^{m-k}\mathrm{d}U\mathrm{d}{\bm{b}},

where cm,k:=SSk1×Vm,k1dUd𝛚c_{m,k}:=\int_{\SS^{k-1}\times V_{m,k-1}}\mathrm{d}U\mathrm{d}{\bm{\omega}}.

Proof.

Recall that U𝒃=i=1kbi𝒖imU{\bm{b}}=\sum_{i=1}^{k}b_{i}{\bm{u}}_{i}\in\mathbb{R}^{m} and thus |U𝒃|2=𝒃UU𝒃=|𝒃|2|U{\bm{b}}|^{2}={\bm{b}}^{\top}U^{\top}U{\bm{b}}=|{\bm{b}}|^{2}. Hence, using the polar decomposition 𝒃=r𝝎{\bm{b}}=r{\bm{\omega}} with d𝒃=rk1drd𝝎\mathrm{d}{\bm{b}}=r^{k-1}\mathrm{d}r\mathrm{d}{\bm{\omega}},

Vm,k×kf(U𝒃)|U𝒃|mkdUd𝒃\displaystyle\int_{V_{m,k}\times\mathbb{R}^{k}}f(U{\bm{b}})|U{\bm{b}}|^{m-k}\mathrm{d}U\mathrm{d}{\bm{b}}
=Vm,k×SSk1×+f(U𝝎r)rmkrk1drd𝝎dU\displaystyle=\int_{V_{m,k}\times\SS^{k-1}\times\mathbb{R}_{+}}f(U{\bm{\omega}}r)r^{m-k}r^{k-1}\mathrm{d}r\mathrm{d}{\bm{\omega}}\mathrm{d}U

then letting 𝒖𝝎:=i=1kωi𝒖i{\bm{u}}_{{\bm{\omega}}}:=\sum_{i=1}^{k}\omega_{i}{\bm{u}}_{i}, which is a unit vector in spanU=[𝒖1,,𝒖k]\operatorname{span}\,U=[{\bm{u}}_{1},\ldots,{\bm{u}}_{k}], and letting U𝝎U_{-{\bm{\omega}}} be a rearranged k1k-1-frame in spanU\operatorname{span}\,U that excludes 𝒖𝝎{\bm{u}}_{{\bm{\omega}}},

=SSk1[Vm,k1×SSk1×+f(r𝒖𝝎)rm1drd𝒖𝝎dU𝝎]d𝝎\displaystyle=\int_{\SS^{k-1}}\left[\int_{V_{m,k-1}\times\SS^{k-1}\times\mathbb{R}_{+}}f(r{\bm{u}}_{{\bm{\omega}}})r^{m-1}\mathrm{d}r\mathrm{d}{\bm{u}}_{{\bm{\omega}}}\mathrm{d}U_{-{\bm{\omega}}}\right]\mathrm{d}{\bm{\omega}}
=SSk1×Vm,k1dUk1d𝝎mf(𝒙)d𝒙,𝒙=r𝒖𝝎.\displaystyle=\int_{\SS^{k-1}\times V_{m,k-1}}\mathrm{d}U_{k-1}\mathrm{d}{\bm{\omega}}\int_{\mathbb{R}^{m}}f({\bm{x}})\mathrm{d}{\bm{x}},\quad{\bm{x}}=r{\bm{u}}_{{\bm{\omega}}}.

Lemma C.3 (SVD).

For any column-full-rank matrix WMm,kW\in M_{m,k}, there exist (U,D,V)Vm,k×+k×O(k)(U,D,V)\in V_{m,k}\times\mathbb{R}_{+}^{k}\times O(k) satisfying W=UDVW=UDV^{\top} with d1>>dk>0d_{1}>\cdots>d_{k}>0; and for any fL1(Mm,k)f\in L^{1}(M_{m,k}),

Mm,kf(W)dW=2kVm,k×+k×O(k)f(UDV)|detD|mki<j(di2dj2)dDdVdU,\int_{M_{m,k}}f(W)\mathrm{d}W=2^{-k}\int_{V_{m,k}\times\mathbb{R}_{+}^{k}\times O(k)}f(UDV^{\top})|\det D|^{m-k}\prod_{i<j}(d_{i}^{2}-d_{j}^{2})\mathrm{d}D\mathrm{d}V\mathrm{d}U,

where dD=i=1kddi\mathrm{d}D=\bigwedge_{i=1}^{k}\mathrm{d}d_{i} and dU,dV\mathrm{d}U,\mathrm{d}V denote the invariant measures.

See Lemma 1 of Díaz-García and González-Farías (2005) for the proof.

Appendix D Proofs

D.1 Solution via Singular Value Decomposition

Step 1.

We begin with the following Fourier expression:

S[γ](𝒙)\displaystyle S[\gamma]({\bm{x}}) =Mm,k×kγ(A,𝒃)σ(A𝒙𝒃)dAd𝒃\displaystyle=\int_{M_{m,k}\times\mathbb{R}^{k}}\gamma(A,{\bm{b}})\sigma(A^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}A\mathrm{d}{\bm{b}}
=1(2π)kMm,k×kγ(A,𝝎)σ(𝝎)ei(A𝝎)𝒙dAd𝝎.\displaystyle=\frac{1}{(2\pi)^{k}}\int_{M_{m,k}\times\mathbb{R}^{k}}\gamma^{\sharp}(A,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})e^{i(A{\bm{\omega}})\cdot{\bm{x}}}\mathrm{d}A\mathrm{d}{\bm{\omega}}. (D.1)

Here, we assume (D.1) to be absolutely convergent for a.e. 𝒙m{\bm{x}}\in\mathbb{R}^{m}, so that we can change the order of integrations freely. But this assumption will be automatically satisfied because we eventually set γ\gamma to be the ridgelet transform.

Step 2.

In the following, we aim to turn the integration ei(A𝝎)𝒙dAd𝝎\int\cdots e^{i(A{\bm{\omega}})\cdot{\bm{x}}}\mathrm{d}A\mathrm{d}{\bm{\omega}} into the Fourier inversion eiU𝝃𝒙|U𝝃|ddUd𝝃\int\cdots e^{iU{\bm{\xi}}\cdot{\bm{x}}}|U{\bm{\xi}}|^{d}\mathrm{d}U\mathrm{d}{\bm{\xi}} in the matrix polar coordinates. To achieve this, we use the singular value decomposition.

Lemma D.1 (Singular Value Decomposition, Lemma C.3).

The matrix space Mm,kM_{m,k} can be decomposed into Mm,k=Vm,k×+k×O(k)M_{m,k}=V_{m,k}\times\mathbb{R}_{+}^{k}\times O(k) via singular value decomposition

A=UDV,(U,D,V)Vm,k×+k×O(k)A=UDV^{\top},\quad(U,D,V)\in V_{m,k}\times\mathbb{R}_{+}^{k}\times O(k)

satisfying D=diag[d1,,dk](d1>>dk>0)D=\operatorname{diag}[d_{1},\ldots,d_{k}]\,(d_{1}>\cdots>d_{k}>0) (distinct singular values); and the Lebesgue measure dA\mathrm{d}A is calculated as

dA=δ(D)dDdUdV,δ(D):=2k|detD|dΔ(D2),\mathrm{d}A=\delta(D)\mathrm{d}D\mathrm{d}U\mathrm{d}V,\quad\delta(D):=2^{-k}|\det D|^{d}\Delta(D^{2}),

where dD=i=1kddi\mathrm{d}D=\bigwedge_{i=1}^{k}\mathrm{d}d_{i}; dU\mathrm{d}U and dV\mathrm{d}V are invariant measures on Vm,kV_{m,k} and O(k)O(k) respectively; and Δ(D2):=i<j(di2dj2)\Delta(D^{2}):=\prod_{i<j}(d_{i}^{2}-d_{j}^{2}) denotes the Vandermonde polynomial (or the products of differences) of a given (diagonalized) vector D=[d1,,dk]D=[d_{1},\ldots,d_{k}].

If there is no risk of confusion, we write UDVUDV^{\top} as AA for the sake of readability.

Using SVD, the Fourier expression is rewritten as follows:

(D.1) =1(2π)kMm,k×kγ(A,𝝎)σ(𝝎)ei(UDV𝝎)𝒙δ(D)dUdDdVd𝝎\displaystyle=\frac{1}{(2\pi)^{k}}\int_{M_{m,k}\times\mathbb{R}^{k}}\gamma^{\sharp}(A,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})e^{i(UDV^{\top}{\bm{\omega}})\cdot{\bm{x}}}\delta(D)\mathrm{d}U\mathrm{d}D\mathrm{d}V\mathrm{d}{\bm{\omega}} (D.2)

Changing the variables as (𝝎,V)=(V𝝎,V)({\bm{\omega}},V)=(V{\bm{\omega}}^{\prime},V) with d𝝎dV=d𝝎dV\mathrm{d}{\bm{\omega}}\mathrm{d}V=\mathrm{d}{\bm{\omega}}^{\prime}\mathrm{d}V,

=1(2π)kMm,k×kγ(A,V𝝎)σ(V𝝎)ei(UD𝝎)𝒙δ(D)dUdDdVd𝝎\phantom{\text{\eqref{eq:fourier}}}=\frac{1}{(2\pi)^{k}}\int_{M_{m,k}\times\mathbb{R}^{k}}\gamma^{\sharp}(A,V{\bm{\omega}}^{\prime})\sigma^{\sharp}(V{\bm{\omega}}^{\prime})e^{i(UD{\bm{\omega}}^{\prime})\cdot{\bm{x}}}\delta(D)\mathrm{d}U\mathrm{d}D\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime} (D.3)

Then, extending the domain of DD from +k\mathbb{R}_{+}^{k} to k\mathbb{R}^{k}, changing the variables as (di,ωi)=(yi/ωi,ωi)(d_{i},\omega^{\prime}_{i})=(y_{i}/\omega^{\prime}_{i},\omega^{\prime}_{i}) with ddidωi=|ωi|1dyidωi\mathrm{d}d_{i}\mathrm{d}\omega^{\prime}_{i}=|\omega^{\prime}_{i}|^{-1}\mathrm{d}y_{i}\mathrm{d}\omega^{\prime}_{i}, and writing 𝒚:=[y1,,yk]{\bm{y}}:=[y_{1},\ldots,y_{k}],

=1(4π)kVm,k×k×O(k)×kγ(A,V𝝎)σ(V𝝎)ei(U𝒚)𝒙δ(D)i=1k|ωi|1dUd𝒚dVd𝝎.\phantom{\text{\eqref{eq:fourier}}}=\frac{1}{(4\pi)^{k}}\int_{V_{m,k}\times\mathbb{R}^{k}\times O(k)\times\mathbb{R}^{k}}\gamma^{\sharp}(A,V{\bm{\omega}}^{\prime})\sigma^{\sharp}(V{\bm{\omega}}^{\prime})e^{i(U{\bm{y}})\cdot{\bm{x}}}\delta(D)\prod_{i=1}^{k}|\omega^{\prime}_{i}|^{-1}\mathrm{d}U\mathrm{d}{\bm{y}}\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime}. (D.4)

Step 3.

Therefore, it is natural to suppose γ\gamma to satisfy a separation-of-variables form

γ(A,V𝝎)σ(V𝝎)δ(D)i=1k|ωi|1=f^(U𝒚)|U𝒚|dϕ(V,𝝎)\gamma^{\sharp}(A,V{\bm{\omega}}^{\prime})\sigma^{\sharp}(V{\bm{\omega}}^{\prime})\delta(D)\prod_{i=1}^{k}|\omega^{\prime}_{i}|^{-1}=\widehat{f}(U{\bm{y}})|U{\bm{y}}|^{d}\phi^{\sharp}(V,{\bm{\omega}}^{\prime}) (D.5)

with an auxiliary convergence factor ϕ\phi. Then, we have

(D.4) =1(4π)k(O(k)×kϕ(V,𝝎)dVd𝝎)(Vm,k×kf^(U𝒚)|U𝒚|deiU𝒚𝒙d𝒚dU)\displaystyle=\frac{1}{(4\pi)^{k}}\left(\int_{O(k)\times\mathbb{R}^{k}}\phi^{\sharp}(V,{\bm{\omega}}^{\prime})\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime}\right)\left(\int_{V_{m,k}\times\mathbb{R}^{k}}\widehat{f}(U{\bm{y}})|U{\bm{y}}|^{d}e^{iU{\bm{y}}\cdot{\bm{x}}}\mathrm{d}{\bm{y}}\mathrm{d}U\right)
=cϕ(2π)mmf^(𝒚)ei𝒚𝒙d𝒚=cϕf(𝒙).\displaystyle=\frac{c_{\phi}}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}\widehat{f}({\bm{y}})e^{i{\bm{y}}\cdot{\bm{x}}}\mathrm{d}{\bm{y}}=c_{\phi}f({\bm{x}}).

Here, we put cϕ:=2k(2π)dcm,k1O(k)×kϕ(V,𝝎)dVd𝝎c_{\phi}:=2^{-k}(2\pi)^{d}c_{m,k}^{-1}\int_{O(k)\times\mathbb{R}^{k}}\phi^{\sharp}(V,{\bm{\omega}}^{\prime})\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime}, and used a matrix polar integration formula given in Lemma C.1.

Finally, the form (D.5) can be satisfied as below. Since V𝝎=𝝎V{\bm{\omega}}^{\prime}={\bm{\omega}} and U𝒚=UD𝝎=A𝝎U{\bm{y}}=UD{\bm{\omega}}^{\prime}=A{\bm{\omega}}, it is reduced to

γ(A,𝝎)ϕ(V,𝝎)=f^(A𝝎)|A𝝎|dσ(V𝝎)δ(D)i=1k|ωi|1.\frac{\gamma^{\sharp}(A,{\bm{\omega}})}{\phi^{\sharp}(V,{\bm{\omega}}^{\prime})}=\frac{\widehat{f}(A{\bm{\omega}})|A{\bm{\omega}}|^{d}}{\sigma^{\sharp}(V{\bm{\omega}}^{\prime})\delta(D)\prod_{i=1}^{k}|\omega^{\prime}_{i}|^{-1}}.

Hence we can put

γ(A,𝝎)\displaystyle\gamma^{\sharp}(A,{\bm{\omega}}) =f^(A𝝎)|A𝝎|dρ(𝝎)¯δ(D),\displaystyle=\frac{\widehat{f}(A{\bm{\omega}})|A{\bm{\omega}}|^{d}\overline{\rho^{\sharp}({\bm{\omega}})}}{\delta(D)}, (D.6)
ϕ(V,𝝎)\displaystyle\phi^{\sharp}(V,{\bm{\omega}}^{\prime}) =σ(V𝝎)ρ(V𝝎)¯i=1k|ωi|1,\displaystyle=\sigma^{\sharp}(V{\bm{\omega}}^{\prime})\overline{\rho^{\sharp}(V{\bm{\omega}}^{\prime})}\prod_{i=1}^{k}|\omega^{\prime}_{i}|^{-1},

with additional convergence factor ρ\rho. In this setting, the constant cϕc_{\phi} is calculated as

cϕ\displaystyle c_{\phi} :=(2π)d2kcm,kO(k)×kσ(V𝝎)ρ(V𝝎)¯i=1k|ωi|1dVd𝝎\displaystyle:=\frac{(2\pi)^{d}}{2^{k}c_{m,k}}\int_{O(k)\times\mathbb{R}^{k}}\sigma^{\sharp}(V{\bm{\omega}}^{\prime})\overline{\rho^{\sharp}(V{\bm{\omega}}^{\prime})}\prod_{i=1}^{k}|\omega^{\prime}_{i}|^{-1}\mathrm{d}V\mathrm{d}{\bm{\omega}}^{\prime}
=(2π)d2kcm,kkσ(𝝎)ρ(𝝎)¯i=1k|ωi|1d𝝎=:((σ,ρ));\displaystyle=\frac{(2\pi)^{d}}{2^{k}c_{m,k}}\int_{\mathbb{R}^{k}}\sigma^{\sharp}({\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}\prod_{i=1}^{k}|\omega_{i}|^{-1}\mathrm{d}{\bm{\omega}}=:(\!(\sigma,\rho)\!);

and γ\gamma is obtained by taking the Fourier inversion of (D.6) with respect to 𝝎{\bm{\omega}} as follows:

γ(A,𝒃)\displaystyle\gamma(A,{\bm{b}}) =1(2π)kδ(D)k|A𝝎|df^(A𝝎)ρ(𝝎)¯ei𝝎𝒃d𝝎,\displaystyle=\frac{1}{(2\pi)^{k}\delta(D)}\int_{\mathbb{R}^{k}}|A{\bm{\omega}}|^{d}\widehat{f}(A{\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}e^{i{\bm{\omega}}\cdot{\bm{b}}}\mathrm{d}{\bm{\omega}},
=1(2π)kδ(D)k[md/2[f](𝒙)eiA𝝎𝒙d𝒙]ρ(𝝎)¯ei𝝎𝒃d𝝎\displaystyle=\frac{1}{(2\pi)^{k}\delta(D)}\int_{\mathbb{R}^{k}}\left[\int_{\mathbb{R}^{m}}\triangle^{d/2}[f]({\bm{x}})e^{-iA{\bm{\omega}}\cdot{\bm{x}}}\mathrm{d}{\bm{x}}\right]\overline{\rho^{\sharp}({\bm{\omega}})}e^{i{\bm{\omega}}\cdot{\bm{b}}}\mathrm{d}{\bm{\omega}}
=1δ(D)md/2[f](𝒙)[1(2π)kkρ(𝝎)ei𝝎(A𝒙𝒃)d𝝎]d𝒙\displaystyle=\frac{1}{\delta(D)}\int_{\mathbb{R}^{m}}\triangle^{d/2}[f]({\bm{x}})\left[\frac{1}{(2\pi)^{k}}\int_{\mathbb{R}^{k}}\rho^{\sharp}({\bm{\omega}})e^{i{\bm{\omega}}\cdot(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{\omega}}\right]^{*}\mathrm{d}{\bm{x}}
=1δ(D)md/2[f](𝒙)ρ(A𝒙𝒃)¯d𝒙\displaystyle=\frac{1}{\delta(D)}\int_{\mathbb{R}^{m}}\triangle^{d/2}[f]({\bm{x}})\overline{\rho(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{x}}
=:R[f](A,𝒃).\displaystyle=:R[f](A,{\bm{b}}).

To sum up, we have shown that

S[R[f]](𝒙)=Mm,k×kR[f](A,𝒃)σ(A𝒙𝒃)dAd𝒃=((σ,ρ))f(𝒙).S[R[f]]({\bm{x}})=\int_{M_{m,k}\times\mathbb{R}^{k}}R[f](A,{\bm{b}})\sigma(A^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}A\mathrm{d}{\bm{b}}=(\!(\sigma,\rho)\!)f({\bm{x}}).\qed

D.2 Restriction to the Similitude Group

Let us consider the restricted case of the similitude group GVm,kGV_{m,k}. Since it is a measure-zero subspace of Mm,kM_{m,k}, we can obtain the different solution.

Step 1.

The continuous network and its Fourier expression are given as

S[γ](𝒙)\displaystyle S[\gamma]({\bm{x}}) :=GVm,k×kγ(A,𝒃)σ(A𝒙𝒃)dμ(A)d𝒃\displaystyle:=\int_{GV_{m,k}\times\mathbb{R}^{k}}\gamma(A,{\bm{b}})\sigma(A^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}\mu(A)\mathrm{d}{\bm{b}}
=1(2π)kGVm,k×kγ(A,𝝎)σ(𝝎)ei(aU𝝎)𝒙α(a)dadUd𝝎\displaystyle=\frac{1}{(2\pi)^{k}}\int_{GV_{m,k}\times\mathbb{R}^{k}}\gamma^{\sharp}(A,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})e^{i(aU{\bm{\omega}})\cdot{\bm{x}}}\alpha(a)\mathrm{d}a\mathrm{d}U\mathrm{d}{\bm{\omega}} (D.7)

Step 2.

(Skipping the matrix decomposition of AA and) turning 𝝎{\bm{\omega}} into polar coordinates 𝝎=r𝒗{\bm{\omega}}=r{\bm{v}} with (r,𝒗)+×SSk1(r,{\bm{v}})\in\mathbb{R}_{+}\times\SS^{k-1} and d𝝎=rk1drd𝒗\mathrm{d}{\bm{\omega}}=r^{k-1}\mathrm{d}r\mathrm{d}{\bm{v}}, yielding

(D.7)=1(2π)kGVm,k×+×SSk1γ(A,r𝒗)σ(r𝒗)ei(arU𝒗)𝒙α(a)rk1dadUdrd𝒗\text{\eqref{eq:fourier.sim}}=\frac{1}{(2\pi)^{k}}\int_{GV_{m,k}\times\mathbb{R}_{+}\times\SS^{k-1}}\gamma^{\sharp}(A,r{\bm{v}})\sigma^{\sharp}(r{\bm{v}})e^{i(arU{\bm{v}})\cdot{\bm{x}}}\alpha(a)r^{k-1}\mathrm{d}a\mathrm{d}U\mathrm{d}r\mathrm{d}{\bm{v}}

Changing the variable (a,r)=(y/r,r)(a,r)=(y/r,r) with dadr=r1dydr\mathrm{d}a\mathrm{d}r=r^{-1}\mathrm{d}y\mathrm{d}r,

=1(2π)kGVm,k×+×SSk1γ(A,r𝒗)σ(r𝒗)ei(yU𝒗)𝒙α(y/r)rk2dydUdrd𝒗\phantom{\text{\eqref{eq:fourier.sim}}}=\frac{1}{(2\pi)^{k}}\int_{GV_{m,k}\times\mathbb{R}_{+}\times\SS^{k-1}}\gamma^{\sharp}(A,r{\bm{v}})\sigma^{\sharp}(r{\bm{v}})e^{i(yU{\bm{v}})\cdot{\bm{x}}}\alpha(y/r)r^{k-2}\mathrm{d}y\mathrm{d}U\mathrm{d}r\mathrm{d}{\bm{v}}

and returning y𝒗y{\bm{v}} into the Euclidean coordinate 𝒚{\bm{y}} with yk1dyd𝒗=d𝒚y^{k-1}\mathrm{d}y\mathrm{d}{\bm{v}}=\mathrm{d}{\bm{y}},

=1(2π)kGVm,k×kγ(A,r𝒚/|𝒚|)σ(r𝒚/|𝒚|)ei(U𝒚)𝒙α(|𝒚|/r)rk2|𝒚|(k1)d𝒚dUdr.\phantom{\text{\eqref{eq:fourier.sim}}}=\frac{1}{(2\pi)^{k}}\int_{GV_{m,k}\times\mathbb{R}^{k}}\gamma^{\sharp}(A,r{\bm{y}}/|{\bm{y}}|)\sigma^{\sharp}(r{\bm{y}}/|{\bm{y}}|)e^{i(U{\bm{y}})\cdot{\bm{x}}}\alpha(|{\bm{y}}|/r)r^{k-2}|{\bm{y}}|^{-(k-1)}\mathrm{d}{\bm{y}}\mathrm{d}U\mathrm{d}r. (D.8)

Step 3.

Since r𝒚/|𝒚|=r𝒗=𝝎r{\bm{y}}/|{\bm{y}}|=r{\bm{v}}={\bm{\omega}} and |𝒚|/r=y/r=a|{\bm{y}}|/r=y/r=a, supposing the separation-of-variables form as

γ(A,𝝎)σ(𝝎)α(a)a|𝝎|k2|𝒚|(k1)=f^(U𝒚)|U𝒚|dϕ(𝝎),\gamma^{\sharp}(A,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})\alpha(a)a^{-\ell}|{\bm{\omega}}|^{k-2-\ell}|{\bm{y}}|^{-(k-1-\ell)}=\widehat{f}(U{\bm{y}})|U{\bm{y}}|^{d}\phi^{\sharp}({\bm{\omega}}), (D.9)

for any number \ell\in\mathbb{R}, we have

(D.8)=1(2π)k(+ϕ(r)dr)(Vm,k×kf^(U𝒚)|U𝒚|de(U𝒚)𝒙dUd𝒚)=cϕf(𝒙).\text{\eqref{eq:fourier.simsim}}=\frac{1}{(2\pi)^{k}}\left(\int\int_{\mathbb{R}_{+}}\phi^{\sharp}(r)\mathrm{d}r\right)\left(\int_{V_{m,k}\times\mathbb{R}^{k}}\widehat{f}(U{\bm{y}})|U{\bm{y}}|^{d}e^{-(U{\bm{y}})\cdot{\bm{x}}}\mathrm{d}U\mathrm{d}{\bm{y}}\right)=c_{\phi}f({\bm{x}}).

Note that in addition to U𝒚=A𝝎U{\bm{y}}=A{\bm{\omega}}, we have |𝒚|=|U𝒚|=a|𝝎||{\bm{y}}|=|U{\bm{y}}|=a|{\bm{\omega}}| and thus |U𝒚|d=|A𝝎|nadn|𝝎|dn|U{\bm{y}}|^{d}=|A{\bm{\omega}}|^{n}a^{d-n}|{\bm{\omega}}|^{d-n} for any number nn\in\mathbb{R}. Hence the condition is reduced to

γ(A,𝝎)ϕ(𝝎)\displaystyle\frac{\gamma^{\sharp}(A,{\bm{\omega}})}{\phi^{\sharp}({\bm{\omega}})} =f^(A𝝎)|A𝝎|n+(k1)adn+|𝝎|m2kn+2+σ(𝝎)α(a)\displaystyle=\frac{\widehat{f}(A{\bm{\omega}})|A{\bm{\omega}}|^{n+(k-1-\ell)}a^{d-n+\ell}|{\bm{\omega}}|^{m-2k-n+2+\ell}}{\sigma^{\sharp}({\bm{\omega}})\alpha(a)}
=f^(A𝝎)|A𝝎|sams1|𝝎|ds+1σ(𝝎)α(a),\displaystyle=\frac{\widehat{f}(A{\bm{\omega}})|A{\bm{\omega}}|^{s}a^{m-s-1}|{\bm{\omega}}|^{d-s+1}}{\sigma^{\sharp}({\bm{\omega}})\alpha(a)},

where we put s:=n+(k1)s:=n+(k-1-\ell), which can be an arbitrary real number. As a result, to satisfy (D.9), we may put

γ(A,𝝎)\displaystyle\gamma^{\sharp}(A,{\bm{\omega}}) =f^(A𝝎)|A𝝎|sρ(𝝎)¯,\displaystyle=\widehat{f}(A{\bm{\omega}})|A{\bm{\omega}}|^{s}\overline{\rho^{\sharp}({\bm{\omega}})},
α(a)\displaystyle\quad\alpha(a) =ams1,\displaystyle=a^{m-s-1},
ϕ(𝝎)\displaystyle\phi^{\sharp}({\bm{\omega}}) =σ(𝝎)ρ(𝝎)¯|𝝎|(ds+1);\displaystyle=\sigma^{\sharp}({\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}|{\bm{\omega}}|^{-(d-s+1)};

which lead to

Rs[f]\displaystyle R_{s}[f] =m[s/2f](𝒙)ρ(A𝒙𝒃)¯d𝒙,\displaystyle=\int_{\mathbb{R}^{m}}[\triangle^{s/2}f]({\bm{x}})\overline{\rho(A^{\top}{\bm{x}}-{\bm{b}})}\mathrm{d}{\bm{x}},
((σ,ρ))s\displaystyle(\!(\sigma,\rho)\!)_{s} kσ(𝝎)ρ(𝝎)¯|𝝎|(ds+1)d𝝎,\displaystyle\propto\int_{\mathbb{R}^{k}}\sigma^{\sharp}({\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}|{\bm{\omega}}|^{-(d-s+1)}\mathrm{d}{\bm{\omega}},
S[Rs[f]](𝒙)\displaystyle S[R_{s}[f]]({\bm{x}}) =GVm,k×kRs[f](A,𝒃)σ(A𝒙𝒃)ams1dadUd𝒃=((σ,ρ))f(𝒙).\displaystyle=\int_{GV_{m,k}\times\mathbb{R}^{k}}R_{s}[f](A,{\bm{b}})\sigma(A^{\top}{\bm{x}}-{\bm{b}})a^{m-s-1}\mathrm{d}a\mathrm{d}U\mathrm{d}{\bm{b}}=(\!(\sigma,\rho)\!)f({\bm{x}}).\qed

By matching the order of the fractional derivative s\triangle^{s}, s=ds=d corresponds to the SVD solution. On the other hand, by matching the weight |𝝎|(ds+1)|{\bm{\omega}}|^{-(d-s+1)}, k=1k=1 and s=0s=0 exactly reproduces the classical result.

D.3 Restriction to the Stiefel manifold

Let us consider a further restricted case of the Stiefel manifold.

Step 1.

S[γ](𝒙)\displaystyle S[\gamma]({\bm{x}}) :=Vm,k×kγ(U,𝒃)σ(U𝒙𝒃)dUd𝒃\displaystyle:=\int_{V_{m,k}\times\mathbb{R}^{k}}\gamma(U,{\bm{b}})\sigma(U^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}U\mathrm{d}{\bm{b}}
=1(2π)kVm,k×kγ(U,𝝎)σ(𝝎)ei(U𝝎)𝒙dUd𝝎.\displaystyle=\frac{1}{(2\pi)^{k}}\int_{V_{m,k}\times\mathbb{R}^{k}}\gamma^{\sharp}(U,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})e^{i(U{\bm{\omega}})\cdot{\bm{x}}}\mathrm{d}U\mathrm{d}{\bm{\omega}}.

Namely, the weight matrix parameter UMm,kU\in M_{m,k} now simply lies in the Stiefel manifold Vm,kV_{m,k}, and thus it does not contain any scaling factor. We show that this formulation still admit solutions, provided that σ\sigma is also appropriately restricted.

Step 3.

Skipping the rescaling step (Step 2), let us consider the separation-of-variables form:

γ(U,𝝎)σ(𝝎)=f^(U𝝎)|U𝝎|dϕ(𝝎);\gamma^{\sharp}(U,{\bm{\omega}})\sigma^{\sharp}({\bm{\omega}})=\widehat{f}(U{\bm{\omega}})|U{\bm{\omega}}|^{d}\phi^{\sharp}({\bm{\omega}}); (D.10)

satisfied by

γ(U,𝝎)\displaystyle\gamma^{\sharp}(U,{\bm{\omega}}) =f^(U𝝎)|U𝝎|sρ(𝝎)¯,\displaystyle=\widehat{f}(U{\bm{\omega}})|U{\bm{\omega}}|^{s}\overline{\rho^{\sharp}({\bm{\omega}})},
ϕ(𝝎)\displaystyle\phi^{\sharp}({\bm{\omega}}) =σ(𝝎)ρ(𝝎)¯|𝝎|sd,\displaystyle=\sigma^{\sharp}({\bm{\omega}})\overline{\rho^{\sharp}({\bm{\omega}})}|{\bm{\omega}}|^{s-d},

for any real number ss\in\mathbb{R}. Here, we used |U𝝎|=|𝝎||U{\bm{\omega}}|=|{\bm{\omega}}|.

In order (D.10) to turn to a solution, it is sufficient when

ϕ(𝝎)=cm,k1(2π)d,\phi^{\sharp}({\bm{\omega}})=c_{m,k}^{-1}(2\pi)^{-d}, (D.11)

because then

(D.10) =1(2π)kVm,k×kf^(U𝝎)|U𝝎|dϕ(𝝎)ei(U𝝎)𝒙dUd𝝎\displaystyle=\frac{1}{(2\pi)^{k}}\int_{V_{m,k}\times\mathbb{R}^{k}}\widehat{f}(U{\bm{\omega}})|U{\bm{\omega}}|^{d}\phi^{\sharp}({\bm{\omega}})e^{i(U{\bm{\omega}})\cdot{\bm{x}}}\mathrm{d}U\mathrm{d}{\bm{\omega}}
=1(2π)mmf^(𝒚)ei𝒚𝒙d𝒚=f(𝒙).\displaystyle=\frac{1}{(2\pi)^{m}}\int_{\mathbb{R}^{m}}\widehat{f}({\bm{y}})e^{i{\bm{y}}\cdot{\bm{x}}}\mathrm{d}{\bm{y}}=f({\bm{x}}).\qed

Compared to the previous results, (D.11) demands much more strict. Nonetheless, a few examples are such as

σ(𝝎)=|𝝎|t,ρ(𝝎)=cm,k1(2π)d|𝝎|d(s+t);\sigma^{\sharp}({\bm{\omega}})=|{\bm{\omega}}|^{t},\quad\rho^{\sharp}({\bm{\omega}})=c_{m,k}^{-1}(2\pi)^{-d}|{\bm{\omega}}|^{d-(s+t)};

or equivalently in the real domain,

𝒃t2[σ](𝒃)=δ(𝒃),𝒃d(s+t)2[ρ](𝒃)=cm,k1(2π)dδ(𝒃).\triangle_{{\bm{b}}}^{-\frac{t}{2}}[\sigma]({\bm{b}})=\delta({\bm{b}}),\quad\triangle_{{\bm{b}}}^{-\frac{d-(s+t)}{2}}[\rho]({\bm{b}})=c_{m,k}^{-1}(2\pi)^{-d}\delta({\bm{b}}).

In particular when k=1k=1, then σ\sigma coincides with the Dirac delta (t=0t=0), step function (t=1t=-1), and ReLU function (t=2t=-2).

Interestingly, the ridgelet transform is reduced to the dd-plane transform (d:=mkd:=m-k is the codimension). Since γ(U,𝝎)=f^(U𝝎)|U𝝎|dt\gamma^{\sharp}(U,{\bm{\omega}})=\widehat{f}(U{\bm{\omega}})|U{\bm{\omega}}|^{d-t}, we have

γ(U,𝒃)\displaystyle\gamma(U,{\bm{b}}) =1(2π)kkf^(U𝝎)|U𝝎|dtei𝝎𝒃d𝝎\displaystyle=\frac{1}{(2\pi)^{k}}\int_{\mathbb{R}^{k}}\widehat{f}(U{\bm{\omega}})|U{\bm{\omega}}|^{d-t}e^{i{\bm{\omega}}\cdot{\bm{b}}}\mathrm{d}{\bm{\omega}}
=1(2π)kk(dt)/2[f]^(U𝝎)ei𝝎𝒃d𝝎,\displaystyle=\frac{1}{(2\pi)^{k}}\int_{\mathbb{R}^{k}}\widehat{\triangle^{(d-t)/2}[f]}(U{\bm{\omega}})e^{i{\bm{\omega}}\cdot{\bm{b}}}\mathrm{d}{\bm{\omega}},

but this is the Fourier expression (a.k.a. Fourier slice theorem) for the dd-plane transform, say PdP_{d}, of the derivative (dt)/2[f]\triangle^{(d-t)/2}[f]. In other words, when the scaling parameter is removed, the reconstruction formula is reduced to the Radon transform:

S[R[f]](𝒙)=Vm,k×kPd[(dt)/2[f]](U,𝒃)σ(U𝒙𝒃)dUd𝒃.S[R[f]]({\bm{x}})=\int_{V_{m,k}\times\mathbb{R}^{k}}P_{d}[\triangle^{(d-t)/2}[f]](U,{\bm{b}})\sigma(U^{\top}{\bm{x}}-{\bm{b}})\mathrm{d}U\mathrm{d}{\bm{b}}.

References